PyMuPDF.IO

PyMuPDF-Layout Performance on DocLayNet: A Comparative Evaluation

November 6, 2025

Processing PyMuPDF

This report presents benchmark results for PyMuPDF-Layout evaluated against Docling on the DocLayNet dataset. We compare layout detection accuracy using IoU-based metrics and report model efficiency characteristics.

Methodology

Dataset: DocLayNet (Pfitzmann et al., 2022)

  • Training set: 69,000 document pages
  • Validation set: 6,480 document pages
  • Document categories: financial reports, scientific articles, patents, manuals, legal documents, tender documents
  • Annotation schema: 11 class labels (caption, footnote, formula, list-item, page-footer, page-header, picture, section-header, table, text, title)

Baseline: Docling v2 with RT-DETR architecture

Evaluation metric: F1 score computed from precision and recall at IoU threshold 0.6

Class harmonization: To account for taxonomy differences between Docling and DocLayNet, we applied the following mappings:

  • Docling classes document-index and form → DocLayNet class table
  • Docling classes key-value-region, code, checkbox-selected, checkbox-unselected → DocLayNet class text
Note

Docling's classification scheme maps all title elements to section-header, resulting in zero coverage of the DocLayNet title class.


Experimental Results

Experiment 1: PDF-based features

The first model variant uses features extracted exclusively from PDF internals without rendering to images.

ClassDocling F1PyMuPDF-Layout F1Δ
caption0.85940.8157-0.0437
footnote0.48270.7217+0.2390
formula0.74160.7370-0.0046
list-item0.79550.8737+0.0782
page-footer0.79370.7973+0.0036
page-header0.82180.8387+0.0169
picture0.63140.2462-0.3852
section-header0.87320.7823-0.0909
table0.79770.6886-0.1091
text0.81460.8675+0.0529
title0.00000.7672+0.7672
Overall0.81020.8270+0.0168

Model characteristics: 20M parameters (Docling RT-DETR) vs. 1.3M parameters (PyMuPDF-Layout)

Observed performance patterns:

  • Strong performance on structured elements: footnotes (+0.239), list-items (+0.078), text blocks (+0.053)
  • Degraded performance on visual elements: pictures (-0.385)
  • Moderate underperformance on elements requiring document-level context: tables (-0.109), section-headers (-0.091)
  • Significant improvement on title detection (+0.767) relative to baseline

Experiment 2: Fusion features (PDF + global image context)

The second model variant augments PDF features with global document features extracted from low-resolution page images using a lightweight CNN backbone (0.5M additional parameters).

ClassDocling F1PyMuPDF-Layout F1Δ
caption0.85940.8613+0.0019
footnote0.48270.7584+0.2757
formula0.74160.7666+0.0250
list-item0.79550.8676+0.0721
page-footer0.79370.9277+0.1340
page-header0.82180.7953-0.0265
picture0.63140.2885-0.3429
section-header0.87320.8389-0.0343
table0.79770.7966-0.0011
text0.81460.8489+0.0343
title0.00000.7189+0.7189
Overall0.81020.8356+0.0254

Model characteristics: 20M parameters (Docling RT-DETR) vs. 1.8M parameters (PyMuPDF-Layout with fusion)

Observed effects of global context augmentation:

  • Substantial improvement on page-footer detection (+0.134)
  • Marginal improvements on table detection (Δ -0.011 → -0.001)
  • Persistent underperformance on picture classification, though reduced (-0.385 → -0.343)
  • Overall F1 improvement of +0.025 over PDF-only variant

Computational efficiency

ImplementationParametersF1 ScoreGPU Dependency
Docling (RT-DETR)20M0.8102Required
PyMuPDF-Layout (PDF features)1.3M0.8270None
PyMuPDF-Layout (Fusion features)1.8M0.8356None

The PDF-feature variant achieves comparable accuracy with 15.4× parameter reduction. The fusion variant achieves +2.5% F1 improvement with 11.1× parameter reduction. Both variants operate without GPU acceleration.


Discussion

The results demonstrate that layout detection models trained on structured PDF features can achieve performance parity with vision-based models while operating at significantly reduced computational cost. The approach exhibits clear strengths (structured text elements, document metadata) and limitations (visual elements, complex tables).

The fusion approach partially addresses global context deficiencies while maintaining computational efficiency. Picture classification remains a structural limitation of PDF-based feature extraction.


Future work

Additional benchmark evaluations are planned using alternative datasets and evaluation frameworks. We will continue to update performance metrics as new baselines become available.

For implementation details and usage instructions, see the PyMuPDF-Layout tutorial.

Benchmark last updated: October 2025