PyMuPDF-Layout Performance on DocLayNet: A Comparative Evaluation
November 6, 2025

This report presents benchmark results for PyMuPDF-Layout evaluated against Docling on the DocLayNet dataset. We compare layout detection accuracy using IoU-based metrics and report model efficiency characteristics.
Methodology
Dataset: DocLayNet (Pfitzmann et al., 2022)
- Training set: 69,000 document pages
- Validation set: 6,480 document pages
- Document categories: financial reports, scientific articles, patents, manuals, legal documents, tender documents
- Annotation schema: 11 class labels (caption, footnote, formula, list-item, page-footer, page-header, picture, section-header, table, text, title)
Baseline: Docling v2 with RT-DETR architecture
Evaluation metric: F1 score computed from precision and recall at IoU threshold 0.6
Class harmonization: To account for taxonomy differences between Docling and DocLayNet, we applied the following mappings:
- Docling classes
document-indexandform→ DocLayNet classtable - Docling classes
key-value-region,code,checkbox-selected,checkbox-unselected→ DocLayNet classtext
Note
Docling's classification scheme maps all title elements to section-header, resulting in zero coverage of the DocLayNet title class.
Experimental Results
Experiment 1: PDF-based features
The first model variant uses features extracted exclusively from PDF internals without rendering to images.
| Class | Docling F1 | PyMuPDF-Layout F1 | Δ |
|---|---|---|---|
| caption | 0.8594 | 0.8157 | -0.0437 |
| footnote | 0.4827 | 0.7217 | +0.2390 |
| formula | 0.7416 | 0.7370 | -0.0046 |
| list-item | 0.7955 | 0.8737 | +0.0782 |
| page-footer | 0.7937 | 0.7973 | +0.0036 |
| page-header | 0.8218 | 0.8387 | +0.0169 |
| picture | 0.6314 | 0.2462 | -0.3852 |
| section-header | 0.8732 | 0.7823 | -0.0909 |
| table | 0.7977 | 0.6886 | -0.1091 |
| text | 0.8146 | 0.8675 | +0.0529 |
| title | 0.0000 | 0.7672 | +0.7672 |
| Overall | 0.8102 | 0.8270 | +0.0168 |
Model characteristics: 20M parameters (Docling RT-DETR) vs. 1.3M parameters (PyMuPDF-Layout)
Observed performance patterns:
- Strong performance on structured elements: footnotes (+0.239), list-items (+0.078), text blocks (+0.053)
- Degraded performance on visual elements: pictures (-0.385)
- Moderate underperformance on elements requiring document-level context: tables (-0.109), section-headers (-0.091)
- Significant improvement on title detection (+0.767) relative to baseline
Experiment 2: Fusion features (PDF + global image context)
The second model variant augments PDF features with global document features extracted from low-resolution page images using a lightweight CNN backbone (0.5M additional parameters).
| Class | Docling F1 | PyMuPDF-Layout F1 | Δ |
|---|---|---|---|
| caption | 0.8594 | 0.8613 | +0.0019 |
| footnote | 0.4827 | 0.7584 | +0.2757 |
| formula | 0.7416 | 0.7666 | +0.0250 |
| list-item | 0.7955 | 0.8676 | +0.0721 |
| page-footer | 0.7937 | 0.9277 | +0.1340 |
| page-header | 0.8218 | 0.7953 | -0.0265 |
| picture | 0.6314 | 0.2885 | -0.3429 |
| section-header | 0.8732 | 0.8389 | -0.0343 |
| table | 0.7977 | 0.7966 | -0.0011 |
| text | 0.8146 | 0.8489 | +0.0343 |
| title | 0.0000 | 0.7189 | +0.7189 |
| Overall | 0.8102 | 0.8356 | +0.0254 |
Model characteristics: 20M parameters (Docling RT-DETR) vs. 1.8M parameters (PyMuPDF-Layout with fusion)
Observed effects of global context augmentation:
- Substantial improvement on page-footer detection (+0.134)
- Marginal improvements on table detection (Δ -0.011 → -0.001)
- Persistent underperformance on picture classification, though reduced (-0.385 → -0.343)
- Overall F1 improvement of +0.025 over PDF-only variant
Computational efficiency
| Implementation | Parameters | F1 Score | GPU Dependency |
|---|---|---|---|
| Docling (RT-DETR) | 20M | 0.8102 | Required |
| PyMuPDF-Layout (PDF features) | 1.3M | 0.8270 | None |
| PyMuPDF-Layout (Fusion features) | 1.8M | 0.8356 | None |
The PDF-feature variant achieves comparable accuracy with 15.4× parameter reduction. The fusion variant achieves +2.5% F1 improvement with 11.1× parameter reduction. Both variants operate without GPU acceleration.
Discussion
The results demonstrate that layout detection models trained on structured PDF features can achieve performance parity with vision-based models while operating at significantly reduced computational cost. The approach exhibits clear strengths (structured text elements, document metadata) and limitations (visual elements, complex tables).
The fusion approach partially addresses global context deficiencies while maintaining computational efficiency. Picture classification remains a structural limitation of PDF-based feature extraction.
Future work
Additional benchmark evaluations are planned using alternative datasets and evaluation frameworks. We will continue to update performance metrics as new baselines become available.
For implementation details and usage instructions, see the PyMuPDF-Layout tutorial.
Benchmark last updated: October 2025