PyMuPDF-Layout Performance on DocLayNet: A Comparative Evaluation

November 6, 2025

This report presents benchmark results for PyMuPDF-Layout evaluated against Docling on the DocLayNet dataset. We compare layout detection accuracy using IoU-based metrics and report model efficiency characteristics.

Methodology

Dataset: DocLayNet (Pfitzmann et al., 2022)

Training set: 69,000 document pages
Validation set: 6,480 document pages
Document categories: financial reports, scientific articles, patents, manuals, legal documents, tender documents
Annotation schema: 11 class labels (caption, footnote, formula, list-item, page-footer, page-header, picture, section-header, table, text, title)

Baseline: Docling v2 with RT-DETR architecture

Evaluation metric: F1 score computed from precision and recall at IoU threshold 0.6

Class harmonization: To account for taxonomy differences between Docling and DocLayNet, we applied the following mappings:

Docling classes document-index and form → DocLayNet class table
Docling classes key-value-region, code, checkbox-selected, checkbox-unselected → DocLayNet class text

Note

Docling's classification scheme maps all title elements to section-header, resulting in zero coverage of the DocLayNet title class.

Experimental Results

Experiment 1: PDF-based features

The first model variant uses features extracted exclusively from PDF internals without rendering to images.

Class	Docling F1	PyMuPDF-Layout F1	Δ
caption	0.8594	0.8157	-0.0437
footnote	0.4827	0.7217	+0.2390
formula	0.7416	0.7370	-0.0046
list-item	0.7955	0.8737	+0.0782
page-footer	0.7937	0.7973	+0.0036
page-header	0.8218	0.8387	+0.0169
picture	0.6314	0.2462	-0.3852
section-header	0.8732	0.7823	-0.0909
table	0.7977	0.6886	-0.1091
text	0.8146	0.8675	+0.0529
title	0.0000	0.7672	+0.7672
Overall	0.8102	0.8270	+0.0168

Model characteristics: 20M parameters (Docling RT-DETR) vs. 1.3M parameters (PyMuPDF-Layout)

Observed performance patterns:

Strong performance on structured elements: footnotes (+0.239), list-items (+0.078), text blocks (+0.053)
Degraded performance on visual elements: pictures (-0.385)
Moderate underperformance on elements requiring document-level context: tables (-0.109), section-headers (-0.091)
Significant improvement on title detection (+0.767) relative to baseline

Experiment 2: Fusion features (PDF + global image context)

The second model variant augments PDF features with global document features extracted from low-resolution page images using a lightweight CNN backbone (0.5M additional parameters).

Class	Docling F1	PyMuPDF-Layout F1	Δ
caption	0.8594	0.8613	+0.0019
footnote	0.4827	0.7584	+0.2757
formula	0.7416	0.7666	+0.0250
list-item	0.7955	0.8676	+0.0721
page-footer	0.7937	0.9277	+0.1340
page-header	0.8218	0.7953	-0.0265
picture	0.6314	0.2885	-0.3429
section-header	0.8732	0.8389	-0.0343
table	0.7977	0.7966	-0.0011
text	0.8146	0.8489	+0.0343
title	0.0000	0.7189	+0.7189
Overall	0.8102	0.8356	+0.0254

Model characteristics: 20M parameters (Docling RT-DETR) vs. 1.8M parameters (PyMuPDF-Layout with fusion)

Observed effects of global context augmentation:

Substantial improvement on page-footer detection (+0.134)
Marginal improvements on table detection (Δ -0.011 → -0.001)
Persistent underperformance on picture classification, though reduced (-0.385 → -0.343)
Overall F1 improvement of +0.025 over PDF-only variant

Computational efficiency

Implementation	Parameters	F1 Score	GPU Dependency
Docling (RT-DETR)	20M	0.8102	Required
PyMuPDF-Layout (PDF features)	1.3M	0.8270	None
PyMuPDF-Layout (Fusion features)	1.8M	0.8356	None

The PDF-feature variant achieves comparable accuracy with 15.4× parameter reduction. The fusion variant achieves +2.5% F1 improvement with 11.1× parameter reduction. Both variants operate without GPU acceleration.

Discussion

The results demonstrate that layout detection models trained on structured PDF features can achieve performance parity with vision-based models while operating at significantly reduced computational cost. The approach exhibits clear strengths (structured text elements, document metadata) and limitations (visual elements, complex tables).

The fusion approach partially addresses global context deficiencies while maintaining computational efficiency. Picture classification remains a structural limitation of PDF-based feature extraction.

Future work

Additional benchmark evaluations are planned using alternative datasets and evaluation frameworks. We will continue to update performance metrics as new baselines become available.

For implementation details and usage instructions, see the PyMuPDF-Layout tutorial.

Benchmark last updated: October 2025