Why PDF-Native Extraction Beats Vision Models for Document Intelligence

November 21, 2025

Google recently released Gemini 3.0, and the document AI community is buzzing about its multimodal capabilities. Companies in the document processing space are already integrating it for parsing tasks, noting improvements in handwriting recognition and reading order detection.

However, early adopters have identified persistent limitations: struggles with complex layouts, inaccurate recognition of text formatting like strikethroughs, and challenges with bounding box citations.

This isn't surprising. Vision-based systems, even frontier models, face fundamental limitations when parsing PDFs.

They're solving the wrong problem.

The Vision Model Approach

Vision Language Models (VLMs) like Gemini 3.0 treat PDFs as images. They:

Render pages to pixels
Process the entire page through a unified neural network that simultaneously handles text recognition, layout detection, and semantic understanding

This end-to-end approach is powerful because the model optimizes directly for the final task. However, it comes with inherent tradeoffs:

High computational cost: Processing high-dimensional image data requires substantially more parameters and GPU resources than text-based approaches
No guarantee of text fidelity: Unlike reading embedded text directly, vision models can misrecognize characters, especially for formatting like strikethroughs or font variations
Difficult error correction: When the model makes a mistake, you can't simply patch the output. You can re-tune your prompts, but you may need to fine-tune the model for a few cases.
Over-parameterization for structured documents: The model uses billions of parameters to reconstruct information that's already explicitly encoded in the PDF

For scanned documents and handwritten notes where text must be inferred from pixels, this tradeoff makes sense. But for born-digital PDFs (the vast majority of business documents), it's inefficient.

Why PDFs Aren't Images

PDFs contain structured data that vision models can't access:

Text objects with embedded font information (bold, italic, monospace, …) or applied decorations like strikethroughs, underlines and highlights
Vector graphics defining table borders, gridlines, and graphical elements
Image objects representing company logos or complementing vector graphics
Annotations marking comments, highlights, and form fields
Metadata describing document structure, bookmarks, and reading order

When you render a PDF to an image, you destroy this information. A vision model must then reconstruct it from pixels (a lossy, computationally expensive process).

The PDF-Native Advantage

PyMuPDF-Layout extracts information directly from PDF internals, providing efficient document processing without the overhead:

import pymupdf.layout
import pymupdf4llm

# Extract structured content as markdown
doc = pymupdf.open("document.pdf")
md_text = pymupdf4llm.to_markdown(doc)

# Or extract as JSON
json_text = pymupdf4llm.to_json(doc)

This approach delivers:

1. Perfect text fidelity: We extract actual text strings with their formatting properties (no OCR uncertainty). Strikethrough text? Bold vs. italic? Monospaced code? We read it directly from the PDF structure.

2. Accurate table detection: Our GNN model identifies table boundaries, then PyMuPDF extracts rows and columns using vector graphics analysis (not pixel pattern matching). We recently achieved 97% table structure detection on complex financial documents by analyzing embedded gridlines. This document extraction approach preserves cell-level precision that vision models often miss.

3. Resource efficiency: PyMuPDF-Layout runs on CPU with just 1.8 million parameters (including multimodal fusion). Gemini 3.0 requires GPU inference with billions of parameters. For businesses processing thousands of documents daily, the cost difference is substantial. We deliver sub-second processing times on standard hardware.

What About Scanned Documents?

PyMuPDF-Layout handles scanned documents through built-in Tesseract-OCR integration. When our system detects a page would benefit from OCR, it automatically invokes Tesseract to extract text, then applies the same layout analysis as born-digital PDFs. We're also adding integrations with additional OCR engines like RapidOCR to provide more flexibility.

For documents with significant handwritten content or highly degraded scans, vision models may offer advantages. But for standard scanned business documents (invoices, contracts, reports), our OCR-enhanced pipeline delivers comparable results without requiring GPU infrastructure.

Our Strategy: Complementary, Not Competitive

We're not trying to be as good at everything as VLM. Our goal is to match their performance on structured document data extraction using fewer resources by exploiting PDF features that vision models can't access.

We're currently training our next-generation model using a teacher-student approach:

Train on public datasets (DoclayNet, PublayNet: 400K pages)
Train on private datasets (500K pages of reports, presentations, etc.)
Benchmark and improve performance through comparative analysis with VLM

This lets us combine the efficiency of PDF-native extraction with the flexibility of vision-based understanding (without requiring GPU infrastructure for inference).

The Bottom Line

If you're parsing born-digital PDFs (invoices, financial reports, contracts, technical documentation), PDF-native document extraction software is faster, more accurate, and dramatically cheaper than vision models.

If you're dealing with scanned documents, PyMuPDF-Layout's OCR integration handles most use cases without the overhead of vision models.

Do not reconstruct what you can read directly.

Get Started

You can download PyMuPDF-Layout from PyPI, or try the live demo.