PyMuPDF.IO

PyMuPDF Layout Tutorial

November 6, 2025

pip install command

This blog post will walk you through how to get started with PyMuPDF Layout and explain what it is capable of. A basic knowledge of Python and the command line is required along with experience of installation of packages from the Python Package Index (PyPI).

Installing PyMUPDF Layout

The first thing we need to do is to install the required PyMuPDF packages from PyPI.

Open up a command line or terminal window and install the following:

pip install pymupdf-layout
pip install pymupdf4llm

In nutshell, PyMuPDF Layout detects the layout to extract, and PyMuPDFLLM is providing the output (markdown/JSON/simple text).


PyMuPDF Layout Capability

There are 2 things we can do with PyMuPDF Layout:

  1. Extract structured data from documents as markdown, JSON or plain text
  2. Omit or include headers & footers when parsing documents

Let’s set up the Python coding environment to get started and open a PDF then we’ll move on to see how to do these things.

Register packages and open a PDF

First up let’s import the libraries and open a sample document:

import pymupdf.layout
import pymupdf4llm
doc = pymupdf.open(“sample.pdf”)
Note

In the above code, that PyMuPDF Layout must be imported as shown and before importing PyMuPDF4LLM to activate PyMuPDF’s layout feature and make it available to PyMuPDF4LLM.

Omitting the first line would cause execution of standard PyMuPDF4LLM - without the layout feature!

Extract the structured data

We’ve activated the PyMuPDF Layout library and we’ve loaded a document, next let's extract the structured data. This is now like a super-charged version of standard PyMuPDF4LLM with Layout working behind the scenes combining heuristics with machine learning - for better extraction results.

As Markdown:

md = pymupdf4llm.to_markdown(doc)

As JSON:

json = pymupdf4llm.to_json(doc)

Or as plain text:

txt = pymupdf4llm.to_text(doc)

Finally we can save the output to an external file as follows:

from pathlib import Path
suffix = ".md" # or “.json” or “.txt”
Path(doc.name).with_suffix(suffix).write_bytes(md.encode())

Headers & Footers

Many documents will have header and footer information on each page of a PDF which you may or may not want to include. This information can be repetitive and simply not needed ( e.g. the same logo and document title or page number information is not always really important when it comes to extracting the document content ).

PyMuPDF Layout is trained in detecting these typical document elements and able to omit them.

So in this case we can adjust our API calls to ignore these elements as follows:

md = pymupdf4llm.to_markdown(doc, header=False, footer=False)
txt = pymupdf4llm.to_text(doc, header=False, footer=False)
Note

Please note that page header / footer exclusion is not applicable to JSON output as it aims to always represent all data for the included pages.


Extending Capability

We are able to extend PyMuPDF Layout to work with PyMuPDF Pro and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to include the import for PyMuPDF Pro and unlock it before we import & activate PyMuPDF Layout:

import pymupdf.layout
import pymupdf.pro
import pymupdf4llm
pymupdf.pro.unlock()

Now we can happily load Office files and convert them as follows:

md = pymupdf4llm.to_markdown(“sample.docx”)

OCR support

The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.

If Tesseract is not installed on your platform, no OCR is attempted.

Sub-Selecting Pages

By default, all pages of a document will be processed. However, even with the superior performance of PyMuPDF4LLM, this may take some time for large documents.

To address this, you can restrict the output generation to a single or to a list of desired pages. Add one of the following parameters to the parameter lists of to_markdown / to_json / to_text:

  • pages=n - to select a single page
  • pages=[1, 3, 5, 7] - to select some list of pages. Any Python sequence is good for this like ranges, tuples etc.

All numbers must be 0-based valid page numbers in the document.


Wrapping Up

Let’s put it all together into one script and trigger from the command line. The following script will allow you to load a file of your choice and then output the information without any header or footer information using the three output formats and then save the results to disk.

import sys
from pathlib import Path

import pymupdf.layout
import pymupdf.pro
import pymupdf4llm
pymupdf.pro.unlock()

filename = sys.argv[1]
doc = pymupdf.open(filename)

md = pymupdf4llm.to_markdown(doc, header=False, footer=False)
json = pymupdf4llm.to_json(doc)
txt = pymupdf4llm.to_text(doc, header=False, footer=False)

Path(filename).with_suffix(".md").write_bytes(md.encode())
Path(filename).with_suffix(".json").write_bytes(json.encode())
Path(filename).with_suffix(".txt").write_bytes(txt.encode())

Lets save the code above as “test-layout.py” in a folder along with the PDF we want to load (“sample.pdf”), so our folder contents are:

test-layout.py
sample.pdf

Then to use the code above from the Command Line/Terminal, we just type:

python test-layout.py sample.pdf

This tells the Python code that we want to run it with the provided document (of course we can easily just swap out whatever document we want to load at any time by providing a new one).

The folder should now contain the output and look like:

test-layout.py
sample.pdf
sample.md
sample.json
sample.txt

Happy Coding! 🙂