PyMuPDF Layout Tutorial
November 6, 2025

This blog post will walk you through how to get started with PyMuPDF Layout and explain what it is capable of. A basic knowledge of Python and the command line is required along with experience of installation of packages from the Python Package Index (PyPI).
Installing PyMUPDF Layout
The first thing we need to do is to install the required PyMuPDF packages from PyPI.
Open up a command line or terminal window and install the following:
pip install pymupdf-layout
pip install pymupdf4llmIn nutshell, PyMuPDF Layout detects the layout to extract, and PyMuPDFLLM is providing the output (markdown/JSON/simple text).
PyMuPDF Layout Capability
There are 2 things we can do with PyMuPDF Layout:
- Extract structured data from documents as markdown, JSON or plain text
- Omit or include headers & footers when parsing documents
Let’s set up the Python coding environment to get started and open a PDF then we’ll move on to see how to do these things.
Register packages and open a PDF
First up let’s import the libraries and open a sample document:
import pymupdf.layout
import pymupdf4llm
doc = pymupdf.open(“sample.pdf”)Note
In the above code, that PyMuPDF Layout must be imported as shown and before importing PyMuPDF4LLM to activate PyMuPDF’s layout feature and make it available to PyMuPDF4LLM.
Omitting the first line would cause execution of standard PyMuPDF4LLM - without the layout feature!
Extract the structured data
We’ve activated the PyMuPDF Layout library and we’ve loaded a document, next let's extract the structured data. This is now like a super-charged version of standard PyMuPDF4LLM with Layout working behind the scenes combining heuristics with machine learning - for better extraction results.
As Markdown:
md = pymupdf4llm.to_markdown(doc)As JSON:
json = pymupdf4llm.to_json(doc)Or as plain text:
txt = pymupdf4llm.to_text(doc)Finally we can save the output to an external file as follows:
from pathlib import Path
suffix = ".md" # or “.json” or “.txt”
Path(doc.name).with_suffix(suffix).write_bytes(md.encode())Headers & Footers
Many documents will have header and footer information on each page of a PDF which you may or may not want to include. This information can be repetitive and simply not needed ( e.g. the same logo and document title or page number information is not always really important when it comes to extracting the document content ).
PyMuPDF Layout is trained in detecting these typical document elements and able to omit them.
So in this case we can adjust our API calls to ignore these elements as follows:
md = pymupdf4llm.to_markdown(doc, header=False, footer=False)
txt = pymupdf4llm.to_text(doc, header=False, footer=False)Note
Please note that page header / footer exclusion is not applicable to JSON output as it aims to always represent all data for the included pages.
Extending Capability
We are able to extend PyMuPDF Layout to work with PyMuPDF Pro and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to include the import for PyMuPDF Pro and unlock it before we import & activate PyMuPDF Layout:
import pymupdf.layout
import pymupdf.pro
import pymupdf4llm
pymupdf.pro.unlock()Now we can happily load Office files and convert them as follows:
md = pymupdf4llm.to_markdown(“sample.docx”)OCR support
The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
If Tesseract is not installed on your platform, no OCR is attempted.
Sub-Selecting Pages
By default, all pages of a document will be processed. However, even with the superior performance of PyMuPDF4LLM, this may take some time for large documents.
To address this, you can restrict the output generation to a single or to a list of desired pages. Add one of the following parameters to the parameter lists of to_markdown / to_json / to_text:
- pages=n - to select a single page
- pages=[1, 3, 5, 7] - to select some list of pages. Any Python sequence is good for this like ranges, tuples etc.
All numbers must be 0-based valid page numbers in the document.
Wrapping Up
Let’s put it all together into one script and trigger from the command line. The following script will allow you to load a file of your choice and then output the information without any header or footer information using the three output formats and then save the results to disk.
import sys
from pathlib import Path
import pymupdf.layout
import pymupdf.pro
import pymupdf4llm
pymupdf.pro.unlock()
filename = sys.argv[1]
doc = pymupdf.open(filename)
md = pymupdf4llm.to_markdown(doc, header=False, footer=False)
json = pymupdf4llm.to_json(doc)
txt = pymupdf4llm.to_text(doc, header=False, footer=False)
Path(filename).with_suffix(".md").write_bytes(md.encode())
Path(filename).with_suffix(".json").write_bytes(json.encode())
Path(filename).with_suffix(".txt").write_bytes(txt.encode())Lets save the code above as “test-layout.py” in a folder along with the PDF we want to load (“sample.pdf”), so our folder contents are:
test-layout.py
sample.pdfThen to use the code above from the Command Line/Terminal, we just type:
python test-layout.py sample.pdfThis tells the Python code that we want to run it with the provided document (of course we can easily just swap out whatever document we want to load at any time by providing a new one).
The folder should now contain the output and look like:
test-layout.py
sample.pdf
sample.md
sample.json
sample.txt