PyMuPDF.IO

PyMuPDF Layout

10× faster PDF parsing with layout analysis.
Trained on structure, not images. CPU-only.
PyMuPDF Layout Hero

Trusted by teams from startups to enterprises worldwide

OracleDocuSignMistralHarvey
8.6K Stars on GitHub

Open Source with Flexible Licensing

PyMuPDF is built on open collaboration and always will be. Our code is freely available on GitHub under the AGPL license, welcoming contributions from developers worldwide. For projects requiring different terms, we also offer commercial licensing through Artifex.
License Cover
Get PyMuPDF Pro: Office Support + RAG/LLM + Layout

Everything in Open Source, Plus Three Powerful Extensions

Keep the speed and accuracy you love. Get the full package with Office document support, PyMuPDF4LLM for RAG pipelines, PyMuPDF Layout for advanced analysis, all with commercial licensing for production.

PyMuPDF Pro for
Office Document

PyMuPDF Pro supports a wide range of Office file formats, including DOC/DOCX, PPT/PPTX, XLS/XLSX, as well as HWP and HWPX, the widely used formats for Korean word processing.

PyMuPDF Pro for Office Document

PyMuPDF4LLM for
RAG Integrations

PyMuPDF integrates seamlessly with LangChain, Llamaparse and more! Prepare your data for RAG solutions and give your LLM the data that your users can trust.

PyMuPDF4LLM for RAG Integrations

Advanced Layout
Analysis
Included

PyMuPDF Layout delivers enterprise-grade document structure extraction without the enterprise overhead. Built into PyMuPDF Pro, it analyzes PDF internals directly, no GPUs, no cloud dependencies, just pure CPU performance that's 10× faster than comparable tools.

Advanced Layout Analysis Included

Get Started with PyMuPDF

Extract Text from a PDF
Extract Image from a PDF
Merging PDF files
Adding a watermark to a PDF
import pymupdf
 
doc = pymupdf.open("a.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
    text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()