PDFMiner Development Guide

This document describes how to use PDFMiner as a library from other applications.

Basic Usage
Layout Analysis
TOC Extraction

Basic Usage

A typical way to parse a PDF file is the following:

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
doc = PDFDocument()
# Connect the parser and document objects.
parser.set_document(doc)
doc.set_parser(parser)
# Supply the password for initialization.
# (If no password is set, give an empty string.)
doc.initialize(password)
# Check if the document allows text extraction. If not, abort.
if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages():
    interpreter.process_page(page)

In PDFMiner, there are several Python classes involved in parsing a PDF file, as shown in Figure 1.

Figure 1. Relationships between PDFMiner objects

Accessing Layout Objects

PDF documents are more like graphics, rather than text documents. It presents no logical structure such as sentences or paragraphs (for most cases). PDFMiner tries to reconstruct the original structure by performing basic layout analysis.

from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator

# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages():
    interpreter.process_page(page)
    # receive the top-level layout object.
    ltpage = device.get_result()

LTPage
LTTextBox
LTTextLine
LTChar
LTText
LTFigure
LTImage
LTRect
LTPolygon
LTLine

TOC Extraction

fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize(password)
# Get the outlines of the document.
outlines = doc.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)

Yusuke Shinyama