Programming with PDFMiner

This document explains how to use PDFMiner as a library from other applications.

Basic Usage
Layout Analysis
TOC Extraction

Basic Usage

A typical way to parse a PDF file is the following:

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
doc = PDFDocument()
# Connect the parser and document objects.
parser.set_document(doc)
doc.set_parser(parser)
# Supply the password for initialization.
# (If no password is set, give an empty string.)
doc.initialize(password)
# Check if the document allows text extraction. If not, abort.
if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages():
    interpreter.process_page(page)

In PDFMiner, there are several Python classes involved in parsing a PDF file, as shown in Figure 1.

Figure 1. Relationships between PDFMiner objects

Accessing Layout Objects

PDF documents are more like graphics, rather than text documents. In most cases, it presents no logical structure such as sentences or paragraphs. PDFMiner attempts to reconstruct some of them by performing basic layout analysis.

Here is a typical way to do it:

from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator

# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages():
    interpreter.process_page(page)
    # receive the LTPage object for the page.
    layout = device.get_result()

The layout analyzer gives a "LTPage" object for each page in the PDF document. The object contains child objects within the page, forming a tree-like structure. Figure 2 shows the relationship between these objects.

Figure 2. Layout objects and its tree structure

LTPage: Represents an entire page. May contain child objects like LTTextBox, LTFigure, LTImage, LTRect, LTPolygon and LTLine.
LTTextBox: Represents a group of text chunks that can be contained in a rectangular area. Note that this box is created by geometric analysis and does not necessarily represents a logical boundary of the text. It contains a list of LTTextLine objects.
LTTextLine: Contains a list of LTChar objects that represent a single text line. The characters are aligned either horizontaly or vertically, depending on the text's writing mode.
LTChar
LTText: These objects represent an actual letter in the text as a Unicode string. Note that, while a LTChar object has actual boundaries, LTText objects does not, as these are "virtual" characters, inserted by a layout analyzer according to the relationship between two characters (e.g. a space).
LTFigure: Represents an area used by PDF Form objects. PDF Forms can be used to present figures or pictures by embedding yet another PDF document within a page. Note that LTFigure objects can appear recursively.
LTImage: Represents an image object. Embedded images can be in JPEG or other formats, but currently PDFMiner does not pay much attention to graphical objects.
LTLine: Represents a single straight line shown in a page. Could be used for separating texts or figures.
LTRect: Represents a rectangle shown in a page. Could be used for framing another pictures or figures.
LTPolygon: Represents a polygon in a page.

TOC Extraction

PDFMiner provides functions to access the document's table of contents ("Outlines").

from pdfminer.pdfparser import PDFParser, PDFDocument

fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize(password)

# Get the outlines of the document.
outlines = doc.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)

In some PDF documents, destinations are referred to as page numbers. In other PDF documents, destinations are referred to as page numbers plus the location within the page. Since PDF does not provide a way to point to graphical objects in a page, normally these in-page destinations are specified by physical coordinates.

Yusuke Shinyama