This document explains how to use PDFMiner as a library from other applications.
A typical way to parse a PDF file is the following:
from pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice # Open a PDF file. fp = open('mypdf.pdf', 'rb') # Create a PDF parser object associated with the file object. parser = PDFParser(fp) # Create a PDF document object that stores the document structure. doc = PDFDocument() # Connect the parser and document objects. parser.set_document(doc) doc.set_parser(parser) # Supply the password for initialization. # (If no password is set, give an empty string.) doc.initialize(password) # Check if the document allows text extraction. If not, abort. if not doc.is_extractable: raise PDFTextExtractionNotAllowed # Create a PDF resource manager object that stores shared resources. rsrcmgr = PDFResourceManager() # Create a PDF device object. device = PDFDevice(rsrcmgr) # Create a PDF interpreter object. interpreter = PDFPageInterpreter(rsrcmgr, device) # Process each page contained in the document. for page in doc.get_pages(): interpreter.process_page(page)
In PDFMiner, there are several Python classes involved in parsing a PDF file, as shown in Figure 1.
PDF documents are more like graphics, rather than text documents. In most cases, it presents no logical structure such as sentences or paragraphs. PDFMiner attempts to reconstruct some of them by performing basic layout analysis.
Here is a typical way to do it:
The layout analyzer gives a "from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator # Set parameters for analysis. laparams = LAParams() # Create a PDF page aggregator object. device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in doc.get_pages(): interpreter.process_page(page) # receive the LTPage object for the page. layout = device.get_result()
LTPage
" object for each page
in the PDF document. The object contains child objects within the page,
forming a tree-like structure. Figure 2 shows the relationship between
these objects.
LTPage
LTTextBox
, LTFigure
, LTImage
, LTRect
,
LTPolygon
and LTLine
.
LTTextBox
LTTextLine
objects.
LTTextLine
LTChar
objects that represent
a single text line. The characters are aligned either horizontaly
or vertically, depending on the text's writing mode.
LTChar
LTText
LTChar
object has actual boundaries,
LTText
objects does not, as these are "virtual" characters,
inserted by a layout analyzer according to the relationship between two characters
(e.g. a space).
LTFigure
LTFigure
objects can appear recursively.
LTImage
LTLine
LTRect
LTPolygon
PDFMiner provides functions to access the document's table of contents ("Outlines").
from pdfminer.pdfparser import PDFParser, PDFDocument
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize(password)
# Get the outlines of the document.
outlines = doc.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
In some PDF documents, destinations are referred to as page numbers. In other PDF documents, destinations are referred to as page numbers plus the location within the page. Since PDF does not provide a way to point to graphical objects in a page, normally these in-page destinations are specified by physical coordinates.