2010-04-24 13:31:31 +00:00
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
2010-05-29 11:51:24 +00:00
|
|
|
<title>Programming with PDFMiner</title>
|
2010-04-24 13:31:31 +00:00
|
|
|
<style type="text/css"><!--
|
|
|
|
blockquote { background: #eeeeee; }
|
|
|
|
.comment { color: darkgreen; }
|
|
|
|
--></style>
|
|
|
|
</head><body>
|
2010-05-29 11:51:24 +00:00
|
|
|
<p>
|
2010-05-29 11:59:51 +00:00
|
|
|
<a href="index.html">[Back to PDFMiner homepage]</a>
|
2010-04-24 13:31:31 +00:00
|
|
|
|
2010-05-29 11:51:24 +00:00
|
|
|
<h1>Programming with PDFMiner</h1>
|
2010-04-24 13:31:31 +00:00
|
|
|
<p>
|
2010-05-29 11:51:24 +00:00
|
|
|
This document explains how to use PDFMiner as a library
|
2010-04-24 13:31:31 +00:00
|
|
|
from other applications.
|
|
|
|
<ul>
|
2010-10-17 05:14:40 +00:00
|
|
|
<li> <a href="#overview">Overview</a>
|
2010-04-24 13:31:31 +00:00
|
|
|
<li> <a href="#basic">Basic Usage</a>
|
|
|
|
<li> <a href="#layout">Layout Analysis</a>
|
|
|
|
<li> <a href="#toc">TOC Extraction</a>
|
2010-10-17 05:15:05 +00:00
|
|
|
<li> <a href="#more">more</a>
|
2010-04-24 13:31:31 +00:00
|
|
|
</ul>
|
|
|
|
|
2010-10-17 05:14:40 +00:00
|
|
|
<a name="overview">
|
|
|
|
<hr noshade>
|
|
|
|
<h2>Overview</h2>
|
|
|
|
<p>
|
2010-10-17 05:15:05 +00:00
|
|
|
<strong>PDF is evil.</strong> Although it is called a PDF
|
|
|
|
"document", it's nothing like Word or HTML. PDF is more like a
|
|
|
|
picture representation. PDF contents are just a bunch of
|
|
|
|
instructions that tell how to place the stuff at each exact
|
|
|
|
position on a display or paper. In most cases, it has no logical
|
|
|
|
structure such as sentences or paragraphs and it cannot adapt
|
|
|
|
itself when the paper size changes. PDFMiner attempts to
|
|
|
|
reconstruct some of those structures by guessing from its
|
|
|
|
positioning, but there's nothing guaranteed to work. Ugly, I
|
|
|
|
know. Again, PDF is evil.
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Because a PDF file has such a big and complex structure,
|
|
|
|
parsing a PDF file as a whole is time and memory consuming. However,
|
|
|
|
not every part is needed for most PDF processing tasks. Therefore
|
|
|
|
PDFMiner takes a strategy of lazy parsing, which is to parse the
|
|
|
|
stuff only when it's necessary. To parse PDF files, you need to use at
|
|
|
|
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
|
|
|
|
These two objects are associated with each other.
|
|
|
|
<code>PDFParser</code> fetches data from a file,
|
2010-10-17 05:14:40 +00:00
|
|
|
and <code>PDFDocument</code> stores it. You'll also need
|
|
|
|
<code>PDFPageInterpreter</code> to process the page contents
|
|
|
|
and <code>PDFDevice</code> to translate it to whatever you need.
|
2010-10-17 05:15:05 +00:00
|
|
|
<code>PDFResourceManager</code> is used to store
|
|
|
|
shared resources such as fonts or images.
|
2010-10-17 05:14:40 +00:00
|
|
|
|
|
|
|
<p>
|
2010-10-17 05:15:05 +00:00
|
|
|
Figure 1 shows the relationship between the classes in PDFMiner.
|
2010-10-17 05:14:40 +00:00
|
|
|
|
|
|
|
<div align=center>
|
|
|
|
<img src="objrel.png"><br>
|
|
|
|
<small>Figure 1. Relationships between PDFMiner classes</small>
|
|
|
|
</div>
|
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
<a name="basic">
|
|
|
|
<hr noshade>
|
|
|
|
<h2>Basic Usage</h2>
|
|
|
|
<p>
|
|
|
|
A typical way to parse a PDF file is the following:
|
|
|
|
<blockquote><pre>
|
2010-05-05 05:51:22 +00:00
|
|
|
from pdfminer.pdfparser import PDFParser, PDFDocument
|
|
|
|
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
|
|
|
|
from pdfminer.pdfdevice import PDFDevice
|
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
<span class="comment"># Open a PDF file.</span>
|
|
|
|
fp = open('mypdf.pdf', 'rb')
|
|
|
|
<span class="comment"># Create a PDF parser object associated with the file object.</span>
|
|
|
|
parser = PDFParser(fp)
|
|
|
|
<span class="comment"># Create a PDF document object that stores the document structure.</span>
|
|
|
|
doc = PDFDocument()
|
|
|
|
<span class="comment"># Connect the parser and document objects.</span>
|
|
|
|
parser.set_document(doc)
|
|
|
|
doc.set_parser(parser)
|
2010-05-05 05:51:22 +00:00
|
|
|
<span class="comment"># Supply the password for initialization.</span>
|
2010-04-24 13:31:31 +00:00
|
|
|
<span class="comment"># (If no password is set, give an empty string.)</span>
|
|
|
|
doc.initialize(password)
|
|
|
|
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
|
|
|
|
if not doc.is_extractable:
|
|
|
|
raise PDFTextExtractionNotAllowed
|
|
|
|
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
|
|
|
|
rsrcmgr = PDFResourceManager()
|
|
|
|
<span class="comment"># Create a PDF device object.</span>
|
|
|
|
device = PDFDevice(rsrcmgr)
|
|
|
|
<span class="comment"># Create a PDF interpreter object.</span>
|
|
|
|
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
|
|
|
<span class="comment"># Process each page contained in the document.</span>
|
|
|
|
for page in doc.get_pages():
|
|
|
|
interpreter.process_page(page)
|
|
|
|
</pre></blockquote>
|
|
|
|
|
|
|
|
<a name="layout">
|
|
|
|
<hr noshade>
|
|
|
|
<h2>Accessing Layout Objects</h2>
|
|
|
|
<p>
|
2010-10-17 05:14:40 +00:00
|
|
|
Here is a typical way to use the layout analysis function:
|
2010-04-24 13:31:31 +00:00
|
|
|
<blockquote><pre>
|
2010-05-05 05:51:22 +00:00
|
|
|
from pdfminer.layout import LAParams
|
|
|
|
from pdfminer.converter import PDFPageAggregator
|
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
<span class="comment"># Set parameters for analysis.</span>
|
|
|
|
laparams = LAParams()
|
|
|
|
<span class="comment"># Create a PDF page aggregator object.</span>
|
|
|
|
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
|
|
|
|
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
|
|
|
for page in doc.get_pages():
|
|
|
|
interpreter.process_page(page)
|
2010-05-29 11:51:24 +00:00
|
|
|
<span class="comment"># receive the LTPage object for the page.</span>
|
|
|
|
layout = device.get_result()
|
2010-04-24 13:31:31 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
2010-05-29 11:51:24 +00:00
|
|
|
The layout analyzer gives a "<code>LTPage</code>" object for each page
|
|
|
|
in the PDF document. The object contains child objects within the page,
|
|
|
|
forming a tree-like structure. Figure 2 shows the relationship between
|
|
|
|
these objects.
|
|
|
|
|
|
|
|
<div align=center>
|
|
|
|
<img src="layout.png"><br>
|
|
|
|
<small>Figure 2. Layout objects and its tree structure</small>
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<dl>
|
|
|
|
<dt> <code>LTPage</code>
|
|
|
|
<dd> Represents an entire page. May contain child objects like
|
|
|
|
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,
|
|
|
|
<code>LTPolygon</code> and <code>LTLine</code>.
|
|
|
|
|
|
|
|
<dt> <code>LTTextBox</code>
|
|
|
|
<dd> Represents a group of text chunks that can be contained in a rectangular area.
|
|
|
|
Note that this box is created by geometric analysis and does not necessarily
|
|
|
|
represents a logical boundary of the text.
|
|
|
|
It contains a list of <code>LTTextLine</code> objects.
|
|
|
|
|
|
|
|
<dt> <code>LTTextLine</code>
|
|
|
|
<dd> Contains a list of <code>LTChar</code> objects that represent
|
|
|
|
a single text line. The characters are aligned either horizontaly
|
|
|
|
or vertically, depending on the text's writing mode.
|
|
|
|
|
|
|
|
<dt> <code>LTChar</code>
|
|
|
|
<dt> <code>LTText</code>
|
|
|
|
<dd> These objects represent an actual letter in the text as a Unicode string.
|
|
|
|
Note that, while a <code>LTChar</code> object has actual boundaries,
|
|
|
|
<code>LTText</code> objects does not, as these are "virtual" characters,
|
|
|
|
inserted by a layout analyzer according to the relationship between two characters
|
|
|
|
(e.g. a space).
|
|
|
|
|
|
|
|
<dt> <code>LTFigure</code>
|
|
|
|
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
|
|
|
|
present figures or pictures by embedding yet another PDF document within a page.
|
|
|
|
Note that <code>LTFigure</code> objects can appear recursively.
|
|
|
|
|
|
|
|
<dt> <code>LTImage</code>
|
|
|
|
<dd> Represents an image object. Embedded images can be
|
|
|
|
in JPEG or other formats, but currently PDFMiner does not
|
|
|
|
pay much attention to graphical objects.
|
|
|
|
|
|
|
|
<dt> <code>LTLine</code>
|
|
|
|
<dd> Represents a single straight line shown in a page.
|
|
|
|
Could be used for separating texts or figures.
|
|
|
|
|
|
|
|
<dt> <code>LTRect</code>
|
|
|
|
<dd> Represents a rectangle shown in a page.
|
|
|
|
Could be used for framing another pictures or figures.
|
|
|
|
|
|
|
|
<dt> <code>LTPolygon</code>
|
|
|
|
<dd> Represents a polygon in a page.
|
|
|
|
</dl>
|
2010-04-24 13:31:31 +00:00
|
|
|
|
|
|
|
<a name="toc">
|
|
|
|
<hr noshade>
|
|
|
|
<h2>TOC Extraction</h2>
|
2010-05-29 11:51:24 +00:00
|
|
|
<p>
|
|
|
|
PDFMiner provides functions to access the document's table of contents
|
|
|
|
("Outlines").
|
2010-04-24 13:31:31 +00:00
|
|
|
|
|
|
|
<blockquote><pre>
|
2010-05-29 11:51:24 +00:00
|
|
|
from pdfminer.pdfparser import PDFParser, PDFDocument
|
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
fp = open('mypdf.pdf', 'rb')
|
|
|
|
parser = PDFParser(fp)
|
|
|
|
doc = PDFDocument()
|
|
|
|
parser.set_document(doc)
|
|
|
|
doc.set_parser(parser)
|
|
|
|
doc.initialize(password)
|
2010-05-29 11:51:24 +00:00
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
<span class="comment"># Get the outlines of the document.</span>
|
|
|
|
outlines = doc.get_outlines()
|
|
|
|
for (level,title,dest,a,se) in outlines:
|
|
|
|
print (level, title)
|
|
|
|
</pre></blockquote>
|
|
|
|
|
2010-05-29 11:51:24 +00:00
|
|
|
<p>
|
2010-10-17 05:14:40 +00:00
|
|
|
Some PDF documents use page numbers as destinations, while others
|
|
|
|
use page numbers and the physical location within the page. Since
|
|
|
|
PDF does not have a logical strucutre, and it does not provide a
|
|
|
|
way to refer to any in-page object from the outside, there's no
|
|
|
|
way to tell exactly which part of text these destinations are
|
|
|
|
refering to.
|
2010-04-24 13:31:31 +00:00
|
|
|
|
2010-10-17 05:15:05 +00:00
|
|
|
<a name="more">
|
|
|
|
<hr noshade>
|
|
|
|
<h2>More</h2>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
|
|
|
|
in order to process them differently / obtain other information.
|
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
<hr noshade>
|
|
|
|
<address>Yusuke Shinyama</address>
|
|
|
|
</body>
|