224 lines
8.4 KiB
HTML
224 lines
8.4 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
|
<html>
|
|
<head>
|
|
<link rel="stylesheet" type="text/css" href="style.css">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
<title>Programming with PDFMiner</title>
|
|
</head>
|
|
<body>
|
|
|
|
<div align=right class=lastmod>
|
|
<!-- hhmts start -->
|
|
Last Modified: Mon Mar 24 11:49:28 UTC 2014
|
|
<!-- hhmts end -->
|
|
</div>
|
|
|
|
<p>
|
|
<a href="index.html">[Back to PDFMiner homepage]</a>
|
|
|
|
<h1>Programming with PDFMiner</h1>
|
|
<p>
|
|
This page explains how to use PDFMiner as a library
|
|
from other applications.
|
|
<ul>
|
|
<li> <a href="#overview">Overview</a>
|
|
<li> <a href="#basic">Basic Usage</a>
|
|
<li> <a href="#layout">Performing Layout Analysis</a>
|
|
<li> <a href="#tocextract">Obtaining Table of Contents</a>
|
|
<li> <a href="#extend">Extending Functionality</a>
|
|
</ul>
|
|
|
|
<h2><a name="overview">Overview</a></h2>
|
|
<p>
|
|
<strong>PDF is evil.</strong> Although it is called a PDF
|
|
"document", it's nothing like Word or HTML document. PDF is more
|
|
like a graphic representation. PDF contents are just a bunch of
|
|
instructions that tell how to place the stuff at each exact
|
|
position on a display or paper. In most cases, it has no logical
|
|
structure such as sentences or paragraphs and it cannot adapt
|
|
itself when the paper size changes. PDFMiner attempts to
|
|
reconstruct some of those structures by guessing from its
|
|
positioning, but there's nothing guaranteed to work. Ugly, I
|
|
know. Again, PDF is evil.
|
|
|
|
<p>
|
|
[More technical details about the internal structure of PDF:
|
|
"How to Extract Text Contents from PDF Manually"
|
|
<a href="http://www.youtube.com/watch?v=k34wRxaxA_c">(part 1)</a>
|
|
<a href="http://www.youtube.com/watch?v=_A1M4OdNsiQ">(part 2)</a>
|
|
<a href="http://www.youtube.com/watch?v=sfV_7cWPgZE">(part 3)</a>]
|
|
|
|
<p>
|
|
Because a PDF file has such a big and complex structure,
|
|
parsing a PDF file as a whole is time and memory consuming. However,
|
|
not every part is needed for most PDF processing tasks. Therefore
|
|
PDFMiner takes a strategy of lazy parsing, which is to parse the
|
|
stuff only when it's necessary. To parse PDF files, you need to use at
|
|
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
|
|
These two objects are associated with each other.
|
|
<code>PDFParser</code> fetches data from a file,
|
|
and <code>PDFDocument</code> stores it. You'll also need
|
|
<code>PDFPageInterpreter</code> to process the page contents
|
|
and <code>PDFDevice</code> to translate it to whatever you need.
|
|
<code>PDFResourceManager</code> is used to store
|
|
shared resources such as fonts or images.
|
|
|
|
<p>
|
|
Figure 1 shows the relationship between the classes in PDFMiner.
|
|
|
|
<div align=center>
|
|
<img src="objrel.png"><br>
|
|
<small>Figure 1. Relationships between PDFMiner classes</small>
|
|
</div>
|
|
|
|
<h2><a name="basic">Basic Usage</a></h2>
|
|
<p>
|
|
A typical way to parse a PDF file is the following:
|
|
<blockquote><pre>
|
|
from pdfminer.pdfparser import PDFParser
|
|
from pdfminer.pdfdocument import PDFDocument
|
|
from pdfminer.pdfpage import PDFPage
|
|
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
|
|
from pdfminer.pdfinterp import PDFResourceManager
|
|
from pdfminer.pdfinterp import PDFPageInterpreter
|
|
from pdfminer.pdfdevice import PDFDevice
|
|
|
|
<span class="comment"># Open a PDF file.</span>
|
|
fp = open('mypdf.pdf', 'rb')
|
|
<span class="comment"># Create a PDF parser object associated with the file object.</span>
|
|
parser = PDFParser(fp)
|
|
<span class="comment"># Create a PDF document object that stores the document structure.</span>
|
|
<span class="comment"># Supply the password for initialization.</span>
|
|
document = PDFDocument(parser, password)
|
|
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
|
|
if not document.is_extractable:
|
|
raise PDFTextExtractionNotAllowed
|
|
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
|
|
rsrcmgr = PDFResourceManager()
|
|
<span class="comment"># Create a PDF device object.</span>
|
|
device = PDFDevice(rsrcmgr)
|
|
<span class="comment"># Create a PDF interpreter object.</span>
|
|
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
|
<span class="comment"># Process each page contained in the document.</span>
|
|
for page in PDFPage.create_pages(document):
|
|
interpreter.process_page(page)
|
|
</pre></blockquote>
|
|
|
|
<h2><a name="layout">Performing Layout Analysis</a></h2>
|
|
<p>
|
|
Here is a typical way to use the layout analysis function:
|
|
<blockquote><pre>
|
|
from pdfminer.layout import LAParams
|
|
from pdfminer.converter import PDFPageAggregator
|
|
|
|
<span class="comment"># Set parameters for analysis.</span>
|
|
laparams = LAParams()
|
|
<span class="comment"># Create a PDF page aggregator object.</span>
|
|
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
|
|
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
|
for page in PDFPage.create_pages(document):
|
|
interpreter.process_page(page)
|
|
<span class="comment"># receive the LTPage object for the page.</span>
|
|
layout = device.get_result()
|
|
</pre></blockquote>
|
|
|
|
A layout analyzer returns a <code>LTPage</code> object for each page
|
|
in the PDF document. This object contains child objects within the page,
|
|
forming a tree structure. Figure 2 shows the relationship between
|
|
these objects.
|
|
|
|
<div align=center>
|
|
<img src="layout.png"><br>
|
|
<small>Figure 2. Layout objects and its tree structure</small>
|
|
</div>
|
|
|
|
<dl>
|
|
<dt> <code>LTPage</code>
|
|
<dd> Represents an entire page. May contain child objects like
|
|
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,
|
|
<code>LTCurve</code> and <code>LTLine</code>.
|
|
|
|
<dt> <code>LTTextBox</code>
|
|
<dd> Represents a group of text chunks that can be contained in a rectangular area.
|
|
Note that this box is created by geometric analysis and does not necessarily
|
|
represents a logical boundary of the text.
|
|
It contains a list of <code>LTTextLine</code> objects.
|
|
<code>get_text()</code> method returns the text content.
|
|
|
|
<dt> <code>LTTextLine</code>
|
|
<dd> Contains a list of <code>LTChar</code> objects that represent
|
|
a single text line. The characters are aligned either horizontaly
|
|
or vertically, depending on the text's writing mode.
|
|
<code>get_text()</code> method returns the text content.
|
|
|
|
<dt> <code>LTChar</code>
|
|
<dt> <code>LTAnno</code>
|
|
<dd> Represent an actual letter in the text as a Unicode string.
|
|
Note that, while a <code>LTChar</code> object has actual boundaries,
|
|
<code>LTAnno</code> objects does not, as these are "virtual" characters,
|
|
inserted by a layout analyzer according to the relationship between two characters
|
|
(e.g. a space).
|
|
|
|
<dt> <code>LTFigure</code>
|
|
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
|
|
present figures or pictures by embedding yet another PDF document within a page.
|
|
Note that <code>LTFigure</code> objects can appear recursively.
|
|
|
|
<dt> <code>LTImage</code>
|
|
<dd> Represents an image object. Embedded images can be
|
|
in JPEG or other formats, but currently PDFMiner does not
|
|
pay much attention to graphical objects.
|
|
|
|
<dt> <code>LTLine</code>
|
|
<dd> Represents a single straight line.
|
|
Could be used for separating text or figures.
|
|
|
|
<dt> <code>LTRect</code>
|
|
<dd> Represents a rectangle.
|
|
Could be used for framing another pictures or figures.
|
|
|
|
<dt> <code>LTCurve</code>
|
|
<dd> Represents a generic Bezier curve.
|
|
</dl>
|
|
|
|
<p>
|
|
Also, check out <a href="http://denis.papathanasiou.org/archive/2010.08.04.post.pdf">a more complete example by Denis Papathanasiou(Extracting Text & Images from PDF Files)</a>.
|
|
|
|
<h2><a name="tocextract">Obtaining Table of Contents</a></h2>
|
|
<p>
|
|
PDFMiner provides functions to access the document's table of contents
|
|
("Outlines").
|
|
|
|
<blockquote><pre>
|
|
from pdfminer.pdfparser import PDFParser
|
|
from pdfminer.pdfdocument import PDFDocument
|
|
|
|
<span class="comment"># Open a PDF document.</span>
|
|
fp = open('mypdf.pdf', 'rb')
|
|
parser = PDFParser(fp)
|
|
document = PDFDocument(parser, password)
|
|
|
|
<span class="comment"># Get the outlines of the document.</span>
|
|
outlines = document.get_outlines()
|
|
for (level,title,dest,a,se) in outlines:
|
|
print (level, title)
|
|
</pre></blockquote>
|
|
|
|
<p>
|
|
Some PDF documents use page numbers as destinations, while others
|
|
use page numbers and the physical location within the page. Since
|
|
PDF does not have a logical structure, and it does not provide a
|
|
way to refer to any in-page object from the outside, there's no
|
|
way to tell exactly which part of text these destinations are
|
|
referring to.
|
|
|
|
<h2><a name="extend">Extending Functionality</a></h2>
|
|
|
|
<p>
|
|
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
|
|
in order to process them differently / obtain other information.
|
|
|
|
<hr noshade>
|
|
<address>Yusuke Shinyama</address>
|
|
</body>
|