pdfminer.six/docs/programming.html

205 lines
7.2 KiB
HTML
Raw Normal View History

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Programming with PDFMiner</title>
<style type="text/css"><!--
blockquote { background: #eeeeee; }
.comment { color: darkgreen; }
--></style>
</head><body>
<p>
<a href="index.html">[Back to PDFMiner homepage]</a>
<h1>Programming with PDFMiner</h1>
<p>
This document explains how to use PDFMiner as a library
from other applications.
<ul>
<li> <a href="#overview">Overview</a>
<li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Layout Analysis</a>
<li> <a href="#toc">TOC Extraction</a>
</ul>
<a name="overview">
<hr noshade>
<h2>Overview</h2>
<p>
<strong>PDF is evil.</strong>
Because a PDF file is normally big and has a complex structure,
parsing a PDF as a whole is time-and-memory
consuming. Furthermore, not every part is needed for most PDF
processing. Therefore, PDFMiner takes a strategy of lazy parsing,
which is to parse the stuff only when it's necessary. To parse PDF
files, you need at least two classes: <code>PDFParser</code>
and <code>PDFDocument</code>. These objects work together.
<code>PDFParser</code> fetches (or parses) data from a PDF,
and <code>PDFDocument</code> stores it. You'll also need
<code>PDFPageInterpreter</code> to process the page contents
and <code>PDFDevice</code> to translate it to whatever you need.
<p>
PDF documents are more like graphics format, rather than text
format. The contents in PDF are just a bunch of procedures that
tell how to render the stuff on a display or paper. In most
cases, it presents no logical structure such as sentences or
paragraphs. So PDFMiner attempts to reconstruct some of them by
performing layout analysis. Ugly, I know. Again, PDF is evil.
<p>
Figure 1 shows the relationship between these classes:
<div align=center>
<img src="objrel.png"><br>
<small>Figure 1. Relationships between PDFMiner classes</small>
</div>
<a name="basic">
<hr noshade>
<h2>Basic Usage</h2>
<p>
A typical way to parse a PDF file is the following:
<blockquote><pre>
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
<span class="comment"># Open a PDF file.</span>
fp = open('mypdf.pdf', 'rb')
<span class="comment"># Create a PDF parser object associated with the file object.</span>
parser = PDFParser(fp)
<span class="comment"># Create a PDF document object that stores the document structure.</span>
doc = PDFDocument()
<span class="comment"># Connect the parser and document objects.</span>
parser.set_document(doc)
doc.set_parser(parser)
<span class="comment"># Supply the password for initialization.</span>
<span class="comment"># (If no password is set, give an empty string.)</span>
doc.initialize(password)
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
rsrcmgr = PDFResourceManager()
<span class="comment"># Create a PDF device object.</span>
device = PDFDevice(rsrcmgr)
<span class="comment"># Create a PDF interpreter object.</span>
interpreter = PDFPageInterpreter(rsrcmgr, device)
<span class="comment"># Process each page contained in the document.</span>
for page in doc.get_pages():
interpreter.process_page(page)
</pre></blockquote>
<a name="layout">
<hr noshade>
<h2>Accessing Layout Objects</h2>
<p>
Here is a typical way to use the layout analysis function:
<blockquote><pre>
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
<span class="comment"># Set parameters for analysis.</span>
laparams = LAParams()
<span class="comment"># Create a PDF page aggregator object.</span>
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages():
interpreter.process_page(page)
<span class="comment"># receive the LTPage object for the page.</span>
layout = device.get_result()
</pre></blockquote>
The layout analyzer gives a "<code>LTPage</code>" object for each page
in the PDF document. The object contains child objects within the page,
forming a tree-like structure. Figure 2 shows the relationship between
these objects.
<div align=center>
<img src="layout.png"><br>
<small>Figure 2. Layout objects and its tree structure</small>
</div>
<dl>
<dt> <code>LTPage</code>
<dd> Represents an entire page. May contain child objects like
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,
<code>LTPolygon</code> and <code>LTLine</code>.
<dt> <code>LTTextBox</code>
<dd> Represents a group of text chunks that can be contained in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text.
It contains a list of <code>LTTextLine</code> objects.
<dt> <code>LTTextLine</code>
<dd> Contains a list of <code>LTChar</code> objects that represent
a single text line. The characters are aligned either horizontaly
or vertically, depending on the text's writing mode.
<dt> <code>LTChar</code>
<dt> <code>LTText</code>
<dd> These objects represent an actual letter in the text as a Unicode string.
Note that, while a <code>LTChar</code> object has actual boundaries,
<code>LTText</code> objects does not, as these are "virtual" characters,
inserted by a layout analyzer according to the relationship between two characters
(e.g. a space).
<dt> <code>LTFigure</code>
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
present figures or pictures by embedding yet another PDF document within a page.
Note that <code>LTFigure</code> objects can appear recursively.
<dt> <code>LTImage</code>
<dd> Represents an image object. Embedded images can be
in JPEG or other formats, but currently PDFMiner does not
pay much attention to graphical objects.
<dt> <code>LTLine</code>
<dd> Represents a single straight line shown in a page.
Could be used for separating texts or figures.
<dt> <code>LTRect</code>
<dd> Represents a rectangle shown in a page.
Could be used for framing another pictures or figures.
<dt> <code>LTPolygon</code>
<dd> Represents a polygon in a page.
</dl>
<a name="toc">
<hr noshade>
<h2>TOC Extraction</h2>
<p>
PDFMiner provides functions to access the document's table of contents
("Outlines").
<blockquote><pre>
from pdfminer.pdfparser import PDFParser, PDFDocument
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize(password)
<span class="comment"># Get the outlines of the document.</span>
outlines = doc.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
</pre></blockquote>
<p>
Some PDF documents use page numbers as destinations, while others
use page numbers and the physical location within the page. Since
PDF does not have a logical strucutre, and it does not provide a
way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are
refering to.
<hr noshade>
<address>Yusuke Shinyama</address>
</body>