2010-04-24 13:31:31 +00:00
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
|
|
<title>PDFMiner Development Guide</title>
|
|
|
|
<style type="text/css"><!--
|
|
|
|
blockquote { background: #eeeeee; }
|
|
|
|
.comment { color: darkgreen; }
|
|
|
|
--></style>
|
|
|
|
</head><body>
|
|
|
|
|
|
|
|
<h1>PDFMiner Development Guide</h1>
|
|
|
|
<p>
|
|
|
|
This document describes how to use PDFMiner as a library
|
|
|
|
from other applications.
|
|
|
|
<ul>
|
|
|
|
<li> <a href="#basic">Basic Usage</a>
|
|
|
|
<li> <a href="#layout">Layout Analysis</a>
|
|
|
|
<li> <a href="#toc">TOC Extraction</a>
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<a name="basic">
|
|
|
|
<hr noshade>
|
|
|
|
<h2>Basic Usage</h2>
|
|
|
|
<p>
|
|
|
|
A typical way to parse a PDF file is the following:
|
|
|
|
<blockquote><pre>
|
2010-05-05 05:51:22 +00:00
|
|
|
from pdfminer.pdfparser import PDFParser, PDFDocument
|
|
|
|
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
|
|
|
|
from pdfminer.pdfdevice import PDFDevice
|
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
<span class="comment"># Open a PDF file.</span>
|
|
|
|
fp = open('mypdf.pdf', 'rb')
|
|
|
|
<span class="comment"># Create a PDF parser object associated with the file object.</span>
|
|
|
|
parser = PDFParser(fp)
|
|
|
|
<span class="comment"># Create a PDF document object that stores the document structure.</span>
|
|
|
|
doc = PDFDocument()
|
|
|
|
<span class="comment"># Connect the parser and document objects.</span>
|
|
|
|
parser.set_document(doc)
|
|
|
|
doc.set_parser(parser)
|
2010-05-05 05:51:22 +00:00
|
|
|
<span class="comment"># Supply the password for initialization.</span>
|
2010-04-24 13:31:31 +00:00
|
|
|
<span class="comment"># (If no password is set, give an empty string.)</span>
|
|
|
|
doc.initialize(password)
|
|
|
|
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
|
|
|
|
if not doc.is_extractable:
|
|
|
|
raise PDFTextExtractionNotAllowed
|
|
|
|
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
|
|
|
|
rsrcmgr = PDFResourceManager()
|
|
|
|
<span class="comment"># Create a PDF device object.</span>
|
|
|
|
device = PDFDevice(rsrcmgr)
|
|
|
|
<span class="comment"># Create a PDF interpreter object.</span>
|
|
|
|
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
|
|
|
<span class="comment"># Process each page contained in the document.</span>
|
|
|
|
for page in doc.get_pages():
|
|
|
|
interpreter.process_page(page)
|
|
|
|
</pre></blockquote>
|
|
|
|
|
|
|
|
<p>
|
2010-05-18 14:57:04 +00:00
|
|
|
In PDFMiner, there are several Python classes involved in parsing a PDF file,
|
2010-05-05 05:51:22 +00:00
|
|
|
as shown in Figure 1.
|
2010-04-24 13:31:31 +00:00
|
|
|
|
|
|
|
<div>
|
|
|
|
<img src="objrel.png"><br>
|
2010-05-05 05:51:22 +00:00
|
|
|
<small>Figure 1. Relationships between PDFMiner objects</small>
|
2010-04-24 13:31:31 +00:00
|
|
|
</div>
|
|
|
|
|
|
|
|
<a name="layout">
|
|
|
|
<hr noshade>
|
|
|
|
<h2>Accessing Layout Objects</h2>
|
|
|
|
<p>
|
2010-05-18 14:57:04 +00:00
|
|
|
PDF documents are more like graphics, rather than text documents.
|
|
|
|
It presents no logical structure such as sentences or paragraphs (for most cases).
|
|
|
|
PDFMiner tries to reconstruct the original structure by performing
|
|
|
|
basic layout analysis.
|
|
|
|
<p>
|
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
|
|
|
|
<blockquote><pre>
|
2010-05-05 05:51:22 +00:00
|
|
|
from pdfminer.layout import LAParams
|
|
|
|
from pdfminer.converter import PDFPageAggregator
|
|
|
|
|
2010-04-24 13:31:31 +00:00
|
|
|
<span class="comment"># Set parameters for analysis.</span>
|
|
|
|
laparams = LAParams()
|
|
|
|
<span class="comment"># Create a PDF page aggregator object.</span>
|
|
|
|
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
|
|
|
|
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
|
|
|
for page in doc.get_pages():
|
|
|
|
interpreter.process_page(page)
|
|
|
|
<span class="comment"># receive the top-level layout object.</span>
|
|
|
|
ltpage = device.get_result()
|
|
|
|
</pre></blockquote>
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
<li> <code>LTPage</code>
|
|
|
|
<li> <code>LTTextBox</code>
|
|
|
|
<li> <code>LTTextLine</code>
|
|
|
|
<li> <code>LTChar</code>
|
|
|
|
<li> <code>LTText</code>
|
|
|
|
<li> <code>LTFigure</code>
|
|
|
|
<li> <code>LTImage</code>
|
|
|
|
<li> <code>LTRect</code>
|
|
|
|
<li> <code>LTPolygon</code>
|
|
|
|
<li> <code>LTLine</code>
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<a name="toc">
|
|
|
|
<hr noshade>
|
|
|
|
<h2>TOC Extraction</h2>
|
|
|
|
|
|
|
|
<blockquote><pre>
|
|
|
|
fp = open('mypdf.pdf', 'rb')
|
|
|
|
parser = PDFParser(fp)
|
|
|
|
doc = PDFDocument()
|
|
|
|
parser.set_document(doc)
|
|
|
|
doc.set_parser(parser)
|
|
|
|
doc.initialize(password)
|
|
|
|
<span class="comment"># Get the outlines of the document.</span>
|
|
|
|
outlines = doc.get_outlines()
|
|
|
|
for (level,title,dest,a,se) in outlines:
|
|
|
|
print (level, title)
|
|
|
|
</pre></blockquote>
|
|
|
|
|
|
|
|
|
|
|
|
<hr noshade>
|
|
|
|
<address>Yusuke Shinyama</address>
|
|
|
|
</body>
|