Updated and fixed the documents.

pull/1/head
Yusuke Shinyama 2013-11-13 14:51:24 +09:00
parent acad011e3f
commit 7504d2bf27
2 changed files with 31 additions and 29 deletions

View File

@ -9,7 +9,7 @@
<div align=right class=lastmod> <div align=right class=lastmod>
<!-- hhmts start --> <!-- hhmts start -->
Last Modified: Mon Nov 11 10:18:06 UTC 2013 Last Modified: Wed Nov 13 05:50:56 UTC 2013
<!-- hhmts end --> <!-- hhmts end -->
</div> </div>
@ -23,9 +23,9 @@ from other applications.
<ul> <ul>
<li> <a href="#overview">Overview</a> <li> <a href="#overview">Overview</a>
<li> <a href="#basic">Basic Usage</a> <li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Layout Analysis</a> <li> <a href="#layout">Performing Layout Analysis</a>
<li> <a href="#tocextract">TOC Extraction</a> <li> <a href="#tocextract">Obtaining Table of Contents</a>
<li> <a href="#extend">Parser Extension</a> <li> <a href="#extend">Extending Functionality</a>
</ul> </ul>
<h2><a name="overview">Overview</a></h2> <h2><a name="overview">Overview</a></h2>
@ -75,8 +75,12 @@ Figure 1 shows the relationship between the classes in PDFMiner.
<p> <p>
A typical way to parse a PDF file is the following: A typical way to parse a PDF file is the following:
<blockquote><pre> <blockquote><pre>
from pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice from pdfminer.pdfdevice import PDFDevice
<span class="comment"># Open a PDF file.</span> <span class="comment"># Open a PDF file.</span>
@ -84,15 +88,12 @@ fp = open('mypdf.pdf', 'rb')
<span class="comment"># Create a PDF parser object associated with the file object.</span> <span class="comment"># Create a PDF parser object associated with the file object.</span>
parser = PDFParser(fp) parser = PDFParser(fp)
<span class="comment"># Create a PDF document object that stores the document structure.</span> <span class="comment"># Create a PDF document object that stores the document structure.</span>
doc = PDFDocument() document = PDFDocument(parser)
<span class="comment"># Connect the parser and document objects.</span>
parser.set_document(doc)
doc.set_parser(parser)
<span class="comment"># Supply the password for initialization.</span> <span class="comment"># Supply the password for initialization.</span>
<span class="comment"># (If no password is set, give an empty string.)</span> <span class="comment"># (If no password is set, give an empty string.)</span>
doc.initialize(password) document.initialize(password)
<span class="comment"># Check if the document allows text extraction. If not, abort.</span> <span class="comment"># Check if the document allows text extraction. If not, abort.</span>
if not doc.is_extractable: if not document.is_extractable:
raise PDFTextExtractionNotAllowed raise PDFTextExtractionNotAllowed
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span> <span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
rsrcmgr = PDFResourceManager() rsrcmgr = PDFResourceManager()
@ -101,11 +102,11 @@ device = PDFDevice(rsrcmgr)
<span class="comment"># Create a PDF interpreter object.</span> <span class="comment"># Create a PDF interpreter object.</span>
interpreter = PDFPageInterpreter(rsrcmgr, device) interpreter = PDFPageInterpreter(rsrcmgr, device)
<span class="comment"># Process each page contained in the document.</span> <span class="comment"># Process each page contained in the document.</span>
for page in doc.get_pages(): for page in PDFPage.create_pages(document):
interpreter.process_page(page) interpreter.process_page(page)
</pre></blockquote> </pre></blockquote>
<h2><a name="layout">Accessing Layout Objects</a></h2> <h2><a name="layout">Performing Layout Analysis</a></h2>
<p> <p>
Here is a typical way to use the layout analysis function: Here is a typical way to use the layout analysis function:
<blockquote><pre> <blockquote><pre>
@ -117,15 +118,15 @@ laparams = LAParams()
<span class="comment"># Create a PDF page aggregator object.</span> <span class="comment"># Create a PDF page aggregator object.</span>
device = PDFPageAggregator(rsrcmgr, laparams=laparams) device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device) interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages(): for page in PDFPage.create_pages(document):
interpreter.process_page(page) interpreter.process_page(page)
<span class="comment"># receive the LTPage object for the page.</span> <span class="comment"># receive the LTPage object for the page.</span>
layout = device.get_result() layout = device.get_result()
</pre></blockquote> </pre></blockquote>
The layout analyzer gives a "<code>LTPage</code>" object for each page A layout analyzer returns a <code>LTPage</code> object for each page
in the PDF document. The object contains child objects within the page, in the PDF document. This object contains child objects within the page,
forming a tree-like structure. Figure 2 shows the relationship between forming a tree structure. Figure 2 shows the relationship between
these objects. these objects.
<div align=center> <div align=center>
@ -179,29 +180,29 @@ Could be used for separating text or figures.
Could be used for framing another pictures or figures. Could be used for framing another pictures or figures.
<dt> <code>LTCurve</code> <dt> <code>LTCurve</code>
<dd> Represents a generic bezier curve. <dd> Represents a generic Bezier curve.
</dl> </dl>
<p> <p>
Also, check out <a href="http://denis.papathanasiou.org/?p=343">a more complete example by Denis Papathanasiou</a>. Also, check out <a href="http://denis.papathanasiou.org/?p=343">a more complete example by Denis Papathanasiou</a>.
<h2><a name="tocextract">TOC Extraction</a></h2> <h2><a name="tocextract">Obtaining Table of Contents</a></h2>
<p> <p>
PDFMiner provides functions to access the document's table of contents PDFMiner provides functions to access the document's table of contents
("Outlines"). ("Outlines").
<blockquote><pre> <blockquote><pre>
from pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
<span class="comment"># Open a PDF document.</span>
fp = open('mypdf.pdf', 'rb') fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp) parser = PDFParser(fp)
doc = PDFDocument() document = PDFDocument(parser)
parser.set_document(doc) document.initialize(password)
doc.set_parser(parser)
doc.initialize(password)
<span class="comment"># Get the outlines of the document.</span> <span class="comment"># Get the outlines of the document.</span>
outlines = doc.get_outlines() outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines: for (level,title,dest,a,se) in outlines:
print (level, title) print (level, title)
</pre></blockquote> </pre></blockquote>
@ -209,12 +210,12 @@ for (level,title,dest,a,se) in outlines:
<p> <p>
Some PDF documents use page numbers as destinations, while others Some PDF documents use page numbers as destinations, while others
use page numbers and the physical location within the page. Since use page numbers and the physical location within the page. Since
PDF does not have a logical strucutre, and it does not provide a PDF does not have a logical structure, and it does not provide a
way to refer to any in-page object from the outside, there's no way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are way to tell exactly which part of text these destinations are
refering to. referring to.
<h2><a name="extend">Parser Extension</a></h2> <h2><a name="extend">Extending Functionality</a></h2>
<p> <p>
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class

View File

@ -1,3 +1,4 @@
blockquote { background: #eeeeee; } blockquote { background: #eeeeee; }
h1 { border-bottom: solid black 2px; } h1 { border-bottom: solid black 2px; }
h2 { border-bottom: solid black 1px; } h2 { border-bottom: solid black 1px; }
.comment { color: darkgreen; }