git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c
pull/1/head
yusuke.shinyama.dummy 2010-10-17 05:15:05 +00:00
parent 0086971714
commit e0db043260
1 changed files with 32 additions and 18 deletions

View File

@ -20,35 +20,41 @@ from other applications.
<li> <a href="#basic">Basic Usage</a> <li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Layout Analysis</a> <li> <a href="#layout">Layout Analysis</a>
<li> <a href="#toc">TOC Extraction</a> <li> <a href="#toc">TOC Extraction</a>
<li> <a href="#more">more</a>
</ul> </ul>
<a name="overview"> <a name="overview">
<hr noshade> <hr noshade>
<h2>Overview</h2> <h2>Overview</h2>
<p> <p>
<strong>PDF is evil.</strong> <strong>PDF is evil.</strong> Although it is called a PDF
Because a PDF file is normally big and has a complex structure, "document", it's nothing like Word or HTML. PDF is more like a
parsing a PDF as a whole is time-and-memory picture representation. PDF contents are just a bunch of
consuming. Furthermore, not every part is needed for most PDF instructions that tell how to place the stuff at each exact
processing. Therefore, PDFMiner takes a strategy of lazy parsing, position on a display or paper. In most cases, it has no logical
which is to parse the stuff only when it's necessary. To parse PDF structure such as sentences or paragraphs and it cannot adapt
files, you need at least two classes: <code>PDFParser</code> itself when the paper size changes. PDFMiner attempts to
and <code>PDFDocument</code>. These objects work together. reconstruct some of those structures by guessing from its
<code>PDFParser</code> fetches (or parses) data from a PDF, positioning, but there's nothing guaranteed to work. Ugly, I
know. Again, PDF is evil.
<p>
Because a PDF file has such a big and complex structure,
parsing a PDF file as a whole is time and memory consuming. However,
not every part is needed for most PDF processing tasks. Therefore
PDFMiner takes a strategy of lazy parsing, which is to parse the
stuff only when it's necessary. To parse PDF files, you need to use at
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
These two objects are associated with each other.
<code>PDFParser</code> fetches data from a file,
and <code>PDFDocument</code> stores it. You'll also need and <code>PDFDocument</code> stores it. You'll also need
<code>PDFPageInterpreter</code> to process the page contents <code>PDFPageInterpreter</code> to process the page contents
and <code>PDFDevice</code> to translate it to whatever you need. and <code>PDFDevice</code> to translate it to whatever you need.
<code>PDFResourceManager</code> is used to store
shared resources such as fonts or images.
<p> <p>
PDF documents are more like graphics format, rather than text Figure 1 shows the relationship between the classes in PDFMiner.
format. The contents in PDF are just a bunch of procedures that
tell how to render the stuff on a display or paper. In most
cases, it presents no logical structure such as sentences or
paragraphs. So PDFMiner attempts to reconstruct some of them by
performing layout analysis. Ugly, I know. Again, PDF is evil.
<p>
Figure 1 shows the relationship between these classes:
<div align=center> <div align=center>
<img src="objrel.png"><br> <img src="objrel.png"><br>
@ -199,6 +205,14 @@ way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are way to tell exactly which part of text these destinations are
refering to. refering to.
<a name="more">
<hr noshade>
<h2>More</h2>
<p>
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
in order to process them differently / obtain other information.
<hr noshade> <hr noshade>
<address>Yusuke Shinyama</address> <address>Yusuke Shinyama</address>
</body> </body>