git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c
pull/1/head
yusuke.shinyama.dummy 2010-10-17 05:15:05 +00:00
parent 0086971714
commit e0db043260
1 changed files with 32 additions and 18 deletions

View File

@ -20,35 +20,41 @@ from other applications.
<li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Layout Analysis</a>
<li> <a href="#toc">TOC Extraction</a>
<li> <a href="#more">more</a>
</ul>
<a name="overview">
<hr noshade>
<h2>Overview</h2>
<p>
<strong>PDF is evil.</strong>
Because a PDF file is normally big and has a complex structure,
parsing a PDF as a whole is time-and-memory
consuming. Furthermore, not every part is needed for most PDF
processing. Therefore, PDFMiner takes a strategy of lazy parsing,
which is to parse the stuff only when it's necessary. To parse PDF
files, you need at least two classes: <code>PDFParser</code>
and <code>PDFDocument</code>. These objects work together.
<code>PDFParser</code> fetches (or parses) data from a PDF,
<strong>PDF is evil.</strong> Although it is called a PDF
"document", it's nothing like Word or HTML. PDF is more like a
picture representation. PDF contents are just a bunch of
instructions that tell how to place the stuff at each exact
position on a display or paper. In most cases, it has no logical
structure such as sentences or paragraphs and it cannot adapt
itself when the paper size changes. PDFMiner attempts to
reconstruct some of those structures by guessing from its
positioning, but there's nothing guaranteed to work. Ugly, I
know. Again, PDF is evil.
<p>
Because a PDF file has such a big and complex structure,
parsing a PDF file as a whole is time and memory consuming. However,
not every part is needed for most PDF processing tasks. Therefore
PDFMiner takes a strategy of lazy parsing, which is to parse the
stuff only when it's necessary. To parse PDF files, you need to use at
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
These two objects are associated with each other.
<code>PDFParser</code> fetches data from a file,
and <code>PDFDocument</code> stores it. You'll also need
<code>PDFPageInterpreter</code> to process the page contents
and <code>PDFDevice</code> to translate it to whatever you need.
<code>PDFResourceManager</code> is used to store
shared resources such as fonts or images.
<p>
PDF documents are more like graphics format, rather than text
format. The contents in PDF are just a bunch of procedures that
tell how to render the stuff on a display or paper. In most
cases, it presents no logical structure such as sentences or
paragraphs. So PDFMiner attempts to reconstruct some of them by
performing layout analysis. Ugly, I know. Again, PDF is evil.
<p>
Figure 1 shows the relationship between these classes:
Figure 1 shows the relationship between the classes in PDFMiner.
<div align=center>
<img src="objrel.png"><br>
@ -199,6 +205,14 @@ way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are
refering to.
<a name="more">
<hr noshade>
<h2>More</h2>
<p>
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
in order to process them differently / obtain other information.
<hr noshade>
<address>Yusuke Shinyama</address>
</body>