overview
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67cpull/1/head
parent
0086971714
commit
e0db043260
|
@ -20,35 +20,41 @@ from other applications.
|
|||
<li> <a href="#basic">Basic Usage</a>
|
||||
<li> <a href="#layout">Layout Analysis</a>
|
||||
<li> <a href="#toc">TOC Extraction</a>
|
||||
<li> <a href="#more">more</a>
|
||||
</ul>
|
||||
|
||||
<a name="overview">
|
||||
<hr noshade>
|
||||
<h2>Overview</h2>
|
||||
<p>
|
||||
<strong>PDF is evil.</strong>
|
||||
Because a PDF file is normally big and has a complex structure,
|
||||
parsing a PDF as a whole is time-and-memory
|
||||
consuming. Furthermore, not every part is needed for most PDF
|
||||
processing. Therefore, PDFMiner takes a strategy of lazy parsing,
|
||||
which is to parse the stuff only when it's necessary. To parse PDF
|
||||
files, you need at least two classes: <code>PDFParser</code>
|
||||
and <code>PDFDocument</code>. These objects work together.
|
||||
<code>PDFParser</code> fetches (or parses) data from a PDF,
|
||||
<strong>PDF is evil.</strong> Although it is called a PDF
|
||||
"document", it's nothing like Word or HTML. PDF is more like a
|
||||
picture representation. PDF contents are just a bunch of
|
||||
instructions that tell how to place the stuff at each exact
|
||||
position on a display or paper. In most cases, it has no logical
|
||||
structure such as sentences or paragraphs and it cannot adapt
|
||||
itself when the paper size changes. PDFMiner attempts to
|
||||
reconstruct some of those structures by guessing from its
|
||||
positioning, but there's nothing guaranteed to work. Ugly, I
|
||||
know. Again, PDF is evil.
|
||||
|
||||
<p>
|
||||
Because a PDF file has such a big and complex structure,
|
||||
parsing a PDF file as a whole is time and memory consuming. However,
|
||||
not every part is needed for most PDF processing tasks. Therefore
|
||||
PDFMiner takes a strategy of lazy parsing, which is to parse the
|
||||
stuff only when it's necessary. To parse PDF files, you need to use at
|
||||
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
|
||||
These two objects are associated with each other.
|
||||
<code>PDFParser</code> fetches data from a file,
|
||||
and <code>PDFDocument</code> stores it. You'll also need
|
||||
<code>PDFPageInterpreter</code> to process the page contents
|
||||
and <code>PDFDevice</code> to translate it to whatever you need.
|
||||
<code>PDFResourceManager</code> is used to store
|
||||
shared resources such as fonts or images.
|
||||
|
||||
<p>
|
||||
PDF documents are more like graphics format, rather than text
|
||||
format. The contents in PDF are just a bunch of procedures that
|
||||
tell how to render the stuff on a display or paper. In most
|
||||
cases, it presents no logical structure such as sentences or
|
||||
paragraphs. So PDFMiner attempts to reconstruct some of them by
|
||||
performing layout analysis. Ugly, I know. Again, PDF is evil.
|
||||
|
||||
<p>
|
||||
Figure 1 shows the relationship between these classes:
|
||||
Figure 1 shows the relationship between the classes in PDFMiner.
|
||||
|
||||
<div align=center>
|
||||
<img src="objrel.png"><br>
|
||||
|
@ -199,6 +205,14 @@ way to refer to any in-page object from the outside, there's no
|
|||
way to tell exactly which part of text these destinations are
|
||||
refering to.
|
||||
|
||||
<a name="more">
|
||||
<hr noshade>
|
||||
<h2>More</h2>
|
||||
|
||||
<p>
|
||||
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
|
||||
in order to process them differently / obtain other information.
|
||||
|
||||
<hr noshade>
|
||||
<address>Yusuke Shinyama</address>
|
||||
</body>
|
||||
|
|
Loading…
Reference in New Issue