overview
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67cpull/1/head
parent
0086971714
commit
e0db043260
|
@ -20,35 +20,41 @@ from other applications.
|
||||||
<li> <a href="#basic">Basic Usage</a>
|
<li> <a href="#basic">Basic Usage</a>
|
||||||
<li> <a href="#layout">Layout Analysis</a>
|
<li> <a href="#layout">Layout Analysis</a>
|
||||||
<li> <a href="#toc">TOC Extraction</a>
|
<li> <a href="#toc">TOC Extraction</a>
|
||||||
|
<li> <a href="#more">more</a>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<a name="overview">
|
<a name="overview">
|
||||||
<hr noshade>
|
<hr noshade>
|
||||||
<h2>Overview</h2>
|
<h2>Overview</h2>
|
||||||
<p>
|
<p>
|
||||||
<strong>PDF is evil.</strong>
|
<strong>PDF is evil.</strong> Although it is called a PDF
|
||||||
Because a PDF file is normally big and has a complex structure,
|
"document", it's nothing like Word or HTML. PDF is more like a
|
||||||
parsing a PDF as a whole is time-and-memory
|
picture representation. PDF contents are just a bunch of
|
||||||
consuming. Furthermore, not every part is needed for most PDF
|
instructions that tell how to place the stuff at each exact
|
||||||
processing. Therefore, PDFMiner takes a strategy of lazy parsing,
|
position on a display or paper. In most cases, it has no logical
|
||||||
which is to parse the stuff only when it's necessary. To parse PDF
|
structure such as sentences or paragraphs and it cannot adapt
|
||||||
files, you need at least two classes: <code>PDFParser</code>
|
itself when the paper size changes. PDFMiner attempts to
|
||||||
and <code>PDFDocument</code>. These objects work together.
|
reconstruct some of those structures by guessing from its
|
||||||
<code>PDFParser</code> fetches (or parses) data from a PDF,
|
positioning, but there's nothing guaranteed to work. Ugly, I
|
||||||
|
know. Again, PDF is evil.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Because a PDF file has such a big and complex structure,
|
||||||
|
parsing a PDF file as a whole is time and memory consuming. However,
|
||||||
|
not every part is needed for most PDF processing tasks. Therefore
|
||||||
|
PDFMiner takes a strategy of lazy parsing, which is to parse the
|
||||||
|
stuff only when it's necessary. To parse PDF files, you need to use at
|
||||||
|
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
|
||||||
|
These two objects are associated with each other.
|
||||||
|
<code>PDFParser</code> fetches data from a file,
|
||||||
and <code>PDFDocument</code> stores it. You'll also need
|
and <code>PDFDocument</code> stores it. You'll also need
|
||||||
<code>PDFPageInterpreter</code> to process the page contents
|
<code>PDFPageInterpreter</code> to process the page contents
|
||||||
and <code>PDFDevice</code> to translate it to whatever you need.
|
and <code>PDFDevice</code> to translate it to whatever you need.
|
||||||
|
<code>PDFResourceManager</code> is used to store
|
||||||
|
shared resources such as fonts or images.
|
||||||
|
|
||||||
<p>
|
<p>
|
||||||
PDF documents are more like graphics format, rather than text
|
Figure 1 shows the relationship between the classes in PDFMiner.
|
||||||
format. The contents in PDF are just a bunch of procedures that
|
|
||||||
tell how to render the stuff on a display or paper. In most
|
|
||||||
cases, it presents no logical structure such as sentences or
|
|
||||||
paragraphs. So PDFMiner attempts to reconstruct some of them by
|
|
||||||
performing layout analysis. Ugly, I know. Again, PDF is evil.
|
|
||||||
|
|
||||||
<p>
|
|
||||||
Figure 1 shows the relationship between these classes:
|
|
||||||
|
|
||||||
<div align=center>
|
<div align=center>
|
||||||
<img src="objrel.png"><br>
|
<img src="objrel.png"><br>
|
||||||
|
@ -199,6 +205,14 @@ way to refer to any in-page object from the outside, there's no
|
||||||
way to tell exactly which part of text these destinations are
|
way to tell exactly which part of text these destinations are
|
||||||
refering to.
|
refering to.
|
||||||
|
|
||||||
|
<a name="more">
|
||||||
|
<hr noshade>
|
||||||
|
<h2>More</h2>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
|
||||||
|
in order to process them differently / obtain other information.
|
||||||
|
|
||||||
<hr noshade>
|
<hr noshade>
|
||||||
<address>Yusuke Shinyama</address>
|
<address>Yusuke Shinyama</address>
|
||||||
</body>
|
</body>
|
||||||
|
|
Loading…
Reference in New Issue