overview

git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:15:05 +00:00 · 2010-10-17 05:15:05 +00:00 · e0db043260
parent 0086971714
commit e0db043260
1 changed files with 32 additions and 18 deletions
--- a/docs/programming.html
+++ b/docs/programming.html
@ -20,35 +20,41 @@ from other applications.
 <li> <a href="#basic">Basic Usage</a>
 <li> <a href="#layout">Layout Analysis</a>
 <li> <a href="#toc">TOC Extraction</a>
 <li> <a href="#more">more</a>
 </ul>
 <a name="overview">
 <hr noshade>
 <h2>Overview</h2>
 <p>
-<strong>PDF is evil.</strong>
+<strong>PDF is evil.</strong>  Although it is called a PDF
-Because a PDF file is normally big and has a complex structure,
+"document", it's nothing like Word or HTML. PDF is more like a
-parsing a PDF as a whole is time-and-memory
+picture representation.  PDF contents are just a bunch of
-consuming. Furthermore, not every part is needed for most PDF
+instructions that tell how to place the stuff at each exact
-processing. Therefore, PDFMiner takes a strategy of lazy parsing,
+position on a display or paper.  In most cases, it has no logical
-which is to parse the stuff only when it's necessary. To parse PDF
+structure such as sentences or paragraphs and it cannot adapt
-files, you need at least two classes: <code>PDFParser</code>
+itself when the paper size changes. PDFMiner attempts to
-and <code>PDFDocument</code>.  These objects work together.
+reconstruct some of those structures by guessing from its
-<code>PDFParser</code> fetches (or parses) data from a PDF,
+positioning, but there's nothing guaranteed to work. Ugly, I
 know. Again, PDF is evil.
 <p>
 Because a PDF file has such a big and complex structure,
 parsing a PDF file as a whole is time and memory consuming. However,
 not every part is needed for most PDF processing tasks. Therefore
 PDFMiner takes a strategy of lazy parsing, which is to parse the
 stuff only when it's necessary. To parse PDF files, you need to use at
 least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.  
 These two objects are associated with each other.
 <code>PDFParser</code> fetches data from a file,
 and <code>PDFDocument</code> stores it. You'll also need
 <code>PDFPageInterpreter</code> to process the page contents
 and <code>PDFDevice</code> to translate it to whatever you need.
 <code>PDFResourceManager</code> is used to store
 shared resources such as fonts or images.
 <p>
-PDF documents are more like graphics format, rather than text
+Figure 1 shows the relationship between the classes in PDFMiner.
 format.  The contents in PDF are just a bunch of procedures that
 tell how to render the stuff on a display or paper.  In most
 cases, it presents no logical structure such as sentences or
 paragraphs.  So PDFMiner attempts to reconstruct some of them by
 performing layout analysis. Ugly, I know. Again, PDF is evil.
 <p>
 Figure 1 shows the relationship between these classes:
 <div align=center>
 <img src="objrel.png"><br>
@ -199,6 +205,14 @@ way to refer to any in-page object from the outside, there's no
 way to tell exactly which part of text these destinations are
 refering to.
 <a name="more">
 <hr noshade>
 <h2>More</h2>
 <p>
 You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
 in order to process them differently / obtain other information.
 <hr noshade>
 <address>Yusuke Shinyama</address>
 </body>