overview

git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:15:05 +00:00 · 2010-10-17 05:15:05 +00:00 · e0db043260
parent 0086971714
commit e0db043260
1 changed files with 32 additions and 18 deletions
--- a/docs/programming.html
+++ b/docs/programming.html
@ -20,35 +20,41 @@ from other applications.
 <li> <a href="#basic">Basic Usage</a>
 <li> <a href="#layout">Layout Analysis</a>
 <li> <a href="#toc">TOC Extraction</a>
+<li> <a href="#more">more</a>
 </ul>

 <a name="overview">
 <hr noshade>
 <h2>Overview</h2>
 <p>
-<strong>PDF is evil.</strong>
-Because a PDF file is normally big and has a complex structure,
-parsing a PDF as a whole is time-and-memory
-consuming. Furthermore, not every part is needed for most PDF
-processing. Therefore, PDFMiner takes a strategy of lazy parsing,
-which is to parse the stuff only when it's necessary. To parse PDF
-files, you need at least two classes: <code>PDFParser</code>
-and <code>PDFDocument</code>.  These objects work together.
-<code>PDFParser</code> fetches (or parses) data from a PDF,
+<strong>PDF is evil.</strong>  Although it is called a PDF
+"document", it's nothing like Word or HTML. PDF is more like a
+picture representation.  PDF contents are just a bunch of
+instructions that tell how to place the stuff at each exact
+position on a display or paper.  In most cases, it has no logical
+structure such as sentences or paragraphs and it cannot adapt
+itself when the paper size changes. PDFMiner attempts to
+reconstruct some of those structures by guessing from its
+positioning, but there's nothing guaranteed to work. Ugly, I
+know. Again, PDF is evil.
+
+<p>
+Because a PDF file has such a big and complex structure,
+parsing a PDF file as a whole is time and memory consuming. However,
+not every part is needed for most PDF processing tasks. Therefore
+PDFMiner takes a strategy of lazy parsing, which is to parse the
+stuff only when it's necessary. To parse PDF files, you need to use at
+least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.  
+These two objects are associated with each other.
+<code>PDFParser</code> fetches data from a file,
 and <code>PDFDocument</code> stores it. You'll also need
 <code>PDFPageInterpreter</code> to process the page contents
 and <code>PDFDevice</code> to translate it to whatever you need.
+<code>PDFResourceManager</code> is used to store
+shared resources such as fonts or images.

 <p>
-PDF documents are more like graphics format, rather than text
-format.  The contents in PDF are just a bunch of procedures that
-tell how to render the stuff on a display or paper.  In most
-cases, it presents no logical structure such as sentences or
-paragraphs.  So PDFMiner attempts to reconstruct some of them by
-performing layout analysis. Ugly, I know. Again, PDF is evil.
-
-<p>
-Figure 1 shows the relationship between these classes:
+Figure 1 shows the relationship between the classes in PDFMiner.

 <div align=center>
 <img src="objrel.png"><br>
@ -199,6 +205,14 @@ way to refer to any in-page object from the outside, there's no
 way to tell exactly which part of text these destinations are
 refering to.

+<a name="more">
+<hr noshade>
+<h2>More</h2>
+
+<p>
+You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
+in order to process them differently / obtain other information.
+
 <hr noshade>
 <address>Yusuke Shinyama</address>
 </body>