diff --git a/docs/programming.html b/docs/programming.html index efd3ffc..b76448a 100644 --- a/docs/programming.html +++ b/docs/programming.html @@ -20,35 +20,41 @@ from other applications.
  • Basic Usage
  • Layout Analysis
  • TOC Extraction +
  • more

    Overview

    -PDF is evil. -Because a PDF file is normally big and has a complex structure, -parsing a PDF as a whole is time-and-memory -consuming. Furthermore, not every part is needed for most PDF -processing. Therefore, PDFMiner takes a strategy of lazy parsing, -which is to parse the stuff only when it's necessary. To parse PDF -files, you need at least two classes: PDFParser -and PDFDocument. These objects work together. -PDFParser fetches (or parses) data from a PDF, +PDF is evil. Although it is called a PDF +"document", it's nothing like Word or HTML. PDF is more like a +picture representation. PDF contents are just a bunch of +instructions that tell how to place the stuff at each exact +position on a display or paper. In most cases, it has no logical +structure such as sentences or paragraphs and it cannot adapt +itself when the paper size changes. PDFMiner attempts to +reconstruct some of those structures by guessing from its +positioning, but there's nothing guaranteed to work. Ugly, I +know. Again, PDF is evil. + +

    +Because a PDF file has such a big and complex structure, +parsing a PDF file as a whole is time and memory consuming. However, +not every part is needed for most PDF processing tasks. Therefore +PDFMiner takes a strategy of lazy parsing, which is to parse the +stuff only when it's necessary. To parse PDF files, you need to use at +least two classes: PDFParser and PDFDocument. +These two objects are associated with each other. +PDFParser fetches data from a file, and PDFDocument stores it. You'll also need PDFPageInterpreter to process the page contents and PDFDevice to translate it to whatever you need. +PDFResourceManager is used to store +shared resources such as fonts or images.

    -PDF documents are more like graphics format, rather than text -format. The contents in PDF are just a bunch of procedures that -tell how to render the stuff on a display or paper. In most -cases, it presents no logical structure such as sentences or -paragraphs. So PDFMiner attempts to reconstruct some of them by -performing layout analysis. Ugly, I know. Again, PDF is evil. - -

    -Figure 1 shows the relationship between these classes: +Figure 1 shows the relationship between the classes in PDFMiner.


    @@ -199,6 +205,14 @@ way to refer to any in-page object from the outside, there's no way to tell exactly which part of text these destinations are refering to. +
    +
    +

    More

    + +

    +You can extend PDFPageInterpreter and PDFDevice class +in order to process them differently / obtain other information. +


    Yusuke Shinyama