diff --git a/docs/programming.html b/docs/programming.html index efd3ffc..b76448a 100644 --- a/docs/programming.html +++ b/docs/programming.html @@ -20,35 +20,41 @@ from other applications.
-PDF is evil.
-Because a PDF file is normally big and has a complex structure,
-parsing a PDF as a whole is time-and-memory
-consuming. Furthermore, not every part is needed for most PDF
-processing. Therefore, PDFMiner takes a strategy of lazy parsing,
-which is to parse the stuff only when it's necessary. To parse PDF
-files, you need at least two classes: PDFParser
-and PDFDocument
. These objects work together.
-PDFParser
fetches (or parses) data from a PDF,
+PDF is evil. Although it is called a PDF
+"document", it's nothing like Word or HTML. PDF is more like a
+picture representation. PDF contents are just a bunch of
+instructions that tell how to place the stuff at each exact
+position on a display or paper. In most cases, it has no logical
+structure such as sentences or paragraphs and it cannot adapt
+itself when the paper size changes. PDFMiner attempts to
+reconstruct some of those structures by guessing from its
+positioning, but there's nothing guaranteed to work. Ugly, I
+know. Again, PDF is evil.
+
+
+Because a PDF file has such a big and complex structure,
+parsing a PDF file as a whole is time and memory consuming. However,
+not every part is needed for most PDF processing tasks. Therefore
+PDFMiner takes a strategy of lazy parsing, which is to parse the
+stuff only when it's necessary. To parse PDF files, you need to use at
+least two classes: PDFParser
and PDFDocument
.
+These two objects are associated with each other.
+PDFParser
fetches data from a file,
and PDFDocument
stores it. You'll also need
PDFPageInterpreter
to process the page contents
and PDFDevice
to translate it to whatever you need.
+PDFResourceManager
is used to store
+shared resources such as fonts or images.
-PDF documents are more like graphics format, rather than text -format. The contents in PDF are just a bunch of procedures that -tell how to render the stuff on a display or paper. In most -cases, it presents no logical structure such as sentences or -paragraphs. So PDFMiner attempts to reconstruct some of them by -performing layout analysis. Ugly, I know. Again, PDF is evil. - -
-Figure 1 shows the relationship between these classes: +Figure 1 shows the relationship between the classes in PDFMiner.
+You can extend PDFPageInterpreter
and PDFDevice
class
+in order to process them differently / obtain other information.
+