diff --git a/docs/index.html b/docs/index.html index a7cff65..50bd581 100644 --- a/docs/index.html +++ b/docs/index.html @@ -5,9 +5,17 @@
Python PDF parser and analyzer @@ -17,31 +25,22 @@ Python PDF parser and analyzer Recent Changes -
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting @@ -51,8 +50,9 @@ other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis. +
-Features: +
-On the performance side, PDFMiner is about 20 times slower than -other C/C++-based software such as XPdf. +other C/C++-based counterparts such as XPdf. - +
-Download from PyPI:
+Source distribution:
http://pypi.python.org/pypi/pdfminer/
+
+SVN repository:
+
+http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer
+
+
Discussion: (for questions and comments, post here)
http://groups.google.com/group/pdfminer-users/
-
-View the source:
-
-http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer
-
-
Online Demo: (pdf -> html conversion webapp)
@@ -96,13 +95,10 @@ http://pdf2html.tabesugi.net:8080/
-
-
setup.py
to install:+@@ -146,6 +141,7 @@ writing 'CNS1_H.py'... # python setup.py install
On Windows machines which don't have make
command,
paste the following commands on a command line prompt:
@@ -157,16 +153,12 @@ paste the following commands on a command line prompt:
python setup.py install
-
-
PDFMiner comes with two handy tools:
pdf2txt.py
and dumppdf.py
.
-
-
pdf2txt.py
extracts text contents from a PDF file.
It extracts all the texts that are to be rendered programmatically,
@@ -176,11 +168,12 @@ It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion.
You need to provide a password for protected PDF documents when its access is restricted.
You cannot extract any text from a PDF document which does not have extraction permission.
-
-Note: Not all characters in a PDF can be safely converted to Unicode.
-Examples: +Note: +Not all characters in a PDF can be safely converted to Unicode. + +
-$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf (extract text as an HTML file whose filename is output.html) @@ -192,8 +185,7 @@ $ pdf2txt.py -P mypassword -o output.txt secret.pdf (extract a text from an encrypted PDF file)
-Options: +
-o filename
dumppdf.py
dumps the internal contents of a PDF file
in pseudo-XML format. This program is primarily for debugging purposes,
but it's also possible to extract some meaningful contents
(such as images).
-
-Examples: +
-$ dumppdf.py -a foo.pdf (dump all the headers and contents, except stream objects) @@ -307,8 +297,7 @@ $ dumppdf.py -r -i6 foo.pdf > pic.jpeg (extract a JPEG image)
-Options: +
-a
PDFMiner can be used as a library by other Python programs.
@@ -356,21 +344,7 @@ For details, see the Programming with PDFMiner pa
Also, check out a more complete example by Denis Papathanasiou. - -
-
- - -(This is so-called MIT/X License) diff --git a/docs/programming.html b/docs/programming.html index b76448a..8026037 100644 --- a/docs/programming.html +++ b/docs/programming.html @@ -5,31 +5,38 @@
-This document explains how to use PDFMiner as a library +This page explains how to use PDFMiner as a library from other applications.
- -PDF is evil. Although it is called a PDF -"document", it's nothing like Word or HTML. PDF is more like a -picture representation. PDF contents are just a bunch of +"document", it's nothing like Word or HTML document. PDF is more +like a graphic representation. PDF contents are just a bunch of instructions that tell how to place the stuff at each exact position on a display or paper. In most cases, it has no logical structure such as sentences or paragraphs and it cannot adapt @@ -38,6 +45,13 @@ reconstruct some of those structures by guessing from its positioning, but there's nothing guaranteed to work. Ugly, I know. Again, PDF is evil. +
+[More technical details about the internal structure of PDF: +"How to Extract Text Contents from PDF Manually" +(part 1) +(part 2) +(part 3)] +
Because a PDF file has such a big and complex structure,
parsing a PDF file as a whole is time and memory consuming. However,
@@ -61,9 +75,7 @@ Figure 1 shows the relationship between the classes in PDFMiner.
Figure 1. Relationships between PDFMiner classes
-
-
A typical way to parse a PDF file is the following:
Here is a typical way to use the layout analysis function:
PDFMiner provides functions to access the document's table of contents
("Outlines").
@@ -205,9 +213,7 @@ way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are
refering to.
-
-
You can extend
-Basic Usage
+Basic Usage
-
-
@@ -97,9 +109,7 @@ for page in doc.get_pages():
interpreter.process_page(page)
-Accessing Layout Objects
+Accessing Layout Objects
@@ -174,9 +184,7 @@ Could be used for framing another pictures or figures.
-TOC Extraction
+TOC Extraction
-More
+Parser Extension
PDFPageInterpreter
and PDFDevice
class