diff --git a/docs/index.html b/docs/index.html index a7cff65..50bd581 100644 --- a/docs/index.html +++ b/docs/index.html @@ -5,9 +5,17 @@ PDFMiner +
+ +Last Modified: Sun Oct 17 09:10:34 UTC 2010 + +
+

PDFMiner

Python PDF parser and analyzer @@ -17,31 +25,22 @@ Python PDF parser and analyzer   Recent Changes -

- -Last Modified: Sun Oct 17 05:13:01 UTC 2010 - -
- - -
-

What's It?

+

What's It?

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting @@ -51,8 +50,9 @@ other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis. +

-Features: +

Features

-On the performance side, PDFMiner is about 20 times slower than -other C/C++-based software such as XPdf. +other C/C++-based counterparts such as XPdf. - +

Download

-Download from PyPI:
+Source distribution:
http://pypi.python.org/pypi/pdfminer/ +

+SVN repository:
+ +http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer + +

Discussion: (for questions and comments, post here)
http://groups.google.com/group/pdfminer-users/ -

-View the source:
- -http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer - -

Online Demo: (pdf -> html conversion webapp)
@@ -96,13 +95,10 @@ http://pdf2html.tabesugi.net:8080/ - -


-

Install

- +

How to Install

  1. Install Python 2.4 or newer. -(Python 3 is not supported.) + (Python 3 is not supported.)
  2. Download the PDFMiner source.
  3. Unpack it.
  4. Run setup.py to install:
    @@ -131,9 +127,8 @@ W o r l d
  5. Done!
+

For CJK languages

- -

For CJK languages

In order to process CJK languages, you need an additional step to take during installation:
@@ -146,6 +141,7 @@ writing 'CNS1_H.py'...
 
 # python setup.py install
 
+

On Windows machines which don't have make command, paste the following commands on a command line prompt: @@ -157,16 +153,12 @@ paste the following commands on a command line prompt: python setup.py install - -


-

How to Use

- +

How to Use

PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py. - -

pdf2txt.py

+

pdf2txt.py

pdf2txt.py extracts text contents from a PDF file. It extracts all the texts that are to be rendered programmatically, @@ -176,11 +168,12 @@ It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission. -

-Note: Not all characters in a PDF can be safely converted to Unicode.

-Examples: +Note: +Not all characters in a PDF can be safely converted to Unicode. + +

Examples

 $ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
 (extract text as an HTML file whose filename is output.html)
@@ -192,8 +185,7 @@ $ pdf2txt.py -P mypassword -o output.txt secret.pdf
 (extract a text from an encrypted PDF file)
 
-

-Options: +

Options

-o filename
Specifies the output file name. @@ -286,16 +278,14 @@ By default, it extracts all the pages in a document.
Increases the debug level.
- -

dumppdf.py

+

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images). -

-Examples: +

Examples

 $ dumppdf.py -a foo.pdf
 (dump all the headers and contents, except stream objects)
@@ -307,8 +297,7 @@ $ dumppdf.py -r -i6 foo.pdf > pic.jpeg
 (extract a JPEG image)
 
-

-Options: +

Options

-a
Instructs to dump all the objects. @@ -347,8 +336,7 @@ no stream header is displayed for the ease of saving it to a file.
Increases the debug level.
- -

Use as Library

+

Use as Library

PDFMiner can be used as a library by other Python programs.

@@ -356,21 +344,7 @@ For details, see the Programming with PDFMiner pa

Also, check out a more complete example by Denis Papathanasiou. - -


-

Technical Documents

-

-

- - -
-

TODOs

+

TODOs

- -
-

Changes

+

Changes

-

Related Projects

-

Terms and Conditions

(This is so-called MIT/X License) diff --git a/docs/programming.html b/docs/programming.html index b76448a..8026037 100644 --- a/docs/programming.html +++ b/docs/programming.html @@ -5,31 +5,38 @@ Programming with PDFMiner + +

+ +Last Modified: Sun Oct 17 09:12:03 UTC 2010 + +
+

[Back to PDFMiner homepage]

Programming with PDFMiner

-This document explains how to use PDFMiner as a library +This page explains how to use PDFMiner as a library from other applications.

- -
-

Overview

+

Overview

PDF is evil. Although it is called a PDF -"document", it's nothing like Word or HTML. PDF is more like a -picture representation. PDF contents are just a bunch of +"document", it's nothing like Word or HTML document. PDF is more +like a graphic representation. PDF contents are just a bunch of instructions that tell how to place the stuff at each exact position on a display or paper. In most cases, it has no logical structure such as sentences or paragraphs and it cannot adapt @@ -38,6 +45,13 @@ reconstruct some of those structures by guessing from its positioning, but there's nothing guaranteed to work. Ugly, I know. Again, PDF is evil. +

+[More technical details about the internal structure of PDF: +"How to Extract Text Contents from PDF Manually" +(part 1) +(part 2) +(part 3)] +

Because a PDF file has such a big and complex structure, parsing a PDF file as a whole is time and memory consuming. However, @@ -61,9 +75,7 @@ Figure 1 shows the relationship between the classes in PDFMiner. Figure 1. Relationships between PDFMiner classes - -


-

Basic Usage

+

Basic Usage

A typical way to parse a PDF file is the following:

@@ -97,9 +109,7 @@ for page in doc.get_pages():
     interpreter.process_page(page)
 
- -
-

Accessing Layout Objects

+

Accessing Layout Objects

Here is a typical way to use the layout analysis function:

@@ -174,9 +184,7 @@ Could be used for framing another pictures or figures.
 
Represents a polygon in a page. - -
-

TOC Extraction

+

TOC Extraction

PDFMiner provides functions to access the document's table of contents ("Outlines"). @@ -205,9 +213,7 @@ way to refer to any in-page object from the outside, there's no way to tell exactly which part of text these destinations are refering to. - -


-

More

+

Parser Extension

You can extend PDFPageInterpreter and PDFDevice class