diff --git a/README.html b/README.html index da23724..ed974f6 100644 --- a/README.html +++ b/README.html @@ -18,7 +18,7 @@ Python PDF parser and analyzer
-Last Modified: Sat Jun 20 19:51:02 JST 2009 +Last Modified: Sun Jul 12 00:27:23 JST 2009
@@ -51,8 +51,8 @@ PDF parser that can be used for other purpoes instead of text analysis.

Download:
- -http://www.unixuser.org/~euske/python/pdfminer/pdfminer-dist-20090517.tar.gz + +http://www.unixuser.org/~euske/python/pdfminer/pdfminer-dist-20090711.tar.gz (1.8Mbytes) @@ -191,23 +191,63 @@ HTML-like tags. pdf2txt tries to extract its content streams rather than inferri Tags used here are defined in the PDF specification (See §10.7 "Tagged PDF").

-

-T cluster_margin -
-

+

-M char_margin +
-L line_margin
-W word_margin -
+
These are the parameters used for layout analysis. +In an actual PDF file, texts might be split into several chunks +in the middle of its running, depending on the authoring software. +Therefore, text extraction needs to splice text chunks. +In the figure below, two text chunks whose distance is closer than +the char_margin (shown as M) is considered +continuous and get grouped into one. Also, two lines whose distance is closer than +the line_margin (L) is grouped +as a text box, which is a recutangular area that contains a "cluster" of texts. +Furthermore, it may be required to insert blank characters (spaces) as necessary +if the distance between two words is greater than the word_margin +(W), as a blank between words might not be +represented as a space, but indicated by the positioning of each word. +

+Each value is specified not as an actual length, but as a proportion of +the length to the size of each character in question. The default values +are M = 1.0, L = 0.3, and W = 0.2, respectively. + + + + + + + + + + + + + + + + + + + +
M
Q u ic kb r o wn   f o x
WL
+
j u m p s...
+

-s scale -
+
Specifies the output scale. Can be used in HTML format only.

-m maxpages -
+
Specifies the maximum number of pages to extract. +By default, it extracts all the pages in a document.

-P password -
Provides the user password to open the PDF file. +
Provides the user password to access PDF contents.

-C CMap directory -
+
Specifies the path of CMap directory. CMap is needed when extracting +non-ASCII texts (especially in Asian languages). The CMap location can be +also specified with CMAP_PATH environment variable.

-d
Increases the debug level. @@ -242,12 +282,13 @@ Options: By default, it only prints the document trailer (like a header).

-i objno,objno, ... -
+
Specifies PDF object IDs to display. +Comma-separated IDs, or multiple -i options are accepted.

-p pageno,pageno, ...
Specifies the page number to be extracted. -Multiple -p options are allowed. -Note that page numbers start from one. +Comma-separated page numbers, or multiple -p options are accepted. +Note that page numbers start from one, not zero.

-r (raw)
-b (binary) @@ -263,11 +304,11 @@ similar to repr() manner. When -r or -b option is given, no stream header is displayed for the ease of saving it to a file.

-

-P password -
Provides the user password to open the PDF file. -

-T -
+
Shows the table of contents. +

+

-P password +
Provides the user password to access PDF contents.

-d
Increases the debug level. @@ -277,6 +318,7 @@ no stream header is displayed for the ease of saving it to a file.

Changes