diff --git a/docs/index.html b/docs/index.html index 6e24e87..c63ff28 100644 --- a/docs/index.html +++ b/docs/index.html @@ -48,12 +48,12 @@ Python PDF parser and analyzer
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting -and analyzing text data. PDFMiner allows to obtain -the exact location of texts in a page, as well as +and analyzing text data. PDFMiner allows one to obtain +the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible -PDF parser that can be used for other purposes instead of text analysis. +PDF parser that can be used for other purposes than text analysis.
pdf2txt.py
extracts text contents from a PDF file.
-It extracts all the texts that are to be rendered programmatically,
-ie. text represented as ASCII or Unicode strings.
-It cannot recognize texts drawn as images that would require optical character recognition.
+It extracts all the text that are to be rendered programmatically,
+i.e. text represented as ASCII or Unicode strings.
+It cannot recognize text drawn as images that would require optical character recognition.
It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion.
You need to provide a password for protected PDF documents when its access is restricted.
@@ -199,8 +199,8 @@ By default, it prints the extracted contents to stdout in text format.
-p pageno[,pageno,...]
-c codec
text
: TEXT format. (Default)
html
: HTML format. Not recommended for extraction purposes because the markup is messy.
-xml
: XML format. Provides the most information available.
+xml
: XML format. Provides the most information.
tag
: "Tagged PDF" format. A tagged PDF has its own contents annotated with
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
Tags used here are defined in the PDF specification (See §10.7 "Tagged PDF").
@@ -224,14 +224,14 @@ Currently only JPEG images are supported.
-L line_margin
-W word_margin
-A
-V
-i
options are accepted.
-p pageno,pageno, ...
-p
options are accepted.
-Note that page numbers start from one, not zero.
+Note that page numbers start at one, not zero.
-r
(raw)
-b
(binary)
diff --git a/docs/programming.html b/docs/programming.html
index 16f3ebe..f71ddcf 100644
--- a/docs/programming.html
+++ b/docs/programming.html
@@ -170,7 +170,7 @@ pay much attention to graphical objects.
LTLine
LTRect