diff --git a/docs/index.html b/docs/index.html index 6e24e87..c63ff28 100644 --- a/docs/index.html +++ b/docs/index.html @@ -48,12 +48,12 @@ Python PDF parser and analyzer

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting -and analyzing text data. PDFMiner allows to obtain -the exact location of texts in a page, as well as +and analyzing text data. PDFMiner allows one to obtain +the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible -PDF parser that can be used for other purposes instead of text analysis. +PDF parser that can be used for other purposes than text analysis.

Features

@@ -167,9 +167,9 @@ PDFMiner comes with two handy tools:

pdf2txt.py

pdf2txt.py extracts text contents from a PDF file. -It extracts all the texts that are to be rendered programmatically, -ie. text represented as ASCII or Unicode strings. -It cannot recognize texts drawn as images that would require optical character recognition. +It extracts all the text that are to be rendered programmatically, +i.e. text represented as ASCII or Unicode strings. +It cannot recognize text drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. @@ -199,8 +199,8 @@ By default, it prints the extracted contents to stdout in text format.

-p pageno[,pageno,...]
Specifies the comma-separated list of the page numbers to be extracted. -Page numbers are starting from one. -By default, it extracts texts from all the pages. +Page numbers start at one. +By default, it extracts text from all the pages.

-c codec
Specifies the output codec. @@ -210,7 +210,7 @@ By default, it extracts texts from all the pages.