diff --git a/Makefile b/Makefile index 01de158..2b23c19 100644 --- a/Makefile +++ b/Makefile @@ -26,9 +26,6 @@ clean: test: cd samples && make test -cdbcmap: CMap - $(CONV_CMAP) CMap - # Maintainance: commit: clean $(SVN) commit diff --git a/README.html b/README.html index 72e5bba..0345d0f 100644 --- a/README.html +++ b/README.html @@ -18,7 +18,7 @@ Python PDF parser and analyzer
-Last Modified: Sat May 16 19:58:11 JST 2009 +Last Modified: Sun May 17 15:39:06 JST 2009
@@ -114,19 +114,21 @@ which is distributed from Adobe. Here is how:
    -
  1. Get +
  2. Get a CMap archive file from http://www.unixuser.org/~euske/pub/CMap.tar.bz2 -
  3. Do the follwoing: +
  4. Expand the archive and put the CMap directory under the directory +where pdfminer is installed. +(Normally this should be something like /usr/lib/python2.5/site-packages.) +For example:
    +$ cd /usr/lib/python2.5/site-packages
     $ tar jxf CMap.tar.bz2
     
    -
  5. Put the CMap directory into the pdfminer directory. -
  6. Go to the pdfminer directory.
  7. Do the follwoing: (this is optional but highly recommended)
    -$ make cdbcmap
    +$ python -m pdfminer.cmap /usr/lib/python2.5/site-packages/CMap
     
@@ -135,38 +137,36 @@ $ make cdbcmap

How to Use

-PDFMiner comes with two programs: +PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.

pdf2txt.py

pdf2txt.py extracts text contents from a PDF file. -It extracts all the texts that are to be rendered programatically. -It also extracts the corresponding locations, font names, -and font sizes for each text portion. However, -it cannot extract texts embedded within images -(i.e. it does not do optical character recognition). -You can provide a password for protected PDF documents -whose access is limited. +It extracts all the texts that are to be rendered programatically, +i.e. it cannot extract texts drawn as images that require optical character recognition. +It also extracts the corresponding locations, font names, font sizes, writing +direction (horizontal or vertical) for each text portion. +You need to provide a password for protected PDF documents when its access is restricted. +You cannot extract any text from a PDF document which does not have extraction permission.

For non-ASCII languages, you can specify the output encoding (such as UTF-8). -Note that not all characters in a PDF can be converted safely -to Unicode, as some of them are not included in the current -Unicode Standard. +

+Note: Not all characters in a PDF can be safely converted to Unicode.

Examples:

-$ python -m pdflib.pdf2txt -o output.html samples/naacl06-shinyama.pdf
+$ pdf2txt.py samples/naacl06-shinyama.pdf > output.html
 (extract text as an HTML file whose filename is output.html)
 
-$ python -m pdflib.pdf2txt -c euc-jp samples/jo.pdf
-(extract Japanese texts in vertical writing, CMap is required)
+$ pdf2txt.py -c euc-jp samples/jo.pdf > output.html
+(extract a Japanese HTML file in vertical writing, CMap is required)
 
-$ python -m pdflib.pdf2txt -P mypassword secret.pdf
-(extract texts from an encrypted PDF file with a password)
+$ pdf2txt.py -P mypassword -t text secret.pdf > output.txt
+(extract a text from an encrypted PDF file)
 

@@ -184,10 +184,6 @@ By default, it extracts texts from all the pages.

-c codec
Specifies the output codec for non-ASCII texts.

-

-w -
Split each word into a different chunk in the output. -This makes the word spacing correctly handled. -

-t type
Specifies the output format. The following formats are currently supported.