diff --git a/Makefile b/Makefile index 01de158..2b23c19 100644 --- a/Makefile +++ b/Makefile @@ -26,9 +26,6 @@ clean: test: cd samples && make test -cdbcmap: CMap - $(CONV_CMAP) CMap - # Maintainance: commit: clean $(SVN) commit diff --git a/README.html b/README.html index 72e5bba..0345d0f 100644 --- a/README.html +++ b/README.html @@ -18,7 +18,7 @@ Python PDF parser and analyzer
CMap
directory under the directory
+where pdfminer
is installed.
+(Normally this should be something like /usr/lib/python2.5/site-packages
.)
+For example:
-+$ cd /usr/lib/python2.5/site-packages $ tar jxf CMap.tar.bz2
CMap
directory into the pdfminer
directory.
-pdfminer
directory.
-$ make cdbcmap +$ python -m pdfminer.cmap /usr/lib/python2.5/site-packages/CMap
-PDFMiner comes with two programs:
+PDFMiner comes with two handy tools:
pdf2txt.py
and dumppdf.py
.
pdf2txt.py
extracts text contents from a PDF file.
-It extracts all the texts that are to be rendered programatically.
-It also extracts the corresponding locations, font names,
-and font sizes for each text portion. However,
-it cannot extract texts embedded within images
-(i.e. it does not do optical character recognition).
-You can provide a password for protected PDF documents
-whose access is limited.
+It extracts all the texts that are to be rendered programatically,
+i.e. it cannot extract texts drawn as images that require optical character recognition.
+It also extracts the corresponding locations, font names, font sizes, writing
+direction (horizontal or vertical) for each text portion.
+You need to provide a password for protected PDF documents when its access is restricted.
+You cannot extract any text from a PDF document which does not have extraction permission.
For non-ASCII languages, you can specify the output encoding (such as UTF-8). -Note that not all characters in a PDF can be converted safely -to Unicode, as some of them are not included in the current -Unicode Standard. +
+Note: Not all characters in a PDF can be safely converted to Unicode.
Examples:
-$ python -m pdflib.pdf2txt -o output.html samples/naacl06-shinyama.pdf +$ pdf2txt.py samples/naacl06-shinyama.pdf > output.html (extract text as an HTML file whose filename is output.html) -$ python -m pdflib.pdf2txt -c euc-jp samples/jo.pdf -(extract Japanese texts in vertical writing, CMap is required) +$ pdf2txt.py -c euc-jp samples/jo.pdf > output.html +(extract a Japanese HTML file in vertical writing, CMap is required) -$ python -m pdflib.pdf2txt -P mypassword secret.pdf -(extract texts from an encrypted PDF file with a password) +$ pdf2txt.py -P mypassword -t text secret.pdf > output.txt +(extract a text from an encrypted PDF file)
@@ -184,10 +184,6 @@ By default, it extracts texts from all the pages.
-c codec
-
-w
-
-t type
Examples:
diff --git a/pdfminer/cmap.py b/pdfminer/cmap.py index b0f2442..4e6e315 100644 --- a/pdfminer/cmap.py +++ b/pdfminer/cmap.py @@ -204,7 +204,6 @@ class CMapDB(object): @classmethod def get_cmap(klass, cmapname, strict=True): - import os.path cmapname = klass.CMAP_ALIAS.get(cmapname, cmapname) if cmapname in klass.cmapdb: cmap = klass.cmapdb[cmapname]-$ python -m tools.dumppdf -a foo.pdf +$ dumppdf.py -a foo.pdf (dump all the headers and contents, except stream objects) -$ python -m tools.dumppdf -T foo.pdf +$ dumppdf.py -T foo.pdf (dump the table of contents) -$ python -m tools.dumppdf -r -i6 foo.pdf > pic.jpeg +$ dumppdf.py -r -i6 foo.pdf > pic.jpeg (extract a JPEG image)