documentation.

git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@109 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-05-17 06:39:54 +00:00 · 2009-05-17 06:39:54 +00:00 · 5c1cebadbb
parent 8cae56a555
commit 5c1cebadbb
3 changed files with 25 additions and 33 deletions
--- a/3
+++ b/3
@ -26,9 +26,6 @@ clean:
 test:
 	cd samples && make test

-cdbcmap: CMap
-	$(CONV_CMAP) CMap
-
 # Maintainance:
 commit: clean
 	$(SVN) commit
--- a/README.html
+++ b/README.html
@ -18,7 +18,7 @@ Python PDF parser and analyzer

 <div align=right class=lastmod>
 <!-- hhmts start -->
-Last Modified: Sat May 16 19:58:11 JST 2009
+Last Modified: Sun May 17 15:39:06 JST 2009
 <!-- hhmts end -->
 </div>

@ -114,19 +114,21 @@ which is distributed from Adobe.
 Here is how:

 <ol>
-<li> Get 
+<li> Get a CMap archive file from
 <a href="http://www.unixuser.org/~euske/pub/CMap.tar.bz2">
 http://www.unixuser.org/~euske/pub/CMap.tar.bz2
 </a>
-<li> Do the follwoing:
+<li> Expand the archive and put the <code>CMap</code> directory under the directory 
+where <code>pdfminer</code> is installed.
+(Normally this should be something like <code>/usr/lib/python2.5/site-packages</code>.)
+For example:
 <blockquote><pre>
+$ <strong>cd /usr/lib/python2.5/site-packages</strong>
 $ <strong>tar jxf CMap.tar.bz2</strong>
 </pre></blockquote>
-<li> Put the <code>CMap</code> directory into the <code>pdfminer</code> directory.
-<li> Go to the <code>pdfminer</code> directory.
 <li> Do the follwoing: (this is optional but highly recommended)<br>
 <blockquote><pre>
-$ <strong>make cdbcmap</strong>
+$ <strong>python -m pdfminer.cmap /usr/lib/python2.5/site-packages/CMap</strong>
 </pre></blockquote>
 </ol>

@ -135,38 +137,36 @@ $ <strong>make cdbcmap</strong>
 <h2>How to Use</h2>

 <p>
-PDFMiner comes with two programs:
+PDFMiner comes with two handy tools:
 <code>pdf2txt.py</code> and <code>dumppdf.py</code>.

 <a name="pdf2txt"></a>
 <h3>pdf2txt.py</h3>
 <p>
 <code>pdf2txt.py</code> extracts text contents from a PDF file.
-It extracts all the texts that are to be rendered programatically.
-It also extracts the corresponding locations, font names,
-and font sizes for each text portion. However,
-it cannot extract texts embedded within images
-(i.e. it does not do optical character recognition).
-You can provide a password for protected PDF documents 
-whose access is limited.
+It extracts all the texts that are to be rendered programatically,
+i.e. it cannot extract texts drawn as images that require optical character recognition.
+It also extracts the corresponding locations, font names, font sizes, writing
+direction (horizontal or vertical) for each text portion.
+You need to provide a password for protected PDF documents when its access is restricted.
+You cannot extract any text from a PDF document which does not have extraction permission.
 <p>
 For non-ASCII languages, you can specify the output encoding 
 (such as UTF-8).
-Note that not all characters in a PDF can be converted safely
-to Unicode, as some of them are not included in the current
-Unicode Standard.
+<p>
+<strong>Note:</strong> Not all characters in a PDF can be safely converted to Unicode.

 <p>
 Examples:
 <blockquote><pre>
-$ <strong>python -m pdflib.pdf2txt -o output.html samples/naacl06-shinyama.pdf</strong>
+$ <strong>pdf2txt.py samples/naacl06-shinyama.pdf &gt; output.html</strong>
 (extract text as an HTML file whose filename is output.html)

-$ <strong>python -m pdflib.pdf2txt -c euc-jp samples/jo.pdf</strong>
-(extract Japanese texts in vertical writing, CMap is required)
+$ <strong>pdf2txt.py -c euc-jp samples/jo.pdf &gt; output.html</strong>
+(extract a Japanese HTML file in vertical writing, CMap is required)

-$ <strong>python -m pdflib.pdf2txt -P mypassword secret.pdf</strong>
-(extract texts from an encrypted PDF file with a password)
+$ <strong>pdf2txt.py -P mypassword -t text secret.pdf &gt; output.txt</strong>
+(extract a text from an encrypted PDF file)
 </pre></blockquote>

 <p>
@ -184,10 +184,6 @@ By default, it extracts texts from all the pages.
 <dt> <code>-c <em>codec</em></code> 
 <dd> Specifies the output codec for non-ASCII texts.
 <p>
-<dt> <code>-w</code> 
-<dd> Split each word into a different chunk in the output.
-This makes the word spacing correctly handled.
-<p>
 <dt> <code>-t <em>type</em></code> 
 <dd> Specifies the output format. The following formats are currently supported.
 <ul>
@ -217,13 +213,13 @@ but it's also possible to extract some meaningful contents
 <p>
 Examples:
 <blockquote><pre>
-$ <strong>python -m tools.dumppdf -a foo.pdf</strong>
+$ <strong>dumppdf.py -a foo.pdf</strong>
 (dump all the headers and contents, except stream objects)

-$ <strong>python -m tools.dumppdf -T foo.pdf</strong>
+$ <strong>dumppdf.py -T foo.pdf</strong>
 (dump the table of contents)

-$ <strong>python -m tools.dumppdf -r -i6 foo.pdf &gt; pic.jpeg</strong>
+$ <strong>dumppdf.py -r -i6 foo.pdf &gt; pic.jpeg</strong>
 (extract a JPEG image)
 </pre></blockquote>

--- a/pdfminer/cmap.py
+++ b/pdfminer/cmap.py
@ -204,7 +204,6 @@ class CMapDB(object):

  @classmethod
  def get_cmap(klass, cmapname, strict=True):
-    import os.path
    cmapname = klass.CMAP_ALIAS.get(cmapname, cmapname)
    if cmapname in klass.cmapdb:
      cmap = klass.cmapdb[cmapname]