Updated documentation.

2013-10-27 00:05:26 +09:00 · 2013-10-27 00:05:26 +09:00 · 96667d286f
parent 02ad086f6a
commit 96667d286f
2 changed files with 84 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -1,5 +1,5 @@
 PDFMiner
-==========
+========

 PDFMiner is a tool for extracting information from PDF documents.
 Unlike other PDF-related tools, it focuses entirely on getting 
@ -10,7 +10,8 @@ It includes a PDF converter that can transform PDF files
 into other text formats (such as HTML). It has an extensible
 PDF parser that can be used for other purposes than text analysis.

-**Features**
+Features
+--------

 * Written entirely in Python.
 * Parse, analyze, and convert PDF documents.
@ -22,7 +23,8 @@ PDF parser that can be used for other purposes than text analysis.
 * Tagged contents extraction.
 * Automatic layout analysis.

-**How to Install**
+How to Install
+--------------

 * Install Python 2.4 or newer. (**Python 3 is not supported.**)
 * Download the source code.
@ -35,7 +37,8 @@ PDF parser that can be used for other purposes than text analysis.

    $ pdf2txt.py samples/simple1.pdf

-**For CJK Languages**
+For CJK Languages
+-----------------

 In order to process CJK languages, do the following before
 running setup.py install:
@ -56,3 +59,75 @@ paste the following commands on a command line prompt:
    python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
    python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
    python setup.py install
+
+Command Line Tools
+------------------
+
+PDFMiner comes with two handy tools:
+pdf2txt.py and dumppdf.py.
+
+pdf2txt.py
+----------
+
+pdf2txt.py extracts text contents from a PDF file.
+It extracts all the text that are to be rendered programmatically,
+i.e. text represented as ASCII or Unicode strings.
+It cannot recognize text drawn as images that would require optical character recognition.
+It also extracts the corresponding locations, font names, font sizes, writing
+direction (horizontal or vertical) for each text portion.
+You need to provide a password for protected PDF documents when its access is restricted.
+You cannot extract any text from a PDF document which does not have extraction permission.
+
+(For details, refer to the html document.)
+
+dumppdf.py
+----------
+
+dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. 
+This program is primarily for debugging purposes,
+but it's also possible to extract some meaningful contents (e.g. images).
+
+(For details, refer to the html document.)
+
+TODO
+----
+
+ * PEP-8 and PEP-257 conformance.
+ * Better documentation.
+ * Crypt stream filter support.
+
+Related Projects
+----------------
+
+ * <a href="http://pybrary.net/pyPdf/">pyPdf</a>
+ * <a href="http://www.foolabs.com/xpdf/">xpdf</a>
+ * <a href="http://www.pdfbox.org/">pdfbox</a>
+ * <a href="http://mupdf.com/">mupdf</a>
+
+Terms and Conditions
+--------------------
+
+(This is so-called MIT/X License)
+
+Copyright (c) 2004-2013  Yusuke Shinyama <yusuke at cs dot nyu dot edu>
+
+Permission is hereby granted, free of charge, to any person
+obtaining a copy of this software and associated documentation
+files (the "Software"), to deal in the Software without
+restriction, including without limitation the rights to use,
+copy, modify, merge, publish, distribute, sublicense, and/or
+sell copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following
+conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
+KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
+WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
+PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--- a/docs/index.html
+++ b/docs/index.html
@ -9,7 +9,7 @@

 <div align=right class=lastmod>
 <!-- hhmts start -->
-Last Modified: Tue Oct 22 15:16:49 UTC 2013
+Last Modified: Sat Oct 26 15:03:35 UTC 2013
 <!-- hhmts end -->
 </div>

@ -286,6 +286,9 @@ including text contained in figures.
 <li> <code>loose</code> : preserve the overall location of each text block.
 </ul>
 <p>
+<dt> <code>-E <em>extractdir</em></code>
+<dd> Specifies the extraction directory of embedded files.
+<p>
 <dt> <code>-s <em>scale</em></code> 
 <dd> Specifies the output scale. Can be used in HTML format only.
 <p>
@ -429,9 +432,7 @@ Incorporated a lot of patches and robust handling of broken PDFs.
 <a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
 <li> Better documentation.
 <li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
-<li> Robust error handling.
 <li> Crypt stream filter support. (More sample documents are needed!)
-<li> CCITTFax stream filter support.
 </ul>

 <h2><a name="related">Related Projects</a></h2>
@ -447,7 +448,7 @@ Incorporated a lot of patches and robust handling of broken PDFs.
 (This is so-called MIT/X License)
 <p>
 <small>
-Copyright (c) 2004-2010  Yusuke Shinyama &lt;yusuke at cs dot nyu dot edu&gt;
+Copyright (c) 2004-2013  Yusuke Shinyama &lt;yusuke at cs dot nyu dot edu&gt;
 <p>
 Permission is hereby granted, free of charge, to any person
 obtaining a copy of this software and associated documentation