diff --git a/README.md b/README.md index 45ccb6c..ab0eeed 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ PDFMiner -========== +======== PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting @@ -10,7 +10,8 @@ It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis. -**Features** +Features +-------- * Written entirely in Python. * Parse, analyze, and convert PDF documents. @@ -22,7 +23,8 @@ PDF parser that can be used for other purposes than text analysis. * Tagged contents extraction. * Automatic layout analysis. -**How to Install** +How to Install +-------------- * Install Python 2.4 or newer. (**Python 3 is not supported.**) * Download the source code. @@ -35,7 +37,8 @@ PDF parser that can be used for other purposes than text analysis. $ pdf2txt.py samples/simple1.pdf -**For CJK Languages** +For CJK Languages +----------------- In order to process CJK languages, do the following before running setup.py install: @@ -56,3 +59,75 @@ paste the following commands on a command line prompt: python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt python setup.py install + +Command Line Tools +------------------ + +PDFMiner comes with two handy tools: +pdf2txt.py and dumppdf.py. + +pdf2txt.py +---------- + +pdf2txt.py extracts text contents from a PDF file. +It extracts all the text that are to be rendered programmatically, +i.e. text represented as ASCII or Unicode strings. +It cannot recognize text drawn as images that would require optical character recognition. +It also extracts the corresponding locations, font names, font sizes, writing +direction (horizontal or vertical) for each text portion. +You need to provide a password for protected PDF documents when its access is restricted. +You cannot extract any text from a PDF document which does not have extraction permission. + +(For details, refer to the html document.) + +dumppdf.py +---------- + +dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. +This program is primarily for debugging purposes, +but it's also possible to extract some meaningful contents (e.g. images). + +(For details, refer to the html document.) + +TODO +---- + + * PEP-8 and PEP-257 conformance. + * Better documentation. + * Crypt stream filter support. + +Related Projects +---------------- + + * pyPdf + * xpdf + * pdfbox + * mupdf + +Terms and Conditions +-------------------- + +(This is so-called MIT/X License) + +Copyright (c) 2004-2013 Yusuke Shinyama + +Permission is hereby granted, free of charge, to any person +obtaining a copy of this software and associated documentation +files (the "Software"), to deal in the Software without +restriction, including without limitation the rights to use, +copy, modify, merge, publish, distribute, sublicense, and/or +sell copies of the Software, and to permit persons to whom the +Software is furnished to do so, subject to the following +conditions: + +The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY +KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE +WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR +PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR +COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR +OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE +SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/docs/index.html b/docs/index.html index 9ce96e1..0039ffd 100644 --- a/docs/index.html +++ b/docs/index.html @@ -9,7 +9,7 @@
-Last Modified: Tue Oct 22 15:16:49 UTC 2013 +Last Modified: Sat Oct 26 15:03:35 UTC 2013
@@ -286,6 +286,9 @@ including text contained in figures.
  • loose : preserve the overall location of each text block.

    +

    -E extractdir +
    Specifies the extraction directory of embedded files. +

    -s scale
    Specifies the output scale. Can be used in HTML format only.

    @@ -429,9 +432,7 @@ Incorporated a lot of patches and robust handling of broken PDFs. PEP-257 conformance.

  • Better documentation.
  • Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.) -
  • Robust error handling.
  • Crypt stream filter support. (More sample documents are needed!) -
  • CCITTFax stream filter support.

    Related Projects

    @@ -447,7 +448,7 @@ Incorporated a lot of patches and robust handling of broken PDFs. (This is so-called MIT/X License)

    -Copyright (c) 2004-2010 Yusuke Shinyama <yusuke at cs dot nyu dot edu> +Copyright (c) 2004-2013 Yusuke Shinyama <yusuke at cs dot nyu dot edu>

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation