Updated documentation.
parent
02ad086f6a
commit
96667d286f
83
README.md
83
README.md
|
@ -1,5 +1,5 @@
|
|||
PDFMiner
|
||||
==========
|
||||
========
|
||||
|
||||
PDFMiner is a tool for extracting information from PDF documents.
|
||||
Unlike other PDF-related tools, it focuses entirely on getting
|
||||
|
@ -10,7 +10,8 @@ It includes a PDF converter that can transform PDF files
|
|||
into other text formats (such as HTML). It has an extensible
|
||||
PDF parser that can be used for other purposes than text analysis.
|
||||
|
||||
**Features**
|
||||
Features
|
||||
--------
|
||||
|
||||
* Written entirely in Python.
|
||||
* Parse, analyze, and convert PDF documents.
|
||||
|
@ -22,7 +23,8 @@ PDF parser that can be used for other purposes than text analysis.
|
|||
* Tagged contents extraction.
|
||||
* Automatic layout analysis.
|
||||
|
||||
**How to Install**
|
||||
How to Install
|
||||
--------------
|
||||
|
||||
* Install Python 2.4 or newer. (**Python 3 is not supported.**)
|
||||
* Download the source code.
|
||||
|
@ -35,7 +37,8 @@ PDF parser that can be used for other purposes than text analysis.
|
|||
|
||||
$ pdf2txt.py samples/simple1.pdf
|
||||
|
||||
**For CJK Languages**
|
||||
For CJK Languages
|
||||
-----------------
|
||||
|
||||
In order to process CJK languages, do the following before
|
||||
running setup.py install:
|
||||
|
@ -56,3 +59,75 @@ paste the following commands on a command line prompt:
|
|||
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
|
||||
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
|
||||
python setup.py install
|
||||
|
||||
Command Line Tools
|
||||
------------------
|
||||
|
||||
PDFMiner comes with two handy tools:
|
||||
pdf2txt.py and dumppdf.py.
|
||||
|
||||
pdf2txt.py
|
||||
----------
|
||||
|
||||
pdf2txt.py extracts text contents from a PDF file.
|
||||
It extracts all the text that are to be rendered programmatically,
|
||||
i.e. text represented as ASCII or Unicode strings.
|
||||
It cannot recognize text drawn as images that would require optical character recognition.
|
||||
It also extracts the corresponding locations, font names, font sizes, writing
|
||||
direction (horizontal or vertical) for each text portion.
|
||||
You need to provide a password for protected PDF documents when its access is restricted.
|
||||
You cannot extract any text from a PDF document which does not have extraction permission.
|
||||
|
||||
(For details, refer to the html document.)
|
||||
|
||||
dumppdf.py
|
||||
----------
|
||||
|
||||
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format.
|
||||
This program is primarily for debugging purposes,
|
||||
but it's also possible to extract some meaningful contents (e.g. images).
|
||||
|
||||
(For details, refer to the html document.)
|
||||
|
||||
TODO
|
||||
----
|
||||
|
||||
* PEP-8 and PEP-257 conformance.
|
||||
* Better documentation.
|
||||
* Crypt stream filter support.
|
||||
|
||||
Related Projects
|
||||
----------------
|
||||
|
||||
* <a href="http://pybrary.net/pyPdf/">pyPdf</a>
|
||||
* <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
||||
* <a href="http://www.pdfbox.org/">pdfbox</a>
|
||||
* <a href="http://mupdf.com/">mupdf</a>
|
||||
|
||||
Terms and Conditions
|
||||
--------------------
|
||||
|
||||
(This is so-called MIT/X License)
|
||||
|
||||
Copyright (c) 2004-2013 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
||||
|
||||
Permission is hereby granted, free of charge, to any person
|
||||
obtaining a copy of this software and associated documentation
|
||||
files (the "Software"), to deal in the Software without
|
||||
restriction, including without limitation the rights to use,
|
||||
copy, modify, merge, publish, distribute, sublicense, and/or
|
||||
sell copies of the Software, and to permit persons to whom the
|
||||
Software is furnished to do so, subject to the following
|
||||
conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be
|
||||
included in all copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
|
||||
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
|
||||
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
||||
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
||||
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
||||
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
||||
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
|
|
@ -9,7 +9,7 @@
|
|||
|
||||
<div align=right class=lastmod>
|
||||
<!-- hhmts start -->
|
||||
Last Modified: Tue Oct 22 15:16:49 UTC 2013
|
||||
Last Modified: Sat Oct 26 15:03:35 UTC 2013
|
||||
<!-- hhmts end -->
|
||||
</div>
|
||||
|
||||
|
@ -286,6 +286,9 @@ including text contained in figures.
|
|||
<li> <code>loose</code> : preserve the overall location of each text block.
|
||||
</ul>
|
||||
<p>
|
||||
<dt> <code>-E <em>extractdir</em></code>
|
||||
<dd> Specifies the extraction directory of embedded files.
|
||||
<p>
|
||||
<dt> <code>-s <em>scale</em></code>
|
||||
<dd> Specifies the output scale. Can be used in HTML format only.
|
||||
<p>
|
||||
|
@ -429,9 +432,7 @@ Incorporated a lot of patches and robust handling of broken PDFs.
|
|||
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
|
||||
<li> Better documentation.
|
||||
<li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
|
||||
<li> Robust error handling.
|
||||
<li> Crypt stream filter support. (More sample documents are needed!)
|
||||
<li> CCITTFax stream filter support.
|
||||
</ul>
|
||||
|
||||
<h2><a name="related">Related Projects</a></h2>
|
||||
|
@ -447,7 +448,7 @@ Incorporated a lot of patches and robust handling of broken PDFs.
|
|||
(This is so-called MIT/X License)
|
||||
<p>
|
||||
<small>
|
||||
Copyright (c) 2004-2010 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
||||
Copyright (c) 2004-2013 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
||||
<p>
|
||||
Permission is hereby granted, free of charge, to any person
|
||||
obtaining a copy of this software and associated documentation
|
||||
|
|
Loading…
Reference in New Issue