2013-10-22 15:21:03 +00:00
|
|
|
PDFMiner
|
2013-10-26 15:05:26 +00:00
|
|
|
========
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2014-06-14 02:24:45 +00:00
|
|
|
[![Build Status](https://travis-ci.org/euske/pdfminer.svg?branch=master)](https://travis-ci.org/euske/pdfminer)
|
|
|
|
|
2013-10-22 15:17:12 +00:00
|
|
|
PDFMiner is a tool for extracting information from PDF documents.
|
|
|
|
Unlike other PDF-related tools, it focuses entirely on getting
|
|
|
|
and analyzing text data. PDFMiner allows one to obtain
|
|
|
|
the exact location of text in a page, as well as
|
|
|
|
other information such as fonts or lines.
|
|
|
|
It includes a PDF converter that can transform PDF files
|
|
|
|
into other text formats (such as HTML). It has an extensible
|
|
|
|
PDF parser that can be used for other purposes than text analysis.
|
|
|
|
|
2014-03-28 13:55:06 +00:00
|
|
|
* Webpage: https://euske.github.io/pdfminer/
|
2014-03-27 15:19:52 +00:00
|
|
|
* Download (PyPI): https://pypi.python.org/pypi/pdfminer/
|
2014-04-05 03:26:33 +00:00
|
|
|
* Demo WebApp: http://pdf2html.tabesugi.net:8080/
|
2014-03-27 15:19:52 +00:00
|
|
|
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2013-10-26 15:05:26 +00:00
|
|
|
Features
|
|
|
|
--------
|
2013-10-22 15:17:12 +00:00
|
|
|
|
|
|
|
* Written entirely in Python.
|
|
|
|
* Parse, analyze, and convert PDF documents.
|
|
|
|
* PDF-1.7 specification support. (well, almost)
|
|
|
|
* CJK languages and vertical writing scripts support.
|
|
|
|
* Various font types (Type1, TrueType, Type3, and CID) support.
|
|
|
|
* Basic encryption (RC4) support.
|
|
|
|
* Outline (TOC) extraction.
|
|
|
|
* Tagged contents extraction.
|
|
|
|
* Automatic layout analysis.
|
|
|
|
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2013-10-26 15:05:26 +00:00
|
|
|
How to Install
|
|
|
|
--------------
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2015-06-14 15:01:03 +00:00
|
|
|
* Install Python 2.6 or newer. (**For Python 3 support have a look at [pdfminer.six](https://github.com/goulu/pdfminer)**).
|
2013-10-22 15:17:12 +00:00
|
|
|
* Download the source code.
|
|
|
|
* Unpack it.
|
|
|
|
* Run `setup.py`:
|
|
|
|
|
2013-10-22 15:21:03 +00:00
|
|
|
$ python setup.py install
|
2013-10-22 15:17:12 +00:00
|
|
|
|
|
|
|
* Do the following test:
|
|
|
|
|
2013-10-22 15:21:03 +00:00
|
|
|
$ pdf2txt.py samples/simple1.pdf
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2013-10-26 15:05:26 +00:00
|
|
|
For CJK Languages
|
|
|
|
-----------------
|
2013-10-22 15:17:12 +00:00
|
|
|
|
|
|
|
In order to process CJK languages, do the following before
|
|
|
|
running setup.py install:
|
|
|
|
|
2013-10-22 15:21:03 +00:00
|
|
|
$ make cmap
|
|
|
|
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
|
|
|
|
reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...
|
|
|
|
writing 'CNS1_H.py'...
|
|
|
|
...
|
|
|
|
$ python setup.py install
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2013-10-22 15:21:03 +00:00
|
|
|
On Windows machines which don't have `make` command,
|
2013-10-22 15:17:12 +00:00
|
|
|
paste the following commands on a command line prompt:
|
|
|
|
|
2013-10-22 15:21:03 +00:00
|
|
|
mkdir pdfminer\cmap
|
|
|
|
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
|
|
|
|
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
|
|
|
|
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
|
|
|
|
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
|
|
|
|
python setup.py install
|
2013-10-26 15:05:26 +00:00
|
|
|
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2013-10-26 15:05:26 +00:00
|
|
|
Command Line Tools
|
|
|
|
------------------
|
|
|
|
|
|
|
|
PDFMiner comes with two handy tools:
|
|
|
|
pdf2txt.py and dumppdf.py.
|
|
|
|
|
2013-11-05 09:25:37 +00:00
|
|
|
**pdf2txt.py**
|
2013-10-26 15:05:26 +00:00
|
|
|
|
|
|
|
pdf2txt.py extracts text contents from a PDF file.
|
|
|
|
It extracts all the text that are to be rendered programmatically,
|
|
|
|
i.e. text represented as ASCII or Unicode strings.
|
|
|
|
It cannot recognize text drawn as images that would require optical character recognition.
|
|
|
|
It also extracts the corresponding locations, font names, font sizes, writing
|
|
|
|
direction (horizontal or vertical) for each text portion.
|
|
|
|
You need to provide a password for protected PDF documents when its access is restricted.
|
|
|
|
You cannot extract any text from a PDF document which does not have extraction permission.
|
|
|
|
|
|
|
|
(For details, refer to the html document.)
|
|
|
|
|
2013-11-05 09:25:37 +00:00
|
|
|
**dumppdf.py**
|
2013-10-26 15:05:26 +00:00
|
|
|
|
|
|
|
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format.
|
|
|
|
This program is primarily for debugging purposes,
|
|
|
|
but it's also possible to extract some meaningful contents (e.g. images).
|
|
|
|
|
|
|
|
(For details, refer to the html document.)
|
|
|
|
|
2013-11-17 06:32:57 +00:00
|
|
|
|
|
|
|
API Changes
|
|
|
|
-----------
|
|
|
|
|
|
|
|
As of November 2013, there were a few changes made to the PDFMiner API
|
|
|
|
prior to October 2013. This is the result of code restructuring. Here
|
|
|
|
is a list of the changes:
|
|
|
|
|
|
|
|
* PDFDocument class is moved to pdfdocument.py.
|
|
|
|
* PDFDocument class now takes a PDFParser object as an argument.
|
|
|
|
PDFDocument.set_parser() and PDFParser.set_document() is removed.
|
|
|
|
* PDFPage class is moved to pdfpage.py
|
|
|
|
* process_pdf function is implemented as a class method PDFPage.get_pages.
|
|
|
|
|
|
|
|
|
2013-10-26 15:05:26 +00:00
|
|
|
TODO
|
|
|
|
----
|
|
|
|
|
2013-11-07 10:53:57 +00:00
|
|
|
* Replace STRICT variable with something better.
|
|
|
|
* Use logging module instead of sys.stderr.
|
2013-11-05 09:25:37 +00:00
|
|
|
* Proper test cases.
|
2013-10-26 15:05:26 +00:00
|
|
|
* PEP-8 and PEP-257 conformance.
|
|
|
|
* Better documentation.
|
|
|
|
* Crypt stream filter support.
|
|
|
|
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2013-10-26 15:05:26 +00:00
|
|
|
Related Projects
|
|
|
|
----------------
|
|
|
|
|
|
|
|
* <a href="http://pybrary.net/pyPdf/">pyPdf</a>
|
|
|
|
* <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
2014-01-13 22:43:09 +00:00
|
|
|
* <a href="http://pdfbox.apache.org/">pdfbox</a>
|
2013-10-26 15:05:26 +00:00
|
|
|
* <a href="http://mupdf.com/">mupdf</a>
|
|
|
|
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2013-10-26 15:05:26 +00:00
|
|
|
Terms and Conditions
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
(This is so-called MIT/X License)
|
|
|
|
|
2016-09-11 14:38:18 +00:00
|
|
|
Copyright (c) 2004-2016 Yusuke Shinyama <yusuke at shinyama dot jp>
|
2013-10-26 15:05:26 +00:00
|
|
|
|
|
|
|
Permission is hereby granted, free of charge, to any person
|
|
|
|
obtaining a copy of this software and associated documentation
|
|
|
|
files (the "Software"), to deal in the Software without
|
|
|
|
restriction, including without limitation the rights to use,
|
|
|
|
copy, modify, merge, publish, distribute, sublicense, and/or
|
|
|
|
sell copies of the Software, and to permit persons to whom the
|
|
|
|
Software is furnished to do so, subject to the following
|
|
|
|
conditions:
|
|
|
|
|
|
|
|
The above copyright notice and this permission notice shall be
|
|
|
|
included in all copies or substantial portions of the Software.
|
|
|
|
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
|
|
|
|
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
|
|
|
|
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
|
|
|
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
|
|
|
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
|
|
|
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
|
|
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|