pdfminer.six/README.md

59 lines
2.1 KiB
Markdown
Raw Normal View History

2013-10-22 15:21:03 +00:00
PDFMiner
==========
2013-10-22 15:17:12 +00:00
PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting
and analyzing text data. PDFMiner allows one to obtain
the exact location of text in a page, as well as
other information such as fonts or lines.
It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes than text analysis.
2013-10-22 15:21:03 +00:00
**Features**
2013-10-22 15:17:12 +00:00
* Written entirely in Python.
* Parse, analyze, and convert PDF documents.
* PDF-1.7 specification support. (well, almost)
* CJK languages and vertical writing scripts support.
* Various font types (Type1, TrueType, Type3, and CID) support.
* Basic encryption (RC4) support.
* Outline (TOC) extraction.
* Tagged contents extraction.
* Automatic layout analysis.
2013-10-22 15:21:03 +00:00
**How to Install**
2013-10-22 15:17:12 +00:00
* Install Python 2.4 or newer. (**Python 3 is not supported.**)
* Download the source code.
* Unpack it.
* Run `setup.py`:
2013-10-22 15:21:03 +00:00
$ python setup.py install
2013-10-22 15:17:12 +00:00
* Do the following test:
2013-10-22 15:21:03 +00:00
$ pdf2txt.py samples/simple1.pdf
2013-10-22 15:17:12 +00:00
2013-10-22 15:21:03 +00:00
**For CJK Languages**
2013-10-22 15:17:12 +00:00
In order to process CJK languages, do the following before
running setup.py install:
2013-10-22 15:21:03 +00:00
$ make cmap
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...
writing 'CNS1_H.py'...
...
$ python setup.py install
2013-10-22 15:17:12 +00:00
2013-10-22 15:21:03 +00:00
On Windows machines which don't have `make` command,
2013-10-22 15:17:12 +00:00
paste the following commands on a command line prompt:
2013-10-22 15:21:03 +00:00
mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install