Community maintained fork of pdfminer - we fathom PDF
 
 
Go to file
Yusuke Shinyama 02ad086f6a fixed: HTMLConverter. 2013-10-25 18:10:40 +09:00
cmaprsrc some wordings and documentations 2010-06-19 03:56:50 +00:00
docs Documentation updated. 2013-10-23 00:17:12 +09:00
pdfminer fixed: HTMLConverter. 2013-10-25 18:10:40 +09:00
samples testcase updated 2011-05-15 01:22:51 +09:00
tools API change: process_pdf -> PDFPage.get_pages 2013-10-22 18:59:16 +09:00
MANIFEST.in another minor fix 2010-12-26 19:30:46 +09:00
Makefile Documentation updated. 2013-10-23 00:17:12 +09:00
README.md Documentation updated. 2013-10-23 00:21:03 +09:00
setup.py renamed: python2 -> python. 2013-10-17 23:05:27 +09:00

README.md

PDFMiner

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python.
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Automatic layout analysis.

How to Install

  • Install Python 2.4 or newer. (Python 3 is not supported.)

  • Download the source code.

  • Unpack it.

  • Run setup.py:

    $ python setup.py install

  • Do the following test:

    $ pdf2txt.py samples/simple1.pdf

For CJK Languages

In order to process CJK languages, do the following before running setup.py install:

$ make cmap
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...
writing 'CNS1_H.py'...
...
$ python setup.py install

On Windows machines which don't have make command, paste the following commands on a command line prompt:

mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install