Community maintained fork of pdfminer - we fathom PDF
 
 
Go to file
Yusuke Shinyama 86348eba2f Documentation updated. 2013-10-23 00:17:12 +09:00
cmaprsrc some wordings and documentations 2010-06-19 03:56:50 +00:00
docs Documentation updated. 2013-10-23 00:17:12 +09:00
pdfminer Version bump! 2013-10-22 22:19:38 +09:00
samples testcase updated 2011-05-15 01:22:51 +09:00
tools API change: process_pdf -> PDFPage.get_pages 2013-10-22 18:59:16 +09:00
MANIFEST.in another minor fix 2010-12-26 19:30:46 +09:00
Makefile Documentation updated. 2013-10-23 00:17:12 +09:00
README.md Documentation updated. 2013-10-23 00:17:12 +09:00
setup.py renamed: python2 -> python. 2013-10-17 23:05:27 +09:00

README.md

PDFMiner

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

** Features **

  • Written entirely in Python.
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Automatic layout analysis.

** How to Install **

  • Install Python 2.4 or newer. (Python 3 is not supported.)

  • Download the source code.

  • Unpack it.

  • Run setup.py:

    $ python setup.py install

  • Do the following test:

    $ pdf2txt.py samples/simple1.pdf

** For CJK Languages **

In order to process CJK languages, do the following before running setup.py install:

$ make cmap python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'... writing 'CNS1_H.py'... ... $ python setup.py install

On Windows machines which don't have make command, paste the following commands on a command line prompt:

mkdir pdfminer\cmap python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt python setup.py install