Create sphinx documentation for Read the Docs (#329)

Fixes #171
Fixes #199
Fixes #118
Fixes #178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
pull/335/head
Pieter Marsman 2019-11-07 21:12:34 +01:00 committed by GitHub
parent 40aa2533c9
commit bc034c8e59
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
40 changed files with 879 additions and 1650 deletions

View File

@ -9,4 +9,4 @@ python:
install: install:
- pip install tox-travis - pip install tox-travis
script: script:
- tox - tox -r

View File

@ -13,6 +13,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
### Added ### Added
- Simple wrapper to easily extract text from a PDF file [#330](https://github.com/pdfminer/pdfminer.six/pull/330) - Simple wrapper to easily extract text from a PDF file [#330](https://github.com/pdfminer/pdfminer.six/pull/330)
- Support for extracting JBIG2 encoded images ([#311](https://github.com/pdfminer/pdfminer.six/pull/311) and [#46](https://github.com/pdfminer/pdfminer.six/pull/46)) - Support for extracting JBIG2 encoded images ([#311](https://github.com/pdfminer/pdfminer.six/pull/311) and [#46](https://github.com/pdfminer/pdfminer.six/pull/46))
- Sphinx documentation that is published on
[Read the Docs](https://pdfminersix.readthedocs.io/)
([#329](https://github.com/pdfminer/pdfminer.six/pull/329))
### Fixed ### Fixed
- Unhandled AssertionError when dumping pdf containing reference to object id 0 - Unhandled AssertionError when dumping pdf containing reference to object id 0

View File

@ -1,21 +1,22 @@
PDFMiner.six pdfminer.six
============ ============
PDFMiner.six is a fork of PDFMiner using six for Python 2+3 compatibility [![Build Status](https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master)](https://travis-ci.org/pdfminer/pdfminer.six)
[![PyPI version](https://img.shields.io/pypi/v/pdfminer.six.svg)](https://pypi.python.org/pypi/pdfminer.six/)
[![gitter](https://badges.gitter.im/pdfminer-six/Lobby.svg)](https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium)
[![Build Status](https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master)](https://travis-ci.org/pdfminer/pdfminer.six) [![PyPI version](https://img.shields.io/pypi/v/pdfminer.six.svg)](https://pypi.python.org/pypi/pdfminer.six/) Pdfminer.six is an community maintained fork of the original PDFMiner. It is a
tool for extracting information from PDF documents.
PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting Unlike other PDF-related tools, it focuses entirely on getting
and analyzing text data. PDFMiner allows one to obtain and analyzing text data. Pdfminer.six allows one to obtain
the exact location of text in a page, as well as the exact location of text in a page, as well as
other information such as fonts or lines. other information such as fonts or lines.
It includes a PDF converter that can transform PDF files It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes than text analysis. PDF parser that can be used for other purposes than text analysis.
* Webpage: https://github.com/pdfminer/ Check out the full documentation on
* Download (PyPI): https://pypi.python.org/pypi/pdfminer.six/ [Read the Docs](https://pdfminersix.readthedocs.io).
Features Features
@ -33,53 +34,20 @@ Features
* Automatic layout analysis. * Automatic layout analysis.
How to Install How to use
-------------- ----------
* Install Python 2.7 or newer. * Install Python 2.7 or newer. Note that Python 2 support is dropped at
* Install January, 2020.
`pip install pdfminer.six` `pip install pdfminer.six`
* Run the following test: * Use command-line interface to extract text from pdf:
`pdf2txt.py samples/simple1.pdf` `python pdf2txt.py samples/simple1.pdf`
* Check out more examples and documentation on
Command Line Tools [Read the Docs](https://pdfminersix.readthedocs.io).
------------------
PDFMiner comes with two handy tools:
pdf2txt.py and dumppdf.py.
**pdf2txt.py**
pdf2txt.py extracts text contents from a PDF file.
It extracts all the text that are to be rendered programmatically,
i.e. text represented as ASCII or Unicode strings.
It cannot recognize text drawn as images that would require optical character recognition.
It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion.
You need to provide a password for protected PDF documents when its access is restricted.
You cannot extract any text from a PDF document which does not have extraction permission.
(For details, refer to /docs/index.html.)
**dumppdf.py**
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format.
This program is primarily for debugging purposes,
but it's also possible to extract some meaningful contents (e.g. images).
(For details, refer to /docs/index.html.)
TODO
----
* PEP-8 and PEP-257 conformance.
* Better documentation.
* Performance improvements.
Contributing Contributing

1
docs/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
build/

20
docs/Makefile Normal file
View File

@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View File

@ -1,225 +0,0 @@
%TGIF 4.1.45-QPL
state(0,37,100.000,0,0,0,16,1,9,1,1,2,0,1,0,1,1,'NewCenturySchlbk-Bold',1,103680,0,0,1,10,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
%
% @(#)$Header$
% %W%
%
unit("1 pixel/pixel").
color_info(19,65535,0,[
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
"red", 65535, 0, 0, 65535, 0, 0, 1,
"green", 0, 65535, 0, 0, 65535, 0, 1,
"blue", 0, 0, 65535, 0, 0, 65535, 1,
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
"black", 0, 0, 0, 0, 0, 0, 1,
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
]).
script_frac("0.6").
fg_bg_colors('black','white').
dont_reencode("FFDingbests:ZapfDingbats").
objshadow_info('#c0c0c0',2,2).
page(1,"",1,'').
text('black',90,95,1,1,1,66,20,0,15,5,0,0,0,0,2,66,20,0,0,"",0,0,0,0,110,'',[
minilines(66,20,0,0,1,0,0,[
mini_line(66,15,5,0,0,0,[
str_block(0,66,15,5,0,-1,0,0,0,[
str_seg('black','Courier-Bold',1,103680,66,15,5,0,-1,0,0,0,0,0,
"U+30FC")])
])
])]).
text('black',100,285,1,1,1,66,20,3,15,5,0,0,0,0,2,66,20,0,0,"",0,0,0,0,300,'',[
minilines(66,20,0,0,1,0,0,[
mini_line(66,15,5,0,0,0,[
str_block(0,66,15,5,0,-2,0,0,0,[
str_seg('black','Courier-Bold',1,103680,66,15,5,0,-2,0,0,0,0,0,
"U+5199")])
])
])]).
text('black',400,38,2,1,1,119,30,5,12,3,0,0,0,0,2,119,30,0,0,"",0,0,0,0,50,'',[
minilines(119,30,0,0,1,0,0,[
mini_line(83,12,3,0,0,0,[
str_block(0,83,12,3,0,-3,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
"Adobe-Japan1")])
]),
mini_line(119,12,3,0,0,0,[
str_block(0,119,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,119,12,3,0,-1,0,0,0,0,0,
"CID:660 (horizontal)")])
])
])]).
text('black',400,118,2,1,1,114,30,8,12,3,0,0,0,0,2,114,30,0,0,"",0,0,0,0,130,'',[
minilines(114,30,0,0,1,0,0,[
mini_line(83,12,3,0,0,0,[
str_block(0,83,12,3,0,-3,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
"Adobe-Japan1")])
]),
mini_line(114,12,3,0,0,0,[
str_block(0,114,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,114,12,3,0,-1,0,0,0,0,0,
"CID:7891 (vertical)")])
])
])]).
text('black',400,238,2,1,1,125,30,15,12,3,0,0,0,0,2,125,30,0,0,"",0,0,0,0,250,'',[
minilines(125,30,0,0,1,0,0,[
mini_line(83,12,3,0,0,0,[
str_block(0,83,12,3,0,-3,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
"Adobe-Japan1")])
]),
mini_line(125,12,3,0,0,0,[
str_block(0,125,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,125,12,3,0,-1,0,0,0,0,0,
"CID:2296 (Japanese)")])
])
])]).
text('black',400,318,2,1,1,115,30,16,12,3,0,0,0,0,2,115,30,0,0,"",0,0,0,0,330,'',[
minilines(115,30,0,0,1,0,0,[
mini_line(67,12,3,0,0,0,[
str_block(0,67,12,3,0,-3,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,67,12,3,0,-3,0,0,0,0,0,
"Adobe-GB1")])
]),
mini_line(115,12,3,0,0,0,[
str_block(0,115,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,115,12,3,0,-1,0,0,0,0,0,
"CID:3967 (Chinese)")])
])
])]).
text('black',200,84,2,1,1,116,38,20,16,3,0,0,0,0,2,116,38,0,0,"",0,0,0,0,100,'',[
minilines(116,38,0,0,1,0,0,[
mini_line(70,16,3,0,0,0,[
str_block(0,70,16,3,0,-1,0,0,0,[
str_seg('black','NewCenturySchlbk-Roman',0,97920,70,16,3,0,-1,0,0,0,0,0,
"Japanese")])
]),
mini_line(116,16,3,0,0,0,[
str_block(0,116,16,3,0,-1,0,0,0,[
str_seg('black','NewCenturySchlbk-Roman',0,97920,116,16,3,0,-1,0,0,0,0,0,
"long-vowel sign")])
])
])]).
oval('black','',30,70,280,140,0,1,1,49,0,0,0,0,0,'1',0,[
]).
oval('black','',30,260,280,330,0,1,1,51,0,0,0,0,0,'1',0,[
]).
text('black',200,274,2,1,1,85,38,53,16,3,0,0,0,0,2,85,38,0,0,"",0,0,0,0,290,'',[
minilines(85,38,0,0,1,0,0,[
mini_line(61,16,3,0,0,0,[
str_block(0,61,16,3,0,-1,0,0,0,[
str_seg('black','NewCenturySchlbk-Roman',0,97920,61,16,3,0,-1,0,0,0,0,0,
"Chinese")])
]),
mini_line(85,16,3,0,0,0,[
str_block(0,85,16,3,0,-1,0,0,0,[
str_seg('black','NewCenturySchlbk-Roman',0,97920,85,16,3,0,-1,0,0,0,0,0,
"letter \"sha\"")])
])
])]).
box('black','',330,30,560,80,0,1,1,57,0,0,0,0,0,'1',0,[
]).
box('black','',330,110,560,160,0,1,1,59,0,0,0,0,0,'1',0,[
]).
box('black','',330,230,560,280,0,1,1,60,0,0,0,0,0,'1',0,[
]).
box('black','',330,310,560,360,0,1,1,61,0,0,0,0,0,'1',0,[
]).
group([
poly('black','',4,[
506,246,501,235,541,235,536,246],0,2,1,68,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',5,[
519,238,516,252,529,252,524,275,516,272],0,2,1,69,0,0,0,0,0,0,0,'2',0,0,
"00","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',2,[
501,261,541,261],0,2,1,70,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',2,[
519,244,529,244],0,2,1,71,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
])
],
76,0,0,[
]).
group([
poly('black','',3,[
519,119,524,127,524,152],0,2,1,67,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
])
],
78,0,0,[
]).
group([
poly('black','',3,[
540,57,509,57,501,49],0,2,1,66,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
])
],
80,0,0,[
]).
group([
poly('black','',4,[
506,326,501,315,541,315,536,326],0,2,1,90,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',5,[
519,318,515,332,531,332,526,355,519,352],0,2,1,89,0,0,0,0,0,0,0,'2',0,0,
"00","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',2,[
501,341,526,341],0,2,1,88,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',2,[
519,324,529,324],0,2,1,87,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
])
],
134,0,0,[
]).
poly('black','',2,[
270,90,320,70],1,3,1,158,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).
poly('black','',2,[
280,110,320,130],1,3,1,159,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).
poly('black','',2,[
270,280,310,250],1,3,1,160,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).
poly('black','',2,[
270,300,310,330],1,3,1,161,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.6 KiB

View File

@ -1,427 +0,0 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<link rel="stylesheet" type="text/css" href="style.css">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>PDFMiner</title>
</head>
<body>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Wed Jun 25 10:27:52 UTC 2014
<!-- hhmts end -->
</div>
<h1>PDFMiner</h1>
<p>
Python PDF parser and analyzer
<p>
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
&nbsp;
<a href="#changes">Recent Changes</a>
&nbsp;
<a href="programming.html">PDFMiner API</a>
<ul>
<li> <a href="#intro">What's It?</a>
<li> <a href="#download">Download</a>
<li> <a href="#wheretoask">Where to Ask</a>
<li> <a href="#install">How to Install</a>
<ul>
<li> <a href="#cmap">CJK languages support</a>
</ul>
<li> <a href="#tools">Command Line Tools</a>
<ul>
<li> <a href="#pdf2txt">pdf2txt.py</a>
<li> <a href="#dumppdf">dumppdf.py</a>
<li> <a href="programming.html">PDFMiner API</a>
</ul>
<li> <a href="#changes">Changes</a>
<li> <a href="#todo">TODO</a>
<li> <a href="#related">Related Projects</a>
<li> <a href="#license">Terms and Conditions</a>
</ul>
<h2><a name="intro">What's It?</a></h2>
<p>
PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting
and analyzing text data. PDFMiner allows one to obtain
the exact location of text in a page, as well as
other information such as fonts or lines.
It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes than text analysis.
<p>
<h3>Features</h3>
<ul>
<li> Written entirely in Python. (for version 2.6 or newer)
<li> Parse, analyze, and convert PDF documents.
<li> PDF-1.7 specification support. (well, almost)
<li> CJK languages and vertical writing scripts support.
<li> Various font types (Type1, TrueType, Type3, and CID) support.
<li> Basic encryption (RC4) support.
<li> PDF to HTML conversion.
<li> Outline (TOC) extraction.
<li> Tagged contents extraction.
<li> Reconstruct the original layout by grouping text chunks.
</ul>
<p>
PDFMiner is about 20 times slower than
other C/C++-based counterparts such as XPdf.
<P>
<strong>Online Demo:</strong> (pdf -&gt; html conversion webapp)<br>
<a href="http://pdf2html.tabesugi.net:8080/">
http://pdf2html.tabesugi.net:8080/
</a>
<h3><a name="download">Download</a></h3>
<p>
<strong>Source distribution:</strong><br>
<a href="http://pypi.python.org/pypi/pdfminer_six/">
http://pypi.python.org/pypi/pdfminer_six/
</a>
<P>
<strong>github:</strong><br>
<a href="https://github.com/goulu/pdfminer/">
https://github.com/goulu/pdfminer/
</a>
<h3><a name="wheretoask">Where to Ask</a></h3>
<p>
<p>
<strong>Questions and comments:</strong><br>
<a href="http://groups.google.com/group/pdfminer-users/">
http://groups.google.com/group/pdfminer-users/
</a>
<h2><a name="install">How to Install</a></h2>
<ol>
<li> Install <a href="http://www.python.org/download/">Python</a> 2.6 or newer.
<li> Download the <a href="#source">PDFMiner source</a>.
<li> Unpack it.
<li> Run <code>setup.py</code> to install:<br>
<blockquote><pre>
# <strong>python setup.py install</strong>
</pre></blockquote>
<li> Do the following test:<br>
<blockquote><pre>
$ <strong>pdf2txt.py samples/simple1.pdf</strong>
Hello
World
Hello
World
H e l l o
W o r l d
H e l l o
W o r l d
</pre></blockquote>
<li> Done!
</ol>
<h3><a name="cmap">For CJK languages</a></h3>
<p>
In order to process CJK languages, you need an additional step to take
during installation:
<blockquote><pre>
# <strong>make cmap</strong>
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...
writing 'CNS1_H.py'...
...
<em>(this may take several minutes)</em>
# <strong>python setup.py install</strong>
</pre></blockquote>
<p>
On Windows machines which don't have <code>make</code> command,
paste the following commands on a command line prompt:
<blockquote><pre>
<strong>mkdir pdfminer\cmap</strong>
<strong>python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt</strong>
<strong>python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt</strong>
<strong>python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt</strong>
<strong>python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt</strong>
<strong>python setup.py install</strong>
</pre></blockquote>
<h2><a name="tools">Command Line Tools</a></h2>
<p>
PDFMiner comes with two handy tools:
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
<h3><a name="pdf2txt">pdf2txt.py</a></h3>
<p>
<code>pdf2txt.py</code> extracts text contents from a PDF file.
It extracts all the text that are to be rendered programmatically,
i.e. text represented as ASCII or Unicode strings.
It cannot recognize text drawn as images that would require optical character recognition.
It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion.
You need to provide a password for protected PDF documents when its access is restricted.
You cannot extract any text from a PDF document which does not have extraction permission.
<p>
<strong>Note:</strong>
Not all characters in a PDF can be safely converted to Unicode.
<h4>Examples</h4>
<blockquote><pre>
$ <strong>pdf2txt.py -o output.html samples/naacl06-shinyama.pdf</strong>
(extract text as an HTML file whose filename is output.html)
$ <strong>pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf</strong>
(extract a Japanese HTML file in vertical writing, CMap is required)
$ <strong>pdf2txt.py -P mypassword -o output.txt secret.pdf</strong>
(extract a text from an encrypted PDF file)
</pre></blockquote>
<h4>Options</h4>
<dl>
<dt> <code>-o <em>filename</em></code>
<dd> Specifies the output file name.
By default, it prints the extracted contents to stdout in text format.
<p>
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
<dd> Specifies the comma-separated list of the page numbers to be extracted.
Page numbers start at one.
By default, it extracts text from all the pages.
<p>
<dt> <code>-c <em>codec</em></code>
<dd> Specifies the output codec.
<p>
<dt> <code>-t <em>type</em></code>
<dd> Specifies the output format. The following formats are currently supported.
<ul>
<li> <code>text</code> : TEXT format. (Default)
<li> <code>html</code> : HTML format. Not recommended for extraction purposes because the markup is messy.
<li> <code>xml</code> : XML format. Provides the most information.
<li> <code>tag</code> : "Tagged PDF" format. A tagged PDF has its own contents annotated with
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
Tags used here are defined in the PDF specification (See &sect;10.7 "<em>Tagged PDF</em>").
</ul>
<p>
<dt> <code>-I <em>image_directory</em></code>
<dd> Specifies the output directory for image extraction.
Currently only JPEG images are supported.
<p>
<dt> <code>-M <em>char_margin</em></code>
<dt> <code>-L <em>line_margin</em></code>
<dt> <code>-W <em>word_margin</em></code>
<dd> These are the parameters used for layout analysis.
In an actual PDF file, text portions might be split into several chunks
in the middle of its running, depending on the authoring software.
Therefore, text extraction needs to splice text chunks.
In the figure below, two text chunks whose distance is closer than
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
continuous and get grouped into one. Also, two lines whose distance is closer than
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
as a text box, which is a rectangular area that contains a "cluster" of text portions.
Furthermore, it may be required to insert blank characters (spaces) as necessary
if the distance between two words is greater than the <em>word_margin</em>
(<em><font color="green">W</font></em>), as a blank between words might not be
represented as a space, but indicated by the positioning of each word.
<p>
Each value is specified not as an actual length, but as a proportion of
the length to the size of each character in question. The default values
are M = 2.0, L = 0.5, and W = 0.1, respectively.
<table style="border:2px gray solid; margin: 10px; padding: 10px;"><tr>
<td style="border-right:1px red solid" align=right>&rarr;</td>
<td style="border-left:1px red solid" colspan="4" align=left>&larr; <em><font color="red">M</font></em></td>
<td></td>
</tr><tr>
<td style="border:1px solid"><code>Q u i</code></td>
<td style="border:1px solid"><code>c k</code></td>
<td width="10px"></td>
<td style="border:1px solid"><code>b r o w</code></td>
<td style="border:1px solid"><code>n &nbsp; f o x</code></td>
<td style="border-bottom:1px blue solid" align=right>&darr;</td>
</tr><tr>
<td style="border-right:1px green solid" colspan="2" align=right>&rarr;</td><td></td>
<td style="border-left:1px green solid" colspan="2" align=left>&larr; <em><font color="green">W</font></em></td>
<td rowspan="2" valign=center align=center><em><font color="blue">L</font></em></td>
</tr><tr height="10px">
</tr><tr>
<td style="padding:0px;" colspan="5">
<table style="border:1px solid"><tr><td><code>j u m p s</code></td><td>...</td></tr></table>
</td>
<td style="border-top:1px blue solid" align=right>&uarr;</td>
</tr></table>
<p>
<dt> <code>-F <em>boxes_flow</em></code>
<dd> Specifies how much a horizontal and vertical position of a text matters
when determining a text order. The value should be within the range of
-1.0 (only horizontal position matters) to +1.0 (only vertical position matters).
The default value is 0.5.
<p>
<dt> <code>-C</code>
<dd> Suppress object caching.
This will reduce the memory consumption but also slows down the process.
<p>
<dt> <code>-n</code>
<dd> Suppress layout analysis.
<p>
<dt> <code>-A</code>
<dd> Forces to perform layout analysis for all the text strings,
including text contained in figures.
<p>
<dt> <code>-V</code>
<dd> Allows vertical writing detection.
<p>
<dt> <code>-Y <em>layout_mode</em></code>
<dd> Specifies how the page layout should be preserved. (Currently only applies to HTML format.)
<ul>
<li> <code>exact</code> : preserve the exact location of each individual character (a large and messy HTML).
<li> <code>normal</code> : preserve the location and line breaks in each text block. (Default)
<li> <code>loose</code> : preserve the overall location of each text block.
</ul>
<p>
<dt> <code>-E <em>extractdir</em></code>
<dd> Specifies the extraction directory of embedded files.
<p>
<dt> <code>-s <em>scale</em></code>
<dd> Specifies the output scale. Can be used in HTML format only.
<p>
<dt> <code>-m <em>maxpages</em></code>
<dd> Specifies the maximum number of pages to extract.
By default, it extracts all the pages in a document.
<p>
<dt> <code>-P <em>password</em></code>
<dd> Provides the user password to access PDF contents.
<p>
<dt> <code>-d</code>
<dd> Increases the debug level.
</dl>
<hr noshade>
<h3><a name="dumppdf">dumppdf.py</a></h3>
<p>
<code>dumppdf.py</code> dumps the internal contents of a PDF file
in pseudo-XML format. This program is primarily for debugging purposes,
but it's also possible to extract some meaningful contents
(such as images).
<h4>Examples</h4>
<blockquote><pre>
$ <strong>dumppdf.py -a foo.pdf</strong>
(dump all the headers and contents, except stream objects)
$ <strong>dumppdf.py -T foo.pdf</strong>
(dump the table of contents)
$ <strong>dumppdf.py -r -i6 foo.pdf &gt; pic.jpeg</strong>
(extract a JPEG image)
</pre></blockquote>
<h4>Options</h4>
<dl>
<dt> <code>-a</code>
<dd> Instructs to dump all the objects.
By default, it only prints the document trailer (like a header).
<p>
<dt> <code>-i <em>objno,objno, ...</em></code>
<dd> Specifies PDF object IDs to display.
Comma-separated IDs, or multiple <code>-i</code> options are accepted.
<p>
<dt> <code>-p <em>pageno,pageno, ...</em></code>
<dd> Specifies the page number to be extracted.
Comma-separated page numbers, or multiple <code>-p</code> options are accepted.
Note that page numbers start at one, not zero.
<p>
<dt> <code>-r</code> (raw)
<dt> <code>-b</code> (binary)
<dt> <code>-t</code> (text)
<dd> Specifies the output format of stream contents.
Because the contents of stream objects can be very large,
they are omitted when none of the options above is specified.
<p>
With <code>-r</code> option, the "raw" stream contents are dumped without decompression.
With <code>-b</code> option, the decompressed contents are dumped as a binary blob.
With <code>-t</code> option, the decompressed contents are dumped in a text format,
similar to <code>repr()</code> manner. When
<code>-r</code> or <code>-b</code> option is given,
no stream header is displayed for the ease of saving it to a file.
<p>
<dt> <code>-T</code>
<dd> Shows the table of contents.
<p>
<dt> <code>-E <em>directory</em></code>
<dd> Extracts embedded files from the pdf into the given directory.
<p>
<dt> <code>-P <em>password</em></code>
<dd> Provides the user password to access PDF contents.
<p>
<dt> <code>-d</code>
<dd> Increases the debug level.
</dl>
<h2><a name="changes">Changes:</a></h2>
<ul>
<li> 2014/09/15: pushed on PyPi</li>
<li> 2014/09/10: pdfminer_six forked from pdfminer since Yusuke didn't want to merge and pdfminer3k is outdated</li>
</ul>
<h2><a name="todo">TODO</a></h2>
<ul>
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
<li> Better documentation.
<li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
<li> Crypt stream filter support. (More sample documents are needed!)
</ul>
<h2><a name="related">Related Projects</a></h2>
<ul>
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
<li> <a href="http://www.foolabs.com/xpdf/">xpdf</a>
<li> <a href="http://www.pdfbox.org/">pdfbox</a>
<li> <a href="http://mupdf.com/">mupdf</a>
</ul>
<h2><a name="license">Terms and Conditions</a></h2>
<p>
(This is so-called MIT/X License)
<p>
<small>
Copyright (c) 2004-2013 Yusuke Shinyama &lt;yusuke at cs dot nyu dot edu&gt;
<p>
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:
<p>
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
<p>
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
</small>
<hr noshade>
<address>Yusuke Shinyama (yusuke at cs dot nyu dot edu)</address>
</body>

View File

@ -1,391 +0,0 @@
%TGIF 4.2.2
state(0,37,100.000,0,0,0,16,1,9,1,1,0,0,0,0,1,1,'Helvetica-Bold',1,69120,0,0,1,5,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
%
% @(#)$Header$
% %W%
%
unit("1 pixel/pixel").
color_info(19,65535,0,[
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
"red", 65535, 0, 0, 65535, 0, 0, 1,
"green", 0, 65535, 0, 0, 65535, 0, 1,
"blue", 0, 0, 65535, 0, 0, 65535, 1,
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
"black", 0, 0, 0, 0, 0, 0, 1,
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
]).
script_frac("0.6").
fg_bg_colors('black','white').
dont_reencode("FFDingbests:ZapfDingbats").
objshadow_info('#c0c0c0',2,2).
rotate_pivot(0,0,0,0).
spline_tightness(1).
page(1,"",1,'').
box('black','',50,45,300,355,2,2,1,0,0,0,0,0,0,'2',0,[
]).
box('black','',75,75,195,225,2,1,1,10,8,0,0,0,0,'1',0,[
]).
box('black','',85,105,185,125,2,1,1,18,8,0,0,0,0,'1',0,[
]).
box('black','',85,105,105,125,2,1,1,19,0,0,0,0,0,'1',0,[
]).
box('black','',105,105,125,125,2,1,1,20,0,0,0,0,0,'1',0,[
]).
text('black',95,108,1,1,1,9,15,21,12,3,0,0,0,0,2,9,15,0,0,"",0,0,0,0,120,'',[
minilines(9,15,0,0,1,0,0,[
mini_line(9,12,3,0,0,0,[
str_block(0,9,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica',0,69120,9,12,3,0,-1,0,0,0,0,0,
"A")])
])
])]).
text('black',115,108,1,1,1,8,15,28,12,3,0,0,0,0,2,8,15,0,0,"",0,0,0,0,120,'',[
minilines(8,15,0,0,1,0,0,[
mini_line(8,12,3,0,0,0,[
str_block(0,8,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica',0,69120,8,12,3,0,-1,0,0,0,0,0,
"B")])
])
])]).
box('black','',125,105,145,125,0,1,1,32,0,0,0,0,0,'1',0,[
]).
text('black',135,108,1,1,1,9,15,36,12,3,0,0,0,0,2,9,15,0,0,"",0,0,0,0,120,'',[
minilines(9,15,0,0,1,0,0,[
mini_line(9,12,3,0,0,0,[
str_block(0,9,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica',0,69120,9,12,3,0,-1,0,0,0,0,0,
"C")])
])
])]).
poly('black','',2,[
215,140,215,220],0,3,1,51,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).
box('black','',175,265,270,325,0,3,1,65,0,0,0,0,0,'3',0,[
]).
box('black','',185,270,260,320,0,1,1,69,8,0,0,0,0,'1',0,[
]).
poly('black','',6,[
195,295,215,290,235,310,245,285,225,300,195,295],0,2,1,74,0,0,0,0,0,0,0,'2',0,0,
"00","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
box('black','',85,275,140,315,1,2,0,87,0,0,0,0,0,'2',0,[
]).
text('black',85,23,1,1,1,44,15,93,12,3,0,0,0,0,2,44,15,0,0,"",0,0,0,0,35,'',[
minilines(44,15,0,0,1,0,0,[
mini_line(44,12,3,0,0,0,[
str_block(0,44,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
"LTPage")])
])
])]).
text('black',255,133,1,1,1,39,15,100,12,3,0,0,0,0,2,39,15,0,0,"",0,0,0,0,145,'',[
minilines(39,15,0,0,1,0,0,[
mini_line(39,12,3,0,0,0,[
str_block(0,39,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,39,12,3,0,-1,0,0,0,0,0,
"LTLine")])
])
])]).
text('black',125,83,1,1,1,42,15,104,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,95,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTChar")])
])
])]).
text('black',245,53,1,1,1,65,15,108,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,65,'',[
minilines(65,15,0,0,1,0,0,[
mini_line(65,12,3,0,0,0,[
str_block(0,65,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
"LTTextBox")])
])
])]).
text('black',245,88,1,1,1,66,15,110,12,3,0,0,0,0,2,66,15,0,0,"",0,0,0,0,100,'',[
minilines(66,15,0,0,1,0,0,[
mini_line(66,12,3,0,0,0,[
str_block(0,66,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,66,12,3,0,-1,0,0,0,0,0,
"LTTextLine")])
])
])]).
text('black',255,243,1,1,1,51,15,112,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,255,'',[
minilines(51,15,0,0,1,0,0,[
mini_line(51,12,3,0,0,0,[
str_block(0,51,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
"LTFigure")])
])
])]).
text('black',140,243,1,1,1,51,15,114,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,255,'',[
minilines(51,15,0,0,1,0,0,[
mini_line(51,12,3,0,0,0,[
str_block(0,51,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
"LTImage")])
])
])]).
text('black',240,223,1,1,1,43,15,116,12,3,0,0,0,0,2,43,15,0,0,"",0,0,0,0,235,'',[
minilines(43,15,0,0,1,0,0,[
mini_line(43,12,3,0,0,0,[
str_block(0,43,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,43,12,3,0,0,0,0,0,0,0,
"LTRect")])
])
])]).
text('black',190,333,1,1,1,50,15,118,12,3,0,0,0,0,2,50,15,0,0,"",0,0,0,0,345,'',[
minilines(50,15,0,0,1,0,0,[
mini_line(50,12,3,0,0,0,[
str_block(0,50,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,50,12,3,0,-1,0,0,0,0,0,
"LTCurve")])
])
])]).
text('black',170,138,1,1,1,42,15,121,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,150,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTText")])
])
])]).
box('black','',145,105,165,125,0,1,1,125,8,0,0,0,0,'1',0,[
]).
poly('black','',2,[
105,95,95,110],0,1,1,135,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
165,140,155,115],0,1,1,138,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
215,65,190,80],0,1,1,139,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
215,100,180,115],0,1,1,140,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
235,140,215,150],0,1,1,141,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
220,235,205,265],0,1,1,146,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
235,255,225,275],0,1,1,147,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
195,330,220,300],0,1,1,148,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
125,255,110,280],0,1,1,149,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
text('black',610,33,1,1,1,44,15,151,12,3,0,0,0,0,2,44,15,0,0,"",0,0,0,0,45,'',[
minilines(44,15,0,0,1,0,0,[
mini_line(44,12,3,0,0,0,[
str_block(0,44,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
"LTPage")])
])
])]).
text('black',460,108,1,1,1,65,15,152,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,120,'',[
minilines(65,15,0,0,1,0,0,[
mini_line(65,12,3,0,0,0,[
str_block(0,65,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
"LTTextBox")])
])
])]).
text('black',410,178,1,1,1,66,15,154,12,3,0,0,0,0,2,66,15,0,0,"",0,0,0,0,190,'',[
minilines(66,15,0,0,1,0,0,[
mini_line(66,12,3,0,0,0,[
str_block(0,66,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,66,12,3,0,-1,0,0,0,0,0,
"LTTextLine")])
])
])]).
text('black',360,248,1,1,1,42,15,157,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTChar")])
])
])]).
text('black',420,248,1,1,1,42,15,159,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTChar")])
])
])]).
text('black',480,248,1,1,1,42,15,161,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTText")])
])
])]).
text('black',460,178,1,1,1,12,15,170,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,190,'',[
minilines(12,15,0,0,1,0,0,[
mini_line(12,12,3,0,0,0,[
str_block(0,12,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
"...")])
])
])]).
text('black',520,248,1,1,1,12,15,172,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,260,'',[
minilines(12,15,0,0,1,0,0,[
mini_line(12,12,3,0,0,0,[
str_block(0,12,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
"...")])
])
])]).
text('black',560,108,1,1,1,51,15,174,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,120,'',[
minilines(51,15,0,0,1,0,0,[
mini_line(51,12,3,0,0,0,[
str_block(0,51,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
"LTFigure")])
])
])]).
text('black',635,108,1,1,1,39,15,178,12,3,0,0,0,0,2,39,15,0,0,"",0,0,0,0,120,'',[
minilines(39,15,0,0,1,0,0,[
mini_line(39,12,3,0,0,0,[
str_block(0,39,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,39,12,3,0,-1,0,0,0,0,0,
"LTLine")])
])
])]).
text('black',700,108,1,1,1,43,15,180,12,3,0,0,0,0,2,43,15,0,0,"",0,0,0,0,120,'',[
minilines(43,15,0,0,1,0,0,[
mini_line(43,12,3,0,0,0,[
str_block(0,43,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,43,12,3,0,0,0,0,0,0,0,
"LTRect")])
])
])]).
text('black',580,178,1,1,1,50,15,182,12,3,0,0,0,0,2,50,15,0,0,"",0,0,0,0,190,'',[
minilines(50,15,0,0,1,0,0,[
mini_line(50,12,3,0,0,0,[
str_block(0,50,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,50,12,3,0,-1,0,0,0,0,0,
"LTCurve")])
])
])]).
text('black',775,108,1,1,1,51,15,186,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,120,'',[
minilines(51,15,0,0,1,0,0,[
mini_line(51,12,3,0,0,0,[
str_block(0,51,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
"LTImage")])
])
])]).
poly('black','',2,[
475,105,590,50],0,1,1,190,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
560,110,595,50],0,1,1,191,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
635,105,600,50],0,1,1,192,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
610,50,700,100],0,1,1,193,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
765,100,630,50],0,1,1,194,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
460,125,425,175],0,1,1,196,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
560,125,570,175],0,1,1,197,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
415,195,370,245],0,1,1,198,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
415,195,420,245],0,1,1,199,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
415,195,475,245],0,1,1,200,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
470,125,485,175],0,1,1,206,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
420,195,510,220],0,1,1,207,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
565,125,635,175],0,1,1,208,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
text('black',635,178,1,1,1,12,15,215,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,190,'',[
minilines(12,15,0,0,1,0,0,[
mini_line(12,12,3,0,0,0,[
str_block(0,12,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
"...")])
])
])]).

35
docs/make.bat Normal file
View File

@ -0,0 +1,35 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

View File

@ -1,187 +0,0 @@
%TGIF 4.2.2
state(0,37,100.000,0,0,0,16,1,9,1,1,1,0,0,2,1,1,'Helvetica-Bold',1,69120,0,0,1,10,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
%
% @(#)$Header$
% %W%
%
unit("1 pixel/pixel").
color_info(19,65535,0,[
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
"red", 65535, 0, 0, 65535, 0, 0, 1,
"green", 0, 65535, 0, 0, 65535, 0, 1,
"blue", 0, 0, 65535, 0, 0, 65535, 1,
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
"black", 0, 0, 0, 0, 0, 0, 1,
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
]).
script_frac("0.6").
fg_bg_colors('black','white').
dont_reencode("FFDingbests:ZapfDingbats").
objshadow_info('#c0c0c0',2,2).
rotate_pivot(0,0,0,0).
spline_tightness(1).
page(1,"",1,'').
oval('black','',350,380,450,430,2,2,1,88,0,0,0,0,0,'2',0,[
]).
poly('black','',2,[
270,270,350,230],1,2,1,54,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
270,280,350,320],1,2,1,55,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
box('black','',350,100,450,150,2,2,1,2,0,0,0,0,0,'2',0,[
]).
text('black',400,118,1,1,1,84,15,3,12,3,0,0,0,0,2,84,15,0,0,"",0,0,0,0,130,'',[
minilines(84,15,0,0,1,0,0,[
mini_line(84,12,3,0,0,0,[
str_block(0,84,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,84,12,3,0,0,0,0,0,0,0,
"PDFDocument")])
])
])]).
box('black','',150,100,250,150,2,2,1,13,0,0,0,0,0,'2',0,[
]).
text('black',200,118,1,1,1,63,15,14,12,3,0,0,0,0,2,63,15,0,0,"",0,0,0,0,130,'',[
minilines(63,15,0,0,1,0,0,[
mini_line(63,12,3,0,0,0,[
str_block(0,63,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,63,12,3,0,0,0,0,0,0,0,
"PDFParser")])
])
])]).
box('black','',350,200,450,250,2,2,1,20,0,0,0,0,0,'2',0,[
]).
text('black',400,218,1,1,1,88,15,21,12,3,0,0,0,0,2,88,15,0,0,"",0,0,0,0,230,'',[
minilines(88,15,0,0,1,0,0,[
mini_line(88,12,3,0,0,0,[
str_block(0,88,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,88,12,3,0,0,0,0,0,0,0,
"PDFInterpreter")])
])
])]).
box('black','',350,300,450,350,2,2,1,23,0,0,0,0,0,'2',0,[
]).
text('black',400,318,1,1,1,65,15,24,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,330,'',[
minilines(65,15,0,0,1,0,0,[
mini_line(65,12,3,0,0,0,[
str_block(0,65,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
"PDFDevice")])
])
])]).
box('black','',180,250,280,300,2,2,1,29,0,0,0,0,0,'2',0,[
]).
text('black',230,268,1,1,1,131,15,30,12,3,2,0,0,0,2,131,15,0,0,"",0,0,0,0,280,'',[
minilines(131,15,0,0,1,0,0,[
mini_line(131,12,3,0,0,0,[
str_block(0,131,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,131,12,3,0,0,0,0,0,0,0,
"PDFResourceManager")])
])
])]).
poly('black','',2,[
250,140,350,140],1,2,1,45,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
350,110,250,110],1,2,1,46,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
400,150,400,200],1,2,1,47,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
400,250,400,300],1,2,1,56,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
400,350,400,380],0,2,1,65,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
text('black',400,388,3,1,1,44,41,71,12,3,0,-2,0,0,2,44,41,0,0,"",0,0,0,0,400,'',[
minilines(44,41,0,0,1,-2,0,[
mini_line(44,12,3,0,0,0,[
str_block(0,44,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
"Display")])
]),
mini_line(20,12,3,0,0,0,[
str_block(0,20,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,20,12,3,0,-1,0,0,0,0,0,
"File")])
]),
mini_line(23,12,3,0,0,0,[
str_block(0,23,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,23,12,3,0,-1,0,0,0,0,0,
"etc.")])
])
])]).
text('black',300,88,1,1,1,92,15,79,12,3,0,0,0,0,2,92,15,0,0,"",0,0,0,0,100,'',[
minilines(92,15,0,0,1,0,0,[
mini_line(92,12,3,0,0,0,[
str_block(0,92,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,92,12,3,0,-1,0,0,0,0,0,
"request objects")])
])
])]).
text('black',300,148,1,1,1,78,15,84,12,3,0,0,0,0,2,78,15,0,0,"",0,0,0,0,160,'',[
minilines(78,15,0,0,1,0,0,[
mini_line(78,12,3,0,0,0,[
str_block(0,78,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,78,12,3,0,-1,0,0,0,0,0,
"store objects")])
])
])]).
oval('black','',20,100,120,150,2,2,1,106,0,0,0,0,0,'2',0,[
]).
text('black',70,118,1,1,1,46,15,107,12,3,0,0,0,0,2,46,15,0,0,"",0,0,0,0,130,'',[
minilines(46,15,0,0,1,0,0,[
mini_line(46,12,3,0,0,0,[
str_block(0,46,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,46,12,3,0,-1,0,0,0,0,0,
"PDF file")])
])
])]).
poly('black','',2,[
120,120,150,120],0,2,1,114,0,2,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
text('black',400,158,1,1,1,84,15,115,12,3,2,0,0,0,2,84,15,0,0,"",0,0,0,0,170,'',[
minilines(84,15,0,0,1,0,0,[
mini_line(84,12,3,0,0,0,[
str_block(0,84,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,84,12,3,0,-1,0,0,0,0,0,
"page contents")])
])
])]).
text('black',400,258,1,1,1,129,15,119,12,3,2,0,0,0,2,129,15,0,0,"",0,0,0,0,270,'',[
minilines(129,15,0,0,1,0,0,[
mini_line(129,12,3,0,0,0,[
str_block(0,129,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,129,12,3,0,-1,0,0,0,0,0,
"rendering instructions")])
])
])]).

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.0 KiB

View File

@ -1,223 +0,0 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<link rel="stylesheet" type="text/css" href="style.css">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Programming with PDFMiner</title>
</head>
<body>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Mon Mar 24 11:49:28 UTC 2014
<!-- hhmts end -->
</div>
<p>
<a href="index.html">[Back to PDFMiner homepage]</a>
<h1>Programming with PDFMiner</h1>
<p>
This page explains how to use PDFMiner as a library
from other applications.
<ul>
<li> <a href="#overview">Overview</a>
<li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Performing Layout Analysis</a>
<li> <a href="#tocextract">Obtaining Table of Contents</a>
<li> <a href="#extend">Extending Functionality</a>
</ul>
<h2><a name="overview">Overview</a></h2>
<p>
<strong>PDF is evil.</strong> Although it is called a PDF
"document", it's nothing like Word or HTML document. PDF is more
like a graphic representation. PDF contents are just a bunch of
instructions that tell how to place the stuff at each exact
position on a display or paper. In most cases, it has no logical
structure such as sentences or paragraphs and it cannot adapt
itself when the paper size changes. PDFMiner attempts to
reconstruct some of those structures by guessing from its
positioning, but there's nothing guaranteed to work. Ugly, I
know. Again, PDF is evil.
<p>
[More technical details about the internal structure of PDF:
"How to Extract Text Contents from PDF Manually"
<a href="http://www.youtube.com/watch?v=k34wRxaxA_c">(part 1)</a>
<a href="http://www.youtube.com/watch?v=_A1M4OdNsiQ">(part 2)</a>
<a href="http://www.youtube.com/watch?v=sfV_7cWPgZE">(part 3)</a>]
<p>
Because a PDF file has such a big and complex structure,
parsing a PDF file as a whole is time and memory consuming. However,
not every part is needed for most PDF processing tasks. Therefore
PDFMiner takes a strategy of lazy parsing, which is to parse the
stuff only when it's necessary. To parse PDF files, you need to use at
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
These two objects are associated with each other.
<code>PDFParser</code> fetches data from a file,
and <code>PDFDocument</code> stores it. You'll also need
<code>PDFPageInterpreter</code> to process the page contents
and <code>PDFDevice</code> to translate it to whatever you need.
<code>PDFResourceManager</code> is used to store
shared resources such as fonts or images.
<p>
Figure 1 shows the relationship between the classes in PDFMiner.
<div align=center>
<img src="objrel.png"><br>
<small>Figure 1. Relationships between PDFMiner classes</small>
</div>
<h2><a name="basic">Basic Usage</a></h2>
<p>
A typical way to parse a PDF file is the following:
<blockquote><pre>
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
<span class="comment"># Open a PDF file.</span>
fp = open('mypdf.pdf', 'rb')
<span class="comment"># Create a PDF parser object associated with the file object.</span>
parser = PDFParser(fp)
<span class="comment"># Create a PDF document object that stores the document structure.</span>
<span class="comment"># Supply the password for initialization.</span>
document = PDFDocument(parser, password)
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
rsrcmgr = PDFResourceManager()
<span class="comment"># Create a PDF device object.</span>
device = PDFDevice(rsrcmgr)
<span class="comment"># Create a PDF interpreter object.</span>
interpreter = PDFPageInterpreter(rsrcmgr, device)
<span class="comment"># Process each page contained in the document.</span>
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
</pre></blockquote>
<h2><a name="layout">Performing Layout Analysis</a></h2>
<p>
Here is a typical way to use the layout analysis function:
<blockquote><pre>
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
<span class="comment"># Set parameters for analysis.</span>
laparams = LAParams()
<span class="comment"># Create a PDF page aggregator object.</span>
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
<span class="comment"># receive the LTPage object for the page.</span>
layout = device.get_result()
</pre></blockquote>
A layout analyzer returns a <code>LTPage</code> object for each page
in the PDF document. This object contains child objects within the page,
forming a tree structure. Figure 2 shows the relationship between
these objects.
<div align=center>
<img src="layout.png"><br>
<small>Figure 2. Layout objects and its tree structure</small>
</div>
<dl>
<dt> <code>LTPage</code>
<dd> Represents an entire page. May contain child objects like
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,
<code>LTCurve</code> and <code>LTLine</code>.
<dt> <code>LTTextBox</code>
<dd> Represents a group of text chunks that can be contained in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text.
It contains a list of <code>LTTextLine</code> objects.
<code>get_text()</code> method returns the text content.
<dt> <code>LTTextLine</code>
<dd> Contains a list of <code>LTChar</code> objects that represent
a single text line. The characters are aligned either horizontaly
or vertically, depending on the text's writing mode.
<code>get_text()</code> method returns the text content.
<dt> <code>LTChar</code>
<dt> <code>LTAnno</code>
<dd> Represent an actual letter in the text as a Unicode string.
Note that, while a <code>LTChar</code> object has actual boundaries,
<code>LTAnno</code> objects does not, as these are "virtual" characters,
inserted by a layout analyzer according to the relationship between two characters
(e.g. a space).
<dt> <code>LTFigure</code>
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
present figures or pictures by embedding yet another PDF document within a page.
Note that <code>LTFigure</code> objects can appear recursively.
<dt> <code>LTImage</code>
<dd> Represents an image object. Embedded images can be
in JPEG or other formats, but currently PDFMiner does not
pay much attention to graphical objects.
<dt> <code>LTLine</code>
<dd> Represents a single straight line.
Could be used for separating text or figures.
<dt> <code>LTRect</code>
<dd> Represents a rectangle.
Could be used for framing another pictures or figures.
<dt> <code>LTCurve</code>
<dd> Represents a generic Bezier curve.
</dl>
<p>
Also, check out <a href="http://denis.papathanasiou.org/archive/2010.08.04.post.pdf">a more complete example by Denis Papathanasiou(Extracting Text & Images from PDF Files)</a>.
<h2><a name="tocextract">Obtaining Table of Contents</a></h2>
<p>
PDFMiner provides functions to access the document's table of contents
("Outlines").
<blockquote><pre>
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
<span class="comment"># Open a PDF document.</span>
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, password)
<span class="comment"># Get the outlines of the document.</span>
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
</pre></blockquote>
<p>
Some PDF documents use page numbers as destinations, while others
use page numbers and the physical location within the page. Since
PDF does not have a logical structure, and it does not provide a
way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are
referring to.
<h2><a name="extend">Extending Functionality</a></h2>
<p>
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
in order to process them differently / obtain other information.
<hr noshade>
<address>Yusuke Shinyama</address>
</body>

1
docs/requirements.txt Normal file
View File

@ -0,0 +1 @@
sphinx-argparse

View File

@ -0,0 +1,28 @@
<style>
td {
text-align: center;
}
</style>
<table style="margin: 10px; padding: 10px;">
<tr>
<td style="text-align: right; border-right:1px red solid">&rarr;</td>
<td colspan="4"
style="text-align: left; border-left:1px red solid">&larr; <em><font
color="red">M</font></em></td>
</tr>
<tr>
<td style="border:1px solid"><code>Q u i</code></td>
<td style="border:1px solid"><code>c k</code></td>
<td width="10px"></td>
<td style="border:1px solid"><code>b r o w n</code></td>
</tr>
<tr>
<td colspan="2" style="text-align: right; border-right:1px green solid">
&rarr;
</td>
<td></td>
<td colspan="2"
style="text-align: left; border-left:1px green solid">&larr;
<em><font color="green">W</font></em></td>
</tr>
</table>

View File

@ -0,0 +1,23 @@
<style>
.background-blue {
background-color: lightblue;
border: 2px solid lightblue;
}
</style>
<table style="margin: 10px; padding: 10px;">
<tr>
<td style="border:1px solid; text-align: left">
<code>
Q u i c k &nbsp; b r o w n<br/> f o x
</code>
</td>
<td class="background-blue" colspan="3"></td>
</tr>
<tr style="height: 10px;">
<td class="background-blue" colspan="4"></td>
</tr>
<tr>
<td class="background-blue" colspan="3"></td>
<td style="border:1px solid"><code>j u m p s ...</code></td>
</tr>
</table>

View File

@ -0,0 +1,45 @@
<style>
td {
text-align: center;
}
</style>
<table style="margin: 10px; padding: 10px;">
<tr>
<td></td>
<td></td>
<td align=right style="border-bottom:1px blue solid">&darr;</td>
<td></td>
</tr>
<tr>
<td colspan="2" style="border:1px solid"><code>Q u i c k &nbsp; b r o w
n</code></td>
<td></td>
<td align=right style="border-bottom:1px blue solid">&darr;</td>
</tr>
<tr>
<td></td>
<td></td>
<td align=center valign=center><em><font color="blue">
L<sub>1</sub>
</font></em></td>
<td></td>
</tr>
<tr>
<td style="border:1px solid;">
<code>f o x</code>
</td>
<td>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</td>
<td align=right style="border-top:1px blue solid">&uarr;</td>
<td align=center valign=center><em><font color="blue">
L<sub>2</sub>
</font></em></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td align=right style="border-top:1px blue solid">&uarr;</td>
</tr>
</table>

View File

Before

Width:  |  Height:  |  Size: 3.5 KiB

After

Width:  |  Height:  |  Size: 3.5 KiB

View File

@ -0,0 +1,25 @@
.. _api_commandline:
Command-line API
****************
.. _api_pdf2txt:
pdf2txt.py
==========
.. argparse::
:module: tools.pdf2txt
:func: maketheparser
:prog: python tools/pdf2txt.py
.. _api_dumppdf:
dumppdf.py
==========
.. argparse::
:module: tools.dumppdf
:func: create_parser
:prog: python tools/dumppdf.py

View File

@ -0,0 +1,20 @@
.. _api_composable:
Composable API
**************
.. _api_laparams:
LAParams
========
.. currentmodule:: pdfminer.layout
.. autoclass:: LAParams
Todo:
=====
- `PDFDevice`
- `TextConverter`
- `PDFPageAggregator`
- `PDFPageInterpreter`

View File

@ -0,0 +1,21 @@
.. _api_highlevel:
High-level functions API
************************
.. _api_extract_text:
extract_text
============
.. currentmodule:: pdfminer.high_level
.. autofunction:: extract_text
.. _api_extract_text_to_fp:
extract_text_to_fp
==================
.. currentmodule:: pdfminer.high_level
.. autofunction:: extract_text_to_fp

View File

@ -0,0 +1,9 @@
API documentation
*****************
.. toctree::
:maxdepth: 2
commandline
highlevel
composable

61
docs/source/conf.py Normal file
View File

@ -0,0 +1,61 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import os
import sys
sys.path.insert(0, os.path.join(os.path.abspath(os.path.dirname(__file__)), '../../'))
# -- Project information -----------------------------------------------------
project = 'pdfminer.six'
copyright = '2019, Yusuke Shinyama, Philippe Guglielmetti & Pieter Marsman'
author = 'Yusuke Shinyama, Philippe Guglielmetti & Pieter Marsman'
# The full version, including alpha/beta/rc tags
release = '20191020'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinxarg.ext',
'sphinx.ext.autodoc',
'sphinx.ext.doctest',
]
# Root rst file
master_doc = 'index'
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

72
docs/source/index.rst Normal file
View File

@ -0,0 +1,72 @@
Welcome to pdfminer.six's documentation!
****************************************
.. image:: https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master
:target: https://travis-ci.org/pdfminer/pdfminer.six
:alt: Travis-ci build badge
.. image:: https://img.shields.io/pypi/v/pdfminer.six.svg
:target: https://pypi.python.org/pypi/pdfminer.six/
:alt: PyPi version badge
.. image:: https://badges.gitter.im/pdfminer-six/Lobby.svg
:target: https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium
:alt: gitter badge
Pdfminer.six is a python package for extracting information from PDF documents.
Check out the source on `github <https://github.com/pdfminer/pdfminer.six>`_.
Content
=======
.. toctree::
:maxdepth: 2
tutorials/index
topics/index
api/index
Features
========
* Parse all objects from a PDF document into Python objects.
* Analyze and group text in a human-readable way.
* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
contents and more.
* Support for (almost all) features from the PDF-1.7 specification
* Support for Chinese, Japanese and Korean CJK) languages as well as vertical
writing.
* Support for various font types (Type1, TrueType, Type3, and CID).
* Support for basic encryption (RC4).
Installation instructions
=========================
Before using it, you must install it using Python 2.7 or newer.
::
$ pip install pdfminer.six
Note that Python 2.7 support is dropped at January, 2020.
Common use-cases
----------------
* :ref:`tutorial_commandline` if you just want to extract text from a pdf once.
* :ref:`tutorial_highlevel` if you want to integrate pdfminer.six with your
Python code.
* :ref:`tutorial_composable` when you want to tailor the behavior of
pdfmine.six to your needs.
Contributing
============
We welcome any contributors to pdfminer.six! But, before doing anything, take
a look at the `contribution guide
<https://github.com/pdfminer/pdfminer.six/blob/master/CONTRIBUTING.md>`_.

View File

@ -0,0 +1,132 @@
.. _topic_pdf_to_text:
Converting a PDF file to text
*****************************
Most PDF files look like they contain well structured text. But the reality is
that a PDF file does not contain anything that resembles a paragraphs,
sentences or even words. When it comes to text, a PDF file is only aware of
the characters and their placement.
This makes extracting meaningful pieces of text from PDF's files difficult.
The characters that compose a paragraph are no different from those that
compose the table, the page footer or the description of a figure. Unlike
other documents formats, like a `.txt` file or a word document, the PDF format
does not contain a stream of text.
A PDF document does consists of a collection of objects that together describe
the appearance of one or more pages, possibly accompanied by additional
interactive elements and higher-level application data. A PDF file contains
the objects making up a PDF document along with associated structural
information, all represented as a single self-contained sequence of bytes. [1]_
Layout analysis algorithm
=========================
PDFMiner attempts to reconstruct some of those structures by using heuristics
on the positioning of characters. This works well for sentences and
paragraphs because meaningful groups of nearby characters can be made.
The layout analysis consist of three different stages: it groups characters
into words and lines, then it groups lines into boxes and finally it groups
textboxes hierarchically. These stages are discussed in the following
sections. The resulting output of the layout analysis is an ordered hierarchy
of layout objects on a PDF page.
.. figure:: ../_static/layout_analysis_output.png
:align: center
The output of the layout analysis is a hierarchy of layout objects.
The output of the layout analysis heavily depends on a couple of parameters.
All these parameters are part of the :ref:`api_laparams` class.
Grouping characters into words and lines
----------------------------------------
The first step in going from characters to text is to group characters in a
meaningful way. Each character has an x-coordinate and a y-coordinate for its
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
.six uses these bounding boxes to decide which characters belong together.
Characters that are both horizontally and vertically close are grouped. How
close they should be is determined by the `char_margin` (M in figure) and the
`line_overlap` (not in figure) parameter. The horizontal *distance* between the
bounding boxes of two characters should be smaller that the `char_margin` and
the vertical *overlap* between the bounding boxes should be smaller the the
`line_overlap`.
.. raw:: html
:file: ../_static/layout_analysis.html
The values of `char_margin` and `line_overlap` are relative to the size of
the bounding boxes of the characters. The `char_margin` is relative to the
maximum width of either one of the bounding boxes, and the `line_overlap` is
relative to the minimum height of either one of the bounding boxes.
Spaces need to be inserted between characters because the PDF format has no
notion of the space character. A space is inserted if the characters are
further apart that the `word_margin` (W in the figure). The `word_margin` is
relative to the maximum width or height of the new character. Having a larger
`word_margin` creates smaller words and inserts spaces between characters
more often. Note that the `word_margin` should be smaller than the
`char_margin` otherwise all the characters are seperated by a space.
The result of this stage is a list of lines. Each line consists a list of
characters. These characters either original `LTChar` characters that
originate from the PDF file, or inserted `LTAnno` characters that
represent spaces between words or newlines at the end of each line.
Grouping lines into boxes
-------------------------
The second step is grouping lines in a meaningful way. Each line has a
bounding box that is determined by the bounding boxes of the characters that
it contains. Like grouping characters, pdfminer.six uses the bounding boxes
to group the lines.
Lines that are both horizontally overlapping and vertically close are grouped.
How vertically close the lines should be is determined by the `line_margin`.
This margin is specified relative to the height of the bounding box. Lines
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
than the absolute line margin, i.e. the `line_margin` multiplied by the
height of the bounding box.
.. raw:: html
:file: ../_static/layout_analysis_group_lines.html
The result of this stage is a list of text boxes. Each box consist of a list
of lines.
Grouping textboxes hierarchically
---------------------------------
the last step is to group the text boxes in a meaningful way. This step
repeatedly merges the two text boxes that are closest to each other.
The closeness of bounding boxes is computed as the area that is between the
two text boxes (the blue area in the figure). In other words, it is the area of
the bounding box that surrounds both lines, minus the area of the bounding
boxes of the individual lines.
.. raw:: html
:file: ../_static/layout_analysis_group_boxes.html
Working with rotated characters
===============================
The algorithm described above assumes that all characters have the same
orientation. However, any writing direction is possible in a PDF. To
accommodate for this, pdfminer.six allows to detect vertical writing with the
`detect_vertical` parameter. This will apply all the grouping steps as if the
pdf was rotated 90 (or 270) degrees
References
==========
.. [1] Adobe System Inc. (2007). *Pdf reference: Adobe portable document
format, version 1.7.*

View File

@ -0,0 +1,7 @@
Using pdfminer.six
******************
.. toctree::
:maxdepth: 2
converting_pdf_to_text

View File

@ -0,0 +1,41 @@
.. _tutorial_commandline:
Get started with command-line tools
***********************************
pdfminer.six has several tools that can be used from the command line. The
command-line tools are aimed at users that occasionally want to extract text
from a pdf.
Take a look at the high-level or composable interface if you want to use
pdfminer.six programmatically.
Examples
========
pdf2txt.py
----------
::
$ python tools/pdf2txt.py example.pdf
all the text from the pdf appears on the command line
The :ref:`api_pdf2txt` tool extracts all the text from a PDF. It uses layout
analysis with sensible defaults to order and group the text in a sensible way.
dumppdf.py
----------
::
$ python tools/dumppdf.py -a example.pdf
<pdf><object id="1">
...
</object>
...
</pdf>
The :ref:`api_dumppdf` tool can be used to extract the internal structure from a
PDF. This tool is primarily for debugging purposes, but that can be useful to
anybody working with PDF's.

View File

@ -0,0 +1,33 @@
.. _tutorial_composable:
Get started using the composable components API
***********************************************
The command line tools and the high-level API are just shortcuts for often
used combinations of pdfminer.six components. You can use these components to
modify pdfminer.six to your own needs.
For example, to extract the text from a PDF file and save it in a python
variable::
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('samples/simple1.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())

View File

@ -0,0 +1,67 @@
.. testsetup::
import sys
from pdfminer.high_level import extract_text_to_fp, extract_text
.. _tutorial_highlevel:
Get started using the high-level functions
******************************************
The high-level API can be used to do common tasks.
The most simple way to extract text from a PDF is to use
:ref:`api_extract_text`:
.. doctest::
>>> text = extract_text('samples/simple1.pdf')
>>> print(repr(text))
'Hello \n\nWorld\n\nWorld\n\nHello \n\nH e l l o \n\nH e l l o \n\nW o r l d\n\nW o r l d\n\n\x0c'
>>> print(text)
... # doctest: +NORMALIZE_WHITESPACE
Hello
<BLANKLINE>
World
<BLANKLINE>
World
<BLANKLINE>
Hello
<BLANKLINE>
H e l l o
<BLANKLINE>
H e l l o
<BLANKLINE>
W o r l d
<BLANKLINE>
W o r l d
<BLANKLINE>
To read text from a PDF and print it on the command line:
.. doctest::
>>> if sys.version_info > (3, 0):
... from io import StringIO
... else:
... from io import BytesIO as StringIO
>>> output_string = StringIO()
>>> with open('samples/simple1.pdf', 'rb') as fin:
... extract_text_to_fp(fin, output_string)
>>> print(output_string.getvalue().strip())
Hello WorldHello WorldHello WorldHello World
Or to convert it to html and use layout analysis:
.. doctest::
>>> if sys.version_info > (3, 0):
... from io import StringIO
... else:
... from io import BytesIO as StringIO
>>> from pdfminer.layout import LAParams
>>> output_string = StringIO()
>>> with open('samples/simple1.pdf', 'rb') as fin:
... extract_text_to_fp(fin, output_string, laparams=LAParams(),
... output_type='html', codec=None)

View File

@ -0,0 +1,9 @@
Getting started
***************
.. toctree::
:maxdepth: 2
commandline
highlevel
composable

View File

@ -1,4 +0,0 @@
blockquote { background: #eeeeee; }
h1 { border-bottom: solid black 2px; }
h2 { border-bottom: solid black 1px; }
.comment { color: darkgreen; }

View File

@ -2,6 +2,7 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
import logging import logging
import re import re
import sys
from .pdfdevice import PDFTextDevice from .pdfdevice import PDFTextDevice
from .pdffont import PDFUnicodeNotDefined from .pdffont import PDFUnicodeNotDefined
from .layout import LTContainer from .layout import LTContainer
@ -271,6 +272,8 @@ class HTMLConverter(PDFConverter):
def write(self, text): def write(self, text):
if self.codec: if self.codec:
text = text.encode(self.codec) text = text.encode(self.codec)
if sys.version_info < (3, 0):
text = str(text)
self.outfp.write(text) self.outfp.write(text)
return return

View File

@ -1,26 +1,20 @@
# -*- coding: utf-8 -*- """Functions that can be used for the most common use-cases for pdfminer.six"""
"""
Functions that encapsulate "usual" use-cases for pdfminer, for use making
bundled scripts and for using pdfminer as a module for routine tasks.
"""
import logging import logging
import six
import sys import sys
import six
# Conditional import because python 2 is stupid # Conditional import because python 2 is stupid
if sys.version_info > (3, 0): if sys.version_info > (3, 0):
from io import StringIO from io import StringIO
else: else:
from io import BytesIO as StringIO from io import BytesIO as StringIO
from .pdfdocument import PDFDocument
from .pdfparser import PDFParser
from .pdfinterp import PDFResourceManager, PDFPageInterpreter from .pdfinterp import PDFResourceManager, PDFPageInterpreter
from .pdfdevice import PDFDevice, TagExtractor from .pdfdevice import TagExtractor
from .pdfpage import PDFPage from .pdfpage import PDFPage
from .converter import XMLConverter, HTMLConverter, TextConverter from .converter import XMLConverter, HTMLConverter, TextConverter
from .cmapdb import CMapDB
from .image import ImageWriter from .image import ImageWriter
from .layout import LAParams from .layout import LAParams
@ -35,21 +29,25 @@ def extract_text_to_fp(inf, outfp,
Takes loads of optional arguments but the defaults are somewhat sane. Takes loads of optional arguments but the defaults are somewhat sane.
Beware laparams: Including an empty LAParams is not the same as passing None! Beware laparams: Including an empty LAParams is not the same as passing None!
Returns nothing, acting as it does on two streams. Use StringIO to get strings. Returns nothing, acting as it does on two streams. Use StringIO to get strings.
output_type: May be 'text', 'xml', 'html', 'tag'. Only 'text' works properly. :param inf: a file-like object to read PDF structure from, such as a
codec: Text decoding codec file handler (using the builtin `open()` function) or a `BytesIO`.
laparams: An LAParams object from pdfminer.layout. :param outfp: a file-like object to write the text to.
Default is None but may not layout correctly. :param output_type: May be 'text', 'xml', 'html', 'tag'. Only 'text' works properly.
maxpages: How many pages to stop parsing after :param codec: Text decoding codec
page_numbers: zero-indexed page numbers to operate on. :param laparams: An LAParams object from pdfminer.layout. Default is None but may not layout correctly.
password: For encrypted PDFs, the password to decrypt. :param maxpages: How many pages to stop parsing after
scale: Scale factor :param page_numbers: zero-indexed page numbers to operate on.
rotation: Rotation factor :param password: For encrypted PDFs, the password to decrypt.
layoutmode: Default is 'normal', see pdfminer.converter.HTMLConverter :param scale: Scale factor
output_dir: If given, creates an ImageWriter for extracted images. :param rotation: Rotation factor
strip_control: Does what it says on the tin :param layoutmode: Default is 'normal', see pdfminer.converter.HTMLConverter
debug: Output more logging data :param output_dir: If given, creates an ImageWriter for extracted images.
disable_caching: Does what it says on the tin :param strip_control: Does what it says on the tin
:param debug: Output more logging data
:param disable_caching: Does what it says on the tin
:param other:
:return:
""" """
if '_py2_no_more_posargs' in kwargs is not None: if '_py2_no_more_posargs' in kwargs is not None:
raise DeprecationWarning( raise DeprecationWarning(
@ -67,7 +65,7 @@ def extract_text_to_fp(inf, outfp,
imagewriter = None imagewriter = None
if output_dir: if output_dir:
imagewriter = ImageWriter(output_dir) imagewriter = ImageWriter(output_dir)
rsrcmgr = PDFResourceManager(caching=not disable_caching) rsrcmgr = PDFResourceManager(caching=not disable_caching)
if output_type == 'text': if output_type == 'text':
@ -96,7 +94,7 @@ def extract_text_to_fp(inf, outfp,
caching=not disable_caching, caching=not disable_caching,
check_extractable=True): check_extractable=True):
page.rotate = (page.rotate + rotation) % 360 page.rotate = (page.rotate + rotation) % 360
interpreter.process_page(page) interpreter.process_page(page)
device.close() device.close()

View File

@ -1,17 +1,15 @@
import heapq import heapq
from .utils import INF from .utils import INF
from .utils import Plane from .utils import Plane
from .utils import get_bound
from .utils import uniq
from .utils import fsplit
from .utils import bbox2str
from .utils import matrix2str
from .utils import apply_matrix_pt from .utils import apply_matrix_pt
from .utils import bbox2str
from .utils import fsplit
from .utils import get_bound
from .utils import matrix2str
from .utils import uniq
import six # Python 2+3 compatibility
## IndexAssigner
##
class IndexAssigner(object): class IndexAssigner(object):
def __init__(self, index=0): def __init__(self, index=0):
@ -28,9 +26,33 @@ class IndexAssigner(object):
return return
## LAParams
##
class LAParams(object): class LAParams(object):
"""Parameters for layout analysis
:param line_overlap: If two characters have more overlap than this they
are considered to be on the same line. The overlap is specified
relative to the minimum height of both characters.
:param char_margin: If two characters are closer together than this
margin they are considered to be part of the same word. If
characters are on the same line but not part of the same word, an
intermediate space is inserted. The margin is specified relative to
the width of the character.
:param word_margin: If two words are are closer together than this
margin they are considered to be part of the same line. A space is
added in between for readability. The margin is specified relative
to the width of the word.
:param line_margin: If two lines are are close together they are
considered to be part of the same paragraph. The margin is
specified relative to the height of a line.
:param boxes_flow: Specifies how much a horizontal and vertical position
of a text matters when determining the order of lines. The value
should be within the range of -1.0 (only horizontal position
matters) to +1.0 (only vertical position matters).
:param detect_vertical: If vertical text should be considered during
layout analysis
:param all_texts: If layout analysis should be performed on text in
figures.
"""
def __init__(self, def __init__(self,
line_overlap=0.5, line_overlap=0.5,
@ -54,30 +76,28 @@ class LAParams(object):
(self.char_margin, self.line_margin, self.word_margin, self.all_texts)) (self.char_margin, self.line_margin, self.word_margin, self.all_texts))
## LTItem
##
class LTItem(object): class LTItem(object):
"""Interface for things that can be analyzed"""
def analyze(self, laparams): def analyze(self, laparams):
"""Perform the layout analysis.""" """Perform the layout analysis."""
return return
## LTText
##
class LTText(object): class LTText(object):
"""Interface for things that have text"""
def __repr__(self): def __repr__(self):
return ('<%s %r>' % return ('<%s %r>' %
(self.__class__.__name__, self.get_text())) (self.__class__.__name__, self.get_text()))
def get_text(self): def get_text(self):
"""Text contained in this object"""
raise NotImplementedError raise NotImplementedError
## LTComponent
##
class LTComponent(LTItem): class LTComponent(LTItem):
"""Object with a bounding box"""
def __init__(self, bbox): def __init__(self, bbox):
LTItem.__init__(self) LTItem.__init__(self)
@ -91,10 +111,13 @@ class LTComponent(LTItem):
# Disable comparison. # Disable comparison.
def __lt__(self, _): def __lt__(self, _):
raise ValueError raise ValueError
def __le__(self, _): def __le__(self, _):
raise ValueError raise ValueError
def __gt__(self, _): def __gt__(self, _):
raise ValueError raise ValueError
def __ge__(self, _): def __ge__(self, _):
raise ValueError raise ValueError
@ -149,9 +172,8 @@ class LTComponent(LTItem):
return 0 return 0
## LTCurve
##
class LTCurve(LTComponent): class LTCurve(LTComponent):
"""A generic Bezier curve"""
def __init__(self, linewidth, pts, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None): def __init__(self, linewidth, pts, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
LTComponent.__init__(self, get_bound(pts)) LTComponent.__init__(self, get_bound(pts))
@ -168,18 +190,22 @@ class LTCurve(LTComponent):
return ','.join('%.3f,%.3f' % p for p in self.pts) return ','.join('%.3f,%.3f' % p for p in self.pts)
## LTLine
##
class LTLine(LTCurve): class LTLine(LTCurve):
"""A single straight line.
Could be used for separating text or figures.
"""
def __init__(self, linewidth, p0, p1, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None): def __init__(self, linewidth, p0, p1, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
LTCurve.__init__(self, linewidth, [p0, p1], stroke, fill, evenodd, stroking_color, non_stroking_color) LTCurve.__init__(self, linewidth, [p0, p1], stroke, fill, evenodd, stroking_color, non_stroking_color)
return return
## LTRect
##
class LTRect(LTCurve): class LTRect(LTCurve):
"""A rectangle.
Could be used for framing another pictures or figures.
"""
def __init__(self, linewidth, bbox, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None): def __init__(self, linewidth, bbox, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
(x0, y0, x1, y1) = bbox (x0, y0, x1, y1) = bbox
@ -187,9 +213,11 @@ class LTRect(LTCurve):
return return
## LTImage
##
class LTImage(LTComponent): class LTImage(LTComponent):
"""An image object.
Embedded images can be in JPEG, Bitmap or JBIG2.
"""
def __init__(self, name, stream, bbox): def __init__(self, name, stream, bbox):
LTComponent.__init__(self, bbox) LTComponent.__init__(self, bbox)
@ -210,9 +238,13 @@ class LTImage(LTComponent):
bbox2str(self.bbox), self.srcsize)) bbox2str(self.bbox), self.srcsize))
## LTAnno
##
class LTAnno(LTItem, LTText): class LTAnno(LTItem, LTText):
"""Actual letter in the text as a Unicode string.
Note that, while a LTChar object has actual boundaries, LTAnno objects does
not, as these are "virtual" characters, inserted by a layout analyzer
according to the relationship between two characters (e.g. a space).
"""
def __init__(self, text): def __init__(self, text):
self._text = text self._text = text
@ -222,9 +254,8 @@ class LTAnno(LTItem, LTText):
return self._text return self._text
## LTChar
##
class LTChar(LTComponent, LTText): class LTChar(LTComponent, LTText):
"""Actual letter in the text as a Unicode string."""
def __init__(self, matrix, font, fontsize, scaling, rise, def __init__(self, matrix, font, fontsize, scaling, rise,
text, textwidth, textdisp, ncs, graphicstate): text, textwidth, textdisp, ncs, graphicstate):
@ -285,9 +316,8 @@ class LTChar(LTComponent, LTText):
return True return True
## LTContainer
##
class LTContainer(LTComponent): class LTContainer(LTComponent):
"""Object that can be extended and analyzed"""
def __init__(self, bbox): def __init__(self, bbox):
LTComponent.__init__(self, bbox) LTComponent.__init__(self, bbox)
@ -315,10 +345,7 @@ class LTContainer(LTComponent):
return return
## LTExpandableContainer
##
class LTExpandableContainer(LTContainer): class LTExpandableContainer(LTContainer):
def __init__(self): def __init__(self):
LTContainer.__init__(self, (+INF, +INF, -INF, -INF)) LTContainer.__init__(self, (+INF, +INF, -INF, -INF))
return return
@ -330,10 +357,7 @@ class LTExpandableContainer(LTContainer):
return return
## LTTextContainer
##
class LTTextContainer(LTExpandableContainer, LTText): class LTTextContainer(LTExpandableContainer, LTText):
def __init__(self): def __init__(self):
LTText.__init__(self) LTText.__init__(self)
LTExpandableContainer.__init__(self) LTExpandableContainer.__init__(self)
@ -343,9 +367,12 @@ class LTTextContainer(LTExpandableContainer, LTText):
return ''.join(obj.get_text() for obj in self if isinstance(obj, LTText)) return ''.join(obj.get_text() for obj in self if isinstance(obj, LTText))
## LTTextLine
##
class LTTextLine(LTTextContainer): class LTTextLine(LTTextContainer):
"""Contains a list of LTChar objects that represent a single text line.
The characters are aligned either horizontally or vertically, depending on
the text's writing mode.
"""
def __init__(self, word_margin): def __init__(self, word_margin):
LTTextContainer.__init__(self) LTTextContainer.__init__(self)
@ -367,7 +394,6 @@ class LTTextLine(LTTextContainer):
class LTTextLineHorizontal(LTTextLine): class LTTextLineHorizontal(LTTextLine):
def __init__(self, word_margin): def __init__(self, word_margin):
LTTextLine.__init__(self, word_margin) LTTextLine.__init__(self, word_margin)
self._x1 = +INF self._x1 = +INF
@ -393,7 +419,6 @@ class LTTextLineHorizontal(LTTextLine):
class LTTextLineVertical(LTTextLine): class LTTextLineVertical(LTTextLine):
def __init__(self, word_margin): def __init__(self, word_margin):
LTTextLine.__init__(self, word_margin) LTTextLine.__init__(self, word_margin)
self._y0 = -INF self._y0 = -INF
@ -418,12 +443,13 @@ class LTTextLineVertical(LTTextLine):
abs(obj.y1-self.y1) < d))] abs(obj.y1-self.y1) < d))]
## LTTextBox
##
## A set of text objects that are grouped within
## a certain rectangular area.
##
class LTTextBox(LTTextContainer): class LTTextBox(LTTextContainer):
"""Represents a group of text chunks in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text. It contains a list of
LTTextLine objects.
"""
def __init__(self): def __init__(self):
LTTextContainer.__init__(self) LTTextContainer.__init__(self)
@ -437,7 +463,6 @@ class LTTextBox(LTTextContainer):
class LTTextBoxHorizontal(LTTextBox): class LTTextBoxHorizontal(LTTextBox):
def analyze(self, laparams): def analyze(self, laparams):
LTTextBox.analyze(self, laparams) LTTextBox.analyze(self, laparams)
self._objs.sort(key=lambda obj: -obj.y1) self._objs.sort(key=lambda obj: -obj.y1)
@ -448,7 +473,6 @@ class LTTextBoxHorizontal(LTTextBox):
class LTTextBoxVertical(LTTextBox): class LTTextBoxVertical(LTTextBox):
def analyze(self, laparams): def analyze(self, laparams):
LTTextBox.analyze(self, laparams) LTTextBox.analyze(self, laparams)
self._objs.sort(key=lambda obj: -obj.x1) self._objs.sort(key=lambda obj: -obj.x1)
@ -458,10 +482,7 @@ class LTTextBoxVertical(LTTextBox):
return 'tb-rl' return 'tb-rl'
## LTTextGroup
##
class LTTextGroup(LTTextContainer): class LTTextGroup(LTTextContainer):
def __init__(self, objs): def __init__(self, objs):
LTTextContainer.__init__(self) LTTextContainer.__init__(self)
self.extend(objs) self.extend(objs)
@ -469,7 +490,6 @@ class LTTextGroup(LTTextContainer):
class LTTextGroupLRTB(LTTextGroup): class LTTextGroupLRTB(LTTextGroup):
def analyze(self, laparams): def analyze(self, laparams):
LTTextGroup.analyze(self, laparams) LTTextGroup.analyze(self, laparams)
# reorder the objects from top-left to bottom-right. # reorder the objects from top-left to bottom-right.
@ -480,7 +500,6 @@ class LTTextGroupLRTB(LTTextGroup):
class LTTextGroupTBRL(LTTextGroup): class LTTextGroupTBRL(LTTextGroup):
def analyze(self, laparams): def analyze(self, laparams):
LTTextGroup.analyze(self, laparams) LTTextGroup.analyze(self, laparams)
# reorder the objects from top-right to bottom-left. # reorder the objects from top-right to bottom-left.
@ -490,10 +509,7 @@ class LTTextGroupTBRL(LTTextGroup):
return return
## LTLayoutContainer
##
class LTLayoutContainer(LTContainer): class LTLayoutContainer(LTContainer):
def __init__(self, bbox): def __init__(self, bbox):
LTContainer.__init__(self, bbox) LTContainer.__init__(self, bbox)
self.groups = None self.groups = None
@ -709,9 +725,13 @@ class LTLayoutContainer(LTContainer):
return return
## LTFigure
##
class LTFigure(LTLayoutContainer): class LTFigure(LTLayoutContainer):
"""Represents an area used by PDF Form objects.
PDF Forms can be used to present figures or pictures by embedding yet
another PDF document within a page. Note that LTFigure objects can appear
recursively.
"""
def __init__(self, name, bbox, matrix): def __init__(self, name, bbox, matrix):
self.name = name self.name = name
@ -734,9 +754,12 @@ class LTFigure(LTLayoutContainer):
return return
## LTPage
##
class LTPage(LTLayoutContainer): class LTPage(LTLayoutContainer):
"""Represents an entire page.
May contain child objects like LTTextBox, LTFigure, LTImage, LTRect,
LTCurve and LTLine.
"""
def __init__(self, pageid, bbox, rotate=0): def __init__(self, pageid, bbox, rotate=0):
LTLayoutContainer.__init__(self, bbox) LTLayoutContainer.__init__(self, bbox)

View File

@ -2,13 +2,13 @@
import six import six
from . import utils
from .pdffont import PDFUnicodeNotDefined from .pdffont import PDFUnicodeNotDefined
from . import utils
## PDFDevice
##
class PDFDevice(object): class PDFDevice(object):
"""Translate the output of PDFPageInterpreter to the output that is needed
"""
def __init__(self, rsrcmgr): def __init__(self, rsrcmgr):
self.rsrcmgr = rsrcmgr self.rsrcmgr = rsrcmgr

View File

@ -318,9 +318,8 @@ class PDFContentParser(PSStackParser):
return return
## Interpreter
##
class PDFPageInterpreter(object): class PDFPageInterpreter(object):
"""Processor for the content of a PDF page"""
def __init__(self, rsrcmgr, device): def __init__(self, rsrcmgr, device):
self.rsrcmgr = rsrcmgr self.rsrcmgr = rsrcmgr

View File

@ -13,7 +13,10 @@ setup(
'six', 'six',
'sortedcontainers', 'sortedcontainers',
], ],
extras_require={"dev": ["nose", "tox"]}, extras_require={
"dev": ["nose", "tox"],
"docs": ["sphinx", "sphinx-argparse"],
},
description='PDF parser and analyzer', description='PDF parser and analyzer',
long_description=package.__doc__, long_description=package.__doc__,
license='MIT/X', license='MIT/X',

View File

@ -240,51 +240,51 @@ def create_parser():
help='One or more paths to PDF files.') help='One or more paths to PDF files.')
parser.add_argument( parser.add_argument(
'-d', '--debug', default=False, action='store_true', '--debug', '-d', default=False, action='store_true',
help='Use debug logging level.') help='Use debug logging level.')
procedure_parser = parser.add_mutually_exclusive_group() procedure_parser = parser.add_mutually_exclusive_group()
procedure_parser.add_argument( procedure_parser.add_argument(
'-T', '--extract-toc', default=False, action='store_true', '--extract-toc', '-T', default=False, action='store_true',
help='Extract structure of outline') help='Extract structure of outline')
procedure_parser.add_argument( procedure_parser.add_argument(
'-E', '--extract-embedded', type=str, '--extract-embedded', '-E', type=str,
help='Extract embedded files') help='Extract embedded files')
parse_params = parser.add_argument_group( parse_params = parser.add_argument_group(
'Parser', description='Used during PDF parsing') 'Parser', description='Used during PDF parsing')
parse_params.add_argument( parse_params.add_argument(
"--page-numbers", type=int, default=None, nargs="+", '--page-numbers', type=int, default=None, nargs='+',
help="A space-seperated list of page numbers to parse.") help='A space-seperated list of page numbers to parse.')
parse_params.add_argument( parse_params.add_argument(
"-p", "--pagenos", type=str, '--pagenos', '-p', type=str,
help="A comma-separated list of page numbers to parse. Included for " help='A comma-separated list of page numbers to parse. Included for '
"legacy applications, use --page-numbers for more idiomatic " 'legacy applications, use --page-numbers for more idiomatic '
"argument entry.") 'argument entry.')
parse_params.add_argument( parse_params.add_argument(
'-i', '--objects', type=str, '--objects', '-i', type=str,
help='Comma separated list of object numbers to extract') help='Comma separated list of object numbers to extract')
parse_params.add_argument( parse_params.add_argument(
'-a', '--all', default=False, action='store_true', '--all', '-a', default=False, action='store_true',
help='If the structure of all objects should be extracted') help='If the structure of all objects should be extracted')
parse_params.add_argument( parse_params.add_argument(
'-P', '--password', type=str, default='', '--password', '-P', type=str, default='',
help='The password to use for decrypting PDF file.') help='The password to use for decrypting PDF file.')
output_params = parser.add_argument_group( output_params = parser.add_argument_group(
'Output', description='Used during output generation.') 'Output', description='Used during output generation.')
output_params.add_argument( output_params.add_argument(
'-o', '--outfile', type=str, default='-', '--outfile', '-o', type=str, default='-',
help='Path to file where output is written. Or "-" (default) to ' help='Path to file where output is written. Or "-" (default) to '
'write to stdout.') 'write to stdout.')
codec_parser = output_params.add_mutually_exclusive_group() codec_parser = output_params.add_mutually_exclusive_group()
codec_parser.add_argument( codec_parser.add_argument(
'-r', '--raw-stream', default=False, action='store_true', '--raw-stream', '-r', default=False, action='store_true',
help='Write stream objects without encoding') help='Write stream objects without encoding')
codec_parser.add_argument( codec_parser.add_argument(
'-b', '--binary-stream', default=False, action='store_true', '--binary-stream', '-b', default=False, action='store_true',
help='Write stream objects with binary encoding') help='Write stream objects with binary encoding')
codec_parser.add_argument( codec_parser.add_argument(
'-t', '--text-stream', default=False, action='store_true', '--text-stream', '-t', default=False, action='store_true',
help='Write stream objects as plain text') help='Write stream objects as plain text')
return parser return parser

View File

@ -1,15 +1,9 @@
#!/usr/bin/env python """A command line tool for extracting text and images from PDF and output it to plain text, html, xml or tags."""
"""
Converts PDF text content (though not images containing text) to plain text, html, xml or "tags".
"""
import argparse import argparse
import logging import logging
import six
import sys import sys
import six
import pdfminer.settings
pdfminer.settings.STRICT = False
import pdfminer.high_level import pdfminer.high_level
import pdfminer.layout import pdfminer.layout
from pdfminer.image import ImageWriter from pdfminer.image import ImageWriter
@ -73,28 +67,68 @@ def extract_text(files=[], outfile='-',
def maketheparser(): def maketheparser():
parser = argparse.ArgumentParser(description=__doc__, add_help=True) parser = argparse.ArgumentParser(description=__doc__, add_help=True)
parser.add_argument("files", type=str, default=None, nargs="+", help="File to process.") parser.add_argument("files", type=str, default=None, nargs="+", help="One or more paths to PDF files.")
parser.add_argument("-d", "--debug", default=False, action="store_true", help="Debug output.")
parser.add_argument("-p", "--pagenos", type=str, help="Comma-separated list of page numbers to parse. Included for legacy applications, use --page-numbers for more idiomatic argument entry.") parser.add_argument("--debug", "-d", default=False, action="store_true",
parser.add_argument("--page-numbers", type=int, default=None, nargs="+", help="Alternative to --pagenos with space-separated numbers; supercedes --pagenos where it is used.") help="Use debug logging level.")
parser.add_argument("-m", "--maxpages", type=int, default=0, help="Maximum pages to parse") parser.add_argument("--disable-caching", "-C", default=False, action="store_true",
parser.add_argument("-P", "--password", type=str, default="", help="Decryption password for PDF") help="If caching or resources, such as fonts, should be disabled.")
parser.add_argument("-o", "--outfile", type=str, default="-", help="Output file (default \"-\" is stdout)")
parser.add_argument("-t", "--output_type", type=str, default="text", help="Output type: text|html|xml|tag (default is text)") parse_params = parser.add_argument_group('Parser', description='Used during PDF parsing')
parser.add_argument("-c", "--codec", type=str, default="utf-8", help="Text encoding") parse_params.add_argument("--page-numbers", type=int, default=None, nargs="+",
parser.add_argument("-s", "--scale", type=float, default=1.0, help="Scale") help="A space-seperated list of page numbers to parse.")
parser.add_argument("-A", "--all-texts", default=None, action="store_true", help="LAParams all texts") parse_params.add_argument("--pagenos", "-p", type=str,
parser.add_argument("-V", "--detect-vertical", default=None, action="store_true", help="LAParams detect vertical") help="A comma-separated list of page numbers to parse. Included for legacy applications, "
parser.add_argument("-W", "--word-margin", type=float, default=None, help="LAParams word margin") "use --page-numbers for more idiomatic argument entry.")
parser.add_argument("-M", "--char-margin", type=float, default=None, help="LAParams char margin") parse_params.add_argument("--maxpages", "-m", type=int, default=0,
parser.add_argument("-L", "--line-margin", type=float, default=None, help="LAParams line margin") help="The maximum number of pages to parse.")
parser.add_argument("-F", "--boxes-flow", type=float, default=None, help="LAParams boxes flow") parse_params.add_argument("--password", "-P", type=str, default="",
parser.add_argument("-Y", "--layoutmode", default="normal", type=str, help="HTML Layout Mode") help="The password to use for decrypting PDF file.")
parser.add_argument("-n", "--no-laparams", default=False, action="store_true", help="Pass None as LAParams") parse_params.add_argument("--rotation", "-R", default=0, type=int,
parser.add_argument("-R", "--rotation", default=0, type=int, help="Rotation") help="The number of degrees to rotate the PDF before other types of processing.")
parser.add_argument("-O", "--output-dir", default=None, help="Output directory for images")
parser.add_argument("-C", "--disable-caching", default=False, action="store_true", help="Disable caching") la_params = parser.add_argument_group('Layout analysis', description='Used during layout analysis.')
parser.add_argument("-S", "--strip-control", default=False, action="store_true", help="Strip control in XML mode") la_params.add_argument("--no-laparams", "-n", default=False, action="store_true",
help="If layout analysis parameters should be ignored.")
la_params.add_argument("--detect-vertical", "-V", default=False, action="store_true",
help="If vertical text should be considered during layout analysis")
la_params.add_argument("--char-margin", "-M", type=float, default=2.0,
help="If two characters are closer together than this margin they are considered to be part "
"of the same word. The margin is specified relative to the width of the character.")
la_params.add_argument("--word-margin", "-W", type=float, default=0.1,
help="If two words are are closer together than this margin they are considered to be part "
"of the same line. A space is added in between for readability. The margin is "
"specified relative to the width of the word.")
la_params.add_argument("--line-margin", "-L", type=float, default=0.5,
help="If two lines are are close together they are considered to be part of the same "
"paragraph. The margin is specified relative to the height of a line.")
la_params.add_argument("--boxes-flow", "-F", type=float, default=0.5,
help="Specifies how much a horizontal and vertical position of a text matters when "
"determining the order of lines. The value should be within the range of -1.0 (only "
"horizontal position matters) to +1.0 (only vertical position matters).")
la_params.add_argument("--all-texts", "-A", default=True, action="store_true",
help="If layout analysis should be performed on text in figures.")
output_params = parser.add_argument_group('Output', description='Used during output generation.')
output_params.add_argument("--outfile", "-o", type=str, default="-",
help="Path to file where output is written. Or \"-\" (default) to write to stdout.")
output_params.add_argument("--output_type", "-t", type=str, default="text",
help="Type of output to generate {text,html,xml,tag}.")
output_params.add_argument("--codec", "-c", type=str, default="utf-8",
help="Text encoding to use in output file.")
output_params.add_argument("--output-dir", "-O", default=None,
help="The output directory to put extracted images in. If not given, images are not "
"extracted.")
output_params.add_argument("--layoutmode", "-Y", default="normal", type=str,
help="Type of layout to use when generating html {normal,exact,loose}. If normal, "
"each line is positioned separately in the html. If exact, each character is "
"positioned separately in the html. If loose, same result as normal but with an "
"additional newline after each text line. Only used when output_type is html.")
output_params.add_argument("--scale", "-s", type=float, default=1.0,
help="The amount of zoom to use when generating html file. Only used when output_type "
"is html.")
output_params.add_argument("--strip-control", "-S", default=False, action="store_true",
help="Remove control statement from text. Only used when output_type is xml.")
return parser return parser

11
tox.ini
View File

@ -1,6 +1,11 @@
[tox] [tox]
envlist = py{26, 27, 34, 35, 36} envlist = py{27,34,35,36,37,38}
[testenv] [testenv]
extras = dev extras =
commands = nosetests --nologcapture dev
docs
commands =
nosetests --nologcapture
python -m sphinx -b html docs/source docs/build/html
python -m sphinx -b doctest docs/source docs/build/doctest