Create sphinx documentation for Read the Docs (#329)
Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentationpull/335/head
parent
40aa2533c9
commit
bc034c8e59
|
@ -9,4 +9,4 @@ python:
|
|||
install:
|
||||
- pip install tox-travis
|
||||
script:
|
||||
- tox
|
||||
- tox -r
|
||||
|
|
|
@ -13,6 +13,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|||
### Added
|
||||
- Simple wrapper to easily extract text from a PDF file [#330](https://github.com/pdfminer/pdfminer.six/pull/330)
|
||||
- Support for extracting JBIG2 encoded images ([#311](https://github.com/pdfminer/pdfminer.six/pull/311) and [#46](https://github.com/pdfminer/pdfminer.six/pull/46))
|
||||
- Sphinx documentation that is published on
|
||||
[Read the Docs](https://pdfminersix.readthedocs.io/)
|
||||
([#329](https://github.com/pdfminer/pdfminer.six/pull/329))
|
||||
|
||||
### Fixed
|
||||
- Unhandled AssertionError when dumping pdf containing reference to object id 0
|
||||
|
|
66
README.md
66
README.md
|
@ -1,21 +1,22 @@
|
|||
PDFMiner.six
|
||||
pdfminer.six
|
||||
============
|
||||
|
||||
PDFMiner.six is a fork of PDFMiner using six for Python 2+3 compatibility
|
||||
[![Build Status](https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master)](https://travis-ci.org/pdfminer/pdfminer.six)
|
||||
[![PyPI version](https://img.shields.io/pypi/v/pdfminer.six.svg)](https://pypi.python.org/pypi/pdfminer.six/)
|
||||
[![gitter](https://badges.gitter.im/pdfminer-six/Lobby.svg)](https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium)
|
||||
|
||||
[![Build Status](https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master)](https://travis-ci.org/pdfminer/pdfminer.six) [![PyPI version](https://img.shields.io/pypi/v/pdfminer.six.svg)](https://pypi.python.org/pypi/pdfminer.six/)
|
||||
|
||||
PDFMiner is a tool for extracting information from PDF documents.
|
||||
Pdfminer.six is an community maintained fork of the original PDFMiner. It is a
|
||||
tool for extracting information from PDF documents.
|
||||
Unlike other PDF-related tools, it focuses entirely on getting
|
||||
and analyzing text data. PDFMiner allows one to obtain
|
||||
and analyzing text data. Pdfminer.six allows one to obtain
|
||||
the exact location of text in a page, as well as
|
||||
other information such as fonts or lines.
|
||||
It includes a PDF converter that can transform PDF files
|
||||
into other text formats (such as HTML). It has an extensible
|
||||
PDF parser that can be used for other purposes than text analysis.
|
||||
|
||||
* Webpage: https://github.com/pdfminer/
|
||||
* Download (PyPI): https://pypi.python.org/pypi/pdfminer.six/
|
||||
Check out the full documentation on
|
||||
[Read the Docs](https://pdfminersix.readthedocs.io).
|
||||
|
||||
|
||||
Features
|
||||
|
@ -33,53 +34,20 @@ Features
|
|||
* Automatic layout analysis.
|
||||
|
||||
|
||||
How to Install
|
||||
--------------
|
||||
How to use
|
||||
----------
|
||||
|
||||
* Install Python 2.7 or newer.
|
||||
* Install
|
||||
* Install Python 2.7 or newer. Note that Python 2 support is dropped at
|
||||
January, 2020.
|
||||
|
||||
`pip install pdfminer.six`
|
||||
|
||||
* Run the following test:
|
||||
* Use command-line interface to extract text from pdf:
|
||||
|
||||
`pdf2txt.py samples/simple1.pdf`
|
||||
`python pdf2txt.py samples/simple1.pdf`
|
||||
|
||||
|
||||
Command Line Tools
|
||||
------------------
|
||||
|
||||
PDFMiner comes with two handy tools:
|
||||
pdf2txt.py and dumppdf.py.
|
||||
|
||||
**pdf2txt.py**
|
||||
|
||||
pdf2txt.py extracts text contents from a PDF file.
|
||||
It extracts all the text that are to be rendered programmatically,
|
||||
i.e. text represented as ASCII or Unicode strings.
|
||||
It cannot recognize text drawn as images that would require optical character recognition.
|
||||
It also extracts the corresponding locations, font names, font sizes, writing
|
||||
direction (horizontal or vertical) for each text portion.
|
||||
You need to provide a password for protected PDF documents when its access is restricted.
|
||||
You cannot extract any text from a PDF document which does not have extraction permission.
|
||||
|
||||
(For details, refer to /docs/index.html.)
|
||||
|
||||
**dumppdf.py**
|
||||
|
||||
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format.
|
||||
This program is primarily for debugging purposes,
|
||||
but it's also possible to extract some meaningful contents (e.g. images).
|
||||
|
||||
(For details, refer to /docs/index.html.)
|
||||
|
||||
|
||||
TODO
|
||||
----
|
||||
|
||||
* PEP-8 and PEP-257 conformance.
|
||||
* Better documentation.
|
||||
* Performance improvements.
|
||||
* Check out more examples and documentation on
|
||||
[Read the Docs](https://pdfminersix.readthedocs.io).
|
||||
|
||||
|
||||
Contributing
|
||||
|
|
|
@ -0,0 +1 @@
|
|||
build/
|
|
@ -0,0 +1,20 @@
|
|||
# Minimal makefile for Sphinx documentation
|
||||
#
|
||||
|
||||
# You can set these variables from the command line, and also
|
||||
# from the environment for the first two.
|
||||
SPHINXOPTS ?=
|
||||
SPHINXBUILD ?= sphinx-build
|
||||
SOURCEDIR = source
|
||||
BUILDDIR = build
|
||||
|
||||
# Put it first so that "make" without argument is like "make help".
|
||||
help:
|
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
|
||||
%: Makefile
|
||||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
225
docs/cid.obj
225
docs/cid.obj
|
@ -1,225 +0,0 @@
|
|||
%TGIF 4.1.45-QPL
|
||||
state(0,37,100.000,0,0,0,16,1,9,1,1,2,0,1,0,1,1,'NewCenturySchlbk-Bold',1,103680,0,0,1,10,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
|
||||
%
|
||||
% @(#)$Header$
|
||||
% %W%
|
||||
%
|
||||
unit("1 pixel/pixel").
|
||||
color_info(19,65535,0,[
|
||||
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
|
||||
"red", 65535, 0, 0, 65535, 0, 0, 1,
|
||||
"green", 0, 65535, 0, 0, 65535, 0, 1,
|
||||
"blue", 0, 0, 65535, 0, 0, 65535, 1,
|
||||
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
|
||||
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
|
||||
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
|
||||
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
|
||||
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
|
||||
"black", 0, 0, 0, 0, 0, 0, 1,
|
||||
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
|
||||
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
|
||||
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
|
||||
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
|
||||
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
|
||||
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
|
||||
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
|
||||
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
|
||||
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
|
||||
]).
|
||||
script_frac("0.6").
|
||||
fg_bg_colors('black','white').
|
||||
dont_reencode("FFDingbests:ZapfDingbats").
|
||||
objshadow_info('#c0c0c0',2,2).
|
||||
page(1,"",1,'').
|
||||
text('black',90,95,1,1,1,66,20,0,15,5,0,0,0,0,2,66,20,0,0,"",0,0,0,0,110,'',[
|
||||
minilines(66,20,0,0,1,0,0,[
|
||||
mini_line(66,15,5,0,0,0,[
|
||||
str_block(0,66,15,5,0,-1,0,0,0,[
|
||||
str_seg('black','Courier-Bold',1,103680,66,15,5,0,-1,0,0,0,0,0,
|
||||
"U+30FC")])
|
||||
])
|
||||
])]).
|
||||
text('black',100,285,1,1,1,66,20,3,15,5,0,0,0,0,2,66,20,0,0,"",0,0,0,0,300,'',[
|
||||
minilines(66,20,0,0,1,0,0,[
|
||||
mini_line(66,15,5,0,0,0,[
|
||||
str_block(0,66,15,5,0,-2,0,0,0,[
|
||||
str_seg('black','Courier-Bold',1,103680,66,15,5,0,-2,0,0,0,0,0,
|
||||
"U+5199")])
|
||||
])
|
||||
])]).
|
||||
text('black',400,38,2,1,1,119,30,5,12,3,0,0,0,0,2,119,30,0,0,"",0,0,0,0,50,'',[
|
||||
minilines(119,30,0,0,1,0,0,[
|
||||
mini_line(83,12,3,0,0,0,[
|
||||
str_block(0,83,12,3,0,-3,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
|
||||
"Adobe-Japan1")])
|
||||
]),
|
||||
mini_line(119,12,3,0,0,0,[
|
||||
str_block(0,119,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,119,12,3,0,-1,0,0,0,0,0,
|
||||
"CID:660 (horizontal)")])
|
||||
])
|
||||
])]).
|
||||
text('black',400,118,2,1,1,114,30,8,12,3,0,0,0,0,2,114,30,0,0,"",0,0,0,0,130,'',[
|
||||
minilines(114,30,0,0,1,0,0,[
|
||||
mini_line(83,12,3,0,0,0,[
|
||||
str_block(0,83,12,3,0,-3,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
|
||||
"Adobe-Japan1")])
|
||||
]),
|
||||
mini_line(114,12,3,0,0,0,[
|
||||
str_block(0,114,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,114,12,3,0,-1,0,0,0,0,0,
|
||||
"CID:7891 (vertical)")])
|
||||
])
|
||||
])]).
|
||||
text('black',400,238,2,1,1,125,30,15,12,3,0,0,0,0,2,125,30,0,0,"",0,0,0,0,250,'',[
|
||||
minilines(125,30,0,0,1,0,0,[
|
||||
mini_line(83,12,3,0,0,0,[
|
||||
str_block(0,83,12,3,0,-3,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
|
||||
"Adobe-Japan1")])
|
||||
]),
|
||||
mini_line(125,12,3,0,0,0,[
|
||||
str_block(0,125,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,125,12,3,0,-1,0,0,0,0,0,
|
||||
"CID:2296 (Japanese)")])
|
||||
])
|
||||
])]).
|
||||
text('black',400,318,2,1,1,115,30,16,12,3,0,0,0,0,2,115,30,0,0,"",0,0,0,0,330,'',[
|
||||
minilines(115,30,0,0,1,0,0,[
|
||||
mini_line(67,12,3,0,0,0,[
|
||||
str_block(0,67,12,3,0,-3,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,67,12,3,0,-3,0,0,0,0,0,
|
||||
"Adobe-GB1")])
|
||||
]),
|
||||
mini_line(115,12,3,0,0,0,[
|
||||
str_block(0,115,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,115,12,3,0,-1,0,0,0,0,0,
|
||||
"CID:3967 (Chinese)")])
|
||||
])
|
||||
])]).
|
||||
text('black',200,84,2,1,1,116,38,20,16,3,0,0,0,0,2,116,38,0,0,"",0,0,0,0,100,'',[
|
||||
minilines(116,38,0,0,1,0,0,[
|
||||
mini_line(70,16,3,0,0,0,[
|
||||
str_block(0,70,16,3,0,-1,0,0,0,[
|
||||
str_seg('black','NewCenturySchlbk-Roman',0,97920,70,16,3,0,-1,0,0,0,0,0,
|
||||
"Japanese")])
|
||||
]),
|
||||
mini_line(116,16,3,0,0,0,[
|
||||
str_block(0,116,16,3,0,-1,0,0,0,[
|
||||
str_seg('black','NewCenturySchlbk-Roman',0,97920,116,16,3,0,-1,0,0,0,0,0,
|
||||
"long-vowel sign")])
|
||||
])
|
||||
])]).
|
||||
oval('black','',30,70,280,140,0,1,1,49,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
oval('black','',30,260,280,330,0,1,1,51,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
text('black',200,274,2,1,1,85,38,53,16,3,0,0,0,0,2,85,38,0,0,"",0,0,0,0,290,'',[
|
||||
minilines(85,38,0,0,1,0,0,[
|
||||
mini_line(61,16,3,0,0,0,[
|
||||
str_block(0,61,16,3,0,-1,0,0,0,[
|
||||
str_seg('black','NewCenturySchlbk-Roman',0,97920,61,16,3,0,-1,0,0,0,0,0,
|
||||
"Chinese")])
|
||||
]),
|
||||
mini_line(85,16,3,0,0,0,[
|
||||
str_block(0,85,16,3,0,-1,0,0,0,[
|
||||
str_seg('black','NewCenturySchlbk-Roman',0,97920,85,16,3,0,-1,0,0,0,0,0,
|
||||
"letter \"sha\"")])
|
||||
])
|
||||
])]).
|
||||
box('black','',330,30,560,80,0,1,1,57,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
box('black','',330,110,560,160,0,1,1,59,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
box('black','',330,230,560,280,0,1,1,60,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
box('black','',330,310,560,360,0,1,1,61,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
group([
|
||||
poly('black','',4,[
|
||||
506,246,501,235,541,235,536,246],0,2,1,68,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]),
|
||||
poly('black','',5,[
|
||||
519,238,516,252,529,252,524,275,516,272],0,2,1,69,0,0,0,0,0,0,0,'2',0,0,
|
||||
"00","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]),
|
||||
poly('black','',2,[
|
||||
501,261,541,261],0,2,1,70,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]),
|
||||
poly('black','',2,[
|
||||
519,244,529,244],0,2,1,71,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
])
|
||||
],
|
||||
76,0,0,[
|
||||
]).
|
||||
group([
|
||||
poly('black','',3,[
|
||||
519,119,524,127,524,152],0,2,1,67,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
])
|
||||
],
|
||||
78,0,0,[
|
||||
]).
|
||||
group([
|
||||
poly('black','',3,[
|
||||
540,57,509,57,501,49],0,2,1,66,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
])
|
||||
],
|
||||
80,0,0,[
|
||||
]).
|
||||
group([
|
||||
poly('black','',4,[
|
||||
506,326,501,315,541,315,536,326],0,2,1,90,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]),
|
||||
poly('black','',5,[
|
||||
519,318,515,332,531,332,526,355,519,352],0,2,1,89,0,0,0,0,0,0,0,'2',0,0,
|
||||
"00","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]),
|
||||
poly('black','',2,[
|
||||
501,341,526,341],0,2,1,88,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]),
|
||||
poly('black','',2,[
|
||||
519,324,529,324],0,2,1,87,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
])
|
||||
],
|
||||
134,0,0,[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
270,90,320,70],1,3,1,158,0,0,0,0,0,0,0,'3',0,0,
|
||||
"0","",[
|
||||
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
280,110,320,130],1,3,1,159,0,0,0,0,0,0,0,'3',0,0,
|
||||
"0","",[
|
||||
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
270,280,310,250],1,3,1,160,0,0,0,0,0,0,0,'3',0,0,
|
||||
"0","",[
|
||||
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
270,300,310,330],1,3,1,161,0,0,0,0,0,0,0,'3',0,0,
|
||||
"0","",[
|
||||
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
|
||||
]).
|
BIN
docs/cid.png
BIN
docs/cid.png
Binary file not shown.
Before Width: | Height: | Size: 2.6 KiB |
427
docs/index.html
427
docs/index.html
|
@ -1,427 +0,0 @@
|
|||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
||||
<html>
|
||||
<head>
|
||||
<link rel="stylesheet" type="text/css" href="style.css">
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||||
<title>PDFMiner</title>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<div align=right class=lastmod>
|
||||
<!-- hhmts start -->
|
||||
Last Modified: Wed Jun 25 10:27:52 UTC 2014
|
||||
<!-- hhmts end -->
|
||||
</div>
|
||||
|
||||
<h1>PDFMiner</h1>
|
||||
<p>
|
||||
Python PDF parser and analyzer
|
||||
|
||||
<p>
|
||||
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
|
||||
|
||||
<a href="#changes">Recent Changes</a>
|
||||
|
||||
<a href="programming.html">PDFMiner API</a>
|
||||
|
||||
<ul>
|
||||
<li> <a href="#intro">What's It?</a>
|
||||
<li> <a href="#download">Download</a>
|
||||
<li> <a href="#wheretoask">Where to Ask</a>
|
||||
<li> <a href="#install">How to Install</a>
|
||||
<ul>
|
||||
<li> <a href="#cmap">CJK languages support</a>
|
||||
</ul>
|
||||
<li> <a href="#tools">Command Line Tools</a>
|
||||
<ul>
|
||||
<li> <a href="#pdf2txt">pdf2txt.py</a>
|
||||
<li> <a href="#dumppdf">dumppdf.py</a>
|
||||
<li> <a href="programming.html">PDFMiner API</a>
|
||||
</ul>
|
||||
<li> <a href="#changes">Changes</a>
|
||||
<li> <a href="#todo">TODO</a>
|
||||
<li> <a href="#related">Related Projects</a>
|
||||
<li> <a href="#license">Terms and Conditions</a>
|
||||
</ul>
|
||||
|
||||
<h2><a name="intro">What's It?</a></h2>
|
||||
<p>
|
||||
PDFMiner is a tool for extracting information from PDF documents.
|
||||
Unlike other PDF-related tools, it focuses entirely on getting
|
||||
and analyzing text data. PDFMiner allows one to obtain
|
||||
the exact location of text in a page, as well as
|
||||
other information such as fonts or lines.
|
||||
It includes a PDF converter that can transform PDF files
|
||||
into other text formats (such as HTML). It has an extensible
|
||||
PDF parser that can be used for other purposes than text analysis.
|
||||
|
||||
<p>
|
||||
<h3>Features</h3>
|
||||
<ul>
|
||||
<li> Written entirely in Python. (for version 2.6 or newer)
|
||||
<li> Parse, analyze, and convert PDF documents.
|
||||
<li> PDF-1.7 specification support. (well, almost)
|
||||
<li> CJK languages and vertical writing scripts support.
|
||||
<li> Various font types (Type1, TrueType, Type3, and CID) support.
|
||||
<li> Basic encryption (RC4) support.
|
||||
<li> PDF to HTML conversion.
|
||||
<li> Outline (TOC) extraction.
|
||||
<li> Tagged contents extraction.
|
||||
<li> Reconstruct the original layout by grouping text chunks.
|
||||
</ul>
|
||||
<p>
|
||||
PDFMiner is about 20 times slower than
|
||||
other C/C++-based counterparts such as XPdf.
|
||||
|
||||
<P>
|
||||
<strong>Online Demo:</strong> (pdf -> html conversion webapp)<br>
|
||||
<a href="http://pdf2html.tabesugi.net:8080/">
|
||||
http://pdf2html.tabesugi.net:8080/
|
||||
</a>
|
||||
|
||||
<h3><a name="download">Download</a></h3>
|
||||
<p>
|
||||
<strong>Source distribution:</strong><br>
|
||||
<a href="http://pypi.python.org/pypi/pdfminer_six/">
|
||||
http://pypi.python.org/pypi/pdfminer_six/
|
||||
</a>
|
||||
|
||||
<P>
|
||||
<strong>github:</strong><br>
|
||||
<a href="https://github.com/goulu/pdfminer/">
|
||||
https://github.com/goulu/pdfminer/
|
||||
</a>
|
||||
|
||||
<h3><a name="wheretoask">Where to Ask</a></h3>
|
||||
<p>
|
||||
<p>
|
||||
<strong>Questions and comments:</strong><br>
|
||||
<a href="http://groups.google.com/group/pdfminer-users/">
|
||||
http://groups.google.com/group/pdfminer-users/
|
||||
</a>
|
||||
|
||||
<h2><a name="install">How to Install</a></h2>
|
||||
<ol>
|
||||
<li> Install <a href="http://www.python.org/download/">Python</a> 2.6 or newer.
|
||||
<li> Download the <a href="#source">PDFMiner source</a>.
|
||||
<li> Unpack it.
|
||||
<li> Run <code>setup.py</code> to install:<br>
|
||||
<blockquote><pre>
|
||||
# <strong>python setup.py install</strong>
|
||||
</pre></blockquote>
|
||||
<li> Do the following test:<br>
|
||||
<blockquote><pre>
|
||||
$ <strong>pdf2txt.py samples/simple1.pdf</strong>
|
||||
Hello
|
||||
|
||||
World
|
||||
|
||||
Hello
|
||||
|
||||
World
|
||||
|
||||
H e l l o
|
||||
|
||||
W o r l d
|
||||
|
||||
H e l l o
|
||||
|
||||
W o r l d
|
||||
</pre></blockquote>
|
||||
<li> Done!
|
||||
</ol>
|
||||
|
||||
<h3><a name="cmap">For CJK languages</a></h3>
|
||||
<p>
|
||||
In order to process CJK languages, you need an additional step to take
|
||||
during installation:
|
||||
<blockquote><pre>
|
||||
# <strong>make cmap</strong>
|
||||
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
|
||||
reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...
|
||||
writing 'CNS1_H.py'...
|
||||
...
|
||||
<em>(this may take several minutes)</em>
|
||||
|
||||
# <strong>python setup.py install</strong>
|
||||
</pre></blockquote>
|
||||
|
||||
<p>
|
||||
On Windows machines which don't have <code>make</code> command,
|
||||
paste the following commands on a command line prompt:
|
||||
<blockquote><pre>
|
||||
<strong>mkdir pdfminer\cmap</strong>
|
||||
<strong>python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt</strong>
|
||||
<strong>python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt</strong>
|
||||
<strong>python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt</strong>
|
||||
<strong>python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt</strong>
|
||||
<strong>python setup.py install</strong>
|
||||
</pre></blockquote>
|
||||
|
||||
<h2><a name="tools">Command Line Tools</a></h2>
|
||||
<p>
|
||||
PDFMiner comes with two handy tools:
|
||||
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
|
||||
|
||||
<h3><a name="pdf2txt">pdf2txt.py</a></h3>
|
||||
<p>
|
||||
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
||||
It extracts all the text that are to be rendered programmatically,
|
||||
i.e. text represented as ASCII or Unicode strings.
|
||||
It cannot recognize text drawn as images that would require optical character recognition.
|
||||
It also extracts the corresponding locations, font names, font sizes, writing
|
||||
direction (horizontal or vertical) for each text portion.
|
||||
You need to provide a password for protected PDF documents when its access is restricted.
|
||||
You cannot extract any text from a PDF document which does not have extraction permission.
|
||||
|
||||
<p>
|
||||
<strong>Note:</strong>
|
||||
Not all characters in a PDF can be safely converted to Unicode.
|
||||
|
||||
<h4>Examples</h4>
|
||||
<blockquote><pre>
|
||||
$ <strong>pdf2txt.py -o output.html samples/naacl06-shinyama.pdf</strong>
|
||||
(extract text as an HTML file whose filename is output.html)
|
||||
|
||||
$ <strong>pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf</strong>
|
||||
(extract a Japanese HTML file in vertical writing, CMap is required)
|
||||
|
||||
$ <strong>pdf2txt.py -P mypassword -o output.txt secret.pdf</strong>
|
||||
(extract a text from an encrypted PDF file)
|
||||
</pre></blockquote>
|
||||
|
||||
<h4>Options</h4>
|
||||
<dl>
|
||||
<dt> <code>-o <em>filename</em></code>
|
||||
<dd> Specifies the output file name.
|
||||
By default, it prints the extracted contents to stdout in text format.
|
||||
<p>
|
||||
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
|
||||
<dd> Specifies the comma-separated list of the page numbers to be extracted.
|
||||
Page numbers start at one.
|
||||
By default, it extracts text from all the pages.
|
||||
<p>
|
||||
<dt> <code>-c <em>codec</em></code>
|
||||
<dd> Specifies the output codec.
|
||||
<p>
|
||||
<dt> <code>-t <em>type</em></code>
|
||||
<dd> Specifies the output format. The following formats are currently supported.
|
||||
<ul>
|
||||
<li> <code>text</code> : TEXT format. (Default)
|
||||
<li> <code>html</code> : HTML format. Not recommended for extraction purposes because the markup is messy.
|
||||
<li> <code>xml</code> : XML format. Provides the most information.
|
||||
<li> <code>tag</code> : "Tagged PDF" format. A tagged PDF has its own contents annotated with
|
||||
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
|
||||
Tags used here are defined in the PDF specification (See §10.7 "<em>Tagged PDF</em>").
|
||||
</ul>
|
||||
<p>
|
||||
<dt> <code>-I <em>image_directory</em></code>
|
||||
<dd> Specifies the output directory for image extraction.
|
||||
Currently only JPEG images are supported.
|
||||
<p>
|
||||
<dt> <code>-M <em>char_margin</em></code>
|
||||
<dt> <code>-L <em>line_margin</em></code>
|
||||
<dt> <code>-W <em>word_margin</em></code>
|
||||
<dd> These are the parameters used for layout analysis.
|
||||
In an actual PDF file, text portions might be split into several chunks
|
||||
in the middle of its running, depending on the authoring software.
|
||||
Therefore, text extraction needs to splice text chunks.
|
||||
In the figure below, two text chunks whose distance is closer than
|
||||
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
|
||||
continuous and get grouped into one. Also, two lines whose distance is closer than
|
||||
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
|
||||
as a text box, which is a rectangular area that contains a "cluster" of text portions.
|
||||
Furthermore, it may be required to insert blank characters (spaces) as necessary
|
||||
if the distance between two words is greater than the <em>word_margin</em>
|
||||
(<em><font color="green">W</font></em>), as a blank between words might not be
|
||||
represented as a space, but indicated by the positioning of each word.
|
||||
<p>
|
||||
Each value is specified not as an actual length, but as a proportion of
|
||||
the length to the size of each character in question. The default values
|
||||
are M = 2.0, L = 0.5, and W = 0.1, respectively.
|
||||
<table style="border:2px gray solid; margin: 10px; padding: 10px;"><tr>
|
||||
<td style="border-right:1px red solid" align=right>→</td>
|
||||
<td style="border-left:1px red solid" colspan="4" align=left>← <em><font color="red">M</font></em></td>
|
||||
<td></td>
|
||||
</tr><tr>
|
||||
<td style="border:1px solid"><code>Q u i</code></td>
|
||||
<td style="border:1px solid"><code>c k</code></td>
|
||||
<td width="10px"></td>
|
||||
<td style="border:1px solid"><code>b r o w</code></td>
|
||||
<td style="border:1px solid"><code>n f o x</code></td>
|
||||
<td style="border-bottom:1px blue solid" align=right>↓</td>
|
||||
</tr><tr>
|
||||
<td style="border-right:1px green solid" colspan="2" align=right>→</td><td></td>
|
||||
<td style="border-left:1px green solid" colspan="2" align=left>← <em><font color="green">W</font></em></td>
|
||||
<td rowspan="2" valign=center align=center><em><font color="blue">L</font></em></td>
|
||||
</tr><tr height="10px">
|
||||
</tr><tr>
|
||||
<td style="padding:0px;" colspan="5">
|
||||
<table style="border:1px solid"><tr><td><code>j u m p s</code></td><td>...</td></tr></table>
|
||||
</td>
|
||||
<td style="border-top:1px blue solid" align=right>↑</td>
|
||||
</tr></table>
|
||||
<p>
|
||||
<dt> <code>-F <em>boxes_flow</em></code>
|
||||
<dd> Specifies how much a horizontal and vertical position of a text matters
|
||||
when determining a text order. The value should be within the range of
|
||||
-1.0 (only horizontal position matters) to +1.0 (only vertical position matters).
|
||||
The default value is 0.5.
|
||||
<p>
|
||||
<dt> <code>-C</code>
|
||||
<dd> Suppress object caching.
|
||||
This will reduce the memory consumption but also slows down the process.
|
||||
<p>
|
||||
<dt> <code>-n</code>
|
||||
<dd> Suppress layout analysis.
|
||||
<p>
|
||||
<dt> <code>-A</code>
|
||||
<dd> Forces to perform layout analysis for all the text strings,
|
||||
including text contained in figures.
|
||||
<p>
|
||||
<dt> <code>-V</code>
|
||||
<dd> Allows vertical writing detection.
|
||||
<p>
|
||||
<dt> <code>-Y <em>layout_mode</em></code>
|
||||
<dd> Specifies how the page layout should be preserved. (Currently only applies to HTML format.)
|
||||
<ul>
|
||||
<li> <code>exact</code> : preserve the exact location of each individual character (a large and messy HTML).
|
||||
<li> <code>normal</code> : preserve the location and line breaks in each text block. (Default)
|
||||
<li> <code>loose</code> : preserve the overall location of each text block.
|
||||
</ul>
|
||||
<p>
|
||||
<dt> <code>-E <em>extractdir</em></code>
|
||||
<dd> Specifies the extraction directory of embedded files.
|
||||
<p>
|
||||
<dt> <code>-s <em>scale</em></code>
|
||||
<dd> Specifies the output scale. Can be used in HTML format only.
|
||||
<p>
|
||||
<dt> <code>-m <em>maxpages</em></code>
|
||||
<dd> Specifies the maximum number of pages to extract.
|
||||
By default, it extracts all the pages in a document.
|
||||
<p>
|
||||
<dt> <code>-P <em>password</em></code>
|
||||
<dd> Provides the user password to access PDF contents.
|
||||
<p>
|
||||
<dt> <code>-d</code>
|
||||
<dd> Increases the debug level.
|
||||
</dl>
|
||||
|
||||
<hr noshade>
|
||||
|
||||
<h3><a name="dumppdf">dumppdf.py</a></h3>
|
||||
<p>
|
||||
<code>dumppdf.py</code> dumps the internal contents of a PDF file
|
||||
in pseudo-XML format. This program is primarily for debugging purposes,
|
||||
but it's also possible to extract some meaningful contents
|
||||
(such as images).
|
||||
|
||||
<h4>Examples</h4>
|
||||
<blockquote><pre>
|
||||
$ <strong>dumppdf.py -a foo.pdf</strong>
|
||||
(dump all the headers and contents, except stream objects)
|
||||
|
||||
$ <strong>dumppdf.py -T foo.pdf</strong>
|
||||
(dump the table of contents)
|
||||
|
||||
$ <strong>dumppdf.py -r -i6 foo.pdf > pic.jpeg</strong>
|
||||
(extract a JPEG image)
|
||||
</pre></blockquote>
|
||||
|
||||
<h4>Options</h4>
|
||||
<dl>
|
||||
<dt> <code>-a</code>
|
||||
<dd> Instructs to dump all the objects.
|
||||
By default, it only prints the document trailer (like a header).
|
||||
<p>
|
||||
<dt> <code>-i <em>objno,objno, ...</em></code>
|
||||
<dd> Specifies PDF object IDs to display.
|
||||
Comma-separated IDs, or multiple <code>-i</code> options are accepted.
|
||||
<p>
|
||||
<dt> <code>-p <em>pageno,pageno, ...</em></code>
|
||||
<dd> Specifies the page number to be extracted.
|
||||
Comma-separated page numbers, or multiple <code>-p</code> options are accepted.
|
||||
Note that page numbers start at one, not zero.
|
||||
<p>
|
||||
<dt> <code>-r</code> (raw)
|
||||
<dt> <code>-b</code> (binary)
|
||||
<dt> <code>-t</code> (text)
|
||||
<dd> Specifies the output format of stream contents.
|
||||
Because the contents of stream objects can be very large,
|
||||
they are omitted when none of the options above is specified.
|
||||
<p>
|
||||
With <code>-r</code> option, the "raw" stream contents are dumped without decompression.
|
||||
With <code>-b</code> option, the decompressed contents are dumped as a binary blob.
|
||||
With <code>-t</code> option, the decompressed contents are dumped in a text format,
|
||||
similar to <code>repr()</code> manner. When
|
||||
<code>-r</code> or <code>-b</code> option is given,
|
||||
no stream header is displayed for the ease of saving it to a file.
|
||||
<p>
|
||||
<dt> <code>-T</code>
|
||||
<dd> Shows the table of contents.
|
||||
<p>
|
||||
<dt> <code>-E <em>directory</em></code>
|
||||
<dd> Extracts embedded files from the pdf into the given directory.
|
||||
<p>
|
||||
<dt> <code>-P <em>password</em></code>
|
||||
<dd> Provides the user password to access PDF contents.
|
||||
<p>
|
||||
<dt> <code>-d</code>
|
||||
<dd> Increases the debug level.
|
||||
</dl>
|
||||
|
||||
<h2><a name="changes">Changes:</a></h2>
|
||||
<ul>
|
||||
<li> 2014/09/15: pushed on PyPi</li>
|
||||
<li> 2014/09/10: pdfminer_six forked from pdfminer since Yusuke didn't want to merge and pdfminer3k is outdated</li>
|
||||
</ul>
|
||||
|
||||
<h2><a name="todo">TODO</a></h2>
|
||||
<ul>
|
||||
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
|
||||
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
|
||||
<li> Better documentation.
|
||||
<li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
|
||||
<li> Crypt stream filter support. (More sample documents are needed!)
|
||||
</ul>
|
||||
|
||||
<h2><a name="related">Related Projects</a></h2>
|
||||
<ul>
|
||||
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
|
||||
<li> <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
||||
<li> <a href="http://www.pdfbox.org/">pdfbox</a>
|
||||
<li> <a href="http://mupdf.com/">mupdf</a>
|
||||
</ul>
|
||||
|
||||
<h2><a name="license">Terms and Conditions</a></h2>
|
||||
<p>
|
||||
(This is so-called MIT/X License)
|
||||
<p>
|
||||
<small>
|
||||
Copyright (c) 2004-2013 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
||||
<p>
|
||||
Permission is hereby granted, free of charge, to any person
|
||||
obtaining a copy of this software and associated documentation
|
||||
files (the "Software"), to deal in the Software without
|
||||
restriction, including without limitation the rights to use,
|
||||
copy, modify, merge, publish, distribute, sublicense, and/or
|
||||
sell copies of the Software, and to permit persons to whom the
|
||||
Software is furnished to do so, subject to the following
|
||||
conditions:
|
||||
<p>
|
||||
The above copyright notice and this permission notice shall be
|
||||
included in all copies or substantial portions of the Software.
|
||||
<p>
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
|
||||
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
|
||||
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
||||
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
||||
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
||||
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
||||
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
</small>
|
||||
|
||||
<hr noshade>
|
||||
<address>Yusuke Shinyama (yusuke at cs dot nyu dot edu)</address>
|
||||
</body>
|
391
docs/layout.obj
391
docs/layout.obj
|
@ -1,391 +0,0 @@
|
|||
%TGIF 4.2.2
|
||||
state(0,37,100.000,0,0,0,16,1,9,1,1,0,0,0,0,1,1,'Helvetica-Bold',1,69120,0,0,1,5,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
|
||||
%
|
||||
% @(#)$Header$
|
||||
% %W%
|
||||
%
|
||||
unit("1 pixel/pixel").
|
||||
color_info(19,65535,0,[
|
||||
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
|
||||
"red", 65535, 0, 0, 65535, 0, 0, 1,
|
||||
"green", 0, 65535, 0, 0, 65535, 0, 1,
|
||||
"blue", 0, 0, 65535, 0, 0, 65535, 1,
|
||||
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
|
||||
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
|
||||
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
|
||||
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
|
||||
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
|
||||
"black", 0, 0, 0, 0, 0, 0, 1,
|
||||
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
|
||||
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
|
||||
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
|
||||
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
|
||||
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
|
||||
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
|
||||
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
|
||||
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
|
||||
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
|
||||
]).
|
||||
script_frac("0.6").
|
||||
fg_bg_colors('black','white').
|
||||
dont_reencode("FFDingbests:ZapfDingbats").
|
||||
objshadow_info('#c0c0c0',2,2).
|
||||
rotate_pivot(0,0,0,0).
|
||||
spline_tightness(1).
|
||||
page(1,"",1,'').
|
||||
box('black','',50,45,300,355,2,2,1,0,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
box('black','',75,75,195,225,2,1,1,10,8,0,0,0,0,'1',0,[
|
||||
]).
|
||||
box('black','',85,105,185,125,2,1,1,18,8,0,0,0,0,'1',0,[
|
||||
]).
|
||||
box('black','',85,105,105,125,2,1,1,19,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
box('black','',105,105,125,125,2,1,1,20,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
text('black',95,108,1,1,1,9,15,21,12,3,0,0,0,0,2,9,15,0,0,"",0,0,0,0,120,'',[
|
||||
minilines(9,15,0,0,1,0,0,[
|
||||
mini_line(9,12,3,0,0,0,[
|
||||
str_block(0,9,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica',0,69120,9,12,3,0,-1,0,0,0,0,0,
|
||||
"A")])
|
||||
])
|
||||
])]).
|
||||
text('black',115,108,1,1,1,8,15,28,12,3,0,0,0,0,2,8,15,0,0,"",0,0,0,0,120,'',[
|
||||
minilines(8,15,0,0,1,0,0,[
|
||||
mini_line(8,12,3,0,0,0,[
|
||||
str_block(0,8,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica',0,69120,8,12,3,0,-1,0,0,0,0,0,
|
||||
"B")])
|
||||
])
|
||||
])]).
|
||||
box('black','',125,105,145,125,0,1,1,32,0,0,0,0,0,'1',0,[
|
||||
]).
|
||||
text('black',135,108,1,1,1,9,15,36,12,3,0,0,0,0,2,9,15,0,0,"",0,0,0,0,120,'',[
|
||||
minilines(9,15,0,0,1,0,0,[
|
||||
mini_line(9,12,3,0,0,0,[
|
||||
str_block(0,9,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica',0,69120,9,12,3,0,-1,0,0,0,0,0,
|
||||
"C")])
|
||||
])
|
||||
])]).
|
||||
poly('black','',2,[
|
||||
215,140,215,220],0,3,1,51,0,0,0,0,0,0,0,'3',0,0,
|
||||
"0","",[
|
||||
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
|
||||
]).
|
||||
box('black','',175,265,270,325,0,3,1,65,0,0,0,0,0,'3',0,[
|
||||
]).
|
||||
box('black','',185,270,260,320,0,1,1,69,8,0,0,0,0,'1',0,[
|
||||
]).
|
||||
poly('black','',6,[
|
||||
195,295,215,290,235,310,245,285,225,300,195,295],0,2,1,74,0,0,0,0,0,0,0,'2',0,0,
|
||||
"00","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
box('black','',85,275,140,315,1,2,0,87,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
text('black',85,23,1,1,1,44,15,93,12,3,0,0,0,0,2,44,15,0,0,"",0,0,0,0,35,'',[
|
||||
minilines(44,15,0,0,1,0,0,[
|
||||
mini_line(44,12,3,0,0,0,[
|
||||
str_block(0,44,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
|
||||
"LTPage")])
|
||||
])
|
||||
])]).
|
||||
text('black',255,133,1,1,1,39,15,100,12,3,0,0,0,0,2,39,15,0,0,"",0,0,0,0,145,'',[
|
||||
minilines(39,15,0,0,1,0,0,[
|
||||
mini_line(39,12,3,0,0,0,[
|
||||
str_block(0,39,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,39,12,3,0,-1,0,0,0,0,0,
|
||||
"LTLine")])
|
||||
])
|
||||
])]).
|
||||
text('black',125,83,1,1,1,42,15,104,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,95,'',[
|
||||
minilines(42,15,0,0,1,0,0,[
|
||||
mini_line(42,12,3,0,0,0,[
|
||||
str_block(0,42,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
|
||||
"LTChar")])
|
||||
])
|
||||
])]).
|
||||
text('black',245,53,1,1,1,65,15,108,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,65,'',[
|
||||
minilines(65,15,0,0,1,0,0,[
|
||||
mini_line(65,12,3,0,0,0,[
|
||||
str_block(0,65,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
|
||||
"LTTextBox")])
|
||||
])
|
||||
])]).
|
||||
text('black',245,88,1,1,1,66,15,110,12,3,0,0,0,0,2,66,15,0,0,"",0,0,0,0,100,'',[
|
||||
minilines(66,15,0,0,1,0,0,[
|
||||
mini_line(66,12,3,0,0,0,[
|
||||
str_block(0,66,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,66,12,3,0,-1,0,0,0,0,0,
|
||||
"LTTextLine")])
|
||||
])
|
||||
])]).
|
||||
text('black',255,243,1,1,1,51,15,112,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,255,'',[
|
||||
minilines(51,15,0,0,1,0,0,[
|
||||
mini_line(51,12,3,0,0,0,[
|
||||
str_block(0,51,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
|
||||
"LTFigure")])
|
||||
])
|
||||
])]).
|
||||
text('black',140,243,1,1,1,51,15,114,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,255,'',[
|
||||
minilines(51,15,0,0,1,0,0,[
|
||||
mini_line(51,12,3,0,0,0,[
|
||||
str_block(0,51,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
|
||||
"LTImage")])
|
||||
])
|
||||
])]).
|
||||
text('black',240,223,1,1,1,43,15,116,12,3,0,0,0,0,2,43,15,0,0,"",0,0,0,0,235,'',[
|
||||
minilines(43,15,0,0,1,0,0,[
|
||||
mini_line(43,12,3,0,0,0,[
|
||||
str_block(0,43,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,43,12,3,0,0,0,0,0,0,0,
|
||||
"LTRect")])
|
||||
])
|
||||
])]).
|
||||
text('black',190,333,1,1,1,50,15,118,12,3,0,0,0,0,2,50,15,0,0,"",0,0,0,0,345,'',[
|
||||
minilines(50,15,0,0,1,0,0,[
|
||||
mini_line(50,12,3,0,0,0,[
|
||||
str_block(0,50,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,50,12,3,0,-1,0,0,0,0,0,
|
||||
"LTCurve")])
|
||||
])
|
||||
])]).
|
||||
text('black',170,138,1,1,1,42,15,121,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,150,'',[
|
||||
minilines(42,15,0,0,1,0,0,[
|
||||
mini_line(42,12,3,0,0,0,[
|
||||
str_block(0,42,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
|
||||
"LTText")])
|
||||
])
|
||||
])]).
|
||||
box('black','',145,105,165,125,0,1,1,125,8,0,0,0,0,'1',0,[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
105,95,95,110],0,1,1,135,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
165,140,155,115],0,1,1,138,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
215,65,190,80],0,1,1,139,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
215,100,180,115],0,1,1,140,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
235,140,215,150],0,1,1,141,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
220,235,205,265],0,1,1,146,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
235,255,225,275],0,1,1,147,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
195,330,220,300],0,1,1,148,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
125,255,110,280],0,1,1,149,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
text('black',610,33,1,1,1,44,15,151,12,3,0,0,0,0,2,44,15,0,0,"",0,0,0,0,45,'',[
|
||||
minilines(44,15,0,0,1,0,0,[
|
||||
mini_line(44,12,3,0,0,0,[
|
||||
str_block(0,44,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
|
||||
"LTPage")])
|
||||
])
|
||||
])]).
|
||||
text('black',460,108,1,1,1,65,15,152,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,120,'',[
|
||||
minilines(65,15,0,0,1,0,0,[
|
||||
mini_line(65,12,3,0,0,0,[
|
||||
str_block(0,65,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
|
||||
"LTTextBox")])
|
||||
])
|
||||
])]).
|
||||
text('black',410,178,1,1,1,66,15,154,12,3,0,0,0,0,2,66,15,0,0,"",0,0,0,0,190,'',[
|
||||
minilines(66,15,0,0,1,0,0,[
|
||||
mini_line(66,12,3,0,0,0,[
|
||||
str_block(0,66,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,66,12,3,0,-1,0,0,0,0,0,
|
||||
"LTTextLine")])
|
||||
])
|
||||
])]).
|
||||
text('black',360,248,1,1,1,42,15,157,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
|
||||
minilines(42,15,0,0,1,0,0,[
|
||||
mini_line(42,12,3,0,0,0,[
|
||||
str_block(0,42,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
|
||||
"LTChar")])
|
||||
])
|
||||
])]).
|
||||
text('black',420,248,1,1,1,42,15,159,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
|
||||
minilines(42,15,0,0,1,0,0,[
|
||||
mini_line(42,12,3,0,0,0,[
|
||||
str_block(0,42,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
|
||||
"LTChar")])
|
||||
])
|
||||
])]).
|
||||
text('black',480,248,1,1,1,42,15,161,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
|
||||
minilines(42,15,0,0,1,0,0,[
|
||||
mini_line(42,12,3,0,0,0,[
|
||||
str_block(0,42,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
|
||||
"LTText")])
|
||||
])
|
||||
])]).
|
||||
text('black',460,178,1,1,1,12,15,170,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,190,'',[
|
||||
minilines(12,15,0,0,1,0,0,[
|
||||
mini_line(12,12,3,0,0,0,[
|
||||
str_block(0,12,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
|
||||
"...")])
|
||||
])
|
||||
])]).
|
||||
text('black',520,248,1,1,1,12,15,172,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,260,'',[
|
||||
minilines(12,15,0,0,1,0,0,[
|
||||
mini_line(12,12,3,0,0,0,[
|
||||
str_block(0,12,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
|
||||
"...")])
|
||||
])
|
||||
])]).
|
||||
text('black',560,108,1,1,1,51,15,174,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,120,'',[
|
||||
minilines(51,15,0,0,1,0,0,[
|
||||
mini_line(51,12,3,0,0,0,[
|
||||
str_block(0,51,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
|
||||
"LTFigure")])
|
||||
])
|
||||
])]).
|
||||
text('black',635,108,1,1,1,39,15,178,12,3,0,0,0,0,2,39,15,0,0,"",0,0,0,0,120,'',[
|
||||
minilines(39,15,0,0,1,0,0,[
|
||||
mini_line(39,12,3,0,0,0,[
|
||||
str_block(0,39,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,39,12,3,0,-1,0,0,0,0,0,
|
||||
"LTLine")])
|
||||
])
|
||||
])]).
|
||||
text('black',700,108,1,1,1,43,15,180,12,3,0,0,0,0,2,43,15,0,0,"",0,0,0,0,120,'',[
|
||||
minilines(43,15,0,0,1,0,0,[
|
||||
mini_line(43,12,3,0,0,0,[
|
||||
str_block(0,43,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,43,12,3,0,0,0,0,0,0,0,
|
||||
"LTRect")])
|
||||
])
|
||||
])]).
|
||||
text('black',580,178,1,1,1,50,15,182,12,3,0,0,0,0,2,50,15,0,0,"",0,0,0,0,190,'',[
|
||||
minilines(50,15,0,0,1,0,0,[
|
||||
mini_line(50,12,3,0,0,0,[
|
||||
str_block(0,50,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,50,12,3,0,-1,0,0,0,0,0,
|
||||
"LTCurve")])
|
||||
])
|
||||
])]).
|
||||
text('black',775,108,1,1,1,51,15,186,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,120,'',[
|
||||
minilines(51,15,0,0,1,0,0,[
|
||||
mini_line(51,12,3,0,0,0,[
|
||||
str_block(0,51,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
|
||||
"LTImage")])
|
||||
])
|
||||
])]).
|
||||
poly('black','',2,[
|
||||
475,105,590,50],0,1,1,190,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
560,110,595,50],0,1,1,191,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
635,105,600,50],0,1,1,192,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
610,50,700,100],0,1,1,193,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
765,100,630,50],0,1,1,194,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
460,125,425,175],0,1,1,196,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
560,125,570,175],0,1,1,197,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
415,195,370,245],0,1,1,198,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
415,195,420,245],0,1,1,199,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
415,195,475,245],0,1,1,200,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
470,125,485,175],0,1,1,206,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
420,195,510,220],0,1,1,207,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
565,125,635,175],0,1,1,208,0,0,0,0,0,0,0,'1',0,0,
|
||||
"0","",[
|
||||
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
|
||||
]).
|
||||
text('black',635,178,1,1,1,12,15,215,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,190,'',[
|
||||
minilines(12,15,0,0,1,0,0,[
|
||||
mini_line(12,12,3,0,0,0,[
|
||||
str_block(0,12,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
|
||||
"...")])
|
||||
])
|
||||
])]).
|
|
@ -0,0 +1,35 @@
|
|||
@ECHO OFF
|
||||
|
||||
pushd %~dp0
|
||||
|
||||
REM Command file for Sphinx documentation
|
||||
|
||||
if "%SPHINXBUILD%" == "" (
|
||||
set SPHINXBUILD=sphinx-build
|
||||
)
|
||||
set SOURCEDIR=source
|
||||
set BUILDDIR=build
|
||||
|
||||
if "%1" == "" goto help
|
||||
|
||||
%SPHINXBUILD% >NUL 2>NUL
|
||||
if errorlevel 9009 (
|
||||
echo.
|
||||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
|
||||
echo.installed, then set the SPHINXBUILD environment variable to point
|
||||
echo.to the full path of the 'sphinx-build' executable. Alternatively you
|
||||
echo.may add the Sphinx directory to PATH.
|
||||
echo.
|
||||
echo.If you don't have Sphinx installed, grab it from
|
||||
echo.http://sphinx-doc.org/
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
|
||||
goto end
|
||||
|
||||
:help
|
||||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
|
||||
|
||||
:end
|
||||
popd
|
187
docs/objrel.obj
187
docs/objrel.obj
|
@ -1,187 +0,0 @@
|
|||
%TGIF 4.2.2
|
||||
state(0,37,100.000,0,0,0,16,1,9,1,1,1,0,0,2,1,1,'Helvetica-Bold',1,69120,0,0,1,10,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
|
||||
%
|
||||
% @(#)$Header$
|
||||
% %W%
|
||||
%
|
||||
unit("1 pixel/pixel").
|
||||
color_info(19,65535,0,[
|
||||
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
|
||||
"red", 65535, 0, 0, 65535, 0, 0, 1,
|
||||
"green", 0, 65535, 0, 0, 65535, 0, 1,
|
||||
"blue", 0, 0, 65535, 0, 0, 65535, 1,
|
||||
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
|
||||
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
|
||||
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
|
||||
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
|
||||
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
|
||||
"black", 0, 0, 0, 0, 0, 0, 1,
|
||||
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
|
||||
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
|
||||
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
|
||||
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
|
||||
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
|
||||
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
|
||||
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
|
||||
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
|
||||
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
|
||||
]).
|
||||
script_frac("0.6").
|
||||
fg_bg_colors('black','white').
|
||||
dont_reencode("FFDingbests:ZapfDingbats").
|
||||
objshadow_info('#c0c0c0',2,2).
|
||||
rotate_pivot(0,0,0,0).
|
||||
spline_tightness(1).
|
||||
page(1,"",1,'').
|
||||
oval('black','',350,380,450,430,2,2,1,88,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
270,270,350,230],1,2,1,54,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
270,280,350,320],1,2,1,55,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
box('black','',350,100,450,150,2,2,1,2,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
text('black',400,118,1,1,1,84,15,3,12,3,0,0,0,0,2,84,15,0,0,"",0,0,0,0,130,'',[
|
||||
minilines(84,15,0,0,1,0,0,[
|
||||
mini_line(84,12,3,0,0,0,[
|
||||
str_block(0,84,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,84,12,3,0,0,0,0,0,0,0,
|
||||
"PDFDocument")])
|
||||
])
|
||||
])]).
|
||||
box('black','',150,100,250,150,2,2,1,13,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
text('black',200,118,1,1,1,63,15,14,12,3,0,0,0,0,2,63,15,0,0,"",0,0,0,0,130,'',[
|
||||
minilines(63,15,0,0,1,0,0,[
|
||||
mini_line(63,12,3,0,0,0,[
|
||||
str_block(0,63,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,63,12,3,0,0,0,0,0,0,0,
|
||||
"PDFParser")])
|
||||
])
|
||||
])]).
|
||||
box('black','',350,200,450,250,2,2,1,20,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
text('black',400,218,1,1,1,88,15,21,12,3,0,0,0,0,2,88,15,0,0,"",0,0,0,0,230,'',[
|
||||
minilines(88,15,0,0,1,0,0,[
|
||||
mini_line(88,12,3,0,0,0,[
|
||||
str_block(0,88,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,88,12,3,0,0,0,0,0,0,0,
|
||||
"PDFInterpreter")])
|
||||
])
|
||||
])]).
|
||||
box('black','',350,300,450,350,2,2,1,23,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
text('black',400,318,1,1,1,65,15,24,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,330,'',[
|
||||
minilines(65,15,0,0,1,0,0,[
|
||||
mini_line(65,12,3,0,0,0,[
|
||||
str_block(0,65,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
|
||||
"PDFDevice")])
|
||||
])
|
||||
])]).
|
||||
box('black','',180,250,280,300,2,2,1,29,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
text('black',230,268,1,1,1,131,15,30,12,3,2,0,0,0,2,131,15,0,0,"",0,0,0,0,280,'',[
|
||||
minilines(131,15,0,0,1,0,0,[
|
||||
mini_line(131,12,3,0,0,0,[
|
||||
str_block(0,131,12,3,0,0,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,131,12,3,0,0,0,0,0,0,0,
|
||||
"PDFResourceManager")])
|
||||
])
|
||||
])]).
|
||||
poly('black','',2,[
|
||||
250,140,350,140],1,2,1,45,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
350,110,250,110],1,2,1,46,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
400,150,400,200],1,2,1,47,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
400,250,400,300],1,2,1,56,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
poly('black','',2,[
|
||||
400,350,400,380],0,2,1,65,0,0,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
text('black',400,388,3,1,1,44,41,71,12,3,0,-2,0,0,2,44,41,0,0,"",0,0,0,0,400,'',[
|
||||
minilines(44,41,0,0,1,-2,0,[
|
||||
mini_line(44,12,3,0,0,0,[
|
||||
str_block(0,44,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
|
||||
"Display")])
|
||||
]),
|
||||
mini_line(20,12,3,0,0,0,[
|
||||
str_block(0,20,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,20,12,3,0,-1,0,0,0,0,0,
|
||||
"File")])
|
||||
]),
|
||||
mini_line(23,12,3,0,0,0,[
|
||||
str_block(0,23,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,23,12,3,0,-1,0,0,0,0,0,
|
||||
"etc.")])
|
||||
])
|
||||
])]).
|
||||
text('black',300,88,1,1,1,92,15,79,12,3,0,0,0,0,2,92,15,0,0,"",0,0,0,0,100,'',[
|
||||
minilines(92,15,0,0,1,0,0,[
|
||||
mini_line(92,12,3,0,0,0,[
|
||||
str_block(0,92,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,92,12,3,0,-1,0,0,0,0,0,
|
||||
"request objects")])
|
||||
])
|
||||
])]).
|
||||
text('black',300,148,1,1,1,78,15,84,12,3,0,0,0,0,2,78,15,0,0,"",0,0,0,0,160,'',[
|
||||
minilines(78,15,0,0,1,0,0,[
|
||||
mini_line(78,12,3,0,0,0,[
|
||||
str_block(0,78,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,78,12,3,0,-1,0,0,0,0,0,
|
||||
"store objects")])
|
||||
])
|
||||
])]).
|
||||
oval('black','',20,100,120,150,2,2,1,106,0,0,0,0,0,'2',0,[
|
||||
]).
|
||||
text('black',70,118,1,1,1,46,15,107,12,3,0,0,0,0,2,46,15,0,0,"",0,0,0,0,130,'',[
|
||||
minilines(46,15,0,0,1,0,0,[
|
||||
mini_line(46,12,3,0,0,0,[
|
||||
str_block(0,46,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,46,12,3,0,-1,0,0,0,0,0,
|
||||
"PDF file")])
|
||||
])
|
||||
])]).
|
||||
poly('black','',2,[
|
||||
120,120,150,120],0,2,1,114,0,2,0,0,0,0,0,'2',0,0,
|
||||
"0","",[
|
||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||
]).
|
||||
text('black',400,158,1,1,1,84,15,115,12,3,2,0,0,0,2,84,15,0,0,"",0,0,0,0,170,'',[
|
||||
minilines(84,15,0,0,1,0,0,[
|
||||
mini_line(84,12,3,0,0,0,[
|
||||
str_block(0,84,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,84,12,3,0,-1,0,0,0,0,0,
|
||||
"page contents")])
|
||||
])
|
||||
])]).
|
||||
text('black',400,258,1,1,1,129,15,119,12,3,2,0,0,0,2,129,15,0,0,"",0,0,0,0,270,'',[
|
||||
minilines(129,15,0,0,1,0,0,[
|
||||
mini_line(129,12,3,0,0,0,[
|
||||
str_block(0,129,12,3,0,-1,0,0,0,[
|
||||
str_seg('black','Helvetica-Bold',1,69120,129,12,3,0,-1,0,0,0,0,0,
|
||||
"rendering instructions")])
|
||||
])
|
||||
])]).
|
BIN
docs/objrel.png
BIN
docs/objrel.png
Binary file not shown.
Before Width: | Height: | Size: 2.0 KiB |
|
@ -1,223 +0,0 @@
|
|||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
||||
<html>
|
||||
<head>
|
||||
<link rel="stylesheet" type="text/css" href="style.css">
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||||
<title>Programming with PDFMiner</title>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<div align=right class=lastmod>
|
||||
<!-- hhmts start -->
|
||||
Last Modified: Mon Mar 24 11:49:28 UTC 2014
|
||||
<!-- hhmts end -->
|
||||
</div>
|
||||
|
||||
<p>
|
||||
<a href="index.html">[Back to PDFMiner homepage]</a>
|
||||
|
||||
<h1>Programming with PDFMiner</h1>
|
||||
<p>
|
||||
This page explains how to use PDFMiner as a library
|
||||
from other applications.
|
||||
<ul>
|
||||
<li> <a href="#overview">Overview</a>
|
||||
<li> <a href="#basic">Basic Usage</a>
|
||||
<li> <a href="#layout">Performing Layout Analysis</a>
|
||||
<li> <a href="#tocextract">Obtaining Table of Contents</a>
|
||||
<li> <a href="#extend">Extending Functionality</a>
|
||||
</ul>
|
||||
|
||||
<h2><a name="overview">Overview</a></h2>
|
||||
<p>
|
||||
<strong>PDF is evil.</strong> Although it is called a PDF
|
||||
"document", it's nothing like Word or HTML document. PDF is more
|
||||
like a graphic representation. PDF contents are just a bunch of
|
||||
instructions that tell how to place the stuff at each exact
|
||||
position on a display or paper. In most cases, it has no logical
|
||||
structure such as sentences or paragraphs and it cannot adapt
|
||||
itself when the paper size changes. PDFMiner attempts to
|
||||
reconstruct some of those structures by guessing from its
|
||||
positioning, but there's nothing guaranteed to work. Ugly, I
|
||||
know. Again, PDF is evil.
|
||||
|
||||
<p>
|
||||
[More technical details about the internal structure of PDF:
|
||||
"How to Extract Text Contents from PDF Manually"
|
||||
<a href="http://www.youtube.com/watch?v=k34wRxaxA_c">(part 1)</a>
|
||||
<a href="http://www.youtube.com/watch?v=_A1M4OdNsiQ">(part 2)</a>
|
||||
<a href="http://www.youtube.com/watch?v=sfV_7cWPgZE">(part 3)</a>]
|
||||
|
||||
<p>
|
||||
Because a PDF file has such a big and complex structure,
|
||||
parsing a PDF file as a whole is time and memory consuming. However,
|
||||
not every part is needed for most PDF processing tasks. Therefore
|
||||
PDFMiner takes a strategy of lazy parsing, which is to parse the
|
||||
stuff only when it's necessary. To parse PDF files, you need to use at
|
||||
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
|
||||
These two objects are associated with each other.
|
||||
<code>PDFParser</code> fetches data from a file,
|
||||
and <code>PDFDocument</code> stores it. You'll also need
|
||||
<code>PDFPageInterpreter</code> to process the page contents
|
||||
and <code>PDFDevice</code> to translate it to whatever you need.
|
||||
<code>PDFResourceManager</code> is used to store
|
||||
shared resources such as fonts or images.
|
||||
|
||||
<p>
|
||||
Figure 1 shows the relationship between the classes in PDFMiner.
|
||||
|
||||
<div align=center>
|
||||
<img src="objrel.png"><br>
|
||||
<small>Figure 1. Relationships between PDFMiner classes</small>
|
||||
</div>
|
||||
|
||||
<h2><a name="basic">Basic Usage</a></h2>
|
||||
<p>
|
||||
A typical way to parse a PDF file is the following:
|
||||
<blockquote><pre>
|
||||
from pdfminer.pdfparser import PDFParser
|
||||
from pdfminer.pdfdocument import PDFDocument
|
||||
from pdfminer.pdfpage import PDFPage
|
||||
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
|
||||
from pdfminer.pdfinterp import PDFResourceManager
|
||||
from pdfminer.pdfinterp import PDFPageInterpreter
|
||||
from pdfminer.pdfdevice import PDFDevice
|
||||
|
||||
<span class="comment"># Open a PDF file.</span>
|
||||
fp = open('mypdf.pdf', 'rb')
|
||||
<span class="comment"># Create a PDF parser object associated with the file object.</span>
|
||||
parser = PDFParser(fp)
|
||||
<span class="comment"># Create a PDF document object that stores the document structure.</span>
|
||||
<span class="comment"># Supply the password for initialization.</span>
|
||||
document = PDFDocument(parser, password)
|
||||
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
|
||||
if not document.is_extractable:
|
||||
raise PDFTextExtractionNotAllowed
|
||||
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
|
||||
rsrcmgr = PDFResourceManager()
|
||||
<span class="comment"># Create a PDF device object.</span>
|
||||
device = PDFDevice(rsrcmgr)
|
||||
<span class="comment"># Create a PDF interpreter object.</span>
|
||||
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
||||
<span class="comment"># Process each page contained in the document.</span>
|
||||
for page in PDFPage.create_pages(document):
|
||||
interpreter.process_page(page)
|
||||
</pre></blockquote>
|
||||
|
||||
<h2><a name="layout">Performing Layout Analysis</a></h2>
|
||||
<p>
|
||||
Here is a typical way to use the layout analysis function:
|
||||
<blockquote><pre>
|
||||
from pdfminer.layout import LAParams
|
||||
from pdfminer.converter import PDFPageAggregator
|
||||
|
||||
<span class="comment"># Set parameters for analysis.</span>
|
||||
laparams = LAParams()
|
||||
<span class="comment"># Create a PDF page aggregator object.</span>
|
||||
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
|
||||
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
||||
for page in PDFPage.create_pages(document):
|
||||
interpreter.process_page(page)
|
||||
<span class="comment"># receive the LTPage object for the page.</span>
|
||||
layout = device.get_result()
|
||||
</pre></blockquote>
|
||||
|
||||
A layout analyzer returns a <code>LTPage</code> object for each page
|
||||
in the PDF document. This object contains child objects within the page,
|
||||
forming a tree structure. Figure 2 shows the relationship between
|
||||
these objects.
|
||||
|
||||
<div align=center>
|
||||
<img src="layout.png"><br>
|
||||
<small>Figure 2. Layout objects and its tree structure</small>
|
||||
</div>
|
||||
|
||||
<dl>
|
||||
<dt> <code>LTPage</code>
|
||||
<dd> Represents an entire page. May contain child objects like
|
||||
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,
|
||||
<code>LTCurve</code> and <code>LTLine</code>.
|
||||
|
||||
<dt> <code>LTTextBox</code>
|
||||
<dd> Represents a group of text chunks that can be contained in a rectangular area.
|
||||
Note that this box is created by geometric analysis and does not necessarily
|
||||
represents a logical boundary of the text.
|
||||
It contains a list of <code>LTTextLine</code> objects.
|
||||
<code>get_text()</code> method returns the text content.
|
||||
|
||||
<dt> <code>LTTextLine</code>
|
||||
<dd> Contains a list of <code>LTChar</code> objects that represent
|
||||
a single text line. The characters are aligned either horizontaly
|
||||
or vertically, depending on the text's writing mode.
|
||||
<code>get_text()</code> method returns the text content.
|
||||
|
||||
<dt> <code>LTChar</code>
|
||||
<dt> <code>LTAnno</code>
|
||||
<dd> Represent an actual letter in the text as a Unicode string.
|
||||
Note that, while a <code>LTChar</code> object has actual boundaries,
|
||||
<code>LTAnno</code> objects does not, as these are "virtual" characters,
|
||||
inserted by a layout analyzer according to the relationship between two characters
|
||||
(e.g. a space).
|
||||
|
||||
<dt> <code>LTFigure</code>
|
||||
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
|
||||
present figures or pictures by embedding yet another PDF document within a page.
|
||||
Note that <code>LTFigure</code> objects can appear recursively.
|
||||
|
||||
<dt> <code>LTImage</code>
|
||||
<dd> Represents an image object. Embedded images can be
|
||||
in JPEG or other formats, but currently PDFMiner does not
|
||||
pay much attention to graphical objects.
|
||||
|
||||
<dt> <code>LTLine</code>
|
||||
<dd> Represents a single straight line.
|
||||
Could be used for separating text or figures.
|
||||
|
||||
<dt> <code>LTRect</code>
|
||||
<dd> Represents a rectangle.
|
||||
Could be used for framing another pictures or figures.
|
||||
|
||||
<dt> <code>LTCurve</code>
|
||||
<dd> Represents a generic Bezier curve.
|
||||
</dl>
|
||||
|
||||
<p>
|
||||
Also, check out <a href="http://denis.papathanasiou.org/archive/2010.08.04.post.pdf">a more complete example by Denis Papathanasiou(Extracting Text & Images from PDF Files)</a>.
|
||||
|
||||
<h2><a name="tocextract">Obtaining Table of Contents</a></h2>
|
||||
<p>
|
||||
PDFMiner provides functions to access the document's table of contents
|
||||
("Outlines").
|
||||
|
||||
<blockquote><pre>
|
||||
from pdfminer.pdfparser import PDFParser
|
||||
from pdfminer.pdfdocument import PDFDocument
|
||||
|
||||
<span class="comment"># Open a PDF document.</span>
|
||||
fp = open('mypdf.pdf', 'rb')
|
||||
parser = PDFParser(fp)
|
||||
document = PDFDocument(parser, password)
|
||||
|
||||
<span class="comment"># Get the outlines of the document.</span>
|
||||
outlines = document.get_outlines()
|
||||
for (level,title,dest,a,se) in outlines:
|
||||
print (level, title)
|
||||
</pre></blockquote>
|
||||
|
||||
<p>
|
||||
Some PDF documents use page numbers as destinations, while others
|
||||
use page numbers and the physical location within the page. Since
|
||||
PDF does not have a logical structure, and it does not provide a
|
||||
way to refer to any in-page object from the outside, there's no
|
||||
way to tell exactly which part of text these destinations are
|
||||
referring to.
|
||||
|
||||
<h2><a name="extend">Extending Functionality</a></h2>
|
||||
|
||||
<p>
|
||||
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
|
||||
in order to process them differently / obtain other information.
|
||||
|
||||
<hr noshade>
|
||||
<address>Yusuke Shinyama</address>
|
||||
</body>
|
|
@ -0,0 +1 @@
|
|||
sphinx-argparse
|
|
@ -0,0 +1,28 @@
|
|||
<style>
|
||||
td {
|
||||
text-align: center;
|
||||
}
|
||||
</style>
|
||||
<table style="margin: 10px; padding: 10px;">
|
||||
<tr>
|
||||
<td style="text-align: right; border-right:1px red solid">→</td>
|
||||
<td colspan="4"
|
||||
style="text-align: left; border-left:1px red solid">← <em><font
|
||||
color="red">M</font></em></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="border:1px solid"><code>Q u i</code></td>
|
||||
<td style="border:1px solid"><code>c k</code></td>
|
||||
<td width="10px"></td>
|
||||
<td style="border:1px solid"><code>b r o w n</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="2" style="text-align: right; border-right:1px green solid">
|
||||
→
|
||||
</td>
|
||||
<td></td>
|
||||
<td colspan="2"
|
||||
style="text-align: left; border-left:1px green solid">←
|
||||
<em><font color="green">W</font></em></td>
|
||||
</tr>
|
||||
</table>
|
|
@ -0,0 +1,23 @@
|
|||
<style>
|
||||
.background-blue {
|
||||
background-color: lightblue;
|
||||
border: 2px solid lightblue;
|
||||
}
|
||||
</style>
|
||||
<table style="margin: 10px; padding: 10px;">
|
||||
<tr>
|
||||
<td style="border:1px solid; text-align: left">
|
||||
<code>
|
||||
Q u i c k b r o w n<br/> f o x
|
||||
</code>
|
||||
</td>
|
||||
<td class="background-blue" colspan="3"></td>
|
||||
</tr>
|
||||
<tr style="height: 10px;">
|
||||
<td class="background-blue" colspan="4"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="background-blue" colspan="3"></td>
|
||||
<td style="border:1px solid"><code>j u m p s ...</code></td>
|
||||
</tr>
|
||||
</table>
|
|
@ -0,0 +1,45 @@
|
|||
<style>
|
||||
td {
|
||||
text-align: center;
|
||||
}
|
||||
</style>
|
||||
<table style="margin: 10px; padding: 10px;">
|
||||
<tr>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td align=right style="border-bottom:1px blue solid">↓</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="2" style="border:1px solid"><code>Q u i c k b r o w
|
||||
n</code></td>
|
||||
<td></td>
|
||||
<td align=right style="border-bottom:1px blue solid">↓</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td align=center valign=center><em><font color="blue">
|
||||
L<sub>1</sub>
|
||||
</font></em></td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="border:1px solid;">
|
||||
<code>f o x</code>
|
||||
</td>
|
||||
<td>
|
||||
|
||||
</td>
|
||||
<td align=right style="border-top:1px blue solid">↑</td>
|
||||
<td align=center valign=center><em><font color="blue">
|
||||
L<sub>2</sub>
|
||||
</font></em></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td align=right style="border-top:1px blue solid">↑</td>
|
||||
</tr>
|
||||
</table>
|
Before Width: | Height: | Size: 3.5 KiB After Width: | Height: | Size: 3.5 KiB |
|
@ -0,0 +1,25 @@
|
|||
.. _api_commandline:
|
||||
|
||||
|
||||
Command-line API
|
||||
****************
|
||||
|
||||
.. _api_pdf2txt:
|
||||
|
||||
pdf2txt.py
|
||||
==========
|
||||
|
||||
.. argparse::
|
||||
:module: tools.pdf2txt
|
||||
:func: maketheparser
|
||||
:prog: python tools/pdf2txt.py
|
||||
|
||||
.. _api_dumppdf:
|
||||
|
||||
dumppdf.py
|
||||
==========
|
||||
|
||||
.. argparse::
|
||||
:module: tools.dumppdf
|
||||
:func: create_parser
|
||||
:prog: python tools/dumppdf.py
|
|
@ -0,0 +1,20 @@
|
|||
.. _api_composable:
|
||||
|
||||
Composable API
|
||||
**************
|
||||
|
||||
.. _api_laparams:
|
||||
|
||||
LAParams
|
||||
========
|
||||
|
||||
.. currentmodule:: pdfminer.layout
|
||||
.. autoclass:: LAParams
|
||||
|
||||
Todo:
|
||||
=====
|
||||
|
||||
- `PDFDevice`
|
||||
- `TextConverter`
|
||||
- `PDFPageAggregator`
|
||||
- `PDFPageInterpreter`
|
|
@ -0,0 +1,21 @@
|
|||
.. _api_highlevel:
|
||||
|
||||
High-level functions API
|
||||
************************
|
||||
|
||||
.. _api_extract_text:
|
||||
|
||||
extract_text
|
||||
============
|
||||
|
||||
.. currentmodule:: pdfminer.high_level
|
||||
.. autofunction:: extract_text
|
||||
|
||||
|
||||
.. _api_extract_text_to_fp:
|
||||
|
||||
extract_text_to_fp
|
||||
==================
|
||||
|
||||
.. currentmodule:: pdfminer.high_level
|
||||
.. autofunction:: extract_text_to_fp
|
|
@ -0,0 +1,9 @@
|
|||
API documentation
|
||||
*****************
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
commandline
|
||||
highlevel
|
||||
composable
|
|
@ -0,0 +1,61 @@
|
|||
# Configuration file for the Sphinx documentation builder.
|
||||
#
|
||||
# This file only contains a selection of the most common options. For a full
|
||||
# list see the documentation:
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html
|
||||
|
||||
# -- Path setup --------------------------------------------------------------
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
|
||||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.join(os.path.abspath(os.path.dirname(__file__)), '../../'))
|
||||
|
||||
|
||||
# -- Project information -----------------------------------------------------
|
||||
|
||||
project = 'pdfminer.six'
|
||||
copyright = '2019, Yusuke Shinyama, Philippe Guglielmetti & Pieter Marsman'
|
||||
author = 'Yusuke Shinyama, Philippe Guglielmetti & Pieter Marsman'
|
||||
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = '20191020'
|
||||
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = [
|
||||
'sphinxarg.ext',
|
||||
'sphinx.ext.autodoc',
|
||||
'sphinx.ext.doctest',
|
||||
]
|
||||
|
||||
# Root rst file
|
||||
master_doc = 'index'
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ['_templates']
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
# This pattern also affects html_static_path and html_extra_path.
|
||||
exclude_patterns = []
|
||||
|
||||
|
||||
# -- Options for HTML output -------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
#
|
||||
html_theme = 'alabaster'
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = ['_static']
|
|
@ -0,0 +1,72 @@
|
|||
Welcome to pdfminer.six's documentation!
|
||||
****************************************
|
||||
|
||||
.. image:: https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master
|
||||
:target: https://travis-ci.org/pdfminer/pdfminer.six
|
||||
:alt: Travis-ci build badge
|
||||
|
||||
.. image:: https://img.shields.io/pypi/v/pdfminer.six.svg
|
||||
:target: https://pypi.python.org/pypi/pdfminer.six/
|
||||
:alt: PyPi version badge
|
||||
|
||||
.. image:: https://badges.gitter.im/pdfminer-six/Lobby.svg
|
||||
:target: https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium
|
||||
:alt: gitter badge
|
||||
|
||||
|
||||
Pdfminer.six is a python package for extracting information from PDF documents.
|
||||
|
||||
Check out the source on `github <https://github.com/pdfminer/pdfminer.six>`_.
|
||||
|
||||
Content
|
||||
=======
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
tutorials/index
|
||||
topics/index
|
||||
api/index
|
||||
|
||||
|
||||
Features
|
||||
========
|
||||
|
||||
* Parse all objects from a PDF document into Python objects.
|
||||
* Analyze and group text in a human-readable way.
|
||||
* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
|
||||
contents and more.
|
||||
* Support for (almost all) features from the PDF-1.7 specification
|
||||
* Support for Chinese, Japanese and Korean CJK) languages as well as vertical
|
||||
writing.
|
||||
* Support for various font types (Type1, TrueType, Type3, and CID).
|
||||
* Support for basic encryption (RC4).
|
||||
|
||||
|
||||
Installation instructions
|
||||
=========================
|
||||
|
||||
Before using it, you must install it using Python 2.7 or newer.
|
||||
|
||||
::
|
||||
|
||||
$ pip install pdfminer.six
|
||||
|
||||
Note that Python 2.7 support is dropped at January, 2020.
|
||||
|
||||
Common use-cases
|
||||
----------------
|
||||
|
||||
* :ref:`tutorial_commandline` if you just want to extract text from a pdf once.
|
||||
* :ref:`tutorial_highlevel` if you want to integrate pdfminer.six with your
|
||||
Python code.
|
||||
* :ref:`tutorial_composable` when you want to tailor the behavior of
|
||||
pdfmine.six to your needs.
|
||||
|
||||
|
||||
Contributing
|
||||
============
|
||||
|
||||
We welcome any contributors to pdfminer.six! But, before doing anything, take
|
||||
a look at the `contribution guide
|
||||
<https://github.com/pdfminer/pdfminer.six/blob/master/CONTRIBUTING.md>`_.
|
|
@ -0,0 +1,132 @@
|
|||
.. _topic_pdf_to_text:
|
||||
|
||||
Converting a PDF file to text
|
||||
*****************************
|
||||
|
||||
Most PDF files look like they contain well structured text. But the reality is
|
||||
that a PDF file does not contain anything that resembles a paragraphs,
|
||||
sentences or even words. When it comes to text, a PDF file is only aware of
|
||||
the characters and their placement.
|
||||
|
||||
This makes extracting meaningful pieces of text from PDF's files difficult.
|
||||
The characters that compose a paragraph are no different from those that
|
||||
compose the table, the page footer or the description of a figure. Unlike
|
||||
other documents formats, like a `.txt` file or a word document, the PDF format
|
||||
does not contain a stream of text.
|
||||
|
||||
A PDF document does consists of a collection of objects that together describe
|
||||
the appearance of one or more pages, possibly accompanied by additional
|
||||
interactive elements and higher-level application data. A PDF file contains
|
||||
the objects making up a PDF document along with associated structural
|
||||
information, all represented as a single self-contained sequence of bytes. [1]_
|
||||
|
||||
Layout analysis algorithm
|
||||
=========================
|
||||
|
||||
PDFMiner attempts to reconstruct some of those structures by using heuristics
|
||||
on the positioning of characters. This works well for sentences and
|
||||
paragraphs because meaningful groups of nearby characters can be made.
|
||||
|
||||
The layout analysis consist of three different stages: it groups characters
|
||||
into words and lines, then it groups lines into boxes and finally it groups
|
||||
textboxes hierarchically. These stages are discussed in the following
|
||||
sections. The resulting output of the layout analysis is an ordered hierarchy
|
||||
of layout objects on a PDF page.
|
||||
|
||||
.. figure:: ../_static/layout_analysis_output.png
|
||||
:align: center
|
||||
|
||||
The output of the layout analysis is a hierarchy of layout objects.
|
||||
|
||||
|
||||
The output of the layout analysis heavily depends on a couple of parameters.
|
||||
All these parameters are part of the :ref:`api_laparams` class.
|
||||
|
||||
Grouping characters into words and lines
|
||||
----------------------------------------
|
||||
|
||||
The first step in going from characters to text is to group characters in a
|
||||
meaningful way. Each character has an x-coordinate and a y-coordinate for its
|
||||
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
|
||||
.six uses these bounding boxes to decide which characters belong together.
|
||||
|
||||
Characters that are both horizontally and vertically close are grouped. How
|
||||
close they should be is determined by the `char_margin` (M in figure) and the
|
||||
`line_overlap` (not in figure) parameter. The horizontal *distance* between the
|
||||
bounding boxes of two characters should be smaller that the `char_margin` and
|
||||
the vertical *overlap* between the bounding boxes should be smaller the the
|
||||
`line_overlap`.
|
||||
|
||||
|
||||
.. raw:: html
|
||||
:file: ../_static/layout_analysis.html
|
||||
|
||||
The values of `char_margin` and `line_overlap` are relative to the size of
|
||||
the bounding boxes of the characters. The `char_margin` is relative to the
|
||||
maximum width of either one of the bounding boxes, and the `line_overlap` is
|
||||
relative to the minimum height of either one of the bounding boxes.
|
||||
|
||||
Spaces need to be inserted between characters because the PDF format has no
|
||||
notion of the space character. A space is inserted if the characters are
|
||||
further apart that the `word_margin` (W in the figure). The `word_margin` is
|
||||
relative to the maximum width or height of the new character. Having a larger
|
||||
`word_margin` creates smaller words and inserts spaces between characters
|
||||
more often. Note that the `word_margin` should be smaller than the
|
||||
`char_margin` otherwise all the characters are seperated by a space.
|
||||
|
||||
The result of this stage is a list of lines. Each line consists a list of
|
||||
characters. These characters either original `LTChar` characters that
|
||||
originate from the PDF file, or inserted `LTAnno` characters that
|
||||
represent spaces between words or newlines at the end of each line.
|
||||
|
||||
Grouping lines into boxes
|
||||
-------------------------
|
||||
|
||||
The second step is grouping lines in a meaningful way. Each line has a
|
||||
bounding box that is determined by the bounding boxes of the characters that
|
||||
it contains. Like grouping characters, pdfminer.six uses the bounding boxes
|
||||
to group the lines.
|
||||
|
||||
Lines that are both horizontally overlapping and vertically close are grouped.
|
||||
How vertically close the lines should be is determined by the `line_margin`.
|
||||
This margin is specified relative to the height of the bounding box. Lines
|
||||
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
|
||||
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
|
||||
than the absolute line margin, i.e. the `line_margin` multiplied by the
|
||||
height of the bounding box.
|
||||
|
||||
.. raw:: html
|
||||
:file: ../_static/layout_analysis_group_lines.html
|
||||
|
||||
The result of this stage is a list of text boxes. Each box consist of a list
|
||||
of lines.
|
||||
|
||||
Grouping textboxes hierarchically
|
||||
---------------------------------
|
||||
|
||||
the last step is to group the text boxes in a meaningful way. This step
|
||||
repeatedly merges the two text boxes that are closest to each other.
|
||||
|
||||
The closeness of bounding boxes is computed as the area that is between the
|
||||
two text boxes (the blue area in the figure). In other words, it is the area of
|
||||
the bounding box that surrounds both lines, minus the area of the bounding
|
||||
boxes of the individual lines.
|
||||
|
||||
.. raw:: html
|
||||
:file: ../_static/layout_analysis_group_boxes.html
|
||||
|
||||
|
||||
Working with rotated characters
|
||||
===============================
|
||||
|
||||
The algorithm described above assumes that all characters have the same
|
||||
orientation. However, any writing direction is possible in a PDF. To
|
||||
accommodate for this, pdfminer.six allows to detect vertical writing with the
|
||||
`detect_vertical` parameter. This will apply all the grouping steps as if the
|
||||
pdf was rotated 90 (or 270) degrees
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] Adobe System Inc. (2007). *Pdf reference: Adobe portable document
|
||||
format, version 1.7.*
|
|
@ -0,0 +1,7 @@
|
|||
Using pdfminer.six
|
||||
******************
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
converting_pdf_to_text
|
|
@ -0,0 +1,41 @@
|
|||
.. _tutorial_commandline:
|
||||
|
||||
Get started with command-line tools
|
||||
***********************************
|
||||
|
||||
pdfminer.six has several tools that can be used from the command line. The
|
||||
command-line tools are aimed at users that occasionally want to extract text
|
||||
from a pdf.
|
||||
|
||||
Take a look at the high-level or composable interface if you want to use
|
||||
pdfminer.six programmatically.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
pdf2txt.py
|
||||
----------
|
||||
|
||||
::
|
||||
|
||||
$ python tools/pdf2txt.py example.pdf
|
||||
all the text from the pdf appears on the command line
|
||||
|
||||
The :ref:`api_pdf2txt` tool extracts all the text from a PDF. It uses layout
|
||||
analysis with sensible defaults to order and group the text in a sensible way.
|
||||
|
||||
dumppdf.py
|
||||
----------
|
||||
|
||||
::
|
||||
|
||||
$ python tools/dumppdf.py -a example.pdf
|
||||
<pdf><object id="1">
|
||||
...
|
||||
</object>
|
||||
...
|
||||
</pdf>
|
||||
|
||||
The :ref:`api_dumppdf` tool can be used to extract the internal structure from a
|
||||
PDF. This tool is primarily for debugging purposes, but that can be useful to
|
||||
anybody working with PDF's.
|
|
@ -0,0 +1,33 @@
|
|||
.. _tutorial_composable:
|
||||
|
||||
Get started using the composable components API
|
||||
***********************************************
|
||||
|
||||
The command line tools and the high-level API are just shortcuts for often
|
||||
used combinations of pdfminer.six components. You can use these components to
|
||||
modify pdfminer.six to your own needs.
|
||||
|
||||
For example, to extract the text from a PDF file and save it in a python
|
||||
variable::
|
||||
|
||||
from io import StringIO
|
||||
|
||||
from pdfminer.converter import TextConverter
|
||||
from pdfminer.layout import LAParams
|
||||
from pdfminer.pdfdocument import PDFDocument
|
||||
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
|
||||
from pdfminer.pdfpage import PDFPage
|
||||
from pdfminer.pdfparser import PDFParser
|
||||
|
||||
output_string = StringIO()
|
||||
with open('samples/simple1.pdf', 'rb') as in_file:
|
||||
parser = PDFParser(in_file)
|
||||
doc = PDFDocument(parser)
|
||||
rsrcmgr = PDFResourceManager()
|
||||
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
|
||||
interpreter = PDFPageInterpreter(rsrcmgr, device)
|
||||
for page in PDFPage.create_pages(doc):
|
||||
interpreter.process_page(page)
|
||||
|
||||
print(output_string.getvalue())
|
||||
|
|
@ -0,0 +1,67 @@
|
|||
.. testsetup::
|
||||
|
||||
import sys
|
||||
from pdfminer.high_level import extract_text_to_fp, extract_text
|
||||
|
||||
.. _tutorial_highlevel:
|
||||
|
||||
Get started using the high-level functions
|
||||
******************************************
|
||||
|
||||
The high-level API can be used to do common tasks.
|
||||
|
||||
The most simple way to extract text from a PDF is to use
|
||||
:ref:`api_extract_text`:
|
||||
|
||||
.. doctest::
|
||||
|
||||
>>> text = extract_text('samples/simple1.pdf')
|
||||
>>> print(repr(text))
|
||||
'Hello \n\nWorld\n\nWorld\n\nHello \n\nH e l l o \n\nH e l l o \n\nW o r l d\n\nW o r l d\n\n\x0c'
|
||||
>>> print(text)
|
||||
... # doctest: +NORMALIZE_WHITESPACE
|
||||
Hello
|
||||
<BLANKLINE>
|
||||
World
|
||||
<BLANKLINE>
|
||||
World
|
||||
<BLANKLINE>
|
||||
Hello
|
||||
<BLANKLINE>
|
||||
H e l l o
|
||||
<BLANKLINE>
|
||||
H e l l o
|
||||
<BLANKLINE>
|
||||
W o r l d
|
||||
<BLANKLINE>
|
||||
W o r l d
|
||||
<BLANKLINE>
|
||||
|
||||
|
||||
To read text from a PDF and print it on the command line:
|
||||
|
||||
.. doctest::
|
||||
|
||||
>>> if sys.version_info > (3, 0):
|
||||
... from io import StringIO
|
||||
... else:
|
||||
... from io import BytesIO as StringIO
|
||||
>>> output_string = StringIO()
|
||||
>>> with open('samples/simple1.pdf', 'rb') as fin:
|
||||
... extract_text_to_fp(fin, output_string)
|
||||
>>> print(output_string.getvalue().strip())
|
||||
Hello WorldHello WorldHello WorldHello World
|
||||
|
||||
Or to convert it to html and use layout analysis:
|
||||
|
||||
.. doctest::
|
||||
|
||||
>>> if sys.version_info > (3, 0):
|
||||
... from io import StringIO
|
||||
... else:
|
||||
... from io import BytesIO as StringIO
|
||||
>>> from pdfminer.layout import LAParams
|
||||
>>> output_string = StringIO()
|
||||
>>> with open('samples/simple1.pdf', 'rb') as fin:
|
||||
... extract_text_to_fp(fin, output_string, laparams=LAParams(),
|
||||
... output_type='html', codec=None)
|
|
@ -0,0 +1,9 @@
|
|||
Getting started
|
||||
***************
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
commandline
|
||||
highlevel
|
||||
composable
|
|
@ -1,4 +0,0 @@
|
|||
blockquote { background: #eeeeee; }
|
||||
h1 { border-bottom: solid black 2px; }
|
||||
h2 { border-bottom: solid black 1px; }
|
||||
.comment { color: darkgreen; }
|
|
@ -2,6 +2,7 @@
|
|||
# -*- coding: utf-8 -*-
|
||||
import logging
|
||||
import re
|
||||
import sys
|
||||
from .pdfdevice import PDFTextDevice
|
||||
from .pdffont import PDFUnicodeNotDefined
|
||||
from .layout import LTContainer
|
||||
|
@ -271,6 +272,8 @@ class HTMLConverter(PDFConverter):
|
|||
def write(self, text):
|
||||
if self.codec:
|
||||
text = text.encode(self.codec)
|
||||
if sys.version_info < (3, 0):
|
||||
text = str(text)
|
||||
self.outfp.write(text)
|
||||
return
|
||||
|
||||
|
|
|
@ -1,26 +1,20 @@
|
|||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Functions that encapsulate "usual" use-cases for pdfminer, for use making
|
||||
bundled scripts and for using pdfminer as a module for routine tasks.
|
||||
"""
|
||||
"""Functions that can be used for the most common use-cases for pdfminer.six"""
|
||||
|
||||
import logging
|
||||
import six
|
||||
import sys
|
||||
|
||||
import six
|
||||
|
||||
# Conditional import because python 2 is stupid
|
||||
if sys.version_info > (3, 0):
|
||||
from io import StringIO
|
||||
else:
|
||||
from io import BytesIO as StringIO
|
||||
|
||||
from .pdfdocument import PDFDocument
|
||||
from .pdfparser import PDFParser
|
||||
from .pdfinterp import PDFResourceManager, PDFPageInterpreter
|
||||
from .pdfdevice import PDFDevice, TagExtractor
|
||||
from .pdfdevice import TagExtractor
|
||||
from .pdfpage import PDFPage
|
||||
from .converter import XMLConverter, HTMLConverter, TextConverter
|
||||
from .cmapdb import CMapDB
|
||||
from .image import ImageWriter
|
||||
from .layout import LAParams
|
||||
|
||||
|
@ -36,20 +30,24 @@ def extract_text_to_fp(inf, outfp,
|
|||
Beware laparams: Including an empty LAParams is not the same as passing None!
|
||||
Returns nothing, acting as it does on two streams. Use StringIO to get strings.
|
||||
|
||||
output_type: May be 'text', 'xml', 'html', 'tag'. Only 'text' works properly.
|
||||
codec: Text decoding codec
|
||||
laparams: An LAParams object from pdfminer.layout.
|
||||
Default is None but may not layout correctly.
|
||||
maxpages: How many pages to stop parsing after
|
||||
page_numbers: zero-indexed page numbers to operate on.
|
||||
password: For encrypted PDFs, the password to decrypt.
|
||||
scale: Scale factor
|
||||
rotation: Rotation factor
|
||||
layoutmode: Default is 'normal', see pdfminer.converter.HTMLConverter
|
||||
output_dir: If given, creates an ImageWriter for extracted images.
|
||||
strip_control: Does what it says on the tin
|
||||
debug: Output more logging data
|
||||
disable_caching: Does what it says on the tin
|
||||
:param inf: a file-like object to read PDF structure from, such as a
|
||||
file handler (using the builtin `open()` function) or a `BytesIO`.
|
||||
:param outfp: a file-like object to write the text to.
|
||||
:param output_type: May be 'text', 'xml', 'html', 'tag'. Only 'text' works properly.
|
||||
:param codec: Text decoding codec
|
||||
:param laparams: An LAParams object from pdfminer.layout. Default is None but may not layout correctly.
|
||||
:param maxpages: How many pages to stop parsing after
|
||||
:param page_numbers: zero-indexed page numbers to operate on.
|
||||
:param password: For encrypted PDFs, the password to decrypt.
|
||||
:param scale: Scale factor
|
||||
:param rotation: Rotation factor
|
||||
:param layoutmode: Default is 'normal', see pdfminer.converter.HTMLConverter
|
||||
:param output_dir: If given, creates an ImageWriter for extracted images.
|
||||
:param strip_control: Does what it says on the tin
|
||||
:param debug: Output more logging data
|
||||
:param disable_caching: Does what it says on the tin
|
||||
:param other:
|
||||
:return:
|
||||
"""
|
||||
if '_py2_no_more_posargs' in kwargs is not None:
|
||||
raise DeprecationWarning(
|
||||
|
|
|
@ -1,17 +1,15 @@
|
|||
import heapq
|
||||
|
||||
from .utils import INF
|
||||
from .utils import Plane
|
||||
from .utils import get_bound
|
||||
from .utils import uniq
|
||||
from .utils import fsplit
|
||||
from .utils import bbox2str
|
||||
from .utils import matrix2str
|
||||
from .utils import apply_matrix_pt
|
||||
from .utils import bbox2str
|
||||
from .utils import fsplit
|
||||
from .utils import get_bound
|
||||
from .utils import matrix2str
|
||||
from .utils import uniq
|
||||
|
||||
import six # Python 2+3 compatibility
|
||||
|
||||
## IndexAssigner
|
||||
##
|
||||
class IndexAssigner(object):
|
||||
|
||||
def __init__(self, index=0):
|
||||
|
@ -28,9 +26,33 @@ class IndexAssigner(object):
|
|||
return
|
||||
|
||||
|
||||
## LAParams
|
||||
##
|
||||
class LAParams(object):
|
||||
"""Parameters for layout analysis
|
||||
|
||||
:param line_overlap: If two characters have more overlap than this they
|
||||
are considered to be on the same line. The overlap is specified
|
||||
relative to the minimum height of both characters.
|
||||
:param char_margin: If two characters are closer together than this
|
||||
margin they are considered to be part of the same word. If
|
||||
characters are on the same line but not part of the same word, an
|
||||
intermediate space is inserted. The margin is specified relative to
|
||||
the width of the character.
|
||||
:param word_margin: If two words are are closer together than this
|
||||
margin they are considered to be part of the same line. A space is
|
||||
added in between for readability. The margin is specified relative
|
||||
to the width of the word.
|
||||
:param line_margin: If two lines are are close together they are
|
||||
considered to be part of the same paragraph. The margin is
|
||||
specified relative to the height of a line.
|
||||
:param boxes_flow: Specifies how much a horizontal and vertical position
|
||||
of a text matters when determining the order of lines. The value
|
||||
should be within the range of -1.0 (only horizontal position
|
||||
matters) to +1.0 (only vertical position matters).
|
||||
:param detect_vertical: If vertical text should be considered during
|
||||
layout analysis
|
||||
:param all_texts: If layout analysis should be performed on text in
|
||||
figures.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
line_overlap=0.5,
|
||||
|
@ -54,30 +76,28 @@ class LAParams(object):
|
|||
(self.char_margin, self.line_margin, self.word_margin, self.all_texts))
|
||||
|
||||
|
||||
## LTItem
|
||||
##
|
||||
class LTItem(object):
|
||||
"""Interface for things that can be analyzed"""
|
||||
|
||||
def analyze(self, laparams):
|
||||
"""Perform the layout analysis."""
|
||||
return
|
||||
|
||||
|
||||
## LTText
|
||||
##
|
||||
class LTText(object):
|
||||
"""Interface for things that have text"""
|
||||
|
||||
def __repr__(self):
|
||||
return ('<%s %r>' %
|
||||
(self.__class__.__name__, self.get_text()))
|
||||
|
||||
def get_text(self):
|
||||
"""Text contained in this object"""
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
## LTComponent
|
||||
##
|
||||
class LTComponent(LTItem):
|
||||
"""Object with a bounding box"""
|
||||
|
||||
def __init__(self, bbox):
|
||||
LTItem.__init__(self)
|
||||
|
@ -91,10 +111,13 @@ class LTComponent(LTItem):
|
|||
# Disable comparison.
|
||||
def __lt__(self, _):
|
||||
raise ValueError
|
||||
|
||||
def __le__(self, _):
|
||||
raise ValueError
|
||||
|
||||
def __gt__(self, _):
|
||||
raise ValueError
|
||||
|
||||
def __ge__(self, _):
|
||||
raise ValueError
|
||||
|
||||
|
@ -149,9 +172,8 @@ class LTComponent(LTItem):
|
|||
return 0
|
||||
|
||||
|
||||
## LTCurve
|
||||
##
|
||||
class LTCurve(LTComponent):
|
||||
"""A generic Bezier curve"""
|
||||
|
||||
def __init__(self, linewidth, pts, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
|
||||
LTComponent.__init__(self, get_bound(pts))
|
||||
|
@ -168,18 +190,22 @@ class LTCurve(LTComponent):
|
|||
return ','.join('%.3f,%.3f' % p for p in self.pts)
|
||||
|
||||
|
||||
## LTLine
|
||||
##
|
||||
class LTLine(LTCurve):
|
||||
"""A single straight line.
|
||||
|
||||
Could be used for separating text or figures.
|
||||
"""
|
||||
|
||||
def __init__(self, linewidth, p0, p1, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
|
||||
LTCurve.__init__(self, linewidth, [p0, p1], stroke, fill, evenodd, stroking_color, non_stroking_color)
|
||||
return
|
||||
|
||||
|
||||
## LTRect
|
||||
##
|
||||
class LTRect(LTCurve):
|
||||
"""A rectangle.
|
||||
|
||||
Could be used for framing another pictures or figures.
|
||||
"""
|
||||
|
||||
def __init__(self, linewidth, bbox, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
|
||||
(x0, y0, x1, y1) = bbox
|
||||
|
@ -187,9 +213,11 @@ class LTRect(LTCurve):
|
|||
return
|
||||
|
||||
|
||||
## LTImage
|
||||
##
|
||||
class LTImage(LTComponent):
|
||||
"""An image object.
|
||||
|
||||
Embedded images can be in JPEG, Bitmap or JBIG2.
|
||||
"""
|
||||
|
||||
def __init__(self, name, stream, bbox):
|
||||
LTComponent.__init__(self, bbox)
|
||||
|
@ -210,9 +238,13 @@ class LTImage(LTComponent):
|
|||
bbox2str(self.bbox), self.srcsize))
|
||||
|
||||
|
||||
## LTAnno
|
||||
##
|
||||
class LTAnno(LTItem, LTText):
|
||||
"""Actual letter in the text as a Unicode string.
|
||||
|
||||
Note that, while a LTChar object has actual boundaries, LTAnno objects does
|
||||
not, as these are "virtual" characters, inserted by a layout analyzer
|
||||
according to the relationship between two characters (e.g. a space).
|
||||
"""
|
||||
|
||||
def __init__(self, text):
|
||||
self._text = text
|
||||
|
@ -222,9 +254,8 @@ class LTAnno(LTItem, LTText):
|
|||
return self._text
|
||||
|
||||
|
||||
## LTChar
|
||||
##
|
||||
class LTChar(LTComponent, LTText):
|
||||
"""Actual letter in the text as a Unicode string."""
|
||||
|
||||
def __init__(self, matrix, font, fontsize, scaling, rise,
|
||||
text, textwidth, textdisp, ncs, graphicstate):
|
||||
|
@ -285,9 +316,8 @@ class LTChar(LTComponent, LTText):
|
|||
return True
|
||||
|
||||
|
||||
## LTContainer
|
||||
##
|
||||
class LTContainer(LTComponent):
|
||||
"""Object that can be extended and analyzed"""
|
||||
|
||||
def __init__(self, bbox):
|
||||
LTComponent.__init__(self, bbox)
|
||||
|
@ -315,10 +345,7 @@ class LTContainer(LTComponent):
|
|||
return
|
||||
|
||||
|
||||
## LTExpandableContainer
|
||||
##
|
||||
class LTExpandableContainer(LTContainer):
|
||||
|
||||
def __init__(self):
|
||||
LTContainer.__init__(self, (+INF, +INF, -INF, -INF))
|
||||
return
|
||||
|
@ -330,10 +357,7 @@ class LTExpandableContainer(LTContainer):
|
|||
return
|
||||
|
||||
|
||||
## LTTextContainer
|
||||
##
|
||||
class LTTextContainer(LTExpandableContainer, LTText):
|
||||
|
||||
def __init__(self):
|
||||
LTText.__init__(self)
|
||||
LTExpandableContainer.__init__(self)
|
||||
|
@ -343,9 +367,12 @@ class LTTextContainer(LTExpandableContainer, LTText):
|
|||
return ''.join(obj.get_text() for obj in self if isinstance(obj, LTText))
|
||||
|
||||
|
||||
## LTTextLine
|
||||
##
|
||||
class LTTextLine(LTTextContainer):
|
||||
"""Contains a list of LTChar objects that represent a single text line.
|
||||
|
||||
The characters are aligned either horizontally or vertically, depending on
|
||||
the text's writing mode.
|
||||
"""
|
||||
|
||||
def __init__(self, word_margin):
|
||||
LTTextContainer.__init__(self)
|
||||
|
@ -367,7 +394,6 @@ class LTTextLine(LTTextContainer):
|
|||
|
||||
|
||||
class LTTextLineHorizontal(LTTextLine):
|
||||
|
||||
def __init__(self, word_margin):
|
||||
LTTextLine.__init__(self, word_margin)
|
||||
self._x1 = +INF
|
||||
|
@ -393,7 +419,6 @@ class LTTextLineHorizontal(LTTextLine):
|
|||
|
||||
|
||||
class LTTextLineVertical(LTTextLine):
|
||||
|
||||
def __init__(self, word_margin):
|
||||
LTTextLine.__init__(self, word_margin)
|
||||
self._y0 = -INF
|
||||
|
@ -418,12 +443,13 @@ class LTTextLineVertical(LTTextLine):
|
|||
abs(obj.y1-self.y1) < d))]
|
||||
|
||||
|
||||
## LTTextBox
|
||||
##
|
||||
## A set of text objects that are grouped within
|
||||
## a certain rectangular area.
|
||||
##
|
||||
class LTTextBox(LTTextContainer):
|
||||
"""Represents a group of text chunks in a rectangular area.
|
||||
|
||||
Note that this box is created by geometric analysis and does not necessarily
|
||||
represents a logical boundary of the text. It contains a list of
|
||||
LTTextLine objects.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
LTTextContainer.__init__(self)
|
||||
|
@ -437,7 +463,6 @@ class LTTextBox(LTTextContainer):
|
|||
|
||||
|
||||
class LTTextBoxHorizontal(LTTextBox):
|
||||
|
||||
def analyze(self, laparams):
|
||||
LTTextBox.analyze(self, laparams)
|
||||
self._objs.sort(key=lambda obj: -obj.y1)
|
||||
|
@ -448,7 +473,6 @@ class LTTextBoxHorizontal(LTTextBox):
|
|||
|
||||
|
||||
class LTTextBoxVertical(LTTextBox):
|
||||
|
||||
def analyze(self, laparams):
|
||||
LTTextBox.analyze(self, laparams)
|
||||
self._objs.sort(key=lambda obj: -obj.x1)
|
||||
|
@ -458,10 +482,7 @@ class LTTextBoxVertical(LTTextBox):
|
|||
return 'tb-rl'
|
||||
|
||||
|
||||
## LTTextGroup
|
||||
##
|
||||
class LTTextGroup(LTTextContainer):
|
||||
|
||||
def __init__(self, objs):
|
||||
LTTextContainer.__init__(self)
|
||||
self.extend(objs)
|
||||
|
@ -469,7 +490,6 @@ class LTTextGroup(LTTextContainer):
|
|||
|
||||
|
||||
class LTTextGroupLRTB(LTTextGroup):
|
||||
|
||||
def analyze(self, laparams):
|
||||
LTTextGroup.analyze(self, laparams)
|
||||
# reorder the objects from top-left to bottom-right.
|
||||
|
@ -480,7 +500,6 @@ class LTTextGroupLRTB(LTTextGroup):
|
|||
|
||||
|
||||
class LTTextGroupTBRL(LTTextGroup):
|
||||
|
||||
def analyze(self, laparams):
|
||||
LTTextGroup.analyze(self, laparams)
|
||||
# reorder the objects from top-right to bottom-left.
|
||||
|
@ -490,10 +509,7 @@ class LTTextGroupTBRL(LTTextGroup):
|
|||
return
|
||||
|
||||
|
||||
## LTLayoutContainer
|
||||
##
|
||||
class LTLayoutContainer(LTContainer):
|
||||
|
||||
def __init__(self, bbox):
|
||||
LTContainer.__init__(self, bbox)
|
||||
self.groups = None
|
||||
|
@ -709,9 +725,13 @@ class LTLayoutContainer(LTContainer):
|
|||
return
|
||||
|
||||
|
||||
## LTFigure
|
||||
##
|
||||
class LTFigure(LTLayoutContainer):
|
||||
"""Represents an area used by PDF Form objects.
|
||||
|
||||
PDF Forms can be used to present figures or pictures by embedding yet
|
||||
another PDF document within a page. Note that LTFigure objects can appear
|
||||
recursively.
|
||||
"""
|
||||
|
||||
def __init__(self, name, bbox, matrix):
|
||||
self.name = name
|
||||
|
@ -734,9 +754,12 @@ class LTFigure(LTLayoutContainer):
|
|||
return
|
||||
|
||||
|
||||
## LTPage
|
||||
##
|
||||
class LTPage(LTLayoutContainer):
|
||||
"""Represents an entire page.
|
||||
|
||||
May contain child objects like LTTextBox, LTFigure, LTImage, LTRect,
|
||||
LTCurve and LTLine.
|
||||
"""
|
||||
|
||||
def __init__(self, pageid, bbox, rotate=0):
|
||||
LTLayoutContainer.__init__(self, bbox)
|
||||
|
|
|
@ -2,13 +2,13 @@
|
|||
|
||||
import six
|
||||
|
||||
from . import utils
|
||||
from .pdffont import PDFUnicodeNotDefined
|
||||
|
||||
from . import utils
|
||||
|
||||
## PDFDevice
|
||||
##
|
||||
class PDFDevice(object):
|
||||
"""Translate the output of PDFPageInterpreter to the output that is needed
|
||||
"""
|
||||
|
||||
def __init__(self, rsrcmgr):
|
||||
self.rsrcmgr = rsrcmgr
|
||||
|
|
|
@ -318,9 +318,8 @@ class PDFContentParser(PSStackParser):
|
|||
return
|
||||
|
||||
|
||||
## Interpreter
|
||||
##
|
||||
class PDFPageInterpreter(object):
|
||||
"""Processor for the content of a PDF page"""
|
||||
|
||||
def __init__(self, rsrcmgr, device):
|
||||
self.rsrcmgr = rsrcmgr
|
||||
|
|
5
setup.py
5
setup.py
|
@ -13,7 +13,10 @@ setup(
|
|||
'six',
|
||||
'sortedcontainers',
|
||||
],
|
||||
extras_require={"dev": ["nose", "tox"]},
|
||||
extras_require={
|
||||
"dev": ["nose", "tox"],
|
||||
"docs": ["sphinx", "sphinx-argparse"],
|
||||
},
|
||||
description='PDF parser and analyzer',
|
||||
long_description=package.__doc__,
|
||||
license='MIT/X',
|
||||
|
|
|
@ -240,51 +240,51 @@ def create_parser():
|
|||
help='One or more paths to PDF files.')
|
||||
|
||||
parser.add_argument(
|
||||
'-d', '--debug', default=False, action='store_true',
|
||||
'--debug', '-d', default=False, action='store_true',
|
||||
help='Use debug logging level.')
|
||||
procedure_parser = parser.add_mutually_exclusive_group()
|
||||
procedure_parser.add_argument(
|
||||
'-T', '--extract-toc', default=False, action='store_true',
|
||||
'--extract-toc', '-T', default=False, action='store_true',
|
||||
help='Extract structure of outline')
|
||||
procedure_parser.add_argument(
|
||||
'-E', '--extract-embedded', type=str,
|
||||
'--extract-embedded', '-E', type=str,
|
||||
help='Extract embedded files')
|
||||
|
||||
parse_params = parser.add_argument_group(
|
||||
'Parser', description='Used during PDF parsing')
|
||||
parse_params.add_argument(
|
||||
"--page-numbers", type=int, default=None, nargs="+",
|
||||
help="A space-seperated list of page numbers to parse.")
|
||||
'--page-numbers', type=int, default=None, nargs='+',
|
||||
help='A space-seperated list of page numbers to parse.')
|
||||
parse_params.add_argument(
|
||||
"-p", "--pagenos", type=str,
|
||||
help="A comma-separated list of page numbers to parse. Included for "
|
||||
"legacy applications, use --page-numbers for more idiomatic "
|
||||
"argument entry.")
|
||||
'--pagenos', '-p', type=str,
|
||||
help='A comma-separated list of page numbers to parse. Included for '
|
||||
'legacy applications, use --page-numbers for more idiomatic '
|
||||
'argument entry.')
|
||||
parse_params.add_argument(
|
||||
'-i', '--objects', type=str,
|
||||
'--objects', '-i', type=str,
|
||||
help='Comma separated list of object numbers to extract')
|
||||
parse_params.add_argument(
|
||||
'-a', '--all', default=False, action='store_true',
|
||||
'--all', '-a', default=False, action='store_true',
|
||||
help='If the structure of all objects should be extracted')
|
||||
parse_params.add_argument(
|
||||
'-P', '--password', type=str, default='',
|
||||
'--password', '-P', type=str, default='',
|
||||
help='The password to use for decrypting PDF file.')
|
||||
|
||||
output_params = parser.add_argument_group(
|
||||
'Output', description='Used during output generation.')
|
||||
output_params.add_argument(
|
||||
'-o', '--outfile', type=str, default='-',
|
||||
'--outfile', '-o', type=str, default='-',
|
||||
help='Path to file where output is written. Or "-" (default) to '
|
||||
'write to stdout.')
|
||||
codec_parser = output_params.add_mutually_exclusive_group()
|
||||
codec_parser.add_argument(
|
||||
'-r', '--raw-stream', default=False, action='store_true',
|
||||
'--raw-stream', '-r', default=False, action='store_true',
|
||||
help='Write stream objects without encoding')
|
||||
codec_parser.add_argument(
|
||||
'-b', '--binary-stream', default=False, action='store_true',
|
||||
'--binary-stream', '-b', default=False, action='store_true',
|
||||
help='Write stream objects with binary encoding')
|
||||
codec_parser.add_argument(
|
||||
'-t', '--text-stream', default=False, action='store_true',
|
||||
'--text-stream', '-t', default=False, action='store_true',
|
||||
help='Write stream objects as plain text')
|
||||
|
||||
return parser
|
||||
|
|
|
@ -1,15 +1,9 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
"""
|
||||
Converts PDF text content (though not images containing text) to plain text, html, xml or "tags".
|
||||
"""
|
||||
"""A command line tool for extracting text and images from PDF and output it to plain text, html, xml or tags."""
|
||||
import argparse
|
||||
import logging
|
||||
import six
|
||||
import sys
|
||||
import six
|
||||
|
||||
import pdfminer.settings
|
||||
pdfminer.settings.STRICT = False
|
||||
import pdfminer.high_level
|
||||
import pdfminer.layout
|
||||
from pdfminer.image import ImageWriter
|
||||
|
@ -73,28 +67,68 @@ def extract_text(files=[], outfile='-',
|
|||
|
||||
def maketheparser():
|
||||
parser = argparse.ArgumentParser(description=__doc__, add_help=True)
|
||||
parser.add_argument("files", type=str, default=None, nargs="+", help="File to process.")
|
||||
parser.add_argument("-d", "--debug", default=False, action="store_true", help="Debug output.")
|
||||
parser.add_argument("-p", "--pagenos", type=str, help="Comma-separated list of page numbers to parse. Included for legacy applications, use --page-numbers for more idiomatic argument entry.")
|
||||
parser.add_argument("--page-numbers", type=int, default=None, nargs="+", help="Alternative to --pagenos with space-separated numbers; supercedes --pagenos where it is used.")
|
||||
parser.add_argument("-m", "--maxpages", type=int, default=0, help="Maximum pages to parse")
|
||||
parser.add_argument("-P", "--password", type=str, default="", help="Decryption password for PDF")
|
||||
parser.add_argument("-o", "--outfile", type=str, default="-", help="Output file (default \"-\" is stdout)")
|
||||
parser.add_argument("-t", "--output_type", type=str, default="text", help="Output type: text|html|xml|tag (default is text)")
|
||||
parser.add_argument("-c", "--codec", type=str, default="utf-8", help="Text encoding")
|
||||
parser.add_argument("-s", "--scale", type=float, default=1.0, help="Scale")
|
||||
parser.add_argument("-A", "--all-texts", default=None, action="store_true", help="LAParams all texts")
|
||||
parser.add_argument("-V", "--detect-vertical", default=None, action="store_true", help="LAParams detect vertical")
|
||||
parser.add_argument("-W", "--word-margin", type=float, default=None, help="LAParams word margin")
|
||||
parser.add_argument("-M", "--char-margin", type=float, default=None, help="LAParams char margin")
|
||||
parser.add_argument("-L", "--line-margin", type=float, default=None, help="LAParams line margin")
|
||||
parser.add_argument("-F", "--boxes-flow", type=float, default=None, help="LAParams boxes flow")
|
||||
parser.add_argument("-Y", "--layoutmode", default="normal", type=str, help="HTML Layout Mode")
|
||||
parser.add_argument("-n", "--no-laparams", default=False, action="store_true", help="Pass None as LAParams")
|
||||
parser.add_argument("-R", "--rotation", default=0, type=int, help="Rotation")
|
||||
parser.add_argument("-O", "--output-dir", default=None, help="Output directory for images")
|
||||
parser.add_argument("-C", "--disable-caching", default=False, action="store_true", help="Disable caching")
|
||||
parser.add_argument("-S", "--strip-control", default=False, action="store_true", help="Strip control in XML mode")
|
||||
parser.add_argument("files", type=str, default=None, nargs="+", help="One or more paths to PDF files.")
|
||||
|
||||
parser.add_argument("--debug", "-d", default=False, action="store_true",
|
||||
help="Use debug logging level.")
|
||||
parser.add_argument("--disable-caching", "-C", default=False, action="store_true",
|
||||
help="If caching or resources, such as fonts, should be disabled.")
|
||||
|
||||
parse_params = parser.add_argument_group('Parser', description='Used during PDF parsing')
|
||||
parse_params.add_argument("--page-numbers", type=int, default=None, nargs="+",
|
||||
help="A space-seperated list of page numbers to parse.")
|
||||
parse_params.add_argument("--pagenos", "-p", type=str,
|
||||
help="A comma-separated list of page numbers to parse. Included for legacy applications, "
|
||||
"use --page-numbers for more idiomatic argument entry.")
|
||||
parse_params.add_argument("--maxpages", "-m", type=int, default=0,
|
||||
help="The maximum number of pages to parse.")
|
||||
parse_params.add_argument("--password", "-P", type=str, default="",
|
||||
help="The password to use for decrypting PDF file.")
|
||||
parse_params.add_argument("--rotation", "-R", default=0, type=int,
|
||||
help="The number of degrees to rotate the PDF before other types of processing.")
|
||||
|
||||
la_params = parser.add_argument_group('Layout analysis', description='Used during layout analysis.')
|
||||
la_params.add_argument("--no-laparams", "-n", default=False, action="store_true",
|
||||
help="If layout analysis parameters should be ignored.")
|
||||
la_params.add_argument("--detect-vertical", "-V", default=False, action="store_true",
|
||||
help="If vertical text should be considered during layout analysis")
|
||||
la_params.add_argument("--char-margin", "-M", type=float, default=2.0,
|
||||
help="If two characters are closer together than this margin they are considered to be part "
|
||||
"of the same word. The margin is specified relative to the width of the character.")
|
||||
la_params.add_argument("--word-margin", "-W", type=float, default=0.1,
|
||||
help="If two words are are closer together than this margin they are considered to be part "
|
||||
"of the same line. A space is added in between for readability. The margin is "
|
||||
"specified relative to the width of the word.")
|
||||
la_params.add_argument("--line-margin", "-L", type=float, default=0.5,
|
||||
help="If two lines are are close together they are considered to be part of the same "
|
||||
"paragraph. The margin is specified relative to the height of a line.")
|
||||
la_params.add_argument("--boxes-flow", "-F", type=float, default=0.5,
|
||||
help="Specifies how much a horizontal and vertical position of a text matters when "
|
||||
"determining the order of lines. The value should be within the range of -1.0 (only "
|
||||
"horizontal position matters) to +1.0 (only vertical position matters).")
|
||||
la_params.add_argument("--all-texts", "-A", default=True, action="store_true",
|
||||
help="If layout analysis should be performed on text in figures.")
|
||||
|
||||
output_params = parser.add_argument_group('Output', description='Used during output generation.')
|
||||
output_params.add_argument("--outfile", "-o", type=str, default="-",
|
||||
help="Path to file where output is written. Or \"-\" (default) to write to stdout.")
|
||||
output_params.add_argument("--output_type", "-t", type=str, default="text",
|
||||
help="Type of output to generate {text,html,xml,tag}.")
|
||||
output_params.add_argument("--codec", "-c", type=str, default="utf-8",
|
||||
help="Text encoding to use in output file.")
|
||||
output_params.add_argument("--output-dir", "-O", default=None,
|
||||
help="The output directory to put extracted images in. If not given, images are not "
|
||||
"extracted.")
|
||||
output_params.add_argument("--layoutmode", "-Y", default="normal", type=str,
|
||||
help="Type of layout to use when generating html {normal,exact,loose}. If normal, "
|
||||
"each line is positioned separately in the html. If exact, each character is "
|
||||
"positioned separately in the html. If loose, same result as normal but with an "
|
||||
"additional newline after each text line. Only used when output_type is html.")
|
||||
output_params.add_argument("--scale", "-s", type=float, default=1.0,
|
||||
help="The amount of zoom to use when generating html file. Only used when output_type "
|
||||
"is html.")
|
||||
output_params.add_argument("--strip-control", "-S", default=False, action="store_true",
|
||||
help="Remove control statement from text. Only used when output_type is xml.")
|
||||
return parser
|
||||
|
||||
|
||||
|
|
11
tox.ini
11
tox.ini
|
@ -1,6 +1,11 @@
|
|||
[tox]
|
||||
envlist = py{26, 27, 34, 35, 36}
|
||||
envlist = py{27,34,35,36,37,38}
|
||||
|
||||
[testenv]
|
||||
extras = dev
|
||||
commands = nosetests --nologcapture
|
||||
extras =
|
||||
dev
|
||||
docs
|
||||
commands =
|
||||
nosetests --nologcapture
|
||||
python -m sphinx -b html docs/source docs/build/html
|
||||
python -m sphinx -b doctest docs/source docs/build/doctest
|
||||
|
|
Loading…
Reference in New Issue