2008-04-27 11:55:51 +00:00
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
2007-12-31 04:40:27 +00:00
|
|
|
<html>
|
|
|
|
<head>
|
2010-10-17 09:23:07 +00:00
|
|
|
<link rel="stylesheet" type="text/css" href="style.css">
|
2008-04-27 11:55:51 +00:00
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
2007-12-31 04:40:27 +00:00
|
|
|
<title>PDFMiner</title>
|
2010-10-17 09:23:07 +00:00
|
|
|
</head>
|
|
|
|
<body>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<div align=right class=lastmod>
|
|
|
|
<!-- hhmts start -->
|
2014-06-25 10:28:54 +00:00
|
|
|
Last Modified: Wed Jun 25 10:27:52 UTC 2014
|
2010-10-17 09:22:39 +00:00
|
|
|
<!-- hhmts end -->
|
|
|
|
</div>
|
|
|
|
|
2007-12-31 04:40:27 +00:00
|
|
|
<h1>PDFMiner</h1>
|
2008-07-29 15:02:20 +00:00
|
|
|
<p>
|
|
|
|
Python PDF parser and analyzer
|
|
|
|
|
2009-03-24 23:10:34 +00:00
|
|
|
<p>
|
|
|
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
|
2009-10-31 01:41:30 +00:00
|
|
|
|
2009-04-02 14:22:19 +00:00
|
|
|
<a href="#changes">Recent Changes</a>
|
2010-12-25 08:40:45 +00:00
|
|
|
|
|
|
|
<a href="programming.html">PDFMiner API</a>
|
2009-03-24 23:10:34 +00:00
|
|
|
|
2009-10-24 03:44:32 +00:00
|
|
|
<ul>
|
|
|
|
<li> <a href="#intro">What's It?</a>
|
2010-10-17 09:22:39 +00:00
|
|
|
<li> <a href="#download">Download</a>
|
2010-12-25 08:40:52 +00:00
|
|
|
<li> <a href="#wheretoask">Where to Ask</a>
|
2010-10-17 09:22:39 +00:00
|
|
|
<li> <a href="#install">How to Install</a>
|
2010-10-17 09:23:07 +00:00
|
|
|
<ul>
|
|
|
|
<li> <a href="#cmap">CJK languages support</a>
|
|
|
|
</ul>
|
2011-02-14 13:32:18 +00:00
|
|
|
<li> <a href="#tools">Command Line Tools</a>
|
2010-10-17 09:23:07 +00:00
|
|
|
<ul>
|
|
|
|
<li> <a href="#pdf2txt">pdf2txt.py</a>
|
|
|
|
<li> <a href="#dumppdf">dumppdf.py</a>
|
2010-10-18 15:04:49 +00:00
|
|
|
<li> <a href="programming.html">PDFMiner API</a>
|
2010-10-17 09:23:07 +00:00
|
|
|
</ul>
|
2009-10-24 03:44:32 +00:00
|
|
|
<li> <a href="#changes">Changes</a>
|
2010-10-17 09:23:07 +00:00
|
|
|
<li> <a href="#todo">TODO</a>
|
2009-10-24 03:44:32 +00:00
|
|
|
<li> <a href="#related">Related Projects</a>
|
|
|
|
<li> <a href="#license">Terms and Conditions</a>
|
|
|
|
</ul>
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h2><a name="intro">What's It?</a></h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2010-04-24 04:32:21 +00:00
|
|
|
PDFMiner is a tool for extracting information from PDF documents.
|
2009-11-14 11:29:40 +00:00
|
|
|
Unlike other PDF-related tools, it focuses entirely on getting
|
2011-03-07 12:56:43 +00:00
|
|
|
and analyzing text data. PDFMiner allows one to obtain
|
|
|
|
the exact location of text in a page, as well as
|
2010-04-10 11:04:53 +00:00
|
|
|
other information such as fonts or lines.
|
2009-05-17 06:21:08 +00:00
|
|
|
It includes a PDF converter that can transform PDF files
|
|
|
|
into other text formats (such as HTML). It has an extensible
|
2011-03-07 12:56:43 +00:00
|
|
|
PDF parser that can be used for other purposes than text analysis.
|
2010-10-17 09:22:39 +00:00
|
|
|
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
2010-10-17 09:22:39 +00:00
|
|
|
<h3>Features</h3>
|
2008-04-27 11:47:38 +00:00
|
|
|
<ul>
|
2014-06-25 10:28:54 +00:00
|
|
|
<li> Written entirely in Python. (for version 2.6 or newer)
|
2010-04-10 11:04:53 +00:00
|
|
|
<li> Parse, analyze, and convert PDF documents.
|
2009-05-17 06:21:08 +00:00
|
|
|
<li> PDF-1.7 specification support. (well, almost)
|
2010-02-15 14:50:19 +00:00
|
|
|
<li> CJK languages and vertical writing scripts support.
|
2008-07-29 15:02:20 +00:00
|
|
|
<li> Various font types (Type1, TrueType, Type3, and CID) support.
|
2009-05-17 06:21:08 +00:00
|
|
|
<li> Basic encryption (RC4) support.
|
2008-09-06 04:52:25 +00:00
|
|
|
<li> PDF to HTML conversion (with a sample converter web app).
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> Outline (TOC) extraction.
|
|
|
|
<li> Tagged contents extraction.
|
2009-11-14 11:29:40 +00:00
|
|
|
<li> Reconstruct the original layout by grouping text chunks.
|
2008-04-27 11:47:38 +00:00
|
|
|
</ul>
|
2010-04-04 12:18:57 +00:00
|
|
|
<p>
|
2010-05-18 14:57:04 +00:00
|
|
|
PDFMiner is about 20 times slower than
|
2010-10-17 09:22:39 +00:00
|
|
|
other C/C++-based counterparts such as XPdf.
|
2007-12-31 04:40:27 +00:00
|
|
|
|
2010-12-25 08:40:52 +00:00
|
|
|
<P>
|
|
|
|
<strong>Online Demo:</strong> (pdf -> html conversion webapp)<br>
|
|
|
|
<a href="http://pdf2html.tabesugi.net:8080/">
|
|
|
|
http://pdf2html.tabesugi.net:8080/
|
|
|
|
</a>
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h3><a name="download">Download</a></h3>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2010-10-17 09:22:39 +00:00
|
|
|
<strong>Source distribution:</strong><br>
|
2009-12-19 06:52:02 +00:00
|
|
|
<a href="http://pypi.python.org/pypi/pdfminer/">
|
|
|
|
http://pypi.python.org/pypi/pdfminer/
|
2007-12-31 04:40:27 +00:00
|
|
|
</a>
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<P>
|
2010-12-25 08:40:52 +00:00
|
|
|
<strong>github:</strong><br>
|
|
|
|
<a href="https://github.com/euske/pdfminer/">
|
|
|
|
https://github.com/euske/pdfminer/
|
2010-10-17 09:22:39 +00:00
|
|
|
</a>
|
|
|
|
|
2010-12-25 08:40:52 +00:00
|
|
|
<h3><a name="wheretoask">Where to Ask</a></h3>
|
|
|
|
<p>
|
2009-03-24 23:10:34 +00:00
|
|
|
<p>
|
2010-12-25 08:40:52 +00:00
|
|
|
<strong>Questions and comments:</strong><br>
|
2009-03-24 23:10:34 +00:00
|
|
|
<a href="http://groups.google.com/group/pdfminer-users/">
|
|
|
|
http://groups.google.com/group/pdfminer-users/
|
|
|
|
</a>
|
|
|
|
|
2010-03-23 10:35:37 +00:00
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h2><a name="install">How to Install</a></h2>
|
2008-04-27 11:47:38 +00:00
|
|
|
<ol>
|
2014-06-25 10:28:54 +00:00
|
|
|
<li> Install <a href="http://www.python.org/download/">Python</a> 2.6 or newer.
|
2010-10-17 09:22:39 +00:00
|
|
|
(<font color=red><strong>Python 3 is not supported.</strong></font>)
|
2008-05-03 04:10:59 +00:00
|
|
|
<li> Download the <a href="#source">PDFMiner source</a>.
|
2009-12-20 02:38:01 +00:00
|
|
|
<li> Unpack it.
|
2009-05-17 06:21:08 +00:00
|
|
|
<li> Run <code>setup.py</code> to install:<br>
|
|
|
|
<blockquote><pre>
|
|
|
|
# <strong>python setup.py install</strong>
|
|
|
|
</pre></blockquote>
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> Do the following test:<br>
|
|
|
|
<blockquote><pre>
|
2009-05-17 06:21:08 +00:00
|
|
|
$ <strong>pdf2txt.py samples/simple1.pdf</strong>
|
2009-10-24 03:44:32 +00:00
|
|
|
Hello
|
|
|
|
|
|
|
|
World
|
2009-07-11 12:42:12 +00:00
|
|
|
|
|
|
|
Hello
|
|
|
|
|
|
|
|
World
|
|
|
|
|
2009-10-24 03:44:32 +00:00
|
|
|
H e l l o
|
|
|
|
|
|
|
|
W o r l d
|
|
|
|
|
2010-03-22 06:22:33 +00:00
|
|
|
H e l l o
|
|
|
|
|
|
|
|
W o r l d
|
2008-04-27 11:47:38 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
<li> Done!
|
|
|
|
</ol>
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h3><a name="cmap">For CJK languages</a></h3>
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
2010-04-06 10:51:16 +00:00
|
|
|
In order to process CJK languages, you need an additional step to take
|
|
|
|
during installation:
|
2008-04-27 11:47:38 +00:00
|
|
|
<blockquote><pre>
|
2009-12-19 15:10:58 +00:00
|
|
|
# <strong>make cmap</strong>
|
2013-10-22 15:17:12 +00:00
|
|
|
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
|
2009-12-19 15:10:58 +00:00
|
|
|
reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...
|
|
|
|
writing 'CNS1_H.py'...
|
|
|
|
...
|
|
|
|
<em>(this may take several minutes)</em>
|
|
|
|
|
|
|
|
# <strong>python setup.py install</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
</pre></blockquote>
|
2010-10-17 09:22:39 +00:00
|
|
|
|
2010-04-24 04:31:54 +00:00
|
|
|
<p>
|
|
|
|
On Windows machines which don't have <code>make</code> command,
|
|
|
|
paste the following commands on a command line prompt:
|
|
|
|
<blockquote><pre>
|
2013-10-22 10:09:14 +00:00
|
|
|
<strong>mkdir pdfminer\cmap</strong>
|
|
|
|
<strong>python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt</strong>
|
|
|
|
<strong>python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt</strong>
|
|
|
|
<strong>python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt</strong>
|
|
|
|
<strong>python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt</strong>
|
2010-04-24 04:31:54 +00:00
|
|
|
<strong>python setup.py install</strong>
|
|
|
|
</pre></blockquote>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
2011-02-14 13:32:18 +00:00
|
|
|
<h2><a name="tools">Command Line Tools</a></h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2011-02-14 13:32:18 +00:00
|
|
|
PDFMiner comes with two handy tools:
|
2008-04-27 11:47:38 +00:00
|
|
|
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h3><a name="pdf2txt">pdf2txt.py</a></h3>
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
2011-03-07 12:56:43 +00:00
|
|
|
It extracts all the text that are to be rendered programmatically,
|
|
|
|
i.e. text represented as ASCII or Unicode strings.
|
|
|
|
It cannot recognize text drawn as images that would require optical character recognition.
|
2009-05-17 06:39:54 +00:00
|
|
|
It also extracts the corresponding locations, font names, font sizes, writing
|
|
|
|
direction (horizontal or vertical) for each text portion.
|
|
|
|
You need to provide a password for protected PDF documents when its access is restricted.
|
|
|
|
You cannot extract any text from a PDF document which does not have extraction permission.
|
2008-04-27 11:47:38 +00:00
|
|
|
|
|
|
|
<p>
|
2010-10-17 09:22:39 +00:00
|
|
|
<strong>Note:</strong>
|
|
|
|
Not all characters in a PDF can be safely converted to Unicode.
|
|
|
|
|
|
|
|
<h4>Examples</h4>
|
2007-12-31 04:40:27 +00:00
|
|
|
<blockquote><pre>
|
2009-11-06 15:06:59 +00:00
|
|
|
$ <strong>pdf2txt.py -o output.html samples/naacl06-shinyama.pdf</strong>
|
2008-06-29 08:45:46 +00:00
|
|
|
(extract text as an HTML file whose filename is output.html)
|
2008-04-27 11:47:38 +00:00
|
|
|
|
2011-02-27 10:53:12 +00:00
|
|
|
$ <strong>pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf</strong>
|
2009-05-17 06:39:54 +00:00
|
|
|
(extract a Japanese HTML file in vertical writing, CMap is required)
|
2008-04-27 11:47:38 +00:00
|
|
|
|
2009-11-06 15:06:59 +00:00
|
|
|
$ <strong>pdf2txt.py -P mypassword -o output.txt secret.pdf</strong>
|
2009-05-17 06:39:54 +00:00
|
|
|
(extract a text from an encrypted PDF file)
|
2007-12-31 04:40:27 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h4>Options</h4>
|
2008-04-27 11:47:38 +00:00
|
|
|
<dl>
|
|
|
|
<dt> <code>-o <em>filename</em></code>
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the output file name.
|
2009-07-11 12:42:12 +00:00
|
|
|
By default, it prints the extracted contents to stdout in text format.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
2008-06-29 08:45:46 +00:00
|
|
|
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the comma-separated list of the page numbers to be extracted.
|
2011-03-07 12:56:43 +00:00
|
|
|
Page numbers start at one.
|
|
|
|
By default, it extracts text from all the pages.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-c <em>codec</em></code>
|
2010-01-01 03:09:26 +00:00
|
|
|
<dd> Specifies the output codec.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
2008-07-27 04:30:37 +00:00
|
|
|
<dt> <code>-t <em>type</em></code>
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the output format. The following formats are currently supported.
|
2008-07-27 04:30:37 +00:00
|
|
|
<ul>
|
2009-10-23 14:51:40 +00:00
|
|
|
<li> <code>text</code> : TEXT format. (Default)
|
2010-01-10 07:18:05 +00:00
|
|
|
<li> <code>html</code> : HTML format. Not recommended for extraction purposes because the markup is messy.
|
2011-03-07 12:56:43 +00:00
|
|
|
<li> <code>xml</code> : XML format. Provides the most information.
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> <code>tag</code> : "Tagged PDF" format. A tagged PDF has its own contents annotated with
|
|
|
|
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
|
2008-08-30 07:40:52 +00:00
|
|
|
Tags used here are defined in the PDF specification (See §10.7 "<em>Tagged PDF</em>").
|
2008-07-27 04:30:37 +00:00
|
|
|
</ul>
|
2008-06-29 08:45:46 +00:00
|
|
|
<p>
|
2010-01-30 07:33:18 +00:00
|
|
|
<dt> <code>-I <em>image_directory</em></code>
|
|
|
|
<dd> Specifies the output directory for image extraction.
|
|
|
|
Currently only JPEG images are supported.
|
|
|
|
<p>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dt> <code>-M <em>char_margin</em></code>
|
|
|
|
<dt> <code>-L <em>line_margin</em></code>
|
2009-07-11 12:42:12 +00:00
|
|
|
<dt> <code>-W <em>word_margin</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> These are the parameters used for layout analysis.
|
2011-03-07 12:56:43 +00:00
|
|
|
In an actual PDF file, text portions might be split into several chunks
|
2009-07-11 15:28:12 +00:00
|
|
|
in the middle of its running, depending on the authoring software.
|
|
|
|
Therefore, text extraction needs to splice text chunks.
|
|
|
|
In the figure below, two text chunks whose distance is closer than
|
|
|
|
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
|
|
|
|
continuous and get grouped into one. Also, two lines whose distance is closer than
|
|
|
|
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
|
2011-03-07 12:56:43 +00:00
|
|
|
as a text box, which is a rectangular area that contains a "cluster" of text portions.
|
2009-07-11 15:28:12 +00:00
|
|
|
Furthermore, it may be required to insert blank characters (spaces) as necessary
|
|
|
|
if the distance between two words is greater than the <em>word_margin</em>
|
|
|
|
(<em><font color="green">W</font></em>), as a blank between words might not be
|
|
|
|
represented as a space, but indicated by the positioning of each word.
|
|
|
|
<p>
|
|
|
|
Each value is specified not as an actual length, but as a proportion of
|
|
|
|
the length to the size of each character in question. The default values
|
2014-05-19 14:23:31 +00:00
|
|
|
are M = 2.0, L = 0.5, and W = 0.1, respectively.
|
2009-07-11 15:28:12 +00:00
|
|
|
<table style="border:2px gray solid; margin: 10px; padding: 10px;"><tr>
|
|
|
|
<td style="border-right:1px red solid" align=right>→</td>
|
|
|
|
<td style="border-left:1px red solid" colspan="4" align=left>← <em><font color="red">M</font></em></td>
|
|
|
|
<td></td>
|
|
|
|
</tr><tr>
|
|
|
|
<td style="border:1px solid"><code>Q u i</code></td>
|
|
|
|
<td style="border:1px solid"><code>c k</code></td>
|
|
|
|
<td width="10px"></td>
|
|
|
|
<td style="border:1px solid"><code>b r o w</code></td>
|
|
|
|
<td style="border:1px solid"><code>n f o x</code></td>
|
|
|
|
<td style="border-bottom:1px blue solid" align=right>↓</td>
|
|
|
|
</tr><tr>
|
|
|
|
<td style="border-right:1px green solid" colspan="2" align=right>→</td><td></td>
|
|
|
|
<td style="border-left:1px green solid" colspan="2" align=left>← <em><font color="green">W</font></em></td>
|
|
|
|
<td rowspan="2" valign=center align=center><em><font color="blue">L</font></em></td>
|
|
|
|
</tr><tr height="10px">
|
|
|
|
</tr><tr>
|
|
|
|
<td style="padding:0px;" colspan="5">
|
|
|
|
<table style="border:1px solid"><tr><td><code>j u m p s</code></td><td>...</td></tr></table>
|
|
|
|
</td>
|
|
|
|
<td style="border-top:1px blue solid" align=right>↑</td>
|
|
|
|
</tr></table>
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
2014-03-24 10:20:40 +00:00
|
|
|
<dt> <code>-F <em>boxes_flow</em></code>
|
|
|
|
<dd> Specifies how much a horizontal and vertical position of a text matters
|
|
|
|
when determining a text order. The value should be within the range of
|
|
|
|
-1.0 (only horizontal position matters) to +1.0 (only vertical position matters).
|
|
|
|
The default value is 0.5.
|
|
|
|
<p>
|
2011-03-02 15:04:43 +00:00
|
|
|
<dt> <code>-C</code>
|
|
|
|
<dd> Suppress object caching.
|
|
|
|
This will reduce the memory consumption but also slows down the process.
|
|
|
|
<p>
|
2009-11-07 09:12:54 +00:00
|
|
|
<dt> <code>-n</code>
|
|
|
|
<dd> Suppress layout analysis.
|
|
|
|
<p>
|
2010-04-24 04:32:21 +00:00
|
|
|
<dt> <code>-A</code>
|
|
|
|
<dd> Forces to perform layout analysis for all the text strings,
|
2011-03-07 12:56:43 +00:00
|
|
|
including text contained in figures.
|
2010-04-24 04:32:21 +00:00
|
|
|
<p>
|
2011-02-27 10:53:12 +00:00
|
|
|
<dt> <code>-V</code>
|
|
|
|
<dd> Allows vertical writing detection.
|
|
|
|
<p>
|
2010-11-14 15:04:28 +00:00
|
|
|
<dt> <code>-Y <em>layout_mode</em></code>
|
|
|
|
<dd> Specifies how the page layout should be preserved. (Currently only applies to HTML format.)
|
|
|
|
<ul>
|
|
|
|
<li> <code>exact</code> : preserve the exact location of each individual character (a large and messy HTML).
|
|
|
|
<li> <code>normal</code> : preserve the location and line breaks in each text block. (Default)
|
|
|
|
<li> <code>loose</code> : preserve the overall location of each text block.
|
|
|
|
</ul>
|
|
|
|
<p>
|
2013-10-26 15:05:26 +00:00
|
|
|
<dt> <code>-E <em>extractdir</em></code>
|
|
|
|
<dd> Specifies the extraction directory of embedded files.
|
|
|
|
<p>
|
2009-07-11 12:42:12 +00:00
|
|
|
<dt> <code>-s <em>scale</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Specifies the output scale. Can be used in HTML format only.
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-m <em>maxpages</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Specifies the maximum number of pages to extract.
|
|
|
|
By default, it extracts all the pages in a document.
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
<dt> <code>-P <em>password</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Provides the user password to access PDF contents.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-d</code>
|
|
|
|
<dd> Increases the debug level.
|
|
|
|
</dl>
|
|
|
|
|
2010-10-17 09:23:07 +00:00
|
|
|
<hr noshade>
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h3><a name="dumppdf">dumppdf.py</a></h3>
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<code>dumppdf.py</code> dumps the internal contents of a PDF file
|
2010-01-10 07:18:05 +00:00
|
|
|
in pseudo-XML format. This program is primarily for debugging purposes,
|
2008-04-27 11:47:38 +00:00
|
|
|
but it's also possible to extract some meaningful contents
|
|
|
|
(such as images).
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h4>Examples</h4>
|
2007-12-31 04:40:27 +00:00
|
|
|
<blockquote><pre>
|
2009-05-17 06:39:54 +00:00
|
|
|
$ <strong>dumppdf.py -a foo.pdf</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
(dump all the headers and contents, except stream objects)
|
|
|
|
|
2009-05-17 06:39:54 +00:00
|
|
|
$ <strong>dumppdf.py -T foo.pdf</strong>
|
2008-07-09 15:15:32 +00:00
|
|
|
(dump the table of contents)
|
|
|
|
|
2009-05-17 06:39:54 +00:00
|
|
|
$ <strong>dumppdf.py -r -i6 foo.pdf > pic.jpeg</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
(extract a JPEG image)
|
2007-12-31 04:40:27 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h4>Options</h4>
|
2008-04-27 11:47:38 +00:00
|
|
|
<dl>
|
|
|
|
<dt> <code>-a</code>
|
|
|
|
<dd> Instructs to dump all the objects.
|
|
|
|
By default, it only prints the document trailer (like a header).
|
|
|
|
<p>
|
2009-07-11 12:42:12 +00:00
|
|
|
<dt> <code>-i <em>objno,objno, ...</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Specifies PDF object IDs to display.
|
|
|
|
Comma-separated IDs, or multiple <code>-i</code> options are accepted.
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-p <em>pageno,pageno, ...</em></code>
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the page number to be extracted.
|
2009-07-11 15:28:12 +00:00
|
|
|
Comma-separated page numbers, or multiple <code>-p</code> options are accepted.
|
2011-03-07 12:56:43 +00:00
|
|
|
Note that page numbers start at one, not zero.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-r</code> (raw)
|
|
|
|
<dt> <code>-b</code> (binary)
|
|
|
|
<dt> <code>-t</code> (text)
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the output format of stream contents.
|
2008-04-27 11:47:38 +00:00
|
|
|
Because the contents of stream objects can be very large,
|
|
|
|
they are omitted when none of the options above is specified.
|
|
|
|
<p>
|
2009-06-14 08:54:57 +00:00
|
|
|
With <code>-r</code> option, the "raw" stream contents are dumped without decompression.
|
|
|
|
With <code>-b</code> option, the decompressed contents are dumped as a binary blob.
|
|
|
|
With <code>-t</code> option, the decompressed contents are dumped in a text format,
|
2008-04-27 11:47:38 +00:00
|
|
|
similar to <code>repr()</code> manner. When
|
|
|
|
<code>-r</code> or <code>-b</code> option is given,
|
|
|
|
no stream header is displayed for the ease of saving it to a file.
|
|
|
|
<p>
|
2009-07-11 12:42:12 +00:00
|
|
|
<dt> <code>-T</code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Shows the table of contents.
|
|
|
|
<p>
|
2013-01-20 03:23:58 +00:00
|
|
|
<dt> <code>-E <em>directory</em></code>
|
|
|
|
<dd> Extracts embedded files from the pdf into the given directory.
|
|
|
|
<p>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dt> <code>-P <em>password</em></code>
|
|
|
|
<dd> Provides the user password to access PDF contents.
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
<dt> <code>-d</code>
|
|
|
|
<dd> Increases the debug level.
|
|
|
|
</dl>
|
|
|
|
|
2010-10-17 09:22:39 +00:00
|
|
|
<h2><a name="changes">Changes</a></h2>
|
2008-04-27 11:47:38 +00:00
|
|
|
<ul>
|
2014-03-28 13:49:18 +00:00
|
|
|
<li> 2014/03/28: Further bugfixes.
|
2014-03-24 12:03:10 +00:00
|
|
|
<li> 2014/03/24: Bugfixes and improvements for fauly PDFs.<br>
|
|
|
|
API changes:
|
|
|
|
<ul>
|
|
|
|
<li> <code>PDFDocument.initialize()</code> method is removed and no longer needed.
|
|
|
|
A password is given as an argument of a PDFDocument constructor.
|
|
|
|
</ul>
|
2013-11-17 06:32:57 +00:00
|
|
|
<li> 2013/11/13: Bugfixes and minor improvements.<br>
|
|
|
|
As of November 2013, there were a few changes made to the PDFMiner API
|
|
|
|
prior to October 2013. This is the result of code restructuring. Here
|
|
|
|
is a list of the changes:
|
|
|
|
<ul>
|
|
|
|
<li> <code>PDFDocument</code> class is moved to <code>pdfdocument.py</code>.
|
|
|
|
<li> <code>PDFDocument</code> class now takes a <code>PDFParser</code> object as an argument.
|
|
|
|
<li> <code>PDFDocument.set_parser()</code> and <code>PDFParser.set_document()</code> is removed.
|
|
|
|
<li> <code>PDFPage</code> class is moved to <code>pdfpage.py</code>.
|
|
|
|
<li> <code>process_pdf</code> function is implemented as <code>PDFPage.get_pages</code>.
|
|
|
|
</ul>
|
|
|
|
<li> 2013/10/22: Sudden resurge of interests. API changes.
|
2013-10-22 13:19:38 +00:00
|
|
|
Incorporated a lot of patches and robust handling of broken PDFs.
|
2013-10-22 10:00:26 +00:00
|
|
|
<li> 2011/05/15: Speed improvements for layout analysis.
|
|
|
|
<li> 2011/05/15: API changes. <code>LTText.get_text()</code> is added.
|
|
|
|
<li> 2011/04/20: API changes. LTPolygon class was renamed as LTCurve.
|
|
|
|
<li> 2011/04/20: LTLine now represents horizontal/vertical lines only. Thanks to Koji Nakagawa.
|
|
|
|
<li> 2011/03/07: Documentation improvements by Jakub Wilk. Memory usage patch by Jonathan Hunt.
|
|
|
|
<li> 2011/02/27: Bugfixes and layout analysis improvements. Thanks to fujimoto.report.
|
2010-12-26 10:06:47 +00:00
|
|
|
<li> 2010/12/26: A couple of bugfixes and minor improvements. Thanks to Kevin Brubeck Unhammer and Daniel Gerber.
|
2010-12-25 08:40:45 +00:00
|
|
|
<li> 2010/10/17: A couple of bugfixes and minor improvements. Thanks to standardabweichung and Alastair Irving.
|
2010-10-17 05:15:00 +00:00
|
|
|
<li> 2010/09/07: A minor bugfix. Thanks to Alexander Garden.
|
2010-08-29 06:59:56 +00:00
|
|
|
<li> 2010/08/29: A couple of bugfixes. Thanks to Sahan Malagi, pk, and Humberto Pereira.
|
2010-08-26 15:02:29 +00:00
|
|
|
<li> 2010/07/06: Minor bugfixes. Thanks to Federico Brega.
|
2010-06-13 04:35:18 +00:00
|
|
|
<li> 2010/06/13: Bugfixes and improvements on CMap data compression. Thanks to Jakub Wilk.
|
|
|
|
<li> 2010/04/24: Bugfixes and improvements on TOC extraction. Thanks to Jose Maria.
|
2010-03-27 06:06:09 +00:00
|
|
|
<li> 2010/03/26: Bugfixes. Thanks to Brian Berry and Lubos Pintes.
|
2010-03-22 04:34:52 +00:00
|
|
|
<li> 2010/03/22: Improved layout analysis. Added regression tests.
|
2010-03-12 13:47:39 +00:00
|
|
|
<li> 2010/03/12: A couple of bugfixes. Thanks to Sean Manefield.
|
2010-02-27 03:59:25 +00:00
|
|
|
<li> 2010/02/27: Changed the way of internal layout handling. (LTTextItem -> LTChar)
|
|
|
|
<li> 2010/02/15: Several bugfixes. Thanks to Sean.
|
2010-02-15 14:50:19 +00:00
|
|
|
<li> 2010/02/13: Bugfix and enhancement. Thanks to André Auzi.
|
2010-02-07 03:14:00 +00:00
|
|
|
<li> 2010/02/07: Several bugfixes. Thanks to Hiroshi Manabe.
|
2010-01-31 02:09:28 +00:00
|
|
|
<li> 2010/01/31: JPEG image extraction supported. Page rotation bug fixed.
|
2010-01-04 12:50:59 +00:00
|
|
|
<li> 2010/01/04: Python 2.6 warning removal. More doctest conversion.
|
2010-01-01 03:09:26 +00:00
|
|
|
<li> 2010/01/01: CMap bug fix. Thanks to Winfried Plappert.
|
|
|
|
<li> 2009/12/24: RunLengthDecode filter added. Thanks to Troy Bollinger.
|
|
|
|
<li> 2009/12/20: Experimental polygon shape extraction added. Thanks to Yusuf Dewaswala for reporting.
|
2009-12-20 02:38:01 +00:00
|
|
|
<li> 2009/12/19: CMap resources are now the part of the package. Thanks to Adobe for open-sourcing them.
|
2009-11-29 07:17:36 +00:00
|
|
|
<li> 2009/11/29: Password encryption bug fixed. Thanks to Yannick Gingras.
|
|
|
|
<li> 2009/10/31: SGML output format is changed and renamed as XML.
|
2009-10-24 04:41:59 +00:00
|
|
|
<li> 2009/10/24: Charspace bug fixed. Adjusted for 4-space indentation.
|
2009-10-04 03:48:11 +00:00
|
|
|
<li> 2009/10/04: Another matrix operation bug fixed. Thanks to Vitaly Sedelnik.
|
2009-09-12 03:05:49 +00:00
|
|
|
<li> 2009/09/12: Fixed rectangle handling. Able to extract image boundaries.
|
2009-08-30 01:23:00 +00:00
|
|
|
<li> 2009/08/30: Fixed page rotation handling.
|
2009-08-26 15:20:44 +00:00
|
|
|
<li> 2009/08/26: Fixed zlib decoding bug. Thanks to Shon Urbas.
|
2009-08-24 06:56:54 +00:00
|
|
|
<li> 2009/08/24: Fixed a bug in character placing. Thanks to Pawan Jain.
|
2009-07-21 07:55:19 +00:00
|
|
|
<li> 2009/07/21: Improvement in layout analysis.
|
2009-07-11 15:28:12 +00:00
|
|
|
<li> 2009/07/11: Improvement in layout analysis. Thanks to Lubos Pintes.
|
2009-05-17 14:02:57 +00:00
|
|
|
<li> 2009/05/17: Bugfixes, massive code restructuring, and simple graphic element support added. setup.py is supported.
|
2009-03-29 15:14:23 +00:00
|
|
|
<li> 2009/03/30: Text output mode added.
|
2009-03-29 15:31:00 +00:00
|
|
|
<li> 2009/03/25: Encoding problems fixed. Word splitting option added.
|
|
|
|
<li> 2009/02/28: Robust handling of corrupted PDFs. Thanks to Troy Bollinger.
|
2009-02-01 15:01:32 +00:00
|
|
|
<li> 2009/02/01: Various bugfixes. Thanks to Hiroshi Manabe.
|
2009-01-17 16:31:42 +00:00
|
|
|
<li> 2009/01/17: Handling a trailer correctly that contains both /XrefStm and /Prev entries.
|
2009-01-10 11:14:17 +00:00
|
|
|
<li> 2009/01/10: Handling Type3 font metrics correctly.
|
|
|
|
<li> 2008/12/28: Better handling of word spacing. Thanks to Christian Nentwich.
|
2008-09-06 04:52:25 +00:00
|
|
|
<li> 2008/09/06: A sample pdf2html webapp added.
|
2008-08-30 07:40:52 +00:00
|
|
|
<li> 2008/08/30: ASCII85 encoding filter support.
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> 2008/07/27: Tagged contents extraction support.
|
2008-07-16 11:38:01 +00:00
|
|
|
<li> 2008/07/10: Outline (TOC) extraction support.
|
|
|
|
<li> 2008/06/29: HTML output added. Reorganized the directory structure.
|
2008-06-29 14:29:36 +00:00
|
|
|
<li> 2008/04/29: Bugfix for Win32. Thanks to Chris Clark.
|
|
|
|
<li> 2008/04/27: Basic encryption and LZW decoding support added.
|
2009-10-31 02:09:36 +00:00
|
|
|
<li> 2008/01/07: Several bugfixes. Thanks to Nick Fabry for his vast contribution.
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> 2007/12/31: Initial release.
|
2008-04-27 11:55:51 +00:00
|
|
|
<li> 2004/12/24: Start writing the code out of boredom...
|
2008-04-27 11:47:38 +00:00
|
|
|
</ul>
|
|
|
|
|
2010-10-17 09:23:07 +00:00
|
|
|
<h2><a name="todo">TODO</a></h2>
|
|
|
|
<ul>
|
|
|
|
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
|
|
|
|
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
|
|
|
|
<li> Better documentation.
|
|
|
|
<li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
|
|
|
|
<li> Crypt stream filter support. (More sample documents are needed!)
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<h2><a name="related">Related Projects</a></h2>
|
2008-01-07 13:47:52 +00:00
|
|
|
<ul>
|
2008-01-09 14:21:24 +00:00
|
|
|
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
|
2008-01-07 13:47:52 +00:00
|
|
|
<li> <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
|
|
|
<li> <a href="http://www.pdfbox.org/">pdfbox</a>
|
2010-08-26 15:02:40 +00:00
|
|
|
<li> <a href="http://mupdf.com/">mupdf</a>
|
2008-01-07 13:47:52 +00:00
|
|
|
</ul>
|
|
|
|
|
2010-10-17 09:23:07 +00:00
|
|
|
<h2><a name="license">Terms and Conditions</a></h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2009-10-24 03:44:32 +00:00
|
|
|
(This is so-called MIT/X License)
|
|
|
|
<p>
|
2007-12-31 04:40:27 +00:00
|
|
|
<small>
|
2013-10-26 15:05:26 +00:00
|
|
|
Copyright (c) 2004-2013 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
|
|
|
Permission is hereby granted, free of charge, to any person
|
|
|
|
obtaining a copy of this software and associated documentation
|
|
|
|
files (the "Software"), to deal in the Software without
|
|
|
|
restriction, including without limitation the rights to use,
|
|
|
|
copy, modify, merge, publish, distribute, sublicense, and/or
|
|
|
|
sell copies of the Software, and to permit persons to whom the
|
|
|
|
Software is furnished to do so, subject to the following
|
|
|
|
conditions:
|
|
|
|
<p>
|
|
|
|
The above copyright notice and this permission notice shall be
|
|
|
|
included in all copies or substantial portions of the Software.
|
|
|
|
<p>
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
|
|
|
|
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
|
|
|
|
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
|
|
|
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
|
|
|
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
|
|
|
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
|
|
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
|
|
</small>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<hr noshade>
|
2010-12-25 08:40:52 +00:00
|
|
|
<address>Yusuke Shinyama (yusuke at cs dot nyu dot edu)</address>
|
2007-12-31 04:40:27 +00:00
|
|
|
</body>
|