428 lines
15 KiB
HTML
428 lines
15 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
<title>PDFMiner</title>
|
|
<style type="text/css"><!--
|
|
blockquote { background: #eeeeee; }
|
|
--></style>
|
|
</head><body>
|
|
|
|
<h1>PDFMiner</h1>
|
|
<p>
|
|
Python PDF parser and analyzer
|
|
|
|
<p>
|
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
|
|
|
|
<a href="#changes">Recent Changes</a>
|
|
|
|
<div align=right class=lastmod>
|
|
<!-- hhmts start -->
|
|
Last Modified: Sat Oct 31 12:03:49 JST 2009
|
|
<!-- hhmts end -->
|
|
</div>
|
|
|
|
<ul>
|
|
<li> <a href="#intro">What's It?</a>
|
|
<li> <a href="#source">Download</a>
|
|
<li> <a href="#install">Install</a>
|
|
<small>(<a href="#cmap">for non-ASCII languages</a>)</small>
|
|
<li> <a href="#usage">How to Use</a>
|
|
<small>(<a href="#pdf2txt">pdf2txt.py</a>, <a href="#dumppdf">dumppdf.py</a>)</small>
|
|
<li> <a href="#todos">TODOs</a>
|
|
<li> <a href="#changes">Changes</a>
|
|
<li> <a href="#related">Related Projects</a>
|
|
<li> <a href="#license">Terms and Conditions</a>
|
|
</ul>
|
|
|
|
<a name="intro"></a>
|
|
<hr noshade>
|
|
<h2>What's It?</h2>
|
|
<p>
|
|
PDFMiner is a suite of programs that help
|
|
extracting and analyzing text data of PDF documents.
|
|
Unlike other PDF-related tools, it allows to obtain
|
|
the exact location of texts in a page, as well as
|
|
other extra information such as font information or ruled lines.
|
|
It includes a PDF converter that can transform PDF files
|
|
into other text formats (such as HTML). It has an extensible
|
|
PDF parser that can be used for other purposes instead of text analysis.
|
|
<p>
|
|
<strong>Features:</strong>
|
|
<ul>
|
|
<li> Written entirely in Python. (for version 2.4 or newer)
|
|
<li> PDF-1.7 specification support. (well, almost)
|
|
<li> Non-ASCII languages and vertical writing scripts support.
|
|
<li> Various font types (Type1, TrueType, Type3, and CID) support.
|
|
<li> Basic encryption (RC4) support.
|
|
<li> PDF to HTML conversion (with a sample converter web app).
|
|
<li> Outline (TOC) extraction.
|
|
<li> Tagged contents extraction.
|
|
<li> Infer text running by using clustering technique.
|
|
</ul>
|
|
|
|
<a name="source"></a>
|
|
<p>
|
|
<strong>Download:</strong><br>
|
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/pdfminer-20091024.tar.gz">
|
|
http://www.unixuser.org/~euske/python/pdfminer/pdfminer-20091024.tar.gz
|
|
</a>
|
|
(1.8Mbytes)
|
|
|
|
<p>
|
|
<strong>Discussion:</strong> (for questions and comments, post here)<br>
|
|
<a href="http://groups.google.com/group/pdfminer-users/">
|
|
http://groups.google.com/group/pdfminer-users/
|
|
</a>
|
|
|
|
<P>
|
|
<strong>View the source:</strong><br>
|
|
<a href="http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer">
|
|
http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer
|
|
</a>
|
|
|
|
<P>
|
|
<strong>Online Demonstration:</strong> (pdf -> html conversion webapp)<br>
|
|
<a href="http://pdf2html.tabesugi.net:8080/">
|
|
http://pdf2html.tabesugi.net:8080/
|
|
</a>
|
|
|
|
<a name="install"></a>
|
|
<hr noshade>
|
|
<h2>Install</h2>
|
|
|
|
<ol>
|
|
<li> Install <a href="http://www.python.org/download/">Python</a> 2.4 or newer.
|
|
<li> Download the <a href="#source">PDFMiner source</a>.
|
|
<li> Extract it.
|
|
<li> Run <code>setup.py</code> to install:<br>
|
|
<blockquote><pre>
|
|
# <strong>python setup.py install</strong>
|
|
</pre></blockquote>
|
|
<li> Do the following test:<br>
|
|
<blockquote><pre>
|
|
$ <strong>pdf2txt.py samples/simple1.pdf</strong>
|
|
Hello
|
|
|
|
Hello
|
|
|
|
World
|
|
|
|
World
|
|
|
|
Hello
|
|
|
|
World
|
|
|
|
H e l l o
|
|
|
|
W o r l d
|
|
|
|
</pre></blockquote>
|
|
<li> Done!
|
|
</ol>
|
|
|
|
<p>
|
|
<a name="cmap"></a>
|
|
<h3>For non-ASCII languages</h3>
|
|
In order to handle non-ASCII languages (e.g. Japanese),
|
|
you need to install an additional data called <code>CMap</code>,
|
|
which is distributed from Adobe.
|
|
<p>
|
|
Here is how:
|
|
|
|
<ol>
|
|
<li> Get CMap files from
|
|
<a href="http://www.unixuser.org/~euske/pub/CMap.tar.bz2">
|
|
http://www.unixuser.org/~euske/pub/CMap.tar.bz2
|
|
</a>
|
|
<li> Expand the archive and put the <code>CMap</code> directory under the directory
|
|
where <code>pdfminer</code> is installed.
|
|
(Normally this should be something like <code>/usr/lib/python2.5/site-packages/pdfminer</code>.)
|
|
For example:
|
|
<blockquote><pre>
|
|
$ <strong>cd /usr/lib/python2.5/site-packages/pdfminer</strong>
|
|
$ <strong>tar jxf CMap.tar.bz2</strong>
|
|
</pre></blockquote>
|
|
<li> Do the following. (this is optional and may take several minutes, but highly recommended!)<br>
|
|
<blockquote><pre>
|
|
$ <strong>python -m pdfminer.cmap</strong>
|
|
</pre></blockquote>
|
|
</ol>
|
|
|
|
<a name="usage"></a>
|
|
<hr noshade>
|
|
<h2>How to Use</h2>
|
|
|
|
<p>
|
|
PDFMiner comes with two handy tools:
|
|
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
|
|
|
|
<a name="pdf2txt"></a>
|
|
<h3>pdf2txt.py</h3>
|
|
<p>
|
|
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
|
It extracts all the texts that are to be rendered programmatically,
|
|
ie. text represented as ASCII or Unicode strings.
|
|
It cannot recognize texts drawn as images that would require optical character recognition.
|
|
It also extracts the corresponding locations, font names, font sizes, writing
|
|
direction (horizontal or vertical) for each text portion.
|
|
You need to provide a password for protected PDF documents when its access is restricted.
|
|
You cannot extract any text from a PDF document which does not have extraction permission.
|
|
<p>
|
|
For non-ASCII languages, you can specify the output encoding
|
|
(such as UTF-8).
|
|
<p>
|
|
<strong>Note:</strong> Not all characters in a PDF can be safely converted to Unicode.
|
|
|
|
<p>
|
|
Examples:
|
|
<blockquote><pre>
|
|
$ <strong>pdf2txt.py samples/naacl06-shinyama.pdf -o output.html</strong>
|
|
(extract text as an HTML file whose filename is output.html)
|
|
|
|
$ <strong>pdf2txt.py -c euc-jp samples/jo.pdf -D V -o output.html</strong>
|
|
(extract a Japanese HTML file in vertical writing, CMap is required)
|
|
|
|
$ <strong>pdf2txt.py -P mypassword secret.pdf -o output.txt</strong>
|
|
(extract a text from an encrypted PDF file)
|
|
</pre></blockquote>
|
|
|
|
<p>
|
|
Options:
|
|
<dl>
|
|
<dt> <code>-o <em>filename</em></code>
|
|
<dd> Specifies the output file name.
|
|
By default, it prints the extracted contents to stdout in text format.
|
|
<p>
|
|
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
|
|
<dd> Specifies the comma-separated list of the page numbers to be extracted.
|
|
Page numbers are starting from one.
|
|
By default, it extracts texts from all the pages.
|
|
<p>
|
|
<dt> <code>-c <em>codec</em></code>
|
|
<dd> Specifies the output codec for non-ASCII texts.
|
|
<p>
|
|
<dt> <code>-t <em>type</em></code>
|
|
<dd> Specifies the output format. The following formats are currently supported.
|
|
<ul>
|
|
<li> <code>text</code> : TEXT format. (Default)
|
|
<li> <code>html</code> : HTML format. Not recommended for extraction purpose because the markup is very messy.
|
|
<li> <code>xml</code> : XML format. Provides the most information available.
|
|
<li> <code>tag</code> : "Tagged PDF" format. A tagged PDF has its own contents annotated with
|
|
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
|
|
Tags used here are defined in the PDF specification (See §10.7 "<em>Tagged PDF</em>").
|
|
</ul>
|
|
<p>
|
|
<dt> <code>-D <em>direction</em></code>
|
|
<dt> <code>-M <em>char_margin</em></code>
|
|
<dt> <code>-L <em>line_margin</em></code>
|
|
<dt> <code>-W <em>word_margin</em></code>
|
|
<dd> These are the parameters used for layout analysis.
|
|
In an actual PDF file, texts might be split into several chunks
|
|
in the middle of its running, depending on the authoring software.
|
|
Therefore, text extraction needs to splice text chunks.
|
|
In the figure below, two text chunks whose distance is closer than
|
|
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
|
|
continuous and get grouped into one. Also, two lines whose distance is closer than
|
|
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
|
|
as a text box, which is a rectangular area that contains a "cluster" of texts.
|
|
Furthermore, it may be required to insert blank characters (spaces) as necessary
|
|
if the distance between two words is greater than the <em>word_margin</em>
|
|
(<em><font color="green">W</font></em>), as a blank between words might not be
|
|
represented as a space, but indicated by the positioning of each word.
|
|
<p>
|
|
Each value is specified not as an actual length, but as a proportion of
|
|
the length to the size of each character in question. The default values
|
|
are M = 1.0, L = 0.3, and W = 0.2, respectively.
|
|
<table style="border:2px gray solid; margin: 10px; padding: 10px;"><tr>
|
|
<td style="border-right:1px red solid" align=right>→</td>
|
|
<td style="border-left:1px red solid" colspan="4" align=left>← <em><font color="red">M</font></em></td>
|
|
<td></td>
|
|
</tr><tr>
|
|
<td style="border:1px solid"><code>Q u i</code></td>
|
|
<td style="border:1px solid"><code>c k</code></td>
|
|
<td width="10px"></td>
|
|
<td style="border:1px solid"><code>b r o w</code></td>
|
|
<td style="border:1px solid"><code>n f o x</code></td>
|
|
<td style="border-bottom:1px blue solid" align=right>↓</td>
|
|
</tr><tr>
|
|
<td style="border-right:1px green solid" colspan="2" align=right>→</td><td></td>
|
|
<td style="border-left:1px green solid" colspan="2" align=left>← <em><font color="green">W</font></em></td>
|
|
<td rowspan="2" valign=center align=center><em><font color="blue">L</font></em></td>
|
|
</tr><tr height="10px">
|
|
</tr><tr>
|
|
<td style="padding:0px;" colspan="5">
|
|
<table style="border:1px solid"><tr><td><code>j u m p s</code></td><td>...</td></tr></table>
|
|
</td>
|
|
<td style="border-top:1px blue solid" align=right>↑</td>
|
|
</tr></table>
|
|
<p>
|
|
<dt> <code>-s <em>scale</em></code>
|
|
<dd> Specifies the output scale. Can be used in HTML format only.
|
|
<p>
|
|
<dt> <code>-m <em>maxpages</em></code>
|
|
<dd> Specifies the maximum number of pages to extract.
|
|
By default, it extracts all the pages in a document.
|
|
<p>
|
|
<dt> <code>-P <em>password</em></code>
|
|
<dd> Provides the user password to access PDF contents.
|
|
<p>
|
|
<dt> <code>-C <em>CMap directory</em></code>
|
|
<dd> Specifies the path of CMap directory. CMap is needed when extracting
|
|
non-ASCII texts (especially in Asian languages). The CMap location can be
|
|
also specified with <code>CMAP_PATH</code> environment variable.
|
|
<p>
|
|
<dt> <code>-d</code>
|
|
<dd> Increases the debug level.
|
|
</dl>
|
|
|
|
<a name="dumppdf"></a>
|
|
<h3>dumppdf.py</h3>
|
|
<p>
|
|
<code>dumppdf.py</code> dumps the internal contents of a PDF file
|
|
in pseudo-XML format. This program is primarily for debugging purpose,
|
|
but it's also possible to extract some meaningful contents
|
|
(such as images).
|
|
|
|
<p>
|
|
Examples:
|
|
<blockquote><pre>
|
|
$ <strong>dumppdf.py -a foo.pdf</strong>
|
|
(dump all the headers and contents, except stream objects)
|
|
|
|
$ <strong>dumppdf.py -T foo.pdf</strong>
|
|
(dump the table of contents)
|
|
|
|
$ <strong>dumppdf.py -r -i6 foo.pdf > pic.jpeg</strong>
|
|
(extract a JPEG image)
|
|
</pre></blockquote>
|
|
|
|
<p>
|
|
Options:
|
|
<dl>
|
|
<dt> <code>-a</code>
|
|
<dd> Instructs to dump all the objects.
|
|
By default, it only prints the document trailer (like a header).
|
|
<p>
|
|
<dt> <code>-i <em>objno,objno, ...</em></code>
|
|
<dd> Specifies PDF object IDs to display.
|
|
Comma-separated IDs, or multiple <code>-i</code> options are accepted.
|
|
<p>
|
|
<dt> <code>-p <em>pageno,pageno, ...</em></code>
|
|
<dd> Specifies the page number to be extracted.
|
|
Comma-separated page numbers, or multiple <code>-p</code> options are accepted.
|
|
Note that page numbers start from one, not zero.
|
|
<p>
|
|
<dt> <code>-r</code> (raw)
|
|
<dt> <code>-b</code> (binary)
|
|
<dt> <code>-t</code> (text)
|
|
<dd> Specifies the output format of stream contents.
|
|
Because the contents of stream objects can be very large,
|
|
they are omitted when none of the options above is specified.
|
|
<p>
|
|
With <code>-r</code> option, the "raw" stream contents are dumped without decompression.
|
|
With <code>-b</code> option, the decompressed contents are dumped as a binary blob.
|
|
With <code>-t</code> option, the decompressed contents are dumped in a text format,
|
|
similar to <code>repr()</code> manner. When
|
|
<code>-r</code> or <code>-b</code> option is given,
|
|
no stream header is displayed for the ease of saving it to a file.
|
|
<p>
|
|
<dt> <code>-T</code>
|
|
<dd> Shows the table of contents.
|
|
<p>
|
|
<dt> <code>-P <em>password</em></code>
|
|
<dd> Provides the user password to access PDF contents.
|
|
<p>
|
|
<dt> <code>-d</code>
|
|
<dd> Increases the debug level.
|
|
</dl>
|
|
|
|
<a name="todos"></a>
|
|
<hr noshade>
|
|
<h2>TODOs</h2>
|
|
<ul>
|
|
<li> PEP-8 conformance.
|
|
<li> Better text extraction / layout analysis.
|
|
<li> Better API Documentation.
|
|
<li> Robust error handling.
|
|
</ul>
|
|
|
|
<a name="changes"></a>
|
|
<hr noshade>
|
|
<h2>Changes</h2>
|
|
<ul>
|
|
<li> 2009/10/xx: Password encryption bug fixed. Thanks to Yannick Gingras.
|
|
<li> 2009/10/24: Charspace bug fixed. Adjusted for 4-space indentation.
|
|
<li> 2009/10/04: Another matrix operation bug fixed. Thanks to Vitaly Sedelnik.
|
|
<li> 2009/09/12: Fixed rectangle handling. Able to extract image boundaries.
|
|
<li> 2009/08/30: Fixed page rotation handling.
|
|
<li> 2009/08/26: Fixed zlib decoding bug. Thanks to Shon Urbas.
|
|
<li> 2009/08/24: Fixed a bug in character placing. Thanks to Pawan Jain.
|
|
<li> 2009/07/21: Improvement in layout analysis.
|
|
<li> 2009/07/11: Improvement in layout analysis. Thanks to Lubos Pintes.
|
|
<li> 2009/05/17: Bugfixes, massive code restructuring, and simple graphic element support added. setup.py is supported.
|
|
<li> 2009/03/30: Text output mode added.
|
|
<li> 2009/03/25: Encoding problems fixed. Word splitting option added.
|
|
<li> 2009/02/28: Robust handling of corrupted PDFs. Thanks to Troy Bollinger.
|
|
<li> 2009/02/01: Various bugfixes. Thanks to Hiroshi Manabe.
|
|
<li> 2009/01/17: Handling a trailer correctly that contains both /XrefStm and /Prev entries.
|
|
<li> 2009/01/10: Handling Type3 font metrics correctly.
|
|
<li> 2008/12/28: Better handling of word spacing. Thanks to Christian Nentwich.
|
|
<li> 2008/09/06: A sample pdf2html webapp added.
|
|
<li> 2008/08/30: ASCII85 encoding filter support.
|
|
<li> 2008/07/27: Tagged contents extraction support.
|
|
<li> 2008/07/10: Outline (TOC) extraction support.
|
|
<li> 2008/06/29: HTML output added. Reorganized the directory structure.
|
|
<li> 2008/04/29: Bugfix for Win32. Thanks to Chris Clark.
|
|
<li> 2008/04/27: Basic encryption and LZW decoding support added.
|
|
<li> 2008/01/07: Several bugfixes. Thanks to Nick Fabry for his vast contribution.
|
|
<li> 2007/12/31: Initial release.
|
|
<li> 2004/12/24: Start writing the code out of boredom...
|
|
</ul>
|
|
|
|
<a name="related"></a>
|
|
<hr noshade>
|
|
<h2>Related Projects</h2>
|
|
<ul>
|
|
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
|
|
<li> <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
|
<li> <a href="http://www.pdfbox.org/">pdfbox</a>
|
|
</ul>
|
|
|
|
<a name="license"></a>
|
|
<hr noshade>
|
|
<h2>Terms and Conditions</h2>
|
|
<p>
|
|
(This is so-called MIT/X License)
|
|
<p>
|
|
<small>
|
|
Copyright (c) 2004-2009 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
|
<p>
|
|
Permission is hereby granted, free of charge, to any person
|
|
obtaining a copy of this software and associated documentation
|
|
files (the "Software"), to deal in the Software without
|
|
restriction, including without limitation the rights to use,
|
|
copy, modify, merge, publish, distribute, sublicense, and/or
|
|
sell copies of the Software, and to permit persons to whom the
|
|
Software is furnished to do so, subject to the following
|
|
conditions:
|
|
<p>
|
|
The above copyright notice and this permission notice shall be
|
|
included in all copies or substantial portions of the Software.
|
|
<p>
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
|
|
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
|
|
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
|
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
|
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
|
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
</small>
|
|
|
|
<hr noshade>
|
|
<address>Yusuke Shinyama</address>
|
|
</body>
|