2008-04-27 11:55:51 +00:00
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
2007-12-31 04:40:27 +00:00
|
|
|
<html>
|
|
|
|
<head>
|
2008-04-27 11:55:51 +00:00
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
2007-12-31 04:40:27 +00:00
|
|
|
<title>PDFMiner</title>
|
2008-04-27 11:55:51 +00:00
|
|
|
<style type="text/css"><!--
|
|
|
|
blockquote { background: #eeeeee; }
|
|
|
|
--></style>
|
|
|
|
</head><body>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<h1>PDFMiner</h1>
|
2008-04-27 11:47:38 +00:00
|
|
|
<div align=right class=lastmod>
|
|
|
|
<!-- hhmts start -->
|
2008-06-29 14:29:36 +00:00
|
|
|
Last Modified: Sun Jun 29 19:58:40 JST 2008
|
2008-04-27 11:47:38 +00:00
|
|
|
<!-- hhmts end -->
|
|
|
|
</div>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="intro"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>What's It?</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2008-01-09 14:21:24 +00:00
|
|
|
PDFMiner is a suite of programs that aims to help
|
2008-06-29 14:29:36 +00:00
|
|
|
analyzing text data from PDF documents.
|
|
|
|
It includes a PDF parser, a PDF interpreter
|
|
|
|
(though only rendering text is supported for now),
|
|
|
|
and a couple of nice tools to extract texts.
|
2008-01-09 14:21:24 +00:00
|
|
|
Unlike other PDF-related tools, it allows to obtain
|
|
|
|
the exact location of texts in a page, as well as
|
|
|
|
other layout information such as font size or font name,
|
|
|
|
which could be useful for analyzing the document.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<strong>Features:</strong>
|
|
|
|
<ul>
|
2008-06-29 14:29:36 +00:00
|
|
|
<li> Written entirely in Python. (for version 2.4 or newer)
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> Roughly supports up to PDF-1.7 specification.
|
|
|
|
<li> Supports non-ASCII languages and vertical writing scripts.
|
|
|
|
<li> Supports various font types (Type1, TrueType, Type3, and CID).
|
|
|
|
<li> Supports basic encryption (RC4).
|
|
|
|
</ul>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<p>
|
|
|
|
<strong>Homepage:</strong><br>
|
|
|
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">
|
|
|
|
http://www.unixuser.org/~euske/python/pdfminer/index.html
|
|
|
|
</a>
|
|
|
|
|
2008-05-03 04:10:59 +00:00
|
|
|
<a name="source"></a>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
<strong>Download (source):</strong><br>
|
2008-06-29 08:55:05 +00:00
|
|
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/pdfminer-dist-20080629.tar.gz">
|
|
|
|
http://www.unixuser.org/~euske/python/pdfminer/pdfminer-dist-20080629.tar.gz
|
2007-12-31 04:40:27 +00:00
|
|
|
</a>
|
2008-04-27 11:47:38 +00:00
|
|
|
(1.8Mbytes)
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<P>
|
|
|
|
<strong>Svn repository:</strong><br>
|
|
|
|
<a href="http://pdfminerr.googlecode.com/svn/">
|
|
|
|
http://pdfminerr.googlecode.com/svn/
|
|
|
|
</a>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="install"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>How to Install</h2>
|
2008-04-27 11:47:38 +00:00
|
|
|
<ol>
|
2008-05-03 04:10:59 +00:00
|
|
|
<li> Install <a href="http://www.python.org/download/">Python</a> 2.4 or newer.
|
|
|
|
<li> Download the <a href="#source">PDFMiner source</a>.
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> Extract it.
|
|
|
|
<li> Go to the <code>pdfminer</code> directory.
|
|
|
|
<li> Do the following test:<br>
|
|
|
|
<blockquote><pre>
|
2008-06-29 08:55:05 +00:00
|
|
|
$ <strong>python -m tools.pdf2txt samples/simple1.pdf</strong>
|
|
|
|
<page id="0" bbox="0.000,0.000,612.000,792.000" rotate="0">
|
|
|
|
<text font="Helvetica" direction="1" bbox="100.000,695.032,237.352,719.032" fontsize="24.000"> Hello World </text>
|
2008-04-27 11:47:38 +00:00
|
|
|
</page>
|
|
|
|
</pre></blockquote>
|
|
|
|
<li> Done!
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
<h3>For non-ASCII languages</h3>
|
|
|
|
In order to handle non-ASCII languages (e.g. Japanese),
|
|
|
|
you need to install an additional data called <code>CMap</code>.
|
|
|
|
<p>
|
|
|
|
Here is how:
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<ol>
|
|
|
|
<li> Get
|
|
|
|
<a href="http://www.unixuser.org/~euske/pub/CMap.tar.bz2">
|
|
|
|
http://www.unixuser.org/~euske/pub/CMap.tar.bz2
|
|
|
|
</a>
|
2008-04-27 11:55:51 +00:00
|
|
|
<li> Do the follwoing:
|
|
|
|
<blockquote><pre>
|
|
|
|
$ <strong>tar jxf CMap.tar.bz2</strong>
|
|
|
|
</pre></blockquote>
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> Put the <code>CMap</code> directory into the <code>pdfminer</code> directory.
|
|
|
|
<li> Go to the <code>pdfminer</code> directory.
|
|
|
|
<li> Do the follwoing: (this is optional but highly recommended)<br>
|
|
|
|
<blockquote><pre>
|
|
|
|
$ <strong>make cdbcmap</strong>
|
|
|
|
</pre></blockquote>
|
2007-12-31 04:40:27 +00:00
|
|
|
</ol>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="usage"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>How to Use</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
PDFMiner comes with two programs:
|
|
|
|
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="pdf2txt"></a>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h3>pdf2txt.py</h3>
|
|
|
|
<p>
|
|
|
|
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
|
|
|
It extracts all the texts that are to be rendered programatically.
|
|
|
|
It also extracts the corresponding locations, font names,
|
2008-06-29 08:45:46 +00:00
|
|
|
and font sizes for each text portion. However,
|
|
|
|
it cannot extract texts embedded within images
|
2008-04-27 11:47:38 +00:00
|
|
|
(i.e. it does not do optical character recognition).
|
|
|
|
You can provide a password for protected PDF documents
|
|
|
|
whose access is limited.
|
|
|
|
<p>
|
|
|
|
For non-ASCII languages, you can speficy the output encoding
|
|
|
|
(such as UTF-8).
|
|
|
|
Note that not all characters in a PDF can be converted safely
|
|
|
|
to Unicode, as some of them are not included in the current
|
|
|
|
Unicode Standard.
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Examples:
|
2007-12-31 04:40:27 +00:00
|
|
|
<blockquote><pre>
|
2008-06-29 08:45:46 +00:00
|
|
|
$ <strong>./pdf2txt.py -H -o output.html samples/naacl06-shinyama.pdf</strong>
|
|
|
|
(extract text as an HTML file whose filename is output.html)
|
2008-04-27 11:47:38 +00:00
|
|
|
|
|
|
|
$ <strong>./pdf2txt.py -c euc-jp samples/jo.pdf</strong>
|
|
|
|
(extract Japanese texts in vertical writing, CMap is required)
|
|
|
|
|
|
|
|
$ <strong>./pdf2txt.py -P mypassword secret.pdf</strong>
|
|
|
|
(extract texts from an encrypted PDF file with a password)
|
2007-12-31 04:40:27 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
Options:
|
|
|
|
<dl>
|
|
|
|
<dt> <code>-o <em>filename</em></code>
|
|
|
|
<dd> Speficies the output file name.
|
|
|
|
By default, it prints the extracted contents to stdout.
|
|
|
|
<p>
|
2008-06-29 08:45:46 +00:00
|
|
|
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
|
|
|
|
<dd> Speficies the comma-separated list of the page numbers to be extracted.
|
|
|
|
Page numbers are starting from zero.
|
2008-04-27 11:47:38 +00:00
|
|
|
By default, it extracts texts from all the pages.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-c <em>codec</em></code>
|
|
|
|
<dd> Speficies the output codec for non-ASCII texts.
|
|
|
|
<p>
|
2008-06-29 08:45:46 +00:00
|
|
|
<dt> <code>-H</code>
|
|
|
|
<dd> Speficies the output to be HTML file.
|
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
<dt> <code>-P <em>password</em></code>
|
|
|
|
<dd> Provides the user password to open the PDF file.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-d</code>
|
|
|
|
<dd> Increases the debug level.
|
|
|
|
</dl>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="dumppdf"></a>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h3>dumppdf.py</h3>
|
|
|
|
<p>
|
|
|
|
<code>dumppdf.py</code> dumps the internal contents of a PDF file
|
|
|
|
in pseudo-XML format. This program is primarily for debugging purpose,
|
|
|
|
but it's also possible to extract some meaningful contents
|
|
|
|
(such as images).
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Examples:
|
2007-12-31 04:40:27 +00:00
|
|
|
<blockquote><pre>
|
2008-04-27 11:47:38 +00:00
|
|
|
$ <strong>./dumppdf.py -a foo.pdf</strong>
|
|
|
|
(dump all the headers and contents, except stream objects)
|
|
|
|
|
|
|
|
$ <strong>./dumppdf.py -r -i6 foo.pdf > pic.jpeg</strong>
|
|
|
|
(extract a JPEG image)
|
2007-12-31 04:40:27 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
Options:
|
|
|
|
<dl>
|
|
|
|
<dt> <code>-a</code>
|
|
|
|
<dd> Instructs to dump all the objects.
|
|
|
|
By default, it only prints the document trailer (like a header).
|
|
|
|
<p>
|
|
|
|
<dt> <code>-p <em>pageno</em></code>
|
|
|
|
<dd> Speficies the page number to be extracted.
|
|
|
|
Multiple <code>-p</code> options are allowed.
|
|
|
|
Note that page numbers start from zero.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-r</code> (raw)
|
|
|
|
<dt> <code>-b</code> (binary)
|
|
|
|
<dt> <code>-t</code> (text)
|
|
|
|
<dd> Speficies the output format of stream contents.
|
|
|
|
Because the contents of stream objects can be very large,
|
|
|
|
they are omitted when none of the options above is specified.
|
|
|
|
<p>
|
|
|
|
With <code>-r</code> option, all the stream contents are dumped without decoding.
|
|
|
|
With <code>-b</code> option, the contents are dumped as a binary blob.
|
|
|
|
With <code>-t</code> option, the contents are dumped in a text format,
|
|
|
|
similar to <code>repr()</code> manner. When
|
|
|
|
<code>-r</code> or <code>-b</code> option is given,
|
|
|
|
no stream header is displayed for the ease of saving it to a file.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-P <em>password</em></code>
|
|
|
|
<dd> Provides the user password to open the PDF file.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-d</code>
|
|
|
|
<dd> Increases the debug level.
|
|
|
|
</dl>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="changes"></a>
|
|
|
|
<hr noshade>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h2>Changes</h2>
|
|
|
|
<ul>
|
2008-06-29 14:29:36 +00:00
|
|
|
<li> 2008/06/29: Added HTML output. Reorganized the directory structure.
|
|
|
|
<li> 2008/04/29: Bugfix for Win32. Thanks to Chris Clark.
|
|
|
|
<li> 2008/04/27: Basic encryption and LZW decoding support added.
|
|
|
|
<li> 2008/01/07: Several bugfixes. Thanks to Nick Fabry for his contribution.
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> 2007/12/31: Initial release.
|
2008-04-27 11:55:51 +00:00
|
|
|
<li> 2004/12/24: Start writing the code out of boredom...
|
2008-04-27 11:47:38 +00:00
|
|
|
</ul>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="related"></a>
|
|
|
|
<hr noshade>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h2>Related Projects</h2>
|
2008-01-07 13:47:52 +00:00
|
|
|
<ul>
|
2008-01-09 14:21:24 +00:00
|
|
|
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
|
2008-01-07 13:47:52 +00:00
|
|
|
<li> <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
|
|
|
<li> <a href="http://www.pdfbox.org/">pdfbox</a>
|
|
|
|
</ul>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="license"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>Terms and Conditions</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
|
|
|
<small>
|
|
|
|
Copyright (c) 2004-2008 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
|
|
|
<p>
|
|
|
|
Permission is hereby granted, free of charge, to any person
|
|
|
|
obtaining a copy of this software and associated documentation
|
|
|
|
files (the "Software"), to deal in the Software without
|
|
|
|
restriction, including without limitation the rights to use,
|
|
|
|
copy, modify, merge, publish, distribute, sublicense, and/or
|
|
|
|
sell copies of the Software, and to permit persons to whom the
|
|
|
|
Software is furnished to do so, subject to the following
|
|
|
|
conditions:
|
|
|
|
<p>
|
|
|
|
The above copyright notice and this permission notice shall be
|
|
|
|
included in all copies or substantial portions of the Software.
|
|
|
|
<p>
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
|
|
|
|
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
|
|
|
|
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
|
|
|
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
|
|
|
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
|
|
|
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
|
|
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
|
|
</small>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<hr noshade>
|
2007-12-31 04:40:27 +00:00
|
|
|
<address>Yusuke Shinyama</address>
|
|
|
|
</body>
|