2008-04-27 11:55:51 +00:00
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
2007-12-31 04:40:27 +00:00
|
|
|
<html>
|
|
|
|
<head>
|
2008-04-27 11:55:51 +00:00
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
2007-12-31 04:40:27 +00:00
|
|
|
<title>PDFMiner</title>
|
2008-04-27 11:55:51 +00:00
|
|
|
<style type="text/css"><!--
|
|
|
|
blockquote { background: #eeeeee; }
|
|
|
|
--></style>
|
|
|
|
</head><body>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<h1>PDFMiner</h1>
|
2008-07-29 15:02:20 +00:00
|
|
|
<p>
|
|
|
|
Python PDF parser and analyzer
|
|
|
|
|
2008-04-27 11:47:38 +00:00
|
|
|
<div align=right class=lastmod>
|
|
|
|
<!-- hhmts start -->
|
2009-02-01 15:01:32 +00:00
|
|
|
Last Modified: Mon Feb 2 00:01:01 JST 2009
|
2008-04-27 11:47:38 +00:00
|
|
|
<!-- hhmts end -->
|
|
|
|
</div>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="intro"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>What's It?</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2008-01-09 14:21:24 +00:00
|
|
|
PDFMiner is a suite of programs that aims to help
|
2008-06-29 14:29:36 +00:00
|
|
|
analyzing text data from PDF documents.
|
2008-06-29 14:57:42 +00:00
|
|
|
It includes a PDF parser, a PDF renderer
|
2008-06-29 14:29:36 +00:00
|
|
|
(though only rendering text is supported for now),
|
|
|
|
and a couple of nice tools to extract texts.
|
2008-01-09 14:21:24 +00:00
|
|
|
Unlike other PDF-related tools, it allows to obtain
|
|
|
|
the exact location of texts in a page, as well as
|
|
|
|
other layout information such as font size or font name,
|
|
|
|
which could be useful for analyzing the document.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<strong>Features:</strong>
|
|
|
|
<ul>
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> Written entirely in Python. (for version 2.5 or newer)
|
2008-07-29 15:02:20 +00:00
|
|
|
<li> PDF-1.7 specification support.
|
|
|
|
<li> Non-ASCII languages and vertical writing scripts support.
|
|
|
|
<li> Various font types (Type1, TrueType, Type3, and CID) support.
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> Basic encryption (RC4).
|
2008-09-06 04:52:25 +00:00
|
|
|
<li> PDF to HTML conversion (with a sample converter web app).
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> Outline (TOC) extraction.
|
|
|
|
<li> Tagged contents extraction.
|
2008-04-27 11:47:38 +00:00
|
|
|
</ul>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<p>
|
|
|
|
<strong>Homepage:</strong><br>
|
|
|
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">
|
|
|
|
http://www.unixuser.org/~euske/python/pdfminer/index.html
|
|
|
|
</a>
|
|
|
|
|
2008-05-03 04:10:59 +00:00
|
|
|
<a name="source"></a>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
<strong>Download (source):</strong><br>
|
2009-02-01 15:01:32 +00:00
|
|
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/pdfminer-dist-20090201.tar.gz">
|
|
|
|
http://www.unixuser.org/~euske/python/pdfminer/pdfminer-dist-20090201.tar.gz
|
2007-12-31 04:40:27 +00:00
|
|
|
</a>
|
2008-04-27 11:47:38 +00:00
|
|
|
(1.8Mbytes)
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<P>
|
|
|
|
<strong>Svn repository:</strong><br>
|
2008-07-03 15:51:44 +00:00
|
|
|
<a href="http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer">
|
|
|
|
http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer
|
2008-06-29 14:57:42 +00:00
|
|
|
</a>
|
|
|
|
|
|
|
|
<P>
|
|
|
|
<strong>Demo:</strong> (online pdf -> html conversion)<br>
|
|
|
|
<a href="http://pdf2html.tabesugi.net:8080/">
|
|
|
|
http://pdf2html.tabesugi.net:8080/
|
2007-12-31 04:40:27 +00:00
|
|
|
</a>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="install"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>How to Install</h2>
|
2008-04-27 11:47:38 +00:00
|
|
|
<ol>
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> Install <a href="http://www.python.org/download/">Python</a> 2.5 or newer.
|
2008-05-03 04:10:59 +00:00
|
|
|
<li> Download the <a href="#source">PDFMiner source</a>.
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> Extract it.
|
|
|
|
<li> Go to the <code>pdfminer</code> directory.
|
|
|
|
<li> Do the following test:<br>
|
|
|
|
<blockquote><pre>
|
2009-01-10 11:19:23 +00:00
|
|
|
$ <strong>python -m pdflib.pdf2txt samples/simple1.pdf</strong>
|
2008-08-30 07:40:52 +00:00
|
|
|
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ascii">
|
|
|
|
</head><body>
|
2009-01-10 11:19:23 +00:00
|
|
|
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div><span style="position:absolute; border: 1px solid gray; left:0px; top:50px; width:612px; height:792px;"></span>
|
|
|
|
<span style="position:absolute; writing-mode:lr-tb; left:100px; top:224px; font-size:22px;"> </span>
|
|
|
|
<span style="position:absolute; writing-mode:lr-tb; left:106px; top:224px; font-size:22px;">Hello </span>
|
|
|
|
<span style="position:absolute; writing-mode:lr-tb; left:168px; top:224px; font-size:22px;">World </span>
|
|
|
|
<span style="position:absolute; writing-mode:lr-tb; left:100px; top:124px; font-size:22px;"> </span>
|
|
|
|
<span style="position:absolute; writing-mode:lr-tb; left:206px; top:124px; font-size:22px;">Hello </span>
|
|
|
|
<span style="position:absolute; writing-mode:lr-tb; left:368px; top:124px; font-size:22px;">World </span>
|
|
|
|
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
|
2008-08-30 07:40:52 +00:00
|
|
|
</body></html>
|
2008-04-27 11:47:38 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
<li> Done!
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
<h3>For non-ASCII languages</h3>
|
|
|
|
In order to handle non-ASCII languages (e.g. Japanese),
|
2008-08-30 07:40:52 +00:00
|
|
|
you need to install an additional data called <code>CMap</code>,
|
|
|
|
which is distributed from Adobe.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
Here is how:
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<ol>
|
|
|
|
<li> Get
|
|
|
|
<a href="http://www.unixuser.org/~euske/pub/CMap.tar.bz2">
|
|
|
|
http://www.unixuser.org/~euske/pub/CMap.tar.bz2
|
|
|
|
</a>
|
2008-04-27 11:55:51 +00:00
|
|
|
<li> Do the follwoing:
|
|
|
|
<blockquote><pre>
|
|
|
|
$ <strong>tar jxf CMap.tar.bz2</strong>
|
|
|
|
</pre></blockquote>
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> Put the <code>CMap</code> directory into the <code>pdfminer</code> directory.
|
|
|
|
<li> Go to the <code>pdfminer</code> directory.
|
|
|
|
<li> Do the follwoing: (this is optional but highly recommended)<br>
|
|
|
|
<blockquote><pre>
|
|
|
|
$ <strong>make cdbcmap</strong>
|
|
|
|
</pre></blockquote>
|
2007-12-31 04:40:27 +00:00
|
|
|
</ol>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="usage"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>How to Use</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
PDFMiner comes with two programs:
|
|
|
|
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="pdf2txt"></a>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h3>pdf2txt.py</h3>
|
|
|
|
<p>
|
|
|
|
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
|
|
|
It extracts all the texts that are to be rendered programatically.
|
|
|
|
It also extracts the corresponding locations, font names,
|
2008-06-29 08:45:46 +00:00
|
|
|
and font sizes for each text portion. However,
|
|
|
|
it cannot extract texts embedded within images
|
2008-04-27 11:47:38 +00:00
|
|
|
(i.e. it does not do optical character recognition).
|
|
|
|
You can provide a password for protected PDF documents
|
|
|
|
whose access is limited.
|
|
|
|
<p>
|
2008-07-29 15:02:20 +00:00
|
|
|
For non-ASCII languages, you can specify the output encoding
|
2008-04-27 11:47:38 +00:00
|
|
|
(such as UTF-8).
|
|
|
|
Note that not all characters in a PDF can be converted safely
|
|
|
|
to Unicode, as some of them are not included in the current
|
|
|
|
Unicode Standard.
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Examples:
|
2007-12-31 04:40:27 +00:00
|
|
|
<blockquote><pre>
|
2009-01-10 11:19:23 +00:00
|
|
|
$ <strong>python -m pdflib.pdf2txt -o output.html samples/naacl06-shinyama.pdf</strong>
|
2008-06-29 08:45:46 +00:00
|
|
|
(extract text as an HTML file whose filename is output.html)
|
2008-04-27 11:47:38 +00:00
|
|
|
|
2009-01-10 11:19:23 +00:00
|
|
|
$ <strong>python -m pdflib.pdf2txt -c euc-jp samples/jo.pdf</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
(extract Japanese texts in vertical writing, CMap is required)
|
|
|
|
|
2009-01-10 11:19:23 +00:00
|
|
|
$ <strong>python -m pdflib.pdf2txt -P mypassword secret.pdf</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
(extract texts from an encrypted PDF file with a password)
|
2007-12-31 04:40:27 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
Options:
|
|
|
|
<dl>
|
|
|
|
<dt> <code>-o <em>filename</em></code>
|
|
|
|
<dd> Speficies the output file name.
|
|
|
|
By default, it prints the extracted contents to stdout.
|
|
|
|
<p>
|
2008-06-29 08:45:46 +00:00
|
|
|
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
|
|
|
|
<dd> Speficies the comma-separated list of the page numbers to be extracted.
|
2009-01-10 11:19:23 +00:00
|
|
|
Page numbers are starting from one.
|
2008-04-27 11:47:38 +00:00
|
|
|
By default, it extracts texts from all the pages.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-c <em>codec</em></code>
|
|
|
|
<dd> Speficies the output codec for non-ASCII texts.
|
|
|
|
<p>
|
2008-07-27 04:30:37 +00:00
|
|
|
<dt> <code>-t <em>type</em></code>
|
|
|
|
<dd> Speficies the output format. The following formats are currently supported.
|
|
|
|
<ul>
|
|
|
|
<li> <code>html</code> : HTML format. (Default)
|
|
|
|
<li> <code>sgml</code> : SGML format.
|
|
|
|
<li> <code>tag</code> : "Tagged PDF" format. A tagged PDF has its own contents annotated with
|
|
|
|
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
|
2008-08-30 07:40:52 +00:00
|
|
|
Tags used here are defined in the PDF specification (See §10.7 "<em>Tagged PDF</em>").
|
2008-07-27 04:30:37 +00:00
|
|
|
</ul>
|
2008-06-29 08:45:46 +00:00
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
<dt> <code>-P <em>password</em></code>
|
|
|
|
<dd> Provides the user password to open the PDF file.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-d</code>
|
|
|
|
<dd> Increases the debug level.
|
|
|
|
</dl>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="dumppdf"></a>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h3>dumppdf.py</h3>
|
|
|
|
<p>
|
|
|
|
<code>dumppdf.py</code> dumps the internal contents of a PDF file
|
|
|
|
in pseudo-XML format. This program is primarily for debugging purpose,
|
|
|
|
but it's also possible to extract some meaningful contents
|
|
|
|
(such as images).
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Examples:
|
2007-12-31 04:40:27 +00:00
|
|
|
<blockquote><pre>
|
2008-07-09 15:15:32 +00:00
|
|
|
$ <strong>python -m tools.dumppdf -a foo.pdf</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
(dump all the headers and contents, except stream objects)
|
|
|
|
|
2008-07-09 15:15:32 +00:00
|
|
|
$ <strong>python -m tools.dumppdf -T foo.pdf</strong>
|
|
|
|
(dump the table of contents)
|
|
|
|
|
|
|
|
$ <strong>python -m tools.dumppdf -r -i6 foo.pdf > pic.jpeg</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
(extract a JPEG image)
|
2007-12-31 04:40:27 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
Options:
|
|
|
|
<dl>
|
|
|
|
<dt> <code>-a</code>
|
|
|
|
<dd> Instructs to dump all the objects.
|
|
|
|
By default, it only prints the document trailer (like a header).
|
|
|
|
<p>
|
|
|
|
<dt> <code>-p <em>pageno</em></code>
|
|
|
|
<dd> Speficies the page number to be extracted.
|
|
|
|
Multiple <code>-p</code> options are allowed.
|
2009-01-10 11:19:23 +00:00
|
|
|
Note that page numbers start from one.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-r</code> (raw)
|
|
|
|
<dt> <code>-b</code> (binary)
|
|
|
|
<dt> <code>-t</code> (text)
|
|
|
|
<dd> Speficies the output format of stream contents.
|
|
|
|
Because the contents of stream objects can be very large,
|
|
|
|
they are omitted when none of the options above is specified.
|
|
|
|
<p>
|
|
|
|
With <code>-r</code> option, all the stream contents are dumped without decoding.
|
|
|
|
With <code>-b</code> option, the contents are dumped as a binary blob.
|
|
|
|
With <code>-t</code> option, the contents are dumped in a text format,
|
|
|
|
similar to <code>repr()</code> manner. When
|
|
|
|
<code>-r</code> or <code>-b</code> option is given,
|
|
|
|
no stream header is displayed for the ease of saving it to a file.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-P <em>password</em></code>
|
|
|
|
<dd> Provides the user password to open the PDF file.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-d</code>
|
|
|
|
<dd> Increases the debug level.
|
|
|
|
</dl>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="changes"></a>
|
|
|
|
<hr noshade>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h2>Changes</h2>
|
|
|
|
<ul>
|
2009-02-01 15:01:32 +00:00
|
|
|
<li> 2009/02/01: Various bugfixes. Thanks to Hiroshi Manabe.
|
2009-01-17 16:31:42 +00:00
|
|
|
<li> 2009/01/17: Handling a trailer correctly that contains both /XrefStm and /Prev entries.
|
2009-01-10 11:14:17 +00:00
|
|
|
<li> 2009/01/10: Handling Type3 font metrics correctly.
|
|
|
|
<li> 2008/12/28: Better handling of word spacing. Thanks to Christian Nentwich.
|
2008-09-06 04:52:25 +00:00
|
|
|
<li> 2008/09/06: A sample pdf2html webapp added.
|
2008-08-30 07:40:52 +00:00
|
|
|
<li> 2008/08/30: ASCII85 encoding filter support.
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> 2008/07/27: Tagged contents extraction support.
|
2008-07-16 11:38:01 +00:00
|
|
|
<li> 2008/07/10: Outline (TOC) extraction support.
|
|
|
|
<li> 2008/06/29: HTML output added. Reorganized the directory structure.
|
2008-06-29 14:29:36 +00:00
|
|
|
<li> 2008/04/29: Bugfix for Win32. Thanks to Chris Clark.
|
|
|
|
<li> 2008/04/27: Basic encryption and LZW decoding support added.
|
|
|
|
<li> 2008/01/07: Several bugfixes. Thanks to Nick Fabry for his contribution.
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> 2007/12/31: Initial release.
|
2008-04-27 11:55:51 +00:00
|
|
|
<li> 2004/12/24: Start writing the code out of boredom...
|
2008-04-27 11:47:38 +00:00
|
|
|
</ul>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="related"></a>
|
|
|
|
<hr noshade>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h2>Related Projects</h2>
|
2008-01-07 13:47:52 +00:00
|
|
|
<ul>
|
2008-01-09 14:21:24 +00:00
|
|
|
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
|
2008-01-07 13:47:52 +00:00
|
|
|
<li> <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
|
|
|
<li> <a href="http://www.pdfbox.org/">pdfbox</a>
|
|
|
|
</ul>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="license"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>Terms and Conditions</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
|
|
|
<small>
|
2009-01-10 11:14:17 +00:00
|
|
|
Copyright (c) 2004-2009 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
|
|
|
Permission is hereby granted, free of charge, to any person
|
|
|
|
obtaining a copy of this software and associated documentation
|
|
|
|
files (the "Software"), to deal in the Software without
|
|
|
|
restriction, including without limitation the rights to use,
|
|
|
|
copy, modify, merge, publish, distribute, sublicense, and/or
|
|
|
|
sell copies of the Software, and to permit persons to whom the
|
|
|
|
Software is furnished to do so, subject to the following
|
|
|
|
conditions:
|
|
|
|
<p>
|
|
|
|
The above copyright notice and this permission notice shall be
|
|
|
|
included in all copies or substantial portions of the Software.
|
|
|
|
<p>
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
|
|
|
|
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
|
|
|
|
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
|
|
|
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
|
|
|
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
|
|
|
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
|
|
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
|
|
</small>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<hr noshade>
|
2007-12-31 04:40:27 +00:00
|
|
|
<address>Yusuke Shinyama</address>
|
|
|
|
</body>
|