2008-04-27 11:55:51 +00:00
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
2007-12-31 04:40:27 +00:00
|
|
|
<html>
|
|
|
|
<head>
|
2008-04-27 11:55:51 +00:00
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
2007-12-31 04:40:27 +00:00
|
|
|
<title>PDFMiner</title>
|
2008-04-27 11:55:51 +00:00
|
|
|
<style type="text/css"><!--
|
|
|
|
blockquote { background: #eeeeee; }
|
|
|
|
--></style>
|
|
|
|
</head><body>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<h1>PDFMiner</h1>
|
2008-07-29 15:02:20 +00:00
|
|
|
<p>
|
|
|
|
Python PDF parser and analyzer
|
|
|
|
|
2009-03-24 23:10:34 +00:00
|
|
|
<p>
|
|
|
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
|
2009-10-31 01:41:30 +00:00
|
|
|
|
2009-04-02 14:22:19 +00:00
|
|
|
<a href="#changes">Recent Changes</a>
|
2009-03-24 23:10:34 +00:00
|
|
|
|
2008-04-27 11:47:38 +00:00
|
|
|
<div align=right class=lastmod>
|
|
|
|
<!-- hhmts start -->
|
2010-01-31 02:09:28 +00:00
|
|
|
Last Modified: Sun Jan 31 10:38:26 JST 2010
|
2008-04-27 11:47:38 +00:00
|
|
|
<!-- hhmts end -->
|
|
|
|
</div>
|
|
|
|
|
2009-10-24 03:44:32 +00:00
|
|
|
<ul>
|
|
|
|
<li> <a href="#intro">What's It?</a>
|
|
|
|
<li> <a href="#source">Download</a>
|
|
|
|
<li> <a href="#install">Install</a>
|
2010-01-01 03:09:26 +00:00
|
|
|
<small>(<a href="#cmap">for East Asian languages</a>)</small>
|
2009-10-24 03:44:32 +00:00
|
|
|
<li> <a href="#usage">How to Use</a>
|
|
|
|
<small>(<a href="#pdf2txt">pdf2txt.py</a>, <a href="#dumppdf">dumppdf.py</a>)</small>
|
|
|
|
<li> <a href="#todos">TODOs</a>
|
|
|
|
<li> <a href="#changes">Changes</a>
|
|
|
|
<li> <a href="#related">Related Projects</a>
|
|
|
|
<li> <a href="#license">Terms and Conditions</a>
|
|
|
|
</ul>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="intro"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>What's It?</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2009-05-17 06:21:08 +00:00
|
|
|
PDFMiner is a suite of programs that help
|
2009-12-20 02:38:01 +00:00
|
|
|
extracting some meaningful information out of PDF documents.
|
2009-11-14 11:29:40 +00:00
|
|
|
Unlike other PDF-related tools, it focuses entirely on getting
|
|
|
|
and analyzing text data from PDFs. PDFMiner allows to obtain
|
2008-01-09 14:21:24 +00:00
|
|
|
the exact location of texts in a page, as well as
|
2009-05-17 06:21:08 +00:00
|
|
|
other extra information such as font information or ruled lines.
|
|
|
|
It includes a PDF converter that can transform PDF files
|
|
|
|
into other text formats (such as HTML). It has an extensible
|
2009-07-11 15:38:13 +00:00
|
|
|
PDF parser that can be used for other purposes instead of text analysis.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<strong>Features:</strong>
|
|
|
|
<ul>
|
2009-07-21 14:23:23 +00:00
|
|
|
<li> Written entirely in Python. (for version 2.4 or newer)
|
2009-05-17 06:21:08 +00:00
|
|
|
<li> PDF-1.7 specification support. (well, almost)
|
2010-01-01 03:09:26 +00:00
|
|
|
<li> East Asian languages and vertical writing scripts support.
|
2008-07-29 15:02:20 +00:00
|
|
|
<li> Various font types (Type1, TrueType, Type3, and CID) support.
|
2009-05-17 06:21:08 +00:00
|
|
|
<li> Basic encryption (RC4) support.
|
2008-09-06 04:52:25 +00:00
|
|
|
<li> PDF to HTML conversion (with a sample converter web app).
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> Outline (TOC) extraction.
|
|
|
|
<li> Tagged contents extraction.
|
2009-11-14 11:29:40 +00:00
|
|
|
<li> Reconstruct the original layout by grouping text chunks.
|
2008-04-27 11:47:38 +00:00
|
|
|
</ul>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
2008-05-03 04:10:59 +00:00
|
|
|
<a name="source"></a>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2009-12-19 06:52:02 +00:00
|
|
|
<strong>Download from PyPI:</strong><br>
|
|
|
|
<a href="http://pypi.python.org/pypi/pdfminer/">
|
|
|
|
http://pypi.python.org/pypi/pdfminer/
|
2007-12-31 04:40:27 +00:00
|
|
|
</a>
|
|
|
|
|
2009-03-24 23:10:34 +00:00
|
|
|
<p>
|
2009-05-17 14:25:14 +00:00
|
|
|
<strong>Discussion:</strong> (for questions and comments, post here)<br>
|
2009-03-24 23:10:34 +00:00
|
|
|
<a href="http://groups.google.com/group/pdfminer-users/">
|
|
|
|
http://groups.google.com/group/pdfminer-users/
|
|
|
|
</a>
|
|
|
|
|
2007-12-31 04:40:27 +00:00
|
|
|
<P>
|
2009-03-24 23:10:34 +00:00
|
|
|
<strong>View the source:</strong><br>
|
2008-07-03 15:51:44 +00:00
|
|
|
<a href="http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer">
|
|
|
|
http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer
|
2008-06-29 14:57:42 +00:00
|
|
|
</a>
|
|
|
|
|
|
|
|
<P>
|
2009-03-24 23:10:34 +00:00
|
|
|
<strong>Online Demonstration:</strong> (pdf -> html conversion webapp)<br>
|
2008-06-29 14:57:42 +00:00
|
|
|
<a href="http://pdf2html.tabesugi.net:8080/">
|
|
|
|
http://pdf2html.tabesugi.net:8080/
|
2007-12-31 04:40:27 +00:00
|
|
|
</a>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="install"></a>
|
|
|
|
<hr noshade>
|
2009-10-24 03:44:32 +00:00
|
|
|
<h2>Install</h2>
|
2009-03-28 17:23:53 +00:00
|
|
|
|
2008-04-27 11:47:38 +00:00
|
|
|
<ol>
|
2009-04-08 10:36:49 +00:00
|
|
|
<li> Install <a href="http://www.python.org/download/">Python</a> 2.4 or newer.
|
2008-05-03 04:10:59 +00:00
|
|
|
<li> Download the <a href="#source">PDFMiner source</a>.
|
2009-12-20 02:38:01 +00:00
|
|
|
<li> Unpack it.
|
2009-05-17 06:21:08 +00:00
|
|
|
<li> Run <code>setup.py</code> to install:<br>
|
|
|
|
<blockquote><pre>
|
|
|
|
# <strong>python setup.py install</strong>
|
|
|
|
</pre></blockquote>
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> Do the following test:<br>
|
|
|
|
<blockquote><pre>
|
2009-05-17 06:21:08 +00:00
|
|
|
$ <strong>pdf2txt.py samples/simple1.pdf</strong>
|
2009-10-24 03:44:32 +00:00
|
|
|
Hello
|
|
|
|
|
|
|
|
Hello
|
|
|
|
|
|
|
|
World
|
2009-07-11 12:42:12 +00:00
|
|
|
|
2009-10-24 03:44:32 +00:00
|
|
|
World
|
2009-07-11 12:42:12 +00:00
|
|
|
|
|
|
|
Hello
|
|
|
|
|
|
|
|
World
|
|
|
|
|
2009-10-24 03:44:32 +00:00
|
|
|
H e l l o
|
|
|
|
|
|
|
|
W o r l d
|
|
|
|
|
2008-04-27 11:47:38 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
<li> Done!
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
<p>
|
2009-08-26 15:20:44 +00:00
|
|
|
<a name="cmap"></a>
|
2010-01-01 03:09:26 +00:00
|
|
|
<h3>For East Asian languages</h3>
|
|
|
|
In order to handle East Asian languages (Chinese or Japanese, etc.),
|
2010-01-31 02:09:28 +00:00
|
|
|
an additional data called <code>CMap</code> is required.
|
|
|
|
CMap files are not installed by default.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
2009-12-19 15:10:58 +00:00
|
|
|
Here is the additional step you need:
|
2008-04-27 11:47:38 +00:00
|
|
|
<blockquote><pre>
|
2009-12-19 15:10:58 +00:00
|
|
|
# <strong>make cmap</strong>
|
|
|
|
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt cp950 big5
|
|
|
|
reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...
|
|
|
|
writing 'CNS1_H.py'...
|
|
|
|
...
|
|
|
|
<em>(this may take several minutes)</em>
|
|
|
|
|
|
|
|
# <strong>python setup.py install</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
</pre></blockquote>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="usage"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>How to Use</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
|
|
|
|
<p>
|
2009-05-17 06:39:54 +00:00
|
|
|
PDFMiner comes with two handy tools:
|
2008-04-27 11:47:38 +00:00
|
|
|
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="pdf2txt"></a>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h3>pdf2txt.py</h3>
|
|
|
|
<p>
|
|
|
|
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
2009-07-11 15:38:13 +00:00
|
|
|
It extracts all the texts that are to be rendered programmatically,
|
2009-10-23 14:51:40 +00:00
|
|
|
ie. text represented as ASCII or Unicode strings.
|
2009-06-14 08:54:57 +00:00
|
|
|
It cannot recognize texts drawn as images that would require optical character recognition.
|
2009-05-17 06:39:54 +00:00
|
|
|
It also extracts the corresponding locations, font names, font sizes, writing
|
|
|
|
direction (horizontal or vertical) for each text portion.
|
|
|
|
You need to provide a password for protected PDF documents when its access is restricted.
|
|
|
|
You cannot extract any text from a PDF document which does not have extraction permission.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
2009-05-17 06:39:54 +00:00
|
|
|
<strong>Note:</strong> Not all characters in a PDF can be safely converted to Unicode.
|
2008-04-27 11:47:38 +00:00
|
|
|
|
|
|
|
<p>
|
|
|
|
Examples:
|
2007-12-31 04:40:27 +00:00
|
|
|
<blockquote><pre>
|
2009-11-06 15:06:59 +00:00
|
|
|
$ <strong>pdf2txt.py -o output.html samples/naacl06-shinyama.pdf</strong>
|
2008-06-29 08:45:46 +00:00
|
|
|
(extract text as an HTML file whose filename is output.html)
|
2008-04-27 11:47:38 +00:00
|
|
|
|
2009-11-06 15:06:59 +00:00
|
|
|
$ <strong>pdf2txt.py -c euc-jp -D V -o output.html samples/jo.pdf</strong>
|
2009-05-17 06:39:54 +00:00
|
|
|
(extract a Japanese HTML file in vertical writing, CMap is required)
|
2008-04-27 11:47:38 +00:00
|
|
|
|
2009-11-06 15:06:59 +00:00
|
|
|
$ <strong>pdf2txt.py -P mypassword -o output.txt secret.pdf</strong>
|
2009-05-17 06:39:54 +00:00
|
|
|
(extract a text from an encrypted PDF file)
|
2007-12-31 04:40:27 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
Options:
|
|
|
|
<dl>
|
|
|
|
<dt> <code>-o <em>filename</em></code>
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the output file name.
|
2009-07-11 12:42:12 +00:00
|
|
|
By default, it prints the extracted contents to stdout in text format.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
2008-06-29 08:45:46 +00:00
|
|
|
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the comma-separated list of the page numbers to be extracted.
|
2009-01-10 11:19:23 +00:00
|
|
|
Page numbers are starting from one.
|
2008-04-27 11:47:38 +00:00
|
|
|
By default, it extracts texts from all the pages.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-c <em>codec</em></code>
|
2010-01-01 03:09:26 +00:00
|
|
|
<dd> Specifies the output codec.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
2008-07-27 04:30:37 +00:00
|
|
|
<dt> <code>-t <em>type</em></code>
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the output format. The following formats are currently supported.
|
2008-07-27 04:30:37 +00:00
|
|
|
<ul>
|
2009-10-23 14:51:40 +00:00
|
|
|
<li> <code>text</code> : TEXT format. (Default)
|
2010-01-10 07:18:05 +00:00
|
|
|
<li> <code>html</code> : HTML format. Not recommended for extraction purposes because the markup is messy.
|
2009-10-31 03:04:56 +00:00
|
|
|
<li> <code>xml</code> : XML format. Provides the most information available.
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> <code>tag</code> : "Tagged PDF" format. A tagged PDF has its own contents annotated with
|
|
|
|
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
|
2008-08-30 07:40:52 +00:00
|
|
|
Tags used here are defined in the PDF specification (See §10.7 "<em>Tagged PDF</em>").
|
2008-07-27 04:30:37 +00:00
|
|
|
</ul>
|
2008-06-29 08:45:46 +00:00
|
|
|
<p>
|
2010-01-30 07:33:18 +00:00
|
|
|
<dt> <code>-I <em>image_directory</em></code>
|
|
|
|
<dd> Specifies the output directory for image extraction.
|
|
|
|
Currently only JPEG images are supported.
|
|
|
|
<p>
|
2009-07-21 07:55:19 +00:00
|
|
|
<dt> <code>-D <em>direction</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dt> <code>-M <em>char_margin</em></code>
|
|
|
|
<dt> <code>-L <em>line_margin</em></code>
|
2009-07-11 12:42:12 +00:00
|
|
|
<dt> <code>-W <em>word_margin</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> These are the parameters used for layout analysis.
|
|
|
|
In an actual PDF file, texts might be split into several chunks
|
|
|
|
in the middle of its running, depending on the authoring software.
|
|
|
|
Therefore, text extraction needs to splice text chunks.
|
|
|
|
In the figure below, two text chunks whose distance is closer than
|
|
|
|
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
|
|
|
|
continuous and get grouped into one. Also, two lines whose distance is closer than
|
|
|
|
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
|
2009-07-11 15:38:13 +00:00
|
|
|
as a text box, which is a rectangular area that contains a "cluster" of texts.
|
2009-07-11 15:28:12 +00:00
|
|
|
Furthermore, it may be required to insert blank characters (spaces) as necessary
|
|
|
|
if the distance between two words is greater than the <em>word_margin</em>
|
|
|
|
(<em><font color="green">W</font></em>), as a blank between words might not be
|
|
|
|
represented as a space, but indicated by the positioning of each word.
|
|
|
|
<p>
|
|
|
|
Each value is specified not as an actual length, but as a proportion of
|
|
|
|
the length to the size of each character in question. The default values
|
|
|
|
are M = 1.0, L = 0.3, and W = 0.2, respectively.
|
|
|
|
<table style="border:2px gray solid; margin: 10px; padding: 10px;"><tr>
|
|
|
|
<td style="border-right:1px red solid" align=right>→</td>
|
|
|
|
<td style="border-left:1px red solid" colspan="4" align=left>← <em><font color="red">M</font></em></td>
|
|
|
|
<td></td>
|
|
|
|
</tr><tr>
|
|
|
|
<td style="border:1px solid"><code>Q u i</code></td>
|
|
|
|
<td style="border:1px solid"><code>c k</code></td>
|
|
|
|
<td width="10px"></td>
|
|
|
|
<td style="border:1px solid"><code>b r o w</code></td>
|
|
|
|
<td style="border:1px solid"><code>n f o x</code></td>
|
|
|
|
<td style="border-bottom:1px blue solid" align=right>↓</td>
|
|
|
|
</tr><tr>
|
|
|
|
<td style="border-right:1px green solid" colspan="2" align=right>→</td><td></td>
|
|
|
|
<td style="border-left:1px green solid" colspan="2" align=left>← <em><font color="green">W</font></em></td>
|
|
|
|
<td rowspan="2" valign=center align=center><em><font color="blue">L</font></em></td>
|
|
|
|
</tr><tr height="10px">
|
|
|
|
</tr><tr>
|
|
|
|
<td style="padding:0px;" colspan="5">
|
|
|
|
<table style="border:1px solid"><tr><td><code>j u m p s</code></td><td>...</td></tr></table>
|
|
|
|
</td>
|
|
|
|
<td style="border-top:1px blue solid" align=right>↑</td>
|
|
|
|
</tr></table>
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
2009-11-07 09:12:54 +00:00
|
|
|
<dt> <code>-n</code>
|
|
|
|
<dd> Suppress layout analysis.
|
|
|
|
<p>
|
2009-07-11 12:42:12 +00:00
|
|
|
<dt> <code>-s <em>scale</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Specifies the output scale. Can be used in HTML format only.
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-m <em>maxpages</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Specifies the maximum number of pages to extract.
|
|
|
|
By default, it extracts all the pages in a document.
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
<dt> <code>-P <em>password</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Provides the user password to access PDF contents.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-d</code>
|
|
|
|
<dd> Increases the debug level.
|
|
|
|
</dl>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="dumppdf"></a>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h3>dumppdf.py</h3>
|
|
|
|
<p>
|
|
|
|
<code>dumppdf.py</code> dumps the internal contents of a PDF file
|
2010-01-10 07:18:05 +00:00
|
|
|
in pseudo-XML format. This program is primarily for debugging purposes,
|
2008-04-27 11:47:38 +00:00
|
|
|
but it's also possible to extract some meaningful contents
|
|
|
|
(such as images).
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Examples:
|
2007-12-31 04:40:27 +00:00
|
|
|
<blockquote><pre>
|
2009-05-17 06:39:54 +00:00
|
|
|
$ <strong>dumppdf.py -a foo.pdf</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
(dump all the headers and contents, except stream objects)
|
|
|
|
|
2009-05-17 06:39:54 +00:00
|
|
|
$ <strong>dumppdf.py -T foo.pdf</strong>
|
2008-07-09 15:15:32 +00:00
|
|
|
(dump the table of contents)
|
|
|
|
|
2009-05-17 06:39:54 +00:00
|
|
|
$ <strong>dumppdf.py -r -i6 foo.pdf > pic.jpeg</strong>
|
2008-04-27 11:47:38 +00:00
|
|
|
(extract a JPEG image)
|
2007-12-31 04:40:27 +00:00
|
|
|
</pre></blockquote>
|
|
|
|
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
Options:
|
|
|
|
<dl>
|
|
|
|
<dt> <code>-a</code>
|
|
|
|
<dd> Instructs to dump all the objects.
|
|
|
|
By default, it only prints the document trailer (like a header).
|
|
|
|
<p>
|
2009-07-11 12:42:12 +00:00
|
|
|
<dt> <code>-i <em>objno,objno, ...</em></code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Specifies PDF object IDs to display.
|
|
|
|
Comma-separated IDs, or multiple <code>-i</code> options are accepted.
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-p <em>pageno,pageno, ...</em></code>
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the page number to be extracted.
|
2009-07-11 15:28:12 +00:00
|
|
|
Comma-separated page numbers, or multiple <code>-p</code> options are accepted.
|
|
|
|
Note that page numbers start from one, not zero.
|
2008-04-27 11:47:38 +00:00
|
|
|
<p>
|
|
|
|
<dt> <code>-r</code> (raw)
|
|
|
|
<dt> <code>-b</code> (binary)
|
|
|
|
<dt> <code>-t</code> (text)
|
2009-02-28 05:44:08 +00:00
|
|
|
<dd> Specifies the output format of stream contents.
|
2008-04-27 11:47:38 +00:00
|
|
|
Because the contents of stream objects can be very large,
|
|
|
|
they are omitted when none of the options above is specified.
|
|
|
|
<p>
|
2009-06-14 08:54:57 +00:00
|
|
|
With <code>-r</code> option, the "raw" stream contents are dumped without decompression.
|
|
|
|
With <code>-b</code> option, the decompressed contents are dumped as a binary blob.
|
|
|
|
With <code>-t</code> option, the decompressed contents are dumped in a text format,
|
2008-04-27 11:47:38 +00:00
|
|
|
similar to <code>repr()</code> manner. When
|
|
|
|
<code>-r</code> or <code>-b</code> option is given,
|
|
|
|
no stream header is displayed for the ease of saving it to a file.
|
|
|
|
<p>
|
2009-07-11 12:42:12 +00:00
|
|
|
<dt> <code>-T</code>
|
2009-07-11 15:28:12 +00:00
|
|
|
<dd> Shows the table of contents.
|
|
|
|
<p>
|
|
|
|
<dt> <code>-P <em>password</em></code>
|
|
|
|
<dd> Provides the user password to access PDF contents.
|
2009-07-11 12:42:12 +00:00
|
|
|
<p>
|
2008-04-27 11:47:38 +00:00
|
|
|
<dt> <code>-d</code>
|
|
|
|
<dd> Increases the debug level.
|
|
|
|
</dl>
|
|
|
|
|
2009-10-24 03:44:32 +00:00
|
|
|
<a name="todos"></a>
|
|
|
|
<hr noshade>
|
|
|
|
<h2>TODOs</h2>
|
|
|
|
<ul>
|
2009-11-29 07:17:36 +00:00
|
|
|
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
|
|
|
|
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
|
2009-10-24 03:44:32 +00:00
|
|
|
<li> Better text extraction / layout analysis.
|
|
|
|
<li> Better API Documentation.
|
2010-01-30 07:33:18 +00:00
|
|
|
<li> Crypt stream filter support. (More sample documents are needed!)
|
|
|
|
<li> CCITTFax stream filter support.
|
2009-10-24 03:44:32 +00:00
|
|
|
<li> Robust error handling.
|
|
|
|
</ul>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="changes"></a>
|
|
|
|
<hr noshade>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h2>Changes</h2>
|
|
|
|
<ul>
|
2010-01-31 02:09:28 +00:00
|
|
|
<li> 2010/01/31: JPEG image extraction supported. Page rotation bug fixed.
|
2010-01-04 12:50:59 +00:00
|
|
|
<li> 2010/01/04: Python 2.6 warning removal. More doctest conversion.
|
2010-01-01 03:09:26 +00:00
|
|
|
<li> 2010/01/01: CMap bug fix. Thanks to Winfried Plappert.
|
|
|
|
<li> 2009/12/24: RunLengthDecode filter added. Thanks to Troy Bollinger.
|
|
|
|
<li> 2009/12/20: Experimental polygon shape extraction added. Thanks to Yusuf Dewaswala for reporting.
|
2009-12-20 02:38:01 +00:00
|
|
|
<li> 2009/12/19: CMap resources are now the part of the package. Thanks to Adobe for open-sourcing them.
|
2009-11-29 07:17:36 +00:00
|
|
|
<li> 2009/11/29: Password encryption bug fixed. Thanks to Yannick Gingras.
|
|
|
|
<li> 2009/10/31: SGML output format is changed and renamed as XML.
|
2009-10-24 04:41:59 +00:00
|
|
|
<li> 2009/10/24: Charspace bug fixed. Adjusted for 4-space indentation.
|
2009-10-04 03:48:11 +00:00
|
|
|
<li> 2009/10/04: Another matrix operation bug fixed. Thanks to Vitaly Sedelnik.
|
2009-09-12 03:05:49 +00:00
|
|
|
<li> 2009/09/12: Fixed rectangle handling. Able to extract image boundaries.
|
2009-08-30 01:23:00 +00:00
|
|
|
<li> 2009/08/30: Fixed page rotation handling.
|
2009-08-26 15:20:44 +00:00
|
|
|
<li> 2009/08/26: Fixed zlib decoding bug. Thanks to Shon Urbas.
|
2009-08-24 06:56:54 +00:00
|
|
|
<li> 2009/08/24: Fixed a bug in character placing. Thanks to Pawan Jain.
|
2009-07-21 07:55:19 +00:00
|
|
|
<li> 2009/07/21: Improvement in layout analysis.
|
2009-07-11 15:28:12 +00:00
|
|
|
<li> 2009/07/11: Improvement in layout analysis. Thanks to Lubos Pintes.
|
2009-05-17 14:02:57 +00:00
|
|
|
<li> 2009/05/17: Bugfixes, massive code restructuring, and simple graphic element support added. setup.py is supported.
|
2009-03-29 15:14:23 +00:00
|
|
|
<li> 2009/03/30: Text output mode added.
|
2009-03-29 15:31:00 +00:00
|
|
|
<li> 2009/03/25: Encoding problems fixed. Word splitting option added.
|
|
|
|
<li> 2009/02/28: Robust handling of corrupted PDFs. Thanks to Troy Bollinger.
|
2009-02-01 15:01:32 +00:00
|
|
|
<li> 2009/02/01: Various bugfixes. Thanks to Hiroshi Manabe.
|
2009-01-17 16:31:42 +00:00
|
|
|
<li> 2009/01/17: Handling a trailer correctly that contains both /XrefStm and /Prev entries.
|
2009-01-10 11:14:17 +00:00
|
|
|
<li> 2009/01/10: Handling Type3 font metrics correctly.
|
|
|
|
<li> 2008/12/28: Better handling of word spacing. Thanks to Christian Nentwich.
|
2008-09-06 04:52:25 +00:00
|
|
|
<li> 2008/09/06: A sample pdf2html webapp added.
|
2008-08-30 07:40:52 +00:00
|
|
|
<li> 2008/08/30: ASCII85 encoding filter support.
|
2008-07-27 04:30:37 +00:00
|
|
|
<li> 2008/07/27: Tagged contents extraction support.
|
2008-07-16 11:38:01 +00:00
|
|
|
<li> 2008/07/10: Outline (TOC) extraction support.
|
|
|
|
<li> 2008/06/29: HTML output added. Reorganized the directory structure.
|
2008-06-29 14:29:36 +00:00
|
|
|
<li> 2008/04/29: Bugfix for Win32. Thanks to Chris Clark.
|
|
|
|
<li> 2008/04/27: Basic encryption and LZW decoding support added.
|
2009-10-31 02:09:36 +00:00
|
|
|
<li> 2008/01/07: Several bugfixes. Thanks to Nick Fabry for his vast contribution.
|
2008-04-27 11:47:38 +00:00
|
|
|
<li> 2007/12/31: Initial release.
|
2008-04-27 11:55:51 +00:00
|
|
|
<li> 2004/12/24: Start writing the code out of boredom...
|
2008-04-27 11:47:38 +00:00
|
|
|
</ul>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="related"></a>
|
|
|
|
<hr noshade>
|
2008-04-27 11:47:38 +00:00
|
|
|
<h2>Related Projects</h2>
|
2008-01-07 13:47:52 +00:00
|
|
|
<ul>
|
2008-01-09 14:21:24 +00:00
|
|
|
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
|
2008-01-07 13:47:52 +00:00
|
|
|
<li> <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
|
|
|
<li> <a href="http://www.pdfbox.org/">pdfbox</a>
|
|
|
|
</ul>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<a name="license"></a>
|
|
|
|
<hr noshade>
|
2008-05-03 04:10:59 +00:00
|
|
|
<h2>Terms and Conditions</h2>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
2009-10-24 03:44:32 +00:00
|
|
|
(This is so-called MIT/X License)
|
|
|
|
<p>
|
2007-12-31 04:40:27 +00:00
|
|
|
<small>
|
2009-01-10 11:14:17 +00:00
|
|
|
Copyright (c) 2004-2009 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
|
2007-12-31 04:40:27 +00:00
|
|
|
<p>
|
|
|
|
Permission is hereby granted, free of charge, to any person
|
|
|
|
obtaining a copy of this software and associated documentation
|
|
|
|
files (the "Software"), to deal in the Software without
|
|
|
|
restriction, including without limitation the rights to use,
|
|
|
|
copy, modify, merge, publish, distribute, sublicense, and/or
|
|
|
|
sell copies of the Software, and to permit persons to whom the
|
|
|
|
Software is furnished to do so, subject to the following
|
|
|
|
conditions:
|
|
|
|
<p>
|
|
|
|
The above copyright notice and this permission notice shall be
|
|
|
|
included in all copies or substantial portions of the Software.
|
|
|
|
<p>
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
|
|
|
|
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
|
|
|
|
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
|
|
|
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
|
|
|
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
|
|
|
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
|
|
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
|
|
</small>
|
|
|
|
|
2008-04-27 11:55:51 +00:00
|
|
|
<hr noshade>
|
2007-12-31 04:40:27 +00:00
|
|
|
<address>Yusuke Shinyama</address>
|
|
|
|
</body>
|