html tidy up

git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@257 1aa58f4a-7d42-0410-adbc-911cccaed67c
pull/1/head
yusuke.shinyama.dummy 2010-10-17 09:22:39 +00:00
parent 98442ed943
commit 4f4f03fb2d
2 changed files with 65 additions and 89 deletions

View File

@ -5,9 +5,17 @@
<title>PDFMiner</title> <title>PDFMiner</title>
<style type="text/css"><!-- <style type="text/css"><!--
blockquote { background: #eeeeee; } blockquote { background: #eeeeee; }
h1 { border-bottom: solid black 2px; }
h2 { border-bottom: solid black 1px; }
--></style> --></style>
</head><body> </head><body>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Sun Oct 17 09:10:34 UTC 2010
<!-- hhmts end -->
</div>
<h1>PDFMiner</h1> <h1>PDFMiner</h1>
<p> <p>
Python PDF parser and analyzer Python PDF parser and analyzer
@ -17,31 +25,22 @@ Python PDF parser and analyzer
&nbsp; &nbsp;
<a href="#changes">Recent Changes</a> <a href="#changes">Recent Changes</a>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Sun Oct 17 05:13:01 UTC 2010
<!-- hhmts end -->
</div>
<ul> <ul>
<li> <a href="#intro">What's It?</a> <li> <a href="#intro">What's It?</a>
<li> <a href="#source">Download</a> <li> <a href="#download">Download</a>
<li> <a href="#install">Install</a> <li> <a href="#install">How to Install</a>
&nbsp; <small>(<a href="#cmap">for CJK languages</a>)</small> &nbsp; <small>(<a href="#cmap">for CJK languages</a>)</small>
<li> <a href="#usage">How to Use</a> <li> <a href="#usage">How to Use</a>
&nbsp; <small>(<a href="#pdf2txt">pdf2txt.py</a>, &nbsp; <small>(<a href="#pdf2txt">pdf2txt.py</a>,
<a href="#dumppdf">dumppdf.py</a>, <a href="#dumppdf">dumppdf.py</a>,
<a href="programming.html">use as library</a>)</small> <a href="programming.html">use as library</a>)</small>
<li> <a href="#techdocs">Technical Documents</a>
<li> <a href="#todos">TODOs</a> <li> <a href="#todos">TODOs</a>
<li> <a href="#changes">Changes</a> <li> <a href="#changes">Changes</a>
<li> <a href="#related">Related Projects</a> <li> <a href="#related">Related Projects</a>
<li> <a href="#license">Terms and Conditions</a> <li> <a href="#license">Terms and Conditions</a>
</ul> </ul>
<a name="intro"></a> <h2><a name="intro">What's It?</a></h2>
<hr noshade>
<h2>What's It?</h2>
<p> <p>
PDFMiner is a tool for extracting information from PDF documents. PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting Unlike other PDF-related tools, it focuses entirely on getting
@ -51,8 +50,9 @@ other information such as fonts or lines.
It includes a PDF converter that can transform PDF files It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes instead of text analysis. PDF parser that can be used for other purposes instead of text analysis.
<p> <p>
<strong>Features:</strong> <h3>Features</h3>
<ul> <ul>
<li> Written entirely in Python. (for version 2.4 or newer) <li> Written entirely in Python. (for version 2.4 or newer)
<li> Parse, analyze, and convert PDF documents. <li> Parse, analyze, and convert PDF documents.
@ -66,29 +66,28 @@ PDF parser that can be used for other purposes instead of text analysis.
<li> Reconstruct the original layout by grouping text chunks. <li> Reconstruct the original layout by grouping text chunks.
</ul> </ul>
<p> <p>
On the performance side,
PDFMiner is about 20 times slower than PDFMiner is about 20 times slower than
other C/C++-based software such as XPdf. other C/C++-based counterparts such as XPdf.
<a name="source"></a> <h3><a name="download">Download</a></h3>
<p> <p>
<strong>Download from PyPI:</strong><br> <strong>Source distribution:</strong><br>
<a href="http://pypi.python.org/pypi/pdfminer/"> <a href="http://pypi.python.org/pypi/pdfminer/">
http://pypi.python.org/pypi/pdfminer/ http://pypi.python.org/pypi/pdfminer/
</a> </a>
<P>
<strong>SVN repository:</strong><br>
<a href="http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer">
http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer
</a>
<p> <p>
<strong>Discussion:</strong> (for questions and comments, post here)<br> <strong>Discussion:</strong> (for questions and comments, post here)<br>
<a href="http://groups.google.com/group/pdfminer-users/"> <a href="http://groups.google.com/group/pdfminer-users/">
http://groups.google.com/group/pdfminer-users/ http://groups.google.com/group/pdfminer-users/
</a> </a>
<P>
<strong>View the source:</strong><br>
<a href="http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer">
http://code.google.com/p/pdfminerr/source/browse/trunk/pdfminer
</a>
<P> <P>
<strong>Online Demo:</strong> (pdf -&gt; html conversion webapp)<br> <strong>Online Demo:</strong> (pdf -&gt; html conversion webapp)<br>
<a href="http://pdf2html.tabesugi.net:8080/"> <a href="http://pdf2html.tabesugi.net:8080/">
@ -96,13 +95,10 @@ http://pdf2html.tabesugi.net:8080/
</a> </a>
<a name="install"></a> <h2><a name="install">How to Install</a></h2>
<hr noshade>
<h2>Install</h2>
<ol> <ol>
<li> Install <a href="http://www.python.org/download/">Python</a> 2.4 or newer. <li> Install <a href="http://www.python.org/download/">Python</a> 2.4 or newer.
(<font color=red><strong>Python 3 is not supported.</strong></font>) (<font color=red><strong>Python 3 is not supported.</strong></font>)
<li> Download the <a href="#source">PDFMiner source</a>. <li> Download the <a href="#source">PDFMiner source</a>.
<li> Unpack it. <li> Unpack it.
<li> Run <code>setup.py</code> to install:<br> <li> Run <code>setup.py</code> to install:<br>
@ -131,9 +127,8 @@ W o r l d
<li> Done! <li> Done!
</ol> </ol>
<h3><a name="cmap">For CJK languages</a></h3>
<p> <p>
<a name="cmap"></a>
<h3>For CJK languages</h3>
In order to process CJK languages, you need an additional step to take In order to process CJK languages, you need an additional step to take
during installation: during installation:
<blockquote><pre> <blockquote><pre>
@ -146,6 +141,7 @@ writing 'CNS1_H.py'...
# <strong>python setup.py install</strong> # <strong>python setup.py install</strong>
</pre></blockquote> </pre></blockquote>
<p> <p>
On Windows machines which don't have <code>make</code> command, On Windows machines which don't have <code>make</code> command,
paste the following commands on a command line prompt: paste the following commands on a command line prompt:
@ -157,16 +153,12 @@ paste the following commands on a command line prompt:
<strong>python setup.py install</strong> <strong>python setup.py install</strong>
</pre></blockquote> </pre></blockquote>
<a name="usage"></a> <h2><a name="usage">How to Use</a></h2>
<hr noshade>
<h2>How to Use</h2>
<p> <p>
PDFMiner comes with two handy tools: PDFMiner comes with two handy tools:
<code>pdf2txt.py</code> and <code>dumppdf.py</code>. <code>pdf2txt.py</code> and <code>dumppdf.py</code>.
<a name="pdf2txt"></a> <h3><a name="pdf2txt">pdf2txt.py</a></h3>
<h3>pdf2txt.py</h3>
<p> <p>
<code>pdf2txt.py</code> extracts text contents from a PDF file. <code>pdf2txt.py</code> extracts text contents from a PDF file.
It extracts all the texts that are to be rendered programmatically, It extracts all the texts that are to be rendered programmatically,
@ -176,11 +168,12 @@ It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion. direction (horizontal or vertical) for each text portion.
You need to provide a password for protected PDF documents when its access is restricted. You need to provide a password for protected PDF documents when its access is restricted.
You cannot extract any text from a PDF document which does not have extraction permission. You cannot extract any text from a PDF document which does not have extraction permission.
<p>
<strong>Note:</strong> Not all characters in a PDF can be safely converted to Unicode.
<p> <p>
Examples: <strong>Note:</strong>
Not all characters in a PDF can be safely converted to Unicode.
<h4>Examples</h4>
<blockquote><pre> <blockquote><pre>
$ <strong>pdf2txt.py -o output.html samples/naacl06-shinyama.pdf</strong> $ <strong>pdf2txt.py -o output.html samples/naacl06-shinyama.pdf</strong>
(extract text as an HTML file whose filename is output.html) (extract text as an HTML file whose filename is output.html)
@ -192,8 +185,7 @@ $ <strong>pdf2txt.py -P mypassword -o output.txt secret.pdf</strong>
(extract a text from an encrypted PDF file) (extract a text from an encrypted PDF file)
</pre></blockquote> </pre></blockquote>
<p> <h4>Options</h4>
Options:
<dl> <dl>
<dt> <code>-o <em>filename</em></code> <dt> <code>-o <em>filename</em></code>
<dd> Specifies the output file name. <dd> Specifies the output file name.
@ -286,16 +278,14 @@ By default, it extracts all the pages in a document.
<dd> Increases the debug level. <dd> Increases the debug level.
</dl> </dl>
<a name="dumppdf"></a> <h3><a name="dumppdf">dumppdf.py</a></h3>
<h3>dumppdf.py</h3>
<p> <p>
<code>dumppdf.py</code> dumps the internal contents of a PDF file <code>dumppdf.py</code> dumps the internal contents of a PDF file
in pseudo-XML format. This program is primarily for debugging purposes, in pseudo-XML format. This program is primarily for debugging purposes,
but it's also possible to extract some meaningful contents but it's also possible to extract some meaningful contents
(such as images). (such as images).
<p> <h4>Examples</h4>
Examples:
<blockquote><pre> <blockquote><pre>
$ <strong>dumppdf.py -a foo.pdf</strong> $ <strong>dumppdf.py -a foo.pdf</strong>
(dump all the headers and contents, except stream objects) (dump all the headers and contents, except stream objects)
@ -307,8 +297,7 @@ $ <strong>dumppdf.py -r -i6 foo.pdf &gt; pic.jpeg</strong>
(extract a JPEG image) (extract a JPEG image)
</pre></blockquote> </pre></blockquote>
<p> <h4>Options</h4>
Options:
<dl> <dl>
<dt> <code>-a</code> <dt> <code>-a</code>
<dd> Instructs to dump all the objects. <dd> Instructs to dump all the objects.
@ -347,8 +336,7 @@ no stream header is displayed for the ease of saving it to a file.
<dd> Increases the debug level. <dd> Increases the debug level.
</dl> </dl>
<a name="library"></a> <h3><a name="library">Use as Library</a></h3>
<h3>Use as Library</h3>
<p> <p>
PDFMiner can be used as a library by other Python programs. PDFMiner can be used as a library by other Python programs.
<p> <p>
@ -356,21 +344,7 @@ For details, see the <a href="programming.html">Programming with PDFMiner</a> pa
<p> <p>
Also, check out <a href="http://denis.papathanasiou.org/?p=343">a more complete example by Denis Papathanasiou</a>. Also, check out <a href="http://denis.papathanasiou.org/?p=343">a more complete example by Denis Papathanasiou</a>.
<a name="techdocs"></a> <h2><a name="todos">TODOs</a></h2>
<hr noshade>
<h2>Technical Documents</h2>
<p>
<ul>
<li> Video:
"How to Extract Text Contents from PDF by Hand"
<a href="http://www.youtube.com/watch?v=k34wRxaxA_c">(part 1)</a>
<a href="http://www.youtube.com/watch?v=_A1M4OdNsiQ">(part 2)</a>
<a href="http://www.youtube.com/watch?v=sfV_7cWPgZE">(part 3)</a>
</ul>
<a name="todos"></a>
<hr noshade>
<h2>TODOs</h2>
<ul> <ul>
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and <li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance. <a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
@ -381,9 +355,7 @@ Also, check out <a href="http://denis.papathanasiou.org/?p=343">a more complete
<li> CCITTFax stream filter support. <li> CCITTFax stream filter support.
</ul> </ul>
<a name="changes"></a> <h2><a name="changes">Changes</a></h2>
<hr noshade>
<h2>Changes</h2>
<ul> <ul>
<li> 2010/10/17: A couple of bugfixes and a minor improvement. Thanks to standardabweichung and Alastair Irving. <li> 2010/10/17: A couple of bugfixes and a minor improvement. Thanks to standardabweichung and Alastair Irving.
<li> 2010/09/07: A minor bugfix. Thanks to Alexander Garden. <li> 2010/09/07: A minor bugfix. Thanks to Alexander Garden.
@ -435,7 +407,6 @@ Also, check out <a href="http://denis.papathanasiou.org/?p=343">a more complete
</ul> </ul>
<a name="related"></a> <a name="related"></a>
<hr noshade>
<h2>Related Projects</h2> <h2>Related Projects</h2>
<ul> <ul>
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a> <li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
@ -445,7 +416,6 @@ Also, check out <a href="http://denis.papathanasiou.org/?p=343">a more complete
</ul> </ul>
<a name="license"></a> <a name="license"></a>
<hr noshade>
<h2>Terms and Conditions</h2> <h2>Terms and Conditions</h2>
<p> <p>
(This is so-called MIT/X License) (This is so-called MIT/X License)

View File

@ -5,31 +5,38 @@
<title>Programming with PDFMiner</title> <title>Programming with PDFMiner</title>
<style type="text/css"><!-- <style type="text/css"><!--
blockquote { background: #eeeeee; } blockquote { background: #eeeeee; }
h1 { border-bottom: solid black 2px; }
h2 { border-bottom: solid black 1px; }
.comment { color: darkgreen; } .comment { color: darkgreen; }
--></style> --></style>
</head><body> </head><body>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Sun Oct 17 09:12:03 UTC 2010
<!-- hhmts end -->
</div>
<p> <p>
<a href="index.html">[Back to PDFMiner homepage]</a> <a href="index.html">[Back to PDFMiner homepage]</a>
<h1>Programming with PDFMiner</h1> <h1>Programming with PDFMiner</h1>
<p> <p>
This document explains how to use PDFMiner as a library This page explains how to use PDFMiner as a library
from other applications. from other applications.
<ul> <ul>
<li> <a href="#overview">Overview</a> <li> <a href="#overview">Overview</a>
<li> <a href="#basic">Basic Usage</a> <li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Layout Analysis</a> <li> <a href="#layout">Layout Analysis</a>
<li> <a href="#toc">TOC Extraction</a> <li> <a href="#tocextract">TOC Extraction</a>
<li> <a href="#more">more</a> <li> <a href="#extend">Parser Extension</a>
</ul> </ul>
<a name="overview"> <h2><a name="overview">Overview</a></h2>
<hr noshade>
<h2>Overview</h2>
<p> <p>
<strong>PDF is evil.</strong> Although it is called a PDF <strong>PDF is evil.</strong> Although it is called a PDF
"document", it's nothing like Word or HTML. PDF is more like a "document", it's nothing like Word or HTML document. PDF is more
picture representation. PDF contents are just a bunch of like a graphic representation. PDF contents are just a bunch of
instructions that tell how to place the stuff at each exact instructions that tell how to place the stuff at each exact
position on a display or paper. In most cases, it has no logical position on a display or paper. In most cases, it has no logical
structure such as sentences or paragraphs and it cannot adapt structure such as sentences or paragraphs and it cannot adapt
@ -38,6 +45,13 @@ reconstruct some of those structures by guessing from its
positioning, but there's nothing guaranteed to work. Ugly, I positioning, but there's nothing guaranteed to work. Ugly, I
know. Again, PDF is evil. know. Again, PDF is evil.
<p>
[More technical details about the internal structure of PDF:
"How to Extract Text Contents from PDF Manually"
<a href="http://www.youtube.com/watch?v=k34wRxaxA_c">(part 1)</a>
<a href="http://www.youtube.com/watch?v=_A1M4OdNsiQ">(part 2)</a>
<a href="http://www.youtube.com/watch?v=sfV_7cWPgZE">(part 3)</a>]
<p> <p>
Because a PDF file has such a big and complex structure, Because a PDF file has such a big and complex structure,
parsing a PDF file as a whole is time and memory consuming. However, parsing a PDF file as a whole is time and memory consuming. However,
@ -61,9 +75,7 @@ Figure 1 shows the relationship between the classes in PDFMiner.
<small>Figure 1. Relationships between PDFMiner classes</small> <small>Figure 1. Relationships between PDFMiner classes</small>
</div> </div>
<a name="basic"> <h2><a name="basic">Basic Usage</a></h2>
<hr noshade>
<h2>Basic Usage</h2>
<p> <p>
A typical way to parse a PDF file is the following: A typical way to parse a PDF file is the following:
<blockquote><pre> <blockquote><pre>
@ -97,9 +109,7 @@ for page in doc.get_pages():
interpreter.process_page(page) interpreter.process_page(page)
</pre></blockquote> </pre></blockquote>
<a name="layout"> <h2><a name="layout">Accessing Layout Objects</a></h2>
<hr noshade>
<h2>Accessing Layout Objects</h2>
<p> <p>
Here is a typical way to use the layout analysis function: Here is a typical way to use the layout analysis function:
<blockquote><pre> <blockquote><pre>
@ -174,9 +184,7 @@ Could be used for framing another pictures or figures.
<dd> Represents a polygon in a page. <dd> Represents a polygon in a page.
</dl> </dl>
<a name="toc"> <h2><a name="tocextract">TOC Extraction</a></h2>
<hr noshade>
<h2>TOC Extraction</h2>
<p> <p>
PDFMiner provides functions to access the document's table of contents PDFMiner provides functions to access the document's table of contents
("Outlines"). ("Outlines").
@ -205,9 +213,7 @@ way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are way to tell exactly which part of text these destinations are
refering to. refering to.
<a name="more"> <h2><a name="extend">Parser Extension</a></h2>
<hr noshade>
<h2>More</h2>
<p> <p>
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class