documentation.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@109 1aa58f4a-7d42-0410-adbc-911cccaed67cpull/1/head
parent
8cae56a555
commit
5c1cebadbb
3
Makefile
3
Makefile
|
@ -26,9 +26,6 @@ clean:
|
||||||
test:
|
test:
|
||||||
cd samples && make test
|
cd samples && make test
|
||||||
|
|
||||||
cdbcmap: CMap
|
|
||||||
$(CONV_CMAP) CMap
|
|
||||||
|
|
||||||
# Maintainance:
|
# Maintainance:
|
||||||
commit: clean
|
commit: clean
|
||||||
$(SVN) commit
|
$(SVN) commit
|
||||||
|
|
54
README.html
54
README.html
|
@ -18,7 +18,7 @@ Python PDF parser and analyzer
|
||||||
|
|
||||||
<div align=right class=lastmod>
|
<div align=right class=lastmod>
|
||||||
<!-- hhmts start -->
|
<!-- hhmts start -->
|
||||||
Last Modified: Sat May 16 19:58:11 JST 2009
|
Last Modified: Sun May 17 15:39:06 JST 2009
|
||||||
<!-- hhmts end -->
|
<!-- hhmts end -->
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -114,19 +114,21 @@ which is distributed from Adobe.
|
||||||
Here is how:
|
Here is how:
|
||||||
|
|
||||||
<ol>
|
<ol>
|
||||||
<li> Get
|
<li> Get a CMap archive file from
|
||||||
<a href="http://www.unixuser.org/~euske/pub/CMap.tar.bz2">
|
<a href="http://www.unixuser.org/~euske/pub/CMap.tar.bz2">
|
||||||
http://www.unixuser.org/~euske/pub/CMap.tar.bz2
|
http://www.unixuser.org/~euske/pub/CMap.tar.bz2
|
||||||
</a>
|
</a>
|
||||||
<li> Do the follwoing:
|
<li> Expand the archive and put the <code>CMap</code> directory under the directory
|
||||||
|
where <code>pdfminer</code> is installed.
|
||||||
|
(Normally this should be something like <code>/usr/lib/python2.5/site-packages</code>.)
|
||||||
|
For example:
|
||||||
<blockquote><pre>
|
<blockquote><pre>
|
||||||
|
$ <strong>cd /usr/lib/python2.5/site-packages</strong>
|
||||||
$ <strong>tar jxf CMap.tar.bz2</strong>
|
$ <strong>tar jxf CMap.tar.bz2</strong>
|
||||||
</pre></blockquote>
|
</pre></blockquote>
|
||||||
<li> Put the <code>CMap</code> directory into the <code>pdfminer</code> directory.
|
|
||||||
<li> Go to the <code>pdfminer</code> directory.
|
|
||||||
<li> Do the follwoing: (this is optional but highly recommended)<br>
|
<li> Do the follwoing: (this is optional but highly recommended)<br>
|
||||||
<blockquote><pre>
|
<blockquote><pre>
|
||||||
$ <strong>make cdbcmap</strong>
|
$ <strong>python -m pdfminer.cmap /usr/lib/python2.5/site-packages/CMap</strong>
|
||||||
</pre></blockquote>
|
</pre></blockquote>
|
||||||
</ol>
|
</ol>
|
||||||
|
|
||||||
|
@ -135,38 +137,36 @@ $ <strong>make cdbcmap</strong>
|
||||||
<h2>How to Use</h2>
|
<h2>How to Use</h2>
|
||||||
|
|
||||||
<p>
|
<p>
|
||||||
PDFMiner comes with two programs:
|
PDFMiner comes with two handy tools:
|
||||||
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
|
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
|
||||||
|
|
||||||
<a name="pdf2txt"></a>
|
<a name="pdf2txt"></a>
|
||||||
<h3>pdf2txt.py</h3>
|
<h3>pdf2txt.py</h3>
|
||||||
<p>
|
<p>
|
||||||
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
||||||
It extracts all the texts that are to be rendered programatically.
|
It extracts all the texts that are to be rendered programatically,
|
||||||
It also extracts the corresponding locations, font names,
|
i.e. it cannot extract texts drawn as images that require optical character recognition.
|
||||||
and font sizes for each text portion. However,
|
It also extracts the corresponding locations, font names, font sizes, writing
|
||||||
it cannot extract texts embedded within images
|
direction (horizontal or vertical) for each text portion.
|
||||||
(i.e. it does not do optical character recognition).
|
You need to provide a password for protected PDF documents when its access is restricted.
|
||||||
You can provide a password for protected PDF documents
|
You cannot extract any text from a PDF document which does not have extraction permission.
|
||||||
whose access is limited.
|
|
||||||
<p>
|
<p>
|
||||||
For non-ASCII languages, you can specify the output encoding
|
For non-ASCII languages, you can specify the output encoding
|
||||||
(such as UTF-8).
|
(such as UTF-8).
|
||||||
Note that not all characters in a PDF can be converted safely
|
<p>
|
||||||
to Unicode, as some of them are not included in the current
|
<strong>Note:</strong> Not all characters in a PDF can be safely converted to Unicode.
|
||||||
Unicode Standard.
|
|
||||||
|
|
||||||
<p>
|
<p>
|
||||||
Examples:
|
Examples:
|
||||||
<blockquote><pre>
|
<blockquote><pre>
|
||||||
$ <strong>python -m pdflib.pdf2txt -o output.html samples/naacl06-shinyama.pdf</strong>
|
$ <strong>pdf2txt.py samples/naacl06-shinyama.pdf > output.html</strong>
|
||||||
(extract text as an HTML file whose filename is output.html)
|
(extract text as an HTML file whose filename is output.html)
|
||||||
|
|
||||||
$ <strong>python -m pdflib.pdf2txt -c euc-jp samples/jo.pdf</strong>
|
$ <strong>pdf2txt.py -c euc-jp samples/jo.pdf > output.html</strong>
|
||||||
(extract Japanese texts in vertical writing, CMap is required)
|
(extract a Japanese HTML file in vertical writing, CMap is required)
|
||||||
|
|
||||||
$ <strong>python -m pdflib.pdf2txt -P mypassword secret.pdf</strong>
|
$ <strong>pdf2txt.py -P mypassword -t text secret.pdf > output.txt</strong>
|
||||||
(extract texts from an encrypted PDF file with a password)
|
(extract a text from an encrypted PDF file)
|
||||||
</pre></blockquote>
|
</pre></blockquote>
|
||||||
|
|
||||||
<p>
|
<p>
|
||||||
|
@ -184,10 +184,6 @@ By default, it extracts texts from all the pages.
|
||||||
<dt> <code>-c <em>codec</em></code>
|
<dt> <code>-c <em>codec</em></code>
|
||||||
<dd> Specifies the output codec for non-ASCII texts.
|
<dd> Specifies the output codec for non-ASCII texts.
|
||||||
<p>
|
<p>
|
||||||
<dt> <code>-w</code>
|
|
||||||
<dd> Split each word into a different chunk in the output.
|
|
||||||
This makes the word spacing correctly handled.
|
|
||||||
<p>
|
|
||||||
<dt> <code>-t <em>type</em></code>
|
<dt> <code>-t <em>type</em></code>
|
||||||
<dd> Specifies the output format. The following formats are currently supported.
|
<dd> Specifies the output format. The following formats are currently supported.
|
||||||
<ul>
|
<ul>
|
||||||
|
@ -217,13 +213,13 @@ but it's also possible to extract some meaningful contents
|
||||||
<p>
|
<p>
|
||||||
Examples:
|
Examples:
|
||||||
<blockquote><pre>
|
<blockquote><pre>
|
||||||
$ <strong>python -m tools.dumppdf -a foo.pdf</strong>
|
$ <strong>dumppdf.py -a foo.pdf</strong>
|
||||||
(dump all the headers and contents, except stream objects)
|
(dump all the headers and contents, except stream objects)
|
||||||
|
|
||||||
$ <strong>python -m tools.dumppdf -T foo.pdf</strong>
|
$ <strong>dumppdf.py -T foo.pdf</strong>
|
||||||
(dump the table of contents)
|
(dump the table of contents)
|
||||||
|
|
||||||
$ <strong>python -m tools.dumppdf -r -i6 foo.pdf > pic.jpeg</strong>
|
$ <strong>dumppdf.py -r -i6 foo.pdf > pic.jpeg</strong>
|
||||||
(extract a JPEG image)
|
(extract a JPEG image)
|
||||||
</pre></blockquote>
|
</pre></blockquote>
|
||||||
|
|
||||||
|
|
|
@ -204,7 +204,6 @@ class CMapDB(object):
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def get_cmap(klass, cmapname, strict=True):
|
def get_cmap(klass, cmapname, strict=True):
|
||||||
import os.path
|
|
||||||
cmapname = klass.CMAP_ALIAS.get(cmapname, cmapname)
|
cmapname = klass.CMAP_ALIAS.get(cmapname, cmapname)
|
||||||
if cmapname in klass.cmapdb:
|
if cmapname in klass.cmapdb:
|
||||||
cmap = klass.cmapdb[cmapname]
|
cmap = klass.cmapdb[cmapname]
|
||||||
|
|
Loading…
Reference in New Issue