git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@119 1aa58f4a-7d42-0410-adbc-911cccaed67c
parent
af63784305
commit
0113486b76
10
README.html
10
README.html
|
@ -18,7 +18,7 @@ Python PDF parser and analyzer
|
||||||
|
|
||||||
<div align=right class=lastmod>
|
<div align=right class=lastmod>
|
||||||
<!-- hhmts start -->
|
<!-- hhmts start -->
|
||||||
Last Modified: Sun Jul 12 00:27:23 JST 2009
|
Last Modified: Sun Jul 12 00:36:44 JST 2009
|
||||||
<!-- hhmts end -->
|
<!-- hhmts end -->
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -33,7 +33,7 @@ the exact location of texts in a page, as well as
|
||||||
other extra information such as font information or ruled lines.
|
other extra information such as font information or ruled lines.
|
||||||
It includes a PDF converter that can transform PDF files
|
It includes a PDF converter that can transform PDF files
|
||||||
into other text formats (such as HTML). It has an extensible
|
into other text formats (such as HTML). It has an extensible
|
||||||
PDF parser that can be used for other purpoes instead of text analysis.
|
PDF parser that can be used for other purposes instead of text analysis.
|
||||||
<p>
|
<p>
|
||||||
<strong>Features:</strong>
|
<strong>Features:</strong>
|
||||||
<ul>
|
<ul>
|
||||||
|
@ -121,7 +121,7 @@ For example:
|
||||||
$ <strong>cd /usr/lib/python2.5/site-packages</strong>
|
$ <strong>cd /usr/lib/python2.5/site-packages</strong>
|
||||||
$ <strong>tar jxf CMap.tar.bz2</strong>
|
$ <strong>tar jxf CMap.tar.bz2</strong>
|
||||||
</pre></blockquote>
|
</pre></blockquote>
|
||||||
<li> Do the follwoing. (this is optional, but highly recommended)<br>
|
<li> Do the following. (this is optional, but highly recommended)<br>
|
||||||
<blockquote><pre>
|
<blockquote><pre>
|
||||||
$ <strong>python -m pdfminer.cmap</strong>
|
$ <strong>python -m pdfminer.cmap</strong>
|
||||||
</pre></blockquote>
|
</pre></blockquote>
|
||||||
|
@ -140,7 +140,7 @@ PDFMiner comes with two handy tools:
|
||||||
<h3>pdf2txt.py</h3>
|
<h3>pdf2txt.py</h3>
|
||||||
<p>
|
<p>
|
||||||
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
<code>pdf2txt.py</code> extracts text contents from a PDF file.
|
||||||
It extracts all the texts that are to be rendered programatically,
|
It extracts all the texts that are to be rendered programmatically,
|
||||||
It cannot recognize texts drawn as images that would require optical character recognition.
|
It cannot recognize texts drawn as images that would require optical character recognition.
|
||||||
It also extracts the corresponding locations, font names, font sizes, writing
|
It also extracts the corresponding locations, font names, font sizes, writing
|
||||||
direction (horizontal or vertical) for each text portion.
|
direction (horizontal or vertical) for each text portion.
|
||||||
|
@ -202,7 +202,7 @@ In the figure below, two text chunks whose distance is closer than
|
||||||
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
|
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
|
||||||
continuous and get grouped into one. Also, two lines whose distance is closer than
|
continuous and get grouped into one. Also, two lines whose distance is closer than
|
||||||
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
|
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
|
||||||
as a text box, which is a recutangular area that contains a "cluster" of texts.
|
as a text box, which is a rectangular area that contains a "cluster" of texts.
|
||||||
Furthermore, it may be required to insert blank characters (spaces) as necessary
|
Furthermore, it may be required to insert blank characters (spaces) as necessary
|
||||||
if the distance between two words is greater than the <em>word_margin</em>
|
if the distance between two words is greater than the <em>word_margin</em>
|
||||||
(<em><font color="green">W</font></em>), as a blank between words might not be
|
(<em><font color="green">W</font></em>), as a blank between words might not be
|
||||||
|
|
36
setup.py
36
setup.py
|
@ -2,12 +2,30 @@
|
||||||
from distutils.core import setup
|
from distutils.core import setup
|
||||||
from pdfminer import __version__
|
from pdfminer import __version__
|
||||||
|
|
||||||
setup(name='pdfminer',
|
setup(
|
||||||
version=__version__,
|
name='pdfminer',
|
||||||
description='PDF parser and analyzer',
|
version=__version__,
|
||||||
license='MIT/X',
|
description='PDF parser and analyzer',
|
||||||
author='Yusuke Shinyama',
|
long_description='''PDFMiner is a suite of programs that help
|
||||||
url='http://www.unixuser.org/~euske/python/pdfminer/index.html',
|
extracting and analyzing text data of PDF documents.
|
||||||
packages=['pdfminer'],
|
Unlike other PDF-related tools, it allows to obtain
|
||||||
scripts=['tools/pdf2txt.py', 'tools/dumppdf.py'],
|
the exact location of texts in a page, as well as
|
||||||
)
|
other extra information such as font information or ruled lines.
|
||||||
|
It includes a PDF converter that can transform PDF files
|
||||||
|
into other text formats (such as HTML). It has an extensible
|
||||||
|
PDF parser that can be used for other purposes instead of text analysis.''',
|
||||||
|
keywords='pdf parser, pdf converter, text mining',
|
||||||
|
license='MIT/X',
|
||||||
|
author='Yusuke Shinyama',
|
||||||
|
author_email='yusuke at cs dot nyu dot edu',
|
||||||
|
url='http://www.unixuser.org/~euske/python/pdfminer/index.html',
|
||||||
|
packages=['pdfminer'],
|
||||||
|
scripts=['tools/pdf2txt.py', 'tools/dumppdf.py'],
|
||||||
|
classifiers=[
|
||||||
|
'Development Status :: 4 - Beta',
|
||||||
|
'Environment :: Console',
|
||||||
|
'Intended Audience :: Developers',
|
||||||
|
'Intended Audience :: Science/Research',
|
||||||
|
'License :: OSI Approved :: MIT License',
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
Loading…
Reference in New Issue