git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@119 1aa58f4a-7d42-0410-adbc-911cccaed67c

pull/1/head
yusuke.shinyama.dummy 2009-07-11 15:38:13 +00:00
parent af63784305
commit 0113486b76
2 changed files with 32 additions and 14 deletions

View File

@ -18,7 +18,7 @@ Python PDF parser and analyzer
<div align=right class=lastmod> <div align=right class=lastmod>
<!-- hhmts start --> <!-- hhmts start -->
Last Modified: Sun Jul 12 00:27:23 JST 2009 Last Modified: Sun Jul 12 00:36:44 JST 2009
<!-- hhmts end --> <!-- hhmts end -->
</div> </div>
@ -33,7 +33,7 @@ the exact location of texts in a page, as well as
other extra information such as font information or ruled lines. other extra information such as font information or ruled lines.
It includes a PDF converter that can transform PDF files It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purpoes instead of text analysis. PDF parser that can be used for other purposes instead of text analysis.
<p> <p>
<strong>Features:</strong> <strong>Features:</strong>
<ul> <ul>
@ -121,7 +121,7 @@ For example:
$ <strong>cd /usr/lib/python2.5/site-packages</strong> $ <strong>cd /usr/lib/python2.5/site-packages</strong>
$ <strong>tar jxf CMap.tar.bz2</strong> $ <strong>tar jxf CMap.tar.bz2</strong>
</pre></blockquote> </pre></blockquote>
<li> Do the follwoing. (this is optional, but highly recommended)<br> <li> Do the following. (this is optional, but highly recommended)<br>
<blockquote><pre> <blockquote><pre>
$ <strong>python -m pdfminer.cmap</strong> $ <strong>python -m pdfminer.cmap</strong>
</pre></blockquote> </pre></blockquote>
@ -140,7 +140,7 @@ PDFMiner comes with two handy tools:
<h3>pdf2txt.py</h3> <h3>pdf2txt.py</h3>
<p> <p>
<code>pdf2txt.py</code> extracts text contents from a PDF file. <code>pdf2txt.py</code> extracts text contents from a PDF file.
It extracts all the texts that are to be rendered programatically, It extracts all the texts that are to be rendered programmatically,
It cannot recognize texts drawn as images that would require optical character recognition. It cannot recognize texts drawn as images that would require optical character recognition.
It also extracts the corresponding locations, font names, font sizes, writing It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion. direction (horizontal or vertical) for each text portion.
@ -202,7 +202,7 @@ In the figure below, two text chunks whose distance is closer than
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
continuous and get grouped into one. Also, two lines whose distance is closer than continuous and get grouped into one. Also, two lines whose distance is closer than
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
as a text box, which is a recutangular area that contains a "cluster" of texts. as a text box, which is a rectangular area that contains a "cluster" of texts.
Furthermore, it may be required to insert blank characters (spaces) as necessary Furthermore, it may be required to insert blank characters (spaces) as necessary
if the distance between two words is greater than the <em>word_margin</em> if the distance between two words is greater than the <em>word_margin</em>
(<em><font color="green">W</font></em>), as a blank between words might not be (<em><font color="green">W</font></em>), as a blank between words might not be

View File

@ -2,12 +2,30 @@
from distutils.core import setup from distutils.core import setup
from pdfminer import __version__ from pdfminer import __version__
setup(name='pdfminer', setup(
version=__version__, name='pdfminer',
description='PDF parser and analyzer', version=__version__,
license='MIT/X', description='PDF parser and analyzer',
author='Yusuke Shinyama', long_description='''PDFMiner is a suite of programs that help
url='http://www.unixuser.org/~euske/python/pdfminer/index.html', extracting and analyzing text data of PDF documents.
packages=['pdfminer'], Unlike other PDF-related tools, it allows to obtain
scripts=['tools/pdf2txt.py', 'tools/dumppdf.py'], the exact location of texts in a page, as well as
) other extra information such as font information or ruled lines.
It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes instead of text analysis.''',
keywords='pdf parser, pdf converter, text mining',
license='MIT/X',
author='Yusuke Shinyama',
author_email='yusuke at cs dot nyu dot edu',
url='http://www.unixuser.org/~euske/python/pdfminer/index.html',
packages=['pdfminer'],
scripts=['tools/pdf2txt.py', 'tools/dumppdf.py'],
classifiers=[
'Development Status :: 4 - Beta',
'Environment :: Console',
'Intended Audience :: Developers',
'Intended Audience :: Science/Research',
'License :: OSI Approved :: MIT License',
],
)