Documentation updates.

pull/1/head
Yusuke Shinyama 2013-11-17 15:32:57 +09:00
parent cf1e3c9973
commit e39e39fa12
2 changed files with 34 additions and 2 deletions

View File

@ -10,6 +10,7 @@ It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes than text analysis. PDF parser that can be used for other purposes than text analysis.
Features Features
-------- --------
@ -23,6 +24,7 @@ Features
* Tagged contents extraction. * Tagged contents extraction.
* Automatic layout analysis. * Automatic layout analysis.
How to Install How to Install
-------------- --------------
@ -37,6 +39,7 @@ How to Install
$ pdf2txt.py samples/simple1.pdf $ pdf2txt.py samples/simple1.pdf
For CJK Languages For CJK Languages
----------------- -----------------
@ -60,6 +63,7 @@ paste the following commands on a command line prompt:
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install python setup.py install
Command Line Tools Command Line Tools
------------------ ------------------
@ -87,6 +91,21 @@ but it's also possible to extract some meaningful contents (e.g. images).
(For details, refer to the html document.) (For details, refer to the html document.)
API Changes
-----------
As of November 2013, there were a few changes made to the PDFMiner API
prior to October 2013. This is the result of code restructuring. Here
is a list of the changes:
* PDFDocument class is moved to pdfdocument.py.
* PDFDocument class now takes a PDFParser object as an argument.
PDFDocument.set_parser() and PDFParser.set_document() is removed.
* PDFPage class is moved to pdfpage.py
* process_pdf function is implemented as a class method PDFPage.get_pages.
TODO TODO
---- ----
@ -97,6 +116,7 @@ TODO
* Better documentation. * Better documentation.
* Crypt stream filter support. * Crypt stream filter support.
Related Projects Related Projects
---------------- ----------------
@ -105,6 +125,7 @@ Related Projects
* <a href="http://www.pdfbox.org/">pdfbox</a> * <a href="http://www.pdfbox.org/">pdfbox</a>
* <a href="http://mupdf.com/">mupdf</a> * <a href="http://mupdf.com/">mupdf</a>
Terms and Conditions Terms and Conditions
-------------------- --------------------

View File

@ -9,7 +9,7 @@
<div align=right class=lastmod> <div align=right class=lastmod>
<!-- hhmts start --> <!-- hhmts start -->
Last Modified: Sat Oct 26 15:03:35 UTC 2013 Last Modified: Sun Nov 17 06:32:44 UTC 2013
<!-- hhmts end --> <!-- hhmts end -->
</div> </div>
@ -368,7 +368,18 @@ no stream header is displayed for the ease of saving it to a file.
<h2><a name="changes">Changes</a></h2> <h2><a name="changes">Changes</a></h2>
<ul> <ul>
<li> 2013/10/22: Sudden resurge of interests. <li> 2013/11/13: Bugfixes and minor improvements.<br>
As of November 2013, there were a few changes made to the PDFMiner API
prior to October 2013. This is the result of code restructuring. Here
is a list of the changes:
<ul>
<li> <code>PDFDocument</code> class is moved to <code>pdfdocument.py</code>.
<li> <code>PDFDocument</code> class now takes a <code>PDFParser</code> object as an argument.
<li> <code>PDFDocument.set_parser()</code> and <code>PDFParser.set_document()</code> is removed.
<li> <code>PDFPage</code> class is moved to <code>pdfpage.py</code>.
<li> <code>process_pdf</code> function is implemented as <code>PDFPage.get_pages</code>.
</ul>
<li> 2013/10/22: Sudden resurge of interests. API changes.
Incorporated a lot of patches and robust handling of broken PDFs. Incorporated a lot of patches and robust handling of broken PDFs.
<li> 2011/05/15: Speed improvements for layout analysis. <li> 2011/05/15: Speed improvements for layout analysis.
<li> 2011/05/15: API changes. <code>LTText.get_text()</code> is added. <li> 2011/05/15: API changes. <code>LTText.get_text()</code> is added.