Documentation updates.
parent
cf1e3c9973
commit
e39e39fa12
21
README.md
21
README.md
|
@ -10,6 +10,7 @@ It includes a PDF converter that can transform PDF files
|
||||||
into other text formats (such as HTML). It has an extensible
|
into other text formats (such as HTML). It has an extensible
|
||||||
PDF parser that can be used for other purposes than text analysis.
|
PDF parser that can be used for other purposes than text analysis.
|
||||||
|
|
||||||
|
|
||||||
Features
|
Features
|
||||||
--------
|
--------
|
||||||
|
|
||||||
|
@ -23,6 +24,7 @@ Features
|
||||||
* Tagged contents extraction.
|
* Tagged contents extraction.
|
||||||
* Automatic layout analysis.
|
* Automatic layout analysis.
|
||||||
|
|
||||||
|
|
||||||
How to Install
|
How to Install
|
||||||
--------------
|
--------------
|
||||||
|
|
||||||
|
@ -37,6 +39,7 @@ How to Install
|
||||||
|
|
||||||
$ pdf2txt.py samples/simple1.pdf
|
$ pdf2txt.py samples/simple1.pdf
|
||||||
|
|
||||||
|
|
||||||
For CJK Languages
|
For CJK Languages
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
|
@ -60,6 +63,7 @@ paste the following commands on a command line prompt:
|
||||||
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
|
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
|
||||||
python setup.py install
|
python setup.py install
|
||||||
|
|
||||||
|
|
||||||
Command Line Tools
|
Command Line Tools
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
|
@ -87,6 +91,21 @@ but it's also possible to extract some meaningful contents (e.g. images).
|
||||||
|
|
||||||
(For details, refer to the html document.)
|
(For details, refer to the html document.)
|
||||||
|
|
||||||
|
|
||||||
|
API Changes
|
||||||
|
-----------
|
||||||
|
|
||||||
|
As of November 2013, there were a few changes made to the PDFMiner API
|
||||||
|
prior to October 2013. This is the result of code restructuring. Here
|
||||||
|
is a list of the changes:
|
||||||
|
|
||||||
|
* PDFDocument class is moved to pdfdocument.py.
|
||||||
|
* PDFDocument class now takes a PDFParser object as an argument.
|
||||||
|
PDFDocument.set_parser() and PDFParser.set_document() is removed.
|
||||||
|
* PDFPage class is moved to pdfpage.py
|
||||||
|
* process_pdf function is implemented as a class method PDFPage.get_pages.
|
||||||
|
|
||||||
|
|
||||||
TODO
|
TODO
|
||||||
----
|
----
|
||||||
|
|
||||||
|
@ -97,6 +116,7 @@ TODO
|
||||||
* Better documentation.
|
* Better documentation.
|
||||||
* Crypt stream filter support.
|
* Crypt stream filter support.
|
||||||
|
|
||||||
|
|
||||||
Related Projects
|
Related Projects
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
|
@ -105,6 +125,7 @@ Related Projects
|
||||||
* <a href="http://www.pdfbox.org/">pdfbox</a>
|
* <a href="http://www.pdfbox.org/">pdfbox</a>
|
||||||
* <a href="http://mupdf.com/">mupdf</a>
|
* <a href="http://mupdf.com/">mupdf</a>
|
||||||
|
|
||||||
|
|
||||||
Terms and Conditions
|
Terms and Conditions
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
|
|
|
@ -9,7 +9,7 @@
|
||||||
|
|
||||||
<div align=right class=lastmod>
|
<div align=right class=lastmod>
|
||||||
<!-- hhmts start -->
|
<!-- hhmts start -->
|
||||||
Last Modified: Sat Oct 26 15:03:35 UTC 2013
|
Last Modified: Sun Nov 17 06:32:44 UTC 2013
|
||||||
<!-- hhmts end -->
|
<!-- hhmts end -->
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -368,7 +368,18 @@ no stream header is displayed for the ease of saving it to a file.
|
||||||
|
|
||||||
<h2><a name="changes">Changes</a></h2>
|
<h2><a name="changes">Changes</a></h2>
|
||||||
<ul>
|
<ul>
|
||||||
<li> 2013/10/22: Sudden resurge of interests.
|
<li> 2013/11/13: Bugfixes and minor improvements.<br>
|
||||||
|
As of November 2013, there were a few changes made to the PDFMiner API
|
||||||
|
prior to October 2013. This is the result of code restructuring. Here
|
||||||
|
is a list of the changes:
|
||||||
|
<ul>
|
||||||
|
<li> <code>PDFDocument</code> class is moved to <code>pdfdocument.py</code>.
|
||||||
|
<li> <code>PDFDocument</code> class now takes a <code>PDFParser</code> object as an argument.
|
||||||
|
<li> <code>PDFDocument.set_parser()</code> and <code>PDFParser.set_document()</code> is removed.
|
||||||
|
<li> <code>PDFPage</code> class is moved to <code>pdfpage.py</code>.
|
||||||
|
<li> <code>process_pdf</code> function is implemented as <code>PDFPage.get_pages</code>.
|
||||||
|
</ul>
|
||||||
|
<li> 2013/10/22: Sudden resurge of interests. API changes.
|
||||||
Incorporated a lot of patches and robust handling of broken PDFs.
|
Incorporated a lot of patches and robust handling of broken PDFs.
|
||||||
<li> 2011/05/15: Speed improvements for layout analysis.
|
<li> 2011/05/15: Speed improvements for layout analysis.
|
||||||
<li> 2011/05/15: API changes. <code>LTText.get_text()</code> is added.
|
<li> 2011/05/15: API changes. <code>LTText.get_text()</code> is added.
|
||||||
|
|
Loading…
Reference in New Issue