diff --git a/README.html b/README.html index b3edc81..50940ea 100644 --- a/README.html +++ b/README.html @@ -14,7 +14,7 @@ Python PDF parser and analyzer
-Last Modified: Tue Jul 29 21:34:29 JST 2008 +Last Modified: Sat Aug 30 16:39:32 JST 2008
@@ -81,9 +81,12 @@ http://pdf2html.tabesugi.net:8080/
  • Do the following test:
     $ python -m tools.pdf2txt samples/simple1.pdf
    -<page id="0" bbox="0.000,0.000,612.000,792.000" rotate="0">
    -<text font="Helvetica" direction="1" bbox="100.000,695.032,237.352,719.032" fontsize="24.000"> Hello World </text>
    -</page>
    +<html><head><meta http-equiv="Content-Type" content="text/html; charset=ascii">
    +</head><body>
    +<div style="position:absolute; top:50px;"><a name="0">Page 0</a></div><span style="position:absolute; border: 1px solid gray; left:0px; top:50px; width:612px; height:792px;"></span>
    +<span style="position:absolute; writing-mode:lr-tb; left:100px; top:122px; font-size:24px;"> Hello World </span>
    +<div style="position:absolute; top:0px;">Page: <a href="#0">0</a></div>
    +</body></html>
     
  • Done! @@ -91,7 +94,8 @@ $ python -m tools.pdf2txt samples/simple1.pdf

    For non-ASCII languages

    In order to handle non-ASCII languages (e.g. Japanese), -you need to install an additional data called CMap. +you need to install an additional data called CMap, +which is distributed from Adobe.

    Here is how: @@ -173,7 +177,7 @@ By default, it extracts texts from all the pages.

  • sgml : SGML format.
  • tag : "Tagged PDF" format. A tagged PDF has its own contents annotated with HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations. -Tags used here are defined in the PDF specification. +Tags used here are defined in the PDF specification (See §10.7 "Tagged PDF").

    -P password @@ -241,6 +245,7 @@ no stream header is displayed for the ease of saving it to a file.

    Changes