non-free sample files moved into a separate directory

git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@227 1aa58f4a-7d42-0410-adbc-911cccaed67c
pull/1/head
yusuke.shinyama.dummy 2010-06-13 04:35:18 +00:00
parent 3f831c8104
commit f2005bee55
28 changed files with 251 additions and 65 deletions

View File

@ -19,7 +19,7 @@ Python PDF parser and analyzer
<div align=right class=lastmod> <div align=right class=lastmod>
<!-- hhmts start --> <!-- hhmts start -->
Last Modified: Sat May 29 11:57:59 UTC 2010 Last Modified: Sun Jun 13 04:20:47 UTC 2010
<!-- hhmts end --> <!-- hhmts end -->
</div> </div>
@ -29,7 +29,9 @@ Last Modified: Sat May 29 11:57:59 UTC 2010
<li> <a href="#install">Install</a> <li> <a href="#install">Install</a>
&nbsp; <small>(<a href="#cmap">for CJK languages</a>)</small> &nbsp; <small>(<a href="#cmap">for CJK languages</a>)</small>
<li> <a href="#usage">How to Use</a> <li> <a href="#usage">How to Use</a>
&nbsp; <small>(<a href="#pdf2txt">pdf2txt.py</a>, <a href="#dumppdf">dumppdf.py</a>, <a href="#library">use as library</a>)</small> &nbsp; <small>(<a href="#pdf2txt">pdf2txt.py</a>,
<a href="#dumppdf">dumppdf.py</a>,
<a href="programming.html">use as library</a>)</small>
<li> <a href="#techdocs">Technical Documents</a> <li> <a href="#techdocs">Technical Documents</a>
<li> <a href="#todos">TODOs</a> <li> <a href="#todos">TODOs</a>
<li> <a href="#changes">Changes</a> <li> <a href="#changes">Changes</a>
@ -375,7 +377,7 @@ For details, see the <a href="programming.html">Programming with PDFMiner</a> pa
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and <li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance. <a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
<li> Better documentation. <li> Better documentation.
<li> Better text extraction / layout analysis. <li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
<li> Robust error handling. <li> Robust error handling.
<li> Crypt stream filter support. (More sample documents are needed!) <li> Crypt stream filter support. (More sample documents are needed!)
<li> CCITTFax stream filter support. <li> CCITTFax stream filter support.
@ -385,7 +387,8 @@ For details, see the <a href="programming.html">Programming with PDFMiner</a> pa
<hr noshade> <hr noshade>
<h2>Changes</h2> <h2>Changes</h2>
<ul> <ul>
<li> 2010/04/24: Bugfixes and tiny improvements on TOC extraction. Thanks to Jose Maria. <li> 2010/06/13: Bugfixes and improvements on CMap data compression. Thanks to Jakub Wilk.
<li> 2010/04/24: Bugfixes and improvements on TOC extraction. Thanks to Jose Maria.
<li> 2010/03/26: Bugfixes. Thanks to Brian Berry and Lubos Pintes. <li> 2010/03/26: Bugfixes. Thanks to Brian Berry and Lubos Pintes.
<li> 2010/03/22: Improved layout analysis. Added regression tests. <li> 2010/03/22: Improved layout analysis. Added regression tests.
<li> 2010/03/12: A couple of bugfixes. Thanks to Sean Manefield. <li> 2010/03/12: A couple of bugfixes. Thanks to Sean Manefield.

View File

@ -6,38 +6,44 @@ CMP=:
PYTHON=python PYTHON=python
PDF2TXT=PYTHONPATH=.. $(PYTHON) ../tools/pdf2txt.py -Dx -p1 PDF2TXT=PYTHONPATH=.. $(PYTHON) ../tools/pdf2txt.py -Dx -p1
HTMLS= \ HTMLS=$(HTMLS_FREE) $(HTMLS_NONFREE)
HTMLS_FREE= \
simple1.html \ simple1.html \
simple2.html \ simple2.html \
dmca.html \ jo.html
f1040nr.html \ HTMLS_NONFREE= \
i1040nr.html \ nonfree/dmca.html \
jo.html \ nonfree/f1040nr.html \
kampo.html \ nonfree/i1040nr.html \
naacl06-shinyama.html \ nonfree/kampo.html \
nlp2004slides.html nonfree/naacl06-shinyama.html \
nonfree/nlp2004slides.html
TEXTS= \ TEXTS=$(TEXTS_FREE) $(TEXTS_NONFREE)
TEXTS_FREE= \
simple1.txt \ simple1.txt \
simple2.txt \ simple2.txt \
dmca.txt \ jo.txt
f1040nr.txt \ TEXTS_NONFREE= \
i1040nr.txt \ nonfree/dmca.txt \
jo.txt \ nonfree/f1040nr.txt \
kampo.txt \ nonfree/i1040nr.txt \
naacl06-shinyama.txt \ nonfree/kampo.txt \
nlp2004slides.txt nonfree/naacl06-shinyama.txt \
nonfree/nlp2004slides.txt
XMLS= \ XMLS=$(XMLS_FREE) $(XMLS_NONFREE)
XMLS_FREE= \
simple1.xml \ simple1.xml \
simple2.xml \ simple2.xml \
dmca.xml \ jo.xml
f1040nr.xml \ XMLS_NONFREE= \
i1040nr.xml \ nonfree/dmca.xml \
jo.xml \ nonfree/f1040nr.xml \
kampo.xml \ nonfree/i1040nr.xml \
naacl06-shinyama.xml \ nonfree/kampo.xml \
nlp2004slides.xml nonfree/naacl06-shinyama.xml \
nonfree/nlp2004slides.xml
test: htmls texts xmls test: htmls texts xmls

View File

@ -1,44 +1,48 @@
This directory contains sample PDF files. This directory contains sample PDF files.
The files in nonfree/ subdirectory can be distributed freely
but does not come with explicit licensing terms or source files.
Here are the credits of the original files: Here are the credits of the original files:
dmca.pdf: simple1.pdf:
U.S. Copyright Office (Originally taken from PDF Specification 1.7,
The Digital Millenium Copyright Act Appendix G. "Simple Text String Example" and modified)
http://www.copyright.gov/legislation/dmca.pdf
f1040nr.pdf: simple2.pdf:
U.S. Department of the Treasury Internal Revenue Service (Originally taken from PDF Specification 1.7,
Form 1040-NR, U.S. Nonresident Alien Income Tax Return Appendix G. "Simple Graphics Example" and modified)
http://www.irs.gov/pub/irs-pdf/f1040nr.pdf
i1040nr.pdf:
U.S. Department of the Treasury Internal Revenue Service
Instructions for Form 1040-NR, U.S. Nonresident Alien Income Tax Return
http://www.irs.gov/pub/irs-pdf/i1040nr.pdf
jo.pdf: jo.pdf:
Kenji Miyazawa (1896-1933, copyright expired) Kenji Miyazawa (1896-1933, copyright expired)
Preface of "Haru to Shura" Preface of "Haru to Shura"
(File generated by LaTeX and dvi2pdfm) (File generated by LaTeX and dvi2pdfm)
kampo.pdf: --
nonfree/dmca.pdf:
U.S. Copyright Office
The Digital Millenium Copyright Act
http://www.copyright.gov/legislation/dmca.pdf
nonfree/f1040nr.pdf:
U.S. Department of the Treasury Internal Revenue Service
Form 1040-NR, U.S. Nonresident Alien Income Tax Return
http://www.irs.gov/pub/irs-pdf/f1040nr.pdf
nonfree/i1040nr.pdf:
U.S. Department of the Treasury Internal Revenue Service
Instructions for Form 1040-NR, U.S. Nonresident Alien Income Tax Return
http://www.irs.gov/pub/irs-pdf/i1040nr.pdf
nonfree/kampo.pdf:
National Priting Bureau of Japan National Priting Bureau of Japan
Official Gazette, Vol. 4817 Official Gazette, Vol. 4817
http://kanpou.npb.go.jp/ http://kanpou.npb.go.jp/
nlp2004slides.pdf: nonfree/nlp2004slides.pdf:
Yusuke Shinyama and Satoshi Sekine Yusuke Shinyama and Satoshi Sekine
"Named Entity Discovery from Comparable News Corpora" "Named Entity Discovery from Comparable News Corpora"
naacl06-shinyama.pdf: nonfree/naacl06-shinyama.pdf:
Yusuke Shinyama and Satoshi Sekine Yusuke Shinyama and Satoshi Sekine
"Preemptive Information Extraction using Unrestircted Relation Discovery" "Preemptive Information Extraction using Unrestircted Relation Discovery"
simple1.pdf:
(Originally taken from PDF Specification,
Appendix G. "Simple Text String Example" and modified)
simple2.pdf:
(Originally taken from PDF Specification,
Appendix G. "Simple Graphics Example" and modified)

173
samples/jo.tex Normal file
View File

@ -0,0 +1,173 @@
\documentclass[landscape,twocolumn]{tarticle}
\setlength{\hoffset}{-0.6in}
\setlength{\voffset}{-0.7in}
\setlength{\textwidth}{18cm}
%\setlength{\textheight}{9in}
%\setlength{\oddsidemargin}{-0.5in}
%\setlength{\evensidemargin}{-0.5in}
\setlength{\topmargin}{0in}
\setlength{\columnsep}{0.4in}
\pagestyle{empty}
\makeatletter
\def\kanjistrut{\vrule \@height0.88zw \@depth0.12zw \@width\z@}
\newdimen\mytempdima
\newcommand{\ruby}[2]{%
\leavevmode
\setbox0=\hbox{#1}%
\mytempdima=\f@size\p@
\setbox1=\hbox{\fontsize{0.5\mytempdima}{0pt}\selectfont #2}%
\ifdim\wd0>\wd1 \dimen0=\wd0 \else \dimen0=\wd1 \fi
\hbox{%
\kanjiskip=0pt plus 2fil
\xkanjiskip=0pt plus 2fil
\vbox{%
\hbox to \dimen0{%
\fontsize{0.5\mytempdima}{0pt}\selectfont \kanjistrut\hfil#2\hfil}%
\nointerlineskip
\hbox to \dimen0{\kanjistrut\hfil#1\hfil}}}}
\makeatother
\begin{document}
  序
\vspace{0.4in}
\begin{flushleft}
わたくしといふ現象は
假定された有機交流電燈の
ひとつの青い照明です
(あらゆる透明な幽霊の複合体)
風景やみんなといっしょに
せはしくせはしく明滅しながら
いかにもたしかにともりつづける
因果交流電燈の
ひとつの青い照明です
(ひかりはたもち、その電燈は失はれ)
  
これらは二十二箇月の
過去とかんずる方角から
紙と鑛質インクをつらね
(すべてわたくしと明滅し
 みんなが同時に感ずるもの)
ここまでたもちつゞけられた
かげとひかりのひとくさりづつ
そのとほりの心象スケッチです
  
これらについて人や銀河や修羅や海膽は
宇宙塵をたべ、または空気や塩水を呼吸しながら
それぞれ新鮮な本体論もかんがへませうが
それらも畢竟こゝろのひとつの風物です
たゞたしかに記録されたこれらのけしきは
記録されたそのとほりのこのけしきで
それが虚無ならば虚無自身がこのとほりで
ある程度まではみんなに共通いたします
(すべてがわたくしの中のみんなであるやうに
 みんなのおのおののなかのすべてですから)
\newpage
 
\vspace{1.0in}
けれどもこれら新世代沖積世の
巨大に明るい時間の集積のなかで
正しくうつされた筈のこれらのことばが
わづかその一點にも均しい明暗のうちに
   (あるひは修羅の十億年)
すでにはやくもその組立や質を變じ
しかもわたくしも印刷者も
それを変らないとして感ずることは
傾向としてはあり得ます
けだしわれわれがわれわれの感官や
風景や人物をかんずるやうに
そしてたゞ共通に感ずるだけであるやうに
記録や歴史、あるひは地史といふものも
それのいろいろの論料といっしょに
(因果の時空的制約のもとに)
われわれがかんじてゐるのに過ぎません
おそらくこれから二千年もたったころは
それ相當のちがった地質學が流用され
相當した證據もまた次次過去から現出し
みんなは二千年ぐらゐ前には
青ぞらいっぱいの無色な孔雀が居たとおもひ
新進の大學士たちは気圏のいちばんの上層
きらびやかな氷窒素のあたりから
すてきな化石を發堀したり
あるひは白堊紀砂岩の層面に
透明な人類の巨大な足跡を
発見するかもしれません
  
すべてこれらの命題は
心象や時間それ自身の性質として
第四次延長のなかで主張されます
  
\end{flushleft}
\begin{flushright}
大正十三年一月廿日  宮澤賢治
\end{flushright}
\end{document}