non-free sample files moved into a separate directory
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@227 1aa58f4a-7d42-0410-adbc-911cccaed67cpull/1/head
parent
3f831c8104
commit
f2005bee55
|
@ -19,7 +19,7 @@ Python PDF parser and analyzer
|
||||||
|
|
||||||
<div align=right class=lastmod>
|
<div align=right class=lastmod>
|
||||||
<!-- hhmts start -->
|
<!-- hhmts start -->
|
||||||
Last Modified: Sat May 29 11:57:59 UTC 2010
|
Last Modified: Sun Jun 13 04:20:47 UTC 2010
|
||||||
<!-- hhmts end -->
|
<!-- hhmts end -->
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -29,7 +29,9 @@ Last Modified: Sat May 29 11:57:59 UTC 2010
|
||||||
<li> <a href="#install">Install</a>
|
<li> <a href="#install">Install</a>
|
||||||
<small>(<a href="#cmap">for CJK languages</a>)</small>
|
<small>(<a href="#cmap">for CJK languages</a>)</small>
|
||||||
<li> <a href="#usage">How to Use</a>
|
<li> <a href="#usage">How to Use</a>
|
||||||
<small>(<a href="#pdf2txt">pdf2txt.py</a>, <a href="#dumppdf">dumppdf.py</a>, <a href="#library">use as library</a>)</small>
|
<small>(<a href="#pdf2txt">pdf2txt.py</a>,
|
||||||
|
<a href="#dumppdf">dumppdf.py</a>,
|
||||||
|
<a href="programming.html">use as library</a>)</small>
|
||||||
<li> <a href="#techdocs">Technical Documents</a>
|
<li> <a href="#techdocs">Technical Documents</a>
|
||||||
<li> <a href="#todos">TODOs</a>
|
<li> <a href="#todos">TODOs</a>
|
||||||
<li> <a href="#changes">Changes</a>
|
<li> <a href="#changes">Changes</a>
|
||||||
|
@ -375,7 +377,7 @@ For details, see the <a href="programming.html">Programming with PDFMiner</a> pa
|
||||||
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
|
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
|
||||||
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
|
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
|
||||||
<li> Better documentation.
|
<li> Better documentation.
|
||||||
<li> Better text extraction / layout analysis.
|
<li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
|
||||||
<li> Robust error handling.
|
<li> Robust error handling.
|
||||||
<li> Crypt stream filter support. (More sample documents are needed!)
|
<li> Crypt stream filter support. (More sample documents are needed!)
|
||||||
<li> CCITTFax stream filter support.
|
<li> CCITTFax stream filter support.
|
||||||
|
@ -385,7 +387,8 @@ For details, see the <a href="programming.html">Programming with PDFMiner</a> pa
|
||||||
<hr noshade>
|
<hr noshade>
|
||||||
<h2>Changes</h2>
|
<h2>Changes</h2>
|
||||||
<ul>
|
<ul>
|
||||||
<li> 2010/04/24: Bugfixes and tiny improvements on TOC extraction. Thanks to Jose Maria.
|
<li> 2010/06/13: Bugfixes and improvements on CMap data compression. Thanks to Jakub Wilk.
|
||||||
|
<li> 2010/04/24: Bugfixes and improvements on TOC extraction. Thanks to Jose Maria.
|
||||||
<li> 2010/03/26: Bugfixes. Thanks to Brian Berry and Lubos Pintes.
|
<li> 2010/03/26: Bugfixes. Thanks to Brian Berry and Lubos Pintes.
|
||||||
<li> 2010/03/22: Improved layout analysis. Added regression tests.
|
<li> 2010/03/22: Improved layout analysis. Added regression tests.
|
||||||
<li> 2010/03/12: A couple of bugfixes. Thanks to Sean Manefield.
|
<li> 2010/03/12: A couple of bugfixes. Thanks to Sean Manefield.
|
||||||
|
|
|
@ -6,38 +6,44 @@ CMP=:
|
||||||
PYTHON=python
|
PYTHON=python
|
||||||
PDF2TXT=PYTHONPATH=.. $(PYTHON) ../tools/pdf2txt.py -Dx -p1
|
PDF2TXT=PYTHONPATH=.. $(PYTHON) ../tools/pdf2txt.py -Dx -p1
|
||||||
|
|
||||||
HTMLS= \
|
HTMLS=$(HTMLS_FREE) $(HTMLS_NONFREE)
|
||||||
|
HTMLS_FREE= \
|
||||||
simple1.html \
|
simple1.html \
|
||||||
simple2.html \
|
simple2.html \
|
||||||
dmca.html \
|
jo.html
|
||||||
f1040nr.html \
|
HTMLS_NONFREE= \
|
||||||
i1040nr.html \
|
nonfree/dmca.html \
|
||||||
jo.html \
|
nonfree/f1040nr.html \
|
||||||
kampo.html \
|
nonfree/i1040nr.html \
|
||||||
naacl06-shinyama.html \
|
nonfree/kampo.html \
|
||||||
nlp2004slides.html
|
nonfree/naacl06-shinyama.html \
|
||||||
|
nonfree/nlp2004slides.html
|
||||||
|
|
||||||
TEXTS= \
|
TEXTS=$(TEXTS_FREE) $(TEXTS_NONFREE)
|
||||||
|
TEXTS_FREE= \
|
||||||
simple1.txt \
|
simple1.txt \
|
||||||
simple2.txt \
|
simple2.txt \
|
||||||
dmca.txt \
|
jo.txt
|
||||||
f1040nr.txt \
|
TEXTS_NONFREE= \
|
||||||
i1040nr.txt \
|
nonfree/dmca.txt \
|
||||||
jo.txt \
|
nonfree/f1040nr.txt \
|
||||||
kampo.txt \
|
nonfree/i1040nr.txt \
|
||||||
naacl06-shinyama.txt \
|
nonfree/kampo.txt \
|
||||||
nlp2004slides.txt
|
nonfree/naacl06-shinyama.txt \
|
||||||
|
nonfree/nlp2004slides.txt
|
||||||
|
|
||||||
XMLS= \
|
XMLS=$(XMLS_FREE) $(XMLS_NONFREE)
|
||||||
|
XMLS_FREE= \
|
||||||
simple1.xml \
|
simple1.xml \
|
||||||
simple2.xml \
|
simple2.xml \
|
||||||
dmca.xml \
|
jo.xml
|
||||||
f1040nr.xml \
|
XMLS_NONFREE= \
|
||||||
i1040nr.xml \
|
nonfree/dmca.xml \
|
||||||
jo.xml \
|
nonfree/f1040nr.xml \
|
||||||
kampo.xml \
|
nonfree/i1040nr.xml \
|
||||||
naacl06-shinyama.xml \
|
nonfree/kampo.xml \
|
||||||
nlp2004slides.xml
|
nonfree/naacl06-shinyama.xml \
|
||||||
|
nonfree/nlp2004slides.xml
|
||||||
|
|
||||||
test: htmls texts xmls
|
test: htmls texts xmls
|
||||||
|
|
||||||
|
|
|
@ -1,44 +1,48 @@
|
||||||
This directory contains sample PDF files.
|
This directory contains sample PDF files.
|
||||||
|
|
||||||
|
The files in nonfree/ subdirectory can be distributed freely
|
||||||
|
but does not come with explicit licensing terms or source files.
|
||||||
|
|
||||||
Here are the credits of the original files:
|
Here are the credits of the original files:
|
||||||
|
|
||||||
dmca.pdf:
|
|
||||||
U.S. Copyright Office
|
|
||||||
The Digital Millenium Copyright Act
|
|
||||||
http://www.copyright.gov/legislation/dmca.pdf
|
|
||||||
|
|
||||||
f1040nr.pdf:
|
|
||||||
U.S. Department of the Treasury Internal Revenue Service
|
|
||||||
Form 1040-NR, U.S. Nonresident Alien Income Tax Return
|
|
||||||
http://www.irs.gov/pub/irs-pdf/f1040nr.pdf
|
|
||||||
|
|
||||||
i1040nr.pdf:
|
|
||||||
U.S. Department of the Treasury Internal Revenue Service
|
|
||||||
Instructions for Form 1040-NR, U.S. Nonresident Alien Income Tax Return
|
|
||||||
http://www.irs.gov/pub/irs-pdf/i1040nr.pdf
|
|
||||||
|
|
||||||
jo.pdf:
|
|
||||||
Kenji Miyazawa (1896-1933, copyright expired)
|
|
||||||
Preface of "Haru to Shura"
|
|
||||||
(File generated by LaTeX and dvi2pdfm)
|
|
||||||
|
|
||||||
kampo.pdf:
|
|
||||||
National Priting Bureau of Japan
|
|
||||||
Official Gazette, Vol. 4817
|
|
||||||
http://kanpou.npb.go.jp/
|
|
||||||
|
|
||||||
nlp2004slides.pdf:
|
|
||||||
Yusuke Shinyama and Satoshi Sekine
|
|
||||||
"Named Entity Discovery from Comparable News Corpora"
|
|
||||||
|
|
||||||
naacl06-shinyama.pdf:
|
|
||||||
Yusuke Shinyama and Satoshi Sekine
|
|
||||||
"Preemptive Information Extraction using Unrestircted Relation Discovery"
|
|
||||||
|
|
||||||
simple1.pdf:
|
simple1.pdf:
|
||||||
(Originally taken from PDF Specification,
|
(Originally taken from PDF Specification 1.7,
|
||||||
Appendix G. "Simple Text String Example" and modified)
|
Appendix G. "Simple Text String Example" and modified)
|
||||||
|
|
||||||
simple2.pdf:
|
simple2.pdf:
|
||||||
(Originally taken from PDF Specification,
|
(Originally taken from PDF Specification 1.7,
|
||||||
Appendix G. "Simple Graphics Example" and modified)
|
Appendix G. "Simple Graphics Example" and modified)
|
||||||
|
|
||||||
|
jo.pdf:
|
||||||
|
Kenji Miyazawa (1896-1933, copyright expired)
|
||||||
|
Preface of "Haru to Shura"
|
||||||
|
(File generated by LaTeX and dvi2pdfm)
|
||||||
|
|
||||||
|
--
|
||||||
|
nonfree/dmca.pdf:
|
||||||
|
U.S. Copyright Office
|
||||||
|
The Digital Millenium Copyright Act
|
||||||
|
http://www.copyright.gov/legislation/dmca.pdf
|
||||||
|
|
||||||
|
nonfree/f1040nr.pdf:
|
||||||
|
U.S. Department of the Treasury Internal Revenue Service
|
||||||
|
Form 1040-NR, U.S. Nonresident Alien Income Tax Return
|
||||||
|
http://www.irs.gov/pub/irs-pdf/f1040nr.pdf
|
||||||
|
|
||||||
|
nonfree/i1040nr.pdf:
|
||||||
|
U.S. Department of the Treasury Internal Revenue Service
|
||||||
|
Instructions for Form 1040-NR, U.S. Nonresident Alien Income Tax Return
|
||||||
|
http://www.irs.gov/pub/irs-pdf/i1040nr.pdf
|
||||||
|
|
||||||
|
nonfree/kampo.pdf:
|
||||||
|
National Priting Bureau of Japan
|
||||||
|
Official Gazette, Vol. 4817
|
||||||
|
http://kanpou.npb.go.jp/
|
||||||
|
|
||||||
|
nonfree/nlp2004slides.pdf:
|
||||||
|
Yusuke Shinyama and Satoshi Sekine
|
||||||
|
"Named Entity Discovery from Comparable News Corpora"
|
||||||
|
|
||||||
|
nonfree/naacl06-shinyama.pdf:
|
||||||
|
Yusuke Shinyama and Satoshi Sekine
|
||||||
|
"Preemptive Information Extraction using Unrestircted Relation Discovery"
|
||||||
|
|
|
@ -0,0 +1,173 @@
|
||||||
|
\documentclass[landscape,twocolumn]{tarticle}
|
||||||
|
|
||||||
|
\setlength{\hoffset}{-0.6in}
|
||||||
|
\setlength{\voffset}{-0.7in}
|
||||||
|
|
||||||
|
\setlength{\textwidth}{18cm}
|
||||||
|
%\setlength{\textheight}{9in}
|
||||||
|
|
||||||
|
%\setlength{\oddsidemargin}{-0.5in}
|
||||||
|
%\setlength{\evensidemargin}{-0.5in}
|
||||||
|
\setlength{\topmargin}{0in}
|
||||||
|
\setlength{\columnsep}{0.4in}
|
||||||
|
|
||||||
|
\pagestyle{empty}
|
||||||
|
\makeatletter
|
||||||
|
\def\kanjistrut{\vrule \@height0.88zw \@depth0.12zw \@width\z@}
|
||||||
|
\newdimen\mytempdima
|
||||||
|
\newcommand{\ruby}[2]{%
|
||||||
|
\leavevmode
|
||||||
|
\setbox0=\hbox{#1}%
|
||||||
|
\mytempdima=\f@size\p@
|
||||||
|
\setbox1=\hbox{\fontsize{0.5\mytempdima}{0pt}\selectfont #2}%
|
||||||
|
\ifdim\wd0>\wd1 \dimen0=\wd0 \else \dimen0=\wd1 \fi
|
||||||
|
\hbox{%
|
||||||
|
\kanjiskip=0pt plus 2fil
|
||||||
|
\xkanjiskip=0pt plus 2fil
|
||||||
|
\vbox{%
|
||||||
|
\hbox to \dimen0{%
|
||||||
|
\fontsize{0.5\mytempdima}{0pt}\selectfont \kanjistrut\hfil#2\hfil}%
|
||||||
|
\nointerlineskip
|
||||||
|
\hbox to \dimen0{\kanjistrut\hfil#1\hfil}}}}
|
||||||
|
\makeatother
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
|
||||||
|
序
|
||||||
|
\vspace{0.4in}
|
||||||
|
|
||||||
|
\begin{flushleft}
|
||||||
|
わたくしといふ現象は
|
||||||
|
|
||||||
|
假定された有機交流電燈の
|
||||||
|
|
||||||
|
ひとつの青い照明です
|
||||||
|
|
||||||
|
(あらゆる透明な幽霊の複合体)
|
||||||
|
|
||||||
|
風景やみんなといっしょに
|
||||||
|
|
||||||
|
せはしくせはしく明滅しながら
|
||||||
|
|
||||||
|
いかにもたしかにともりつづける
|
||||||
|
|
||||||
|
因果交流電燈の
|
||||||
|
|
||||||
|
ひとつの青い照明です
|
||||||
|
|
||||||
|
(ひかりはたもち、その電燈は失はれ)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
これらは二十二箇月の
|
||||||
|
|
||||||
|
過去とかんずる方角から
|
||||||
|
|
||||||
|
紙と鑛質インクをつらね
|
||||||
|
|
||||||
|
(すべてわたくしと明滅し
|
||||||
|
|
||||||
|
みんなが同時に感ずるもの)
|
||||||
|
|
||||||
|
ここまでたもちつゞけられた
|
||||||
|
|
||||||
|
かげとひかりのひとくさりづつ
|
||||||
|
|
||||||
|
そのとほりの心象スケッチです
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
これらについて人や銀河や修羅や海膽は
|
||||||
|
|
||||||
|
宇宙塵をたべ、または空気や塩水を呼吸しながら
|
||||||
|
|
||||||
|
それぞれ新鮮な本体論もかんがへませうが
|
||||||
|
|
||||||
|
それらも畢竟こゝろのひとつの風物です
|
||||||
|
|
||||||
|
たゞたしかに記録されたこれらのけしきは
|
||||||
|
|
||||||
|
記録されたそのとほりのこのけしきで
|
||||||
|
|
||||||
|
それが虚無ならば虚無自身がこのとほりで
|
||||||
|
|
||||||
|
ある程度まではみんなに共通いたします
|
||||||
|
|
||||||
|
(すべてがわたくしの中のみんなであるやうに
|
||||||
|
|
||||||
|
みんなのおのおののなかのすべてですから)
|
||||||
|
\newpage
|
||||||
|
|
||||||
|
|
||||||
|
\vspace{1.0in}
|
||||||
|
|
||||||
|
けれどもこれら新世代沖積世の
|
||||||
|
|
||||||
|
巨大に明るい時間の集積のなかで
|
||||||
|
|
||||||
|
正しくうつされた筈のこれらのことばが
|
||||||
|
|
||||||
|
わづかその一點にも均しい明暗のうちに
|
||||||
|
|
||||||
|
(あるひは修羅の十億年)
|
||||||
|
|
||||||
|
すでにはやくもその組立や質を變じ
|
||||||
|
|
||||||
|
しかもわたくしも印刷者も
|
||||||
|
|
||||||
|
それを変らないとして感ずることは
|
||||||
|
|
||||||
|
傾向としてはあり得ます
|
||||||
|
|
||||||
|
けだしわれわれがわれわれの感官や
|
||||||
|
|
||||||
|
風景や人物をかんずるやうに
|
||||||
|
|
||||||
|
そしてたゞ共通に感ずるだけであるやうに
|
||||||
|
|
||||||
|
記録や歴史、あるひは地史といふものも
|
||||||
|
|
||||||
|
それのいろいろの論料といっしょに
|
||||||
|
|
||||||
|
(因果の時空的制約のもとに)
|
||||||
|
|
||||||
|
われわれがかんじてゐるのに過ぎません
|
||||||
|
|
||||||
|
おそらくこれから二千年もたったころは
|
||||||
|
|
||||||
|
それ相當のちがった地質學が流用され
|
||||||
|
|
||||||
|
相當した證據もまた次次過去から現出し
|
||||||
|
|
||||||
|
みんなは二千年ぐらゐ前には
|
||||||
|
|
||||||
|
青ぞらいっぱいの無色な孔雀が居たとおもひ
|
||||||
|
|
||||||
|
新進の大學士たちは気圏のいちばんの上層
|
||||||
|
|
||||||
|
きらびやかな氷窒素のあたりから
|
||||||
|
|
||||||
|
すてきな化石を發堀したり
|
||||||
|
|
||||||
|
あるひは白堊紀砂岩の層面に
|
||||||
|
|
||||||
|
透明な人類の巨大な足跡を
|
||||||
|
|
||||||
|
発見するかもしれません
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
すべてこれらの命題は
|
||||||
|
|
||||||
|
心象や時間それ自身の性質として
|
||||||
|
|
||||||
|
第四次延長のなかで主張されます
|
||||||
|
|
||||||
|
|
||||||
|
\end{flushleft}
|
||||||
|
|
||||||
|
\begin{flushright}
|
||||||
|
大正十三年一月廿日 宮澤賢治
|
||||||
|
\end{flushright}
|
||||||
|
|
||||||
|
\end{document}
|
Loading…
Reference in New Issue