non-free sample files moved into a separate directory
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@227 1aa58f4a-7d42-0410-adbc-911cccaed67cpull/1/head
parent
3f831c8104
commit
f2005bee55
|
@ -19,7 +19,7 @@ Python PDF parser and analyzer
|
|||
|
||||
<div align=right class=lastmod>
|
||||
<!-- hhmts start -->
|
||||
Last Modified: Sat May 29 11:57:59 UTC 2010
|
||||
Last Modified: Sun Jun 13 04:20:47 UTC 2010
|
||||
<!-- hhmts end -->
|
||||
</div>
|
||||
|
||||
|
@ -29,7 +29,9 @@ Last Modified: Sat May 29 11:57:59 UTC 2010
|
|||
<li> <a href="#install">Install</a>
|
||||
<small>(<a href="#cmap">for CJK languages</a>)</small>
|
||||
<li> <a href="#usage">How to Use</a>
|
||||
<small>(<a href="#pdf2txt">pdf2txt.py</a>, <a href="#dumppdf">dumppdf.py</a>, <a href="#library">use as library</a>)</small>
|
||||
<small>(<a href="#pdf2txt">pdf2txt.py</a>,
|
||||
<a href="#dumppdf">dumppdf.py</a>,
|
||||
<a href="programming.html">use as library</a>)</small>
|
||||
<li> <a href="#techdocs">Technical Documents</a>
|
||||
<li> <a href="#todos">TODOs</a>
|
||||
<li> <a href="#changes">Changes</a>
|
||||
|
@ -375,7 +377,7 @@ For details, see the <a href="programming.html">Programming with PDFMiner</a> pa
|
|||
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
|
||||
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
|
||||
<li> Better documentation.
|
||||
<li> Better text extraction / layout analysis.
|
||||
<li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
|
||||
<li> Robust error handling.
|
||||
<li> Crypt stream filter support. (More sample documents are needed!)
|
||||
<li> CCITTFax stream filter support.
|
||||
|
@ -385,7 +387,8 @@ For details, see the <a href="programming.html">Programming with PDFMiner</a> pa
|
|||
<hr noshade>
|
||||
<h2>Changes</h2>
|
||||
<ul>
|
||||
<li> 2010/04/24: Bugfixes and tiny improvements on TOC extraction. Thanks to Jose Maria.
|
||||
<li> 2010/06/13: Bugfixes and improvements on CMap data compression. Thanks to Jakub Wilk.
|
||||
<li> 2010/04/24: Bugfixes and improvements on TOC extraction. Thanks to Jose Maria.
|
||||
<li> 2010/03/26: Bugfixes. Thanks to Brian Berry and Lubos Pintes.
|
||||
<li> 2010/03/22: Improved layout analysis. Added regression tests.
|
||||
<li> 2010/03/12: A couple of bugfixes. Thanks to Sean Manefield.
|
||||
|
|
|
@ -6,38 +6,44 @@ CMP=:
|
|||
PYTHON=python
|
||||
PDF2TXT=PYTHONPATH=.. $(PYTHON) ../tools/pdf2txt.py -Dx -p1
|
||||
|
||||
HTMLS= \
|
||||
HTMLS=$(HTMLS_FREE) $(HTMLS_NONFREE)
|
||||
HTMLS_FREE= \
|
||||
simple1.html \
|
||||
simple2.html \
|
||||
dmca.html \
|
||||
f1040nr.html \
|
||||
i1040nr.html \
|
||||
jo.html \
|
||||
kampo.html \
|
||||
naacl06-shinyama.html \
|
||||
nlp2004slides.html
|
||||
jo.html
|
||||
HTMLS_NONFREE= \
|
||||
nonfree/dmca.html \
|
||||
nonfree/f1040nr.html \
|
||||
nonfree/i1040nr.html \
|
||||
nonfree/kampo.html \
|
||||
nonfree/naacl06-shinyama.html \
|
||||
nonfree/nlp2004slides.html
|
||||
|
||||
TEXTS= \
|
||||
TEXTS=$(TEXTS_FREE) $(TEXTS_NONFREE)
|
||||
TEXTS_FREE= \
|
||||
simple1.txt \
|
||||
simple2.txt \
|
||||
dmca.txt \
|
||||
f1040nr.txt \
|
||||
i1040nr.txt \
|
||||
jo.txt \
|
||||
kampo.txt \
|
||||
naacl06-shinyama.txt \
|
||||
nlp2004slides.txt
|
||||
jo.txt
|
||||
TEXTS_NONFREE= \
|
||||
nonfree/dmca.txt \
|
||||
nonfree/f1040nr.txt \
|
||||
nonfree/i1040nr.txt \
|
||||
nonfree/kampo.txt \
|
||||
nonfree/naacl06-shinyama.txt \
|
||||
nonfree/nlp2004slides.txt
|
||||
|
||||
XMLS= \
|
||||
XMLS=$(XMLS_FREE) $(XMLS_NONFREE)
|
||||
XMLS_FREE= \
|
||||
simple1.xml \
|
||||
simple2.xml \
|
||||
dmca.xml \
|
||||
f1040nr.xml \
|
||||
i1040nr.xml \
|
||||
jo.xml \
|
||||
kampo.xml \
|
||||
naacl06-shinyama.xml \
|
||||
nlp2004slides.xml
|
||||
jo.xml
|
||||
XMLS_NONFREE= \
|
||||
nonfree/dmca.xml \
|
||||
nonfree/f1040nr.xml \
|
||||
nonfree/i1040nr.xml \
|
||||
nonfree/kampo.xml \
|
||||
nonfree/naacl06-shinyama.xml \
|
||||
nonfree/nlp2004slides.xml
|
||||
|
||||
test: htmls texts xmls
|
||||
|
||||
|
|
|
@ -1,44 +1,48 @@
|
|||
This directory contains sample PDF files.
|
||||
|
||||
The files in nonfree/ subdirectory can be distributed freely
|
||||
but does not come with explicit licensing terms or source files.
|
||||
|
||||
Here are the credits of the original files:
|
||||
|
||||
dmca.pdf:
|
||||
U.S. Copyright Office
|
||||
The Digital Millenium Copyright Act
|
||||
http://www.copyright.gov/legislation/dmca.pdf
|
||||
|
||||
f1040nr.pdf:
|
||||
U.S. Department of the Treasury Internal Revenue Service
|
||||
Form 1040-NR, U.S. Nonresident Alien Income Tax Return
|
||||
http://www.irs.gov/pub/irs-pdf/f1040nr.pdf
|
||||
|
||||
i1040nr.pdf:
|
||||
U.S. Department of the Treasury Internal Revenue Service
|
||||
Instructions for Form 1040-NR, U.S. Nonresident Alien Income Tax Return
|
||||
http://www.irs.gov/pub/irs-pdf/i1040nr.pdf
|
||||
|
||||
jo.pdf:
|
||||
Kenji Miyazawa (1896-1933, copyright expired)
|
||||
Preface of "Haru to Shura"
|
||||
(File generated by LaTeX and dvi2pdfm)
|
||||
|
||||
kampo.pdf:
|
||||
National Priting Bureau of Japan
|
||||
Official Gazette, Vol. 4817
|
||||
http://kanpou.npb.go.jp/
|
||||
|
||||
nlp2004slides.pdf:
|
||||
Yusuke Shinyama and Satoshi Sekine
|
||||
"Named Entity Discovery from Comparable News Corpora"
|
||||
|
||||
naacl06-shinyama.pdf:
|
||||
Yusuke Shinyama and Satoshi Sekine
|
||||
"Preemptive Information Extraction using Unrestircted Relation Discovery"
|
||||
|
||||
simple1.pdf:
|
||||
(Originally taken from PDF Specification,
|
||||
Appendix G. "Simple Text String Example" and modified)
|
||||
(Originally taken from PDF Specification 1.7,
|
||||
Appendix G. "Simple Text String Example" and modified)
|
||||
|
||||
simple2.pdf:
|
||||
(Originally taken from PDF Specification,
|
||||
Appendix G. "Simple Graphics Example" and modified)
|
||||
(Originally taken from PDF Specification 1.7,
|
||||
Appendix G. "Simple Graphics Example" and modified)
|
||||
|
||||
jo.pdf:
|
||||
Kenji Miyazawa (1896-1933, copyright expired)
|
||||
Preface of "Haru to Shura"
|
||||
(File generated by LaTeX and dvi2pdfm)
|
||||
|
||||
--
|
||||
nonfree/dmca.pdf:
|
||||
U.S. Copyright Office
|
||||
The Digital Millenium Copyright Act
|
||||
http://www.copyright.gov/legislation/dmca.pdf
|
||||
|
||||
nonfree/f1040nr.pdf:
|
||||
U.S. Department of the Treasury Internal Revenue Service
|
||||
Form 1040-NR, U.S. Nonresident Alien Income Tax Return
|
||||
http://www.irs.gov/pub/irs-pdf/f1040nr.pdf
|
||||
|
||||
nonfree/i1040nr.pdf:
|
||||
U.S. Department of the Treasury Internal Revenue Service
|
||||
Instructions for Form 1040-NR, U.S. Nonresident Alien Income Tax Return
|
||||
http://www.irs.gov/pub/irs-pdf/i1040nr.pdf
|
||||
|
||||
nonfree/kampo.pdf:
|
||||
National Priting Bureau of Japan
|
||||
Official Gazette, Vol. 4817
|
||||
http://kanpou.npb.go.jp/
|
||||
|
||||
nonfree/nlp2004slides.pdf:
|
||||
Yusuke Shinyama and Satoshi Sekine
|
||||
"Named Entity Discovery from Comparable News Corpora"
|
||||
|
||||
nonfree/naacl06-shinyama.pdf:
|
||||
Yusuke Shinyama and Satoshi Sekine
|
||||
"Preemptive Information Extraction using Unrestircted Relation Discovery"
|
||||
|
|
|
@ -0,0 +1,173 @@
|
|||
\documentclass[landscape,twocolumn]{tarticle}
|
||||
|
||||
\setlength{\hoffset}{-0.6in}
|
||||
\setlength{\voffset}{-0.7in}
|
||||
|
||||
\setlength{\textwidth}{18cm}
|
||||
%\setlength{\textheight}{9in}
|
||||
|
||||
%\setlength{\oddsidemargin}{-0.5in}
|
||||
%\setlength{\evensidemargin}{-0.5in}
|
||||
\setlength{\topmargin}{0in}
|
||||
\setlength{\columnsep}{0.4in}
|
||||
|
||||
\pagestyle{empty}
|
||||
\makeatletter
|
||||
\def\kanjistrut{\vrule \@height0.88zw \@depth0.12zw \@width\z@}
|
||||
\newdimen\mytempdima
|
||||
\newcommand{\ruby}[2]{%
|
||||
\leavevmode
|
||||
\setbox0=\hbox{#1}%
|
||||
\mytempdima=\f@size\p@
|
||||
\setbox1=\hbox{\fontsize{0.5\mytempdima}{0pt}\selectfont #2}%
|
||||
\ifdim\wd0>\wd1 \dimen0=\wd0 \else \dimen0=\wd1 \fi
|
||||
\hbox{%
|
||||
\kanjiskip=0pt plus 2fil
|
||||
\xkanjiskip=0pt plus 2fil
|
||||
\vbox{%
|
||||
\hbox to \dimen0{%
|
||||
\fontsize{0.5\mytempdima}{0pt}\selectfont \kanjistrut\hfil#2\hfil}%
|
||||
\nointerlineskip
|
||||
\hbox to \dimen0{\kanjistrut\hfil#1\hfil}}}}
|
||||
\makeatother
|
||||
|
||||
\begin{document}
|
||||
|
||||
序
|
||||
\vspace{0.4in}
|
||||
|
||||
\begin{flushleft}
|
||||
わたくしといふ現象は
|
||||
|
||||
假定された有機交流電燈の
|
||||
|
||||
ひとつの青い照明です
|
||||
|
||||
(あらゆる透明な幽霊の複合体)
|
||||
|
||||
風景やみんなといっしょに
|
||||
|
||||
せはしくせはしく明滅しながら
|
||||
|
||||
いかにもたしかにともりつづける
|
||||
|
||||
因果交流電燈の
|
||||
|
||||
ひとつの青い照明です
|
||||
|
||||
(ひかりはたもち、その電燈は失はれ)
|
||||
|
||||
|
||||
|
||||
これらは二十二箇月の
|
||||
|
||||
過去とかんずる方角から
|
||||
|
||||
紙と鑛質インクをつらね
|
||||
|
||||
(すべてわたくしと明滅し
|
||||
|
||||
みんなが同時に感ずるもの)
|
||||
|
||||
ここまでたもちつゞけられた
|
||||
|
||||
かげとひかりのひとくさりづつ
|
||||
|
||||
そのとほりの心象スケッチです
|
||||
|
||||
|
||||
|
||||
これらについて人や銀河や修羅や海膽は
|
||||
|
||||
宇宙塵をたべ、または空気や塩水を呼吸しながら
|
||||
|
||||
それぞれ新鮮な本体論もかんがへませうが
|
||||
|
||||
それらも畢竟こゝろのひとつの風物です
|
||||
|
||||
たゞたしかに記録されたこれらのけしきは
|
||||
|
||||
記録されたそのとほりのこのけしきで
|
||||
|
||||
それが虚無ならば虚無自身がこのとほりで
|
||||
|
||||
ある程度まではみんなに共通いたします
|
||||
|
||||
(すべてがわたくしの中のみんなであるやうに
|
||||
|
||||
みんなのおのおののなかのすべてですから)
|
||||
\newpage
|
||||
|
||||
|
||||
\vspace{1.0in}
|
||||
|
||||
けれどもこれら新世代沖積世の
|
||||
|
||||
巨大に明るい時間の集積のなかで
|
||||
|
||||
正しくうつされた筈のこれらのことばが
|
||||
|
||||
わづかその一點にも均しい明暗のうちに
|
||||
|
||||
(あるひは修羅の十億年)
|
||||
|
||||
すでにはやくもその組立や質を變じ
|
||||
|
||||
しかもわたくしも印刷者も
|
||||
|
||||
それを変らないとして感ずることは
|
||||
|
||||
傾向としてはあり得ます
|
||||
|
||||
けだしわれわれがわれわれの感官や
|
||||
|
||||
風景や人物をかんずるやうに
|
||||
|
||||
そしてたゞ共通に感ずるだけであるやうに
|
||||
|
||||
記録や歴史、あるひは地史といふものも
|
||||
|
||||
それのいろいろの論料といっしょに
|
||||
|
||||
(因果の時空的制約のもとに)
|
||||
|
||||
われわれがかんじてゐるのに過ぎません
|
||||
|
||||
おそらくこれから二千年もたったころは
|
||||
|
||||
それ相當のちがった地質學が流用され
|
||||
|
||||
相當した證據もまた次次過去から現出し
|
||||
|
||||
みんなは二千年ぐらゐ前には
|
||||
|
||||
青ぞらいっぱいの無色な孔雀が居たとおもひ
|
||||
|
||||
新進の大學士たちは気圏のいちばんの上層
|
||||
|
||||
きらびやかな氷窒素のあたりから
|
||||
|
||||
すてきな化石を發堀したり
|
||||
|
||||
あるひは白堊紀砂岩の層面に
|
||||
|
||||
透明な人類の巨大な足跡を
|
||||
|
||||
発見するかもしれません
|
||||
|
||||
|
||||
|
||||
すべてこれらの命題は
|
||||
|
||||
心象や時間それ自身の性質として
|
||||
|
||||
第四次延長のなかで主張されます
|
||||
|
||||
|
||||
\end{flushleft}
|
||||
|
||||
\begin{flushright}
|
||||
大正十三年一月廿日 宮澤賢治
|
||||
\end{flushright}
|
||||
|
||||
\end{document}
|
Loading…
Reference in New Issue