Merge branch 'develop'

pull/341/head 20191107
Pieter Marsman 2019-11-07 21:52:58 +01:00
commit b63a636512
103 changed files with 1852 additions and 37880 deletions

20
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@ -0,0 +1,20 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: bug
assignees: ''
---
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
1. If any, include the code that you are using
2. If any, include the command line statements that you are using
3. If you have problems with a specific pdf file, include that pdf file
**Expected behavior**
A clear and concise description of what you expected to happen.

View File

@ -0,0 +1,17 @@
---
name: Feature request
about: Suggest an improvement for this project
title: ''
labels: enhancement
assignees: ''
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

17
.github/pull_request_template.md vendored Normal file
View File

@ -0,0 +1,17 @@
**Description**
Please include a summary of the change and which issue is fixed. If this does not fix an issue, then first create a new issue. Please also include relevant motivation and context.
Fixes # (issue)
**How Has This Been Tested?**
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Include an example pdf if you have one.
**Checklist**
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] I have updated the [README.md](../README.md) and other documentation, or I am sure that this is not necessary
- [ ] I have added a consice human-readable description of the change to [CHANGELOG.md](../CHANGELOG.md)
- [ ] I have added docstrings to newly created methods and classes
- [ ] I have optimized the code at least one time after creating the initial version

View File

@ -4,7 +4,9 @@ python:
- "3.4" - "3.4"
- "3.5" - "3.5"
- "3.6" - "3.6"
- "3.7"
- "3.8"
install: install:
- pip install tox-travis - pip install tox-travis
script: script:
- tox - tox -r

View File

@ -7,6 +7,33 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
Nothing yet Nothing yet
## [20191107] - 2019-11-07
### Deprecated
- The argument `_py2_no_more_posargs` because Python2 is removed on January
, 2020 ([#328](https://github.com/pdfminer/pdfminer.six/pull/328) and
[#307](https://github.com/pdfminer/pdfminer.six/pull/307))
### Added
- Simple wrapper to easily extract text from a PDF file [#330](https://github.com/pdfminer/pdfminer.six/pull/330)
- Support for extracting JBIG2 encoded images ([#311](https://github.com/pdfminer/pdfminer.six/pull/311) and [#46](https://github.com/pdfminer/pdfminer.six/pull/46))
- Sphinx documentation that is published on
[Read the Docs](https://pdfminersix.readthedocs.io/)
([#329](https://github.com/pdfminer/pdfminer.six/pull/329))
### Fixed
- Unhandled AssertionError when dumping pdf containing reference to object id 0
([#318](https://github.com/pdfminer/pdfminer.six/pull/318))
- Debug flag actually changes logging level to debug for pdf2txt.py and
dumppdf.py ([#325](https://github.com/pdfminer/pdfminer.six/pull/325))
### Changed
- Using argparse instead of getopt for command line interface of dumppdf.py ([#321](https://github.com/pdfminer/pdfminer.six/pull/321))
- Refactor `LTLayoutContainer.group_textboxes` for a significant speed up in layout analysis ([#315](https://github.com/pdfminer/pdfminer.six/pull/315))
### Removed
- Files for external applications such as django, cgi and pyinstaller ([#314](https://github.com/pdfminer/pdfminer.six/issues/314))
## [20191020] - 2019-10-20 ## [20191020] - 2019-10-20
### Deprecated ### Deprecated
@ -27,7 +54,7 @@ Nothing yet
- Allow for bounding boxes with zero height or width by removing assertion ([#246](https://github.com/pdfminer/pdfminer.six/pull/246)) - Allow for bounding boxes with zero height or width by removing assertion ([#246](https://github.com/pdfminer/pdfminer.six/pull/246))
### Changed ### Changed
- All dependencies are managed in `setup.py` ([#306](https://github.com/pdfminer/pdfminer.six/pull/306), [#219](https://github.com/pdfminer/pdfminer.six/pull/219)) - All dependencies are managed in `setup.py` ([#306](https://github.com/pdfminer/pdfminer.six/pull/306) and [#219](https://github.com/pdfminer/pdfminer.six/pull/219))
## [20181108] - 2018-11-08 ## [20181108] - 2018-11-08

View File

@ -1,21 +1,22 @@
PDFMiner.six pdfminer.six
============ ============
PDFMiner.six is a fork of PDFMiner using six for Python 2+3 compatibility [![Build Status](https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master)](https://travis-ci.org/pdfminer/pdfminer.six)
[![PyPI version](https://img.shields.io/pypi/v/pdfminer.six.svg)](https://pypi.python.org/pypi/pdfminer.six/)
[![gitter](https://badges.gitter.im/pdfminer-six/Lobby.svg)](https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium)
[![Build Status](https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master)](https://travis-ci.org/pdfminer/pdfminer.six) [![PyPI version](https://img.shields.io/pypi/v/pdfminer.six.svg)](https://pypi.python.org/pypi/pdfminer.six/) Pdfminer.six is an community maintained fork of the original PDFMiner. It is a
tool for extracting information from PDF documents.
PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting Unlike other PDF-related tools, it focuses entirely on getting
and analyzing text data. PDFMiner allows one to obtain and analyzing text data. Pdfminer.six allows one to obtain
the exact location of text in a page, as well as the exact location of text in a page, as well as
other information such as fonts or lines. other information such as fonts or lines.
It includes a PDF converter that can transform PDF files It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes than text analysis. PDF parser that can be used for other purposes than text analysis.
* Webpage: https://github.com/pdfminer/ Check out the full documentation on
* Download (PyPI): https://pypi.python.org/pypi/pdfminer.six/ [Read the Docs](https://pdfminersix.readthedocs.io).
Features Features
@ -23,62 +24,30 @@ Features
* Written entirely in Python. * Written entirely in Python.
* Parse, analyze, and convert PDF documents. * Parse, analyze, and convert PDF documents.
* PDF-1.7 specification support. (well, almost) * PDF-1.7 specification support. (well, almost).
* CJK languages and vertical writing scripts support. * CJK languages and vertical writing scripts support.
* Various font types (Type1, TrueType, Type3, and CID) support. * Various font types (Type1, TrueType, Type3, and CID) support.
* Support for extracting images (JPG, JBIG2 and Bitmaps).
* Basic encryption (RC4) support. * Basic encryption (RC4) support.
* Outline (TOC) extraction. * Outline (TOC) extraction.
* Tagged contents extraction. * Tagged contents extraction.
* Automatic layout analysis. * Automatic layout analysis.
How to Install How to use
-------------- ----------
* Install Python 2.7 or newer. * Install Python 2.7 or newer. Note that Python 2 support is dropped at
* Install January, 2020.
`pip install pdfminer.six` `pip install pdfminer.six`
* Run the following test: * Use command-line interface to extract text from pdf:
`pdf2txt.py samples/simple1.pdf` `python pdf2txt.py samples/simple1.pdf`
* Check out more examples and documentation on
Command Line Tools [Read the Docs](https://pdfminersix.readthedocs.io).
------------------
PDFMiner comes with two handy tools:
pdf2txt.py and dumppdf.py.
**pdf2txt.py**
pdf2txt.py extracts text contents from a PDF file.
It extracts all the text that are to be rendered programmatically,
i.e. text represented as ASCII or Unicode strings.
It cannot recognize text drawn as images that would require optical character recognition.
It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion.
You need to provide a password for protected PDF documents when its access is restricted.
You cannot extract any text from a PDF document which does not have extraction permission.
(For details, refer to /docs/index.html.)
**dumppdf.py**
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format.
This program is primarily for debugging purposes,
but it's also possible to extract some meaningful contents (e.g. images).
(For details, refer to /docs/index.html.)
TODO
----
* PEP-8 and PEP-257 conformance.
* Better documentation.
* Performance improvements.
Contributing Contributing

1
docs/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
build/

20
docs/Makefile Normal file
View File

@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View File

@ -1,225 +0,0 @@
%TGIF 4.1.45-QPL
state(0,37,100.000,0,0,0,16,1,9,1,1,2,0,1,0,1,1,'NewCenturySchlbk-Bold',1,103680,0,0,1,10,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
%
% @(#)$Header$
% %W%
%
unit("1 pixel/pixel").
color_info(19,65535,0,[
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
"red", 65535, 0, 0, 65535, 0, 0, 1,
"green", 0, 65535, 0, 0, 65535, 0, 1,
"blue", 0, 0, 65535, 0, 0, 65535, 1,
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
"black", 0, 0, 0, 0, 0, 0, 1,
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
]).
script_frac("0.6").
fg_bg_colors('black','white').
dont_reencode("FFDingbests:ZapfDingbats").
objshadow_info('#c0c0c0',2,2).
page(1,"",1,'').
text('black',90,95,1,1,1,66,20,0,15,5,0,0,0,0,2,66,20,0,0,"",0,0,0,0,110,'',[
minilines(66,20,0,0,1,0,0,[
mini_line(66,15,5,0,0,0,[
str_block(0,66,15,5,0,-1,0,0,0,[
str_seg('black','Courier-Bold',1,103680,66,15,5,0,-1,0,0,0,0,0,
"U+30FC")])
])
])]).
text('black',100,285,1,1,1,66,20,3,15,5,0,0,0,0,2,66,20,0,0,"",0,0,0,0,300,'',[
minilines(66,20,0,0,1,0,0,[
mini_line(66,15,5,0,0,0,[
str_block(0,66,15,5,0,-2,0,0,0,[
str_seg('black','Courier-Bold',1,103680,66,15,5,0,-2,0,0,0,0,0,
"U+5199")])
])
])]).
text('black',400,38,2,1,1,119,30,5,12,3,0,0,0,0,2,119,30,0,0,"",0,0,0,0,50,'',[
minilines(119,30,0,0,1,0,0,[
mini_line(83,12,3,0,0,0,[
str_block(0,83,12,3,0,-3,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
"Adobe-Japan1")])
]),
mini_line(119,12,3,0,0,0,[
str_block(0,119,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,119,12,3,0,-1,0,0,0,0,0,
"CID:660 (horizontal)")])
])
])]).
text('black',400,118,2,1,1,114,30,8,12,3,0,0,0,0,2,114,30,0,0,"",0,0,0,0,130,'',[
minilines(114,30,0,0,1,0,0,[
mini_line(83,12,3,0,0,0,[
str_block(0,83,12,3,0,-3,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
"Adobe-Japan1")])
]),
mini_line(114,12,3,0,0,0,[
str_block(0,114,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,114,12,3,0,-1,0,0,0,0,0,
"CID:7891 (vertical)")])
])
])]).
text('black',400,238,2,1,1,125,30,15,12,3,0,0,0,0,2,125,30,0,0,"",0,0,0,0,250,'',[
minilines(125,30,0,0,1,0,0,[
mini_line(83,12,3,0,0,0,[
str_block(0,83,12,3,0,-3,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,83,12,3,0,-3,0,0,0,0,0,
"Adobe-Japan1")])
]),
mini_line(125,12,3,0,0,0,[
str_block(0,125,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,125,12,3,0,-1,0,0,0,0,0,
"CID:2296 (Japanese)")])
])
])]).
text('black',400,318,2,1,1,115,30,16,12,3,0,0,0,0,2,115,30,0,0,"",0,0,0,0,330,'',[
minilines(115,30,0,0,1,0,0,[
mini_line(67,12,3,0,0,0,[
str_block(0,67,12,3,0,-3,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,67,12,3,0,-3,0,0,0,0,0,
"Adobe-GB1")])
]),
mini_line(115,12,3,0,0,0,[
str_block(0,115,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,115,12,3,0,-1,0,0,0,0,0,
"CID:3967 (Chinese)")])
])
])]).
text('black',200,84,2,1,1,116,38,20,16,3,0,0,0,0,2,116,38,0,0,"",0,0,0,0,100,'',[
minilines(116,38,0,0,1,0,0,[
mini_line(70,16,3,0,0,0,[
str_block(0,70,16,3,0,-1,0,0,0,[
str_seg('black','NewCenturySchlbk-Roman',0,97920,70,16,3,0,-1,0,0,0,0,0,
"Japanese")])
]),
mini_line(116,16,3,0,0,0,[
str_block(0,116,16,3,0,-1,0,0,0,[
str_seg('black','NewCenturySchlbk-Roman',0,97920,116,16,3,0,-1,0,0,0,0,0,
"long-vowel sign")])
])
])]).
oval('black','',30,70,280,140,0,1,1,49,0,0,0,0,0,'1',0,[
]).
oval('black','',30,260,280,330,0,1,1,51,0,0,0,0,0,'1',0,[
]).
text('black',200,274,2,1,1,85,38,53,16,3,0,0,0,0,2,85,38,0,0,"",0,0,0,0,290,'',[
minilines(85,38,0,0,1,0,0,[
mini_line(61,16,3,0,0,0,[
str_block(0,61,16,3,0,-1,0,0,0,[
str_seg('black','NewCenturySchlbk-Roman',0,97920,61,16,3,0,-1,0,0,0,0,0,
"Chinese")])
]),
mini_line(85,16,3,0,0,0,[
str_block(0,85,16,3,0,-1,0,0,0,[
str_seg('black','NewCenturySchlbk-Roman',0,97920,85,16,3,0,-1,0,0,0,0,0,
"letter \"sha\"")])
])
])]).
box('black','',330,30,560,80,0,1,1,57,0,0,0,0,0,'1',0,[
]).
box('black','',330,110,560,160,0,1,1,59,0,0,0,0,0,'1',0,[
]).
box('black','',330,230,560,280,0,1,1,60,0,0,0,0,0,'1',0,[
]).
box('black','',330,310,560,360,0,1,1,61,0,0,0,0,0,'1',0,[
]).
group([
poly('black','',4,[
506,246,501,235,541,235,536,246],0,2,1,68,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',5,[
519,238,516,252,529,252,524,275,516,272],0,2,1,69,0,0,0,0,0,0,0,'2',0,0,
"00","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',2,[
501,261,541,261],0,2,1,70,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',2,[
519,244,529,244],0,2,1,71,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
])
],
76,0,0,[
]).
group([
poly('black','',3,[
519,119,524,127,524,152],0,2,1,67,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
])
],
78,0,0,[
]).
group([
poly('black','',3,[
540,57,509,57,501,49],0,2,1,66,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
])
],
80,0,0,[
]).
group([
poly('black','',4,[
506,326,501,315,541,315,536,326],0,2,1,90,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',5,[
519,318,515,332,531,332,526,355,519,352],0,2,1,89,0,0,0,0,0,0,0,'2',0,0,
"00","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',2,[
501,341,526,341],0,2,1,88,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]),
poly('black','',2,[
519,324,529,324],0,2,1,87,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
])
],
134,0,0,[
]).
poly('black','',2,[
270,90,320,70],1,3,1,158,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).
poly('black','',2,[
280,110,320,130],1,3,1,159,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).
poly('black','',2,[
270,280,310,250],1,3,1,160,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).
poly('black','',2,[
270,300,310,330],1,3,1,161,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.6 KiB

View File

@ -1,427 +0,0 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<link rel="stylesheet" type="text/css" href="style.css">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>PDFMiner</title>
</head>
<body>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Wed Jun 25 10:27:52 UTC 2014
<!-- hhmts end -->
</div>
<h1>PDFMiner</h1>
<p>
Python PDF parser and analyzer
<p>
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
&nbsp;
<a href="#changes">Recent Changes</a>
&nbsp;
<a href="programming.html">PDFMiner API</a>
<ul>
<li> <a href="#intro">What's It?</a>
<li> <a href="#download">Download</a>
<li> <a href="#wheretoask">Where to Ask</a>
<li> <a href="#install">How to Install</a>
<ul>
<li> <a href="#cmap">CJK languages support</a>
</ul>
<li> <a href="#tools">Command Line Tools</a>
<ul>
<li> <a href="#pdf2txt">pdf2txt.py</a>
<li> <a href="#dumppdf">dumppdf.py</a>
<li> <a href="programming.html">PDFMiner API</a>
</ul>
<li> <a href="#changes">Changes</a>
<li> <a href="#todo">TODO</a>
<li> <a href="#related">Related Projects</a>
<li> <a href="#license">Terms and Conditions</a>
</ul>
<h2><a name="intro">What's It?</a></h2>
<p>
PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting
and analyzing text data. PDFMiner allows one to obtain
the exact location of text in a page, as well as
other information such as fonts or lines.
It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes than text analysis.
<p>
<h3>Features</h3>
<ul>
<li> Written entirely in Python. (for version 2.6 or newer)
<li> Parse, analyze, and convert PDF documents.
<li> PDF-1.7 specification support. (well, almost)
<li> CJK languages and vertical writing scripts support.
<li> Various font types (Type1, TrueType, Type3, and CID) support.
<li> Basic encryption (RC4) support.
<li> PDF to HTML conversion (with a sample converter web app).
<li> Outline (TOC) extraction.
<li> Tagged contents extraction.
<li> Reconstruct the original layout by grouping text chunks.
</ul>
<p>
PDFMiner is about 20 times slower than
other C/C++-based counterparts such as XPdf.
<P>
<strong>Online Demo:</strong> (pdf -&gt; html conversion webapp)<br>
<a href="http://pdf2html.tabesugi.net:8080/">
http://pdf2html.tabesugi.net:8080/
</a>
<h3><a name="download">Download</a></h3>
<p>
<strong>Source distribution:</strong><br>
<a href="http://pypi.python.org/pypi/pdfminer_six/">
http://pypi.python.org/pypi/pdfminer_six/
</a>
<P>
<strong>github:</strong><br>
<a href="https://github.com/goulu/pdfminer/">
https://github.com/goulu/pdfminer/
</a>
<h3><a name="wheretoask">Where to Ask</a></h3>
<p>
<p>
<strong>Questions and comments:</strong><br>
<a href="http://groups.google.com/group/pdfminer-users/">
http://groups.google.com/group/pdfminer-users/
</a>
<h2><a name="install">How to Install</a></h2>
<ol>
<li> Install <a href="http://www.python.org/download/">Python</a> 2.6 or newer.
<li> Download the <a href="#source">PDFMiner source</a>.
<li> Unpack it.
<li> Run <code>setup.py</code> to install:<br>
<blockquote><pre>
# <strong>python setup.py install</strong>
</pre></blockquote>
<li> Do the following test:<br>
<blockquote><pre>
$ <strong>pdf2txt.py samples/simple1.pdf</strong>
Hello
World
Hello
World
H e l l o
W o r l d
H e l l o
W o r l d
</pre></blockquote>
<li> Done!
</ol>
<h3><a name="cmap">For CJK languages</a></h3>
<p>
In order to process CJK languages, you need an additional step to take
during installation:
<blockquote><pre>
# <strong>make cmap</strong>
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...
writing 'CNS1_H.py'...
...
<em>(this may take several minutes)</em>
# <strong>python setup.py install</strong>
</pre></blockquote>
<p>
On Windows machines which don't have <code>make</code> command,
paste the following commands on a command line prompt:
<blockquote><pre>
<strong>mkdir pdfminer\cmap</strong>
<strong>python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt</strong>
<strong>python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt</strong>
<strong>python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt</strong>
<strong>python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt</strong>
<strong>python setup.py install</strong>
</pre></blockquote>
<h2><a name="tools">Command Line Tools</a></h2>
<p>
PDFMiner comes with two handy tools:
<code>pdf2txt.py</code> and <code>dumppdf.py</code>.
<h3><a name="pdf2txt">pdf2txt.py</a></h3>
<p>
<code>pdf2txt.py</code> extracts text contents from a PDF file.
It extracts all the text that are to be rendered programmatically,
i.e. text represented as ASCII or Unicode strings.
It cannot recognize text drawn as images that would require optical character recognition.
It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion.
You need to provide a password for protected PDF documents when its access is restricted.
You cannot extract any text from a PDF document which does not have extraction permission.
<p>
<strong>Note:</strong>
Not all characters in a PDF can be safely converted to Unicode.
<h4>Examples</h4>
<blockquote><pre>
$ <strong>pdf2txt.py -o output.html samples/naacl06-shinyama.pdf</strong>
(extract text as an HTML file whose filename is output.html)
$ <strong>pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf</strong>
(extract a Japanese HTML file in vertical writing, CMap is required)
$ <strong>pdf2txt.py -P mypassword -o output.txt secret.pdf</strong>
(extract a text from an encrypted PDF file)
</pre></blockquote>
<h4>Options</h4>
<dl>
<dt> <code>-o <em>filename</em></code>
<dd> Specifies the output file name.
By default, it prints the extracted contents to stdout in text format.
<p>
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
<dd> Specifies the comma-separated list of the page numbers to be extracted.
Page numbers start at one.
By default, it extracts text from all the pages.
<p>
<dt> <code>-c <em>codec</em></code>
<dd> Specifies the output codec.
<p>
<dt> <code>-t <em>type</em></code>
<dd> Specifies the output format. The following formats are currently supported.
<ul>
<li> <code>text</code> : TEXT format. (Default)
<li> <code>html</code> : HTML format. Not recommended for extraction purposes because the markup is messy.
<li> <code>xml</code> : XML format. Provides the most information.
<li> <code>tag</code> : "Tagged PDF" format. A tagged PDF has its own contents annotated with
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
Tags used here are defined in the PDF specification (See &sect;10.7 "<em>Tagged PDF</em>").
</ul>
<p>
<dt> <code>-I <em>image_directory</em></code>
<dd> Specifies the output directory for image extraction.
Currently only JPEG images are supported.
<p>
<dt> <code>-M <em>char_margin</em></code>
<dt> <code>-L <em>line_margin</em></code>
<dt> <code>-W <em>word_margin</em></code>
<dd> These are the parameters used for layout analysis.
In an actual PDF file, text portions might be split into several chunks
in the middle of its running, depending on the authoring software.
Therefore, text extraction needs to splice text chunks.
In the figure below, two text chunks whose distance is closer than
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
continuous and get grouped into one. Also, two lines whose distance is closer than
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
as a text box, which is a rectangular area that contains a "cluster" of text portions.
Furthermore, it may be required to insert blank characters (spaces) as necessary
if the distance between two words is greater than the <em>word_margin</em>
(<em><font color="green">W</font></em>), as a blank between words might not be
represented as a space, but indicated by the positioning of each word.
<p>
Each value is specified not as an actual length, but as a proportion of
the length to the size of each character in question. The default values
are M = 2.0, L = 0.5, and W = 0.1, respectively.
<table style="border:2px gray solid; margin: 10px; padding: 10px;"><tr>
<td style="border-right:1px red solid" align=right>&rarr;</td>
<td style="border-left:1px red solid" colspan="4" align=left>&larr; <em><font color="red">M</font></em></td>
<td></td>
</tr><tr>
<td style="border:1px solid"><code>Q u i</code></td>
<td style="border:1px solid"><code>c k</code></td>
<td width="10px"></td>
<td style="border:1px solid"><code>b r o w</code></td>
<td style="border:1px solid"><code>n &nbsp; f o x</code></td>
<td style="border-bottom:1px blue solid" align=right>&darr;</td>
</tr><tr>
<td style="border-right:1px green solid" colspan="2" align=right>&rarr;</td><td></td>
<td style="border-left:1px green solid" colspan="2" align=left>&larr; <em><font color="green">W</font></em></td>
<td rowspan="2" valign=center align=center><em><font color="blue">L</font></em></td>
</tr><tr height="10px">
</tr><tr>
<td style="padding:0px;" colspan="5">
<table style="border:1px solid"><tr><td><code>j u m p s</code></td><td>...</td></tr></table>
</td>
<td style="border-top:1px blue solid" align=right>&uarr;</td>
</tr></table>
<p>
<dt> <code>-F <em>boxes_flow</em></code>
<dd> Specifies how much a horizontal and vertical position of a text matters
when determining a text order. The value should be within the range of
-1.0 (only horizontal position matters) to +1.0 (only vertical position matters).
The default value is 0.5.
<p>
<dt> <code>-C</code>
<dd> Suppress object caching.
This will reduce the memory consumption but also slows down the process.
<p>
<dt> <code>-n</code>
<dd> Suppress layout analysis.
<p>
<dt> <code>-A</code>
<dd> Forces to perform layout analysis for all the text strings,
including text contained in figures.
<p>
<dt> <code>-V</code>
<dd> Allows vertical writing detection.
<p>
<dt> <code>-Y <em>layout_mode</em></code>
<dd> Specifies how the page layout should be preserved. (Currently only applies to HTML format.)
<ul>
<li> <code>exact</code> : preserve the exact location of each individual character (a large and messy HTML).
<li> <code>normal</code> : preserve the location and line breaks in each text block. (Default)
<li> <code>loose</code> : preserve the overall location of each text block.
</ul>
<p>
<dt> <code>-E <em>extractdir</em></code>
<dd> Specifies the extraction directory of embedded files.
<p>
<dt> <code>-s <em>scale</em></code>
<dd> Specifies the output scale. Can be used in HTML format only.
<p>
<dt> <code>-m <em>maxpages</em></code>
<dd> Specifies the maximum number of pages to extract.
By default, it extracts all the pages in a document.
<p>
<dt> <code>-P <em>password</em></code>
<dd> Provides the user password to access PDF contents.
<p>
<dt> <code>-d</code>
<dd> Increases the debug level.
</dl>
<hr noshade>
<h3><a name="dumppdf">dumppdf.py</a></h3>
<p>
<code>dumppdf.py</code> dumps the internal contents of a PDF file
in pseudo-XML format. This program is primarily for debugging purposes,
but it's also possible to extract some meaningful contents
(such as images).
<h4>Examples</h4>
<blockquote><pre>
$ <strong>dumppdf.py -a foo.pdf</strong>
(dump all the headers and contents, except stream objects)
$ <strong>dumppdf.py -T foo.pdf</strong>
(dump the table of contents)
$ <strong>dumppdf.py -r -i6 foo.pdf &gt; pic.jpeg</strong>
(extract a JPEG image)
</pre></blockquote>
<h4>Options</h4>
<dl>
<dt> <code>-a</code>
<dd> Instructs to dump all the objects.
By default, it only prints the document trailer (like a header).
<p>
<dt> <code>-i <em>objno,objno, ...</em></code>
<dd> Specifies PDF object IDs to display.
Comma-separated IDs, or multiple <code>-i</code> options are accepted.
<p>
<dt> <code>-p <em>pageno,pageno, ...</em></code>
<dd> Specifies the page number to be extracted.
Comma-separated page numbers, or multiple <code>-p</code> options are accepted.
Note that page numbers start at one, not zero.
<p>
<dt> <code>-r</code> (raw)
<dt> <code>-b</code> (binary)
<dt> <code>-t</code> (text)
<dd> Specifies the output format of stream contents.
Because the contents of stream objects can be very large,
they are omitted when none of the options above is specified.
<p>
With <code>-r</code> option, the "raw" stream contents are dumped without decompression.
With <code>-b</code> option, the decompressed contents are dumped as a binary blob.
With <code>-t</code> option, the decompressed contents are dumped in a text format,
similar to <code>repr()</code> manner. When
<code>-r</code> or <code>-b</code> option is given,
no stream header is displayed for the ease of saving it to a file.
<p>
<dt> <code>-T</code>
<dd> Shows the table of contents.
<p>
<dt> <code>-E <em>directory</em></code>
<dd> Extracts embedded files from the pdf into the given directory.
<p>
<dt> <code>-P <em>password</em></code>
<dd> Provides the user password to access PDF contents.
<p>
<dt> <code>-d</code>
<dd> Increases the debug level.
</dl>
<h2><a name="changes">Changes:</a></h2>
<ul>
<li> 2014/09/15: pushed on PyPi</li>
<li> 2014/09/10: pdfminer_six forked from pdfminer since Yusuke didn't want to merge and pdfminer3k is outdated</li>
</ul>
<h2><a name="todo">TODO</a></h2>
<ul>
<li> <A href="http://www.python.org/dev/peps/pep-0008/">PEP-8</a> and
<a href="http://www.python.org/dev/peps/pep-0257/">PEP-257</a> conformance.
<li> Better documentation.
<li> Better text extraction / layout analysis. (writing mode detection, Type1 font file analysis, etc.)
<li> Crypt stream filter support. (More sample documents are needed!)
</ul>
<h2><a name="related">Related Projects</a></h2>
<ul>
<li> <a href="http://pybrary.net/pyPdf/">pyPdf</a>
<li> <a href="http://www.foolabs.com/xpdf/">xpdf</a>
<li> <a href="http://www.pdfbox.org/">pdfbox</a>
<li> <a href="http://mupdf.com/">mupdf</a>
</ul>
<h2><a name="license">Terms and Conditions</a></h2>
<p>
(This is so-called MIT/X License)
<p>
<small>
Copyright (c) 2004-2013 Yusuke Shinyama &lt;yusuke at cs dot nyu dot edu&gt;
<p>
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:
<p>
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
<p>
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
</small>
<hr noshade>
<address>Yusuke Shinyama (yusuke at cs dot nyu dot edu)</address>
</body>

View File

@ -1,391 +0,0 @@
%TGIF 4.2.2
state(0,37,100.000,0,0,0,16,1,9,1,1,0,0,0,0,1,1,'Helvetica-Bold',1,69120,0,0,1,5,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
%
% @(#)$Header$
% %W%
%
unit("1 pixel/pixel").
color_info(19,65535,0,[
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
"red", 65535, 0, 0, 65535, 0, 0, 1,
"green", 0, 65535, 0, 0, 65535, 0, 1,
"blue", 0, 0, 65535, 0, 0, 65535, 1,
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
"black", 0, 0, 0, 0, 0, 0, 1,
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
]).
script_frac("0.6").
fg_bg_colors('black','white').
dont_reencode("FFDingbests:ZapfDingbats").
objshadow_info('#c0c0c0',2,2).
rotate_pivot(0,0,0,0).
spline_tightness(1).
page(1,"",1,'').
box('black','',50,45,300,355,2,2,1,0,0,0,0,0,0,'2',0,[
]).
box('black','',75,75,195,225,2,1,1,10,8,0,0,0,0,'1',0,[
]).
box('black','',85,105,185,125,2,1,1,18,8,0,0,0,0,'1',0,[
]).
box('black','',85,105,105,125,2,1,1,19,0,0,0,0,0,'1',0,[
]).
box('black','',105,105,125,125,2,1,1,20,0,0,0,0,0,'1',0,[
]).
text('black',95,108,1,1,1,9,15,21,12,3,0,0,0,0,2,9,15,0,0,"",0,0,0,0,120,'',[
minilines(9,15,0,0,1,0,0,[
mini_line(9,12,3,0,0,0,[
str_block(0,9,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica',0,69120,9,12,3,0,-1,0,0,0,0,0,
"A")])
])
])]).
text('black',115,108,1,1,1,8,15,28,12,3,0,0,0,0,2,8,15,0,0,"",0,0,0,0,120,'',[
minilines(8,15,0,0,1,0,0,[
mini_line(8,12,3,0,0,0,[
str_block(0,8,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica',0,69120,8,12,3,0,-1,0,0,0,0,0,
"B")])
])
])]).
box('black','',125,105,145,125,0,1,1,32,0,0,0,0,0,'1',0,[
]).
text('black',135,108,1,1,1,9,15,36,12,3,0,0,0,0,2,9,15,0,0,"",0,0,0,0,120,'',[
minilines(9,15,0,0,1,0,0,[
mini_line(9,12,3,0,0,0,[
str_block(0,9,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica',0,69120,9,12,3,0,-1,0,0,0,0,0,
"C")])
])
])]).
poly('black','',2,[
215,140,215,220],0,3,1,51,0,0,0,0,0,0,0,'3',0,0,
"0","",[
0,12,5,0,'12','5','0'],[0,12,5,0,'12','5','0'],[
]).
box('black','',175,265,270,325,0,3,1,65,0,0,0,0,0,'3',0,[
]).
box('black','',185,270,260,320,0,1,1,69,8,0,0,0,0,'1',0,[
]).
poly('black','',6,[
195,295,215,290,235,310,245,285,225,300,195,295],0,2,1,74,0,0,0,0,0,0,0,'2',0,0,
"00","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
box('black','',85,275,140,315,1,2,0,87,0,0,0,0,0,'2',0,[
]).
text('black',85,23,1,1,1,44,15,93,12,3,0,0,0,0,2,44,15,0,0,"",0,0,0,0,35,'',[
minilines(44,15,0,0,1,0,0,[
mini_line(44,12,3,0,0,0,[
str_block(0,44,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
"LTPage")])
])
])]).
text('black',255,133,1,1,1,39,15,100,12,3,0,0,0,0,2,39,15,0,0,"",0,0,0,0,145,'',[
minilines(39,15,0,0,1,0,0,[
mini_line(39,12,3,0,0,0,[
str_block(0,39,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,39,12,3,0,-1,0,0,0,0,0,
"LTLine")])
])
])]).
text('black',125,83,1,1,1,42,15,104,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,95,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTChar")])
])
])]).
text('black',245,53,1,1,1,65,15,108,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,65,'',[
minilines(65,15,0,0,1,0,0,[
mini_line(65,12,3,0,0,0,[
str_block(0,65,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
"LTTextBox")])
])
])]).
text('black',245,88,1,1,1,66,15,110,12,3,0,0,0,0,2,66,15,0,0,"",0,0,0,0,100,'',[
minilines(66,15,0,0,1,0,0,[
mini_line(66,12,3,0,0,0,[
str_block(0,66,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,66,12,3,0,-1,0,0,0,0,0,
"LTTextLine")])
])
])]).
text('black',255,243,1,1,1,51,15,112,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,255,'',[
minilines(51,15,0,0,1,0,0,[
mini_line(51,12,3,0,0,0,[
str_block(0,51,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
"LTFigure")])
])
])]).
text('black',140,243,1,1,1,51,15,114,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,255,'',[
minilines(51,15,0,0,1,0,0,[
mini_line(51,12,3,0,0,0,[
str_block(0,51,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
"LTImage")])
])
])]).
text('black',240,223,1,1,1,43,15,116,12,3,0,0,0,0,2,43,15,0,0,"",0,0,0,0,235,'',[
minilines(43,15,0,0,1,0,0,[
mini_line(43,12,3,0,0,0,[
str_block(0,43,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,43,12,3,0,0,0,0,0,0,0,
"LTRect")])
])
])]).
text('black',190,333,1,1,1,50,15,118,12,3,0,0,0,0,2,50,15,0,0,"",0,0,0,0,345,'',[
minilines(50,15,0,0,1,0,0,[
mini_line(50,12,3,0,0,0,[
str_block(0,50,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,50,12,3,0,-1,0,0,0,0,0,
"LTCurve")])
])
])]).
text('black',170,138,1,1,1,42,15,121,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,150,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTText")])
])
])]).
box('black','',145,105,165,125,0,1,1,125,8,0,0,0,0,'1',0,[
]).
poly('black','',2,[
105,95,95,110],0,1,1,135,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
165,140,155,115],0,1,1,138,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
215,65,190,80],0,1,1,139,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
215,100,180,115],0,1,1,140,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
235,140,215,150],0,1,1,141,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
220,235,205,265],0,1,1,146,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
235,255,225,275],0,1,1,147,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
195,330,220,300],0,1,1,148,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
125,255,110,280],0,1,1,149,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
text('black',610,33,1,1,1,44,15,151,12,3,0,0,0,0,2,44,15,0,0,"",0,0,0,0,45,'',[
minilines(44,15,0,0,1,0,0,[
mini_line(44,12,3,0,0,0,[
str_block(0,44,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
"LTPage")])
])
])]).
text('black',460,108,1,1,1,65,15,152,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,120,'',[
minilines(65,15,0,0,1,0,0,[
mini_line(65,12,3,0,0,0,[
str_block(0,65,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
"LTTextBox")])
])
])]).
text('black',410,178,1,1,1,66,15,154,12,3,0,0,0,0,2,66,15,0,0,"",0,0,0,0,190,'',[
minilines(66,15,0,0,1,0,0,[
mini_line(66,12,3,0,0,0,[
str_block(0,66,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,66,12,3,0,-1,0,0,0,0,0,
"LTTextLine")])
])
])]).
text('black',360,248,1,1,1,42,15,157,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTChar")])
])
])]).
text('black',420,248,1,1,1,42,15,159,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTChar")])
])
])]).
text('black',480,248,1,1,1,42,15,161,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,260,'',[
minilines(42,15,0,0,1,0,0,[
mini_line(42,12,3,0,0,0,[
str_block(0,42,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,42,12,3,0,0,0,0,0,0,0,
"LTText")])
])
])]).
text('black',460,178,1,1,1,12,15,170,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,190,'',[
minilines(12,15,0,0,1,0,0,[
mini_line(12,12,3,0,0,0,[
str_block(0,12,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
"...")])
])
])]).
text('black',520,248,1,1,1,12,15,172,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,260,'',[
minilines(12,15,0,0,1,0,0,[
mini_line(12,12,3,0,0,0,[
str_block(0,12,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
"...")])
])
])]).
text('black',560,108,1,1,1,51,15,174,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,120,'',[
minilines(51,15,0,0,1,0,0,[
mini_line(51,12,3,0,0,0,[
str_block(0,51,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
"LTFigure")])
])
])]).
text('black',635,108,1,1,1,39,15,178,12,3,0,0,0,0,2,39,15,0,0,"",0,0,0,0,120,'',[
minilines(39,15,0,0,1,0,0,[
mini_line(39,12,3,0,0,0,[
str_block(0,39,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,39,12,3,0,-1,0,0,0,0,0,
"LTLine")])
])
])]).
text('black',700,108,1,1,1,43,15,180,12,3,0,0,0,0,2,43,15,0,0,"",0,0,0,0,120,'',[
minilines(43,15,0,0,1,0,0,[
mini_line(43,12,3,0,0,0,[
str_block(0,43,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,43,12,3,0,0,0,0,0,0,0,
"LTRect")])
])
])]).
text('black',580,178,1,1,1,50,15,182,12,3,0,0,0,0,2,50,15,0,0,"",0,0,0,0,190,'',[
minilines(50,15,0,0,1,0,0,[
mini_line(50,12,3,0,0,0,[
str_block(0,50,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,50,12,3,0,-1,0,0,0,0,0,
"LTCurve")])
])
])]).
text('black',775,108,1,1,1,51,15,186,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,120,'',[
minilines(51,15,0,0,1,0,0,[
mini_line(51,12,3,0,0,0,[
str_block(0,51,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,51,12,3,0,-1,0,0,0,0,0,
"LTImage")])
])
])]).
poly('black','',2,[
475,105,590,50],0,1,1,190,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
560,110,595,50],0,1,1,191,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
635,105,600,50],0,1,1,192,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
610,50,700,100],0,1,1,193,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
765,100,630,50],0,1,1,194,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
460,125,425,175],0,1,1,196,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
560,125,570,175],0,1,1,197,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
415,195,370,245],0,1,1,198,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
415,195,420,245],0,1,1,199,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
415,195,475,245],0,1,1,200,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
470,125,485,175],0,1,1,206,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
420,195,510,220],0,1,1,207,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
poly('black','',2,[
565,125,635,175],0,1,1,208,0,0,0,0,0,0,0,'1',0,0,
"0","",[
0,8,3,0,'8','3','0'],[0,8,3,0,'8','3','0'],[
]).
text('black',635,178,1,1,1,12,15,215,12,3,0,0,0,0,2,12,15,0,0,"",0,0,0,0,190,'',[
minilines(12,15,0,0,1,0,0,[
mini_line(12,12,3,0,0,0,[
str_block(0,12,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,12,12,3,0,-1,0,0,0,0,0,
"...")])
])
])]).

35
docs/make.bat Normal file
View File

@ -0,0 +1,35 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

View File

@ -1,187 +0,0 @@
%TGIF 4.2.2
state(0,37,100.000,0,0,0,16,1,9,1,1,1,0,0,2,1,1,'Helvetica-Bold',1,69120,0,0,1,10,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
%
% @(#)$Header$
% %W%
%
unit("1 pixel/pixel").
color_info(19,65535,0,[
"magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
"red", 65535, 0, 0, 65535, 0, 0, 1,
"green", 0, 65535, 0, 0, 65535, 0, 1,
"blue", 0, 0, 65535, 0, 0, 65535, 1,
"yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
"pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
"cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
"CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
"white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
"black", 0, 0, 0, 0, 0, 0, 1,
"DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1,
"#00000000c000", 0, 0, 49344, 0, 0, 49152, 1,
"#820782070000", 33410, 33410, 0, 33287, 33287, 0, 1,
"#3cf3fbee34d2", 15420, 64507, 13364, 15603, 64494, 13522, 1,
"#3cf3fbed34d3", 15420, 64507, 13364, 15603, 64493, 13523, 1,
"#ffffa6990000", 65535, 42662, 0, 65535, 42649, 0, 1,
"#ffff0000fffe", 65535, 0, 65535, 65535, 0, 65534, 1,
"#fffe0000fffe", 65535, 0, 65535, 65534, 0, 65534, 1,
"#fffe00000000", 65535, 0, 0, 65534, 0, 0, 1
]).
script_frac("0.6").
fg_bg_colors('black','white').
dont_reencode("FFDingbests:ZapfDingbats").
objshadow_info('#c0c0c0',2,2).
rotate_pivot(0,0,0,0).
spline_tightness(1).
page(1,"",1,'').
oval('black','',350,380,450,430,2,2,1,88,0,0,0,0,0,'2',0,[
]).
poly('black','',2,[
270,270,350,230],1,2,1,54,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
270,280,350,320],1,2,1,55,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
box('black','',350,100,450,150,2,2,1,2,0,0,0,0,0,'2',0,[
]).
text('black',400,118,1,1,1,84,15,3,12,3,0,0,0,0,2,84,15,0,0,"",0,0,0,0,130,'',[
minilines(84,15,0,0,1,0,0,[
mini_line(84,12,3,0,0,0,[
str_block(0,84,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,84,12,3,0,0,0,0,0,0,0,
"PDFDocument")])
])
])]).
box('black','',150,100,250,150,2,2,1,13,0,0,0,0,0,'2',0,[
]).
text('black',200,118,1,1,1,63,15,14,12,3,0,0,0,0,2,63,15,0,0,"",0,0,0,0,130,'',[
minilines(63,15,0,0,1,0,0,[
mini_line(63,12,3,0,0,0,[
str_block(0,63,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,63,12,3,0,0,0,0,0,0,0,
"PDFParser")])
])
])]).
box('black','',350,200,450,250,2,2,1,20,0,0,0,0,0,'2',0,[
]).
text('black',400,218,1,1,1,88,15,21,12,3,0,0,0,0,2,88,15,0,0,"",0,0,0,0,230,'',[
minilines(88,15,0,0,1,0,0,[
mini_line(88,12,3,0,0,0,[
str_block(0,88,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,88,12,3,0,0,0,0,0,0,0,
"PDFInterpreter")])
])
])]).
box('black','',350,300,450,350,2,2,1,23,0,0,0,0,0,'2',0,[
]).
text('black',400,318,1,1,1,65,15,24,12,3,0,0,0,0,2,65,15,0,0,"",0,0,0,0,330,'',[
minilines(65,15,0,0,1,0,0,[
mini_line(65,12,3,0,0,0,[
str_block(0,65,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,65,12,3,0,-1,0,0,0,0,0,
"PDFDevice")])
])
])]).
box('black','',180,250,280,300,2,2,1,29,0,0,0,0,0,'2',0,[
]).
text('black',230,268,1,1,1,131,15,30,12,3,2,0,0,0,2,131,15,0,0,"",0,0,0,0,280,'',[
minilines(131,15,0,0,1,0,0,[
mini_line(131,12,3,0,0,0,[
str_block(0,131,12,3,0,0,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,131,12,3,0,0,0,0,0,0,0,
"PDFResourceManager")])
])
])]).
poly('black','',2,[
250,140,350,140],1,2,1,45,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
350,110,250,110],1,2,1,46,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
400,150,400,200],1,2,1,47,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
400,250,400,300],1,2,1,56,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
poly('black','',2,[
400,350,400,380],0,2,1,65,0,0,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
text('black',400,388,3,1,1,44,41,71,12,3,0,-2,0,0,2,44,41,0,0,"",0,0,0,0,400,'',[
minilines(44,41,0,0,1,-2,0,[
mini_line(44,12,3,0,0,0,[
str_block(0,44,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,44,12,3,0,-1,0,0,0,0,0,
"Display")])
]),
mini_line(20,12,3,0,0,0,[
str_block(0,20,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,20,12,3,0,-1,0,0,0,0,0,
"File")])
]),
mini_line(23,12,3,0,0,0,[
str_block(0,23,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,23,12,3,0,-1,0,0,0,0,0,
"etc.")])
])
])]).
text('black',300,88,1,1,1,92,15,79,12,3,0,0,0,0,2,92,15,0,0,"",0,0,0,0,100,'',[
minilines(92,15,0,0,1,0,0,[
mini_line(92,12,3,0,0,0,[
str_block(0,92,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,92,12,3,0,-1,0,0,0,0,0,
"request objects")])
])
])]).
text('black',300,148,1,1,1,78,15,84,12,3,0,0,0,0,2,78,15,0,0,"",0,0,0,0,160,'',[
minilines(78,15,0,0,1,0,0,[
mini_line(78,12,3,0,0,0,[
str_block(0,78,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,78,12,3,0,-1,0,0,0,0,0,
"store objects")])
])
])]).
oval('black','',20,100,120,150,2,2,1,106,0,0,0,0,0,'2',0,[
]).
text('black',70,118,1,1,1,46,15,107,12,3,0,0,0,0,2,46,15,0,0,"",0,0,0,0,130,'',[
minilines(46,15,0,0,1,0,0,[
mini_line(46,12,3,0,0,0,[
str_block(0,46,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,46,12,3,0,-1,0,0,0,0,0,
"PDF file")])
])
])]).
poly('black','',2,[
120,120,150,120],0,2,1,114,0,2,0,0,0,0,0,'2',0,0,
"0","",[
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
]).
text('black',400,158,1,1,1,84,15,115,12,3,2,0,0,0,2,84,15,0,0,"",0,0,0,0,170,'',[
minilines(84,15,0,0,1,0,0,[
mini_line(84,12,3,0,0,0,[
str_block(0,84,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,84,12,3,0,-1,0,0,0,0,0,
"page contents")])
])
])]).
text('black',400,258,1,1,1,129,15,119,12,3,2,0,0,0,2,129,15,0,0,"",0,0,0,0,270,'',[
minilines(129,15,0,0,1,0,0,[
mini_line(129,12,3,0,0,0,[
str_block(0,129,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,129,12,3,0,-1,0,0,0,0,0,
"rendering instructions")])
])
])]).

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.0 KiB

View File

@ -1,223 +0,0 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<link rel="stylesheet" type="text/css" href="style.css">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Programming with PDFMiner</title>
</head>
<body>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Mon Mar 24 11:49:28 UTC 2014
<!-- hhmts end -->
</div>
<p>
<a href="index.html">[Back to PDFMiner homepage]</a>
<h1>Programming with PDFMiner</h1>
<p>
This page explains how to use PDFMiner as a library
from other applications.
<ul>
<li> <a href="#overview">Overview</a>
<li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Performing Layout Analysis</a>
<li> <a href="#tocextract">Obtaining Table of Contents</a>
<li> <a href="#extend">Extending Functionality</a>
</ul>
<h2><a name="overview">Overview</a></h2>
<p>
<strong>PDF is evil.</strong> Although it is called a PDF
"document", it's nothing like Word or HTML document. PDF is more
like a graphic representation. PDF contents are just a bunch of
instructions that tell how to place the stuff at each exact
position on a display or paper. In most cases, it has no logical
structure such as sentences or paragraphs and it cannot adapt
itself when the paper size changes. PDFMiner attempts to
reconstruct some of those structures by guessing from its
positioning, but there's nothing guaranteed to work. Ugly, I
know. Again, PDF is evil.
<p>
[More technical details about the internal structure of PDF:
"How to Extract Text Contents from PDF Manually"
<a href="http://www.youtube.com/watch?v=k34wRxaxA_c">(part 1)</a>
<a href="http://www.youtube.com/watch?v=_A1M4OdNsiQ">(part 2)</a>
<a href="http://www.youtube.com/watch?v=sfV_7cWPgZE">(part 3)</a>]
<p>
Because a PDF file has such a big and complex structure,
parsing a PDF file as a whole is time and memory consuming. However,
not every part is needed for most PDF processing tasks. Therefore
PDFMiner takes a strategy of lazy parsing, which is to parse the
stuff only when it's necessary. To parse PDF files, you need to use at
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
These two objects are associated with each other.
<code>PDFParser</code> fetches data from a file,
and <code>PDFDocument</code> stores it. You'll also need
<code>PDFPageInterpreter</code> to process the page contents
and <code>PDFDevice</code> to translate it to whatever you need.
<code>PDFResourceManager</code> is used to store
shared resources such as fonts or images.
<p>
Figure 1 shows the relationship between the classes in PDFMiner.
<div align=center>
<img src="objrel.png"><br>
<small>Figure 1. Relationships between PDFMiner classes</small>
</div>
<h2><a name="basic">Basic Usage</a></h2>
<p>
A typical way to parse a PDF file is the following:
<blockquote><pre>
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
<span class="comment"># Open a PDF file.</span>
fp = open('mypdf.pdf', 'rb')
<span class="comment"># Create a PDF parser object associated with the file object.</span>
parser = PDFParser(fp)
<span class="comment"># Create a PDF document object that stores the document structure.</span>
<span class="comment"># Supply the password for initialization.</span>
document = PDFDocument(parser, password)
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
rsrcmgr = PDFResourceManager()
<span class="comment"># Create a PDF device object.</span>
device = PDFDevice(rsrcmgr)
<span class="comment"># Create a PDF interpreter object.</span>
interpreter = PDFPageInterpreter(rsrcmgr, device)
<span class="comment"># Process each page contained in the document.</span>
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
</pre></blockquote>
<h2><a name="layout">Performing Layout Analysis</a></h2>
<p>
Here is a typical way to use the layout analysis function:
<blockquote><pre>
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
<span class="comment"># Set parameters for analysis.</span>
laparams = LAParams()
<span class="comment"># Create a PDF page aggregator object.</span>
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
<span class="comment"># receive the LTPage object for the page.</span>
layout = device.get_result()
</pre></blockquote>
A layout analyzer returns a <code>LTPage</code> object for each page
in the PDF document. This object contains child objects within the page,
forming a tree structure. Figure 2 shows the relationship between
these objects.
<div align=center>
<img src="layout.png"><br>
<small>Figure 2. Layout objects and its tree structure</small>
</div>
<dl>
<dt> <code>LTPage</code>
<dd> Represents an entire page. May contain child objects like
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,
<code>LTCurve</code> and <code>LTLine</code>.
<dt> <code>LTTextBox</code>
<dd> Represents a group of text chunks that can be contained in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text.
It contains a list of <code>LTTextLine</code> objects.
<code>get_text()</code> method returns the text content.
<dt> <code>LTTextLine</code>
<dd> Contains a list of <code>LTChar</code> objects that represent
a single text line. The characters are aligned either horizontaly
or vertically, depending on the text's writing mode.
<code>get_text()</code> method returns the text content.
<dt> <code>LTChar</code>
<dt> <code>LTAnno</code>
<dd> Represent an actual letter in the text as a Unicode string.
Note that, while a <code>LTChar</code> object has actual boundaries,
<code>LTAnno</code> objects does not, as these are "virtual" characters,
inserted by a layout analyzer according to the relationship between two characters
(e.g. a space).
<dt> <code>LTFigure</code>
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
present figures or pictures by embedding yet another PDF document within a page.
Note that <code>LTFigure</code> objects can appear recursively.
<dt> <code>LTImage</code>
<dd> Represents an image object. Embedded images can be
in JPEG or other formats, but currently PDFMiner does not
pay much attention to graphical objects.
<dt> <code>LTLine</code>
<dd> Represents a single straight line.
Could be used for separating text or figures.
<dt> <code>LTRect</code>
<dd> Represents a rectangle.
Could be used for framing another pictures or figures.
<dt> <code>LTCurve</code>
<dd> Represents a generic Bezier curve.
</dl>
<p>
Also, check out <a href="http://denis.papathanasiou.org/archive/2010.08.04.post.pdf">a more complete example by Denis Papathanasiou(Extracting Text & Images from PDF Files)</a>.
<h2><a name="tocextract">Obtaining Table of Contents</a></h2>
<p>
PDFMiner provides functions to access the document's table of contents
("Outlines").
<blockquote><pre>
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
<span class="comment"># Open a PDF document.</span>
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, password)
<span class="comment"># Get the outlines of the document.</span>
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
</pre></blockquote>
<p>
Some PDF documents use page numbers as destinations, while others
use page numbers and the physical location within the page. Since
PDF does not have a logical structure, and it does not provide a
way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are
referring to.
<h2><a name="extend">Extending Functionality</a></h2>
<p>
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
in order to process them differently / obtain other information.
<hr noshade>
<address>Yusuke Shinyama</address>
</body>

1
docs/requirements.txt Normal file
View File

@ -0,0 +1 @@
sphinx-argparse

View File

@ -0,0 +1,28 @@
<style>
td {
text-align: center;
}
</style>
<table style="margin: 10px; padding: 10px;">
<tr>
<td style="text-align: right; border-right:1px red solid">&rarr;</td>
<td colspan="4"
style="text-align: left; border-left:1px red solid">&larr; <em><font
color="red">M</font></em></td>
</tr>
<tr>
<td style="border:1px solid"><code>Q u i</code></td>
<td style="border:1px solid"><code>c k</code></td>
<td width="10px"></td>
<td style="border:1px solid"><code>b r o w n</code></td>
</tr>
<tr>
<td colspan="2" style="text-align: right; border-right:1px green solid">
&rarr;
</td>
<td></td>
<td colspan="2"
style="text-align: left; border-left:1px green solid">&larr;
<em><font color="green">W</font></em></td>
</tr>
</table>

View File

@ -0,0 +1,23 @@
<style>
.background-blue {
background-color: lightblue;
border: 2px solid lightblue;
}
</style>
<table style="margin: 10px; padding: 10px;">
<tr>
<td style="border:1px solid; text-align: left">
<code>
Q u i c k &nbsp; b r o w n<br/> f o x
</code>
</td>
<td class="background-blue" colspan="3"></td>
</tr>
<tr style="height: 10px;">
<td class="background-blue" colspan="4"></td>
</tr>
<tr>
<td class="background-blue" colspan="3"></td>
<td style="border:1px solid"><code>j u m p s ...</code></td>
</tr>
</table>

View File

@ -0,0 +1,45 @@
<style>
td {
text-align: center;
}
</style>
<table style="margin: 10px; padding: 10px;">
<tr>
<td></td>
<td></td>
<td align=right style="border-bottom:1px blue solid">&darr;</td>
<td></td>
</tr>
<tr>
<td colspan="2" style="border:1px solid"><code>Q u i c k &nbsp; b r o w
n</code></td>
<td></td>
<td align=right style="border-bottom:1px blue solid">&darr;</td>
</tr>
<tr>
<td></td>
<td></td>
<td align=center valign=center><em><font color="blue">
L<sub>1</sub>
</font></em></td>
<td></td>
</tr>
<tr>
<td style="border:1px solid;">
<code>f o x</code>
</td>
<td>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</td>
<td align=right style="border-top:1px blue solid">&uarr;</td>
<td align=center valign=center><em><font color="blue">
L<sub>2</sub>
</font></em></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td align=right style="border-top:1px blue solid">&uarr;</td>
</tr>
</table>

View File

Before

Width:  |  Height:  |  Size: 3.5 KiB

After

Width:  |  Height:  |  Size: 3.5 KiB

View File

@ -0,0 +1,25 @@
.. _api_commandline:
Command-line API
****************
.. _api_pdf2txt:
pdf2txt.py
==========
.. argparse::
:module: tools.pdf2txt
:func: maketheparser
:prog: python tools/pdf2txt.py
.. _api_dumppdf:
dumppdf.py
==========
.. argparse::
:module: tools.dumppdf
:func: create_parser
:prog: python tools/dumppdf.py

View File

@ -0,0 +1,20 @@
.. _api_composable:
Composable API
**************
.. _api_laparams:
LAParams
========
.. currentmodule:: pdfminer.layout
.. autoclass:: LAParams
Todo:
=====
- `PDFDevice`
- `TextConverter`
- `PDFPageAggregator`
- `PDFPageInterpreter`

View File

@ -0,0 +1,21 @@
.. _api_highlevel:
High-level functions API
************************
.. _api_extract_text:
extract_text
============
.. currentmodule:: pdfminer.high_level
.. autofunction:: extract_text
.. _api_extract_text_to_fp:
extract_text_to_fp
==================
.. currentmodule:: pdfminer.high_level
.. autofunction:: extract_text_to_fp

View File

@ -0,0 +1,9 @@
API documentation
*****************
.. toctree::
:maxdepth: 2
commandline
highlevel
composable

61
docs/source/conf.py Normal file
View File

@ -0,0 +1,61 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import os
import sys
sys.path.insert(0, os.path.join(os.path.abspath(os.path.dirname(__file__)), '../../'))
# -- Project information -----------------------------------------------------
project = 'pdfminer.six'
copyright = '2019, Yusuke Shinyama, Philippe Guglielmetti & Pieter Marsman'
author = 'Yusuke Shinyama, Philippe Guglielmetti & Pieter Marsman'
# The full version, including alpha/beta/rc tags
release = '20191020'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinxarg.ext',
'sphinx.ext.autodoc',
'sphinx.ext.doctest',
]
# Root rst file
master_doc = 'index'
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

72
docs/source/index.rst Normal file
View File

@ -0,0 +1,72 @@
Welcome to pdfminer.six's documentation!
****************************************
.. image:: https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master
:target: https://travis-ci.org/pdfminer/pdfminer.six
:alt: Travis-ci build badge
.. image:: https://img.shields.io/pypi/v/pdfminer.six.svg
:target: https://pypi.python.org/pypi/pdfminer.six/
:alt: PyPi version badge
.. image:: https://badges.gitter.im/pdfminer-six/Lobby.svg
:target: https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium
:alt: gitter badge
Pdfminer.six is a python package for extracting information from PDF documents.
Check out the source on `github <https://github.com/pdfminer/pdfminer.six>`_.
Content
=======
.. toctree::
:maxdepth: 2
tutorials/index
topics/index
api/index
Features
========
* Parse all objects from a PDF document into Python objects.
* Analyze and group text in a human-readable way.
* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
contents and more.
* Support for (almost all) features from the PDF-1.7 specification
* Support for Chinese, Japanese and Korean CJK) languages as well as vertical
writing.
* Support for various font types (Type1, TrueType, Type3, and CID).
* Support for basic encryption (RC4).
Installation instructions
=========================
Before using it, you must install it using Python 2.7 or newer.
::
$ pip install pdfminer.six
Note that Python 2.7 support is dropped at January, 2020.
Common use-cases
----------------
* :ref:`tutorial_commandline` if you just want to extract text from a pdf once.
* :ref:`tutorial_highlevel` if you want to integrate pdfminer.six with your
Python code.
* :ref:`tutorial_composable` when you want to tailor the behavior of
pdfmine.six to your needs.
Contributing
============
We welcome any contributors to pdfminer.six! But, before doing anything, take
a look at the `contribution guide
<https://github.com/pdfminer/pdfminer.six/blob/master/CONTRIBUTING.md>`_.

View File

@ -0,0 +1,132 @@
.. _topic_pdf_to_text:
Converting a PDF file to text
*****************************
Most PDF files look like they contain well structured text. But the reality is
that a PDF file does not contain anything that resembles a paragraphs,
sentences or even words. When it comes to text, a PDF file is only aware of
the characters and their placement.
This makes extracting meaningful pieces of text from PDF's files difficult.
The characters that compose a paragraph are no different from those that
compose the table, the page footer or the description of a figure. Unlike
other documents formats, like a `.txt` file or a word document, the PDF format
does not contain a stream of text.
A PDF document does consists of a collection of objects that together describe
the appearance of one or more pages, possibly accompanied by additional
interactive elements and higher-level application data. A PDF file contains
the objects making up a PDF document along with associated structural
information, all represented as a single self-contained sequence of bytes. [1]_
Layout analysis algorithm
=========================
PDFMiner attempts to reconstruct some of those structures by using heuristics
on the positioning of characters. This works well for sentences and
paragraphs because meaningful groups of nearby characters can be made.
The layout analysis consist of three different stages: it groups characters
into words and lines, then it groups lines into boxes and finally it groups
textboxes hierarchically. These stages are discussed in the following
sections. The resulting output of the layout analysis is an ordered hierarchy
of layout objects on a PDF page.
.. figure:: ../_static/layout_analysis_output.png
:align: center
The output of the layout analysis is a hierarchy of layout objects.
The output of the layout analysis heavily depends on a couple of parameters.
All these parameters are part of the :ref:`api_laparams` class.
Grouping characters into words and lines
----------------------------------------
The first step in going from characters to text is to group characters in a
meaningful way. Each character has an x-coordinate and a y-coordinate for its
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
.six uses these bounding boxes to decide which characters belong together.
Characters that are both horizontally and vertically close are grouped. How
close they should be is determined by the `char_margin` (M in figure) and the
`line_overlap` (not in figure) parameter. The horizontal *distance* between the
bounding boxes of two characters should be smaller that the `char_margin` and
the vertical *overlap* between the bounding boxes should be smaller the the
`line_overlap`.
.. raw:: html
:file: ../_static/layout_analysis.html
The values of `char_margin` and `line_overlap` are relative to the size of
the bounding boxes of the characters. The `char_margin` is relative to the
maximum width of either one of the bounding boxes, and the `line_overlap` is
relative to the minimum height of either one of the bounding boxes.
Spaces need to be inserted between characters because the PDF format has no
notion of the space character. A space is inserted if the characters are
further apart that the `word_margin` (W in the figure). The `word_margin` is
relative to the maximum width or height of the new character. Having a larger
`word_margin` creates smaller words and inserts spaces between characters
more often. Note that the `word_margin` should be smaller than the
`char_margin` otherwise all the characters are seperated by a space.
The result of this stage is a list of lines. Each line consists a list of
characters. These characters either original `LTChar` characters that
originate from the PDF file, or inserted `LTAnno` characters that
represent spaces between words or newlines at the end of each line.
Grouping lines into boxes
-------------------------
The second step is grouping lines in a meaningful way. Each line has a
bounding box that is determined by the bounding boxes of the characters that
it contains. Like grouping characters, pdfminer.six uses the bounding boxes
to group the lines.
Lines that are both horizontally overlapping and vertically close are grouped.
How vertically close the lines should be is determined by the `line_margin`.
This margin is specified relative to the height of the bounding box. Lines
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
than the absolute line margin, i.e. the `line_margin` multiplied by the
height of the bounding box.
.. raw:: html
:file: ../_static/layout_analysis_group_lines.html
The result of this stage is a list of text boxes. Each box consist of a list
of lines.
Grouping textboxes hierarchically
---------------------------------
the last step is to group the text boxes in a meaningful way. This step
repeatedly merges the two text boxes that are closest to each other.
The closeness of bounding boxes is computed as the area that is between the
two text boxes (the blue area in the figure). In other words, it is the area of
the bounding box that surrounds both lines, minus the area of the bounding
boxes of the individual lines.
.. raw:: html
:file: ../_static/layout_analysis_group_boxes.html
Working with rotated characters
===============================
The algorithm described above assumes that all characters have the same
orientation. However, any writing direction is possible in a PDF. To
accommodate for this, pdfminer.six allows to detect vertical writing with the
`detect_vertical` parameter. This will apply all the grouping steps as if the
pdf was rotated 90 (or 270) degrees
References
==========
.. [1] Adobe System Inc. (2007). *Pdf reference: Adobe portable document
format, version 1.7.*

View File

@ -0,0 +1,7 @@
Using pdfminer.six
******************
.. toctree::
:maxdepth: 2
converting_pdf_to_text

View File

@ -0,0 +1,41 @@
.. _tutorial_commandline:
Get started with command-line tools
***********************************
pdfminer.six has several tools that can be used from the command line. The
command-line tools are aimed at users that occasionally want to extract text
from a pdf.
Take a look at the high-level or composable interface if you want to use
pdfminer.six programmatically.
Examples
========
pdf2txt.py
----------
::
$ python tools/pdf2txt.py example.pdf
all the text from the pdf appears on the command line
The :ref:`api_pdf2txt` tool extracts all the text from a PDF. It uses layout
analysis with sensible defaults to order and group the text in a sensible way.
dumppdf.py
----------
::
$ python tools/dumppdf.py -a example.pdf
<pdf><object id="1">
...
</object>
...
</pdf>
The :ref:`api_dumppdf` tool can be used to extract the internal structure from a
PDF. This tool is primarily for debugging purposes, but that can be useful to
anybody working with PDF's.

View File

@ -0,0 +1,33 @@
.. _tutorial_composable:
Get started using the composable components API
***********************************************
The command line tools and the high-level API are just shortcuts for often
used combinations of pdfminer.six components. You can use these components to
modify pdfminer.six to your own needs.
For example, to extract the text from a PDF file and save it in a python
variable::
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('samples/simple1.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())

View File

@ -0,0 +1,67 @@
.. testsetup::
import sys
from pdfminer.high_level import extract_text_to_fp, extract_text
.. _tutorial_highlevel:
Get started using the high-level functions
******************************************
The high-level API can be used to do common tasks.
The most simple way to extract text from a PDF is to use
:ref:`api_extract_text`:
.. doctest::
>>> text = extract_text('samples/simple1.pdf')
>>> print(repr(text))
'Hello \n\nWorld\n\nWorld\n\nHello \n\nH e l l o \n\nH e l l o \n\nW o r l d\n\nW o r l d\n\n\x0c'
>>> print(text)
... # doctest: +NORMALIZE_WHITESPACE
Hello
<BLANKLINE>
World
<BLANKLINE>
World
<BLANKLINE>
Hello
<BLANKLINE>
H e l l o
<BLANKLINE>
H e l l o
<BLANKLINE>
W o r l d
<BLANKLINE>
W o r l d
<BLANKLINE>
To read text from a PDF and print it on the command line:
.. doctest::
>>> if sys.version_info > (3, 0):
... from io import StringIO
... else:
... from io import BytesIO as StringIO
>>> output_string = StringIO()
>>> with open('samples/simple1.pdf', 'rb') as fin:
... extract_text_to_fp(fin, output_string)
>>> print(output_string.getvalue().strip())
Hello WorldHello WorldHello WorldHello World
Or to convert it to html and use layout analysis:
.. doctest::
>>> if sys.version_info > (3, 0):
... from io import StringIO
... else:
... from io import BytesIO as StringIO
>>> from pdfminer.layout import LAParams
>>> output_string = StringIO()
>>> with open('samples/simple1.pdf', 'rb') as fin:
... extract_text_to_fp(fin, output_string, laparams=LAParams(),
... output_type='html', codec=None)

View File

@ -0,0 +1,9 @@
Getting started
***************
.. toctree::
:maxdepth: 2
commandline
highlevel
composable

View File

@ -1,4 +0,0 @@
blockquote { background: #eeeeee; }
h1 { border-bottom: solid black 2px; }
h2 { border-bottom: solid black 1px; }
.comment { color: darkgreen; }

View File

@ -13,7 +13,7 @@ other purposes instead of text analysis.
import sys import sys
import warnings import warnings
__version__ = '20191020' __version__ = '20191107'
if sys.version_info < (3, 0): if sys.version_info < (3, 0):

View File

@ -2,6 +2,7 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
import logging import logging
import re import re
import sys
from .pdfdevice import PDFTextDevice from .pdfdevice import PDFTextDevice
from .pdffont import PDFUnicodeNotDefined from .pdffont import PDFUnicodeNotDefined
from .layout import LTContainer from .layout import LTContainer
@ -271,6 +272,8 @@ class HTMLConverter(PDFConverter):
def write(self, text): def write(self, text):
if self.codec: if self.codec:
text = text.encode(self.codec) text = text.encode(self.codec)
if sys.version_info < (3, 0):
text = str(text)
self.outfp.write(text) self.outfp.write(text)
return return

View File

@ -1,56 +1,71 @@
# -*- coding: utf-8 -*- """Functions that can be used for the most common use-cases for pdfminer.six"""
"""
Functions that encapsulate "usual" use-cases for pdfminer, for use making import logging
bundled scripts and for using pdfminer as a module for routine tasks.
"""
import six
import sys import sys
from .pdfdocument import PDFDocument import six
from .pdfparser import PDFParser
# Conditional import because python 2 is stupid
if sys.version_info > (3, 0):
from io import StringIO
else:
from io import BytesIO as StringIO
from .pdfinterp import PDFResourceManager, PDFPageInterpreter from .pdfinterp import PDFResourceManager, PDFPageInterpreter
from .pdfdevice import PDFDevice, TagExtractor from .pdfdevice import TagExtractor
from .pdfpage import PDFPage from .pdfpage import PDFPage
from .converter import XMLConverter, HTMLConverter, TextConverter from .converter import XMLConverter, HTMLConverter, TextConverter
from .cmapdb import CMapDB
from .image import ImageWriter from .image import ImageWriter
from .layout import LAParams
def extract_text_to_fp(inf, outfp, def extract_text_to_fp(inf, outfp,
_py2_no_more_posargs=None, # Bloody Python2 needs a shim
output_type='text', codec='utf-8', laparams = None, output_type='text', codec='utf-8', laparams = None,
maxpages=0, page_numbers=None, password="", scale=1.0, rotation=0, maxpages=0, page_numbers=None, password="", scale=1.0, rotation=0,
layoutmode='normal', output_dir=None, strip_control=False, layoutmode='normal', output_dir=None, strip_control=False,
debug=False, disable_caching=False, **other): debug=False, disable_caching=False, **kwargs):
""" """
Parses text from inf-file and writes to outfp file-like object. Parses text from inf-file and writes to outfp file-like object.
Takes loads of optional arguments but the defaults are somewhat sane. Takes loads of optional arguments but the defaults are somewhat sane.
Beware laparams: Including an empty LAParams is not the same as passing None! Beware laparams: Including an empty LAParams is not the same as passing None!
Returns nothing, acting as it does on two streams. Use StringIO to get strings. Returns nothing, acting as it does on two streams. Use StringIO to get strings.
output_type: May be 'text', 'xml', 'html', 'tag'. Only 'text' works properly. :param inf: a file-like object to read PDF structure from, such as a
codec: Text decoding codec file handler (using the builtin `open()` function) or a `BytesIO`.
laparams: An LAParams object from pdfminer.layout. :param outfp: a file-like object to write the text to.
Default is None but may not layout correctly. :param output_type: May be 'text', 'xml', 'html', 'tag'. Only 'text' works properly.
maxpages: How many pages to stop parsing after :param codec: Text decoding codec
page_numbers: zero-indexed page numbers to operate on. :param laparams: An LAParams object from pdfminer.layout. Default is None but may not layout correctly.
password: For encrypted PDFs, the password to decrypt. :param maxpages: How many pages to stop parsing after
scale: Scale factor :param page_numbers: zero-indexed page numbers to operate on.
rotation: Rotation factor :param password: For encrypted PDFs, the password to decrypt.
layoutmode: Default is 'normal', see pdfminer.converter.HTMLConverter :param scale: Scale factor
output_dir: If given, creates an ImageWriter for extracted images. :param rotation: Rotation factor
strip_control: Does what it says on the tin :param layoutmode: Default is 'normal', see pdfminer.converter.HTMLConverter
debug: Output more logging data :param output_dir: If given, creates an ImageWriter for extracted images.
disable_caching: Does what it says on the tin :param strip_control: Does what it says on the tin
:param debug: Output more logging data
:param disable_caching: Does what it says on the tin
:param other:
:return:
""" """
if '_py2_no_more_posargs' in kwargs is not None:
raise DeprecationWarning(
'The `_py2_no_more_posargs will be removed on January, 2020. At '
'that moment pdfminer.six will stop supporting Python 2. Please '
'upgrade to Python 3. For more information see '
'https://github.com/pdfminer/pdfminer .six/issues/194')
if debug:
logging.getLogger().setLevel(logging.DEBUG)
if six.PY2 and sys.stdin.encoding: if six.PY2 and sys.stdin.encoding:
password = password.decode(sys.stdin.encoding) password = password.decode(sys.stdin.encoding)
imagewriter = None imagewriter = None
if output_dir: if output_dir:
imagewriter = ImageWriter(output_dir) imagewriter = ImageWriter(output_dir)
rsrcmgr = PDFResourceManager(caching=not disable_caching) rsrcmgr = PDFResourceManager(caching=not disable_caching)
if output_type == 'text': if output_type == 'text':
@ -79,6 +94,44 @@ def extract_text_to_fp(inf, outfp,
caching=not disable_caching, caching=not disable_caching,
check_extractable=True): check_extractable=True):
page.rotate = (page.rotate + rotation) % 360 page.rotate = (page.rotate + rotation) % 360
interpreter.process_page(page) interpreter.process_page(page)
device.close() device.close()
def extract_text(pdf_file, password='', page_numbers=None, maxpages=0,
caching=True, codec='utf-8', laparams=None):
"""
Parses and returns the text contained in a PDF file.
Takes loads of optional arguments but the defaults are somewhat sane.
Returns a string containing all of the text extracted.
:param pdf_file: Path to the PDF file to be worked on
:param password: For encrypted PDFs, the password to decrypt.
:param page_numbers: List of zero-indexed page numbers to extract.
:param maxpages: The maximum number of pages to parse
:param caching: If resources should be cached
:param codec: Text decoding codec
:param laparams: LAParams object from pdfminer.layout.
"""
if laparams is None:
laparams = LAParams()
with open(pdf_file, "rb") as fp, StringIO() as output_string:
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, codec=codec,
laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(
fp,
page_numbers,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True,
):
interpreter.process_page(page)
return output_string.getvalue()

View File

@ -1,12 +1,14 @@
import struct
import os import os
import os.path import os.path
import struct
from io import BytesIO from io import BytesIO
from .pdftypes import LITERALS_DCT_DECODE
from .jbig2 import JBIG2StreamReader, JBIG2StreamWriter
from .pdfcolor import LITERAL_DEVICE_CMYK
from .pdfcolor import LITERAL_DEVICE_GRAY from .pdfcolor import LITERAL_DEVICE_GRAY
from .pdfcolor import LITERAL_DEVICE_RGB from .pdfcolor import LITERAL_DEVICE_RGB
from .pdfcolor import LITERAL_DEVICE_CMYK from .pdftypes import LITERALS_DCT_DECODE, LITERALS_JBIG2_DECODE
def align32(x): def align32(x):
@ -57,9 +59,11 @@ class BMPWriter(object):
return return
## ImageWriter
##
class ImageWriter(object): class ImageWriter(object):
"""Write image to a file
Supports various image types: JPEG, JBIG2 and bitmaps
"""
def __init__(self, outdir): def __init__(self, outdir):
self.outdir = outdir self.outdir = outdir
@ -68,21 +72,15 @@ class ImageWriter(object):
return return
def export_image(self, image): def export_image(self, image):
stream = image.stream
filters = stream.get_filters()
(width, height) = image.srcsize (width, height) = image.srcsize
if len(filters) == 1 and filters[0][0] in LITERALS_DCT_DECODE:
ext = '.jpg' is_jbig2 = self.is_jbig2_image(image)
elif (image.bits == 1 or ext = self._get_image_extension(image, width, height, is_jbig2)
image.bits == 8 and (LITERAL_DEVICE_RGB in image.colorspace or LITERAL_DEVICE_GRAY in image.colorspace)): name, path = self._create_unique_image_name(self.outdir, image.name, ext)
ext = '.%dx%d.bmp' % (width, height)
else: fp = open(path, 'wb')
ext = '.%d.%dx%d.img' % (image.bits, width, height)
name = image.name+ext
path = os.path.join(self.outdir, name)
fp=open(path, 'wb')
if ext == '.jpg': if ext == '.jpg':
raw_data = stream.get_rawdata() raw_data = image.stream.get_rawdata()
if LITERAL_DEVICE_CMYK in image.colorspace: if LITERAL_DEVICE_CMYK in image.colorspace:
from PIL import Image from PIL import Image
from PIL import ImageChops from PIL import ImageChops
@ -93,9 +91,18 @@ class ImageWriter(object):
i.save(fp, 'JPEG') i.save(fp, 'JPEG')
else: else:
fp.write(raw_data) fp.write(raw_data)
elif is_jbig2:
input_stream = BytesIO()
input_stream.write(image.stream.get_data())
input_stream.seek(0)
reader = JBIG2StreamReader(input_stream)
segments = reader.get_segments()
writer = JBIG2StreamWriter(fp)
writer.write_file(segments)
elif image.bits == 1: elif image.bits == 1:
bmp = BMPWriter(fp, 1, width, height) bmp = BMPWriter(fp, 1, width, height)
data = stream.get_data() data = image.stream.get_data()
i = 0 i = 0
width = (width+7)//8 width = (width+7)//8
for y in range(height): for y in range(height):
@ -103,7 +110,7 @@ class ImageWriter(object):
i += width i += width
elif image.bits == 8 and LITERAL_DEVICE_RGB in image.colorspace: elif image.bits == 8 and LITERAL_DEVICE_RGB in image.colorspace:
bmp = BMPWriter(fp, 24, width, height) bmp = BMPWriter(fp, 24, width, height)
data = stream.get_data() data = image.stream.get_data()
i = 0 i = 0
width = width*3 width = width*3
for y in range(height): for y in range(height):
@ -111,12 +118,47 @@ class ImageWriter(object):
i += width i += width
elif image.bits == 8 and LITERAL_DEVICE_GRAY in image.colorspace: elif image.bits == 8 and LITERAL_DEVICE_GRAY in image.colorspace:
bmp = BMPWriter(fp, 8, width, height) bmp = BMPWriter(fp, 8, width, height)
data = stream.get_data() data = image.stream.get_data()
i = 0 i = 0
for y in range(height): for y in range(height):
bmp.write_line(y, data[i:i+width]) bmp.write_line(y, data[i:i+width])
i += width i += width
else: else:
fp.write(stream.get_data()) fp.write(image.stream.get_data())
fp.close() fp.close()
return name return name
@staticmethod
def is_jbig2_image(image):
filters = image.stream.get_filters()
is_jbig2 = False
for filter_name, params in filters:
if filter_name in LITERALS_JBIG2_DECODE:
is_jbig2 = True
break
return is_jbig2
@staticmethod
def _get_image_extension(image, width, height, is_jbig2):
filters = image.stream.get_filters()
if len(filters) == 1 and filters[0][0] in LITERALS_DCT_DECODE:
ext = '.jpg'
elif is_jbig2:
ext = '.jb2'
elif (image.bits == 1 or
image.bits == 8 and (LITERAL_DEVICE_RGB in image.colorspace or LITERAL_DEVICE_GRAY in image.colorspace)):
ext = '.%dx%d.bmp' % (width, height)
else:
ext = '.%d.%dx%d.img' % (image.bits, width, height)
return ext
@staticmethod
def _create_unique_image_name(dirname, image_name, ext):
name = image_name + ext
path = os.path.join(dirname, name)
img_index = 0
while os.path.exists(path):
name = '%s.%d%s' % (image_name, img_index, ext)
path = os.path.join(dirname, name)
img_index += 1
return name, path

321
pdfminer/jbig2.py Normal file
View File

@ -0,0 +1,321 @@
import math
import os
from struct import pack, unpack, calcsize
# segment structure base
SEG_STRUCT = [
(">L", "number"),
(">B", "flags"),
(">B", "retention_flags"),
(">B", "page_assoc"),
(">L", "data_length"),
]
# segment header literals
HEADER_FLAG_DEFERRED = 0b10000000
HEADER_FLAG_PAGE_ASSOC_LONG = 0b01000000
SEG_TYPE_MASK = 0b00111111
REF_COUNT_SHORT_MASK = 0b11100000
REF_COUNT_LONG_MASK = 0x1fffffff
REF_COUNT_LONG = 7
DATA_LEN_UNKNOWN = 0xffffffff
# segment types
SEG_TYPE_IMMEDIATE_GEN_REGION = 38
SEG_TYPE_END_OF_PAGE = 49
SEG_TYPE_END_OF_FILE = 50
# file literals
FILE_HEADER_ID = b'\x97\x4A\x42\x32\x0D\x0A\x1A\x0A'
FILE_HEAD_FLAG_SEQUENTIAL = 0b00000001
FILE_HEAD_FLAG_PAGES_UNKNOWN = 0b00000010
def bit_set(bit_pos, value):
return bool((value >> bit_pos) & 1)
def check_flag(flag, value):
return bool(flag & value)
def masked_value(mask, value):
for bit_pos in range(0, 31):
if bit_set(bit_pos, mask):
return (value & mask) >> bit_pos
raise Exception("Invalid mask or value")
def mask_value(mask, value):
for bit_pos in range(0, 31):
if bit_set(bit_pos, mask):
return (value & (mask >> bit_pos)) << bit_pos
raise Exception("Invalid mask or value")
class JBIG2StreamReader(object):
"""Read segments from a JBIG2 byte stream"""
def __init__(self, stream):
self.stream = stream
def get_segments(self):
segments = []
while not self.is_eof():
segment = {}
for field_format, name in SEG_STRUCT:
field_len = calcsize(field_format)
field = self.stream.read(field_len)
if len(field) < field_len:
segment["_error"] = True
break
value = unpack(field_format, field)
if len(value) == 1:
[value] = value
parser = getattr(self, "parse_%s" % name, None)
if callable(parser):
value = parser(segment, value, field)
segment[name] = value
if not segment.get("_error"):
segments.append(segment)
return segments
def is_eof(self):
if self.stream.read(1) == b'':
return True
else:
self.stream.seek(-1, os.SEEK_CUR)
return False
def parse_flags(self, segment, flags, field):
return {
"deferred": check_flag(HEADER_FLAG_DEFERRED, flags),
"page_assoc_long": check_flag(HEADER_FLAG_PAGE_ASSOC_LONG, flags),
"type": masked_value(SEG_TYPE_MASK, flags)
}
def parse_retention_flags(self, segment, flags, field):
ref_count = masked_value(REF_COUNT_SHORT_MASK, flags)
retain_segments = []
ref_segments = []
if ref_count < REF_COUNT_LONG:
for bit_pos in range(5):
retain_segments.append(bit_set(bit_pos, flags))
else:
field += self.stream.read(3)
[ref_count] = unpack(">L", field)
ref_count = masked_value(REF_COUNT_LONG_MASK, ref_count)
ret_bytes_count = int(math.ceil((ref_count + 1) / 8))
for ret_byte_index in range(ret_bytes_count):
[ret_byte] = unpack(">B", self.stream.read(1))
for bit_pos in range(7):
retain_segments.append(bit_set(bit_pos, ret_byte))
seg_num = segment["number"]
if seg_num <= 256:
ref_format = ">B"
elif seg_num <= 65536:
ref_format = ">I"
else:
ref_format = ">L"
ref_size = calcsize(ref_format)
for ref_index in range(ref_count):
ref = self.stream.read(ref_size)
[ref] = unpack(ref_format, ref)
ref_segments.append(ref)
return {
"ref_count": ref_count,
"retain_segments": retain_segments,
"ref_segments": ref_segments,
}
def parse_page_assoc(self, segment, page, field):
if segment["flags"]["page_assoc_long"]:
field += self.stream.read(3)
[page] = unpack(">L", field)
return page
def parse_data_length(self, segment, length, field):
if length:
if (segment["flags"]["type"] == SEG_TYPE_IMMEDIATE_GEN_REGION) \
and (length == DATA_LEN_UNKNOWN):
raise NotImplementedError(
"Working with unknown segment length "
"is not implemented yet"
)
else:
segment["raw_data"] = self.stream.read(length)
return length
class JBIG2StreamWriter(object):
"""Write JBIG2 segments to a file in JBIG2 format"""
def __init__(self, stream):
self.stream = stream
def write_segments(self, segments, fix_last_page=True):
data_len = 0
current_page = None
seg_num = None
for segment in segments:
data = self.encode_segment(segment)
self.stream.write(data)
data_len += len(data)
seg_num = segment["number"]
if fix_last_page:
seg_page = segment.get("page_assoc")
if segment["flags"]["type"] == SEG_TYPE_END_OF_PAGE:
current_page = None
elif seg_page:
current_page = seg_page
if fix_last_page and current_page and (seg_num is not None):
segment = self.get_eop_segment(seg_num + 1, current_page)
data = self.encode_segment(segment)
self.stream.write(data)
data_len += len(data)
return data_len
def write_file(self, segments, fix_last_page=True):
header = FILE_HEADER_ID
header_flags = FILE_HEAD_FLAG_SEQUENTIAL | FILE_HEAD_FLAG_PAGES_UNKNOWN
header += pack(">B", header_flags)
self.stream.write(header)
data_len = len(header)
data_len += self.write_segments(segments, fix_last_page)
seg_num = 0
for segment in segments:
seg_num = segment["number"]
eof_segment = self.get_eof_segment(seg_num + 1)
data = self.encode_segment(eof_segment)
self.stream.write(data)
data_len += len(data)
return data_len
def encode_segment(self, segment):
data = b''
for field_format, name in SEG_STRUCT:
value = segment.get(name)
encoder = getattr(self, "encode_%s" % name, None)
if callable(encoder):
field = encoder(value, segment)
else:
field = pack(field_format, value)
data += field
return data
def encode_flags(self, value, segment):
flags = 0
if value.get("deferred"):
flags |= HEADER_FLAG_DEFERRED
if "page_assoc_long" in value:
flags |= HEADER_FLAG_PAGE_ASSOC_LONG \
if value["page_assoc_long"] else flags
else:
flags |= HEADER_FLAG_PAGE_ASSOC_LONG \
if segment.get("page", 0) > 255 else flags
flags |= mask_value(SEG_TYPE_MASK, value["type"])
return pack(">B", flags)
def encode_retention_flags(self, value, segment):
flags = []
flags_format = ">B"
ref_count = value["ref_count"]
retain_segments = value.get("retain_segments", [])
if ref_count <= 4:
flags_byte = mask_value(REF_COUNT_SHORT_MASK, ref_count)
for ref_index, ref_retain in enumerate(retain_segments):
flags_byte |= 1 << ref_index
flags.append(flags_byte)
else:
bytes_count = math.ceil((ref_count + 1) / 8)
flags_format = ">L" + ("B" * bytes_count)
flags_dword = mask_value(
REF_COUNT_SHORT_MASK,
REF_COUNT_LONG
) << 24
flags.append(flags_dword)
for byte_index in range(bytes_count):
ret_byte = 0
ret_part = retain_segments[byte_index * 8:byte_index * 8 + 8]
for bit_pos, ret_seg in enumerate(ret_part):
ret_byte |= 1 << bit_pos if ret_seg else ret_byte
flags.append(ret_byte)
ref_segments = value.get("ref_segments", [])
seg_num = segment["number"]
if seg_num <= 256:
ref_format = "B"
elif seg_num <= 65536:
ref_format = "I"
else:
ref_format = "L"
for ref in ref_segments:
flags_format += ref_format
flags.append(ref)
return pack(flags_format, *flags)
def encode_data_length(self, value, segment):
data = pack(">L", value)
data += segment["raw_data"]
return data
def get_eop_segment(self, seg_number, page_number):
return {
'data_length': 0,
'flags': {'deferred': False, 'type': SEG_TYPE_END_OF_PAGE},
'number': seg_number,
'page_assoc': page_number,
'raw_data': b'',
'retention_flags': {
'ref_count': 0,
'ref_segments': [],
'retain_segments': []
}
}
def get_eof_segment(self, seg_number):
return {
'data_length': 0,
'flags': {'deferred': False, 'type': SEG_TYPE_END_OF_FILE},
'number': seg_number,
'page_assoc': 0,
'raw_data': b'',
'retention_flags': {
'ref_count': 0,
'ref_segments': [],
'retain_segments': []
}
}

View File

@ -1,18 +1,15 @@
from sortedcontainers import SortedListWithKey import heapq
from .utils import INF from .utils import INF
from .utils import Plane from .utils import Plane
from .utils import get_bound
from .utils import uniq
from .utils import fsplit
from .utils import bbox2str
from .utils import matrix2str
from .utils import apply_matrix_pt from .utils import apply_matrix_pt
from .utils import bbox2str
from .utils import fsplit
from .utils import get_bound
from .utils import matrix2str
from .utils import uniq
import six # Python 2+3 compatibility
## IndexAssigner
##
class IndexAssigner(object): class IndexAssigner(object):
def __init__(self, index=0): def __init__(self, index=0):
@ -29,9 +26,33 @@ class IndexAssigner(object):
return return
## LAParams
##
class LAParams(object): class LAParams(object):
"""Parameters for layout analysis
:param line_overlap: If two characters have more overlap than this they
are considered to be on the same line. The overlap is specified
relative to the minimum height of both characters.
:param char_margin: If two characters are closer together than this
margin they are considered to be part of the same word. If
characters are on the same line but not part of the same word, an
intermediate space is inserted. The margin is specified relative to
the width of the character.
:param word_margin: If two words are are closer together than this
margin they are considered to be part of the same line. A space is
added in between for readability. The margin is specified relative
to the width of the word.
:param line_margin: If two lines are are close together they are
considered to be part of the same paragraph. The margin is
specified relative to the height of a line.
:param boxes_flow: Specifies how much a horizontal and vertical position
of a text matters when determining the order of lines. The value
should be within the range of -1.0 (only horizontal position
matters) to +1.0 (only vertical position matters).
:param detect_vertical: If vertical text should be considered during
layout analysis
:param all_texts: If layout analysis should be performed on text in
figures.
"""
def __init__(self, def __init__(self,
line_overlap=0.5, line_overlap=0.5,
@ -55,30 +76,28 @@ class LAParams(object):
(self.char_margin, self.line_margin, self.word_margin, self.all_texts)) (self.char_margin, self.line_margin, self.word_margin, self.all_texts))
## LTItem
##
class LTItem(object): class LTItem(object):
"""Interface for things that can be analyzed"""
def analyze(self, laparams): def analyze(self, laparams):
"""Perform the layout analysis.""" """Perform the layout analysis."""
return return
## LTText
##
class LTText(object): class LTText(object):
"""Interface for things that have text"""
def __repr__(self): def __repr__(self):
return ('<%s %r>' % return ('<%s %r>' %
(self.__class__.__name__, self.get_text())) (self.__class__.__name__, self.get_text()))
def get_text(self): def get_text(self):
"""Text contained in this object"""
raise NotImplementedError raise NotImplementedError
## LTComponent
##
class LTComponent(LTItem): class LTComponent(LTItem):
"""Object with a bounding box"""
def __init__(self, bbox): def __init__(self, bbox):
LTItem.__init__(self) LTItem.__init__(self)
@ -92,10 +111,13 @@ class LTComponent(LTItem):
# Disable comparison. # Disable comparison.
def __lt__(self, _): def __lt__(self, _):
raise ValueError raise ValueError
def __le__(self, _): def __le__(self, _):
raise ValueError raise ValueError
def __gt__(self, _): def __gt__(self, _):
raise ValueError raise ValueError
def __ge__(self, _): def __ge__(self, _):
raise ValueError raise ValueError
@ -150,9 +172,8 @@ class LTComponent(LTItem):
return 0 return 0
## LTCurve
##
class LTCurve(LTComponent): class LTCurve(LTComponent):
"""A generic Bezier curve"""
def __init__(self, linewidth, pts, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None): def __init__(self, linewidth, pts, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
LTComponent.__init__(self, get_bound(pts)) LTComponent.__init__(self, get_bound(pts))
@ -169,18 +190,22 @@ class LTCurve(LTComponent):
return ','.join('%.3f,%.3f' % p for p in self.pts) return ','.join('%.3f,%.3f' % p for p in self.pts)
## LTLine
##
class LTLine(LTCurve): class LTLine(LTCurve):
"""A single straight line.
Could be used for separating text or figures.
"""
def __init__(self, linewidth, p0, p1, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None): def __init__(self, linewidth, p0, p1, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
LTCurve.__init__(self, linewidth, [p0, p1], stroke, fill, evenodd, stroking_color, non_stroking_color) LTCurve.__init__(self, linewidth, [p0, p1], stroke, fill, evenodd, stroking_color, non_stroking_color)
return return
## LTRect
##
class LTRect(LTCurve): class LTRect(LTCurve):
"""A rectangle.
Could be used for framing another pictures or figures.
"""
def __init__(self, linewidth, bbox, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None): def __init__(self, linewidth, bbox, stroke = False, fill = False, evenodd = False, stroking_color = None, non_stroking_color = None):
(x0, y0, x1, y1) = bbox (x0, y0, x1, y1) = bbox
@ -188,9 +213,11 @@ class LTRect(LTCurve):
return return
## LTImage
##
class LTImage(LTComponent): class LTImage(LTComponent):
"""An image object.
Embedded images can be in JPEG, Bitmap or JBIG2.
"""
def __init__(self, name, stream, bbox): def __init__(self, name, stream, bbox):
LTComponent.__init__(self, bbox) LTComponent.__init__(self, bbox)
@ -211,9 +238,13 @@ class LTImage(LTComponent):
bbox2str(self.bbox), self.srcsize)) bbox2str(self.bbox), self.srcsize))
## LTAnno
##
class LTAnno(LTItem, LTText): class LTAnno(LTItem, LTText):
"""Actual letter in the text as a Unicode string.
Note that, while a LTChar object has actual boundaries, LTAnno objects does
not, as these are "virtual" characters, inserted by a layout analyzer
according to the relationship between two characters (e.g. a space).
"""
def __init__(self, text): def __init__(self, text):
self._text = text self._text = text
@ -223,9 +254,8 @@ class LTAnno(LTItem, LTText):
return self._text return self._text
## LTChar
##
class LTChar(LTComponent, LTText): class LTChar(LTComponent, LTText):
"""Actual letter in the text as a Unicode string."""
def __init__(self, matrix, font, fontsize, scaling, rise, def __init__(self, matrix, font, fontsize, scaling, rise,
text, textwidth, textdisp, ncs, graphicstate): text, textwidth, textdisp, ncs, graphicstate):
@ -286,9 +316,8 @@ class LTChar(LTComponent, LTText):
return True return True
## LTContainer
##
class LTContainer(LTComponent): class LTContainer(LTComponent):
"""Object that can be extended and analyzed"""
def __init__(self, bbox): def __init__(self, bbox):
LTComponent.__init__(self, bbox) LTComponent.__init__(self, bbox)
@ -316,10 +345,7 @@ class LTContainer(LTComponent):
return return
## LTExpandableContainer
##
class LTExpandableContainer(LTContainer): class LTExpandableContainer(LTContainer):
def __init__(self): def __init__(self):
LTContainer.__init__(self, (+INF, +INF, -INF, -INF)) LTContainer.__init__(self, (+INF, +INF, -INF, -INF))
return return
@ -331,10 +357,7 @@ class LTExpandableContainer(LTContainer):
return return
## LTTextContainer
##
class LTTextContainer(LTExpandableContainer, LTText): class LTTextContainer(LTExpandableContainer, LTText):
def __init__(self): def __init__(self):
LTText.__init__(self) LTText.__init__(self)
LTExpandableContainer.__init__(self) LTExpandableContainer.__init__(self)
@ -344,9 +367,12 @@ class LTTextContainer(LTExpandableContainer, LTText):
return ''.join(obj.get_text() for obj in self if isinstance(obj, LTText)) return ''.join(obj.get_text() for obj in self if isinstance(obj, LTText))
## LTTextLine
##
class LTTextLine(LTTextContainer): class LTTextLine(LTTextContainer):
"""Contains a list of LTChar objects that represent a single text line.
The characters are aligned either horizontally or vertically, depending on
the text's writing mode.
"""
def __init__(self, word_margin): def __init__(self, word_margin):
LTTextContainer.__init__(self) LTTextContainer.__init__(self)
@ -368,7 +394,6 @@ class LTTextLine(LTTextContainer):
class LTTextLineHorizontal(LTTextLine): class LTTextLineHorizontal(LTTextLine):
def __init__(self, word_margin): def __init__(self, word_margin):
LTTextLine.__init__(self, word_margin) LTTextLine.__init__(self, word_margin)
self._x1 = +INF self._x1 = +INF
@ -394,7 +419,6 @@ class LTTextLineHorizontal(LTTextLine):
class LTTextLineVertical(LTTextLine): class LTTextLineVertical(LTTextLine):
def __init__(self, word_margin): def __init__(self, word_margin):
LTTextLine.__init__(self, word_margin) LTTextLine.__init__(self, word_margin)
self._y0 = -INF self._y0 = -INF
@ -419,12 +443,13 @@ class LTTextLineVertical(LTTextLine):
abs(obj.y1-self.y1) < d))] abs(obj.y1-self.y1) < d))]
## LTTextBox
##
## A set of text objects that are grouped within
## a certain rectangular area.
##
class LTTextBox(LTTextContainer): class LTTextBox(LTTextContainer):
"""Represents a group of text chunks in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text. It contains a list of
LTTextLine objects.
"""
def __init__(self): def __init__(self):
LTTextContainer.__init__(self) LTTextContainer.__init__(self)
@ -438,7 +463,6 @@ class LTTextBox(LTTextContainer):
class LTTextBoxHorizontal(LTTextBox): class LTTextBoxHorizontal(LTTextBox):
def analyze(self, laparams): def analyze(self, laparams):
LTTextBox.analyze(self, laparams) LTTextBox.analyze(self, laparams)
self._objs.sort(key=lambda obj: -obj.y1) self._objs.sort(key=lambda obj: -obj.y1)
@ -449,7 +473,6 @@ class LTTextBoxHorizontal(LTTextBox):
class LTTextBoxVertical(LTTextBox): class LTTextBoxVertical(LTTextBox):
def analyze(self, laparams): def analyze(self, laparams):
LTTextBox.analyze(self, laparams) LTTextBox.analyze(self, laparams)
self._objs.sort(key=lambda obj: -obj.x1) self._objs.sort(key=lambda obj: -obj.x1)
@ -459,10 +482,7 @@ class LTTextBoxVertical(LTTextBox):
return 'tb-rl' return 'tb-rl'
## LTTextGroup
##
class LTTextGroup(LTTextContainer): class LTTextGroup(LTTextContainer):
def __init__(self, objs): def __init__(self, objs):
LTTextContainer.__init__(self) LTTextContainer.__init__(self)
self.extend(objs) self.extend(objs)
@ -470,7 +490,6 @@ class LTTextGroup(LTTextContainer):
class LTTextGroupLRTB(LTTextGroup): class LTTextGroupLRTB(LTTextGroup):
def analyze(self, laparams): def analyze(self, laparams):
LTTextGroup.analyze(self, laparams) LTTextGroup.analyze(self, laparams)
# reorder the objects from top-left to bottom-right. # reorder the objects from top-left to bottom-right.
@ -481,7 +500,6 @@ class LTTextGroupLRTB(LTTextGroup):
class LTTextGroupTBRL(LTTextGroup): class LTTextGroupTBRL(LTTextGroup):
def analyze(self, laparams): def analyze(self, laparams):
LTTextGroup.analyze(self, laparams) LTTextGroup.analyze(self, laparams)
# reorder the objects from top-right to bottom-left. # reorder the objects from top-right to bottom-left.
@ -491,10 +509,7 @@ class LTTextGroupTBRL(LTTextGroup):
return return
## LTLayoutContainer
##
class LTLayoutContainer(LTContainer): class LTLayoutContainer(LTContainer):
def __init__(self, bbox): def __init__(self, bbox):
LTContainer.__init__(self, bbox) LTContainer.__init__(self, bbox)
self.groups = None self.groups = None
@ -603,9 +618,22 @@ class LTLayoutContainer(LTContainer):
yield box yield box
return return
# group_textboxes: group textboxes hierarchically.
def group_textboxes(self, laparams, boxes): def group_textboxes(self, laparams, boxes):
assert boxes, str((laparams, boxes)) """Group textboxes hierarchically.
Get pair-wise distances, via dist func defined below, and then merge from the closest textbox pair. Once
obj1 and obj2 are merged / grouped, the resulting group is considered as a new object, and its distances to
other objects & groups are added to the process queue.
For performance reason, pair-wise distances and object pair info are maintained in a heap of
(idx, dist, id(obj1), id(obj2), obj1, obj2) tuples. It ensures quick access to the smallest element. Note that
since comparison operators, e.g., __lt__, are disabled for LTComponent, id(obj) has to appear before obj in
element tuples.
:param laparams: LAParams object.
:param boxes: All textbox objects to be grouped.
:return: a list that has only one element, the final top level textbox.
"""
def dist(obj1, obj2): def dist(obj1, obj2):
"""A distance function between two TextBoxes. """A distance function between two TextBoxes.
@ -626,8 +654,7 @@ class LTLayoutContainer(LTContainer):
return ((x1-x0)*(y1-y0) - obj1.width*obj1.height - obj2.width*obj2.height) return ((x1-x0)*(y1-y0) - obj1.width*obj1.height - obj2.width*obj2.height)
def isany(obj1, obj2): def isany(obj1, obj2):
"""Check if there's any other object between obj1 and obj2. """Check if there's any other object between obj1 and obj2."""
"""
x0 = min(obj1.x0, obj2.x0) x0 = min(obj1.x0, obj2.x0)
y0 = min(obj1.y0, obj2.y0) y0 = min(obj1.y0, obj2.y0)
x1 = max(obj1.x1, obj2.x1) x1 = max(obj1.x1, obj2.x1)
@ -635,39 +662,36 @@ class LTLayoutContainer(LTContainer):
objs = set(plane.find((x0, y0, x1, y1))) objs = set(plane.find((x0, y0, x1, y1)))
return objs.difference((obj1, obj2)) return objs.difference((obj1, obj2))
def key_obj(t): dists = []
(c,d,_,_) = t
return (c,d)
dists = SortedListWithKey(key=key_obj)
for i in range(len(boxes)): for i in range(len(boxes)):
obj1 = boxes[i] obj1 = boxes[i]
for j in range(i+1, len(boxes)): for j in range(i+1, len(boxes)):
obj2 = boxes[j] obj2 = boxes[j]
dists.add((0, dist(obj1, obj2), obj1, obj2)) dists.append((True, dist(obj1, obj2), id(obj1), id(obj2), obj1, obj2))
heapq.heapify(dists)
plane = Plane(self.bbox) plane = Plane(self.bbox)
plane.extend(boxes) plane.extend(boxes)
while dists: done = set()
(c, d, obj1, obj2) = dists.pop(0) while len(dists) > 0:
if c == 0 and isany(obj1, obj2): (is_first, d, id1, id2, obj1, obj2) = heapq.heappop(dists)
dists.add((1, d, obj1, obj2)) # Skip objects that are already merged
continue if (id1 not in done) and (id2 not in done):
if (isinstance(obj1, (LTTextBoxVertical, LTTextGroupTBRL)) or if is_first and isany(obj1, obj2):
isinstance(obj2, (LTTextBoxVertical, LTTextGroupTBRL))): heapq.heappush(dists, (False, d, id1, id2, obj1, obj2))
group = LTTextGroupTBRL([obj1, obj2]) continue
else: if isinstance(obj1, (LTTextBoxVertical, LTTextGroupTBRL)) or \
group = LTTextGroupLRTB([obj1, obj2]) isinstance(obj2, (LTTextBoxVertical, LTTextGroupTBRL)):
plane.remove(obj1) group = LTTextGroupTBRL([obj1, obj2])
plane.remove(obj2) else:
removed = [obj1, obj2] group = LTTextGroupLRTB([obj1, obj2])
to_remove = [ (c,d,obj1,obj2) for (c,d,obj1,obj2) in dists plane.remove(obj1)
if (obj1 in removed or obj2 in removed) ] plane.remove(obj2)
for r in to_remove: done.update([id1, id2])
dists.remove(r)
for other in plane: for other in plane:
dists.add((0, dist(group, other), group, other)) heapq.heappush(dists, (False, dist(group, other), id(group), id(other), group, other))
plane.add(group) plane.add(group)
assert len(plane) == 1, str(len(plane))
return list(plane) return list(plane)
def analyze(self, laparams): def analyze(self, laparams):
@ -701,9 +725,13 @@ class LTLayoutContainer(LTContainer):
return return
## LTFigure
##
class LTFigure(LTLayoutContainer): class LTFigure(LTLayoutContainer):
"""Represents an area used by PDF Form objects.
PDF Forms can be used to present figures or pictures by embedding yet
another PDF document within a page. Note that LTFigure objects can appear
recursively.
"""
def __init__(self, name, bbox, matrix): def __init__(self, name, bbox, matrix):
self.name = name self.name = name
@ -726,9 +754,12 @@ class LTFigure(LTLayoutContainer):
return return
## LTPage
##
class LTPage(LTLayoutContainer): class LTPage(LTLayoutContainer):
"""Represents an entire page.
May contain child objects like LTTextBox, LTFigure, LTImage, LTRect,
LTCurve and LTLine.
"""
def __init__(self, pageid, bbox, rotate=0): def __init__(self, pageid, bbox, rotate=0):
LTLayoutContainer.__init__(self, bbox) LTLayoutContainer.__init__(self, bbox)

View File

@ -5,10 +5,13 @@ import six #Python 2+3 compatibility
import logging import logging
logger = logging.getLogger(__name__)
class CorruptDataError(Exception): class CorruptDataError(Exception):
pass pass
## LZWDecoder ## LZWDecoder
## ##
class LZWDecoder(object): class LZWDecoder(object):
@ -90,7 +93,7 @@ class LZWDecoder(object):
# just ignore corrupt data and stop yielding there # just ignore corrupt data and stop yielding there
break break
yield x yield x
logging.debug('nbits=%d, code=%d, output=%r, table=%r' % logger.debug('nbits=%d, code=%d, output=%r, table=%r' %
(self.nbits, code, x, self.table[258:])) (self.nbits, code, x, self.table[258:]))
return return

View File

@ -2,13 +2,13 @@
import six import six
from . import utils
from .pdffont import PDFUnicodeNotDefined from .pdffont import PDFUnicodeNotDefined
from . import utils
## PDFDevice
##
class PDFDevice(object): class PDFDevice(object):
"""Translate the output of PDFPageInterpreter to the output that is needed
"""
def __init__(self, rsrcmgr): def __init__(self, rsrcmgr):
self.rsrcmgr = rsrcmgr self.rsrcmgr = rsrcmgr

View File

@ -671,7 +671,11 @@ class PDFDocument(object):
# can raise PDFObjectNotFound # can raise PDFObjectNotFound
def getobj(self, objid): def getobj(self, objid):
assert objid != 0 """Get object from PDF
:raises PDFException if PDFDocument is not initialized
:raises PDFObjectNotFound if objid does not exist in PDF
"""
if not self.xrefs: if not self.xrefs:
raise PDFException('PDFDocument is not initialized') raise PDFException('PDFDocument is not initialized')
log.debug('getobj: objid=%r', objid) log.debug('getobj: objid=%r', objid)

View File

@ -318,9 +318,8 @@ class PDFContentParser(PSStackParser):
return return
## Interpreter
##
class PDFPageInterpreter(object): class PDFPageInterpreter(object):
"""Processor for the content of a PDF page"""
def __init__(self, rsrcmgr, device): def __init__(self, rsrcmgr, device):
self.rsrcmgr = rsrcmgr self.rsrcmgr = rsrcmgr

View File

@ -27,7 +27,7 @@ LITERALS_ASCIIHEX_DECODE = (LIT('ASCIIHexDecode'), LIT('AHx'))
LITERALS_RUNLENGTH_DECODE = (LIT('RunLengthDecode'), LIT('RL')) LITERALS_RUNLENGTH_DECODE = (LIT('RunLengthDecode'), LIT('RL'))
LITERALS_CCITTFAX_DECODE = (LIT('CCITTFaxDecode'), LIT('CCF')) LITERALS_CCITTFAX_DECODE = (LIT('CCITTFaxDecode'), LIT('CCF'))
LITERALS_DCT_DECODE = (LIT('DCTDecode'), LIT('DCT')) LITERALS_DCT_DECODE = (LIT('DCTDecode'), LIT('DCT'))
LITERALS_JBIG2_DECODE = (LIT('JBIG2Decode'),)
## PDF Objects ## PDF Objects
## ##
@ -275,6 +275,8 @@ class PDFStream(PDFObject):
# This is probably a JPG stream - it does not need to be decoded twice. # This is probably a JPG stream - it does not need to be decoded twice.
# Just return the stream to the user. # Just return the stream to the user.
pass pass
elif f in LITERALS_JBIG2_DECODE:
pass
elif f == LITERAL_CRYPT: elif f == LITERAL_CRYPT:
# not yet.. # not yet..
raise PDFNotImplementedError('/Crypt filter is unsupported') raise PDFNotImplementedError('/Crypt filter is unsupported')

View File

@ -1,8 +1 @@
STRICT = False STRICT = False
try:
from django.conf import settings
STRICT = getattr(settings, 'PDF_MINER_IS_STRICT', STRICT)
except Exception:
# in case it's not a django project
pass

View File

@ -20,6 +20,17 @@ jo.pdf:
(File generated from jo.tex by LaTeX and dvi2pdfm) (File generated from jo.tex by LaTeX and dvi2pdfm)
-- --
contrib/matplotlib.pdf
Copyright 2018, James R Barlow
Example file created in matplotlib to add a Type3 font to the samples
Released under the terms of the "LICENSE" file
--
nonfree/cmp_itext_logo.pdf
Bruno Lowagie
"iText Logo - Type 3 font"
http://gitlab.itextsupport.com/itext/sandbox/raw/master/cmpfiles/fonts/cmp_itext_logo.pdf
nonfree/dmca.pdf: nonfree/dmca.pdf:
U.S. Copyright Office U.S. Copyright Office
The Digital Millenium Copyright Act The Digital Millenium Copyright Act

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@ -1,23 +0,0 @@
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,595.000,842.000" rotate="0">
<textbox id="0" bbox="56.800,771.508,90.688,787.264">
<textline bbox="56.800,771.508,90.688,787.264">
<text font="BAAAAA+TimesNewRomanPSMT" bbox="56.800,771.508,63.472,787.264" size="15.756">S</text>
<text font="BAAAAA+TimesNewRomanPSMT" bbox="63.484,771.508,68.800,787.264" size="15.756">e</text>
<text font="BAAAAA+TimesNewRomanPSMT" bbox="68.788,771.508,74.104,787.264" size="15.756">c</text>
<text font="BAAAAA+TimesNewRomanPSMT" bbox="74.092,771.508,78.088,787.264" size="15.756">r</text>
<text font="BAAAAA+TimesNewRomanPSMT" bbox="78.088,771.508,83.404,787.264" size="15.756">e</text>
<text font="BAAAAA+TimesNewRomanPSMT" bbox="83.392,771.508,86.716,787.264" size="15.756">t</text>
<text font="BAAAAA+TimesNewRomanPSMT" bbox="86.692,771.508,90.688,787.264" size="15.756">!</text>
<text>
</text>
</textline>
</textbox>
<figure name="Tr4" bbox="-9.000,420.000,595.000,840.100">
</figure>
<layout>
<textbox id="0" bbox="56.800,771.508,90.688,787.264" />
</layout>
</page>
</pages>

View File

@ -1,72 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:792px; height:612px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:715px; top:114px; width:11px; height:28px;"><span style="font-family: Ryumin-Light; font-size:11px">  序
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:715px; top:374px; width:11px; height:9px;"><span style="font-family: Ryumin-Light; font-size:11px"> 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:168px; top:105px; width:502px; height:210px;"><span style="font-family: Ryumin-Light; font-size:11px">わたくしといふ現象は
<br>假定された有機交流電燈の
<br>ひとつの青い照明です
<br>(あらゆる透明な幽霊の複合体)
<br>風景やみんなといっしょに
<br>せはしくせはしく明滅しながら
<br>いかにもたしかにともりつづける
<br>因果交流電燈の
<br>ひとつの青い照明です
<br>(ひかりはたもち、その電燈は失はれ)
<br>  
<br>これらは二十二箇月の
<br>過去とかんずる方角から
<br>紙と鑛質インクをつらね
<br>(すべてわたくしと明滅し
<br> みんなが同時に感ずるもの)
<br>ここまでたもちつゞけられた
<br>かげとひかりのひとくさりづつ
<br>そのとほりの心象スケッチです
<br>  
<br>これらについて人や銀河や修羅や海膽は
<br>宇宙塵をたべ、または空気や塩水を呼吸しながら
<br>それぞれ新鮮な本体論もかんがへませうが
<br>それらも畢竟こゝろのひとつの風物です
<br>たゞたしかに記録されたこれらのけしきは
<br></span><span style="font-family: Ryumin-Light; font-size:11px">記録されたそのとほりのこのけしきで
<br></span><span style="font-family: Ryumin-Light; font-size:11px">それが虚無ならば虚無自身がこのとほりで
<br>ある程度まではみんなに共通いたします
<br></span><span style="font-family: Ryumin-Light; font-size:11px">(すべてがわたくしの中のみんなであるやうに
<br></span><span style="font-family: Ryumin-Light; font-size:11px"> みんなのおのおののなかのすべてですから)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:101px; top:374px; width:536px; height:191px;"><span style="font-family: Ryumin-Light; font-size:11px">けれどもこれら新世代沖積世の
<br>巨大に明るい時間の集積のなかで
<br>正しくうつされた筈のこれらのことばが
<br>わづかその一點にも均しい明暗のうちに
<br>   (あるひは修羅の十億年)
<br>すでにはやくもその組立や質を變じ
<br>しかもわたくしも印刷者も
<br>それを変らないとして感ずることは
<br>傾向としてはあり得ます
<br>けだしわれわれがわれわれの感官や
<br>風景や人物をかんずるやうに
<br>そしてたゞ共通に感ずるだけであるやうに
<br>記録や歴史、あるひは地史といふものも
<br>それのいろいろの論料といっしょに
<br>(因果の時空的制約のもとに)
<br>われわれがかんじてゐるのに過ぎません
<br>おそらくこれから二千年もたったころは
<br>それ相當のちがった地質學が流用され
<br>相當した證據もまた次次過去から現出し
<br>みんなは二千年ぐらゐ前には
<br>青ぞらいっぱいの無色な孔雀が居たとおもひ
<br>新進の大學士たちは気圏のいちばんの上層
<br>きらびやかな氷窒素のあたりから
<br></span><span style="font-family: Ryumin-Light; font-size:11px">すてきな化石を發堀したり
<br></span><span style="font-family: Ryumin-Light; font-size:11px">あるひは白堊紀砂岩の層面に
<br>透明な人類の巨大な足跡を
<br></span><span style="font-family: Ryumin-Light; font-size:11px">発見するかもしれません
<br></span><span style="font-family: Ryumin-Light; font-size:11px">  
<br></span><span style="font-family: Ryumin-Light; font-size:11px">すべてこれらの命題は
<br></span><span style="font-family: Ryumin-Light; font-size:11px">心象や時間それ自身の性質として
<br></span><span style="font-family: Ryumin-Light; font-size:11px">第四次延長のなかで主張されます
<br></span><span style="font-family: Ryumin-Light; font-size:11px">  
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:74px; top:471px; width:11px; height:143px;"><span style="font-family: Ryumin-Light; font-size:11px">大正十三年一月廿日  宮澤賢治
<br></span></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,173 +0,0 @@
\documentclass[landscape,twocolumn]{tarticle}
\setlength{\hoffset}{-0.6in}
\setlength{\voffset}{-0.7in}
\setlength{\textwidth}{18cm}
%\setlength{\textheight}{9in}
%\setlength{\oddsidemargin}{-0.5in}
%\setlength{\evensidemargin}{-0.5in}
\setlength{\topmargin}{0in}
\setlength{\columnsep}{0.4in}
\pagestyle{empty}
\makeatletter
\def\kanjistrut{\vrule \@height0.88zw \@depth0.12zw \@width\z@}
\newdimen\mytempdima
\newcommand{\ruby}[2]{%
\leavevmode
\setbox0=\hbox{#1}%
\mytempdima=\f@size\p@
\setbox1=\hbox{\fontsize{0.5\mytempdima}{0pt}\selectfont #2}%
\ifdim\wd0>\wd1 \dimen0=\wd0 \else \dimen0=\wd1 \fi
\hbox{%
\kanjiskip=0pt plus 2fil
\xkanjiskip=0pt plus 2fil
\vbox{%
\hbox to \dimen0{%
\fontsize{0.5\mytempdima}{0pt}\selectfont \kanjistrut\hfil#2\hfil}%
\nointerlineskip
\hbox to \dimen0{\kanjistrut\hfil#1\hfil}}}}
\makeatother
\begin{document}
  序
\vspace{0.4in}
\begin{flushleft}
わたくしといふ現象は
假定された有機交流電燈の
ひとつの青い照明です
(あらゆる透明な幽霊の複合体)
風景やみんなといっしょに
せはしくせはしく明滅しながら
いかにもたしかにともりつづける
因果交流電燈の
ひとつの青い照明です
(ひかりはたもち、その電燈は失はれ)
  
これらは二十二箇月の
過去とかんずる方角から
紙と鑛質インクをつらね
(すべてわたくしと明滅し
 みんなが同時に感ずるもの)
ここまでたもちつゞけられた
かげとひかりのひとくさりづつ
そのとほりの心象スケッチです
  
これらについて人や銀河や修羅や海膽は
宇宙塵をたべ、または空気や塩水を呼吸しながら
それぞれ新鮮な本体論もかんがへませうが
それらも畢竟こゝろのひとつの風物です
たゞたしかに記録されたこれらのけしきは
記録されたそのとほりのこのけしきで
それが虚無ならば虚無自身がこのとほりで
ある程度まではみんなに共通いたします
(すべてがわたくしの中のみんなであるやうに
 みんなのおのおののなかのすべてですから)
\newpage
 
\vspace{1.0in}
けれどもこれら新世代沖積世の
巨大に明るい時間の集積のなかで
正しくうつされた筈のこれらのことばが
わづかその一點にも均しい明暗のうちに
   (あるひは修羅の十億年)
すでにはやくもその組立や質を變じ
しかもわたくしも印刷者も
それを変らないとして感ずることは
傾向としてはあり得ます
けだしわれわれがわれわれの感官や
風景や人物をかんずるやうに
そしてたゞ共通に感ずるだけであるやうに
記録や歴史、あるひは地史といふものも
それのいろいろの論料といっしょに
(因果の時空的制約のもとに)
われわれがかんじてゐるのに過ぎません
おそらくこれから二千年もたったころは
それ相當のちがった地質學が流用され
相當した證據もまた次次過去から現出し
みんなは二千年ぐらゐ前には
青ぞらいっぱいの無色な孔雀が居たとおもひ
新進の大學士たちは気圏のいちばんの上層
きらびやかな氷窒素のあたりから
すてきな化石を發堀したり
あるひは白堊紀砂岩の層面に
透明な人類の巨大な足跡を
発見するかもしれません
  
すべてこれらの命題は
心象や時間それ自身の性質として
第四次延長のなかで主張されます
  
\end{flushleft}
\begin{flushright}
大正十三年一月廿日  宮澤賢治
\end{flushright}
\end{document}

View File

@ -1,71 +0,0 @@
  序
 
わたくしといふ現象は
假定された有機交流電燈の
ひとつの青い照明です
(あらゆる透明な幽霊の複合体)
風景やみんなといっしょに
せはしくせはしく明滅しながら
いかにもたしかにともりつづける
因果交流電燈の
ひとつの青い照明です
(ひかりはたもち、その電燈は失はれ)
  
これらは二十二箇月の
過去とかんずる方角から
紙と鑛質インクをつらね
(すべてわたくしと明滅し
 みんなが同時に感ずるもの)
ここまでたもちつゞけられた
かげとひかりのひとくさりづつ
そのとほりの心象スケッチです
  
これらについて人や銀河や修羅や海膽は
宇宙塵をたべ、または空気や塩水を呼吸しながら
それぞれ新鮮な本体論もかんがへませうが
それらも畢竟こゝろのひとつの風物です
たゞたしかに記録されたこれらのけしきは
記録されたそのとほりのこのけしきで
それが虚無ならば虚無自身がこのとほりで
ある程度まではみんなに共通いたします
(すべてがわたくしの中のみんなであるやうに
 みんなのおのおののなかのすべてですから)
けれどもこれら新世代沖積世の
巨大に明るい時間の集積のなかで
正しくうつされた筈のこれらのことばが
わづかその一點にも均しい明暗のうちに
   (あるひは修羅の十億年)
すでにはやくもその組立や質を變じ
しかもわたくしも印刷者も
それを変らないとして感ずることは
傾向としてはあり得ます
けだしわれわれがわれわれの感官や
風景や人物をかんずるやうに
そしてたゞ共通に感ずるだけであるやうに
記録や歴史、あるひは地史といふものも
それのいろいろの論料といっしょに
(因果の時空的制約のもとに)
われわれがかんじてゐるのに過ぎません
おそらくこれから二千年もたったころは
それ相當のちがった地質學が流用され
相當した證據もまた次次過去から現出し
みんなは二千年ぐらゐ前には
青ぞらいっぱいの無色な孔雀が居たとおもひ
新進の大學士たちは気圏のいちばんの上層
きらびやかな氷窒素のあたりから
すてきな化石を發堀したり
あるひは白堊紀砂岩の層面に
透明な人類の巨大な足跡を
発見するかもしれません
  
すべてこれらの命題は
心象や時間それ自身の性質として
第四次延長のなかで主張されます
  
大正十三年一月廿日  宮澤賢治

File diff suppressed because it is too large Load Diff

Binary file not shown.

View File

@ -1,50 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:138px; top:113px; width:335px; height:17px;"><span style="font-family: Garamond,Bold; font-size:17px">T</span><span style="font-family: Garamond,Bold; font-size:13px">HE </span><span style="font-family: Garamond,Bold; font-size:17px">D</span><span style="font-family: Garamond,Bold; font-size:13px">IGITAL </span><span style="font-family: Garamond,Bold; font-size:17px">M</span><span style="font-family: Garamond,Bold; font-size:13px">ILLENNIUM </span><span style="font-family: Garamond,Bold; font-size:17px">C</span><span style="font-family: Garamond,Bold; font-size:13px">OPYRIGHT </span><span style="font-family: Garamond,Bold; font-size:17px">A</span><span style="font-family: Garamond,Bold; font-size:13px">CT OF </span><span style="font-family: Garamond,Bold; font-size:17px">1998
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:218px; top:132px; width:174px; height:14px;"><span style="font-family: Garamond,Bold; font-size:14px">U.S. Copyright Office Summary
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:240px; width:93px; height:16px;"><span style="font-family: Garamond,Bold; font-size:15px">I</span><span style="font-family: Garamond,Bold; font-size:12px">NTRODUCTION
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:267px; top:214px; width:76px; height:13px;"><span style="font-family: Garamond,Bold; font-size:13px">December 1998
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:108px; top:270px; width:396px; height:67px;"><span style="font-family: Garamond; font-size:13px">The Digital Millennium Copyright Act (DMCA) was signed into law by
<br>President Clinton on October 28, 1998. The legislation implements two 1996 World
<br>Intellectual Property Organization (WIPO) treaties: the WIPO Copyright Treaty and
<br>the WIPO Performances and Phonograms Treaty. The DMCA also addresses a
<br>number of other significant copyright-related issues.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:381px; top:272px; width:3px; height:8px;"><span style="font-family: Garamond; font-size:8px">1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:144px; top:351px; width:179px; height:13px;"><span style="font-family: Garamond; font-size:13px">The DMCA is divided into five titles:
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:144px; top:379px; width:8px; height:12px;"><span style="font-family: ELCKGH+WPTypographicSymbols; font-size:12px">!
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:144px; top:419px; width:8px; height:12px;"><span style="font-family: ELCKGH+WPTypographicSymbols; font-size:12px">!
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:144px; top:460px; width:8px; height:12px;"><span style="font-family: ELCKGH+WPTypographicSymbols; font-size:12px">!
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:143px; top:500px; width:8px; height:12px;"><span style="font-family: ELCKGH+WPTypographicSymbols; font-size:12px">!
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:144px; top:581px; width:8px; height:12px;"><span style="font-family: ELCKGH+WPTypographicSymbols; font-size:12px">!
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:179px; top:377px; width:324px; height:228px;"><span style="font-family: Garamond; font-size:13px">Title I, the “</span><span style="font-family: Garamond,Bold; font-size:13px">WIPO Copyright and Performances and Phonograms
<br>Treaties Implementation Act of 1998</span><span style="font-family: Garamond; font-size:13px">,” implements the WIPO
<br>treaties.
<br>Title II, the “</span><span style="font-family: Garamond,Bold; font-size:13px">Online Copyright Infringement Liability Limitation
<br>Act</span><span style="font-family: Garamond; font-size:13px">,” creates limitations on the liability of online service providers for
<br>copyright infringement when engaging in certain types of activities.
<br>Title III, the “</span><span style="font-family: Garamond,Bold; font-size:13px">Computer Maintenance Competition Assurance
<br>Act</span><span style="font-family: Garamond; font-size:13px">,” creates an exemption for making a copy of a computer program
<br>by activating a computer for purposes of maintenance or repair.
<br>Title IV contains six </span><span style="font-family: Garamond,Bold; font-size:13px">miscellaneous provisions</span><span style="font-family: Garamond; font-size:13px">, relating to the
<br>functions of the Copyright Office, distance education, the exceptions
<br>in the Copyright Act for libraries and for making ephemeral recordings,
<br>“webcasting” of sound recordings on the Internet, and the applicability
<br>of collective bargaining agreement obligations in the case of transfers
<br>of rights in motion pictures.
<br>Title V, the “</span><span style="font-family: Garamond,Bold; font-size:13px">Vessel Hull Design Protection Act</span><span style="font-family: Garamond; font-size:13px">,” creates a new form
<br>of protection for the design of vessel hulls.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:107px; top:619px; width:396px; height:53px;"><span style="font-family: Garamond; font-size:13px">This memorandum summarizes briefly each title of the DMCA. It provides
<br>merely an overview of the laws provisions; for purposes of length and readability a
<br>significant amount of detail has been omitted. </span><span style="font-family: Garamond,Bold; font-size:13px">A complete understanding of any
<br>provision of the DMCA requires reference to the text of the legislation itself.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:144px; top:726px; width:228px; height:12px;"><span style="font-family: Garamond; font-size:12px">Pub. L. No. 105-304, 112 Stat. 2860 (Oct. 28, 1998).
<br></span><span style="font-family: Garamond; font-size:8px">1
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:108px; top:750px; width:106px; height:13px;"><span style="font-family: Garamond,Italic; font-size:13px">Copyright Office Summary
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:750px; width:63px; height:13px;"><span style="font-family: Garamond,Italic; font-size:13px">December 1998
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:476px; top:750px; width:27px; height:13px;"><span style="font-family: Garamond,Italic; font-size:13px">Page 1
<br></span></div><span style="position:absolute; border: black 1px solid; left:108px; top:719px; width:144px; height:1px;"></span>
<div style="position:absolute; border: figure 1px solid; writing-mode:False; left:285px; top:163px; width:44px; height:42px;"></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,61 +0,0 @@
THE DIGITAL MILLENNIUM COPYRIGHT ACT OF 1998
U.S. Copyright Office Summary
INTRODUCTION
December 1998
The Digital Millennium Copyright Act (DMCA) was signed into law by
President Clinton on October 28, 1998. The legislation implements two 1996 World
Intellectual Property Organization (WIPO) treaties: the WIPO Copyright Treaty and
the WIPO Performances and Phonograms Treaty. The DMCA also addresses a
number of other significant copyright-related issues.
1
The DMCA is divided into five titles:
!
!
!
!
!
Title I, the “WIPO Copyright and Performances and Phonograms
Treaties Implementation Act of 1998,” implements the WIPO
treaties.
Title II, the “Online Copyright Infringement Liability Limitation
Act,” creates limitations on the liability of online service providers for
copyright infringement when engaging in certain types of activities.
Title III, the “Computer Maintenance Competition Assurance
Act,” creates an exemption for making a copy of a computer program
by activating a computer for purposes of maintenance or repair.
Title IV contains six miscellaneous provisions, relating to the
functions of the Copyright Office, distance education, the exceptions
in the Copyright Act for libraries and for making ephemeral recordings,
“webcasting” of sound recordings on the Internet, and the applicability
of collective bargaining agreement obligations in the case of transfers
of rights in motion pictures.
Title V, the “Vessel Hull Design Protection Act,” creates a new form
of protection for the design of vessel hulls.
This memorandum summarizes briefly each title of the DMCA. It provides
merely an overview of the laws provisions; for purposes of length and readability a
significant amount of detail has been omitted. A complete understanding of any
provision of the DMCA requires reference to the text of the legislation itself.
Pub. L. No. 105-304, 112 Stat. 2860 (Oct. 28, 1998).
1
Copyright Office Summary
December 1998
Page 1

File diff suppressed because it is too large Load Diff

View File

@ -1,475 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:611px; height:791px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:490px; top:85px; width:65px; height:16px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">OMB No. 1545-0074
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:499px; top:93px; width:45px; height:34px;"><span style="font-family: HelveticaNeue-Bold; font-size:26px">20</span><span style="font-family: Helvetica-Condensed-Black; font-size:27px">07
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:425px; top:123px; width:111px; height:18px;"><span style="font-family: HelveticaNeue-Bold; font-size:9px">Identifying number (see page 8)
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:480px; top:150px; width:2px; height:28px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px"> I
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:425px; top:149px; width:32px; height:18px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">Check if:
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:425px; top:150px; width:112px; height:33px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">ndividual
<br>
<br>Estate or Trust
<br>Type of entry visa (see page 8)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:437px; top:113px; width:12px; height:16px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">, 20
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:425px; top:184px; width:6px; height:14px;"><span style="font-family: Universal-NewswithCommPi; font-size:5px">䊳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:61px; top:148px; width:330px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">resent home address (number, street, and apt. no., or rural route). If you have a P.O. box, see page 8.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:56px; top:157px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:56px; top:172px; width:323px; height:18px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">City, town or post office, state, and ZIP code. If you have a foreign address, see page 8.
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:56px; top:196px; width:233px; height:37px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">Country </span><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:9px">Give address </span><span style="font-family: HelveticaNeue-Bold; font-size:9px">outside the United States </span><span style="font-family: HelveticaNeue-Roman; font-size:9px">to which you want any
<br>refund check mailed. If same as above, write “Same.”
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:88px; top:207px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:251px; top:196px; width:195px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">Of what country were you a </span><span style="font-family: HelveticaNeue-Bold; font-size:9px">citizen </span><span style="font-family: HelveticaNeue-Roman; font-size:9px">or national during the tax year? </span><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:316px; top:208px; width:241px; height:25px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">Give address in the country where you are a </span><span style="font-family: HelveticaNeue-Bold; font-size:9px">permanent resident.
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:9px">If same as above, write “Same.”
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:439px; top:207px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:176px; top:82px; width:269px; height:17px;"><span style="font-family: FranklinGothic-Demi; font-size:16px">U.S. Nonresident Alien Income Tax Return
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:146px; top:113px; width:30px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">beginning
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:211px; top:100px; width:195px; height:16px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">For the year January 1December 31, 2007, or other tax year
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:260px; top:113px; width:61px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">, 2007, and ending
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:251px; top:123px; width:37px; height:18px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">Last name
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:55px; top:79px; width:66px; height:32px;"><span style="font-family: Helvetica-Condensed-Black; font-size:30px">1040NR
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:34px; top:97px; width:87px; height:32px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">Form
<br>Department of the Treasury
<br>Internal Revenue Service
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:56px; top:123px; width:93px; height:9px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">Your first name and initial
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:56px; top:132px; width:4px; height:24px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px"> </span><span style="font-family: HelveticaNeue-Roman; font-size:8px">P
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:36px; top:146px; width:9px; height:78px;"><span style="font-family: HelveticaNeue-Bold; font-size:5px">P</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">l</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">ea</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">s</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">p</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">r</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">i</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">n</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">r</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">y</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">p</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:221px; width:9px; height:2px;"><span style="font-family: HelveticaNeue-Bold; font-size:2px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:40px; top:273px; width:8px; height:159px;"><span style="font-family: HelveticaNeue-Bold; font-size:4px">A</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">l</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">s</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">tt</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">c</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">h</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">F</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">r</span><span style="font-family: HelveticaNeue-Bold; font-size:6px">m</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">(</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">s</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">)</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:3px">1099</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">-</span><span style="font-family: HelveticaNeue-Bold; font-size:5px">R</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:1px">i</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">f</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">x</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:5px">w</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">s</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:5px">w</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">i</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">hh</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">l</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">d</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:33px; top:312px; width:8px; height:80px;"><span style="font-family: HelveticaNeue-Bold; font-size:4px">A</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">tt</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">c</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">h</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">F</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">r</span><span style="font-family: HelveticaNeue-Bold; font-size:6px">m</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">s</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:6px">W</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">-</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">2</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">h</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">r</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:502px; top:243px; width:10px; height:11px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">7a
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:538px; top:243px; width:10px; height:11px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">7b
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:494px; top:256px; width:27px; height:9px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">Yourself
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:532px; top:256px; width:25px; height:9px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">Spouse
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:483px; top:286px; width:3px; height:41px;"><span style="font-family: Universal-GreekwithMathPi; font-size:25px">兵</span><span style="font-family: HelveticaNeue-Roman; font-size:27px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:344px; width:7px; height:6px;"><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:374px; width:7px; height:6px;"><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:398px; width:7px; height:6px;"><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:415px; width:7px; height:6px;"><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:528px; top:438px; width:7px; height:6px;"><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:535px; top:354px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:535px; top:384px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:535px; top:409px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:535px; top:426px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:535px; top:448px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:453px; top:339px; width:73px; height:86px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">No. of boxes checked
<br>on 7a and 7b
<br>
<br>No. of children on
<br>7c who:
<br>
<br></span><span style="font-family: Universal-NewswithCommPi; font-size:6px">● </span><span style="font-family: HelveticaNeue-Bold; font-size:8px">lived with you
<br> </span><span style="font-family: Universal-NewswithCommPi; font-size:6px">● </span><span style="font-family: HelveticaNeue-Bold; font-size:8px">did not live with
<br>you due to divorce
<br>or separation
<br>
<br>Dependents on 7c
<br>not entered above
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:458px; top:432px; width:68px; height:8px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">dd numbers entered
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:453px; top:439px; width:49px; height:15px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">on lines above
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:459px; top:449px; width:10px; height:23px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">8
<br>
<br>9a
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:453px; top:424px; width:4px; height:16px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px"> A
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:457px; top:485px; width:15px; height:11px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">10a
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:457px; top:509px; width:15px; height:132px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">11
<br>12
<br>13
<br>14
<br>15
<br>16b
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">17b
<br>18
<br>19
<br>20
<br>21
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:459px; top:654px; width:10px; height:11px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">23
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:459px; top:786px; width:10px; height:23px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">34</span><span style="font-family: HelveticaNeue-Bold; font-size:10px">
<br>35</span><span style="font-family: HelveticaNeue-Bold; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:484px; top:810px; width:78px; height:11px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">Form </span><span style="font-family: HelveticaNeue-Bold; font-size:11px">1040NR </span><span style="font-family: HelveticaNeue-Roman; font-size:8px">(2007)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:543px; top:820px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:338px; top:297px; width:140px; height:19px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">If you check box 7b, enter your spouses
<br>identifying number </span><span style="font-family: Universal-NewswithCommPi; font-size:5px">䊳
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:331px; top:371px; width:36px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">relationship
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:339px; top:378px; width:20px; height:15px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">to you
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:391px; top:364px; width:43px; height:29px;"><span style="font-family: Helvetica-Condensed-Bold; font-size:8px">(4)
<br></span><span style="font-family: Helvetica-Condensed; font-size:8px">if qualifying
<br>child for child tax
<br>credit (see page 9)
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:304px; top:254px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:413px; top:321px; width:1px; height:6px;"><span style="font-family: HelveticaNeue-Roman; font-size:6px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:350px; top:282px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:331px; top:290px; width:3px; height:41px;"><span style="font-family: Universal-GreekwithMathPi; font-size:25px">其</span><span style="font-family: HelveticaNeue-Roman; font-size:27px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:56px; top:255px; width:367px; height:119px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">Filing status. Check only one box (16 below).
<br>
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">1
<br>
<br>2
<br>
<br>3
<br>
<br>4
<br>
<br>5
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">6
<br>
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">Caution: Do not </span><span style="font-family: HelveticaNeue-Roman; font-size:9px">check box 7a if your parent (or someone else) can claim you as a dependent.
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">Do not </span><span style="font-family: HelveticaNeue-Roman; font-size:9px">check box 7b if your spouse had any U.S. gross income.
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">7c
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:92px; top:271px; width:253px; height:81px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">Single resident of Canada or Mexico, or a single U.S. national
<br>Other single nonresident alien
<br>Married resident of Canada or Mexico, or a married U.S. national
<br>
<br>Married resident of the Republic of Korea (South Korea)
<br>
<br>Other married nonresident alien
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br>Qualifying widow(er) with dependent child (see page 9)
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:365px; width:83px; height:8px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">Dependents: </span><span style="font-family: HelveticaNeue-Roman; font-size:8px">(see page 9)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:324px; top:364px; width:50px; height:8px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">(3)</span><span style="font-family: HelveticaNeue-Roman; font-size:8px"> Dependents
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:219px; top:294px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:90px; top:361px; width:2px; height:9px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:111px; top:243px; width:243px; height:10px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">Filing Status and Exemptions for Individuals </span><span style="font-family: HelveticaNeue-Roman; font-size:10px">(see page 8)
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:117px; top:373px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:378px; width:43px; height:8px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">(1)</span><span style="font-family: HelveticaNeue-Roman; font-size:8px"> First name
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:80px; top:385px; width:1px; height:8px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:169px; top:378px; width:33px; height:15px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">Last name
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:239px; top:367px; width:58px; height:24px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">(2) </span><span style="font-family: HelveticaNeue-Roman; font-size:8px">Dependents
<br>identifying number
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:249px; top:385px; width:2px; height:52px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">.
<br>.
<br>.
<br>
<br>.
<br>.
<br>.
<br>
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:9px">.
<br>.
<br>.
<br>
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:9px">.
<br>.
<br>.
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:275px; top:385px; width:2px; height:52px;"><span style="font-family: HelveticaNeue-Roman; font-size:9px">.
<br>.
<br>.
<br>
<br>.
<br>.
<br>.
<br>
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:9px">.
<br>.
<br>.
<br>
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:9px">.
<br>.
<br>.
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:435px; top:664px; width:2px; height:6px;"><span style="font-family: Universal-NewswithCommPi; font-size:6px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:281px; top:486px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:318px; top:727px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:435px; top:607px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:340px; top:668px; width:2px; height:21px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:473px; width:10px; height:11px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">9b
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:355px; top:497px; width:15px; height:11px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">10b
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:315px; top:571px; width:15px; height:23px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">16b
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">17b
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:211px; top:569px; width:15px; height:33px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">16a
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">
<br>17a
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:438px; width:357px; height:377px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">Total number of exemptions claimed
<br>
<br>Wages, salaries, tips, etc. Attach Form(s) W-2
<br>
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">Taxable </span><span style="font-family: HelveticaNeue-Roman; font-size:10px">interest
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">Tax-exempt </span><span style="font-family: HelveticaNeue-Roman; font-size:10px">interest. </span><span style="font-family: HelveticaNeue-Bold; font-size:10px">Do not </span><span style="font-family: HelveticaNeue-Roman; font-size:10px">include on line 9a
<br>Ordinary dividends
<br>
<br>Qualified dividends (see page 11)
<br>Taxable refunds, credits, or offsets of state and local income taxes (see page 11)
<br>
<br>Scholarship and fellowship grants. Attach Form(s) 1042-S or required statement (see page 11)
<br>
<br>Business income or (loss). Attach Schedule C or C-EZ (Form 1040)
<br>
<br>Capital gain or (loss). Attach Schedule D (Form 1040) if required. If not required, check here
<br>
<br>Other gains or (losses). Attach Form 4797
<br>IRA distributions
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br>Pensions and annuities
<br>Rental real estate, royalties, partnerships, trusts, etc. Attach Schedule E (Form 1040)
<br>Farm income or (loss). Attach Schedule F (Form 1040)
<br>
<br>Unemployment compensation
<br>Other income. List type and amount (see page 15)
<br>Total income exempt by a treaty from page 5, Item M
<br>Add lines 8, 9a, 10a, 1115, 16b, and 17b21. This is your </span><span style="font-family: HelveticaNeue-Bold; font-size:10px">total effectively connected income </span><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br>Educator expenses (see page 15)
<br>Health savings account deduction. Attach Form 8889
<br>Moving expenses. Attach Form 3903
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:10px">Self-employed SEP, SIMPLE, and qualified plans
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br>Self-employed health insurance deduction (see page 16)
<br>Penalty on early withdrawal of savings
<br>
<br>Scholarship and fellowship grants excluded
<br>
<br>IRA deduction (see page 16)
<br>
<br>Student loan interest deduction (see page 16)
<br>
<br>Domestic production activities deduction. Attach Form 8903
<br>
<br>Add lines 24 through 33
<br>Subtract line 34 from line 23. Enter here and on line 36. This is your </span><span style="font-family: HelveticaNeue-Bold; font-size:10px">adjusted gross income </span><span style="font-family: Universal-NewswithCommPi; font-size:6px">䊳
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span><span style="font-family: Universal-NewswithCommPi; font-size:6px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:666px; width:10px; height:120px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">24
<br>25
<br>26
<br>27
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">28
<br>29
<br>30
<br>31
<br>32
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">33
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:334px; top:572px; width:116px; height:33px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">Taxable amount (see page 12)
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br>Taxable amount (see page 13)
<br></span><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:358px; top:642px; width:10px; height:11px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">22
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:184px; top:595px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:206px; top:631px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:296px; top:644px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:308px; top:691px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:241px; top:703px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:150px; top:474px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:256px; top:571px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:223px; top:511px; width:2px; height:10px;"><span style="font-family: HelveticaNeue-Roman; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:63px; top:437px; width:5px; height:11px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">d
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:65px; top:451px; width:16px; height:59px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">8
<br>
<br>9a
<br>b
<br>
<br>10a
<br>b
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:65px; top:511px; width:15px; height:297px;"><span style="font-family: HelveticaNeue-Bold; font-size:10px">11
<br>12
<br>13
<br>14
<br>15
<br>16a
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">17a</span><span style="font-family: HelveticaNeue-Bold; font-size:10px">
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">18
<br>19
<br>20
<br>21
<br>22
<br>23
<br>24
<br>25
<br>26
<br>27
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">28
<br>29
<br>30
<br>31
<br>32
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">33
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">34</span><span style="font-family: HelveticaNeue-Bold; font-size:10px">
<br></span><span style="font-family: HelveticaNeue-Bold; font-size:10px">35</span><span style="font-family: HelveticaNeue-Bold; font-size:10px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:51px; top:462px; width:8px; height:189px;"><span style="font-family: HelveticaNeue-Bold; font-size:2px">I</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">n</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">c</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:6px">m</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">E</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">ff</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">ec</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">i</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">v</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">l</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">y</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:5px">C</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">nn</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">c</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">d</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:6px">W</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">i</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">h</span><span style="font-family: HelveticaNeue-Bold; font-size:1px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:5px">U</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">.</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">S</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">. </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">T</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">r</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">d</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">/</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">B</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">u</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">s</span><span style="font-family: HelveticaNeue-Bold; font-size:1px">i</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">n</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">ss
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:52px; top:649px; width:8px; height:1px;"><span style="font-family: HelveticaNeue-Bold; font-size:1px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:50px; top:692px; width:9px; height:90px;"><span style="font-family: HelveticaNeue-Bold; font-size:5px">A</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">d</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">j</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">u</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">s</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">d</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:6px">G</span><span style="font-family: HelveticaNeue-Bold; font-size:3px">r</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">ss</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:2px">I</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">n</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">c</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:7px">m</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:51px; top:780px; width:9px; height:2px;"><span style="font-family: HelveticaNeue-Bold; font-size:2px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:35px; top:531px; width:9px; height:159px;"><span style="font-family: HelveticaNeue-Bold; font-size:5px">E</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">n</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">c</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">l</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">s</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">, </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">b</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">u</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">do</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">n</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">o</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">tt</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">ac</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">h</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">, </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">n</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">y</span><span style="font-family: HelveticaNeue-Bold; font-size:2px"> </span><span style="font-family: HelveticaNeue-Bold; font-size:4px">p</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">a</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">y</span><span style="font-family: HelveticaNeue-Bold; font-size:7px">m</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">e</span><span style="font-family: HelveticaNeue-Bold; font-size:4px">n</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">t</span><span style="font-family: HelveticaNeue-Bold; font-size:2px">.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:35px; top:688px; width:9px; height:2px;"><span style="font-family: HelveticaNeue-Bold; font-size:2px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:48px; top:431px; width:8px; height:1px;"><span style="font-family: HelveticaNeue-Bold; font-size:1px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:34px; top:813px; width:269px; height:16px;"><span style="font-family: HelveticaNeue-Bold; font-size:8px">For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see page 32.
<br>
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:367px; top:813px; width:53px; height:15px;"><span style="font-family: HelveticaNeue-Roman; font-size:8px">Cat. No. 11364D
<br>
<br></span></div><span style="position:absolute; border: black 1px solid; left:453px; top:496px; width:21px; height:12px;"></span>
<span style="position:absolute; border: black 1px solid; left:453px; top:472px; width:21px; height:12px;"></span>
<span style="position:absolute; border: black 1px solid; left:453px; top:641px; width:21px; height:12px;"></span>
<span style="position:absolute; border: black 1px solid; left:453px; top:665px; width:21px; height:120px;"></span>
<span style="position:absolute; border: black 1px solid; left:526px; top:315px; width:36px; height:24px;"></span>
<span style="position:absolute; border: black 1px solid; left:526px; top:267px; width:36px; height:24px;"></span>
<span style="position:absolute; border: black 1px solid; left:482px; top:95px; width:79px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:135px; top:87px; width:0px; height:35px;"></span>
<span style="position:absolute; border: black 1px solid; left:482px; top:87px; width:0px; height:35px;"></span>
<span style="position:absolute; border: black 1px solid; left:34px; top:123px; width:528px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:244px; top:123px; width:0px; height:24px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:123px; width:0px; height:686px;"></span>
<span style="position:absolute; border: black 1px solid; left:417px; top:123px; width:0px; height:72px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:147px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:171px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:195px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:207px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:309px; top:207px; width:0px; height:36px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:243px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:526px; top:243px; width:0px; height:96px;"></span>
<span style="position:absolute; border: black 1px solid; left:490px; top:243px; width:0px; height:96px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:255px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:267px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:358px; top:279px; width:121px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:490px; top:279px; width:36px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:226px; top:291px; width:253px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:490px; top:291px; width:36px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:490px; top:303px; width:72px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:415px; top:316px; width:61px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:490px; top:316px; width:72px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:226px; top:328px; width:253px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:490px; top:328px; width:36px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:322px; top:340px; width:157px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:490px; top:340px; width:72px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:538px; top:352px; width:24px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:364px; width:397px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:381px; top:364px; width:0px; height:72px;"></span>
<span style="position:absolute; border: black 1px solid; left:316px; top:364px; width:0px; height:72px;"></span>
<span style="position:absolute; border: black 1px solid; left:215px; top:364px; width:0px; height:72px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:388px; width:397px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:538px; top:382px; width:24px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:400px; width:397px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:412px; width:397px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:424px; width:397px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:538px; top:407px; width:24px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:436px; width:397px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:538px; top:424px; width:24px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:238px; top:446px; width:205px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:448px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:475px; top:448px; width:0px; height:361px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:448px; width:0px; height:361px;"></span>
<span style="position:absolute; border: black 1px solid; left:540px; top:448px; width:0px; height:361px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:460px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:286px; top:460px; width:157px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:63px; top:448px; width:0px; height:361px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:472px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:154px; top:472px; width:289px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:286px; top:484px; width:61px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:432px; top:475px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:374px; top:475px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:475px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:484px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:166px; top:496px; width:277px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:496px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:430px; top:520px; width:13px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:532px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:430px; top:532px; width:13px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:544px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:370px; top:544px; width:73px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:556px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:262px; top:568px; width:181px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:568px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:287px; top:571px; width:0px; height:21px;"></span>
<span style="position:absolute; border: black 1px solid; left:229px; top:571px; width:0px; height:21px;"></span>
<span style="position:absolute; border: black 1px solid; left:208px; top:571px; width:0px; height:21px;"></span>
<span style="position:absolute; border: black 1px solid; left:154px; top:581px; width:49px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:208px; top:581px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:581px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:190px; top:593px; width:13px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:208px; top:593px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:593px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:442px; top:605px; width:1px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:605px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:617px; width:133px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:617px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:214px; top:629px; width:229px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:629px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:296px; top:641px; width:145px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:641px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:430px; top:662px; width:1px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:49px; top:665px; width:513px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:689px; width:37px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:677px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:322px; top:725px; width:25px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:701px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:286px; top:713px; width:61px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:725px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:250px; top:737px; width:97px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:737px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:274px; top:749px; width:73px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:773px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:202px; top:760px; width:145px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:785px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:274px; top:773px; width:73px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:34px; top:809px; width:528px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:446px; top:364px; width:0px; height:72px;"></span>
<span style="position:absolute; border: black 1px solid; left:310px; top:652px; width:37px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:432px; top:644px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:374px; top:644px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:644px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:653px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:309px; top:571px; width:0px; height:21px;"></span>
<span style="position:absolute; border: black 1px solid; left:250px; top:701px; width:97px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:689px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:749px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:761px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:400px; top:367px; width:1px; height:3px;"></span>
<span style="position:absolute; border: black 1px solid; left:402px; top:360px; width:5px; height:11px;"></span>
<span style="position:absolute; border: black 1px solid; left:244px; top:195px; width:0px; height:12px;"></span>
<span style="position:absolute; border: black 1px solid; left:238px; top:677px; width:97px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:226px; top:508px; width:121px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:508px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:374px; top:499px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:499px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:432px; top:499px; width:0px; height:9px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:520px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:713px; width:101px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:334px; top:785px; width:13px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:190px; top:796px; width:253px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:430px; top:807px; width:1px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:454px; top:797px; width:108px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:374px; top:665px; width:0px; height:120px;"></span>
<span style="position:absolute; border: black 1px solid; left:467px; top:151px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:467px; top:161px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:272px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:284px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:296px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:308px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:320px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:77px; top:332px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:409px; top:390px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:409px; top:402px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:409px; top:414px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:409px; top:426px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:539px; top:427px; width:21px; height:18px;"></span>
<span style="position:absolute; border: black 1px solid; left:434px; top:549px; width:8px; height:8px;"></span>
<span style="position:absolute; border: black 1px solid; left:432px; top:665px; width:0px; height:120px;"></span>
<span style="position:absolute; border: black 1px solid; left:352px; top:665px; width:0px; height:120px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,431 +0,0 @@
OMB No. 1545-0074
2007
Identifying number (see page 8)
I
Check if:
ndividual
Estate or Trust
Type of entry visa (see page 8)
, 20
resent home address (number, street, and apt. no., or rural route). If you have a P.O. box, see page 8.
City, town or post office, state, and ZIP code. If you have a foreign address, see page 8.
Country 䊳
Give address outside the United States to which you want any
refund check mailed. If same as above, write “Same.”
Of what country were you a citizen or national during the tax year? 䊳
Give address in the country where you are a permanent resident.
If same as above, write “Same.”
U.S. Nonresident Alien Income Tax Return
beginning
For the year January 1December 31, 2007, or other tax year
, 2007, and ending
Last name
1040NR
Form
Department of the Treasury
Internal Revenue Service
Your first name and initial
P
Please print or type.
Also attach Form(s) 1099-R if tax was withheld.
Attach Forms W-2 here.
7a
7b
Yourself
Spouse
No. of boxes checked
on 7a and 7b
No. of children on
7c who:
● lived with you
● did not live with
you due to divorce
or separation
Dependents on 7c
not entered above
dd numbers entered
on lines above
8
9a
A
10a
11
12
13
14
15
16b
17b
18
19
20
21
23
34
35
Form 1040NR (2007)
If you check box 7b, enter your spouses
identifying number 䊳
relationship
to you
(4)
if qualifying
child for child tax
credit (see page 9)
Filing status. Check only one box (16 below).
1
2
3
4
5
6
Caution: Do not check box 7a if your parent (or someone else) can claim you as a dependent.
Do not check box 7b if your spouse had any U.S. gross income.
7c
Single resident of Canada or Mexico, or a single U.S. national
Other single nonresident alien
Married resident of Canada or Mexico, or a married U.S. national
Married resident of the Republic of Korea (South Korea)
Other married nonresident alien
Qualifying widow(er) with dependent child (see page 9)
Dependents: (see page 9)
(3) Dependents
Filing Status and Exemptions for Individuals (see page 8)
(1) First name
Last name
(2) Dependents
identifying number
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9b
10b
16b
17b
16a
17a
Total number of exemptions claimed
Wages, salaries, tips, etc. Attach Form(s) W-2
Taxable interest
Tax-exempt interest. Do not include on line 9a
Ordinary dividends
Qualified dividends (see page 11)
Taxable refunds, credits, or offsets of state and local income taxes (see page 11)
Scholarship and fellowship grants. Attach Form(s) 1042-S or required statement (see page 11)
Business income or (loss). Attach Schedule C or C-EZ (Form 1040)
Capital gain or (loss). Attach Schedule D (Form 1040) if required. If not required, check here
Other gains or (losses). Attach Form 4797
IRA distributions
Pensions and annuities
Rental real estate, royalties, partnerships, trusts, etc. Attach Schedule E (Form 1040)
Farm income or (loss). Attach Schedule F (Form 1040)
Unemployment compensation
Other income. List type and amount (see page 15)
Total income exempt by a treaty from page 5, Item M
Add lines 8, 9a, 10a, 1115, 16b, and 17b21. This is your total effectively connected income 䊳
Educator expenses (see page 15)
Health savings account deduction. Attach Form 8889
Moving expenses. Attach Form 3903
Self-employed SEP, SIMPLE, and qualified plans
Self-employed health insurance deduction (see page 16)
Penalty on early withdrawal of savings
Scholarship and fellowship grants excluded
IRA deduction (see page 16)
Student loan interest deduction (see page 16)
Domestic production activities deduction. Attach Form 8903
Add lines 24 through 33
Subtract line 34 from line 23. Enter here and on line 36. This is your adjusted gross income 䊳
24
25
26
27
28
29
30
31
32
33
Taxable amount (see page 12)
Taxable amount (see page 13)
22
d
8
9a
b
10a
b
11
12
13
14
15
16a
17a
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Income Effectively Connected With U.S. Trade/Business
Adjusted Gross Income
Enclose, but do not attach, any payment.
For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see page 32.
Cat. No. 11364D

File diff suppressed because it is too large Load Diff

View File

@ -1,209 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:1008px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:36px; top:143px; width:283px; height:37px;"><span style="font-family: PDDIPA+Helvetica; font-size:17px">PAGER/SGML
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:16px">Page 1 of 48 Instructions for Form 1040NR
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:156px; top:134px; width:147px; height:25px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">Userid: ________ DTD INSTR04
<br>Fileid:
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:192px; top:150px; width:234px; height:9px;"><span style="font-family: PDDIPA+Helvetica; font-size:9px">D:\USERS\8fllb\documents\epicfiles\2007Instructions1040NR.sgm
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:323px; top:129px; width:154px; height:15px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">Leadpct: 0% Pt. size: 9.5 </span><span style="font-family: PDDJAB+ZapfDingbats; font-size:13px">❏ </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">Draft
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:505px; top:129px; width:57px; height:15px;"><span style="font-family: PDDJAB+ZapfDingbats; font-size:13px">❏ </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">Ok to Print
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:444px; top:149px; width:128px; height:31px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">(Init. &amp; date)
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:16px">7:48 - 6-DEC-2007
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:198px; width:489px; height:10px;"><span style="font-family: PDDJAC+Helvetica-Oblique; font-size:10px">The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:285px; width:267px; height:130px;"><span style="font-family: PDDJAD+Helvetica-Bold; font-size:46px">20</span><span style="font-family: PDDJCD+Helvetica-Condensed-Black; font-size:48px">07
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:32px">Instructions for
<br>Form 1040NR
<br></span><span style="font-family: PDDJCE+FranklinGothic-Demi; font-size:16px">U.S. Nonresident Alien Income Tax Return
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:429px; width:157px; height:20px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">Section references are to the Internal
<br>Revenue Code unless otherwise noted.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:41px; top:428px; width:337px; height:202px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">use a different address this year. See
<br></span><span style="font-family: PDDJAC+Helvetica-Oblique; font-size:10px">Where To File</span><span style="font-family: PDDIPA+Helvetica; font-size:10px"> on page 4.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:20px">General Instructions </span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">deduction. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">The deduction rate for
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Domestic production activities
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:16px">Whats New for 2007
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Tax benefits extended. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">The following
<br>tax benefits were extended through
<br>2007.
<br></span><span style="font-family: PDDJDF+Symbol; font-size:14px">• </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">Deduction for educator expenses in
<br>figuring adjusted gross income.
<br></span><span style="font-family: PDDJDF+Symbol; font-size:14px">• </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">District of Columbia first-time
<br>homebuyer credit.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Alternative minimum tax (AMT)
<br>exemption amount decreased. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">The
<br>AMT exemption amount is decreased to
<br>$33,750 ($45,000 if a qualifying
<br>widow(er); $22,500 if married filing
<br>separately).
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:337px; top:672px; width:16px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px"> For
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:335px; top:494px; width:42px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px"> If you are
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:221px; top:471px; width:166px; height:533px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">2007 is increased to 6%.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Unreported social security and
<br>Medicare tax on wages.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">an employee and your employer did not
<br>withhold social security and Medicare
<br>tax, see Form 8919 to figure and report
<br>this tax.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Refundable credit for prior-year
<br>minimum tax.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">If you have an unused
<br>minimum tax credit carryforward from
<br>2004, see Form 8801 to find if you can
<br>take this credit.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Health savings account (HSA)
<br>funding distributions. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">You may be
<br>able to elect to exclude from income a
<br>distribution made from your IRA to your
<br>HSA. See the instructions for lines 16a
<br>and 16b beginning on page 12.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">New recordkeeping requirements for
<br>contributions of money.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">charitable contributions of money,
<br>regardless of the amount, you must
<br>maintain as a record of the contribution
<br>a bank record (such as a cancelled
<br>check) or a written record from the
<br>charity. The written record must include
<br>the name of the charity, date, and
<br>amount of the contribution. See </span><span style="font-family: PDDJAC+Helvetica-Oblique; font-size:10px">Gifts to
<br>U.S. Charities</span><span style="font-family: PDDIPA+Helvetica; font-size:10px"> that begins on page 26.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Exemption for housing a person
<br>displaced by Hurricane Katrina
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">expires. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">The additional exemption
<br>amount for housing a person displaced
<br>by Hurricane Katrina does not apply for
<br>2007 or later years.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Telephone excise tax credit.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">credit was available only on your 2006
<br>return. If you filed but did not request it
<br>on your 2006 return, file Form 1040X
<br>using a simplified procedure explained
<br>in its instructions to amend your 2006
<br>return. If you were not required to file a
<br>2006 return, see the 2006 Form
<br>1040EZ-T.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:16px">Whats New for 2008
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">IRA deduction expanded. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">You may
<br>be able to deduct up to $5,000 ($6,000
<br>if age 50 or older at the end of the
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">year). You may be able to take an IRA
<br>deduction if you were covered by a
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:355px; top:837px; width:20px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px"> This
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:276px; top:1015px; width:59px; height:9px;"><span style="font-family: PDDIPA+Helvetica; font-size:9px">Cat. No. 11368V
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:53px; top:642px; width:5px; height:20px;"><span style="font-family: PDDJAD+Helvetica-Bold; font-size:20px">!
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:73px; top:637px; width:122px; height:30px;"><span style="font-family: PDDJAC+Helvetica-Oblique; font-size:10px">At the time these instructions
<br>went to print, Congress was
<br>considering legislation that
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:659px; width:166px; height:111px;"><span style="font-family: PDDJEF+Helvetica-Black; font-size:5px">CAUTION
<br></span><span style="font-family: PDDJAC+Helvetica-Oblique; font-size:10px">would increase the amounts above. To
<br>find out if this legislation was enacted,
<br>and for more details, see the
<br>Instructions for Form 6251.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">IRA deduction expanded.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">covered by a retirement plan, you may
<br>be able to take an IRA deduction if your
<br>2007 modified adjusted gross income
<br>(AGI) is less than $62,000 ($103,000 if
<br>a qualifying widow(er)).
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:710px; width:46px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">If you were
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:54px; top:774px; width:150px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">You may be able to deduct up to an
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:41px; top:784px; width:164px; height:220px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">additional $3,000 if you were a
<br>participant in a 401(k) plan and your
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">employer was in bankruptcy in an
<br>earlier year.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Standard mileage rates. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">The 2007
<br>rate for business use of your vehicle is
<br>48</span><span style="font-family: PDDIPA+Helvetica; font-size:6px">1</span><span style="font-family: PDDIPA+Helvetica; font-size:10px">/</span><span style="font-family: PDDIPA+Helvetica; font-size:6px">2</span><span style="font-family: PDDIPA+Helvetica; font-size:10px"> cents a mile. The 2007 rate for
<br>use of your vehicle to move is 20 cents
<br>a mile. The special rate for charitable
<br>use of your vehicle to provide relief
<br>related to Hurricane Katrina has
<br>expired.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Elective salary deferrals. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">The
<br>maximum amount you can defer under
<br>all plans is generally limited to $15,500
<br>($10,500 if you only have SIMPLE
<br>plans; $18,500 for section 403(b) plans
<br>if you qualify for the 15-year rule). See
<br>the instructions for line 8 on page 10.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Mailing your return.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px"> If you are filing
<br>the return for an estate or trust, you will
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:445px; top:299px; width:122px; height:23px;"><span style="font-family: PDDIPA+Helvetica; font-size:11px">Department of the Treasury
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Internal Revenue Service
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:401px; top:428px; width:167px; height:30px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">retirement plan and your 2008 modified
<br>AGI is less than $63,000 ($105,000) if a
<br>qualifying widow(er)).
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:414px; top:465px; width:150px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">You may be able to deduct up to an
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:401px; top:475px; width:164px; height:239px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">additional $3,000 if you were a
<br>participant in a 401(k) plan and your
<br>employer was in bankruptcy in an
<br>earlier year.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Personal exemption and itemized
<br>deduction phaseouts reduced.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">Taxpayers with adjusted gross income
<br>above a certain amount may lose part
<br>of their deduction for personal
<br>exemptions and itemized deductions.
<br>The amount by which these deductions
<br>are reduced in 2008 will be only </span><span style="font-family: PDDIPA+Helvetica; font-size:6px">1</span><span style="font-family: PDDIPA+Helvetica; font-size:10px">/</span><span style="font-family: PDDIPA+Helvetica; font-size:6px">2</span><span style="font-family: PDDIPA+Helvetica; font-size:10px"> of
<br>the amount of the reduction that
<br>otherwise would have applied in 2007.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Capital gain tax rate reduced.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px"> The
<br>5% capital gain tax rate is reduced to
<br>zero.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Tax on childrens income.
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">8615 will be required to figure the tax
<br>for the following children with
<br>investment income of more than
<br>$1,800.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:525px; top:663px; width:24px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px"> Form
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:414px; top:716px; width:151px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">1. Children under age 18 at the end
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:402px; top:726px; width:34px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">of 2008.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:414px; top:736px; width:133px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">2. The following children if their
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:402px; top:746px; width:151px; height:20px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">earned income is not more than half
<br>their support.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:414px; top:769px; width:135px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">a. Children age 18 at the end of
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:402px; top:779px; width:23px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">2008.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:414px; top:790px; width:146px; height:10px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">b. Children over age 18 and under
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:401px; top:800px; width:164px; height:204px;"><span style="font-family: PDDIPA+Helvetica; font-size:10px">age 24 at the end of 2008 who are
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">full-time students.
<br>The election to report a childs
<br>investment income on a parents return
<br>and the special rule for when a child
<br>must file Form 6251 will also apply to
<br>the children listed above.
<br></span><span style="font-family: PDDJAD+Helvetica-Bold; font-size:11px">Expiring tax benefits. </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">The following
<br>benefits are scheduled to expire and
<br>will not apply for 2008.
<br></span><span style="font-family: PDDJDF+Symbol; font-size:14px">• </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">Deduction for educator expenses in
<br>figuring adjusted gross income.
<br></span><span style="font-family: PDDJDF+Symbol; font-size:14px">• </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">The exclusion from income of
<br>qualified charitable deductions.
<br></span><span style="font-family: PDDJDF+Symbol; font-size:14px">• </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">Credit for nonbusiness energy
<br>property.
<br></span><span style="font-family: PDDJDF+Symbol; font-size:14px">• </span><span style="font-family: PDDIPA+Helvetica; font-size:10px">District of Columbia first-time
<br></span><span style="font-family: PDDIPA+Helvetica; font-size:10px">homebuyer credit (for homes
<br>purchased after 2007).
<br></span></div><span style="position:absolute; border: black 1px solid; left:497px; top:157px; width:67px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:42px; top:184px; width:528px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:428px; top:298px; width:11px; height:25px;"></span>
<span style="position:absolute; border: black 1px solid; left:426px; top:298px; width:5px; height:18px;"></span>
<span style="position:absolute; border: black 1px solid; left:430px; top:301px; width:7px; height:23px;"></span>
<span style="position:absolute; border: black 1px solid; left:430px; top:308px; width:4px; height:13px;"></span>
<span style="position:absolute; border: black 1px solid; left:418px; top:304px; width:11px; height:12px;"></span>
<span style="position:absolute; border: black 1px solid; left:422px; top:315px; width:8px; height:10px;"></span>
<span style="position:absolute; border: black 1px solid; left:414px; top:315px; width:8px; height:10px;"></span>
<span style="position:absolute; border: black 1px solid; left:421px; top:309px; width:1px; height:10px;"></span>
<span style="position:absolute; border: black 1px solid; left:407px; top:297px; width:17px; height:26px;"></span>
<span style="position:absolute; border: black 1px solid; left:420px; top:306px; width:3px; height:1px;"></span>
<span style="position:absolute; border: black 1px solid; left:416px; top:318px; width:3px; height:5px;"></span>
<span style="position:absolute; border: black 1px solid; left:424px; top:318px; width:3px; height:5px;"></span>
<span style="position:absolute; border: black 1px solid; left:42px; top:417px; width:528px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:42px; top:638px; width:27px; height:27px;"></span>
<span style="position:absolute; border: black 1px solid; left:45px; top:641px; width:21px; height:18px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,220 +0,0 @@
PAGER/SGML
Page 1 of 48 Instructions for Form 1040NR
Userid: ________ DTD INSTR04
Fileid:
D:\USERS\8fllb\documents\epicfiles\2007Instructions1040NR.sgm
Leadpct: 0% Pt. size: 9.5 ❏ Draft
❏ Ok to Print
(Init. & date)
7:48 - 6-DEC-2007
The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing.
2007
Instructions for
Form 1040NR
U.S. Nonresident Alien Income Tax Return
Section references are to the Internal
Revenue Code unless otherwise noted.
use a different address this year. See
Where To File on page 4.
General Instructions deduction. The deduction rate for
Domestic production activities
Whats New for 2007
Tax benefits extended. The following
tax benefits were extended through
2007.
• Deduction for educator expenses in
figuring adjusted gross income.
• District of Columbia first-time
homebuyer credit.
Alternative minimum tax (AMT)
exemption amount decreased. The
AMT exemption amount is decreased to
$33,750 ($45,000 if a qualifying
widow(er); $22,500 if married filing
separately).
For
If you are
2007 is increased to 6%.
Unreported social security and
Medicare tax on wages.
an employee and your employer did not
withhold social security and Medicare
tax, see Form 8919 to figure and report
this tax.
Refundable credit for prior-year
minimum tax.
If you have an unused
minimum tax credit carryforward from
2004, see Form 8801 to find if you can
take this credit.
Health savings account (HSA)
funding distributions. You may be
able to elect to exclude from income a
distribution made from your IRA to your
HSA. See the instructions for lines 16a
and 16b beginning on page 12.
New recordkeeping requirements for
contributions of money.
charitable contributions of money,
regardless of the amount, you must
maintain as a record of the contribution
a bank record (such as a cancelled
check) or a written record from the
charity. The written record must include
the name of the charity, date, and
amount of the contribution. See Gifts to
U.S. Charities that begins on page 26.
Exemption for housing a person
displaced by Hurricane Katrina
expires. The additional exemption
amount for housing a person displaced
by Hurricane Katrina does not apply for
2007 or later years.
Telephone excise tax credit.
credit was available only on your 2006
return. If you filed but did not request it
on your 2006 return, file Form 1040X
using a simplified procedure explained
in its instructions to amend your 2006
return. If you were not required to file a
2006 return, see the 2006 Form
1040EZ-T.
Whats New for 2008
IRA deduction expanded. You may
be able to deduct up to $5,000 ($6,000
if age 50 or older at the end of the
year). You may be able to take an IRA
deduction if you were covered by a
This
Cat. No. 11368V
!
At the time these instructions
went to print, Congress was
considering legislation that
CAUTION
would increase the amounts above. To
find out if this legislation was enacted,
and for more details, see the
Instructions for Form 6251.
IRA deduction expanded.
covered by a retirement plan, you may
be able to take an IRA deduction if your
2007 modified adjusted gross income
(AGI) is less than $62,000 ($103,000 if
a qualifying widow(er)).
If you were
You may be able to deduct up to an
additional $3,000 if you were a
participant in a 401(k) plan and your
employer was in bankruptcy in an
earlier year.
Standard mileage rates. The 2007
rate for business use of your vehicle is
481/2 cents a mile. The 2007 rate for
use of your vehicle to move is 20 cents
a mile. The special rate for charitable
use of your vehicle to provide relief
related to Hurricane Katrina has
expired.
Elective salary deferrals. The
maximum amount you can defer under
all plans is generally limited to $15,500
($10,500 if you only have SIMPLE
plans; $18,500 for section 403(b) plans
if you qualify for the 15-year rule). See
the instructions for line 8 on page 10.
Mailing your return.
If you are filing
the return for an estate or trust, you will
Department of the Treasury
Internal Revenue Service
retirement plan and your 2008 modified
AGI is less than $63,000 ($105,000) if a
qualifying widow(er)).
You may be able to deduct up to an
additional $3,000 if you were a
participant in a 401(k) plan and your
employer was in bankruptcy in an
earlier year.
Personal exemption and itemized
deduction phaseouts reduced.
Taxpayers with adjusted gross income
above a certain amount may lose part
of their deduction for personal
exemptions and itemized deductions.
The amount by which these deductions
are reduced in 2008 will be only 1/2 of
the amount of the reduction that
otherwise would have applied in 2007.
Capital gain tax rate reduced.
The
5% capital gain tax rate is reduced to
zero.
Tax on childrens income.
8615 will be required to figure the tax
for the following children with
investment income of more than
$1,800.
Form
1. Children under age 18 at the end
of 2008.
2. The following children if their
earned income is not more than half
their support.
a. Children age 18 at the end of
2008.
b. Children over age 18 and under
age 24 at the end of 2008 who are
full-time students.
The election to report a childs
investment income on a parents return
and the special rule for when a child
must file Form 6251 will also apply to
the children listed above.
Expiring tax benefits. The following
benefits are scheduled to expire and
will not apply for 2008.
• Deduction for educator expenses in
figuring adjusted gross income.
• The exclusion from income of
qualified charitable deductions.
• Credit for nonbusiness energy
property.
• District of Columbia first-time
homebuyer credit (for homes
purchased after 2007).

File diff suppressed because it is too large Load Diff

View File

@ -1,113 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:87px; top:91px; width:155px; height:12px;"><span style="font-family: Ryumin-Light; font-size:12px">平成 </span><span style="font-family: GMALPM+DFHSMincho-W3G014; font-size:9px">™— </span><span style="font-family: Ryumin-Light; font-size:12px">年 </span><span style="font-family: GMALPM+DFHSMincho-W3G014; font-size:9px"> </span><span style="font-family: Ryumin-Light; font-size:12px">月 </span><span style="font-family: GMALPM+DFHSMincho-W3G014; font-size:9px">™œ </span><span style="font-family: Ryumin-Light; font-size:12px">日 金曜日
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:267px; top:89px; width:12px; height:14px;"><span style="font-family: Ryumin-Light; font-size:14px">官
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:315px; top:89px; width:12px; height:14px;"><span style="font-family: Ryumin-Light; font-size:14px">報
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:392px; top:91px; width:65px; height:12px;"><span style="font-family: Ryumin-Light; font-size:12px">第 </span><span style="font-family: GMALPM+DFHSMincho-W3G014; font-size:9px">›Ÿ˜ž </span><span style="font-family: Ryumin-Light; font-size:12px">号
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:527px; top:93px; width:10px; height:9px;"><span style="font-family: GMALPM+DFHSMincho-W3G014; font-size:9px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:540px; top:110px; width:8px; height:65px;"><span style="font-family: GothicBBB-Medium; font-size:9px">政令第百四十九号
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:530px; top:134px; width:8px; height:145px;"><span style="font-family: Ryumin-Light; font-size:9px">道路交通法施行令の一部を改正する政令
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:464px; top:110px; width:64px; height:361px;"><span style="font-family: Ryumin-Light; font-size:9px">内閣は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道路交通法の一部を改正する法律</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">平成十九年法律第九十号</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">の一部の施行に伴い</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">並び
<br></span><span style="font-family: Ryumin-Light; font-size:9px">に道路交通法</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">昭和三十五年法律第百五号</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">第四条第一項及び第四項</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第五</span><span style="font-family: Ryumin-Light; font-size:9px">条第一項</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第三十九条第
<br></span><span style="font-family: Ryumin-Light; font-size:9px">一項</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第五十一条第九項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二項</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同法第七十二条の二第三項及び</span><span style="font-family: Ryumin-Light; font-size:9px">第七十五条の八第二項に
<br></span><span style="font-family: Ryumin-Light; font-size:9px">おいて準用する場合を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)、</span><span style="font-family: Ryumin-Light; font-size:9px">第五十一条の三第一項</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第六十三条の四第一</span><span style="font-family: Ryumin-Light; font-size:9px">項第二号</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第七十一条の
<br></span><span style="font-family: Ryumin-Light; font-size:9px">三第二項ただし書</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第七十一条の六第一項</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第九十条第一項ただし書</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第百</span><span style="font-family: Ryumin-Light; font-size:9px">条の二第一項本文及び第
<br></span><span style="font-family: Ryumin-Light; font-size:9px">四号</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第百二条の二並びに第百二十五条第一項及び第三項の規定に基づき</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">この政令を制定する</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:442px; top:118px; width:20px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">道路交通法施行令</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">昭和三十五年政令第二百七十号</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">の一部を次のように</span><span style="font-family: Ryumin-Light; font-size:9px">改正する</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第一条の二第四項第三号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">一・五メ</span><span style="font-family: Ryumin-Light; font-size:9px">ー</span><span style="font-family: Ryumin-Light; font-size:9px">トル</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">一メ</span><span style="font-family: Ryumin-Light; font-size:9px">ー</span><span style="font-family: Ryumin-Light; font-size:9px">トル</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同条第五項第三号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:431px; top:110px; width:9px; height:247px;"><span style="font-family: Ryumin-Light; font-size:9px">六十三条の四第一項</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第六十三条の四第一項第一号</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改める</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:420px; top:118px; width:9px; height:351px;"><span style="font-family: Ryumin-Light; font-size:9px">第二条第一項の表の青色の灯火の項第三号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">含む</span><span style="font-family: Ryumin-Light; font-size:9px">。</span><span style="font-family: Ryumin-Light; font-size:9px">青色の</span><span style="font-family: Ryumin-Light; font-size:9px">灯火の矢印の項を除き</span><span style="font-family: Ryumin-Light; font-size:9px">、
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:409px; top:274px; width:9px; height:8px;"><span style="font-family: Ryumin-Light; font-size:9px">「
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:404px; top:286px; width:9px; height:143px;"><span style="font-family: Ryumin-Light; font-size:9px">歩行者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">進行することが</span><span style="font-family: Ryumin-Light; font-size:9px">できること</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:383px; top:110px; width:9px; height:169px;"><span style="font-family: Ryumin-Light; font-size:9px">以下この条において同じ</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同表中
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:379px; top:286px; width:17px; height:185px;"><span style="font-family: Ryumin-Light; font-size:9px">歩行者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道路の横断を始</span><span style="font-family: Ryumin-Light; font-size:9px">めてはならず</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">また</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道
<br></span><span style="font-family: Ryumin-Light; font-size:9px">横断を終わるか</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">又は横断</span><span style="font-family: Ryumin-Light; font-size:9px">をやめて引き返さなけれ
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:456px; top:478px; width:93px; height:361px;"><span style="font-family: Ryumin-Light; font-size:9px">一の五 医療機関が</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">傷病者の緊急搬送を</span><span style="font-family: Ryumin-Light; font-size:9px">しようとする都道府県又は市町村の</span><span style="font-family: Ryumin-Light; font-size:9px">要請を受けて</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">当該
<br></span><span style="font-family: Ryumin-Light; font-size:9px">傷病者が医療機関に緊急搬送をされるま</span><span style="font-family: Ryumin-Light; font-size:9px">での間における応急の治療を行う医</span><span style="font-family: Ryumin-Light; font-size:9px">師を当該傷病者の所
<br></span><span style="font-family: Ryumin-Light; font-size:9px">在する場所にまで運搬するために使用す</span><span style="font-family: Ryumin-Light; font-size:9px">る自動車
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第十六条中第二号を削り</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第三号を第二号</span><span style="font-family: Ryumin-Light; font-size:9px">とする</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第十六条の二及び第十六条の三中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第五十</span><span style="font-family: Ryumin-Light; font-size:9px">一条第十一項</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第五十一条第十</span><span style="font-family: Ryumin-Light; font-size:9px">二項</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改める</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第十六条の五中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第五十一条第二十項</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第五十一条第二十一項</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改める</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第十七条中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第五十一条第二十一項</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第五十一条第二十二項</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、「「</span><span style="font-family: Ryumin-Light; font-size:9px">前</span><span style="font-family: Ryumin-Light; font-size:9px">号</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">とあるのは</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">前
<br></span><span style="font-family: Ryumin-Light; font-size:9px">号の公示に係る積載物のうち特に貴重と認め</span><span style="font-family: Ryumin-Light; font-size:9px">られるものについては</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">と</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同条第三号中</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を削
<br></span><span style="font-family: Ryumin-Light; font-size:9px">る</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:435px; top:486px; width:20px; height:128px;"><span style="font-family: Ryumin-Light; font-size:9px">第十七条の二を次のように改める</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">委託することのできない事務</span><span style="font-family: Ryumin-Light; font-size:9px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:424px; top:478px; width:9px; height:327px;"><span style="font-family: GothicBBB-Medium; font-size:9px">第十七条の二 </span><span style="font-family: Ryumin-Light; font-size:9px">法第五十一条の三第一項の政</span><span style="font-family: Ryumin-Light; font-size:9px">令で定めるものは</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">次に掲げるとお</span><span style="font-family: Ryumin-Light; font-size:9px">りとする</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:403px; top:486px; width:19px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">一 法第五十一条第五項の規定による車両</span><span style="font-family: Ryumin-Light; font-size:9px">の移動の決定
<br></span><span style="font-family: Ryumin-Light; font-size:9px">二 法第五十一条第六項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二項</span><span style="font-family: Ryumin-Light; font-size:9px">において準用する場合を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の規</span><span style="font-family: Ryumin-Light; font-size:9px">定により保管した車
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:392px; top:494px; width:9px; height:217px;"><span style="font-family: Ryumin-Light; font-size:9px">両</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">積載物を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。</span><span style="font-family: Ryumin-Light; font-size:9px">以下この条において</span><span style="font-family: Ryumin-Light; font-size:9px">同じ</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の返還の決定
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:382px; top:486px; width:9px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">三 法第五十一条第七項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二項</span><span style="font-family: Ryumin-Light; font-size:9px">において読み替えて準用する場合を</span><span style="font-family: Ryumin-Light; font-size:9px">含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">又は第八項の
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:372px; top:494px; width:8px; height:57px;"><span style="font-family: Ryumin-Light; font-size:9px">規定による告知
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:362px; top:286px; width:9px; height:159px;"><span style="font-family: Ryumin-Light; font-size:9px">歩行者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道路を横断して</span><span style="font-family: Ryumin-Light; font-size:9px">はならないこと</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:361px; top:486px; width:9px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">四 法第五十一条第九項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二項</span><span style="font-family: Ryumin-Light; font-size:9px">において読み替えて準用する場合を</span><span style="font-family: Ryumin-Light; font-size:9px">含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の規定による
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:346px; top:298px; width:9px; height:8px;"><span style="font-family: Ryumin-Light; font-size:9px">「
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:321px; top:310px; width:29px; height:161px;"><span style="font-family: Ryumin-Light; font-size:9px">一 歩行者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">進行</span><span style="font-family: Ryumin-Light; font-size:9px">することができること</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">二 普通自転車</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">法第六十三条の三に規定す
<br></span><span style="font-family: Ryumin-Light; font-size:9px">号において同じ</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">横断歩道において直
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:294px; top:110px; width:17px; height:169px;"><span style="font-family: Ryumin-Light; font-size:9px">路を横断している歩行者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">すみやかに</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">その
<br>ばならないこと</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:299px; top:290px; width:8px; height:9px;"><span style="font-family: Ryumin-Light; font-size:9px">を
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:272px; top:282px; width:9px; height:8px;"><span style="font-family: Ryumin-Light; font-size:9px">」
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:285px; top:310px; width:28px; height:163px;"><span style="font-family: Ryumin-Light; font-size:9px">一 歩行者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道路の</span><span style="font-family: Ryumin-Light; font-size:9px">横断を始めてはならず</span><span style="font-family: Ryumin-Light; font-size:9px">、
<br></span><span style="font-family: Ryumin-Light; font-size:9px">横断を終わるか</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">又は横断をやめて引き返
<br></span><span style="font-family: Ryumin-Light; font-size:9px">二 横断歩道を進行</span><span style="font-family: Ryumin-Light; font-size:9px">しようとする普通自転車
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:256px; top:310px; width:20px; height:161px;"><span style="font-family: Ryumin-Light; font-size:9px">一 歩行者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道路</span><span style="font-family: Ryumin-Light; font-size:9px">を横断してはならないこ
<br></span><span style="font-family: Ryumin-Light; font-size:9px">二 横断歩道を進行</span><span style="font-family: Ryumin-Light; font-size:9px">しようとする普通自転車
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:214px; top:110px; width:17px; height:193px;"><span style="font-family: Ryumin-Light; font-size:9px">る普通自転車をいう</span><span style="font-family: Ryumin-Light; font-size:9px">。</span><span style="font-family: Ryumin-Light; font-size:9px">以下この条及び第二十六条第三
<br>進をし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">又は左折することができること</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:177px; top:110px; width:29px; height:193px;"><span style="font-family: Ryumin-Light; font-size:9px">また</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道路を横断している歩行者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">速やかに</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">その
<br>さなければならないこと</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道路の横断を始めてはならないこと</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:191px; top:310px; width:9px; height:161px;"><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同条第四項</span><span style="font-family: Ryumin-Light; font-size:9px">の表の人の形の記号を有
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:148px; top:110px; width:21px; height:151px;"><span style="font-family: Ryumin-Light; font-size:9px">と</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">道路の横断を始めてはならないこと</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:143px; top:306px; width:9px; height:8px;"><span style="font-family: Ryumin-Light; font-size:9px">」
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:121px; top:110px; width:20px; height:361px;"><span style="font-family: Ryumin-Light; font-size:9px">する青色の灯火の項第二号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">直進</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">右折しようとして右折する地点まで直</span><span style="font-family: Ryumin-Light; font-size:9px">進し</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">その地点において
<br></span><span style="font-family: Ryumin-Light; font-size:9px">右折することを含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">し</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">直進をし</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改める</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:45px; top:110px; width:74px; height:361px;"><span style="font-family: Ryumin-Light; font-size:9px">第三条の二第一項中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">行なわせる</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">行わせる</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に</span><span style="font-family: Ryumin-Light; font-size:9px">、「</span><span style="font-family: Ryumin-Light; font-size:9px">次の各号に</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">次に</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に</span><span style="font-family: Ryumin-Light; font-size:9px">、「</span><span style="font-family: Ryumin-Light; font-size:9px">こえない</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を
<br></span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">超えない</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第十号を第十二号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第四号から第九号までを二号</span><span style="font-family: Ryumin-Light; font-size:9px">ずつ繰り下げ</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第三号を
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第四号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号の次に次の一号を加える</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">五 法第二十五条の二第二項の道路標識等
<br>第三条の二第一項第二号の次に次の一号を加える</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">三 法第十三条第二項の道路標識等
<br>第十三条第一項中第一号の五を第一号の六とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第一号の四の次に次の一</span><span style="font-family: Ryumin-Light; font-size:9px">号を加える</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:351px; top:494px; width:8px; height:17px;"><span style="font-family: Ryumin-Light; font-size:9px">公示
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:340px; top:486px; width:9px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">五 法第五十一条第十項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二項</span><span style="font-family: Ryumin-Light; font-size:9px">において準用する場合を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の規</span><span style="font-family: Ryumin-Light; font-size:9px">定による公示の日付
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:330px; top:494px; width:8px; height:57px;"><span style="font-family: Ryumin-Light; font-size:9px">及び内容の公表
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:319px; top:486px; width:9px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">六 法第五十一条第十二項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二</span><span style="font-family: Ryumin-Light; font-size:9px">項において読み替えて準用する場合</span><span style="font-family: Ryumin-Light; font-size:9px">を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の規定によ
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:309px; top:494px; width:8px; height:73px;"><span style="font-family: Ryumin-Light; font-size:9px">る車両の売却の決定
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:298px; top:486px; width:9px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">七 法第五十一条第十三項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二</span><span style="font-family: Ryumin-Light; font-size:9px">項において準用する場合を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の</span><span style="font-family: Ryumin-Light; font-size:9px">規定による車両の廃
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:288px; top:494px; width:8px; height:33px;"><span style="font-family: Ryumin-Light; font-size:9px">棄の決定
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:277px; top:486px; width:9px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">八 法第五十一条第十六項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二</span><span style="font-family: Ryumin-Light; font-size:9px">項において読み替えて準用する場合</span><span style="font-family: Ryumin-Light; font-size:9px">を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の規定によ
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:267px; top:494px; width:8px; height:25px;"><span style="font-family: Ryumin-Light; font-size:9px">る命令
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:224px; top:486px; width:41px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">九 法第五十一条第十七項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二</span><span style="font-family: Ryumin-Light; font-size:9px">項において準用する場合を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の</span><span style="font-family: Ryumin-Light; font-size:9px">規定による督促
<br></span><span style="font-family: Ryumin-Light; font-size:9px">十 法第五十一条第十八項</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">同条第二十二</span><span style="font-family: Ryumin-Light; font-size:9px">項において準用する場合を含む</span><span style="font-family: Ryumin-Light; font-size:9px">。)</span><span style="font-family: Ryumin-Light; font-size:9px">の</span><span style="font-family: Ryumin-Light; font-size:9px">規定による徴収
<br></span><span style="font-family: Ryumin-Light; font-size:9px">十一 法第五十一条第二十一項の規定によ</span><span style="font-family: Ryumin-Light; font-size:9px">る登録の嘱託
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第十七条の三を削り</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第十七条の四を第十</span><span style="font-family: Ryumin-Light; font-size:9px">七条の三とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第十七条の五から第</span><span style="font-family: Ryumin-Light; font-size:9px">十七条の八までを一
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:213px; top:478px; width:9px; height:71px;"><span style="font-family: Ryumin-Light; font-size:9px">条ずつ繰り上げる</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:171px; top:486px; width:41px; height:280px;"><span style="font-family: Ryumin-Light; font-size:9px">第二十二条第一号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">乗車装置</span><span style="font-family: Ryumin-Light; font-size:9px"></span><span style="font-family: Ryumin-Light; font-size:9px">以下</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">の</span><span style="font-family: Ryumin-Light; font-size:9px">下に</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">この条において</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を加える</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第二十四条の二中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第二十六条</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">第二</span><span style="font-family: Ryumin-Light; font-size:9px">十五条の二</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改める</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第二十六条を第二十五条の二とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">第三章</span><span style="font-family: Ryumin-Light; font-size:9px">中同条の次に次の一条を加える</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span><span style="font-family: Ryumin-Light; font-size:9px">普通自転車により歩道を通行することが</span><span style="font-family: Ryumin-Light; font-size:9px">できる者</span><span style="font-family: Ryumin-Light; font-size:9px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:161px; top:478px; width:9px; height:335px;"><span style="font-family: GothicBBB-Medium; font-size:9px">第二十六条 </span><span style="font-family: Ryumin-Light; font-size:9px">法第六十三条の四第一項第二号</span><span style="font-family: Ryumin-Light; font-size:9px">の政令で定める者は</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">次に掲げると</span><span style="font-family: Ryumin-Light; font-size:9px">おりとする</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:130px; top:486px; width:29px; height:353px;"><span style="font-family: Ryumin-Light; font-size:9px">一 児童及び幼児
<br>二 七十歳以上の者
<br>三 普通自転車により安全に車道を通行す</span><span style="font-family: Ryumin-Light; font-size:9px">ることに支障を生ずる程度の身体の</span><span style="font-family: Ryumin-Light; font-size:9px">障害として内閣府令
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:119px; top:494px; width:8px; height:89px;"><span style="font-family: Ryumin-Light; font-size:9px">で定めるものを有する者
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:45px; top:478px; width:72px; height:361px;"><span style="font-family: Ryumin-Light; font-size:9px">第二十六条の三の二第一項第四号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">次項</span><span style="font-family: Ryumin-Light; font-size:9px">第三号</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">次項第四号</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同項第七号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">次項
<br></span><span style="font-family: Ryumin-Light; font-size:9px">第六号</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">次項第七号</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同条第二</span><span style="font-family: Ryumin-Light; font-size:9px">項第七号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">の横</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">以外</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改</span><span style="font-family: Ryumin-Light; font-size:9px">め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号を同項第八
<br></span><span style="font-family: Ryumin-Light; font-size:9px">号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同項第六号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">の横</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">以外</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に</span><span style="font-family: Ryumin-Light; font-size:9px">改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号を同項第七号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同項</span><span style="font-family: Ryumin-Light; font-size:9px">第五号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">の横</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を
<br></span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">以外</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号を同項第六号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同</span><span style="font-family: Ryumin-Light; font-size:9px">項第四号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">の横</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">以外</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改</span><span style="font-family: Ryumin-Light; font-size:9px">め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号を同項第五
<br></span><span style="font-family: Ryumin-Light; font-size:9px">号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同項第三号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">の横</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">以外</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に</span><span style="font-family: Ryumin-Light; font-size:9px">改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号を同項第四号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同項</span><span style="font-family: Ryumin-Light; font-size:9px">第二号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">の横</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を
<br></span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">以外</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号を同項第三号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同</span><span style="font-family: Ryumin-Light; font-size:9px">項第一号中</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">の横</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">を</span><span style="font-family: Ryumin-Light; font-size:9px">「</span><span style="font-family: Ryumin-Light; font-size:9px">以外</span><span style="font-family: Ryumin-Light; font-size:9px">」</span><span style="font-family: Ryumin-Light; font-size:9px">に改</span><span style="font-family: Ryumin-Light; font-size:9px">め</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同号を同項第二
<br></span><span style="font-family: Ryumin-Light; font-size:9px">号とし</span><span style="font-family: Ryumin-Light; font-size:9px">、</span><span style="font-family: Ryumin-Light; font-size:9px">同項に第一号として次の一号を加え</span><span style="font-family: Ryumin-Light; font-size:9px">る</span><span style="font-family: Ryumin-Light; font-size:9px">。
<br></span></div><span style="position:absolute; border: black 1px solid; left:144px; top:111px; width:273px; height:360px;"></span>
<span style="position:absolute; border: black 1px solid; left:46px; top:475px; width:503px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:37px; top:107px; width:520px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:37px; top:843px; width:520px; height:0px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,157 +0,0 @@
平成 ™— 年 月 ™œ 日 金曜日
第 ›Ÿ˜ž 号
政令第百四十九号
道路交通法施行令の一部を改正する政令
内閣は、道路交通法の一部を改正する法律(平成十九年法律第九十号)の一部の施行に伴い、並び
に道路交通法(昭和三十五年法律第百五号)第四条第一項及び第四項、第五条第一項、第三十九条第
一項、第五十一条第九項(同条第二十二項、同法第七十二条の二第三項及び第七十五条の八第二項に
おいて準用する場合を含む。)、第五十一条の三第一項、第六十三条の四第一項第二号、第七十一条の
三第二項ただし書、第七十一条の六第一項、第九十条第一項ただし書、第百条の二第一項本文及び第
四号、第百二条の二並びに第百二十五条第一項及び第三項の規定に基づき、この政令を制定する。
道路交通法施行令(昭和三十五年政令第二百七十号)の一部を次のように改正する。
第一条の二第四項第三号中「一・五メートル」を「一メートル」に改め、同条第五項第三号中「第
六十三条の四第一項」を「第六十三条の四第一項第一号」に改める。
第二条第一項の表の青色の灯火の項第三号中「含む。)」を「含む。青色の灯火の矢印の項を除き、
歩行者は、進行することができること。
以下この条において同じ。)を」に改め、同表中
歩行者は、道路の横断を始めてはならず、また、道
横断を終わるか、又は横断をやめて引き返さなけれ
一の五 医療機関が、傷病者の緊急搬送をしようとする都道府県又は市町村の要請を受けて、当該
傷病者が医療機関に緊急搬送をされるまでの間における応急の治療を行う医師を当該傷病者の所
在する場所にまで運搬するために使用する自動車
第十六条中第二号を削り、第三号を第二号とする。
第十六条の二及び第十六条の三中「第五十一条第十一項」を「第五十一条第十二項」に改める。
第十六条の五中「第五十一条第二十項」を「第五十一条第二十一項」に改める。
第十七条中「第五十一条第二十一項」を「第五十一条第二十二項」に改め、「「前号」とあるのは「前
号の公示に係る積載物のうち特に貴重と認められるものについては、同号」と、同条第三号中」を削
る。
第十七条の二を次のように改める。
(委託することのできない事務)
第十七条の二 法第五十一条の三第一項の政令で定めるものは、次に掲げるとおりとする。
一 法第五十一条第五項の規定による車両の移動の決定
二 法第五十一条第六項(同条第二十二項において準用する場合を含む。)の規定により保管した車
両(積載物を含む。以下この条において同じ。)の返還の決定
三 法第五十一条第七項(同条第二十二項において読み替えて準用する場合を含む。)又は第八項の
規定による告知
歩行者は、道路を横断してはならないこと。
四 法第五十一条第九項(同条第二十二項において読み替えて準用する場合を含む。)の規定による
一 歩行者は、進行することができること。
二 普通自転車(法第六十三条の三に規定す
号において同じ。)は、横断歩道において直
路を横断している歩行者は、すみやかに、その
ばならないこと。
一 歩行者は、道路の横断を始めてはならず、
横断を終わるか、又は横断をやめて引き返
二 横断歩道を進行しようとする普通自転車
一 歩行者は、道路を横断してはならないこ
二 横断歩道を進行しようとする普通自転車
る普通自転車をいう。以下この条及び第二十六条第三
進をし、又は左折することができること。
また、道路を横断している歩行者は、速やかに、その
さなければならないこと。
は、道路の横断を始めてはならないこと。
に改め、同条第四項の表の人の形の記号を有
と。
は、道路の横断を始めてはならないこと。
する青色の灯火の項第二号中「直進(右折しようとして右折する地点まで直進し、その地点において
右折することを含む。)し」を「直進をし」に改める。
第三条の二第一項中「行なわせる」を「行わせる」に、「次の各号に」を「次に」に、「こえない」を
「超えない」に改め、第十号を第十二号とし、第四号から第九号までを二号ずつ繰り下げ、第三号を
第四号とし、同号の次に次の一号を加える。
五 法第二十五条の二第二項の道路標識等
第三条の二第一項第二号の次に次の一号を加える。
三 法第十三条第二項の道路標識等
第十三条第一項中第一号の五を第一号の六とし、第一号の四の次に次の一号を加える。
公示
五 法第五十一条第十項(同条第二十二項において準用する場合を含む。)の規定による公示の日付
及び内容の公表
六 法第五十一条第十二項(同条第二十二項において読み替えて準用する場合を含む。)の規定によ
る車両の売却の決定
七 法第五十一条第十三項(同条第二十二項において準用する場合を含む。)の規定による車両の廃
棄の決定
八 法第五十一条第十六項(同条第二十二項において読み替えて準用する場合を含む。)の規定によ
る命令
九 法第五十一条第十七項(同条第二十二項において準用する場合を含む。)の規定による督促
十 法第五十一条第十八項(同条第二十二項において準用する場合を含む。)の規定による徴収
十一 法第五十一条第二十一項の規定による登録の嘱託
第十七条の三を削り、第十七条の四を第十七条の三とし、第十七条の五から第十七条の八までを一
条ずつ繰り上げる。
第二十二条第一号中「乗車装置(以下」の下に「この条において」を加える。
第二十四条の二中「第二十六条」を「第二十五条の二」に改める。
第二十六条を第二十五条の二とし、第三章中同条の次に次の一条を加える。
(普通自転車により歩道を通行することができる者)
第二十六条 法第六十三条の四第一項第二号の政令で定める者は、次に掲げるとおりとする。
一 児童及び幼児
二 七十歳以上の者
三 普通自転車により安全に車道を通行することに支障を生ずる程度の身体の障害として内閣府令
で定めるものを有する者
第二十六条の三の二第一項第四号中「次項第三号」を「次項第四号」に改め、同項第七号中「次項
第六号」を「次項第七号」に改め、同条第二項第七号中「の横」を「以外」に改め、同号を同項第八
号とし、同項第六号中「の横」を「以外」に改め、同号を同項第七号とし、同項第五号中「の横」を
「以外」に改め、同号を同項第六号とし、同項第四号中「の横」を「以外」に改め、同号を同項第五
号とし、同項第三号中「の横」を「以外」に改め、同号を同項第四号とし、同項第二号中「の横」を
「以外」に改め、同号を同項第三号とし、同項第一号中「の横」を「以外」に改め、同号を同項第二
号とし、同項に第一号として次の一号を加える。

File diff suppressed because it is too large Load Diff

View File

@ -1,83 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:125px; width:452px; height:18px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:18px">Preemptive Information Extraction using Unrestricted Relation Discovery
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:164px; width:91px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:15px">Yusuke Shinyama
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:389px; top:164px; width:74px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:15px">Satoshi Sekine
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:255px; top:187px; width:101px; height:14px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:14px">New York University
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:244px; top:201px; width:122px; height:14px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:14px">715, Broadway, 7th Floor
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:252px; top:215px; width:106px; height:14px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:14px">New York, NY, 10003
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:222px; top:224px; width:168px; height:18px;"><span style="font-family: ZNQAHA+CMSY10; font-size:18px">{</span><span style="font-family: CXOZYQ+NimbusMonL-Regu; font-size:11px">yusuke,sekine</span><span style="font-family: ZNQAHA+CMSY10; font-size:18px">}</span><span style="font-family: CXOZYQ+NimbusMonL-Regu; font-size:11px">@cs.nyu.edu
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:281px; width:44px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:15px">Abstract
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:93px; top:310px; width:183px; height:175px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">We are trying to extend the boundary of
<br>Information Extraction (IE) systems. Ex-
<br>isting IE systems require a lot of time and
<br>human effort to tune for a new scenario.
<br>Preemptive Information Extraction is an
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">attempt to automatically create all feasible
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">IE systems in advance without human in-
<br>tervention. We propose a technique called
<br>Unrestricted Relation Discovery that dis-
<br>covers all possible relations from texts and
<br>presents them as tables. We present a pre-
<br>liminary system that obtains reasonably
<br>good results.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:505px; width:80px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:15px">1 Background
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:528px; width:226px; height:243px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">Every day, a large number of news articles are cre-
<br>ated and reported, many of which are unique. But
<br>certain types of events, such as hurricanes or mur-
<br>ders, are reported again and again throughout a year.
<br>The goal of Information Extraction, or IE, is to re-
<br>trieve a certain type of news event from past articles
<br>and present the events as a table whose columns are
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">filled with a name of a person or company, accord-
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">ing to its role in the event. However, existing IE
<br>techniques require a lot of human labor. First, you
<br>have to specify the type of information you want and
<br>collect articles that include this information. Then,
<br>you have to analyze the articles and manually craft
<br>a set of patterns to capture these events. Most exist-
<br>ing IE research focuses on reducing this burden by
<br>helping people create such patterns. But each time
<br>you want to extract a different kind of information,
<br>you need to repeat the whole process: specify arti-
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:284px; width:226px; height:94px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">cles and adjust its patterns, either manually or semi-
<br>automatically. There is a bit of a dangerous pitfall
<br>here. First, it is hard to estimate how good the sys-
<br>tem can be after months of work. Furthermore, you
<br>might not know if the task is even doable in the first
<br>place. Knowing what kind of information is easily
<br>obtained in advance would help reduce this risk.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:379px; width:226px; height:175px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">An IE task can be defined as finding a relation
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">among several entities involved in a certain type of
<br>event. For example, in the MUC-6 management
<br>succession scenario, one seeks a relation between
<br>COMPANY, PERSON and POST involved with hir-
<br>ing/firing events. For each row of an extracted ta-
<br>ble, you can always read it as “COMPANY hired
<br>(or fired) PERSON for POST.” The relation between
<br>these entities is retained throughout the table. There
<br>are many existing works on obtaining extraction pat-
<br>terns for pre-defined relations (Riloff, 1996; Yangar-
<br>ber et al., 2000; Agichtein and Gravano, 2000; Sudo
<br>et al., 2003).
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:555px; width:226px; height:216px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">Unrestricted Relation Discovery is a technique to
<br>automatically discover such relations that repeatedly
<br>appear in a corpus and present them as a table, with
<br>absolutely no human intervention. Unlike most ex-
<br>isting IE research, a user does not specify the type
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">of articles or information wanted. Instead, a system
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">tries to find all the kinds of relations that are reported
<br>multiple times and can be reported in tabular form.
<br>This technique will open up the possibility of try-
<br>ing new IE scenarios. Furthermore, the system itself
<br>can be used as an IE system, since an obtained re-
<br>lation is already presented as a table. If this system
<br>works to a certain extent, tuning an IE system be-
<br>comes a search problem: all the tables are already
<br>built “preemptively.” A user only needs to search
<br>for a relevant table.
<br></span></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,91 +0,0 @@
Preemptive Information Extraction using Unrestricted Relation Discovery
Yusuke Shinyama
Satoshi Sekine
New York University
715, Broadway, 7th Floor
New York, NY, 10003
{yusuke,sekine}@cs.nyu.edu
Abstract
We are trying to extend the boundary of
Information Extraction (IE) systems. Ex-
isting IE systems require a lot of time and
human effort to tune for a new scenario.
Preemptive Information Extraction is an
attempt to automatically create all feasible
IE systems in advance without human in-
tervention. We propose a technique called
Unrestricted Relation Discovery that dis-
covers all possible relations from texts and
presents them as tables. We present a pre-
liminary system that obtains reasonably
good results.
1 Background
Every day, a large number of news articles are cre-
ated and reported, many of which are unique. But
certain types of events, such as hurricanes or mur-
ders, are reported again and again throughout a year.
The goal of Information Extraction, or IE, is to re-
trieve a certain type of news event from past articles
and present the events as a table whose columns are
filled with a name of a person or company, accord-
ing to its role in the event. However, existing IE
techniques require a lot of human labor. First, you
have to specify the type of information you want and
collect articles that include this information. Then,
you have to analyze the articles and manually craft
a set of patterns to capture these events. Most exist-
ing IE research focuses on reducing this burden by
helping people create such patterns. But each time
you want to extract a different kind of information,
you need to repeat the whole process: specify arti-
cles and adjust its patterns, either manually or semi-
automatically. There is a bit of a dangerous pitfall
here. First, it is hard to estimate how good the sys-
tem can be after months of work. Furthermore, you
might not know if the task is even doable in the first
place. Knowing what kind of information is easily
obtained in advance would help reduce this risk.
An IE task can be defined as finding a relation
among several entities involved in a certain type of
event. For example, in the MUC-6 management
succession scenario, one seeks a relation between
COMPANY, PERSON and POST involved with hir-
ing/firing events. For each row of an extracted ta-
ble, you can always read it as “COMPANY hired
(or fired) PERSON for POST.” The relation between
these entities is retained throughout the table. There
are many existing works on obtaining extraction pat-
terns for pre-defined relations (Riloff, 1996; Yangar-
ber et al., 2000; Agichtein and Gravano, 2000; Sudo
et al., 2003).
Unrestricted Relation Discovery is a technique to
automatically discover such relations that repeatedly
appear in a corpus and present them as a table, with
absolutely no human intervention. Unlike most ex-
isting IE research, a user does not specify the type
of articles or information wanted. Instead, a system
tries to find all the kinds of relations that are reported
multiple times and can be reported in tabular form.
This technique will open up the possibility of try-
ing new IE scenarios. Furthermore, the system itself
can be used as an IE system, since an obtained re-
lation is already presented as a table. If this system
works to a certain extent, tuning an IE system be-
comes a search problem: all the tables are already
built “preemptively.” A user only needs to search
for a relevant table.

File diff suppressed because it is too large Load Diff

View File

@ -1,15 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:800px; height:600px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:62px; top:126px; width:672px; height:157px;"><span style="font-family: DAFPJF+HiraKakuPro-W6; font-size:85px">コンパラブルな新聞記事からの
<br>固有表現の発見
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:263px; top:374px; width:468px; height:212px;"><span style="font-family: DAFPJF+HiraKakuPro-W6; font-size:64px">新山 祐介
<br>関根 聡
<br></span><span style="font-family: DAFPJF+HiraKakuPro-W6; font-size:50px">Computer Science Department
<br>New York University
<br></span></div><span style="position:absolute; border: black 1px solid; left:0px; top:50px; width:800px; height:600px;"></span>
<span style="position:absolute; border: black 1px solid; left:50px; top:308px; width:510px; height:0px;"></span>
<div style="position:absolute; border: figure 1px solid; writing-mode:False; left:25px; top:587px; width:41px; height:40px;"></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,9 +0,0 @@
コンパラブルな新聞記事からの
固有表現の発見
新山 祐介
関根 聡
Computer Science Department
New York University

View File

@ -1,120 +0,0 @@
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,800.000,600.000" rotate="0">
<textbox id="0" bbox="62.000,365.240,734.000,523.160">
<textline bbox="62.000,437.240,734.000,523.160">
<text font="DAFPJF+HiraKakuPro-W6" bbox="62.000,437.240,110.000,523.160" size="85.920">コ</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="110.000,437.240,158.000,523.160" size="85.920">ン</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="158.000,437.240,206.000,523.160" size="85.920">パ</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="206.000,437.240,254.000,523.160" size="85.920">ラ</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="254.000,437.240,302.000,523.160" size="85.920">ブ</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="302.000,437.240,350.000,523.160" size="85.920">ル</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="350.000,437.240,398.000,523.160" size="85.920">な</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="398.000,437.240,446.000,523.160" size="85.920">新</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="446.000,437.240,494.000,523.160" size="85.920">聞</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="494.000,437.240,542.000,523.160" size="85.920">記</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="542.000,437.240,590.000,523.160" size="85.920">事</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="590.000,437.240,638.000,523.160" size="85.920">か</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="638.000,437.240,686.000,523.160" size="85.920">ら</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="686.000,437.240,734.000,523.160" size="85.920">の</text>
<text>
</text>
</textline>
<textline bbox="62.000,365.240,398.000,451.160">
<text font="DAFPJF+HiraKakuPro-W6" bbox="62.000,365.240,110.000,451.160" size="85.920">固</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="110.000,365.240,158.000,451.160" size="85.920">有</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="158.000,365.240,206.000,451.160" size="85.920">表</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="206.000,365.240,254.000,451.160" size="85.920">現</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="254.000,365.240,302.000,451.160" size="85.920">の</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="302.000,365.240,350.000,451.160" size="85.920">発</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="350.000,365.240,398.000,451.160" size="85.920">見</text>
<text>
</text>
</textline>
</textbox>
<textbox id="1" bbox="263.532,62.640,732.000,275.120">
<textline bbox="576.012,210.680,732.000,275.120">
<text font="DAFPJF+HiraKakuPro-W6" bbox="576.012,210.680,612.012,275.120" size="64.440">新</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="612.012,210.680,648.012,275.120" size="64.440">山</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="648.012,210.680,660.000,275.120" size="64.440"> </text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="660.000,210.680,696.000,275.120" size="64.440">祐</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="696.000,210.680,732.000,275.120" size="64.440">介</text>
<text>
</text>
</textline>
<textline bbox="612.012,154.680,732.000,219.120">
<text font="DAFPJF+HiraKakuPro-W6" bbox="612.012,154.680,648.012,219.120" size="64.440">関</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="648.012,154.680,684.012,219.120" size="64.440">根</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="684.012,154.680,696.000,219.120" size="64.440"> </text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="696.000,154.680,732.000,219.120" size="64.440">聡</text>
<text>
</text>
</textline>
<textline bbox="263.532,106.640,732.000,156.760">
<text font="DAFPJF+HiraKakuPro-W6" bbox="263.532,106.640,285.736,156.760" size="50.120">C</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="285.736,106.640,304.496,156.760" size="50.120">o</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="304.496,106.640,332.776,156.760" size="50.120">m</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="332.776,106.640,352.600,156.760" size="50.120">p</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="352.600,106.640,371.444,156.760" size="50.120">u</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="371.444,106.640,383.596,156.760" size="50.120">t</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="383.596,106.640,401.432,156.760" size="50.120">e</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="401.432,106.640,415.208,156.760" size="50.120">r</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="415.208,106.640,424.532,156.760" size="50.120"> </text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="424.532,106.640,444.552,156.760" size="50.120">S</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="444.552,106.640,462.052,156.760" size="50.120">c</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="462.052,106.640,469.640,156.760" size="50.120">i</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="469.640,106.640,487.476,156.760" size="50.120">e</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="487.476,106.640,506.320,156.760" size="50.120">n</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="506.320,106.640,523.820,156.760" size="50.120">c</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="523.820,106.640,541.656,156.760" size="50.120">e</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="541.656,106.640,550.980,156.760" size="50.120"> </text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="550.980,106.640,573.548,156.760" size="50.120">D</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="573.548,106.640,591.384,156.760" size="50.120">e</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="591.384,106.640,611.208,156.760" size="50.120">p</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="611.208,106.640,628.960,156.760" size="50.120">a</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="628.960,106.640,642.736,156.760" size="50.120">r</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="642.736,106.640,654.888,156.760" size="50.120">t</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="654.888,106.640,683.168,156.760" size="50.120">m</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="683.168,106.640,701.004,156.760" size="50.120">e</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="701.004,106.640,719.848,156.760" size="50.120">n</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="719.848,106.640,732.000,156.760" size="50.120">t</text>
<text>
</text>
</textline>
<textline bbox="424.140,62.640,732.000,112.760">
<text font="DAFPJF+HiraKakuPro-W6" bbox="424.140,62.640,447.128,112.760" size="50.120">N</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="447.128,62.640,464.964,112.760" size="50.120">e</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="464.964,62.640,488.764,112.760" size="50.120">w</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="488.764,62.640,498.088,112.760" size="50.120"> </text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="498.088,62.640,519.312,112.760" size="50.120">Y</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="519.312,62.640,538.072,112.760" size="50.120">o</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="538.072,62.640,551.848,112.760" size="50.120">r</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="551.848,62.640,570.104,112.760" size="50.120">k</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="570.104,62.640,579.428,112.760" size="50.120"> </text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="579.428,62.640,602.640,112.760" size="50.120">U</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="602.640,62.640,621.484,112.760" size="50.120">n</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="621.484,62.640,629.072,112.760" size="50.120">i</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="629.072,62.640,646.460,112.760" size="50.120">v</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="646.460,62.640,664.296,112.760" size="50.120">e</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="664.296,62.640,678.072,112.760" size="50.120">r</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="678.072,62.640,694.676,112.760" size="50.120">s</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="694.676,62.640,702.264,112.760" size="50.120">i</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="702.264,62.640,714.416,112.760" size="50.120">t</text>
<text font="DAFPJF+HiraKakuPro-W6" bbox="714.416,62.640,732.000,112.760" size="50.120">y</text>
<text>
</text>
</textline>
</textbox>
<rect linewidth="0" bbox="0.000,0.000,800.000,600.000" />
<line linewidth="8" bbox="50.000,342.000,560.000,342.000" />
<figure name="Im1" bbox="25.000,23.000,66.000,63.000">
<image width="41" height="40" />
</figure>
<layout>
<textgroup bbox="62.000,62.640,734.000,523.160">
<textbox id="0" bbox="62.000,365.240,734.000,523.160" />
<textbox id="1" bbox="263.532,62.640,732.000,275.120" />
</textgroup>
</layout>
</page>
</pages>

View File

@ -1,15 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:100px; top:119px; width:61px; height:27px;"><span style="font-family: Helvetica; font-size:27px">Hello
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:261px; top:119px; width:62px; height:27px;"><span style="font-family: Helvetica; font-size:27px">World
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:100px; top:219px; width:61px; height:27px;"><span style="font-family: Helvetica; font-size:27px">Hello
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:261px; top:219px; width:62px; height:27px;"><span style="font-family: Helvetica; font-size:27px">World
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:100px; top:319px; width:111px; height:27px;"><span style="font-family: Helvetica; font-size:27px">H e l l o
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:321px; top:319px; width:102px; height:27px;"><span style="font-family: Helvetica; font-size:27px">W o r l d
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:100px; top:419px; width:111px; height:27px;"><span style="font-family: Helvetica; font-size:27px">H e l l o
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:321px; top:419px; width:102px; height:27px;"><span style="font-family: Helvetica; font-size:27px">W o r l d
<br></span></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,17 +0,0 @@
Hello
World
Hello
World
H e l l o
W o r l d
H e l l o
W o r l d

View File

@ -1,139 +0,0 @@
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,612.000,792.000" rotate="0">
<textbox id="0" bbox="100.000,695.032,161.344,722.776">
<textline bbox="100.000,695.032,161.344,722.776">
<text font="Helvetica" bbox="100.000,695.032,117.328,722.776" size="27.744">H</text>
<text font="Helvetica" bbox="117.328,695.032,130.672,722.776" size="27.744">e</text>
<text font="Helvetica" bbox="130.672,695.032,136.000,722.776" size="27.744">l</text>
<text font="Helvetica" bbox="136.000,695.032,141.328,722.776" size="27.744">l</text>
<text font="Helvetica" bbox="141.328,695.032,154.672,722.776" size="27.744">o</text>
<text font="Helvetica" bbox="154.672,695.032,161.344,722.776" size="27.744"> </text>
<text>
</text>
</textline>
</textbox>
<textbox id="1" bbox="261.328,695.032,323.992,722.776">
<textline bbox="261.328,695.032,323.992,722.776">
<text font="Helvetica" bbox="261.328,695.032,283.984,722.776" size="27.744">W</text>
<text font="Helvetica" bbox="283.984,695.032,297.328,722.776" size="27.744">o</text>
<text font="Helvetica" bbox="297.328,695.032,305.320,722.776" size="27.744">r</text>
<text font="Helvetica" bbox="305.320,695.032,310.648,722.776" size="27.744">l</text>
<text font="Helvetica" bbox="310.648,695.032,323.992,722.776" size="27.744">d</text>
<text>
</text>
</textline>
</textbox>
<textbox id="2" bbox="100.000,595.032,161.344,622.776">
<textline bbox="100.000,595.032,161.344,622.776">
<text font="Helvetica" bbox="100.000,595.032,117.328,622.776" size="27.744">H</text>
<text font="Helvetica" bbox="117.328,595.032,130.672,622.776" size="27.744">e</text>
<text font="Helvetica" bbox="130.672,595.032,136.000,622.776" size="27.744">l</text>
<text font="Helvetica" bbox="136.000,595.032,141.328,622.776" size="27.744">l</text>
<text font="Helvetica" bbox="141.328,595.032,154.672,622.776" size="27.744">o</text>
<text font="Helvetica" bbox="154.672,595.032,161.344,622.776" size="27.744"> </text>
<text>
</text>
</textline>
</textbox>
<textbox id="3" bbox="261.344,595.032,324.008,622.776">
<textline bbox="261.344,595.032,324.008,622.776">
<text font="Helvetica" bbox="261.344,595.032,284.000,622.776" size="27.744">W</text>
<text font="Helvetica" bbox="284.000,595.032,297.344,622.776" size="27.744">o</text>
<text font="Helvetica" bbox="297.344,595.032,305.336,622.776" size="27.744">r</text>
<text font="Helvetica" bbox="305.336,595.032,310.664,622.776" size="27.744">l</text>
<text font="Helvetica" bbox="310.664,595.032,324.008,622.776" size="27.744">d</text>
<text>
</text>
</textline>
</textbox>
<textbox id="4" bbox="100.000,495.032,211.344,522.776">
<textline bbox="100.000,495.032,211.344,522.776">
<text font="Helvetica" bbox="100.000,495.032,117.328,522.776" size="27.744">H</text>
<text> </text>
<text font="Helvetica" bbox="127.328,495.032,140.672,522.776" size="27.744">e</text>
<text> </text>
<text font="Helvetica" bbox="150.672,495.032,156.000,522.776" size="27.744">l</text>
<text> </text>
<text font="Helvetica" bbox="166.000,495.032,171.328,522.776" size="27.744">l</text>
<text> </text>
<text font="Helvetica" bbox="181.328,495.032,194.672,522.776" size="27.744">o</text>
<text> </text>
<text font="Helvetica" bbox="204.672,495.032,211.344,522.776" size="27.744"> </text>
<text>
</text>
</textline>
</textbox>
<textbox id="5" bbox="321.344,495.032,424.008,522.776">
<textline bbox="321.344,495.032,424.008,522.776">
<text font="Helvetica" bbox="321.344,495.032,344.000,522.776" size="27.744">W</text>
<text> </text>
<text font="Helvetica" bbox="354.000,495.032,367.344,522.776" size="27.744">o</text>
<text> </text>
<text font="Helvetica" bbox="377.344,495.032,385.336,522.776" size="27.744">r</text>
<text> </text>
<text font="Helvetica" bbox="395.336,495.032,400.664,522.776" size="27.744">l</text>
<text> </text>
<text font="Helvetica" bbox="410.664,495.032,424.008,522.776" size="27.744">d</text>
<text>
</text>
</textline>
</textbox>
<textbox id="6" bbox="100.000,395.032,211.264,422.776">
<textline bbox="100.000,395.032,211.264,422.776">
<text font="Helvetica" bbox="100.000,395.032,117.328,422.776" size="27.744">H</text>
<text> </text>
<text font="Helvetica" bbox="127.312,395.032,140.656,422.776" size="27.744">e</text>
<text> </text>
<text font="Helvetica" bbox="150.640,395.032,155.968,422.776" size="27.744">l</text>
<text> </text>
<text font="Helvetica" bbox="165.952,395.032,171.280,422.776" size="27.744">l</text>
<text> </text>
<text font="Helvetica" bbox="181.264,395.032,194.608,422.776" size="27.744">o</text>
<text> </text>
<text font="Helvetica" bbox="204.592,395.032,211.264,422.776" size="27.744"> </text>
<text>
</text>
</textline>
</textbox>
<textbox id="7" bbox="321.232,395.032,423.832,422.776">
<textline bbox="321.232,395.032,423.832,422.776">
<text font="Helvetica" bbox="321.232,395.032,343.888,422.776" size="27.744">W</text>
<text> </text>
<text font="Helvetica" bbox="353.872,395.032,367.216,422.776" size="27.744">o</text>
<text> </text>
<text font="Helvetica" bbox="377.200,395.032,385.192,422.776" size="27.744">r</text>
<text> </text>
<text font="Helvetica" bbox="395.176,395.032,400.504,422.776" size="27.744">l</text>
<text> </text>
<text font="Helvetica" bbox="410.488,395.032,423.832,422.776" size="27.744">d</text>
<text>
</text>
</textline>
</textbox>
<layout>
<textgroup bbox="100.000,395.032,424.008,722.776">
<textgroup bbox="100.000,595.032,324.008,722.776">
<textgroup bbox="100.000,695.032,323.992,722.776">
<textbox id="0" bbox="100.000,695.032,161.344,722.776" />
<textbox id="1" bbox="261.328,695.032,323.992,722.776" />
</textgroup>
<textgroup bbox="100.000,595.032,324.008,622.776">
<textbox id="2" bbox="100.000,595.032,161.344,622.776" />
<textbox id="3" bbox="261.344,595.032,324.008,622.776" />
</textgroup>
</textgroup>
<textgroup bbox="100.000,395.032,424.008,522.776">
<textgroup bbox="100.000,495.032,424.008,522.776">
<textbox id="4" bbox="100.000,495.032,211.344,522.776" />
<textbox id="5" bbox="321.344,495.032,424.008,522.776" />
</textgroup>
<textgroup bbox="100.000,395.032,423.832,422.776">
<textbox id="6" bbox="100.000,395.032,211.264,422.776" />
<textbox id="7" bbox="321.232,395.032,423.832,422.776" />
</textgroup>
</textgroup>
</textgroup>
</layout>
</page>
</pages>

View File

@ -1,11 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<span style="position:absolute; border: black 1px solid; left:150px; top:492px; width:0px; height:100px;"></span>
<span style="position:absolute; border: black 1px solid; left:150px; top:592px; width:250px; height:0px;"></span>
<span style="position:absolute; border: black 1px solid; left:200px; top:467px; width:50px; height:75px;"></span>
<span style="position:absolute; border: black 1px solid; left:300px; top:442px; width:100px; height:100px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1 +0,0 @@

View File

@ -1,9 +0,0 @@
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,612.000,792.000" rotate="0">
<line linewidth="0" bbox="150.000,250.000,150.000,350.000" />
<line linewidth="4" bbox="150.000,250.000,400.000,250.000" />
<rect linewidth="1" bbox="200.000,300.000,250.000,375.000" />
<curve linewidth="1" bbox="300.000,300.000,400.000,400.000" pts="300.000,300.000,300.000,400.000,400.000,400.000,400.000,300.000"/>
</page>
</pages>

View File

@ -1,11 +0,0 @@
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:281px; top:575px; width:62px; height:27px;"><span style="font-family: Helvetica; font-size:27px">World
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:241px; top:599px; width:40px; height:27px;"><span style="font-family: Helvetica; font-size:27px">orld
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:tb-rl; left:194px; top:136px; width:48px; height:490px;"><span style="font-family: unknown; font-size:48px">あいうえおあいうえお </span><span style="font-family: Helvetica; font-size:27px">W
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:0px; top:72px; width:218px; height:79px;"><span style="font-family: Helvetica; font-size:55px">HelloHello
<br></span></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

View File

@ -1,9 +0,0 @@
World
orld
あいうえおあいうえお W
HelloHello

View File

@ -1,72 +0,0 @@
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,612.000,792.000" rotate="0">
<textbox id="0" bbox="281.352,239.032,344.016,266.776">
<textline bbox="281.352,239.032,344.016,266.776">
<text font="Helvetica" bbox="281.352,239.032,304.008,266.776" size="27.744">W</text>
<text font="Helvetica" bbox="304.008,239.032,317.352,266.776" size="27.744">o</text>
<text font="Helvetica" bbox="317.352,239.032,325.344,266.776" size="27.744">r</text>
<text font="Helvetica" bbox="325.344,239.032,330.672,266.776" size="27.744">l</text>
<text font="Helvetica" bbox="330.672,239.032,344.016,266.776" size="27.744">d</text>
<text>
</text>
</textline>
</textbox>
<textbox id="1" bbox="241.344,215.032,281.352,242.776">
<textline bbox="241.344,215.032,281.352,242.776">
<text font="Helvetica" bbox="241.344,215.032,254.688,242.776" size="27.744">o</text>
<text font="Helvetica" bbox="254.688,215.032,262.680,242.776" size="27.744">r</text>
<text font="Helvetica" bbox="262.680,215.032,268.008,242.776" size="27.744">l</text>
<text font="Helvetica" bbox="268.008,215.032,281.352,242.776" size="27.744">d</text>
<text>
</text>
</textline>
</textbox>
<textbox id="2" bbox="194.688,215.032,242.688,705.760" wmode="vertical">
<textline bbox="194.688,215.032,242.688,705.760">
<text font="unknown" bbox="194.688,657.760,242.688,705.760" size="48.000">あ</text>
<text font="unknown" bbox="194.688,609.760,242.688,657.760" size="48.000">い</text>
<text font="unknown" bbox="194.688,561.760,242.688,609.760" size="48.000">う</text>
<text font="unknown" bbox="194.688,513.760,242.688,561.760" size="48.000">え</text>
<text font="unknown" bbox="194.688,465.760,242.688,513.760" size="48.000">お</text>
<text font="unknown" bbox="194.688,441.760,242.688,489.760" size="48.000">あ</text>
<text font="unknown" bbox="194.688,393.760,242.688,441.760" size="48.000">い</text>
<text font="unknown" bbox="194.688,345.760,242.688,393.760" size="48.000">う</text>
<text font="unknown" bbox="194.688,297.760,242.688,345.760" size="48.000">え</text>
<text font="unknown" bbox="194.688,249.760,242.688,297.760" size="48.000">お</text>
<text> </text>
<text font="Helvetica" bbox="218.688,215.032,241.344,242.776" size="27.744">W</text>
<text>
</text>
</textline>
</textbox>
<textbox id="3" bbox="0.000,690.064,218.688,769.552">
<textline bbox="0.000,690.064,218.688,769.552">
<text font="Helvetica" bbox="0.000,690.064,34.656,745.552" size="55.488">H</text>
<text font="Helvetica" bbox="34.656,690.064,61.344,745.552" size="55.488">e</text>
<text font="Helvetica" bbox="61.344,690.064,72.000,745.552" size="55.488">l</text>
<text font="Helvetica" bbox="72.000,690.064,82.656,745.552" size="55.488">l</text>
<text font="Helvetica" bbox="82.656,690.064,109.344,745.552" size="55.488">o</text>
<text font="Helvetica" bbox="109.344,714.064,144.000,769.552" size="55.488">H</text>
<text font="Helvetica" bbox="144.000,714.064,170.688,769.552" size="55.488">e</text>
<text font="Helvetica" bbox="170.688,714.064,181.344,769.552" size="55.488">l</text>
<text font="Helvetica" bbox="181.344,714.064,192.000,769.552" size="55.488">l</text>
<text font="Helvetica" bbox="192.000,714.064,218.688,769.552" size="55.488">o</text>
<text>
</text>
</textline>
</textbox>
<layout>
<textgroup bbox="0.000,215.032,344.016,769.552">
<textgroup bbox="241.344,215.032,344.016,266.776">
<textbox id="0" bbox="281.352,239.032,344.016,266.776" />
<textbox id="1" bbox="241.344,215.032,281.352,242.776" />
</textgroup>
<textgroup bbox="0.000,215.032,242.688,769.552">
<textbox id="2" bbox="194.688,215.032,242.688,705.760" />
<textbox id="3" bbox="0.000,690.064,218.688,769.552" />
</textgroup>
</textgroup>
</layout>
</page>
</pages>

View File

@ -13,7 +13,10 @@ setup(
'six', 'six',
'sortedcontainers', 'sortedcontainers',
], ],
extras_require={"dev": ["nose", "tox"]}, extras_require={
"dev": ["nose", "tox"],
"docs": ["sphinx", "sphinx-argparse"],
},
description='PDF parser and analyzer', description='PDF parser and analyzer',
long_description=package.__doc__, long_description=package.__doc__,
license='MIT/X', license='MIT/X',

7
tests/helpers.py Normal file
View File

@ -0,0 +1,7 @@
import os
def absolute_sample_path(relative_sample_path):
sample_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '../samples'))
sample_file = os.path.join(sample_dir, relative_sample_path)
return sample_file

View File

@ -1,5 +1,5 @@
""" """Tests based on the Adobe Glyph List Specification
Tests based on the Adobe Glyph List Specification (https://github.com/adobe-type-tools/agl-specification#2-the-mapping) See: https://github.com/adobe-type-tools/agl-specification#2-the-mapping
While not in the specification, lowercase unicode often occurs in pdf's. Therefore lowercase unittest variants are While not in the specification, lowercase unicode often occurs in pdf's. Therefore lowercase unittest variants are
added. added.

View File

@ -0,0 +1,38 @@
import unittest
from helpers import absolute_sample_path
from pdfminer.high_level import extract_text
def run(sample_path):
absolute_path = absolute_sample_path(sample_path)
s = extract_text(absolute_path)
return s
test_strings = {
"simple1.pdf": "Hello \n\nWorld\n\nWorld\n\nHello \n\nH e l l o \n\nH e l l o \n\nW o r l d\n\nW o r l d\n\n\f",
"simple2.pdf": "\f",
"simple3.pdf": "HelloHello\n\nWorld\n\nWorld\n\n\f",
}
class TestExtractText(unittest.TestCase):
def test_simple1(self):
test_file = "simple1.pdf"
s = run(test_file)
self.assertEqual(s, test_strings[test_file])
def test_simple2(self):
test_file = "simple2.pdf"
s = run(test_file)
self.assertEqual(s, test_strings[test_file])
def test_simple3(self):
test_file = "simple3.pdf"
s = run(test_file)
self.assertEqual(s, test_strings[test_file])
if __name__ == "__main__":
unittest.main()

16
tests/test_pdfdocument.py Normal file
View File

@ -0,0 +1,16 @@
from nose.tools import raises
from helpers import absolute_sample_path
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdftypes import PDFObjectNotFound
class TestPdfDocument(object):
@raises(PDFObjectNotFound)
def test_get_zero_objid_raises_pdfobjectnotfound(self):
with open(absolute_sample_path('simple1.pdf'), 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
doc.getobj(0)

View File

@ -2,12 +2,14 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
import nose, logging, os import nose
from pdfminer.cmapdb import IdentityCMap, CMap, IdentityCMapByte from pdfminer.cmapdb import IdentityCMap, CMap, IdentityCMapByte
from pdfminer.pdffont import PDFCIDFont from pdfminer.pdffont import PDFCIDFont
from pdfminer.pdftypes import PDFStream from pdfminer.pdftypes import PDFStream
from pdfminer.psparser import PSLiteral from pdfminer.psparser import PSLiteral
class TestPDFEncoding(): class TestPDFEncoding():
def test_cmapname_onebyteidentityV(self): def test_cmapname_onebyteidentityV(self):
@ -45,25 +47,25 @@ class TestPDFEncoding():
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
def test_encoding_identityH_as_PSLiteral_stream(self): def test_encoding_identityH_as_PSLiteral_stream(self):
stream = PDFStream({'CMapName':PSLiteral('Identity-H')}, '') stream = PDFStream({'CMapName': PSLiteral('Identity-H')}, '')
spec = {'Encoding': stream} spec = {'Encoding': stream}
font = PDFCIDFont(None, spec) font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
def test_encoding_identityV_as_PSLiteral_stream(self): def test_encoding_identityV_as_PSLiteral_stream(self):
stream = PDFStream({'CMapName':PSLiteral('Identity-V')}, '') stream = PDFStream({'CMapName': PSLiteral('Identity-V')}, '')
spec = {'Encoding': stream} spec = {'Encoding': stream}
font = PDFCIDFont(None, spec) font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
def test_encoding_identityH_as_stream(self): def test_encoding_identityH_as_stream(self):
stream = PDFStream({'CMapName':'Identity-H'}, '') stream = PDFStream({'CMapName': 'Identity-H'}, '')
spec = {'Encoding': stream} spec = {'Encoding': stream}
font = PDFCIDFont(None, spec) font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
def test_encoding_identityV_as_stream(self): def test_encoding_identityV_as_stream(self):
stream = PDFStream({'CMapName':'Identity-V'}, '') stream = PDFStream({'CMapName': 'Identity-V'}, '')
spec = {'Encoding': stream} spec = {'Encoding': stream}
font = PDFCIDFont(None, spec) font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
@ -79,25 +81,25 @@ class TestPDFEncoding():
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
def test_encoding_DLIdentH_as_PSLiteral_stream(self): def test_encoding_DLIdentH_as_PSLiteral_stream(self):
stream = PDFStream({'CMapName':PSLiteral('DLIdent-H')}, '') stream = PDFStream({'CMapName': PSLiteral('DLIdent-H')}, '')
spec = {'Encoding': stream} spec = {'Encoding': stream}
font = PDFCIDFont(None, spec) font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
def test_encoding_DLIdentH_as_PSLiteral_stream(self): def test_encoding_DLIdentH_as_PSLiteral_stream(self):
stream = PDFStream({'CMapName':PSLiteral('DLIdent-V')}, '') stream = PDFStream({'CMapName': PSLiteral('DLIdent-V')}, '')
spec = {'Encoding': stream} spec = {'Encoding': stream}
font = PDFCIDFont(None, spec) font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
def test_encoding_DLIdentH_as_stream(self): def test_encoding_DLIdentH_as_stream(self):
stream = PDFStream({'CMapName':'DLIdent-H'}, '') stream = PDFStream({'CMapName': 'DLIdent-H'}, '')
spec = {'Encoding': stream} spec = {'Encoding': stream}
font = PDFCIDFont(None, spec) font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)
def test_encoding_DLIdentV_as_stream(self): def test_encoding_DLIdentV_as_stream(self):
stream = PDFStream({'CMapName':'DLIdent-V'}, '') stream = PDFStream({'CMapName': 'DLIdent-V'}, '')
spec = {'Encoding': stream} spec = {'Encoding': stream}
font = PDFCIDFont(None, spec) font = PDFCIDFont(None, spec)
assert isinstance(font.cmap, IdentityCMap) assert isinstance(font.cmap, IdentityCMap)

View File

@ -1,19 +1,9 @@
#!/usr/bin/env python from nose.tools import assert_equal
# -*- coding: utf-8 -*-
from nose.tools import assert_equal, assert_true, assert_false
from nose import SkipTest
import nose
import logging
from pdfminer.ccitt import * from pdfminer.ccitt import *
## Test cases
##
class TestCCITTG4Parser():
class TestCCITTG4Parser():
def get_parser(self, bits): def get_parser(self, bits):
parser = CCITTG4Parser(len(bits)) parser = CCITTG4Parser(len(bits))
parser._curline = [int(c) for c in bits] parser._curline = [int(c) for c in bits]
@ -163,6 +153,3 @@ class TestCCITTG4Parser():
parser._do_vertical(1) parser._do_vertical(1)
assert_equal(parser._get_bits(), '00000001') assert_equal(parser._get_bits(), '00000001')
return return
if __name__ == '__main__':
nose.runmodule()

View File

@ -1,52 +1,57 @@
#!/usr/bin/env python """Test of various compression/encoding modules (previously in doctests)
# -*- coding: utf-8 -*- """
import binascii
from nose.tools import assert_equal from nose.tools import assert_equal
from nose import SkipTest
import nose
#test of various compression/encoding modules (previously in doctests):
from pdfminer.ascii85 import *
from pdfminer.arcfour import * from pdfminer.arcfour import *
from pdfminer.ascii85 import *
from pdfminer.lzw import * from pdfminer.lzw import *
from pdfminer.runlength import *
from pdfminer.rijndael import * from pdfminer.rijndael import *
from pdfminer.runlength import *
def hex(b):
"""encode('hex')"""
return binascii.hexlify(b)
def dehex(b):
"""decode('hex')"""
return binascii.unhexlify(b)
import binascii
def hex(b): return binascii.hexlify(b) #encode('hex')
def dehex(b): return binascii.unhexlify(b) #decode('hex')
class TestAscii85(): class TestAscii85():
def test_ascii85decode(self): def test_ascii85decode(self):
#The sample string is taken from: http://en.wikipedia.org/w/index.php?title=Ascii85 """The sample string is taken from: http://en.wikipedia.org/w/index.php?title=Ascii85"""
assert_equal(ascii85decode(b'9jqo^BlbD-BleB1DJ+*+F(f,q'),b'Man is distinguished') assert_equal(ascii85decode(b'9jqo^BlbD-BleB1DJ+*+F(f,q'), b'Man is distinguished')
assert_equal(ascii85decode(b'E,9)oF*2M7/c~>'),b'pleasure.') assert_equal(ascii85decode(b'E,9)oF*2M7/c~>'), b'pleasure.')
def test_asciihexdecode(self): def test_asciihexdecode(self):
assert_equal(asciihexdecode(b'61 62 2e6364 65'),b'ab.cde') assert_equal(asciihexdecode(b'61 62 2e6364 65'), b'ab.cde')
assert_equal(asciihexdecode(b'61 62 2e6364 657>'),b'ab.cdep') assert_equal(asciihexdecode(b'61 62 2e6364 657>'), b'ab.cdep')
assert_equal(asciihexdecode(b'7>'),b'p') assert_equal(asciihexdecode(b'7>'), b'p')
class TestArcfour(): class TestArcfour():
def test(self): def test(self):
assert_equal(hex(Arcfour(b'Key').process(b'Plaintext')), b'bbf316e8d940af0ad3')
assert_equal(hex(Arcfour(b'Wiki').process(b'pedia')), b'1021bf0420')
assert_equal(hex(Arcfour(b'Secret').process(b'Attack at dawn')), b'45a01f645fc35b383552544b9bf5')
assert_equal(hex(Arcfour(b'Key').process(b'Plaintext')),b'bbf316e8d940af0ad3')
assert_equal(hex(Arcfour(b'Wiki').process(b'pedia')),b'1021bf0420')
assert_equal(hex(Arcfour(b'Secret').process(b'Attack at dawn')),b'45a01f645fc35b383552544b9bf5')
class TestLzw(): class TestLzw():
def test_lzwdecode(self): def test_lzwdecode(self):
assert_equal(lzwdecode(b'\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01'),b'\x2d\x2d\x2d\x2d\x2d\x41\x2d\x2d\x2d\x42') assert_equal(lzwdecode(b'\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01'), b'\x2d\x2d\x2d\x2d\x2d\x41\x2d\x2d\x2d\x42')
class TestRunlength(): class TestRunlength():
def test_rldecode(self): def test_rldecode(self):
assert_equal(rldecode(b'\x05123456\xfa7\x04abcde\x80junk'),b'1234567777777abcde') assert_equal(rldecode(b'\x05123456\xfa7\x04abcde\x80junk'), b'1234567777777abcde')
class TestRijndaelEncryptor(): class TestRijndaelEncryptor():
def test_RijndaelEncryptor(self): def test_RijndaelEncryptor(self):
key = dehex(b'00010203050607080a0b0c0d0f101112') key = dehex(b'00010203050607080a0b0c0d0f101112')
plaintext = dehex(b'506812a45f08c889b97f5980038b8359') plaintext = dehex(b'506812a45f08c889b97f5980038b8359')
assert_equal(hex(RijndaelEncryptor(key, 128).encrypt(plaintext)),b'd8f532538289ef7d06b506a4fd5be9c9') assert_equal(hex(RijndaelEncryptor(key, 128).encrypt(plaintext)), b'd8f532538289ef7d06b506a4fd5be9c9')
if __name__ == '__main__':
nose.runmodule()

View File

@ -1,18 +1,14 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from nose.tools import assert_equal, assert_true, assert_false
from nose import SkipTest
import nose
import logging import logging
from nose.tools import assert_equal
logger = logging.getLogger(__name__)
from pdfminer.psparser import * from pdfminer.psparser import *
## Simplistic Test cases
##
class TestPSBaseParser: class TestPSBaseParser:
"""Simplistic Test cases"""
TESTDATA = br'''%!PS TESTDATA = br'''%!PS
begin end begin end
@ -35,29 +31,29 @@ func/a/b{(c)do*}def
''' '''
TOKENS = [ TOKENS = [
(5, KWD(b'begin')), (11, KWD(b'end')), (16, KWD(b'"')), (19, KWD(b'@')), (5, KWD(b'begin')), (11, KWD(b'end')), (16, KWD(b'"')), (19, KWD(b'@')),
(21, KWD(b'#')), (23, LIT('a')), (25, LIT('BCD')), (30, LIT('Some_Name')), (21, KWD(b'#')), (23, LIT('a')), (25, LIT('BCD')), (30, LIT('Some_Name')),
(41, LIT('foo_xbaa')), (54, 0), (56, 1), (59, -2), (62, 0.5), (41, LIT('foo_xbaa')), (54, 0), (56, 1), (59, -2), (62, 0.5),
(65, 1.234), (71, b'abc'), (77, b''), (80, b'abc ( def ) ghi'), (65, 1.234), (71, b'abc'), (77, b''), (80, b'abc ( def ) ghi'),
(98, b'def \x00 4ghi'), (118, b'bach\\slask'), (132, b'foo\nbaa'), (98, b'def \x00 4ghi'), (118, b'bach\\slask'), (132, b'foo\nbaa'),
(143, b'this % is not a comment.'), (170, b'foo\nbaa'), (180, b'foobaa'), (143, b'this % is not a comment.'), (170, b'foo\nbaa'), (180, b'foobaa'),
(191, b''), (194, b' '), (199, b'@@ '), (211, b'\xab\xcd\x00\x124\x05'), (191, b''), (194, b' '), (199, b'@@ '), (211, b'\xab\xcd\x00\x124\x05'),
(226, KWD(b'func')), (230, LIT('a')), (232, LIT('b')), (226, KWD(b'func')), (230, LIT('a')), (232, LIT('b')),
(234, KWD(b'{')), (235, b'c'), (238, KWD(b'do*')), (241, KWD(b'}')), (234, KWD(b'{')), (235, b'c'), (238, KWD(b'do*')), (241, KWD(b'}')),
(242, KWD(b'def')), (246, KWD(b'[')), (248, 1), (250, b'z'), (254, KWD(b'!')), (242, KWD(b'def')), (246, KWD(b'[')), (248, 1), (250, b'z'), (254, KWD(b'!')),
(256, KWD(b']')), (258, KWD(b'<<')), (261, LIT('foo')), (266, b'bar'), (256, KWD(b']')), (258, KWD(b'<<')), (261, LIT('foo')), (266, b'bar'),
(272, KWD(b'>>')) (272, KWD(b'>>'))
] ]
OBJS = [ OBJS = [
(23, LIT('a')), (25, LIT('BCD')), (30, LIT('Some_Name')), (23, LIT('a')), (25, LIT('BCD')), (30, LIT('Some_Name')),
(41, LIT('foo_xbaa')), (54, 0), (56, 1), (59, -2), (62, 0.5), (41, LIT('foo_xbaa')), (54, 0), (56, 1), (59, -2), (62, 0.5),
(65, 1.234), (71, b'abc'), (77, b''), (80, b'abc ( def ) ghi'), (65, 1.234), (71, b'abc'), (77, b''), (80, b'abc ( def ) ghi'),
(98, b'def \x00 4ghi'), (118, b'bach\\slask'), (132, b'foo\nbaa'), (98, b'def \x00 4ghi'), (118, b'bach\\slask'), (132, b'foo\nbaa'),
(143, b'this % is not a comment.'), (170, b'foo\nbaa'), (180, b'foobaa'), (143, b'this % is not a comment.'), (170, b'foo\nbaa'), (180, b'foobaa'),
(191, b''), (194, b' '), (199, b'@@ '), (211, b'\xab\xcd\x00\x124\x05'), (191, b''), (194, b' '), (199, b'@@ '), (211, b'\xab\xcd\x00\x124\x05'),
(230, LIT('a')), (232, LIT('b')), (234, [b'c']), (246, [1, b'z']), (230, LIT('a')), (232, LIT('b')), (234, [b'c']), (246, [1, b'z']),
(258, {'foo': b'bar'}), (258, {'foo': b'bar'}),
] ]
def get_tokens(self, s): def get_tokens(self, s):
@ -66,6 +62,7 @@ func/a/b{(c)do*}def
class MyParser(PSBaseParser): class MyParser(PSBaseParser):
def flush(self): def flush(self):
self.add_results(*self.popall()) self.add_results(*self.popall())
parser = MyParser(BytesIO(s)) parser = MyParser(BytesIO(s))
r = [] r = []
try: try:
@ -81,6 +78,7 @@ func/a/b{(c)do*}def
class MyParser(PSStackParser): class MyParser(PSStackParser):
def flush(self): def flush(self):
self.add_results(*self.popall()) self.add_results(*self.popall())
parser = MyParser(BytesIO(s)) parser = MyParser(BytesIO(s))
r = [] r = []
try: try:
@ -92,17 +90,12 @@ func/a/b{(c)do*}def
def test_1(self): def test_1(self):
tokens = self.get_tokens(self.TESTDATA) tokens = self.get_tokens(self.TESTDATA)
logging.info(tokens) logger.info(tokens)
assert_equal(tokens, self.TOKENS) assert_equal(tokens, self.TOKENS)
return return
def test_2(self): def test_2(self):
objs = self.get_objects(self.TESTDATA) objs = self.get_objects(self.TESTDATA)
logging.info(objs) logger.info(objs)
assert_equal(objs, self.OBJS) assert_equal(objs, self.OBJS)
return return
if __name__ == '__main__':
#import logging,sys,os,six
#logging.basicConfig(level=logging.DEBUG, filename='%s_%d.%d.log'%(os.path.basename(__file__),sys.version_info[0],sys.version_info[1]))
nose.runmodule()

View File

@ -1,53 +1,37 @@
#!/usr/bin/env python from tempfile import NamedTemporaryFile
# -*- coding: utf-8 -*- from helpers import absolute_sample_path
import six from tools import dumppdf
import nose, logging, os
if six.PY3: def run(filename, options=None):
from tools import dumppdf absolute_path = absolute_sample_path(filename)
elif six.PY2: with NamedTemporaryFile() as output_file:
import os, sys if options:
sys.path.append(os.path.abspath(os.path.curdir)) s = 'dumppdf -o %s %s %s' % (output_file.name, options, absolute_path)
import tools.dumppdf as dumppdf else:
s = 'dumppdf -o %s %s' % (output_file.name, absolute_path)
dumppdf.main(s.split(' ')[1:])
path=os.path.dirname(os.path.abspath(__file__))+'/'
def run(datapath,filename,options=None):
i=path+datapath+filename+'.pdf'
o=path+filename+'.xml'
if options:
s='dumppdf -o%s %s %s'%(o,options,i)
else:
s='dumppdf -o%s %s'%(o,i)
dumppdf.main(s.split(' '))
class TestDumpPDF(): class TestDumpPDF():
def test_1(self): def test_1(self):
run('../samples/','jo','-t -a') run('jo.pdf', '-t -a')
run('../samples/','simple1','-t -a') run('simple1.pdf', '-t -a')
run('../samples/','simple2','-t -a') run('simple2.pdf', '-t -a')
run('../samples/','simple3','-t -a') run('simple3.pdf', '-t -a')
def test_2(self): def test_2(self):
run('../samples/nonfree/','dmca','-t -a') run('nonfree/dmca.pdf', '-t -a')
def test_3(self): def test_3(self):
run('../samples/nonfree/','f1040nr') run('nonfree/f1040nr.pdf')
def test_4(self): def test_4(self):
run('../samples/nonfree/','i1040nr') run('nonfree/i1040nr.pdf')
def test_5(self):
run('../samples/nonfree/','kampo','-t -a')
def test_6(self):
run('../samples/nonfree/','naacl06-shinyama','-t -a')
if __name__ == '__main__': def test_5(self):
#import logging,sys,os,six run('nonfree/kampo.pdf', '-t -a')
#logging.basicConfig(level=logging.DEBUG, filename='%s_%d.%d.log'%(os.path.basename(__file__),sys.version_info[0],sys.version_info[1]))
nose.runmodule() def test_6(self):
run('nonfree/naacl06-shinyama.pdf', '-t -a')

View File

@ -2,72 +2,71 @@ import os
from shutil import rmtree from shutil import rmtree
from tempfile import NamedTemporaryFile, mkdtemp from tempfile import NamedTemporaryFile, mkdtemp
import nose
import tools.pdf2txt as pdf2txt import tools.pdf2txt as pdf2txt
from helpers import absolute_sample_path
def full_path(relative_path_to_this_file): def run(sample_path, options=None):
this_file_dir = os.path.dirname(os.path.abspath(__file__)) absolute_path = absolute_sample_path(sample_path)
abspath = os.path.abspath(os.path.join(this_file_dir, relative_path_to_this_file)) with NamedTemporaryFile() as output_file:
return abspath if options:
s = 'pdf2txt -o %s %s %s' % (output_file.name, options, absolute_path)
else:
def run(datapath, filename, options=None): s = 'pdf2txt -o %s %s' % (output_file.name, absolute_path)
i = full_path(datapath + filename + '.pdf') pdf2txt.main(s.split(' ')[1:])
o = full_path(filename + '.txt')
if options:
s = 'pdf2txt -o%s %s %s' % (o, options, i)
else:
s = 'pdf2txt -o%s %s' % (o, i)
pdf2txt.main(s.split(' ')[1:])
class TestDumpPDF(): class TestDumpPDF():
def test_1(self): def test_jo(self):
run('../samples/', 'jo') run('jo.pdf')
run('../samples/', 'simple1')
run('../samples/', 'simple2')
run('../samples/', 'simple3')
run('../samples/','sampleOneByteIdentityEncode')
def test_2(self): def test_simple1(self):
run('../samples/nonfree/', 'dmca') run('simple1.pdf')
def test_3(self): def test_simple2(self):
run('../samples/nonfree/', 'f1040nr') run('simple2.pdf')
def test_4(self): def test_simple3(self):
run('../samples/nonfree/', 'i1040nr') run('simple3.pdf')
def test_5(self): def test_sample_one_byte_identity_encode(self):
run('../samples/nonfree/', 'kampo') run('sampleOneByteIdentityEncode.pdf')
def test_6(self): def test_nonfree_175(self):
run('../samples/nonfree/', 'naacl06-shinyama') """Regression test for https://github.com/pdfminer/pdfminer.six/issues/65"""
run('nonfree/175.pdf')
# this test works on Windows but on Linux & Travis-CI it says def test_nonfree_dmca(self):
# PDFSyntaxError: No /Root object! - Is this really a PDF? run('nonfree/dmca.pdf')
# TODO: Find why
"""
def test_7(self):
run('../samples/contrib/','stamp-no')
"""
def test_8(self): def test_nonfree_f1040nr(self):
run('../samples/contrib/', '2b', '-A -t xml') run('nonfree/f1040nr.pdf')
def test_9(self): def test_nonfree_i1040nr(self):
run('../samples/nonfree/', '175') # https://github.com/pdfminer/pdfminer.six/issues/65 run('nonfree/i1040nr.pdf')
def test_10(self): def test_nonfree_kampo(self):
run('../samples/scancode/', 'patchelf') # https://github.com/euske/pdfminer/issues/96 run('nonfree/kampo.pdf')
def test_nonfree_naacl06_shinyama(self):
run('nonfree/naacl06-shinyama.pdf')
def test_nlp2004slides(self):
run('nonfree/nlp2004slides.pdf')
def test_contrib_2b(self):
run('contrib/2b.pdf', '-A -t xml')
def test_scancode_patchelf(self):
"""Regression test for # https://github.com/euske/pdfminer/issues/96"""
run('scancode/patchelf.pdf')
class TestDumpImages(object): class TestDumpImages(object):
def extract_images(self, input_file): @staticmethod
def extract_images(input_file):
output_dir = mkdtemp() output_dir = mkdtemp()
with NamedTemporaryFile() as output_file: with NamedTemporaryFile() as output_file:
commands = ['-o', output_file.name, '--output-dir', output_dir, input_file] commands = ['-o', output_file.name, '--output-dir', output_dir, input_file]
@ -81,13 +80,25 @@ class TestDumpImages(object):
Regression test for: https://github.com/pdfminer/pdfminer.six/issues/131 Regression test for: https://github.com/pdfminer/pdfminer.six/issues/131
""" """
image_files = self.extract_images(full_path('../samples/nonfree/dmca.pdf')) image_files = self.extract_images(absolute_sample_path('../samples/nonfree/dmca.pdf'))
assert image_files[0].endswith('bmp') assert image_files[0].endswith('bmp')
def test_nonfree_175(self): def test_nonfree_175(self):
"""Extract images of pdf containing jpg images""" """Extract images of pdf containing jpg images"""
self.extract_images(full_path('../samples/nonfree/175.pdf')) self.extract_images(absolute_sample_path('../samples/nonfree/175.pdf'))
def test_jbig2_image_export(self):
"""Extract images of pdf containing jbig2 images
if __name__ == '__main__': Feature test for: https://github.com/pdfminer/pdfminer.six/pull/46
nose.runmodule() """
image_files = self.extract_images(absolute_sample_path('../samples/contrib/pdf-with-jbig2.pdf'))
assert image_files[0].endswith('.jb2')
def test_contrib_matplotlib(self):
"""Test a pdf with Type3 font"""
run('contrib/matplotlib.pdf')
def test_nonfree_cmp_itext_logo(self):
"""Test a pdf with Type3 font"""
run('nonfree/cmp_itext_logo.pdf')

View File

@ -1,7 +1,7 @@
from nose.tools import assert_equal from nose.tools import assert_equal
from pdfminer.layout import LTComponent from pdfminer.layout import LTComponent
from pdfminer.utils import make_compat_str, Plane from pdfminer.utils import Plane
class TestPlane(object): class TestPlane(object):
@ -37,4 +37,4 @@ class TestPlane(object):
plane = Plane(bounding_box, gridsize) plane = Plane(bounding_box, gridsize)
obj = LTComponent((0, 0, object_size, object_size)) obj = LTComponent((0, 0, object_size, object_size))
plane.add(obj) plane.add(obj)
return plane, obj return plane, obj

View File

@ -1,32 +1,31 @@
#!/usr/bin/env python """Extract pdf structure in XML format"""
import logging
import os.path
import re
import sys
from argparse import ArgumentParser
import six
#
# dumppdf.py - dump pdf contents in XML format.
#
# usage: dumppdf.py [options] [files ...]
# options:
# -i objid : object id
#
import sys, os.path, re, logging
from pdfminer.psparser import PSKeyword, PSLiteral, LIT
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.pdftypes import PDFObjectNotFound, PDFValueError from pdfminer.pdftypes import PDFObjectNotFound, PDFValueError
from pdfminer.pdftypes import PDFStream, PDFObjRef, resolve1, stream_value from pdfminer.pdftypes import PDFStream, PDFObjRef, resolve1, stream_value
from pdfminer.pdfpage import PDFPage from pdfminer.psparser import PSKeyword, PSLiteral, LIT
from pdfminer.utils import isnumber from pdfminer.utils import isnumber
logging.basicConfig()
ESC_PAT = re.compile(r'[\000-\037&<>()"\042\047\134\177-\377]') ESC_PAT = re.compile(r'[\000-\037&<>()"\042\047\134\177-\377]')
def e(s): def e(s):
if six.PY3 and isinstance(s,six.binary_type): if six.PY3 and isinstance(s, six.binary_type):
s=str(s,'latin-1') s = str(s, 'latin-1')
return ESC_PAT.sub(lambda m:'&#%d;' % ord(m.group(0)), s) return ESC_PAT.sub(lambda m: '&#%d;' % ord(m.group(0)), s)
import six # Python 2+3 compatibility
# dumpxml
def dumpxml(out, obj, codec=None): def dumpxml(out, obj, codec=None):
if obj is None: if obj is None:
out.write('<null />') out.write('<null />')
@ -34,7 +33,7 @@ def dumpxml(out, obj, codec=None):
if isinstance(obj, dict): if isinstance(obj, dict):
out.write('<dict size="%d">\n' % len(obj)) out.write('<dict size="%d">\n' % len(obj))
for (k,v) in six.iteritems(obj): for (k, v) in six.iteritems(obj):
out.write('<key>%s</key>\n' % k) out.write('<key>%s</key>\n' % k)
out.write('<value>') out.write('<value>')
dumpxml(out, v) dumpxml(out, v)
@ -87,7 +86,7 @@ def dumpxml(out, obj, codec=None):
raise TypeError(obj) raise TypeError(obj)
# dumptrailers
def dumptrailers(out, doc): def dumptrailers(out, doc):
for xref in doc.xrefs: for xref in doc.xrefs:
out.write('<trailer>\n') out.write('<trailer>\n')
@ -95,7 +94,7 @@ def dumptrailers(out, doc):
out.write('\n</trailer>\n\n') out.write('\n</trailer>\n\n')
return return
# dumpallobjs
def dumpallobjs(out, doc, codec=None): def dumpallobjs(out, doc, codec=None):
visited = set() visited = set()
out.write('<pdf>') out.write('<pdf>')
@ -110,19 +109,20 @@ def dumpallobjs(out, doc, codec=None):
dumpxml(out, obj, codec=codec) dumpxml(out, obj, codec=codec)
out.write('\n</object>\n\n') out.write('\n</object>\n\n')
except PDFObjectNotFound as e: except PDFObjectNotFound as e:
print >>sys.stderr, 'not found: %r' % e print('not found: %r' % e)
dumptrailers(out, doc) dumptrailers(out, doc)
out.write('</pdf>') out.write('</pdf>')
return return
# dumpoutline
def dumpoutline(outfp, fname, objids, pagenos, password='', def dumpoutline(outfp, fname, objids, pagenos, password='',
dumpall=False, codec=None, extractdir=None): dumpall=False, codec=None, extractdir=None):
fp = open(fname, 'rb') fp = open(fname, 'rb')
parser = PDFParser(fp) parser = PDFParser(fp)
doc = PDFDocument(parser, password) doc = PDFDocument(parser, password)
pages = dict( (page.pageid, pageno) for (pageno,page) pages = dict((page.pageid, pageno) for (pageno, page)
in enumerate(PDFPage.create_pages(doc), 1) ) in enumerate(PDFPage.create_pages(doc), 1))
def resolve_dest(dest): def resolve_dest(dest):
if isinstance(dest, str): if isinstance(dest, str):
dest = resolve1(doc.get_dest(dest)) dest = resolve1(doc.get_dest(dest))
@ -133,10 +133,11 @@ def dumpoutline(outfp, fname, objids, pagenos, password='',
if isinstance(dest, PDFObjRef): if isinstance(dest, PDFObjRef):
dest = dest.resolve() dest = dest.resolve()
return dest return dest
try: try:
outlines = doc.get_outlines() outlines = doc.get_outlines()
outfp.write('<outlines>\n') outfp.write('<outlines>\n')
for (level,title,dest,a,se) in outlines: for (level, title, dest, a, se) in outlines:
pageno = None pageno = None
if dest: if dest:
dest = resolve_dest(dest) dest = resolve_dest(dest)
@ -145,7 +146,8 @@ def dumpoutline(outfp, fname, objids, pagenos, password='',
action = a action = a
if isinstance(action, dict): if isinstance(action, dict):
subtype = action.get('S') subtype = action.get('S')
if subtype and repr(subtype) == '/\'GoTo\'' and action.get('D'): if subtype and repr(subtype) == '/\'GoTo\'' and action.get(
'D'):
dest = resolve_dest(action['D']) dest = resolve_dest(action['D'])
pageno = pages[dest[0].objid] pageno = pages[dest[0].objid]
s = e(title).encode('utf-8', 'xmlcharrefreplace') s = e(title).encode('utf-8', 'xmlcharrefreplace')
@ -164,9 +166,11 @@ def dumpoutline(outfp, fname, objids, pagenos, password='',
fp.close() fp.close()
return return
# extractembedded
LITERAL_FILESPEC = LIT('Filespec') LITERAL_FILESPEC = LIT('Filespec')
LITERAL_EMBEDDEDFILE = LIT('EmbeddedFile') LITERAL_EMBEDDEDFILE = LIT('EmbeddedFile')
def extractembedded(outfp, fname, objids, pagenos, password='', def extractembedded(outfp, fname, objids, pagenos, password='',
dumpall=False, codec=None, extractdir=None): dumpall=False, codec=None, extractdir=None):
def extract1(obj): def extract1(obj):
@ -184,8 +188,8 @@ def extractembedded(outfp, fname, objids, pagenos, password='',
path = os.path.join(extractdir, filename) path = os.path.join(extractdir, filename)
if os.path.exists(path): if os.path.exists(path):
raise IOError('file exists: %r' % path) raise IOError('file exists: %r' % path)
print >>sys.stderr, 'extracting: %r' % path print('extracting: %r' % path)
out = file(path, 'wb') out = open(path, 'wb')
out.write(fileobj.get_data()) out.write(fileobj.get_data())
out.close() out.close()
return return
@ -201,7 +205,7 @@ def extractembedded(outfp, fname, objids, pagenos, password='',
fp.close() fp.close()
return return
# dumppdf
def dumppdf(outfp, fname, objids, pagenos, password='', def dumppdf(outfp, fname, objids, pagenos, password='',
dumpall=False, codec=None, extractdir=None): dumpall=False, codec=None, extractdir=None):
fp = open(fname, 'rb') fp = open(fname, 'rb')
@ -212,7 +216,7 @@ def dumppdf(outfp, fname, objids, pagenos, password='',
obj = doc.getobj(objid) obj = doc.getobj(objid)
dumpxml(outfp, obj, codec=codec) dumpxml(outfp, obj, codec=codec)
if pagenos: if pagenos:
for (pageno,page) in enumerate(PDFPage.create_pages(doc)): for (pageno, page) in enumerate(PDFPage.create_pages(doc)):
if pageno in pagenos: if pageno in pagenos:
if codec: if codec:
for obj in page.contents: for obj in page.contents:
@ -225,51 +229,119 @@ def dumppdf(outfp, fname, objids, pagenos, password='',
if (not objids) and (not pagenos) and (not dumpall): if (not objids) and (not pagenos) and (not dumpall):
dumptrailers(outfp, doc) dumptrailers(outfp, doc)
fp.close() fp.close()
if codec not in ('raw','binary'): if codec not in ('raw', 'binary'):
outfp.write('\n') outfp.write('\n')
return return
# main def create_parser():
def main(argv): parser = ArgumentParser(description=__doc__, add_help=True)
import getopt parser.add_argument('files', type=str, default=None, nargs='+',
def usage(): help='One or more paths to PDF files.')
print ('usage: %s [-d] [-a] [-p pageid] [-P password] [-r|-b|-t] [-T] [-E directory] [-i objid] file ...' % argv[0])
return 100
try:
(opts, args) = getopt.getopt(argv[1:], 'dap:P:rbtTE:i:o:')
except getopt.GetoptError:
return usage()
if not args: return usage()
objids = []
pagenos = set()
codec = None
password = ''
dumpall = False
proc = dumppdf
outfp = sys.stdout
extractdir = None
for (k, v) in opts:
if k == '-d': logging.getLogger().setLevel(logging.DEBUG)
elif k == '-o': outfp = open(v, 'w')
elif k == '-i': objids.extend( int(x) for x in v.split(',') )
elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
elif k == '-P': password = v
elif k == '-a': dumpall = True
elif k == '-r': codec = 'raw'
elif k == '-b': codec = 'binary'
elif k == '-t': codec = 'text'
elif k == '-T': proc = dumpoutline
elif k == '-E':
extractdir = v
proc = extractembedded
parser.add_argument(
'--debug', '-d', default=False, action='store_true',
help='Use debug logging level.')
procedure_parser = parser.add_mutually_exclusive_group()
procedure_parser.add_argument(
'--extract-toc', '-T', default=False, action='store_true',
help='Extract structure of outline')
procedure_parser.add_argument(
'--extract-embedded', '-E', type=str,
help='Extract embedded files')
parse_params = parser.add_argument_group(
'Parser', description='Used during PDF parsing')
parse_params.add_argument(
'--page-numbers', type=int, default=None, nargs='+',
help='A space-seperated list of page numbers to parse.')
parse_params.add_argument(
'--pagenos', '-p', type=str,
help='A comma-separated list of page numbers to parse. Included for '
'legacy applications, use --page-numbers for more idiomatic '
'argument entry.')
parse_params.add_argument(
'--objects', '-i', type=str,
help='Comma separated list of object numbers to extract')
parse_params.add_argument(
'--all', '-a', default=False, action='store_true',
help='If the structure of all objects should be extracted')
parse_params.add_argument(
'--password', '-P', type=str, default='',
help='The password to use for decrypting PDF file.')
output_params = parser.add_argument_group(
'Output', description='Used during output generation.')
output_params.add_argument(
'--outfile', '-o', type=str, default='-',
help='Path to file where output is written. Or "-" (default) to '
'write to stdout.')
codec_parser = output_params.add_mutually_exclusive_group()
codec_parser.add_argument(
'--raw-stream', '-r', default=False, action='store_true',
help='Write stream objects without encoding')
codec_parser.add_argument(
'--binary-stream', '-b', default=False, action='store_true',
help='Write stream objects with binary encoding')
codec_parser.add_argument(
'--text-stream', '-t', default=False, action='store_true',
help='Write stream objects as plain text')
return parser
def main(argv=None):
parser = create_parser()
args = parser.parse_args(args=argv)
if args.debug:
logging.getLogger().setLevel(logging.DEBUG)
if args.outfile == '-':
outfp = sys.stdout
else:
outfp = open(args.outfile, 'w')
if args.objects:
objids = [int(x) for x in args.objects.split(',')]
else:
objids = []
if args.page_numbers:
pagenos = {x - 1 for x in args.page_numbers}
elif args.pagenos:
pagenos = {int(x) - 1 for x in args.pagenos.split(',')}
else:
pagenos = set()
password = args.password
if six.PY2 and sys.stdin.encoding: if six.PY2 and sys.stdin.encoding:
password = password.decode(sys.stdin.encoding) password = password.decode(sys.stdin.encoding)
for fname in args: if args.raw_stream:
codec = 'raw'
elif args.binary_stream:
codec = 'binary'
elif args.text_stream:
codec = 'text'
else:
codec = None
if args.extract_toc:
extractdir = None
proc = dumpoutline
elif args.extract_embedded:
extractdir = args.extract_embedded
proc = extractembedded
else:
extractdir = None
proc = dumppdf
for fname in args.files:
proc(outfp, fname, objids, pagenos, password=password, proc(outfp, fname, objids, pagenos, password=password,
dumpall=dumpall, codec=codec, extractdir=extractdir) dumpall=args.all, codec=codec, extractdir=extractdir)
outfp.close() outfp.close()
if __name__ == '__main__': sys.exit(main(sys.argv))
if __name__ == '__main__':
sys.exit(main())

View File

@ -1,215 +0,0 @@
#!/usr/bin/env python -O
#
# pdf2html.cgi - Gateway script for converting PDF into HTML.
#
# Security consideration for public access:
#
# Limit the process size and/or maximum cpu time.
# The process should be chrooted.
# The user should be imposed quota.
#
# How to Setup:
# $ mkdir $CGIDIR
# $ mkdir $CGIDIR/var
# $ python setup.py install_lib --install-dir=$CGIDIR
# $ cp pdfminer/tools/pdf2html.cgi $CGIDIR
#
import sys, os, os.path, re, time
import cgi, logging, traceback, random
# comment out at this at runtime.
#import cgitb; cgitb.enable()
import pdfminer
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import six #Python 2+3 compatibility
# quote HTML metacharacters
def q(x):
return x.replace('&','&amp;').replace('>','&gt;').replace('<','&lt;').replace('"','&quot;')
# encode parameters as a URL
Q = re.compile(r'[^a-zA-Z0-9_.-=]')
def url(base, **kw):
r = []
for (k,v) in six.iteritems(kw):
v = Q.sub(lambda m: '%%%02X' % ord(m.group(0)), encoder(q(v), 'replace')[0])
r.append('%s=%s' % (k, v))
return base+'&'.join(r)
## convert
##
class FileSizeExceeded(ValueError): pass
def convert(infp, outfp, path, codec='utf-8',
maxpages=0, maxfilesize=0, pagenos=None,
html=True):
# save the input file.
src = open(path, 'wb')
nbytes = 0
while 1:
data = infp.read(4096)
nbytes += len(data)
if maxfilesize and maxfilesize < nbytes:
raise FileSizeExceeded(maxfilesize)
if not data: break
src.write(data)
src.close()
infp.close()
# perform conversion and
# send the results over the network.
rsrcmgr = PDFResourceManager()
laparams = LAParams()
if html:
device = HTMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams,
layoutmode='exact')
else:
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages):
interpreter.process_page(page)
fp.close()
device.close()
return
## WebApp
##
class WebApp(object):
TITLE = 'pdf2html demo'
MAXFILESIZE = 10000000 # set to zero if unlimited.
MAXPAGES = 100 # set to zero if unlimited.
def __init__(self, infp=sys.stdin, outfp=sys.stdout, environ=os.environ,
codec='utf-8', apppath='/'):
self.infp = infp
self.outfp = outfp
self.environ = environ
self.codec = codec
self.apppath = apppath
self.remote_addr = self.environ.get('REMOTE_ADDR')
self.path_info = self.environ.get('PATH_INFO')
self.method = self.environ.get('REQUEST_METHOD', 'GET').upper()
self.server = self.environ.get('SERVER_SOFTWARE', '')
self.tmpdir = self.environ.get('TEMP', './var/')
self.content_type = 'text/html; charset=%s' % codec
self.logger = logging.getLogger()
return
def put(self, *args):
for x in args:
if isinstance(x, str):
self.outfp.write(x)
elif isinstance(x, unicode):
self.outfp.write(x.encode(self.codec, 'xmlcharrefreplace'))
return
def response_200(self):
if self.server.startswith('cgi-httpd'):
# required for cgi-httpd
self.outfp.write('HTTP/1.0 200 OK\r\n')
self.outfp.write('Content-type: %s\r\n' % self.content_type)
self.outfp.write('Connection: close\r\n\r\n')
return
def response_404(self):
if self.server.startswith('cgi-httpd'):
# required for cgi-httpd
self.outfp.write('HTTP/1.0 404 Not Found\r\n')
self.outfp.write('Content-type: text/html\r\n')
self.outfp.write('Connection: close\r\n\r\n')
self.outfp.write('<html><body>page does not exist</body></body>\n')
return
def response_301(self, url):
if self.server.startswith('cgi-httpd'):
# required for cgi-httpd
self.outfp.write('HTTP/1.0 301 Moved\r\n')
self.outfp.write('Location: %s\r\n\r\n' % url)
return
def coverpage(self):
self.put(
'<html><head><title>%s</title></head><body>\n' % q(self.TITLE),
'<h1>%s</h1><hr>\n' % q(self.TITLE),
'<form method="POST" action="%s" enctype="multipart/form-data">\n' % q(self.apppath),
'<p>Upload PDF File: <input name="f" type="file" value="">\n',
'&nbsp; Page numbers (comma-separated):\n',
'<input name="p" type="text" size="10" value="">\n',
'<p>(Text extraction is limited to maximum %d pages.\n' % self.MAXPAGES,
'Maximum file size for input is %d bytes.)\n' % self.MAXFILESIZE,
'<p><input type="submit" name="c" value="Convert to HTML">\n',
'<input type="submit" name="c" value="Convert to TEXT">\n',
'<input type="reset" value="Reset">\n',
'</form><hr>\n',
'<p>Powered by <a href="http://www.unixuser.org/~euske/python/pdfminer/">PDFMiner</a>-%s\n' % pdfminer.__version__,
'</body></html>\n',
)
return
def setup(self):
self.run = self.response_404
status = 404
if not os.path.isdir(self.tmpdir):
self.logger.error('no tmpdir')
status = 304
elif self.path_info == self.apppath:
self.run = self.convert
status = 200
return status
def convert(self):
form = cgi.FieldStorage(fp=self.infp, environ=self.environ)
if (self.method != 'POST' or
'c' not in form or
'f' not in form):
self.response_200()
self.coverpage()
return
item = form['f']
if not (item.file and item.filename):
self.response_200()
self.coverpage()
return
cmd = form.getvalue('c')
html = (cmd == 'Convert to HTML')
pagenos = []
if 'p' in form:
for m in re.finditer(r'\d+', form.getvalue('p')):
try:
pagenos.append(int(m.group(0)))
except ValueError:
pass
h = abs(hash((random.random(), self.remote_addr, item.filename)))
tmppath = os.path.join(self.tmpdir, '%08x%08x.pdf' % (time.time(), h))
self.logger.info('received: host=%s, name=%r, pagenos=%r, tmppath=%r' %
(self.remote_addr, item.filename, pagenos, tmppath))
try:
if not html:
self.content_type = 'text/plain; charset=%s' % self.codec
self.response_200()
try:
convert(item.file, self.outfp, tmppath, pagenos=pagenos, codec=self.codec,
maxpages=self.MAXPAGES, maxfilesize=self.MAXFILESIZE, html=html)
except Exception as e:
self.put('<p>Sorry, an error has occurred: %s' % q(repr(e)))
self.logger.error('convert: %r: path=%r: %s' % (e, traceback.format_exc()))
finally:
try:
os.remove(tmppath)
except:
pass
return
# main
if __name__ == '__main__':
app = WebApp()
app.setup()
sys.exit(app.run())

View File

@ -1,29 +1,30 @@
#!/usr/bin/env python """A command line tool for extracting text and images from PDF and output it to plain text, html, xml or tags."""
"""
Converts PDF text content (though not images containing text) to plain text, html, xml or "tags".
"""
import argparse import argparse
import logging import logging
import six
import sys import sys
import pdfminer.settings import six
pdfminer.settings.STRICT = False
import pdfminer.high_level import pdfminer.high_level
import pdfminer.layout import pdfminer.layout
from pdfminer.image import ImageWriter from pdfminer.image import ImageWriter
logging.basicConfig()
def extract_text(files=[], outfile='-', def extract_text(files=[], outfile='-',
_py2_no_more_posargs=None, # Bloody Python2 needs a shim
no_laparams=False, all_texts=None, detect_vertical=None, # LAParams no_laparams=False, all_texts=None, detect_vertical=None, # LAParams
word_margin=None, char_margin=None, line_margin=None, boxes_flow=None, # LAParams word_margin=None, char_margin=None, line_margin=None, boxes_flow=None, # LAParams
output_type='text', codec='utf-8', strip_control=False, output_type='text', codec='utf-8', strip_control=False,
maxpages=0, page_numbers=None, password="", scale=1.0, rotation=0, maxpages=0, page_numbers=None, password="", scale=1.0, rotation=0,
layoutmode='normal', output_dir=None, debug=False, layoutmode='normal', output_dir=None, debug=False,
disable_caching=False, **other): disable_caching=False, **kwargs):
if _py2_no_more_posargs is not None: if '_py2_no_more_posargs' in kwargs is not None:
raise ValueError("Too many positional arguments passed.") raise DeprecationWarning(
'The `_py2_no_more_posargs will be removed on January, 2020. At '
'that moment pdfminer.six will stop supporting Python 2. Please '
'upgrade to Python 3. For more information see '
'https://github.com/pdfminer/pdfminer .six/issues/194')
if not files: if not files:
raise ValueError("Must provide files to work upon!") raise ValueError("Must provide files to work upon!")
@ -66,28 +67,68 @@ def extract_text(files=[], outfile='-',
def maketheparser(): def maketheparser():
parser = argparse.ArgumentParser(description=__doc__, add_help=True) parser = argparse.ArgumentParser(description=__doc__, add_help=True)
parser.add_argument("files", type=str, default=None, nargs="+", help="File to process.") parser.add_argument("files", type=str, default=None, nargs="+", help="One or more paths to PDF files.")
parser.add_argument("-d", "--debug", default=False, action="store_true", help="Debug output.")
parser.add_argument("-p", "--pagenos", type=str, help="Comma-separated list of page numbers to parse. Included for legacy applications, use --page-numbers for more idiomatic argument entry.") parser.add_argument("--debug", "-d", default=False, action="store_true",
parser.add_argument("--page-numbers", type=int, default=None, nargs="+", help="Alternative to --pagenos with space-separated numbers; supercedes --pagenos where it is used.") help="Use debug logging level.")
parser.add_argument("-m", "--maxpages", type=int, default=0, help="Maximum pages to parse") parser.add_argument("--disable-caching", "-C", default=False, action="store_true",
parser.add_argument("-P", "--password", type=str, default="", help="Decryption password for PDF") help="If caching or resources, such as fonts, should be disabled.")
parser.add_argument("-o", "--outfile", type=str, default="-", help="Output file (default \"-\" is stdout)")
parser.add_argument("-t", "--output_type", type=str, default="text", help="Output type: text|html|xml|tag (default is text)") parse_params = parser.add_argument_group('Parser', description='Used during PDF parsing')
parser.add_argument("-c", "--codec", type=str, default="utf-8", help="Text encoding") parse_params.add_argument("--page-numbers", type=int, default=None, nargs="+",
parser.add_argument("-s", "--scale", type=float, default=1.0, help="Scale") help="A space-seperated list of page numbers to parse.")
parser.add_argument("-A", "--all-texts", default=None, action="store_true", help="LAParams all texts") parse_params.add_argument("--pagenos", "-p", type=str,
parser.add_argument("-V", "--detect-vertical", default=None, action="store_true", help="LAParams detect vertical") help="A comma-separated list of page numbers to parse. Included for legacy applications, "
parser.add_argument("-W", "--word-margin", type=float, default=None, help="LAParams word margin") "use --page-numbers for more idiomatic argument entry.")
parser.add_argument("-M", "--char-margin", type=float, default=None, help="LAParams char margin") parse_params.add_argument("--maxpages", "-m", type=int, default=0,
parser.add_argument("-L", "--line-margin", type=float, default=None, help="LAParams line margin") help="The maximum number of pages to parse.")
parser.add_argument("-F", "--boxes-flow", type=float, default=None, help="LAParams boxes flow") parse_params.add_argument("--password", "-P", type=str, default="",
parser.add_argument("-Y", "--layoutmode", default="normal", type=str, help="HTML Layout Mode") help="The password to use for decrypting PDF file.")
parser.add_argument("-n", "--no-laparams", default=False, action="store_true", help="Pass None as LAParams") parse_params.add_argument("--rotation", "-R", default=0, type=int,
parser.add_argument("-R", "--rotation", default=0, type=int, help="Rotation") help="The number of degrees to rotate the PDF before other types of processing.")
parser.add_argument("-O", "--output-dir", default=None, help="Output directory for images")
parser.add_argument("-C", "--disable-caching", default=False, action="store_true", help="Disable caching") la_params = parser.add_argument_group('Layout analysis', description='Used during layout analysis.')
parser.add_argument("-S", "--strip-control", default=False, action="store_true", help="Strip control in XML mode") la_params.add_argument("--no-laparams", "-n", default=False, action="store_true",
help="If layout analysis parameters should be ignored.")
la_params.add_argument("--detect-vertical", "-V", default=False, action="store_true",
help="If vertical text should be considered during layout analysis")
la_params.add_argument("--char-margin", "-M", type=float, default=2.0,
help="If two characters are closer together than this margin they are considered to be part "
"of the same word. The margin is specified relative to the width of the character.")
la_params.add_argument("--word-margin", "-W", type=float, default=0.1,
help="If two words are are closer together than this margin they are considered to be part "
"of the same line. A space is added in between for readability. The margin is "
"specified relative to the width of the word.")
la_params.add_argument("--line-margin", "-L", type=float, default=0.5,
help="If two lines are are close together they are considered to be part of the same "
"paragraph. The margin is specified relative to the height of a line.")
la_params.add_argument("--boxes-flow", "-F", type=float, default=0.5,
help="Specifies how much a horizontal and vertical position of a text matters when "
"determining the order of lines. The value should be within the range of -1.0 (only "
"horizontal position matters) to +1.0 (only vertical position matters).")
la_params.add_argument("--all-texts", "-A", default=True, action="store_true",
help="If layout analysis should be performed on text in figures.")
output_params = parser.add_argument_group('Output', description='Used during output generation.')
output_params.add_argument("--outfile", "-o", type=str, default="-",
help="Path to file where output is written. Or \"-\" (default) to write to stdout.")
output_params.add_argument("--output_type", "-t", type=str, default="text",
help="Type of output to generate {text,html,xml,tag}.")
output_params.add_argument("--codec", "-c", type=str, default="utf-8",
help="Text encoding to use in output file.")
output_params.add_argument("--output-dir", "-O", default=None,
help="The output directory to put extracted images in. If not given, images are not "
"extracted.")
output_params.add_argument("--layoutmode", "-Y", default="normal", type=str,
help="Type of layout to use when generating html {normal,exact,loose}. If normal, "
"each line is positioned separately in the html. If exact, each character is "
"positioned separately in the html. If loose, same result as normal but with an "
"additional newline after each text line. Only used when output_type is html.")
output_params.add_argument("--scale", "-s", type=float, default=1.0,
help="The amount of zoom to use when generating html file. Only used when output_type "
"is html.")
output_params.add_argument("--strip-control", "-S", default=False, action="store_true",
help="Remove control statement from text. Only used when output_type is xml.")
return parser return parser

View File

@ -1,30 +0,0 @@
# -*- mode: python -*-
block_cipher = None
a = Analysis(['pdf2txt.py'],
pathex=['C:\\Dev\\Python\\pdfminer.six\\tools'],
binaries=[],
datas=[],
hiddenimports=[],
hookspath=[],
runtime_hooks=[],
excludes=['django','matplotlib','PIL','numpy','qt5'],
win_no_prefer_redirects=False,
win_private_assemblies=False,
cipher=block_cipher)
pyz = PYZ(a.pure, a.zipped_data,
cipher=block_cipher)
exe = EXE(pyz,
a.scripts,
a.binaries,
a.zipfiles,
a.datas,
name='pdf2txt',
debug=False,
strip=False,
upx=True,
runtime_tmpdir=None,
console=True )

View File

@ -11,28 +11,34 @@ pdfminer.settings.STRICT = False
import pdfminer.high_level import pdfminer.high_level
import pdfminer.layout import pdfminer.layout
def compare(file1,file2,**args): logging.basicConfig()
if args.get('_py2_no_more_posargs',None) is not None:
raise ValueError("Too many positional arguments passed.")
def compare(file1, file2, **kwargs):
if '_py2_no_more_posargs' in kwargs is not None:
raise DeprecationWarning(
'The `_py2_no_more_posargs will be removed on January, 2020. At '
'that moment pdfminer.six will stop supporting Python 2. Please '
'upgrade to Python 3. For more information see '
'https://github.com/pdfminer/pdfminer .six/issues/194')
# If any LAParams group arguments were passed, create an LAParams object and # If any LAParams group arguments were passed, create an LAParams object and
# populate with given args. Otherwise, set it to None. # populate with given args. Otherwise, set it to None.
if args.get('laparams',None) is None: if kwargs.get('laparams', None) is None:
laparams = pdfminer.layout.LAParams() laparams = pdfminer.layout.LAParams()
for param in ("all_texts", "detect_vertical", "word_margin", "char_margin", "line_margin", "boxes_flow"): for param in ("all_texts", "detect_vertical", "word_margin", "char_margin", "line_margin", "boxes_flow"):
paramv = args.get(param, None) paramv = kwargs.get(param, None)
if paramv is not None: if paramv is not None:
laparams[param]=paramv laparams[param]=paramv
args['laparams']=laparams kwargs['laparams']=laparams
s1=six.StringIO() s1=six.StringIO()
with open(file1, "rb") as fp: with open(file1, "rb") as fp:
pdfminer.high_level.extract_text_to_fp(fp,s1, **args) pdfminer.high_level.extract_text_to_fp(fp, s1, **kwargs)
s2=six.StringIO() s2=six.StringIO()
with open(file2, "rb") as fp: with open(file2, "rb") as fp:
pdfminer.high_level.extract_text_to_fp(fp,s2, **args) pdfminer.high_level.extract_text_to_fp(fp, s2, **kwargs)
import difflib import difflib
s1.seek(0) s1.seek(0)
@ -41,12 +47,12 @@ def compare(file1,file2,**args):
import os.path import os.path
try: try:
extension = os.path.splitext(args['outfile'])[1][1:4] extension = os.path.splitext(kwargs['outfile'])[1][1:4]
if extension.lower()=='htm': if extension.lower()=='htm':
return difflib.HtmlDiff().make_file(s1,s2) return difflib.HtmlDiff().make_file(s1,s2)
except KeyError: except KeyError:
pass pass
return difflib.unified_diff(s1,s2,n=args['context_lines']) return difflib.unified_diff(s1, s2, n=kwargs['context_lines'])
# main # main
@ -85,10 +91,12 @@ def main(args=None):
P.add_argument("-O", "--output-dir", default=None, help="Output directory for images") P.add_argument("-O", "--output-dir", default=None, help="Output directory for images")
P.add_argument("-C", "--disable-caching", default=False, action="store_true", help="Disable caching") P.add_argument("-C", "--disable-caching", default=False, action="store_true", help="Disable caching")
P.add_argument("-S", "--strip-control", default=False, action="store_true", help="Strip control in XML mode") P.add_argument("-S", "--strip-control", default=False, action="store_true", help="Strip control in XML mode")
A = P.parse_args(args=args) A = P.parse_args(args=args)
if A.debug:
logging.getLogger().setLevel(logging.DEBUG)
if A.page_numbers: if A.page_numbers:
A.page_numbers = set([x-1 for x in A.page_numbers]) A.page_numbers = set([x-1 for x in A.page_numbers])
if A.pagenos: if A.pagenos:

Some files were not shown because too many files have changed in this diff Show More