documentation improved
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67cpull/1/head
parent
69d9d85685
commit
0e0acfc3ff
|
@ -1,4 +1,4 @@
|
||||||
%TGIF 4.1.45-QPL
|
%TGIF 4.2.2
|
||||||
state(0,37,100.000,0,0,0,16,1,9,1,1,1,0,0,2,1,1,'Helvetica-Bold',1,69120,0,0,1,10,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
|
state(0,37,100.000,0,0,0,16,1,9,1,1,1,0,0,2,1,1,'Helvetica-Bold',1,69120,0,0,1,10,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
|
||||||
%
|
%
|
||||||
% @(#)$Header$
|
% @(#)$Header$
|
||||||
|
@ -30,6 +30,8 @@ script_frac("0.6").
|
||||||
fg_bg_colors('black','white').
|
fg_bg_colors('black','white').
|
||||||
dont_reencode("FFDingbests:ZapfDingbats").
|
dont_reencode("FFDingbests:ZapfDingbats").
|
||||||
objshadow_info('#c0c0c0',2,2).
|
objshadow_info('#c0c0c0',2,2).
|
||||||
|
rotate_pivot(0,0,0,0).
|
||||||
|
spline_tightness(1).
|
||||||
page(1,"",1,'').
|
page(1,"",1,'').
|
||||||
oval('black','',350,380,450,430,2,2,1,88,0,0,0,0,0,'2',0,[
|
oval('black','',350,380,450,430,2,2,1,88,0,0,0,0,0,'2',0,[
|
||||||
]).
|
]).
|
||||||
|
@ -167,19 +169,19 @@ poly('black','',2,[
|
||||||
"0","",[
|
"0","",[
|
||||||
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
|
||||||
]).
|
]).
|
||||||
text('black',400,158,1,1,1,68,15,115,12,3,2,0,0,0,2,68,15,0,0,"",0,0,0,0,170,'',[
|
text('black',400,158,1,1,1,84,15,115,12,3,2,0,0,0,2,84,15,0,0,"",0,0,0,0,170,'',[
|
||||||
minilines(68,15,0,0,1,0,0,[
|
minilines(84,15,0,0,1,0,0,[
|
||||||
mini_line(68,12,3,0,0,0,[
|
mini_line(84,12,3,0,0,0,[
|
||||||
str_block(0,68,12,3,0,0,0,0,0,[
|
str_block(0,84,12,3,0,-1,0,0,0,[
|
||||||
str_seg('black','Helvetica-Bold',1,69120,68,12,3,0,0,0,0,0,0,0,
|
str_seg('black','Helvetica-Bold',1,69120,84,12,3,0,-1,0,0,0,0,0,
|
||||||
"page object")])
|
"page contents")])
|
||||||
])
|
])
|
||||||
])]).
|
])]).
|
||||||
text('black',400,258,1,1,1,115,15,119,12,3,2,0,0,0,2,115,15,0,0,"",0,0,0,0,270,'',[
|
text('black',400,258,1,1,1,129,15,119,12,3,2,0,0,0,2,129,15,0,0,"",0,0,0,0,270,'',[
|
||||||
minilines(115,15,0,0,1,0,0,[
|
minilines(129,15,0,0,1,0,0,[
|
||||||
mini_line(115,12,3,0,0,0,[
|
mini_line(129,12,3,0,0,0,[
|
||||||
str_block(0,115,12,3,0,-1,0,0,0,[
|
str_block(0,129,12,3,0,-1,0,0,0,[
|
||||||
str_seg('black','Helvetica-Bold',1,69120,115,12,3,0,-1,0,0,0,0,0,
|
str_seg('black','Helvetica-Bold',1,69120,129,12,3,0,-1,0,0,0,0,0,
|
||||||
"rendering sequence")])
|
"rendering instructions")])
|
||||||
])
|
])
|
||||||
])]).
|
])]).
|
||||||
|
|
BIN
docs/objrel.png
BIN
docs/objrel.png
Binary file not shown.
Before Width: | Height: | Size: 2.0 KiB After Width: | Height: | Size: 2.0 KiB |
|
@ -16,11 +16,45 @@ blockquote { background: #eeeeee; }
|
||||||
This document explains how to use PDFMiner as a library
|
This document explains how to use PDFMiner as a library
|
||||||
from other applications.
|
from other applications.
|
||||||
<ul>
|
<ul>
|
||||||
|
<li> <a href="#overview">Overview</a>
|
||||||
<li> <a href="#basic">Basic Usage</a>
|
<li> <a href="#basic">Basic Usage</a>
|
||||||
<li> <a href="#layout">Layout Analysis</a>
|
<li> <a href="#layout">Layout Analysis</a>
|
||||||
<li> <a href="#toc">TOC Extraction</a>
|
<li> <a href="#toc">TOC Extraction</a>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
|
<a name="overview">
|
||||||
|
<hr noshade>
|
||||||
|
<h2>Overview</h2>
|
||||||
|
<p>
|
||||||
|
<strong>PDF is evil.</strong>
|
||||||
|
Because a PDF file is normally big and has a complex structure,
|
||||||
|
parsing a PDF as a whole is time-and-memory
|
||||||
|
consuming. Furthermore, not every part is needed for most PDF
|
||||||
|
processing. Therefore, PDFMiner takes a strategy of lazy parsing,
|
||||||
|
which is to parse the stuff only when it's necessary. To parse PDF
|
||||||
|
files, you need at least two classes: <code>PDFParser</code>
|
||||||
|
and <code>PDFDocument</code>. These objects work together.
|
||||||
|
<code>PDFParser</code> fetches (or parses) data from a PDF,
|
||||||
|
and <code>PDFDocument</code> stores it. You'll also need
|
||||||
|
<code>PDFPageInterpreter</code> to process the page contents
|
||||||
|
and <code>PDFDevice</code> to translate it to whatever you need.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
PDF documents are more like graphics format, rather than text
|
||||||
|
format. The contents in PDF are just a bunch of procedures that
|
||||||
|
tell how to render the stuff on a display or paper. In most
|
||||||
|
cases, it presents no logical structure such as sentences or
|
||||||
|
paragraphs. So PDFMiner attempts to reconstruct some of them by
|
||||||
|
performing layout analysis. Ugly, I know. Again, PDF is evil.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Figure 1 shows the relationship between these classes:
|
||||||
|
|
||||||
|
<div align=center>
|
||||||
|
<img src="objrel.png"><br>
|
||||||
|
<small>Figure 1. Relationships between PDFMiner classes</small>
|
||||||
|
</div>
|
||||||
|
|
||||||
<a name="basic">
|
<a name="basic">
|
||||||
<hr noshade>
|
<hr noshade>
|
||||||
<h2>Basic Usage</h2>
|
<h2>Basic Usage</h2>
|
||||||
|
@ -57,25 +91,11 @@ for page in doc.get_pages():
|
||||||
interpreter.process_page(page)
|
interpreter.process_page(page)
|
||||||
</pre></blockquote>
|
</pre></blockquote>
|
||||||
|
|
||||||
<p>
|
|
||||||
In PDFMiner, there are several Python classes involved in parsing a PDF file,
|
|
||||||
as shown in Figure 1.
|
|
||||||
|
|
||||||
<div align=center>
|
|
||||||
<img src="objrel.png"><br>
|
|
||||||
<small>Figure 1. Relationships between PDFMiner objects</small>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<a name="layout">
|
<a name="layout">
|
||||||
<hr noshade>
|
<hr noshade>
|
||||||
<h2>Accessing Layout Objects</h2>
|
<h2>Accessing Layout Objects</h2>
|
||||||
<p>
|
<p>
|
||||||
PDF documents are more like graphics, rather than text documents.
|
Here is a typical way to use the layout analysis function:
|
||||||
In most cases, it presents no logical structure such as sentences or paragraphs.
|
|
||||||
PDFMiner attempts to reconstruct some of them by performing
|
|
||||||
basic layout analysis.
|
|
||||||
<p>
|
|
||||||
Here is a typical way to do it:
|
|
||||||
<blockquote><pre>
|
<blockquote><pre>
|
||||||
from pdfminer.layout import LAParams
|
from pdfminer.layout import LAParams
|
||||||
from pdfminer.converter import PDFPageAggregator
|
from pdfminer.converter import PDFPageAggregator
|
||||||
|
@ -172,11 +192,12 @@ for (level,title,dest,a,se) in outlines:
|
||||||
</pre></blockquote>
|
</pre></blockquote>
|
||||||
|
|
||||||
<p>
|
<p>
|
||||||
In some PDF documents, destinations are referred to as page numbers.
|
Some PDF documents use page numbers as destinations, while others
|
||||||
In other PDF documents, destinations are referred to as page numbers plus
|
use page numbers and the physical location within the page. Since
|
||||||
the location within the page. Since PDF does not provide a way to
|
PDF does not have a logical strucutre, and it does not provide a
|
||||||
point to graphical objects in a page, normally these in-page destinations
|
way to refer to any in-page object from the outside, there's no
|
||||||
are specified by physical coordinates.
|
way to tell exactly which part of text these destinations are
|
||||||
|
refering to.
|
||||||
|
|
||||||
<hr noshade>
|
<hr noshade>
|
||||||
<address>Yusuke Shinyama</address>
|
<address>Yusuke Shinyama</address>
|
||||||
|
|
Loading…
Reference in New Issue