pdfminer.six/docs/programming.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Programming with PDFMiner</title>
<style type="text/css"><!--
blockquote { background: #eeeeee; }
.comment { color: darkgreen; }
--></style>
</head><body>
<p>
<a href="index.html">[Back to PDFMiner homepage]</a>

<h1>Programming with PDFMiner</h1>
<p>
This document explains how to use PDFMiner as a library 
from other applications.
<ul>
<li> <a href="#overview">Overview</a>
<li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Layout Analysis</a>
<li> <a href="#toc">TOC Extraction</a>
<li> <a href="#more">more</a>
</ul>

<a name="overview">
<hr noshade>
<h2>Overview</h2>
<p>
<strong>PDF is evil.</strong>  Although it is called a PDF
"document", it's nothing like Word or HTML. PDF is more like a
picture representation.  PDF contents are just a bunch of
instructions that tell how to place the stuff at each exact
position on a display or paper.  In most cases, it has no logical
structure such as sentences or paragraphs and it cannot adapt
itself when the paper size changes. PDFMiner attempts to
reconstruct some of those structures by guessing from its
positioning, but there's nothing guaranteed to work. Ugly, I
know. Again, PDF is evil.

<p>
Because a PDF file has such a big and complex structure,
parsing a PDF file as a whole is time and memory consuming. However,
not every part is needed for most PDF processing tasks. Therefore
PDFMiner takes a strategy of lazy parsing, which is to parse the
stuff only when it's necessary. To parse PDF files, you need to use at
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.  
These two objects are associated with each other.
<code>PDFParser</code> fetches data from a file,
and <code>PDFDocument</code> stores it. You'll also need
<code>PDFPageInterpreter</code> to process the page contents
and <code>PDFDevice</code> to translate it to whatever you need.
<code>PDFResourceManager</code> is used to store
shared resources such as fonts or images.

<p>
Figure 1 shows the relationship between the classes in PDFMiner.

<div align=center>
<img src="objrel.png"><br>
<small>Figure 1. Relationships between PDFMiner classes</small>
</div>

<a name="basic">
<hr noshade>
<h2>Basic Usage</h2>
<p>
A typical way to parse a PDF file is the following:
<blockquote><pre>
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

<span class="comment"># Open a PDF file.</span>
fp = open('mypdf.pdf', 'rb')
<span class="comment"># Create a PDF parser object associated with the file object.</span>
parser = PDFParser(fp)
<span class="comment"># Create a PDF document object that stores the document structure.</span>
doc = PDFDocument()
<span class="comment"># Connect the parser and document objects.</span>
parser.set_document(doc)
doc.set_parser(parser)
<span class="comment"># Supply the password for initialization.</span>
<span class="comment"># (If no password is set, give an empty string.)</span>
doc.initialize(password)
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
rsrcmgr = PDFResourceManager()
<span class="comment"># Create a PDF device object.</span>
device = PDFDevice(rsrcmgr)
<span class="comment"># Create a PDF interpreter object.</span>
interpreter = PDFPageInterpreter(rsrcmgr, device)
<span class="comment"># Process each page contained in the document.</span>
for page in doc.get_pages():
    interpreter.process_page(page)
</pre></blockquote>

<a name="layout">
<hr noshade>
<h2>Accessing Layout Objects</h2>
<p>
Here is a typical way to use the layout analysis function:
<blockquote><pre>
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator

<span class="comment"># Set parameters for analysis.</span>
laparams = LAParams()
<span class="comment"># Create a PDF page aggregator object.</span>
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages():
    interpreter.process_page(page)
    <span class="comment"># receive the LTPage object for the page.</span>
    layout = device.get_result()
</pre></blockquote>

The layout analyzer gives a "<code>LTPage</code>" object for each page
in the PDF document. The object contains child objects within the page,
forming a tree-like structure. Figure 2 shows the relationship between
these objects.

<div align=center>
<img src="layout.png"><br>
<small>Figure 2. Layout objects and its tree structure</small>
</div>

<dl>
<dt> <code>LTPage</code>
<dd> Represents an entire page. May contain child objects like
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>, 
<code>LTPolygon</code> and <code>LTLine</code>.

<dt> <code>LTTextBox</code>
<dd> Represents a group of text chunks that can be contained in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text. 
It contains a list of <code>LTTextLine</code> objects.

<dt> <code>LTTextLine</code>
<dd> Contains a list of <code>LTChar</code> objects that represent
a single text line. The characters are aligned either horizontaly
or vertically, depending on the text's writing mode.

<dt> <code>LTChar</code>
<dt> <code>LTText</code>
<dd> These objects represent an actual letter in the text as a Unicode string.
Note that, while a <code>LTChar</code> object has actual boundaries,
<code>LTText</code> objects does not, as these are "virtual" characters,
inserted by a layout analyzer according to the relationship between two characters
(e.g. a space).

<dt> <code>LTFigure</code>
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
present figures or pictures by embedding yet another PDF document within a page.
Note that <code>LTFigure</code> objects can appear recursively.

<dt> <code>LTImage</code>
<dd> Represents an image object. Embedded images can be 
in JPEG or other formats, but currently PDFMiner does not 
pay much attention to graphical objects.

<dt> <code>LTLine</code>
<dd> Represents a single straight line shown in a page. 
Could be used for separating texts or figures.

<dt> <code>LTRect</code>
<dd> Represents a rectangle shown in a page. 
Could be used for framing another pictures or figures.

<dt> <code>LTPolygon</code>
<dd> Represents a polygon in a page. 
</dl>

<a name="toc">
<hr noshade>
<h2>TOC Extraction</h2>
<p>
PDFMiner provides functions to access the document's table of contents
("Outlines").

<blockquote><pre>
from pdfminer.pdfparser import PDFParser, PDFDocument

fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize(password)

<span class="comment"># Get the outlines of the document.</span>
outlines = doc.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)
</pre></blockquote>

<p>
Some PDF documents use page numbers as destinations, while others
use page numbers and the physical location within the page. Since
PDF does not have a logical strucutre, and it does not provide a
way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are
refering to.

<a name="more">
<hr noshade>
<h2>More</h2>

<p>
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
in order to process them differently / obtain other information.

<hr noshade>
<address>Yusuke Shinyama</address>
</body>
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">`
			`<html>`
			`<head>`
			`<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<title>Programming with PDFMiner</title>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<style type="text/css"><!--`
			`blockquote { background: #eeeeee; }`
			`.comment { color: darkgreen; }`
			`--></style>`
			`</head><body>`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<p>`
docs update git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@220 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:59:51 +00:00			`<a href="index.html">[Back to PDFMiner homepage]</a>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<h1>Programming with PDFMiner</h1>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<p>`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`This document explains how to use PDFMiner as a library`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`from other applications.`
			`<ul>`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`<li> <a href="#overview">Overview</a>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<li> <a href="#basic">Basic Usage</a>`
			`<li> <a href="#layout">Layout Analysis</a>`
			`<li> <a href="#toc">TOC Extraction</a>`
overview git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:15:05 +00:00			`<li> <a href="#more">more</a>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`</ul>`

documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`<a name="overview">`
			`<hr noshade>`
			`<h2>Overview</h2>`
			`<p>`
overview git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:15:05 +00:00			`<strong>PDF is evil.</strong> Although it is called a PDF`
			`"document", it's nothing like Word or HTML. PDF is more like a`
			`picture representation. PDF contents are just a bunch of`
			`instructions that tell how to place the stuff at each exact`
			`position on a display or paper. In most cases, it has no logical`
			`structure such as sentences or paragraphs and it cannot adapt`
			`itself when the paper size changes. PDFMiner attempts to`
			`reconstruct some of those structures by guessing from its`
			`positioning, but there's nothing guaranteed to work. Ugly, I`
			`know. Again, PDF is evil.`

			`<p>`
			`Because a PDF file has such a big and complex structure,`
			`parsing a PDF file as a whole is time and memory consuming. However,`
			`not every part is needed for most PDF processing tasks. Therefore`
			`PDFMiner takes a strategy of lazy parsing, which is to parse the`
			`stuff only when it's necessary. To parse PDF files, you need to use at`
			`least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.`
			`These two objects are associated with each other.`
			`<code>PDFParser</code> fetches data from a file,`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`and <code>PDFDocument</code> stores it. You'll also need`
			`<code>PDFPageInterpreter</code> to process the page contents`
			`and <code>PDFDevice</code> to translate it to whatever you need.`
overview git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:15:05 +00:00			`<code>PDFResourceManager</code> is used to store`
			`shared resources such as fonts or images.`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00
			`<p>`
overview git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:15:05 +00:00			`Figure 1 shows the relationship between the classes in PDFMiner.`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00
			`<div align=center>`
			`<img src="objrel.png"><br>`
			`<small>Figure 1. Relationships between PDFMiner classes</small>`
			`</div>`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<a name="basic">`
			`<hr noshade>`
			`<h2>Basic Usage</h2>`
			`<p>`
			`A typical way to parse a PDF file is the following:`
			`<blockquote><pre>`
documentation git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@216 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-05 05:51:22 +00:00			`from pdfminer.pdfparser import PDFParser, PDFDocument`
			`from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter`
			`from pdfminer.pdfdevice import PDFDevice`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<span class="comment"># Open a PDF file.</span>`
			`fp = open('mypdf.pdf', 'rb')`
			`<span class="comment"># Create a PDF parser object associated with the file object.</span>`
			`parser = PDFParser(fp)`
			`<span class="comment"># Create a PDF document object that stores the document structure.</span>`
			`doc = PDFDocument()`
			`<span class="comment"># Connect the parser and document objects.</span>`
			`parser.set_document(doc)`
			`doc.set_parser(parser)`
documentation git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@216 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-05 05:51:22 +00:00			`<span class="comment"># Supply the password for initialization.</span>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<span class="comment"># (If no password is set, give an empty string.)</span>`
			`doc.initialize(password)`
			`<span class="comment"># Check if the document allows text extraction. If not, abort.</span>`
			`if not doc.is_extractable:`
			`raise PDFTextExtractionNotAllowed`
			`<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>`
			`rsrcmgr = PDFResourceManager()`
			`<span class="comment"># Create a PDF device object.</span>`
			`device = PDFDevice(rsrcmgr)`
			`<span class="comment"># Create a PDF interpreter object.</span>`
			`interpreter = PDFPageInterpreter(rsrcmgr, device)`
			`<span class="comment"># Process each page contained in the document.</span>`
			`for page in doc.get_pages():`
			`interpreter.process_page(page)`
			`</pre></blockquote>`

			`<a name="layout">`
			`<hr noshade>`
			`<h2>Accessing Layout Objects</h2>`
			`<p>`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`Here is a typical way to use the layout analysis function:`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<blockquote><pre>`
documentation git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@216 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-05 05:51:22 +00:00			`from pdfminer.layout import LAParams`
			`from pdfminer.converter import PDFPageAggregator`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<span class="comment"># Set parameters for analysis.</span>`
			`laparams = LAParams()`
			`<span class="comment"># Create a PDF page aggregator object.</span>`
			`device = PDFPageAggregator(rsrcmgr, laparams=laparams)`
			`interpreter = PDFPageInterpreter(rsrcmgr, device)`
			`for page in doc.get_pages():`
			`interpreter.process_page(page)`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<span class="comment"># receive the LTPage object for the page.</span>`
			`layout = device.get_result()`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`</pre></blockquote>`

update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`The layout analyzer gives a "<code>LTPage</code>" object for each page`
			`in the PDF document. The object contains child objects within the page,`
			`forming a tree-like structure. Figure 2 shows the relationship between`
			`these objects.`

			`<div align=center>`
			`<img src="layout.png"><br>`
			`<small>Figure 2. Layout objects and its tree structure</small>`
			`</div>`

			`<dl>`
			`<dt> <code>LTPage</code>`
			`<dd> Represents an entire page. May contain child objects like`
			`<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,`
			`<code>LTPolygon</code> and <code>LTLine</code>.`

			`<dt> <code>LTTextBox</code>`
			`<dd> Represents a group of text chunks that can be contained in a rectangular area.`
			`Note that this box is created by geometric analysis and does not necessarily`
			`represents a logical boundary of the text.`
			`It contains a list of <code>LTTextLine</code> objects.`

			`<dt> <code>LTTextLine</code>`
			`<dd> Contains a list of <code>LTChar</code> objects that represent`
			`a single text line. The characters are aligned either horizontaly`
			`or vertically, depending on the text's writing mode.`

			`<dt> <code>LTChar</code>`
			`<dt> <code>LTText</code>`
			`<dd> These objects represent an actual letter in the text as a Unicode string.`
			`Note that, while a <code>LTChar</code> object has actual boundaries,`
			`<code>LTText</code> objects does not, as these are "virtual" characters,`
			`inserted by a layout analyzer according to the relationship between two characters`
			`(e.g. a space).`

			`<dt> <code>LTFigure</code>`
			`<dd> Represents an area used by PDF Form objects. PDF Forms can be used to`
			`present figures or pictures by embedding yet another PDF document within a page.`
			`Note that <code>LTFigure</code> objects can appear recursively.`

			`<dt> <code>LTImage</code>`
			`<dd> Represents an image object. Embedded images can be`
			`in JPEG or other formats, but currently PDFMiner does not`
			`pay much attention to graphical objects.`

			`<dt> <code>LTLine</code>`
			`<dd> Represents a single straight line shown in a page.`
			`Could be used for separating texts or figures.`

			`<dt> <code>LTRect</code>`
			`<dd> Represents a rectangle shown in a page.`
			`Could be used for framing another pictures or figures.`

			`<dt> <code>LTPolygon</code>`
			`<dd> Represents a polygon in a page.`
			`</dl>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00
			`<a name="toc">`
			`<hr noshade>`
			`<h2>TOC Extraction</h2>`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<p>`
			`PDFMiner provides functions to access the document's table of contents`
			`("Outlines").`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00
			`<blockquote><pre>`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`from pdfminer.pdfparser import PDFParser, PDFDocument`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`fp = open('mypdf.pdf', 'rb')`
			`parser = PDFParser(fp)`
			`doc = PDFDocument()`
			`parser.set_document(doc)`
			`doc.set_parser(parser)`
			`doc.initialize(password)`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<span class="comment"># Get the outlines of the document.</span>`
			`outlines = doc.get_outlines()`
			`for (level,title,dest,a,se) in outlines:`
			`print (level, title)`
			`</pre></blockquote>`

update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<p>`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`Some PDF documents use page numbers as destinations, while others`
			`use page numbers and the physical location within the page. Since`
			`PDF does not have a logical strucutre, and it does not provide a`
			`way to refer to any in-page object from the outside, there's no`
			`way to tell exactly which part of text these destinations are`
			`refering to.`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00
overview git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@251 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:15:05 +00:00			`<a name="more">`
			`<hr noshade>`
			`<h2>More</h2>`

			`<p>`
			`You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class`
			`in order to process them differently / obtain other information.`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<hr noshade>`
			`<address>Yusuke Shinyama</address>`
			`</body>`