pdfminer.six/docs/programming.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Programming with PDFMiner</title>
<style type="text/css"><!--
blockquote { background: #eeeeee; }
.comment { color: darkgreen; }
--></style>
</head><body>
<p>
<a href="index.html">[Back to PDFMiner homepage]</a>

<h1>Programming with PDFMiner</h1>
<p>
This document explains how to use PDFMiner as a library 
from other applications.
<ul>
<li> <a href="#overview">Overview</a>
<li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Layout Analysis</a>
<li> <a href="#toc">TOC Extraction</a>
</ul>

<a name="overview">
<hr noshade>
<h2>Overview</h2>
<p>
<strong>PDF is evil.</strong>
Because a PDF file is normally big and has a complex structure,
parsing a PDF as a whole is time-and-memory
consuming. Furthermore, not every part is needed for most PDF
processing. Therefore, PDFMiner takes a strategy of lazy parsing,
which is to parse the stuff only when it's necessary. To parse PDF
files, you need at least two classes: <code>PDFParser</code>
and <code>PDFDocument</code>.  These objects work together.
<code>PDFParser</code> fetches (or parses) data from a PDF,
and <code>PDFDocument</code> stores it. You'll also need
<code>PDFPageInterpreter</code> to process the page contents
and <code>PDFDevice</code> to translate it to whatever you need.

<p>
PDF documents are more like graphics format, rather than text
format.  The contents in PDF are just a bunch of procedures that
tell how to render the stuff on a display or paper.  In most
cases, it presents no logical structure such as sentences or
paragraphs.  So PDFMiner attempts to reconstruct some of them by
performing layout analysis. Ugly, I know. Again, PDF is evil.

<p>
Figure 1 shows the relationship between these classes:

<div align=center>
<img src="objrel.png"><br>
<small>Figure 1. Relationships between PDFMiner classes</small>
</div>

<a name="basic">
<hr noshade>
<h2>Basic Usage</h2>
<p>
A typical way to parse a PDF file is the following:
<blockquote><pre>
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

<span class="comment"># Open a PDF file.</span>
fp = open('mypdf.pdf', 'rb')
<span class="comment"># Create a PDF parser object associated with the file object.</span>
parser = PDFParser(fp)
<span class="comment"># Create a PDF document object that stores the document structure.</span>
doc = PDFDocument()
<span class="comment"># Connect the parser and document objects.</span>
parser.set_document(doc)
doc.set_parser(parser)
<span class="comment"># Supply the password for initialization.</span>
<span class="comment"># (If no password is set, give an empty string.)</span>
doc.initialize(password)
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
rsrcmgr = PDFResourceManager()
<span class="comment"># Create a PDF device object.</span>
device = PDFDevice(rsrcmgr)
<span class="comment"># Create a PDF interpreter object.</span>
interpreter = PDFPageInterpreter(rsrcmgr, device)
<span class="comment"># Process each page contained in the document.</span>
for page in doc.get_pages():
    interpreter.process_page(page)
</pre></blockquote>

<a name="layout">
<hr noshade>
<h2>Accessing Layout Objects</h2>
<p>
Here is a typical way to use the layout analysis function:
<blockquote><pre>
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator

<span class="comment"># Set parameters for analysis.</span>
laparams = LAParams()
<span class="comment"># Create a PDF page aggregator object.</span>
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages():
    interpreter.process_page(page)
    <span class="comment"># receive the LTPage object for the page.</span>
    layout = device.get_result()
</pre></blockquote>

The layout analyzer gives a "<code>LTPage</code>" object for each page
in the PDF document. The object contains child objects within the page,
forming a tree-like structure. Figure 2 shows the relationship between
these objects.

<div align=center>
<img src="layout.png"><br>
<small>Figure 2. Layout objects and its tree structure</small>
</div>

<dl>
<dt> <code>LTPage</code>
<dd> Represents an entire page. May contain child objects like
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>, 
<code>LTPolygon</code> and <code>LTLine</code>.

<dt> <code>LTTextBox</code>
<dd> Represents a group of text chunks that can be contained in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text. 
It contains a list of <code>LTTextLine</code> objects.

<dt> <code>LTTextLine</code>
<dd> Contains a list of <code>LTChar</code> objects that represent
a single text line. The characters are aligned either horizontaly
or vertically, depending on the text's writing mode.

<dt> <code>LTChar</code>
<dt> <code>LTText</code>
<dd> These objects represent an actual letter in the text as a Unicode string.
Note that, while a <code>LTChar</code> object has actual boundaries,
<code>LTText</code> objects does not, as these are "virtual" characters,
inserted by a layout analyzer according to the relationship between two characters
(e.g. a space).

<dt> <code>LTFigure</code>
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
present figures or pictures by embedding yet another PDF document within a page.
Note that <code>LTFigure</code> objects can appear recursively.

<dt> <code>LTImage</code>
<dd> Represents an image object. Embedded images can be 
in JPEG or other formats, but currently PDFMiner does not 
pay much attention to graphical objects.

<dt> <code>LTLine</code>
<dd> Represents a single straight line shown in a page. 
Could be used for separating texts or figures.

<dt> <code>LTRect</code>
<dd> Represents a rectangle shown in a page. 
Could be used for framing another pictures or figures.

<dt> <code>LTPolygon</code>
<dd> Represents a polygon in a page. 
</dl>

<a name="toc">
<hr noshade>
<h2>TOC Extraction</h2>
<p>
PDFMiner provides functions to access the document's table of contents
("Outlines").

<blockquote><pre>
from pdfminer.pdfparser import PDFParser, PDFDocument

fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize(password)

<span class="comment"># Get the outlines of the document.</span>
outlines = doc.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)
</pre></blockquote>

<p>
Some PDF documents use page numbers as destinations, while others
use page numbers and the physical location within the page. Since
PDF does not have a logical strucutre, and it does not provide a
way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are
refering to.

<hr noshade>
<address>Yusuke Shinyama</address>
</body>
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">`
			`<html>`
			`<head>`
			`<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<title>Programming with PDFMiner</title>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<style type="text/css"><!--`
			`blockquote { background: #eeeeee; }`
			`.comment { color: darkgreen; }`
			`--></style>`
			`</head><body>`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<p>`
docs update git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@220 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:59:51 +00:00			`<a href="index.html">[Back to PDFMiner homepage]</a>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<h1>Programming with PDFMiner</h1>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<p>`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`This document explains how to use PDFMiner as a library`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`from other applications.`
			`<ul>`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`<li> <a href="#overview">Overview</a>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<li> <a href="#basic">Basic Usage</a>`
			`<li> <a href="#layout">Layout Analysis</a>`
			`<li> <a href="#toc">TOC Extraction</a>`
			`</ul>`

documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`<a name="overview">`
			`<hr noshade>`
			`<h2>Overview</h2>`
			`<p>`
			`<strong>PDF is evil.</strong>`
			`Because a PDF file is normally big and has a complex structure,`
			`parsing a PDF as a whole is time-and-memory`
			`consuming. Furthermore, not every part is needed for most PDF`
			`processing. Therefore, PDFMiner takes a strategy of lazy parsing,`
			`which is to parse the stuff only when it's necessary. To parse PDF`
			`files, you need at least two classes: <code>PDFParser</code>`
			`and <code>PDFDocument</code>. These objects work together.`
			`<code>PDFParser</code> fetches (or parses) data from a PDF,`
			`and <code>PDFDocument</code> stores it. You'll also need`
			`<code>PDFPageInterpreter</code> to process the page contents`
			`and <code>PDFDevice</code> to translate it to whatever you need.`

			`<p>`
			`PDF documents are more like graphics format, rather than text`
			`format. The contents in PDF are just a bunch of procedures that`
			`tell how to render the stuff on a display or paper. In most`
			`cases, it presents no logical structure such as sentences or`
			`paragraphs. So PDFMiner attempts to reconstruct some of them by`
			`performing layout analysis. Ugly, I know. Again, PDF is evil.`

			`<p>`
			`Figure 1 shows the relationship between these classes:`

			`<div align=center>`
			`<img src="objrel.png"><br>`
			`<small>Figure 1. Relationships between PDFMiner classes</small>`
			`</div>`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<a name="basic">`
			`<hr noshade>`
			`<h2>Basic Usage</h2>`
			`<p>`
			`A typical way to parse a PDF file is the following:`
			`<blockquote><pre>`
documentation git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@216 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-05 05:51:22 +00:00			`from pdfminer.pdfparser import PDFParser, PDFDocument`
			`from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter`
			`from pdfminer.pdfdevice import PDFDevice`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<span class="comment"># Open a PDF file.</span>`
			`fp = open('mypdf.pdf', 'rb')`
			`<span class="comment"># Create a PDF parser object associated with the file object.</span>`
			`parser = PDFParser(fp)`
			`<span class="comment"># Create a PDF document object that stores the document structure.</span>`
			`doc = PDFDocument()`
			`<span class="comment"># Connect the parser and document objects.</span>`
			`parser.set_document(doc)`
			`doc.set_parser(parser)`
documentation git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@216 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-05 05:51:22 +00:00			`<span class="comment"># Supply the password for initialization.</span>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<span class="comment"># (If no password is set, give an empty string.)</span>`
			`doc.initialize(password)`
			`<span class="comment"># Check if the document allows text extraction. If not, abort.</span>`
			`if not doc.is_extractable:`
			`raise PDFTextExtractionNotAllowed`
			`<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>`
			`rsrcmgr = PDFResourceManager()`
			`<span class="comment"># Create a PDF device object.</span>`
			`device = PDFDevice(rsrcmgr)`
			`<span class="comment"># Create a PDF interpreter object.</span>`
			`interpreter = PDFPageInterpreter(rsrcmgr, device)`
			`<span class="comment"># Process each page contained in the document.</span>`
			`for page in doc.get_pages():`
			`interpreter.process_page(page)`
			`</pre></blockquote>`

			`<a name="layout">`
			`<hr noshade>`
			`<h2>Accessing Layout Objects</h2>`
			`<p>`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`Here is a typical way to use the layout analysis function:`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<blockquote><pre>`
documentation git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@216 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-05 05:51:22 +00:00			`from pdfminer.layout import LAParams`
			`from pdfminer.converter import PDFPageAggregator`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<span class="comment"># Set parameters for analysis.</span>`
			`laparams = LAParams()`
			`<span class="comment"># Create a PDF page aggregator object.</span>`
			`device = PDFPageAggregator(rsrcmgr, laparams=laparams)`
			`interpreter = PDFPageInterpreter(rsrcmgr, device)`
			`for page in doc.get_pages():`
			`interpreter.process_page(page)`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<span class="comment"># receive the LTPage object for the page.</span>`
			`layout = device.get_result()`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`</pre></blockquote>`

update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`The layout analyzer gives a "<code>LTPage</code>" object for each page`
			`in the PDF document. The object contains child objects within the page,`
			`forming a tree-like structure. Figure 2 shows the relationship between`
			`these objects.`

			`<div align=center>`
			`<img src="layout.png"><br>`
			`<small>Figure 2. Layout objects and its tree structure</small>`
			`</div>`

			`<dl>`
			`<dt> <code>LTPage</code>`
			`<dd> Represents an entire page. May contain child objects like`
			`<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,`
			`<code>LTPolygon</code> and <code>LTLine</code>.`

			`<dt> <code>LTTextBox</code>`
			`<dd> Represents a group of text chunks that can be contained in a rectangular area.`
			`Note that this box is created by geometric analysis and does not necessarily`
			`represents a logical boundary of the text.`
			`It contains a list of <code>LTTextLine</code> objects.`

			`<dt> <code>LTTextLine</code>`
			`<dd> Contains a list of <code>LTChar</code> objects that represent`
			`a single text line. The characters are aligned either horizontaly`
			`or vertically, depending on the text's writing mode.`

			`<dt> <code>LTChar</code>`
			`<dt> <code>LTText</code>`
			`<dd> These objects represent an actual letter in the text as a Unicode string.`
			`Note that, while a <code>LTChar</code> object has actual boundaries,`
			`<code>LTText</code> objects does not, as these are "virtual" characters,`
			`inserted by a layout analyzer according to the relationship between two characters`
			`(e.g. a space).`

			`<dt> <code>LTFigure</code>`
			`<dd> Represents an area used by PDF Form objects. PDF Forms can be used to`
			`present figures or pictures by embedding yet another PDF document within a page.`
			`Note that <code>LTFigure</code> objects can appear recursively.`

			`<dt> <code>LTImage</code>`
			`<dd> Represents an image object. Embedded images can be`
			`in JPEG or other formats, but currently PDFMiner does not`
			`pay much attention to graphical objects.`

			`<dt> <code>LTLine</code>`
			`<dd> Represents a single straight line shown in a page.`
			`Could be used for separating texts or figures.`

			`<dt> <code>LTRect</code>`
			`<dd> Represents a rectangle shown in a page.`
			`Could be used for framing another pictures or figures.`

			`<dt> <code>LTPolygon</code>`
			`<dd> Represents a polygon in a page.`
			`</dl>`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00
			`<a name="toc">`
			`<hr noshade>`
			`<h2>TOC Extraction</h2>`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<p>`
			`PDFMiner provides functions to access the document's table of contents`
			`("Outlines").`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00
			`<blockquote><pre>`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`from pdfminer.pdfparser import PDFParser, PDFDocument`

some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`fp = open('mypdf.pdf', 'rb')`
			`parser = PDFParser(fp)`
			`doc = PDFDocument()`
			`parser.set_document(doc)`
			`doc.set_parser(parser)`
			`doc.initialize(password)`
update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00			`<span class="comment"># Get the outlines of the document.</span>`
			`outlines = doc.get_outlines()`
			`for (level,title,dest,a,se) in outlines:`
			`print (level, title)`
			`</pre></blockquote>`

update usage document git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@219 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-05-29 11:51:24 +00:00			`<p>`
documentation improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@247 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-10-17 05:14:40 +00:00			`Some PDF documents use page numbers as destinations, while others`
			`use page numbers and the physical location within the page. Since`
			`PDF does not have a logical strucutre, and it does not provide a`
			`way to refer to any in-page object from the outside, there's no`
			`way to tell exactly which part of text these destinations are`
			`refering to.`
some usage document added git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-04-24 13:31:31 +00:00
			`<hr noshade>`
			`<address>Yusuke Shinyama</address>`
			`</body>`