pdfminer.six/docs/miningpdf.html

122 lines
5.0 KiB
HTML
Raw Normal View History

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Mining PDF files</title>
<style type="text/css"><!--
blockquote { background: #eeeeee; }
--></style>
</head><body>
<h1>Mining PDF files</h1>
<p>
<p>
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Sat Nov 14 21:09:01 JST 2009
<!-- hhmts end -->
</div>
<h2>What is PDF?</h2>
<p>
<h3>What PDF is ...</h3>
<ul>
<li> A weird mixture of texts and binaries. (Yikes!)
<li> Generated sequentially, but needs random access to read.
</ul>
<h3>What PDF is not ...</h3>
<ul>
<li> Editable document format (like Word or HTML).
<li> Nice for accessility point of view.
</ul>
<h2>Structure of PDF</h2>
<p>
From a data structure's point of view, PDF is a total mess in the
computer history. Originally, Adobe had a document format called
PostScript (which is also more like "graphics" format rather than
text format). It has nice graphic representation and is able to
express commercial quality typesetting. However, it has to be for
a specific printer and its file size tends to get bloated because
almost everything is represented as text. PDF is Adobe's attempt
to create a less printer dependent format with a reduced data size
(that's why it was named "portable" document format). To some
degree, PDF can be seen as a "compressed" version of PostScript
with seekable index tables. Since its drawing model and concepts
(coordinations, color spaces, etc.) remains pretty much the same
as its precedessor, Adobe decided to reuse the original PostScript
notation partially in PDF. However, this eclectic position ended
up with a disastrous situation.
<h3>Format Disaster</h2>
<p>
When designing a data format, there are two different strategies:
using text or using binary. They both have obvious merits and
demerits. The biggest merit of having textual representation is
that they are human readable and can be modified with any text
editor. The demerits of textual representation is its bloted size,
especially if you want to put something like pictures and
multimedia data like audio or video. Another demerit of textual
representation is that you need a program to serialize/deserialize
(parse) the data, which can be very complex and buggy. On the
other hand, binary representation normally doesn't require a
complex parser and takes much less space than texts. However,
they're not readable for humans. Now, Adobe decided to take the
good parts from both worlds by making PDF a partially text and
partially binary format, and as a result, PDF inherits the
drawbacks of both worlds without having much of their merits, i.e.
PDF is a human *unreadable* document format that still requires a
complex and error-prone parser and has a bloated file size.
<p>
Adobe has been probably aware of this problem from early on, and
they tried to fix this over years. So they gradually dropped text
representations and more inclided toward binaries. For example,
in PDF specification 1.5, they introduce a new notation called
"object stream" (which is different from a "stream object" that
was already there in the specification).
However, by this time there are already tons of PDFs that were
produced by the original standard, which still requires every PDF
viewer to support.
<h2>Problem of Text Extraction from PDF Documents</h2>
<p>
Many people tend to think that a PDF document is somewhat similar
to a Word or HTML document, which is not true. In fact, the primary
focus of PDF is printing and showing on a computer display, so
it is extremely versatile for showing the details of "looks"
of text typography, picture and graphics. All the texts in a PDF document is
just a bunch of string objects floating at various locations on a
blank slate. There is no text flow control and no contexual clue
about its content, except few special "tagged" PDF documents with
extra annotations that denote headlines or page boundaries, which
require specialized tools to create.
<p>
(OpenOffice, for example, has ability to create tagged PDF
documents. But the degree of the annotations is varied depending
on its implementation, and in many cases it is not possible to
obtain the full layout information by only using tags.)
<p>
Besides tagged documents, PDF doesn't care the order of text
strings rendered in a page. You can completely jumble up every
piece of strings in a PDF and still make it look like a
perfect document on the surface. Even worse, PDF allows a word to
be split in the middle and drawn as multiple unrelated strings in
order to represent precise text positioning. For example, a
certain word processing software creates a PDF that splits a word
"You" into two separate strings "Y" and "ou" because of the subtle
kerning between the letters.
<p>
So there's a huge problem associated with extracting texts properly
from PDF files. They require almost similar kinds of analysis
to optical character recognition (OCR).
<hr noshade>
<address>Yusuke Shinyama</address>
</body>