remove obsolete documents
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@258 1aa58f4a-7d42-0410-adbc-911cccaed67cpull/1/head
parent
4f4f03fb2d
commit
6d64586502
|
@ -1,121 +0,0 @@
|
||||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
|
||||||
<html>
|
|
||||||
<head>
|
|
||||||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
||||||
<title>Mining PDF files</title>
|
|
||||||
<style type="text/css"><!--
|
|
||||||
blockquote { background: #eeeeee; }
|
|
||||||
--></style>
|
|
||||||
</head><body>
|
|
||||||
|
|
||||||
<h1>Mining PDF files</h1>
|
|
||||||
<p>
|
|
||||||
|
|
||||||
<p>
|
|
||||||
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
|
|
||||||
|
|
||||||
<div align=right class=lastmod>
|
|
||||||
<!-- hhmts start -->
|
|
||||||
Last Modified: Sat Nov 14 21:09:01 JST 2009
|
|
||||||
<!-- hhmts end -->
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<h2>What is PDF?</h2>
|
|
||||||
<p>
|
|
||||||
<h3>What PDF is ...</h3>
|
|
||||||
<ul>
|
|
||||||
<li> A weird mixture of texts and binaries. (Yikes!)
|
|
||||||
<li> Generated sequentially, but needs random access to read.
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
<h3>What PDF is not ...</h3>
|
|
||||||
<ul>
|
|
||||||
<li> Editable document format (like Word or HTML).
|
|
||||||
<li> Nice for accessility point of view.
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
<h2>Structure of PDF</h2>
|
|
||||||
<p>
|
|
||||||
From a data structure's point of view, PDF is a total mess in the
|
|
||||||
computer history. Originally, Adobe had a document format called
|
|
||||||
PostScript (which is also more like "graphics" format rather than
|
|
||||||
text format). It has nice graphic representation and is able to
|
|
||||||
express commercial quality typesetting. However, it has to be for
|
|
||||||
a specific printer and its file size tends to get bloated because
|
|
||||||
almost everything is represented as text. PDF is Adobe's attempt
|
|
||||||
to create a less printer dependent format with a reduced data size
|
|
||||||
(that's why it was named "portable" document format). To some
|
|
||||||
degree, PDF can be seen as a "compressed" version of PostScript
|
|
||||||
with seekable index tables. Since its drawing model and concepts
|
|
||||||
(coordinations, color spaces, etc.) remains pretty much the same
|
|
||||||
as its precedessor, Adobe decided to reuse the original PostScript
|
|
||||||
notation partially in PDF. However, this eclectic position ended
|
|
||||||
up with a disastrous situation.
|
|
||||||
|
|
||||||
<h3>Format Disaster</h2>
|
|
||||||
<p>
|
|
||||||
When designing a data format, there are two different strategies:
|
|
||||||
using text or using binary. They both have obvious merits and
|
|
||||||
demerits. The biggest merit of having textual representation is
|
|
||||||
that they are human readable and can be modified with any text
|
|
||||||
editor. The demerits of textual representation is its bloted size,
|
|
||||||
especially if you want to put something like pictures and
|
|
||||||
multimedia data like audio or video. Another demerit of textual
|
|
||||||
representation is that you need a program to serialize/deserialize
|
|
||||||
(parse) the data, which can be very complex and buggy. On the
|
|
||||||
other hand, binary representation normally doesn't require a
|
|
||||||
complex parser and takes much less space than texts. However,
|
|
||||||
they're not readable for humans. Now, Adobe decided to take the
|
|
||||||
good parts from both worlds by making PDF a partially text and
|
|
||||||
partially binary format, and as a result, PDF inherits the
|
|
||||||
drawbacks of both worlds without having much of their merits, i.e.
|
|
||||||
PDF is a human *unreadable* document format that still requires a
|
|
||||||
complex and error-prone parser and has a bloated file size.
|
|
||||||
<p>
|
|
||||||
Adobe has been probably aware of this problem from early on, and
|
|
||||||
they tried to fix this over years. So they gradually dropped text
|
|
||||||
representations and more inclided toward binaries. For example,
|
|
||||||
in PDF specification 1.5, they introduce a new notation called
|
|
||||||
"object stream" (which is different from a "stream object" that
|
|
||||||
was already there in the specification).
|
|
||||||
|
|
||||||
However, by this time there are already tons of PDFs that were
|
|
||||||
produced by the original standard, which still requires every PDF
|
|
||||||
viewer to support.
|
|
||||||
|
|
||||||
<h2>Problem of Text Extraction from PDF Documents</h2>
|
|
||||||
<p>
|
|
||||||
Many people tend to think that a PDF document is somewhat similar
|
|
||||||
to a Word or HTML document, which is not true. In fact, the primary
|
|
||||||
focus of PDF is printing and showing on a computer display, so
|
|
||||||
it is extremely versatile for showing the details of "looks"
|
|
||||||
of text typography, picture and graphics. All the texts in a PDF document is
|
|
||||||
just a bunch of string objects floating at various locations on a
|
|
||||||
blank slate. There is no text flow control and no contexual clue
|
|
||||||
about its content, except few special "tagged" PDF documents with
|
|
||||||
extra annotations that denote headlines or page boundaries, which
|
|
||||||
require specialized tools to create.
|
|
||||||
<p>
|
|
||||||
(OpenOffice, for example, has ability to create tagged PDF
|
|
||||||
documents. But the degree of the annotations is varied depending
|
|
||||||
on its implementation, and in many cases it is not possible to
|
|
||||||
obtain the full layout information by only using tags.)
|
|
||||||
<p>
|
|
||||||
Besides tagged documents, PDF doesn't care the order of text
|
|
||||||
strings rendered in a page. You can completely jumble up every
|
|
||||||
piece of strings in a PDF and still make it look like a
|
|
||||||
perfect document on the surface. Even worse, PDF allows a word to
|
|
||||||
be split in the middle and drawn as multiple unrelated strings in
|
|
||||||
order to represent precise text positioning. For example, a
|
|
||||||
certain word processing software creates a PDF that splits a word
|
|
||||||
"You" into two separate strings "Y" and "ou" because of the subtle
|
|
||||||
kerning between the letters.
|
|
||||||
<p>
|
|
||||||
So there's a huge problem associated with extracting texts properly
|
|
||||||
from PDF files. They require almost similar kinds of analysis
|
|
||||||
to optical character recognition (OCR).
|
|
||||||
|
|
||||||
|
|
||||||
<hr noshade>
|
|
||||||
<address>Yusuke Shinyama</address>
|
|
||||||
</body>
|
|
Loading…
Reference in New Issue