Mining PDF files

Homepage

Last Modified: Sat Nov 14 21:09:01 JST 2009

What is PDF?

What PDF is ...

What PDF is not ...

Structure of PDF

From a data structure's point of view, PDF is a total mess in the computer history. Originally, Adobe had a document format called PostScript (which is also more like "graphics" format rather than text format). It has nice graphic representation and is able to express commercial quality typesetting. However, it has to be for a specific printer and its file size tends to get bloated because almost everything is represented as text. PDF is Adobe's attempt to create a less printer dependent format with a reduced data size (that's why it was named "portable" document format). To some degree, PDF can be seen as a "compressed" version of PostScript with seekable index tables. Since its drawing model and concepts (coordinations, color spaces, etc.) remains pretty much the same as its precedessor, Adobe decided to reuse the original PostScript notation partially in PDF. However, this eclectic position ended up with a disastrous situation.

Format Disaster

When designing a data format, there are two different strategies: using text or using binary. They both have obvious merits and demerits. The biggest merit of having textual representation is that they are human readable and can be modified with any text editor. The demerits of textual representation is its bloted size, especially if you want to put something like pictures and multimedia data like audio or video. Another demerit of textual representation is that you need a program to serialize/deserialize (parse) the data, which can be very complex and buggy. On the other hand, binary representation normally doesn't require a complex parser and takes much less space than texts. However, they're not readable for humans. Now, Adobe decided to take the good parts from both worlds by making PDF a partially text and partially binary format, and as a result, PDF inherits the drawbacks of both worlds without having much of their merits, i.e. PDF is a human *unreadable* document format that still requires a complex and error-prone parser and has a bloated file size.

Adobe has been probably aware of this problem from early on, and they tried to fix this over years. So they gradually dropped text representations and more inclided toward binaries. For example, in PDF specification 1.5, they introduce a new notation called "object stream" (which is different from a "stream object" that was already there in the specification). However, by this time there are already tons of PDFs that were produced by the original standard, which still requires every PDF viewer to support.

Problem of Text Extraction from PDF Documents

Many people tend to think that a PDF document is somewhat similar to a Word or HTML document, which is not true. In fact, the primary focus of PDF is printing and showing on a computer display, so it is extremely versatile for showing the details of "looks" of text typography, picture and graphics. All the texts in a PDF document is just a bunch of string objects floating at various locations on a blank slate. There is no text flow control and no contexual clue about its content, except few special "tagged" PDF documents with extra annotations that denote headlines or page boundaries, which require specialized tools to create.

(OpenOffice, for example, has ability to create tagged PDF documents. But the degree of the annotations is varied depending on its implementation, and in many cases it is not possible to obtain the full layout information by only using tags.)

Besides tagged documents, PDF doesn't care the order of text strings rendered in a page. You can completely jumble up every piece of strings in a PDF and still make it look like a perfect document on the surface. Even worse, PDF allows a word to be split in the middle and drawn as multiple unrelated strings in order to represent precise text positioning. For example, a certain word processing software creates a PDF that splits a word "You" into two separate strings "Y" and "ou" because of the subtle kerning between the letters.

So there's a huge problem associated with extracting texts properly from PDF files. They require almost similar kinds of analysis to optical character recognition (OCR).


Yusuke Shinyama