diff --git a/docs/miningpdf.html b/docs/miningpdf.html deleted file mode 100644 index 7f2372f..0000000 --- a/docs/miningpdf.html +++ /dev/null @@ -1,121 +0,0 @@ - - - - -Mining PDF files - - - -

Mining PDF files

-

- -

-Homepage - -

- -Last Modified: Sat Nov 14 21:09:01 JST 2009 - -
- -

What is PDF?

-

-

What PDF is ...

- - -

What PDF is not ...

- - -

Structure of PDF

-

-From a data structure's point of view, PDF is a total mess in the -computer history. Originally, Adobe had a document format called -PostScript (which is also more like "graphics" format rather than -text format). It has nice graphic representation and is able to -express commercial quality typesetting. However, it has to be for -a specific printer and its file size tends to get bloated because -almost everything is represented as text. PDF is Adobe's attempt -to create a less printer dependent format with a reduced data size -(that's why it was named "portable" document format). To some -degree, PDF can be seen as a "compressed" version of PostScript -with seekable index tables. Since its drawing model and concepts -(coordinations, color spaces, etc.) remains pretty much the same -as its precedessor, Adobe decided to reuse the original PostScript -notation partially in PDF. However, this eclectic position ended -up with a disastrous situation. - -

Format Disaster

-

-When designing a data format, there are two different strategies: -using text or using binary. They both have obvious merits and -demerits. The biggest merit of having textual representation is -that they are human readable and can be modified with any text -editor. The demerits of textual representation is its bloted size, -especially if you want to put something like pictures and -multimedia data like audio or video. Another demerit of textual -representation is that you need a program to serialize/deserialize -(parse) the data, which can be very complex and buggy. On the -other hand, binary representation normally doesn't require a -complex parser and takes much less space than texts. However, -they're not readable for humans. Now, Adobe decided to take the -good parts from both worlds by making PDF a partially text and -partially binary format, and as a result, PDF inherits the -drawbacks of both worlds without having much of their merits, i.e. -PDF is a human *unreadable* document format that still requires a -complex and error-prone parser and has a bloated file size. -

-Adobe has been probably aware of this problem from early on, and -they tried to fix this over years. So they gradually dropped text -representations and more inclided toward binaries. For example, -in PDF specification 1.5, they introduce a new notation called -"object stream" (which is different from a "stream object" that -was already there in the specification). - -However, by this time there are already tons of PDFs that were -produced by the original standard, which still requires every PDF -viewer to support. - -

Problem of Text Extraction from PDF Documents

-

-Many people tend to think that a PDF document is somewhat similar -to a Word or HTML document, which is not true. In fact, the primary -focus of PDF is printing and showing on a computer display, so -it is extremely versatile for showing the details of "looks" -of text typography, picture and graphics. All the texts in a PDF document is -just a bunch of string objects floating at various locations on a -blank slate. There is no text flow control and no contexual clue -about its content, except few special "tagged" PDF documents with -extra annotations that denote headlines or page boundaries, which -require specialized tools to create. -

-(OpenOffice, for example, has ability to create tagged PDF -documents. But the degree of the annotations is varied depending -on its implementation, and in many cases it is not possible to -obtain the full layout information by only using tags.) -

-Besides tagged documents, PDF doesn't care the order of text -strings rendered in a page. You can completely jumble up every -piece of strings in a PDF and still make it look like a -perfect document on the surface. Even worse, PDF allows a word to -be split in the middle and drawn as multiple unrelated strings in -order to represent precise text positioning. For example, a -certain word processing software creates a PDF that splits a word -"You" into two separate strings "Y" and "ou" because of the subtle -kerning between the letters. -

-So there's a huge problem associated with extracting texts properly -from PDF files. They require almost similar kinds of analysis -to optical character recognition (OCR). - - -


-
Yusuke Shinyama
-