remove obsolete documents

git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@258 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 09:22:57 +00:00 · 2010-10-17 09:22:57 +00:00 · 6d64586502
parent 4f4f03fb2d
commit 6d64586502
1 changed files with 0 additions and 121 deletions
--- a/docs/miningpdf.html
+++ b/docs/miningpdf.html
@ -1,121 +0,0 @@
 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
 <html>
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
 <title>Mining PDF files</title>
 <style type="text/css"><!--
 blockquote { background: #eeeeee; }
 --></style>
 </head><body>
 <h1>Mining PDF files</h1>
 <p>
 <p>
 <a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
 <div align=right class=lastmod>
 <!-- hhmts start -->
 Last Modified: Sat Nov 14 21:09:01 JST 2009
 <!-- hhmts end -->
 </div>
 <h2>What is PDF?</h2>
 <p>
 <h3>What PDF is ...</h3>
 <ul>
 <li> A weird mixture of texts and binaries. (Yikes!)
 <li> Generated sequentially, but needs random access to read.
 </ul>
 <h3>What PDF is not ...</h3>
 <ul>
 <li> Editable document format (like Word or HTML).
 <li> Nice for accessility point of view.
 </ul>
 <h2>Structure of PDF</h2>
 <p>
 From a data structure's point of view, PDF is a total mess in the
 computer history.  Originally, Adobe had a document format called
 PostScript (which is also more like "graphics" format rather than
 text format). It has nice graphic representation and is able to
 express commercial quality typesetting. However, it has to be for
 a specific printer and its file size tends to get bloated because
 almost everything is represented as text. PDF is Adobe's attempt
 to create a less printer dependent format with a reduced data size
 (that's why it was named "portable" document format). To some
 degree, PDF can be seen as a "compressed" version of PostScript
 with seekable index tables.  Since its drawing model and concepts
 (coordinations, color spaces, etc.) remains pretty much the same
 as its precedessor, Adobe decided to reuse the original PostScript
 notation partially in PDF. However, this eclectic position ended
 up with a disastrous situation.
 <h3>Format Disaster</h2>
 <p>
 When designing a data format, there are two different strategies:
 using text or using binary. They both have obvious merits and
 demerits.  The biggest merit of having textual representation is
 that they are human readable and can be modified with any text
 editor. The demerits of textual representation is its bloted size,
 especially if you want to put something like pictures and
 multimedia data like audio or video. Another demerit of textual
 representation is that you need a program to serialize/deserialize
 (parse) the data, which can be very complex and buggy. On the
 other hand, binary representation normally doesn't require a
 complex parser and takes much less space than texts. However,
 they're not readable for humans.  Now, Adobe decided to take the
 good parts from both worlds by making PDF a partially text and
 partially binary format, and as a result, PDF inherits the
 drawbacks of both worlds without having much of their merits, i.e.
 PDF is a human *unreadable* document format that still requires a
 complex and error-prone parser and has a bloated file size.
 <p>
 Adobe has been probably aware of this problem from early on, and
 they tried to fix this over years. So they gradually dropped text
 representations and more inclided toward binaries.  For example,
 in PDF specification 1.5, they introduce a new notation called
 "object stream" (which is different from a "stream object" that
 was already there in the specification).
 However, by this time there are already tons of PDFs that were
 produced by the original standard, which still requires every PDF
 viewer to support.
 <h2>Problem of Text Extraction from PDF Documents</h2>
 <p>
 Many people tend to think that a PDF document is somewhat similar
 to a Word or HTML document, which is not true. In fact, the primary
 focus of PDF is printing and showing on a computer display, so 
 it is extremely versatile for showing the details of "looks"
 of text typography, picture and graphics. All the texts in a PDF document is
 just a bunch of string objects floating at various locations on a
 blank slate. There is no text flow control and no contexual clue
 about its content, except few special "tagged" PDF documents with
 extra annotations that denote headlines or page boundaries, which
 require specialized tools to create.
 <p>
 (OpenOffice, for example, has ability to create tagged PDF
 documents.  But the degree of the annotations is varied depending
 on its implementation, and in many cases it is not possible to
 obtain the full layout information by only using tags.)
 <p>
 Besides tagged documents, PDF doesn't care the order of text
 strings rendered in a page.  You can completely jumble up every 
 piece of strings in a PDF and still make it look like a
 perfect document on the surface.  Even worse, PDF allows a word to
 be split in the middle and drawn as multiple unrelated strings in
 order to represent precise text positioning.  For example, a
 certain word processing software creates a PDF that splits a word
 "You" into two separate strings "Y" and "ou" because of the subtle
 kerning between the letters.
 <p>
 So there's a huge problem associated with extracting texts properly
 from PDF files. They require almost similar kinds of analysis
 to optical character recognition (OCR).
 <hr noshade>
 <address>Yusuke Shinyama</address>
 </body>