remove obsolete documents

git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@258 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 09:22:57 +00:00 · 2010-10-17 09:22:57 +00:00 · 6d64586502
parent 4f4f03fb2d
commit 6d64586502
1 changed files with 0 additions and 121 deletions
--- a/docs/miningpdf.html
+++ b/docs/miningpdf.html
@ -1,121 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
-<html>
-<head>
-<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
-<title>Mining PDF files</title>
-<style type="text/css"><!--
-blockquote { background: #eeeeee; }
--></style>
-</head><body>
-
-<h1>Mining PDF files</h1>
-<p>
-
-<p>
-<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
-
-<div align=right class=lastmod>
-<!-- hhmts start -->
-Last Modified: Sat Nov 14 21:09:01 JST 2009
-<!-- hhmts end -->
-</div>
-
-<h2>What is PDF?</h2>
-<p>
-<h3>What PDF is ...</h3>
-<ul>
-<li> A weird mixture of texts and binaries. (Yikes!)
-<li> Generated sequentially, but needs random access to read.
-</ul>
-
-<h3>What PDF is not ...</h3>
-<ul>
-<li> Editable document format (like Word or HTML).
-<li> Nice for accessility point of view.
-</ul>
-
-<h2>Structure of PDF</h2>
-<p>
-From a data structure's point of view, PDF is a total mess in the
-computer history.  Originally, Adobe had a document format called
-PostScript (which is also more like "graphics" format rather than
-text format). It has nice graphic representation and is able to
-express commercial quality typesetting. However, it has to be for
-a specific printer and its file size tends to get bloated because
-almost everything is represented as text. PDF is Adobe's attempt
-to create a less printer dependent format with a reduced data size
-(that's why it was named "portable" document format). To some
-degree, PDF can be seen as a "compressed" version of PostScript
-with seekable index tables.  Since its drawing model and concepts
-(coordinations, color spaces, etc.) remains pretty much the same
-as its precedessor, Adobe decided to reuse the original PostScript
-notation partially in PDF. However, this eclectic position ended
-up with a disastrous situation.
-
-<h3>Format Disaster</h2>
-<p>
-When designing a data format, there are two different strategies:
-using text or using binary. They both have obvious merits and
-demerits.  The biggest merit of having textual representation is
-that they are human readable and can be modified with any text
-editor. The demerits of textual representation is its bloted size,
-especially if you want to put something like pictures and
-multimedia data like audio or video. Another demerit of textual
-representation is that you need a program to serialize/deserialize
-(parse) the data, which can be very complex and buggy. On the
-other hand, binary representation normally doesn't require a
-complex parser and takes much less space than texts. However,
-they're not readable for humans.  Now, Adobe decided to take the
-good parts from both worlds by making PDF a partially text and
-partially binary format, and as a result, PDF inherits the
-drawbacks of both worlds without having much of their merits, i.e.
-PDF is a human *unreadable* document format that still requires a
-complex and error-prone parser and has a bloated file size.
-<p>
-Adobe has been probably aware of this problem from early on, and
-they tried to fix this over years. So they gradually dropped text
-representations and more inclided toward binaries.  For example,
-in PDF specification 1.5, they introduce a new notation called
-"object stream" (which is different from a "stream object" that
-was already there in the specification).
-
-However, by this time there are already tons of PDFs that were
-produced by the original standard, which still requires every PDF
-viewer to support.
-
-<h2>Problem of Text Extraction from PDF Documents</h2>
-<p>
-Many people tend to think that a PDF document is somewhat similar
-to a Word or HTML document, which is not true. In fact, the primary
-focus of PDF is printing and showing on a computer display, so 
-it is extremely versatile for showing the details of "looks"
-of text typography, picture and graphics. All the texts in a PDF document is
-just a bunch of string objects floating at various locations on a
-blank slate. There is no text flow control and no contexual clue
-about its content, except few special "tagged" PDF documents with
-extra annotations that denote headlines or page boundaries, which
-require specialized tools to create.
-<p>
-(OpenOffice, for example, has ability to create tagged PDF
-documents.  But the degree of the annotations is varied depending
-on its implementation, and in many cases it is not possible to
-obtain the full layout information by only using tags.)
-<p>
-Besides tagged documents, PDF doesn't care the order of text
-strings rendered in a page.  You can completely jumble up every 
-piece of strings in a PDF and still make it look like a
-perfect document on the surface.  Even worse, PDF allows a word to
-be split in the middle and drawn as multiple unrelated strings in
-order to represent precise text positioning.  For example, a
-certain word processing software creates a PDF that splits a word
-"You" into two separate strings "Y" and "ou" because of the subtle
-kerning between the letters.
-<p>
-So there's a huge problem associated with extracting texts properly
-from PDF files. They require almost similar kinds of analysis
-to optical character recognition (OCR).
-
-
-<hr noshade>
-<address>Yusuke Shinyama</address>
-</body>