Add a bit of documentation.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@159 1aa58f4a-7d42-0410-adbc-911cccaed67cpull/1/head
parent
0298e26acc
commit
2af8eeb3e7
10
Makefile
10
Makefile
|
@ -30,15 +30,9 @@ commit: clean
|
||||||
check:
|
check:
|
||||||
cd $(PACKAGE) && make check
|
cd $(PACKAGE) && make check
|
||||||
|
|
||||||
sdist: clean
|
|
||||||
$(PYTHON) setup.py sdist
|
|
||||||
|
|
||||||
register: clean
|
register: clean
|
||||||
$(PYTHON) setup.py sdist upload register
|
$(PYTHON) setup.py sdist upload register
|
||||||
|
|
||||||
VERSION=`$(PYTHON) $(PACKAGE)/__init__.py`
|
|
||||||
DISTFILE=$(PACKAGE)-$(VERSION).tar.gz
|
|
||||||
WEBDIR=$$HOME/Site/unixuser.org/python/$(PACKAGE)
|
WEBDIR=$$HOME/Site/unixuser.org/python/$(PACKAGE)
|
||||||
publish: sdist
|
publish:
|
||||||
$(CP) dist/$(DISTFILE) $(WEBDIR)
|
$(CP) docs/*.html $(WEBDIR)
|
||||||
$(CP) docs/*.html $(WEBDIR)/index.html
|
|
||||||
|
|
|
@ -0,0 +1,121 @@
|
||||||
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||||||
|
<title>Mining PDF files</title>
|
||||||
|
<style type="text/css"><!--
|
||||||
|
blockquote { background: #eeeeee; }
|
||||||
|
--></style>
|
||||||
|
</head><body>
|
||||||
|
|
||||||
|
<h1>Mining PDF files</h1>
|
||||||
|
<p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
<a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">Homepage</a>
|
||||||
|
|
||||||
|
<div align=right class=lastmod>
|
||||||
|
<!-- hhmts start -->
|
||||||
|
Last Modified: Sat Nov 14 21:09:01 JST 2009
|
||||||
|
<!-- hhmts end -->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<h2>What is PDF?</h2>
|
||||||
|
<p>
|
||||||
|
<h3>What PDF is ...</h3>
|
||||||
|
<ul>
|
||||||
|
<li> A weird mixture of texts and binaries. (Yikes!)
|
||||||
|
<li> Generated sequentially, but needs random access to read.
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<h3>What PDF is not ...</h3>
|
||||||
|
<ul>
|
||||||
|
<li> Editable document format (like Word or HTML).
|
||||||
|
<li> Nice for accessility point of view.
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<h2>Structure of PDF</h2>
|
||||||
|
<p>
|
||||||
|
From a data structure's point of view, PDF is a total mess in the
|
||||||
|
computer history. Originally, Adobe had a document format called
|
||||||
|
PostScript (which is also more like "graphics" format rather than
|
||||||
|
text format). It has nice graphic representation and is able to
|
||||||
|
express commercial quality typesetting. However, it has to be for
|
||||||
|
a specific printer and its file size tends to get bloated because
|
||||||
|
almost everything is represented as text. PDF is Adobe's attempt
|
||||||
|
to create a less printer dependent format with a reduced data size
|
||||||
|
(that's why it was named "portable" document format). To some
|
||||||
|
degree, PDF can be seen as a "compressed" version of PostScript
|
||||||
|
with seekable index tables. Since its drawing model and concepts
|
||||||
|
(coordinations, color spaces, etc.) remains pretty much the same
|
||||||
|
as its precedessor, Adobe decided to reuse the original PostScript
|
||||||
|
notation partially in PDF. However, this eclectic position ended
|
||||||
|
up with a disastrous situation.
|
||||||
|
|
||||||
|
<h3>Format Disaster</h2>
|
||||||
|
<p>
|
||||||
|
When designing a data format, there are two different strategies:
|
||||||
|
using text or using binary. They both have obvious merits and
|
||||||
|
demerits. The biggest merit of having textual representation is
|
||||||
|
that they are human readable and can be modified with any text
|
||||||
|
editor. The demerits of textual representation is its bloted size,
|
||||||
|
especially if you want to put something like pictures and
|
||||||
|
multimedia data like audio or video. Another demerit of textual
|
||||||
|
representation is that you need a program to serialize/deserialize
|
||||||
|
(parse) the data, which can be very complex and buggy. On the
|
||||||
|
other hand, binary representation normally doesn't require a
|
||||||
|
complex parser and takes much less space than texts. However,
|
||||||
|
they're not readable for humans. Now, Adobe decided to take the
|
||||||
|
good parts from both worlds by making PDF a partially text and
|
||||||
|
partially binary format, and as a result, PDF inherits the
|
||||||
|
drawbacks of both worlds without having much of their merits, i.e.
|
||||||
|
PDF is a human *unreadable* document format that still requires a
|
||||||
|
complex and error-prone parser and has a bloated file size.
|
||||||
|
<p>
|
||||||
|
Adobe has been probably aware of this problem from early on, and
|
||||||
|
they tried to fix this over years. So they gradually dropped text
|
||||||
|
representations and more inclided toward binaries. For example,
|
||||||
|
in PDF specification 1.5, they introduce a new notation called
|
||||||
|
"object stream" (which is different from a "stream object" that
|
||||||
|
was already there in the specification).
|
||||||
|
|
||||||
|
However, by this time there are already tons of PDFs that were
|
||||||
|
produced by the original standard, which still requires every PDF
|
||||||
|
viewer to support.
|
||||||
|
|
||||||
|
<h2>Problem of Text Extraction from PDF Documents</h2>
|
||||||
|
<p>
|
||||||
|
Many people tend to think that a PDF document is somewhat similar
|
||||||
|
to a Word or HTML document, which is not true. In fact, the primary
|
||||||
|
focus of PDF is printing and showing on a computer display, so
|
||||||
|
it is extremely versatile for showing the details of "looks"
|
||||||
|
of text typography, picture and graphics. All the texts in a PDF document is
|
||||||
|
just a bunch of string objects floating at various locations on a
|
||||||
|
blank slate. There is no text flow control and no contexual clue
|
||||||
|
about its content, except few special "tagged" PDF documents with
|
||||||
|
extra annotations that denote headlines or page boundaries, which
|
||||||
|
require specialized tools to create.
|
||||||
|
<p>
|
||||||
|
(OpenOffice, for example, has ability to create tagged PDF
|
||||||
|
documents. But the degree of the annotations is varied depending
|
||||||
|
on its implementation, and in many cases it is not possible to
|
||||||
|
obtain the full layout information by only using tags.)
|
||||||
|
<p>
|
||||||
|
Besides tagged documents, PDF doesn't care the order of text
|
||||||
|
strings rendered in a page. You can completely jumble up every
|
||||||
|
piece of strings in a PDF and still make it look like a
|
||||||
|
perfect document on the surface. Even worse, PDF allows a word to
|
||||||
|
be split in the middle and drawn as multiple unrelated strings in
|
||||||
|
order to represent precise text positioning. For example, a
|
||||||
|
certain word processing software creates a PDF that splits a word
|
||||||
|
"You" into two separate strings "Y" and "ou" because of the subtle
|
||||||
|
kerning between the letters.
|
||||||
|
<p>
|
||||||
|
So there's a huge problem associated with extracting texts properly
|
||||||
|
from PDF files. They require almost similar kinds of analysis
|
||||||
|
to optical character recognition (OCR).
|
||||||
|
|
||||||
|
|
||||||
|
<hr noshade>
|
||||||
|
<address>Yusuke Shinyama</address>
|
||||||
|
</body>
|
|
@ -15,7 +15,6 @@ import sys
|
||||||
import re
|
import re
|
||||||
import os
|
import os
|
||||||
import os.path
|
import os.path
|
||||||
from sys import stderr
|
|
||||||
from struct import pack, unpack
|
from struct import pack, unpack
|
||||||
from psparser import PSStackParser
|
from psparser import PSStackParser
|
||||||
from psparser import PSException, PSSyntaxError, PSTypeError, PSEOF
|
from psparser import PSException, PSSyntaxError, PSTypeError, PSEOF
|
||||||
|
@ -24,8 +23,7 @@ from psparser import literal_name, keyword_name
|
||||||
from fontmetrics import FONT_METRICS
|
from fontmetrics import FONT_METRICS
|
||||||
from latin_enc import ENCODING
|
from latin_enc import ENCODING
|
||||||
from glyphlist import charname2unicode
|
from glyphlist import charname2unicode
|
||||||
from utils import choplist
|
from utils import choplist, nunpack
|
||||||
from utils import nunpack
|
|
||||||
try:
|
try:
|
||||||
import cdb
|
import cdb
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
@ -38,16 +36,19 @@ class CMapError(Exception): pass
|
||||||
## find_cmap_path
|
## find_cmap_path
|
||||||
##
|
##
|
||||||
def find_cmap_path():
|
def find_cmap_path():
|
||||||
try:
|
"""Returns the location of CMap directory."""
|
||||||
return os.environ['CMAP_PATH']
|
for path in (os.environ['CMAP_PATH'],
|
||||||
except KeyError:
|
os.path.join(os.path.dirname(__file__), 'CMap')):
|
||||||
pass
|
if os.path.isdir(path):
|
||||||
basedir = os.path.dirname(__file__)
|
return path
|
||||||
return os.path.join(basedir, 'CMap')
|
raise IOError
|
||||||
|
|
||||||
|
|
||||||
|
## name2unicode
|
||||||
|
##
|
||||||
STRIP_NAME = re.compile(r'[0-9]+')
|
STRIP_NAME = re.compile(r'[0-9]+')
|
||||||
def name2unicode(name):
|
def name2unicode(name):
|
||||||
|
"""Converts Adobe glyph names to Unicode numbers."""
|
||||||
if name in charname2unicode:
|
if name in charname2unicode:
|
||||||
return charname2unicode[name]
|
return charname2unicode[name]
|
||||||
m = STRIP_NAME.search(name)
|
m = STRIP_NAME.search(name)
|
||||||
|
@ -97,7 +98,7 @@ class CMap(object):
|
||||||
|
|
||||||
def decode(self, bytes):
|
def decode(self, bytes):
|
||||||
if self.debug:
|
if self.debug:
|
||||||
print >>stderr, 'decode: %r, %r' % (self, bytes)
|
print >>sys.stderr, 'decode: %r, %r' % (self, bytes)
|
||||||
x = ''
|
x = ''
|
||||||
for c in bytes:
|
for c in bytes:
|
||||||
if x:
|
if x:
|
||||||
|
@ -179,7 +180,7 @@ class CDBCMap(CMap):
|
||||||
|
|
||||||
def decode(self, bytes):
|
def decode(self, bytes):
|
||||||
if self.debug:
|
if self.debug:
|
||||||
print >>stderr, 'decode: %r, %r' % (self, bytes)
|
print >>sys.stderr, 'decode: %r, %r' % (self, bytes)
|
||||||
x = ''
|
x = ''
|
||||||
for c in bytes:
|
for c in bytes:
|
||||||
if x:
|
if x:
|
||||||
|
@ -227,11 +228,11 @@ class CMapDB(object):
|
||||||
cdbname = os.path.join(self.cdbdirname, cmapname+'.cmap.cdb')
|
cdbname = os.path.join(self.cdbdirname, cmapname+'.cmap.cdb')
|
||||||
if os.path.exists(cdbname):
|
if os.path.exists(cdbname):
|
||||||
if 1 <= self.debug:
|
if 1 <= self.debug:
|
||||||
print >>stderr, 'Opening: CDBCMap %r...' % cdbname
|
print >>sys.stderr, 'Opening: CDBCMap %r...' % cdbname
|
||||||
cmap = CDBCMap(cdbname)
|
cmap = CDBCMap(cdbname)
|
||||||
elif os.path.exists(fname):
|
elif os.path.exists(fname):
|
||||||
if 1 <= self.debug:
|
if 1 <= self.debug:
|
||||||
print >>stderr, 'Reading: CMap %r...' % fname
|
print >>sys.stderr, 'Reading: CMap %r...' % fname
|
||||||
cmap = CMap()
|
cmap = CMap()
|
||||||
fp = file(fname, 'rb')
|
fp = file(fname, 'rb')
|
||||||
CMapParser(cmap, fp).run()
|
CMapParser(cmap, fp).run()
|
||||||
|
@ -423,10 +424,11 @@ class EncodingDB(object):
|
||||||
|
|
||||||
## CMap -> CMapCDB conversion
|
## CMap -> CMapCDB conversion
|
||||||
##
|
##
|
||||||
def dumpcdb(cmap, cdbfile, verbose=1):
|
def dump_cdb(cmap, cdbfile, verbose=1):
|
||||||
|
"""Writes a CMap object into a cdb file."""
|
||||||
m = cdb.cdbmake(cdbfile, cdbfile+'.tmp')
|
m = cdb.cdbmake(cdbfile, cdbfile+'.tmp')
|
||||||
if verbose:
|
if verbose:
|
||||||
print >>stderr, 'Writing: %r...' % cdbfile
|
print >>sys.stderr, 'Writing: %r...' % cdbfile
|
||||||
for (k,v) in cmap.getall_attrs():
|
for (k,v) in cmap.getall_attrs():
|
||||||
m.add('/'+k, repr(v))
|
m.add('/'+k, repr(v))
|
||||||
for (code,cid) in cmap.getall_code2cid():
|
for (code,cid) in cmap.getall_code2cid():
|
||||||
|
@ -437,44 +439,55 @@ def dumpcdb(cmap, cdbfile, verbose=1):
|
||||||
return
|
return
|
||||||
|
|
||||||
def convert_cmap(cmapdir, outputdir, force=False):
|
def convert_cmap(cmapdir, outputdir, force=False):
|
||||||
|
"""Convert all CMap source files in a directory into cdb files."""
|
||||||
CMapDB.initialize(cmapdir)
|
CMapDB.initialize(cmapdir)
|
||||||
for fname in os.listdir(cmapdir):
|
for fname in os.listdir(cmapdir):
|
||||||
if '.' in fname: continue
|
if '.' in fname: continue
|
||||||
cmapname = os.path.basename(fname)
|
cmapname = os.path.basename(fname)
|
||||||
cdbname = os.path.join(outputdir, cmapname+'.cmap.cdb')
|
cdbname = os.path.join(outputdir, cmapname+'.cmap.cdb')
|
||||||
if not force and os.path.exists(cdbname):
|
if not force and os.path.exists(cdbname):
|
||||||
print >>stderr, 'Skipping: %r' % cmapname
|
print >>sys.stderr, 'Skipping: %r' % cmapname
|
||||||
continue
|
continue
|
||||||
print >>stderr, 'Reading: %r...' % cmapname
|
print >>sys.stderr, 'Reading: %r...' % cmapname
|
||||||
cmap = CMapDB.get_cmap(cmapname)
|
cmap = CMapDB.get_cmap(cmapname)
|
||||||
dumpcdb(cmap, cdbname)
|
dump_cdb(cmap, cdbname)
|
||||||
return
|
return
|
||||||
|
|
||||||
def main(argv):
|
def main(argv):
|
||||||
|
"""Converts CMap files into cdb files.
|
||||||
|
|
||||||
|
usage: python -m pdfminer.cmap [-f] [cmap_dir [output_dir]]
|
||||||
|
"""
|
||||||
|
|
||||||
import getopt
|
import getopt
|
||||||
def usage():
|
def usage():
|
||||||
print 'usage: %s [-D outputdir] [-f] cmap_dir' % argv[0]
|
print 'usage: %s [-f] [cmap_dir [output_dir]]' % argv[0]
|
||||||
return 100
|
return 100
|
||||||
try:
|
try:
|
||||||
(opts, args) = getopt.getopt(argv[1:], 'C:D:f')
|
(opts, args) = getopt.getopt(argv[1:], 'f')
|
||||||
except getopt.GetoptError:
|
except getopt.GetoptError:
|
||||||
return usage()
|
return usage()
|
||||||
if args:
|
if args:
|
||||||
cmapdir = args.pop(0)
|
cmapdir = args.pop(0)
|
||||||
else:
|
else:
|
||||||
cmapdir = find_cmap_path()
|
try:
|
||||||
outputdir = cmapdir
|
cmapdir = find_cmap_path()
|
||||||
|
except IOError:
|
||||||
|
print >>sys.stderr, 'cannot find CMap directory'
|
||||||
|
return 1
|
||||||
|
if args:
|
||||||
|
outputdir = args.pop(0)
|
||||||
|
else:
|
||||||
|
outputdir = cmapdir
|
||||||
force = False
|
force = False
|
||||||
for (k, v) in opts:
|
for (k, v) in opts:
|
||||||
if k == '-f': force = True
|
if k == '-f': force = True
|
||||||
elif k == '-C': cmapdir = v
|
|
||||||
elif k == '-D': outputdir = v
|
|
||||||
if not os.path.isdir(cmapdir):
|
if not os.path.isdir(cmapdir):
|
||||||
print >>stderr, 'directory does not exist: %r' % cmapdir
|
print >>sys.stderr, 'directory does not exist: %r' % cmapdir
|
||||||
return 111
|
return 1
|
||||||
if not os.path.isdir(outputdir):
|
if not os.path.isdir(outputdir):
|
||||||
print >>stderr, 'directory does not exist: %r' % outputdir
|
print >>sys.stderr, 'directory does not exist: %r' % outputdir
|
||||||
return 111
|
return 1
|
||||||
return convert_cmap(cmapdir, outputdir, force=force)
|
return convert_cmap(cmapdir, outputdir, force=force)
|
||||||
|
|
||||||
if __name__ == '__main__': sys.exit(main(sys.argv))
|
if __name__ == '__main__': sys.exit(main(sys.argv))
|
||||||
|
|
Loading…
Reference in New Issue