Added support for Paeth PNG filter compression (predictor value = 4) (#537)

* Added support for Paeth PNG filter compression (predictor value = 4)

* Use `above` and `upper_left` as in the pseudo code

* Refactor: use variable names that are very close to the pseudo code and add pieces of the docs to show what is going on.

* Fix line length issues

* Add line about compressions to README.md

* Fix merge conflict on readme

* Fix bug in filter type Up

* Make if-else consistent

Co-authored-by: Eduardo Gonzalez Lopez de Murillas <eduardo.gonzalez@accha.nl>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
pull/593/head^2
Eduardo Gonzalez Lopez de Murillas 2021-08-26 20:53:13 +02:00 committed by GitHub
parent 19c1372984
commit ea00f56ac6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 131 additions and 59 deletions

View File

@ -5,6 +5,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## [Unreleased]
### Added
- Support for Paeth PNG filter compression (predictor value = 4) ([#537](https://github.com/pdfminer/pdfminer.six/pull/537))
### Fixed
- Fix issue of TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' ([#529](https://github.com/pdfminer/pdfminer.six/pull/529))
- `PermissionError` when creating temporary filepaths on windows when running tests ([#469](https://github.com/pdfminer/pdfminer.six/issues/469))

View File

@ -7,15 +7,12 @@ pdfminer.six
*We fathom PDF*
Pdfminer.six is a community maintained fork of the original PDFMiner. It is a
tool for extracting information from PDF documents. It focuses on getting
and analyzing text data. Pdfminer.six extracts the text from a page directly
from the sourcecode of the PDF. It can also be used to get the exact location,
font or color of the text.
Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF
documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the
sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.
It is built in a modular way such that each component of pdfminer.six can be
replaced easily. You can implement your own interpreter or rendering device
that uses the power of pdfminer.six for other purposes than text analysis.
It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own
interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.
Check out the full documentation on
[Read the Docs](https://pdfminersix.readthedocs.io).
@ -24,31 +21,31 @@ Check out the full documentation on
Features
--------
* Written entirely in Python.
* Parse, analyze, and convert PDF documents.
* PDF-1.7 specification support. (well, almost).
* CJK languages and vertical writing scripts support.
* Various font types (Type1, TrueType, Type3, and CID) support.
* Support for extracting images (JPG, JBIG2 and Bitmaps).
* Support for RC4 and AES encryption.
* Support for AcroForm interactive form extraction.
* Table of contents extraction.
* Tagged contents extraction.
* Automatic layout analysis.
* Written entirely in Python.
* Parse, analyze, and convert PDF documents.
* PDF-1.7 specification support. (well, almost).
* CJK languages and vertical writing scripts support.
* Various font types (Type1, TrueType, Type3, and CID) support.
* Support for extracting images (JPG, JBIG2, Bitmaps).
* Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode,
CCITTFaxDecode)
* Support for RC4 and AES encryption.
* Support for AcroForm interactive form extraction.
* Table of contents extraction.
* Tagged contents extraction.
* Automatic layout analysis.
How to use
----------
* Install Python 3.6 or newer.
* Install
* Install Python 3.6 or newer.
* Install
`pip install pdfminer.six`
`pip install pdfminer.six`
* Use command-line interface to extract text from pdf:
`python pdf2txt.py samples/simple1.pdf`
* Use command-line interface to extract text from pdf:
`python pdf2txt.py samples/simple1.pdf`
Contributing
------------

View File

@ -77,44 +77,116 @@ def compatible_encode_method(bytesorstring, encoding='utf-8',
return bytesorstring.decode(encoding, erraction)
def paeth_predictor(left, above, upper_left):
# From http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html
# Initial estimate
p = left + above - upper_left
# Distances to a,b,c
pa = abs(p - left)
pb = abs(p - above)
pc = abs(p - upper_left)
# Return nearest of a,b,c breaking ties in order a,b,c
if pa <= pb and pa <= pc:
return left
elif pb <= pc:
return above
else:
return upper_left
def apply_png_predictor(pred, colors, columns, bitspercomponent, data):
"""Reverse the effect of the PNG predictor
Documentation: http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html
"""
if bitspercomponent != 8:
# unsupported
raise ValueError("Unsupported `bitspercomponent': %d" %
bitspercomponent)
msg = "Unsupported `bitspercomponent': %d" % bitspercomponent
raise ValueError(msg)
nbytes = colors * columns * bitspercomponent // 8
bpp = colors * bitspercomponent // 8 # number of bytes per complete pixel
buf = b''
line0 = b'\x00' * columns
for i in range(0, len(data), nbytes + 1):
ft = data[i]
i += 1
line1 = data[i:i + nbytes]
line2 = b''
if ft == 0:
# PNG none
line2 += line1
elif ft == 1:
# PNG sub (UNTESTED)
c = 0
for b in line1:
c = (c + b) & 255
line2 += bytes((c,))
elif ft == 2:
# PNG up
for (a, b) in zip(line0, line1):
c = (a + b) & 255
line2 += bytes((c,))
elif ft == 3:
# PNG average (UNTESTED)
c = 0
for (a, b) in zip(line0, line1):
c = ((c + a + b) // 2) & 255
line2 += bytes((c,))
line_above = b'\x00' * columns
for scanline_i in range(0, len(data), nbytes + 1):
filter_type = data[scanline_i]
line_encoded = data[scanline_i + 1:scanline_i + 1 + nbytes]
raw = b''
if filter_type == 0:
# Filter type 0: None
raw += line_encoded
elif filter_type == 1:
# Filter type 1: Sub
# To reverse the effect of the Sub() filter after decompression,
# output the following value:
# Raw(x) = Sub(x) + Raw(x - bpp)
# (computed mod 256), where Raw() refers to the bytes already
# decoded.
for j, sub_x in enumerate(line_encoded):
if j - bpp < 0:
raw_x_bpp = 0
else:
raw_x_bpp = int(raw[j - bpp])
raw_x = (sub_x + raw_x_bpp) & 255
raw += bytes((raw_x,))
elif filter_type == 2:
# Filter type 2: Up
# To reverse the effect of the Up() filter after decompression,
# output the following value:
# Raw(x) = Up(x) + Prior(x)
# (computed mod 256), where Prior() refers to the decoded bytes of
# the prior scanline.
for (up_x, prior_x) in zip(line_encoded, line_above):
raw_x = (up_x + prior_x) & 255
raw += bytes((raw_x,))
elif filter_type == 3:
# Filter type 3: Average
# To reverse the effect of the Average() filter after
# decompression, output the following value:
# Raw(x) = Average(x) + floor((Raw(x-bpp)+Prior(x))/2)
# where the result is computed mod 256, but the prediction is
# calculated in the same way as for encoding. Raw() refers to the
# bytes already decoded, and Prior() refers to the decoded bytes of
# the prior scanline.
for j, average_x in enumerate(line_encoded):
if j - bpp < 0:
raw_x_bpp = 0
else:
raw_x_bpp = int(raw[j - bpp])
prior_x = int(line_above[j])
raw_x = (average_x + (raw_x_bpp + prior_x) // 2) & 255
raw += bytes((raw_x,))
elif filter_type == 4:
# Filter type 4: Paeth
# To reverse the effect of the Paeth() filter after decompression,
# output the following value:
# Raw(x) = Paeth(x)
# + PaethPredictor(Raw(x-bpp), Prior(x), Prior(x-bpp))
# (computed mod 256), where Raw() and Prior() refer to bytes
# already decoded. Exactly the same PaethPredictor() function is
# used by both encoder and decoder.
for j, paeth_x in enumerate(line_encoded):
if j - bpp < 0:
raw_x_bpp = 0
prior_x_bpp = 0
else:
raw_x_bpp = int(raw[j - bpp])
prior_x_bpp = int(line_above[j - bpp])
prior_x = int(line_above[j])
paeth = paeth_predictor(raw_x_bpp, prior_x, prior_x_bpp)
raw_x = (paeth_x + paeth) & 255
raw += bytes((raw_x,))
else:
# unsupported
raise ValueError("Unsupported predictor value: %d" % ft)
buf += line2
line0 = line2
raise ValueError("Unsupported predictor value: %d" % filter_type)
buf += raw
line_above = raw
return buf