Added support for Paeth PNG filter compression (predictor value = 4) (#537)

* Added support for Paeth PNG filter compression (predictor value = 4)

* Use `above` and `upper_left` as in the pseudo code

* Refactor: use variable names that are very close to the pseudo code and add pieces of the docs to show what is going on.

* Fix line length issues

* Add line about compressions to README.md

* Fix merge conflict on readme

* Fix bug in filter type Up

* Make if-else consistent

Co-authored-by: Eduardo Gonzalez Lopez de Murillas <eduardo.gonzalez@accha.nl>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
pull/593/head^2
Eduardo Gonzalez Lopez de Murillas 2021-08-26 20:53:13 +02:00 committed by GitHub
parent 19c1372984
commit ea00f56ac6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 131 additions and 59 deletions

View File

@ -5,6 +5,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## [Unreleased] ## [Unreleased]
### Added
- Support for Paeth PNG filter compression (predictor value = 4) ([#537](https://github.com/pdfminer/pdfminer.six/pull/537))
### Fixed ### Fixed
- Fix issue of TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' ([#529](https://github.com/pdfminer/pdfminer.six/pull/529)) - Fix issue of TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' ([#529](https://github.com/pdfminer/pdfminer.six/pull/529))
- `PermissionError` when creating temporary filepaths on windows when running tests ([#469](https://github.com/pdfminer/pdfminer.six/issues/469)) - `PermissionError` when creating temporary filepaths on windows when running tests ([#469](https://github.com/pdfminer/pdfminer.six/issues/469))

View File

@ -7,15 +7,12 @@ pdfminer.six
*We fathom PDF* *We fathom PDF*
Pdfminer.six is a community maintained fork of the original PDFMiner. It is a Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF
tool for extracting information from PDF documents. It focuses on getting documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the
and analyzing text data. Pdfminer.six extracts the text from a page directly sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.
from the sourcecode of the PDF. It can also be used to get the exact location,
font or color of the text.
It is built in a modular way such that each component of pdfminer.six can be It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own
replaced easily. You can implement your own interpreter or rendering device interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.
that uses the power of pdfminer.six for other purposes than text analysis.
Check out the full documentation on Check out the full documentation on
[Read the Docs](https://pdfminersix.readthedocs.io). [Read the Docs](https://pdfminersix.readthedocs.io).
@ -29,14 +26,15 @@ Features
* PDF-1.7 specification support. (well, almost). * PDF-1.7 specification support. (well, almost).
* CJK languages and vertical writing scripts support. * CJK languages and vertical writing scripts support.
* Various font types (Type1, TrueType, Type3, and CID) support. * Various font types (Type1, TrueType, Type3, and CID) support.
* Support for extracting images (JPG, JBIG2 and Bitmaps). * Support for extracting images (JPG, JBIG2, Bitmaps).
* Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode,
CCITTFaxDecode)
* Support for RC4 and AES encryption. * Support for RC4 and AES encryption.
* Support for AcroForm interactive form extraction. * Support for AcroForm interactive form extraction.
* Table of contents extraction. * Table of contents extraction.
* Tagged contents extraction. * Tagged contents extraction.
* Automatic layout analysis. * Automatic layout analysis.
How to use How to use
---------- ----------
@ -49,7 +47,6 @@ How to use
`python pdf2txt.py samples/simple1.pdf` `python pdf2txt.py samples/simple1.pdf`
Contributing Contributing
------------ ------------

View File

@ -77,44 +77,116 @@ def compatible_encode_method(bytesorstring, encoding='utf-8',
return bytesorstring.decode(encoding, erraction) return bytesorstring.decode(encoding, erraction)
def apply_png_predictor(pred, colors, columns, bitspercomponent, data): def paeth_predictor(left, above, upper_left):
if bitspercomponent != 8: # From http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html
# unsupported # Initial estimate
raise ValueError("Unsupported `bitspercomponent': %d" % p = left + above - upper_left
bitspercomponent) # Distances to a,b,c
nbytes = colors * columns * bitspercomponent // 8 pa = abs(p - left)
buf = b'' pb = abs(p - above)
line0 = b'\x00' * columns pc = abs(p - upper_left)
for i in range(0, len(data), nbytes + 1):
ft = data[i] # Return nearest of a,b,c breaking ties in order a,b,c
i += 1 if pa <= pb and pa <= pc:
line1 = data[i:i + nbytes] return left
line2 = b'' elif pb <= pc:
if ft == 0: return above
# PNG none
line2 += line1
elif ft == 1:
# PNG sub (UNTESTED)
c = 0
for b in line1:
c = (c + b) & 255
line2 += bytes((c,))
elif ft == 2:
# PNG up
for (a, b) in zip(line0, line1):
c = (a + b) & 255
line2 += bytes((c,))
elif ft == 3:
# PNG average (UNTESTED)
c = 0
for (a, b) in zip(line0, line1):
c = ((c + a + b) // 2) & 255
line2 += bytes((c,))
else: else:
# unsupported return upper_left
raise ValueError("Unsupported predictor value: %d" % ft)
buf += line2
line0 = line2 def apply_png_predictor(pred, colors, columns, bitspercomponent, data):
"""Reverse the effect of the PNG predictor
Documentation: http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html
"""
if bitspercomponent != 8:
msg = "Unsupported `bitspercomponent': %d" % bitspercomponent
raise ValueError(msg)
nbytes = colors * columns * bitspercomponent // 8
bpp = colors * bitspercomponent // 8 # number of bytes per complete pixel
buf = b''
line_above = b'\x00' * columns
for scanline_i in range(0, len(data), nbytes + 1):
filter_type = data[scanline_i]
line_encoded = data[scanline_i + 1:scanline_i + 1 + nbytes]
raw = b''
if filter_type == 0:
# Filter type 0: None
raw += line_encoded
elif filter_type == 1:
# Filter type 1: Sub
# To reverse the effect of the Sub() filter after decompression,
# output the following value:
# Raw(x) = Sub(x) + Raw(x - bpp)
# (computed mod 256), where Raw() refers to the bytes already
# decoded.
for j, sub_x in enumerate(line_encoded):
if j - bpp < 0:
raw_x_bpp = 0
else:
raw_x_bpp = int(raw[j - bpp])
raw_x = (sub_x + raw_x_bpp) & 255
raw += bytes((raw_x,))
elif filter_type == 2:
# Filter type 2: Up
# To reverse the effect of the Up() filter after decompression,
# output the following value:
# Raw(x) = Up(x) + Prior(x)
# (computed mod 256), where Prior() refers to the decoded bytes of
# the prior scanline.
for (up_x, prior_x) in zip(line_encoded, line_above):
raw_x = (up_x + prior_x) & 255
raw += bytes((raw_x,))
elif filter_type == 3:
# Filter type 3: Average
# To reverse the effect of the Average() filter after
# decompression, output the following value:
# Raw(x) = Average(x) + floor((Raw(x-bpp)+Prior(x))/2)
# where the result is computed mod 256, but the prediction is
# calculated in the same way as for encoding. Raw() refers to the
# bytes already decoded, and Prior() refers to the decoded bytes of
# the prior scanline.
for j, average_x in enumerate(line_encoded):
if j - bpp < 0:
raw_x_bpp = 0
else:
raw_x_bpp = int(raw[j - bpp])
prior_x = int(line_above[j])
raw_x = (average_x + (raw_x_bpp + prior_x) // 2) & 255
raw += bytes((raw_x,))
elif filter_type == 4:
# Filter type 4: Paeth
# To reverse the effect of the Paeth() filter after decompression,
# output the following value:
# Raw(x) = Paeth(x)
# + PaethPredictor(Raw(x-bpp), Prior(x), Prior(x-bpp))
# (computed mod 256), where Raw() and Prior() refer to bytes
# already decoded. Exactly the same PaethPredictor() function is
# used by both encoder and decoder.
for j, paeth_x in enumerate(line_encoded):
if j - bpp < 0:
raw_x_bpp = 0
prior_x_bpp = 0
else:
raw_x_bpp = int(raw[j - bpp])
prior_x_bpp = int(line_above[j - bpp])
prior_x = int(line_above[j])
paeth = paeth_predictor(raw_x_bpp, prior_x, prior_x_bpp)
raw_x = (paeth_x + paeth) & 255
raw += bytes((raw_x,))
else:
raise ValueError("Unsupported predictor value: %d" % filter_type)
buf += raw
line_above = raw
return buf return buf