Added support for Paeth PNG filter compression (predictor value = 4) (#537)

* Added support for Paeth PNG filter compression (predictor value = 4) * Use `above` and `upper_left` as in the pseudo code * Refactor: use variable names that are very close to the pseudo code and add pieces of the docs to show what is going on. * Fix line length issues * Add line about compressions to README.md * Fix merge conflict on readme * Fix bug in filter type Up * Make if-else consistent Co-authored-by: Eduardo Gonzalez Lopez de Murillas <eduardo.gonzalez@accha.nl> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-26 20:53:13 +02:00 · 2021-08-26 20:53:13 +02:00 · ea00f56ac6
parent 19c1372984
commit ea00f56ac6
3 changed files with 131 additions and 59 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -5,6 +5,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 ## [Unreleased]
 ### Added
 - Support for Paeth PNG filter compression (predictor value = 4) ([#537](https://github.com/pdfminer/pdfminer.six/pull/537))
 ### Fixed
 - Fix issue of TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' ([#529](https://github.com/pdfminer/pdfminer.six/pull/529))
 - `PermissionError` when creating temporary filepaths on windows when running tests ([#469](https://github.com/pdfminer/pdfminer.six/issues/469))
--- a/README.md
+++ b/README.md
@ -7,15 +7,12 @@ pdfminer.six
 *We fathom PDF*
-Pdfminer.six is a community maintained fork of the original PDFMiner. It is a
+Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF
-tool for extracting information from PDF documents. It focuses on getting
+documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the
-and analyzing text data. Pdfminer.six extracts the text from a page directly
+sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.
 from the sourcecode of the PDF. It can also be used to get the exact location, 
 font or color of the text. 
-It is built in a modular way such that each component of pdfminer.six can be
+It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own
-replaced easily. You can implement your own interpreter or rendering device
+interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.
 that uses the power of pdfminer.six for other purposes than text analysis. 
 Check out the full documentation on
 [Read the Docs](https://pdfminersix.readthedocs.io).
@ -24,32 +21,32 @@ Check out the full documentation on
 Features
 --------
- * Written entirely in Python.
+* Written entirely in Python.
- * Parse, analyze, and convert PDF documents.
+* Parse, analyze, and convert PDF documents.
- * PDF-1.7 specification support. (well, almost).
+* PDF-1.7 specification support. (well, almost).
- * CJK languages and vertical writing scripts support.
+* CJK languages and vertical writing scripts support.
- * Various font types (Type1, TrueType, Type3, and CID) support.
+* Various font types (Type1, TrueType, Type3, and CID) support.
- * Support for extracting images (JPG, JBIG2 and Bitmaps).
+* Support for extracting images (JPG, JBIG2, Bitmaps).
- * Support for RC4 and AES encryption.
+* Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode,
- * Support for AcroForm interactive form extraction.
+  CCITTFaxDecode)
- * Table of contents extraction.
+* Support for RC4 and AES encryption.
- * Tagged contents extraction.
+* Support for AcroForm interactive form extraction.
- * Automatic layout analysis.
+* Table of contents extraction.
-
+* Tagged contents extraction.
 * Automatic layout analysis.
 How to use
 ----------
- * Install Python 3.6 or newer.
+* Install Python 3.6 or newer.
- * Install
+* Install
  `pip install pdfminer.six`
- * Use command-line interface to extract text from pdf:
+* Use command-line interface to extract text from pdf:
  `python pdf2txt.py samples/simple1.pdf`
 Contributing
 ------------
--- a/pdfminer/utils.py
+++ b/pdfminer/utils.py
@ -77,44 +77,116 @@ def compatible_encode_method(bytesorstring, encoding='utf-8',
    return bytesorstring.decode(encoding, erraction)
-def apply_png_predictor(pred, colors, columns, bitspercomponent, data):
+def paeth_predictor(left, above, upper_left):
-    if bitspercomponent != 8:
+    # From http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html
-        # unsupported
+    # Initial estimate
-        raise ValueError("Unsupported `bitspercomponent': %d" %
+    p = left + above - upper_left
-                         bitspercomponent)
+    # Distances to a,b,c
-    nbytes = colors * columns * bitspercomponent // 8
+    pa = abs(p - left)
-    buf = b''
+    pb = abs(p - above)
-    line0 = b'\x00' * columns
+    pc = abs(p - upper_left)
-    for i in range(0, len(data), nbytes + 1):
+
-        ft = data[i]
+    # Return nearest of a,b,c breaking ties in order a,b,c
-        i += 1
+    if pa <= pb and pa <= pc:
-        line1 = data[i:i + nbytes]
+        return left
-        line2 = b''
+    elif pb <= pc:
-        if ft == 0:
+        return above
            # PNG none
            line2 += line1
        elif ft == 1:
            # PNG sub (UNTESTED)
            c = 0
            for b in line1:
                c = (c + b) & 255
                line2 += bytes((c,))
        elif ft == 2:
            # PNG up
            for (a, b) in zip(line0, line1):
                c = (a + b) & 255
                line2 += bytes((c,))
        elif ft == 3:
            # PNG average (UNTESTED)
            c = 0
            for (a, b) in zip(line0, line1):
                c = ((c + a + b) // 2) & 255
                line2 += bytes((c,))
    else:
-            # unsupported
+        return upper_left
-            raise ValueError("Unsupported predictor value: %d" % ft)
+
-        buf += line2
+
-        line0 = line2
+def apply_png_predictor(pred, colors, columns, bitspercomponent, data):
    """Reverse the effect of the PNG predictor
    Documentation: http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html
    """
    if bitspercomponent != 8:
        msg = "Unsupported `bitspercomponent': %d" % bitspercomponent
        raise ValueError(msg)
    nbytes = colors * columns * bitspercomponent // 8
    bpp = colors * bitspercomponent // 8  # number of bytes per complete pixel
    buf = b''
    line_above = b'\x00' * columns
    for scanline_i in range(0, len(data), nbytes + 1):
        filter_type = data[scanline_i]
        line_encoded = data[scanline_i + 1:scanline_i + 1 + nbytes]
        raw = b''
        if filter_type == 0:
            # Filter type 0: None
            raw += line_encoded
        elif filter_type == 1:
            # Filter type 1: Sub
            # To reverse the effect of the Sub() filter after decompression,
            # output the following value:
            #   Raw(x) = Sub(x) + Raw(x - bpp)
            # (computed mod 256), where Raw() refers to the bytes already
            #  decoded.
            for j, sub_x in enumerate(line_encoded):
                if j - bpp < 0:
                    raw_x_bpp = 0
                else:
                    raw_x_bpp = int(raw[j - bpp])
                raw_x = (sub_x + raw_x_bpp) & 255
                raw += bytes((raw_x,))
        elif filter_type == 2:
            # Filter type 2: Up
            # To reverse the effect of the Up() filter after decompression,
            # output the following value:
            #   Raw(x) = Up(x) + Prior(x)
            # (computed mod 256), where Prior() refers to the decoded bytes of
            # the prior scanline.
            for (up_x, prior_x) in zip(line_encoded, line_above):
                raw_x = (up_x + prior_x) & 255
                raw += bytes((raw_x,))
        elif filter_type == 3:
            # Filter type 3: Average
            # To reverse the effect of the Average() filter after
            # decompression, output the following value:
            #    Raw(x) = Average(x) + floor((Raw(x-bpp)+Prior(x))/2)
            # where the result is computed mod 256, but the prediction is
            # calculated in the same way as for encoding. Raw() refers to the
            # bytes already decoded, and Prior() refers to the decoded bytes of
            # the prior scanline.
            for j, average_x in enumerate(line_encoded):
                if j - bpp < 0:
                    raw_x_bpp = 0
                else:
                    raw_x_bpp = int(raw[j - bpp])
                prior_x = int(line_above[j])
                raw_x = (average_x + (raw_x_bpp + prior_x) // 2) & 255
                raw += bytes((raw_x,))
        elif filter_type == 4:
            # Filter type 4: Paeth
            # To reverse the effect of the Paeth() filter after decompression,
            # output the following value:
            #    Raw(x) = Paeth(x)
            #             + PaethPredictor(Raw(x-bpp), Prior(x), Prior(x-bpp))
            # (computed mod 256), where Raw() and Prior() refer to bytes
            # already decoded. Exactly the same PaethPredictor() function is
            # used by both encoder and decoder.
            for j, paeth_x in enumerate(line_encoded):
                if j - bpp < 0:
                    raw_x_bpp = 0
                    prior_x_bpp = 0
                else:
                    raw_x_bpp = int(raw[j - bpp])
                    prior_x_bpp = int(line_above[j - bpp])
                prior_x = int(line_above[j])
                paeth = paeth_predictor(raw_x_bpp, prior_x, prior_x_bpp)
                raw_x = (paeth_x + paeth) & 255
                raw += bytes((raw_x,))
        else:
            raise ValueError("Unsupported predictor value: %d" % filter_type)
        buf += raw
        line_above = raw
    return buf