pdfminer.six/pdfminer
Sylvain Thénault 10f6fb40c2
Attempt to handle decompression error on some broken PDF files (#637)
* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422
and the test file has been taken from there, so credits to @zegrep.

* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by mstamy2/PyPDF2#422
and the test file has been taken from there, so credits to @zegrep.

* Use a warnings instead of raising exception

where zlib error is detected before the CRC checksum.

* Add line to CHANGELOG.md

* Only try decompressing if not in strict mode

* Change error into warning because warning.warn needs a subclass of Warning

Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-12-11 18:25:19 +01:00
..
cmap Include compiled cmap resources to simplify installation for CJK languages 2015-12-27 13:32:29 +09:00
Makefile apply more patches 2010-02-13 15:00:43 +00:00
__init__.py Bump version to 20211012 2021-10-12 20:45:24 +02:00
_saslprep.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
arcfour.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
ascii85.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
ccitt.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
cmapdb.py Add support identity unicode cmap (#626) 2021-10-13 21:52:00 +02:00
converter.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
encodingdb.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
fontmetrics.py Drop support for legacy Python 2 (#346) 2020-01-04 16:47:07 +01:00
glyphlist.py Drop support for legacy Python 2 (#346) 2020-01-04 16:47:07 +01:00
high_level.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
image.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
jbig2.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
latin_enc.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
layout.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
lzw.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
pdfcolor.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
pdfdevice.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
pdfdocument.py Attempt to handle decompression error on some broken PDF files (#637) 2021-12-11 18:25:19 +01:00
pdffont.py Add support identity unicode cmap (#626) 2021-10-13 21:52:00 +02:00
pdfinterp.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
pdfpage.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
pdfparser.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
pdftypes.py Attempt to handle decompression error on some broken PDF files (#637) 2021-12-11 18:25:19 +01:00
psparser.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
runlength.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
settings.py Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320) 2019-10-26 19:16:37 +02:00
utils.py Replace typing-extensions Literal with the type of the Literal & run mypy, nosetest and sphinx in there own environment on cicd (#677) 2021-10-12 20:22:58 +02:00