pdfminer.six/tests
Sylvain Thénault 10f6fb40c2
Attempt to handle decompression error on some broken PDF files (#637)
* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422
and the test file has been taken from there, so credits to @zegrep.

* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by mstamy2/PyPDF2#422
and the test file has been taken from there, so credits to @zegrep.

* Use a warnings instead of raising exception

where zlib error is detected before the CRC checksum.

* Add line to CHANGELOG.md

* Only try decompressing if not in strict mode

* Change error into warning because warning.warn needs a subclass of Warning

Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-12-11 18:25:19 +01:00
..
helpers.py Enforce pep8 coding-style (#345) 2019-12-29 21:20:20 +01:00
tempfilepath.py Issue #469 is fixed (When run on Windows a lot of tests fail with the error: [Errno 13] Permission denied) (#484) 2020-10-26 10:10:11 +01:00
test_converter.py Fix bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed (#616) 2021-09-27 20:30:40 +02:00
test_encodingdb.py Catch ValueError when converting font encoding differences to characters (#389) 2020-03-16 20:12:45 +01:00
test_font_size.py Fix bug in computing character bounding box (#348) 2020-01-16 22:15:50 +01:00
test_highlevel_extracttext.py Attempt to handle decompression error on some broken PDF files (#637) 2021-12-11 18:25:19 +01:00
test_layout.py Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382) (#384) 2020-03-23 22:38:39 +01:00
test_pdfdocument.py Fix 594 use null id when encrypted but no id given (#595) 2021-08-29 21:32:14 +02:00
test_pdfencoding.py add shebang line to script in tools (#408) 2020-04-28 10:58:42 +02:00
test_pdffont.py Always try to get CMap, even if name is not recognized (#438) 2020-07-23 20:27:38 +02:00
test_pdfminer_ccitt.py Enforce pep8 coding-style (#345) 2019-12-29 21:20:20 +01:00
test_pdfminer_crypto.py Remove unused rijndael encryption implementation (#465) 2020-09-10 19:28:00 +02:00
test_pdfminer_psparser.py Enforce pep8 coding-style (#345) 2019-12-29 21:20:20 +01:00
test_tools_dumppdf.py Add type annotations (#661) 2021-10-09 16:23:28 +02:00
test_tools_pdf2txt.py Add support for ISO 32000-2 AES256 encryption (#614) 2021-09-06 22:00:23 +02:00
test_utils.py Allow a pathlib.PurePath object as a input to open_filename (#492) 2020-09-17 21:29:00 +02:00