pdfminer.six/samples
Sylvain Thénault 10f6fb40c2
Attempt to handle decompression error on some broken PDF files (#637)
* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422
and the test file has been taken from there, so credits to @zegrep.

* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by mstamy2/PyPDF2#422
and the test file has been taken from there, so credits to @zegrep.

* Use a warnings instead of raising exception

where zlib error is detected before the CRC checksum.

* Add line to CHANGELOG.md

* Only try decompressing if not in strict mode

* Change error into warning because warning.warn needs a subclass of Warning

Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-12-11 18:25:19 +01:00
..
acroform Add section to documentation with howto for AcroForm fields extraction (#458) 2020-09-10 19:18:41 +02:00
contrib Add support identity unicode cmap (#626) 2021-10-13 21:52:00 +02:00
encryption Add support for ISO 32000-2 AES256 encryption (#614) 2021-09-06 22:00:23 +02:00
nonfree Fix failing test on develop & cleaning up test files (#319) 2019-10-26 18:42:33 +02:00
scancode Add a test for the previous fix 2017-10-16 12:35:16 +02:00
README Added: tests for extracting tests from pdfs with Type3 fonts (#205) 2019-10-22 18:15:59 +02:00
font-size-test.pdf Fix bug in computing character bounding box (#348) 2020-01-16 22:15:50 +01:00
jo.pdf add samples, fixed silly bugs. 2007-12-31 05:02:15 +00:00
sampleOneByteIdentityEncode.pdf Adds Test Case 2019-08-10 10:19:20 +05:30
simple1.pdf testcase added 2009-10-24 02:50:07 +00:00
simple2.pdf various cleanup for release. 2008-04-27 11:47:38 +00:00
simple3.pdf test file simple3.pdf added. 2010-08-29 06:39:41 +00:00
simple4.pdf Fix ordering of textlines within a textbox when boxes_flow is disabled (#412) 2020-05-09 15:37:49 +02:00
zen_of_python_corrupted.pdf Attempt to handle decompression error on some broken PDF files (#637) 2021-12-11 18:25:19 +01:00

README

This directory contains sample PDF files.

These files (including ones in nonfree/ subdirectory) can be
distributed freely but does not come with explicit licensing 
terms or source files.

Here are the credits of the original files:

simple1.pdf:
  (Originally taken from PDF Specification 1.7, 
  Appendix G. "Simple Text String Example" and modified)

simple2.pdf:
  (Originally taken from PDF Specification 1.7, 
  Appendix G. "Simple Graphics Example" and modified)

jo.pdf:
  Kenji Miyazawa (1896-1933, copyright expired)
  Preface of "Haru to Shura"
  (File generated from jo.tex by LaTeX and dvi2pdfm)

--
contrib/matplotlib.pdf
  Copyright 2018, James R Barlow
  Example file created in matplotlib to add a Type3 font to the samples
  Released under the terms of the "LICENSE" file

--
nonfree/cmp_itext_logo.pdf
  Bruno Lowagie
  "iText Logo - Type 3 font"
  http://gitlab.itextsupport.com/itext/sandbox/raw/master/cmpfiles/fonts/cmp_itext_logo.pdf

nonfree/dmca.pdf: 
  U.S. Copyright Office
  The Digital Millenium Copyright Act
  http://www.copyright.gov/legislation/dmca.pdf

nonfree/f1040nr.pdf:
  U.S. Department of the Treasury Internal Revenue Service
  Form 1040-NR, U.S. Nonresident Alien Income Tax Return
  http://www.irs.gov/pub/irs-pdf/f1040nr.pdf

nonfree/i1040nr.pdf:
  U.S. Department of the Treasury Internal Revenue Service
  Instructions for Form 1040-NR, U.S. Nonresident Alien Income Tax Return
  http://www.irs.gov/pub/irs-pdf/i1040nr.pdf

nonfree/kampo.pdf:
  National Priting Bureau of Japan
  Official Gazette, Vol. 4817
  http://kanpou.npb.go.jp/

nonfree/nlp2004slides.pdf:
  Yusuke Shinyama and Satoshi Sekine
  "Named Entity Discovery from Comparable News Corpora"

nonfree/naacl06-shinyama.pdf:
  Yusuke Shinyama and Satoshi Sekine
  "Preemptive Information Extraction using Unrestircted Relation Discovery"

--
Files in the encryption folder have been generated with cpdf 1.7 [http://www.coherentpdf.com/]
from the base.pdf file generated with LibreOffice 4.1.1.2 as follows:

cpdf -encrypt 40bit foo baz base.pdf -o rc4-40.pdf
cpdf -encrypt 128bit foo baz base.pdf -o rc4-128.pdf
cpdf -encrypt AES foo baz base.pdf -o aes-128.pdf
cpdf -encrypt AES foo baz base.pdf -no-encrypt-metadata -o aes-128-m.pdf
cpdf -encrypt AES256 foo baz base.pdf -o aes-256.pdf
cpdf -encrypt AES256 foo baz base.pdf -no-encrypt-metadata -o aes-256-m.pdf