* Deprecate usage of `if __name__ == "__main__"` in scripts that are not document. Also deprecate usage of scripts that are only there for testing purposes.
* Add CHANGELOG.md
* Cleanup CHANGELOG.md
* Cleanup CHANGELOG.md
* Undo deleting conf_glyphlist.py and conf_afm.py and add a deprecation warning instead
* Attempt to handle decompression error on some broken PDF files
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).
This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422
and the test file has been taken from there, so credits to @zegrep.
* Attempt to handle decompression error on some broken PDF files
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).
This has been largely inspired by mstamy2/PyPDF2#422
and the test file has been taken from there, so credits to @zegrep.
* Use a warnings instead of raising exception
where zlib error is detected before the CRC checksum.
* Add line to CHANGELOG.md
* Only try decompressing if not in strict mode
* Change error into warning because warning.warn needs a subclass of Warning
Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Fixes#625
* add support for Identity-H/V cmap fonts
* format code to pass flake8 check
* Remove indent
* Remove indent
* Use isinstance instead of type check
* Use or instead of any
* Use str in variable, instead of str.find()
* Fix mypy error: add typing annotations to get_unichr()
* Fix type of PDFCIDFont. Can be any type of CMapBase.
This is a quick fix, the entire cmap structure does not have proper inheritance.
* Added line to CHANGELOG.md
* Add separate class for IdentityUnicodeMap
* Remove ABC from CmapBase
* Remove ABC from CmapBase
* Remove blank line
Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Fixes#566
* try to fix issue of some Chinese characters cannot be extracted
correctly (#566).
* format code to pass flake8 check.
* fix typo and refer to issue 593.
Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Add trying to get cmap from pickle file. And cleaning up a bit.
* Don't use keyword argument for dict.get
* Add docs
* Make _get_cmap_name static
* Add test
* Add CHANGELOG.md
* Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there
* Add CJK characters to expected output of simple3.pdf
* Fix line length
* Add comment
* Remove scaling font height/width with size of font bounding box
* Refactor LTChar bounding box computation
* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.
* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects
* Add line to CHANGELOG
* Code Refractor: Use code-style enforcement #312
* Add flake8 to travis-ci
* Remove python 2 3 comment on six library. 891 errors > 870 errors.
* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.
* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.
* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting
* Cleanup pdfinterp.py and add documentation from PDF Reference
* Cleanup pdfpage.py
* Cleanup pdffont.py
* Clean psparser.py
* Cleanup high_level.py
* Cleanup layout.py
* Cleanup pdfparser.py
* Cleanup pdfcolor.py
* Cleanup rijndael.py
* Cleanup converter.py
* Rename klass to cls if it is the class variable, to be more consistent with standard practice
* Cleanup cmap.py
* Cleanup pdfdevice.py
* flake8 ignore fontmetrics.py
* Cleanup test_pdfminer_psparser.py
* Fix flake8 in pdfdocument.py; 339 errors to go
* Fix flake8 utils.py; 326 errors togo
* pep8 correction for few files in /tools/ 328 > 160 to go (#342)
* pep8 correction for few files in /tools/ 328 > 160 to go
* pep8 correction: 160 > 5 to go
* Fix ascii85.py errors
* Fix error in getting index from target that does not exists
* Remove commented print lines
* Fix flake8 error in pdfinterp.py
* Fix python2 specific error by removing argument from print statement
* Ignore invalid python2 syntax
* Update contributing.md
* Added changelog
* Remove unused import
Co-authored-by: Fakabbir Amin <f4amin@gmail.com>