* Adding in checks for spurious lines that contain either only spaces or new line characters
* Added spurious lines check and unit tests
* Updated CHANGELOG.md with changes
* Simplify code
* Simplify code
* Simplify code
* Remove changes to lines that are not actually changed
* Format import
* Improve CHANGELOG.md
* Improve CHANGELOG.md
* Fix cicd
* Blacken
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* port page label code from pdfannots
* add tests and clean up
* more cleanup; harden against non-conforming input
* one more test
* update CHANGELOG
* cleanup & respond to review feedback (incomplete)
* Refactor implementation of get_page_labels() into a NumberTree and PageLabels class.
* PageLabels *is* a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more.
* fix type errors and cleanup slightly
* fix mypy errors (including tweaking code to avoid problematic dynamic types)
* hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be)
* avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Attempt to handle decompression error on some broken PDF files
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).
This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422
and the test file has been taken from there, so credits to @zegrep.
* Attempt to handle decompression error on some broken PDF files
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).
This has been largely inspired by mstamy2/PyPDF2#422
and the test file has been taken from there, so credits to @zegrep.
* Use a warnings instead of raising exception
where zlib error is detected before the CRC checksum.
* Add line to CHANGELOG.md
* Only try decompressing if not in strict mode
* Change error into warning because warning.warn needs a subclass of Warning
Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Fixes#625
* add support for Identity-H/V cmap fonts
* format code to pass flake8 check
* Remove indent
* Remove indent
* Use isinstance instead of type check
* Use or instead of any
* Use str in variable, instead of str.find()
* Fix mypy error: add typing annotations to get_unichr()
* Fix type of PDFCIDFont. Can be any type of CMapBase.
This is a quick fix, the entire cmap structure does not have proper inheritance.
* Added line to CHANGELOG.md
* Add separate class for IdentityUnicodeMap
* Remove ABC from CmapBase
* Remove ABC from CmapBase
* Remove blank line
Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Fixes#566
* try to fix issue of some Chinese characters cannot be extracted
correctly (#566).
* format code to pass flake8 check.
* fix typo and refer to issue 593.
Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix .paint_path handling of single line segments
- Fixes typo ("ml" should have been "mlh")
- Removes if-statement that required individual line segments to be
strictly horizontal or vertical.
* Treat 'ml'-shape paths as lines not curves
Althoguh 'mlh' is the canonical implementation for a single line
segment, 'ml' is fairly common.
Adds tests and sample PDF.
* Fix trailing whitespace
* Fix point-extraction from Beziér path commands
This commit corrects the manner in which "pts" are extracted from Beziér
path commands. See Table 4.9 of PDF reference manual, and new comments
in code for details. Previously, depending on whether the command (c,
v, or y) the code was extracting some combination of control points (not
on curve) and the actual points-on-curve.
This commit also refactors .paint_path, so that apply_matrix_pt is only
called in one place, and to treat the "h" command in a manner more
consistent with other path commands.
* Add comments to test_paint_path_quadrilaterals
* Parse rect-forming mllll paths as rects not curves
Now that .paint_path has been refactored, adding support for
rect-forming mllll paths requires no extra code, beyond a minor tweak to
the relevant elif statement.
* One changelog line with ref to mr
* Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring
* Extract variables from if statement to make it easier to read
* Optimize imports order
* Trigger travis build
* Revert "Trigger travis build"
This reverts commit 41c05184
* Update travis badge
* Update travis badge
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* swap pycryptodome to the faster, smaller, and industry standard crytography io
* update changelog
* fixlint
* Update CHANGELOG.md
* from MR, unneeded ex and naming
* add samples to nosetests
* fix lint
* show mismatch
* fix lint
* typo and newline
* Revert "add samples to nosetests"
This reverts commit a49ca302
* Add tests for encrypted documents to nose test suite
* Optimize imports of pdfdocument.py
Co-authored-by: Oren Tysor <oren@atakama.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix converting path to multiple rectangles
For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.
fixes#369
* Reduce pdf size by removing font
* Add unittest for PDFLayoutAnalyzer.paint_path()
* Add line to CHANGELOG.md
* Add reference to pdf reference manual
* Cleanup function paint_path a bit
* Reduce line length of tests
* Reduce line length of tests
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Changed error to warning for 'Text extraction is not allowed'
* updated changelog
* fix lint
* made changes suggested in review
* Update CHANGELOG.md
* Add regression test for failing pdf
* Reduce line length to <80
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Remove scaling font height/width with size of font bounding box
* Refactor LTChar bounding box computation
* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.
* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects
* Add line to CHANGELOG
* Fix getting filename when extracting embedded files
* Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs
* Add line to CHANGELOG
Fixes#186
* Tread the permissions (the /P entry) as unsigned long, fix#186
* handle negative values for p
* Extract function for resolving an twos-complement
* Add test for issue #352
* Add line to CHANGELOG.md
* Only ints can be converted to a uint using two's-complement method
* Standardize import style; multiple imports from same module on one line
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Drop support for legacy Python 2
* Add python_requires to help pip
* Upgrade Python syntax with pyupgrade
* Upgrade Python syntax with pyupgrade --py3-plus
* Python 3 imports
* Replace six
* Update CONTRIBUTING.md
* Added line to changelog
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>