Closes#469
* Issue #469 is fixed
* one extra comment to code is added
* TemporaryFilePath context manager is added to facilitate tests
* flake8 complaints fixed
* Update docs of tempfilepath.py
* Fix flake8
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Closes#191
* Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream
* Add docstring
* Fix flake8
* Fix paint_path bug noted in issue #473
Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.
Also adds a test for cases presented in issue #473
* Tweak paint_path fix per @pietermarsman review
- Adjusts logic to adhere to if-elif-else rather than early returns.
- Shortens subpath detection/reprocessing step, using re.finditer().
* Reorder paint_path() if-else statements once more
* Fix flake8 issues
* Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on.
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* open_filename accepts a pathlib.PurePath object
* Add test for open_filename with pathlib
* Fix a wrong function name
* Cast a pathlib object to string for py3.4/3.5
* Add link to the PR
* Raise an exception when open_filename gets an unsupported type
* Add tests for open_filename
* Update CHANGELOG.md
* Documentation
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Add trying to get cmap from pickle file. And cleaning up a bit.
* Don't use keyword argument for dict.get
* Add docs
* Make _get_cmap_name static
* Add test
* Add CHANGELOG.md
* Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there
* Add CJK characters to expected output of simple3.pdf
* Fix line length
* Add comment
* swap pycryptodome to the faster, smaller, and industry standard crytography io
* update changelog
* fixlint
* Update CHANGELOG.md
* from MR, unneeded ex and naming
* add samples to nosetests
* fix lint
* show mismatch
* fix lint
* typo and newline
* Revert "add samples to nosetests"
This reverts commit a49ca302
* Add tests for encrypted documents to nose test suite
* Optimize imports of pdfdocument.py
Co-authored-by: Oren Tysor <oren@atakama.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix converting path to multiple rectangles
For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.
fixes#369
* Reduce pdf size by removing font
* Add unittest for PDFLayoutAnalyzer.paint_path()
* Add line to CHANGELOG.md
* Add reference to pdf reference manual
* Cleanup function paint_path a bit
* Reduce line length of tests
* Reduce line length of tests
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Changed error to warning for 'Text extraction is not allowed'
* updated changelog
* fix lint
* made changes suggested in review
* Update CHANGELOG.md
* Add regression test for failing pdf
* Reduce line length to <80
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Fixes#176
* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's
* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref
* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know
* Adding changelog
* Extend help message
* add shebang line to script in tools
* fix: use shebang line with python 3
* Moved changelog to unreleased
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Group text lines if they are centered (#382)
Closes#382
* Add comparison private methods to LTTextLines
* Add missing docstrings
* Add tests for find_neighbors
* Update changelog
* Cosmetic changes from code review
* Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed
* Add test for catching ValueError and KeyError when font encoding differences are invalid
* Added line to CHANGELOG.md
* Default value for --all-texts should be false, because using the flag enables it
* Fix edge case: when no neighbors are found a line should form its own text box
* Added test for grouping textlines where 1 is outside the parent bounding box
* Added CHANGELOG.md line
* Remove scaling font height/width with size of font bounding box
* Refactor LTChar bounding box computation
* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.
* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects
* Add line to CHANGELOG
* Fix getting filename when extracting embedded files
* Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs
* Add line to CHANGELOG
Fixes#186
* Tread the permissions (the /P entry) as unsigned long, fix#186
* handle negative values for p
* Extract function for resolving an twos-complement
* Add test for issue #352
* Add line to CHANGELOG.md
* Only ints can be converted to a uint using two's-complement method
* Standardize import style; multiple imports from same module on one line
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Drop support for legacy Python 2
* Add python_requires to help pip
* Upgrade Python syntax with pyupgrade
* Upgrade Python syntax with pyupgrade --py3-plus
* Python 3 imports
* Replace six
* Update CONTRIBUTING.md
* Added line to changelog
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
* Code Refractor: Use code-style enforcement #312
* Add flake8 to travis-ci
* Remove python 2 3 comment on six library. 891 errors > 870 errors.
* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.
* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.
* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting
* Cleanup pdfinterp.py and add documentation from PDF Reference
* Cleanup pdfpage.py
* Cleanup pdffont.py
* Clean psparser.py
* Cleanup high_level.py
* Cleanup layout.py
* Cleanup pdfparser.py
* Cleanup pdfcolor.py
* Cleanup rijndael.py
* Cleanup converter.py
* Rename klass to cls if it is the class variable, to be more consistent with standard practice
* Cleanup cmap.py
* Cleanup pdfdevice.py
* flake8 ignore fontmetrics.py
* Cleanup test_pdfminer_psparser.py
* Fix flake8 in pdfdocument.py; 339 errors to go
* Fix flake8 utils.py; 326 errors togo
* pep8 correction for few files in /tools/ 328 > 160 to go (#342)
* pep8 correction for few files in /tools/ 328 > 160 to go
* pep8 correction: 160 > 5 to go
* Fix ascii85.py errors
* Fix error in getting index from target that does not exists
* Remove commented print lines
* Fix flake8 error in pdfinterp.py
* Fix python2 specific error by removing argument from print statement
* Ignore invalid python2 syntax
* Update contributing.md
* Added changelog
* Remove unused import
Co-authored-by: Fakabbir Amin <f4amin@gmail.com>