* Remove unused sortedcontainers package
* Fix changelog format
* Fix a link to the PR
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix paint_path bug noted in issue #473
Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.
Also adds a test for cases presented in issue #473
* Tweak paint_path fix per @pietermarsman review
- Adjusts logic to adhere to if-elif-else rather than early returns.
- Shortens subpath detection/reprocessing step, using re.finditer().
* Reorder paint_path() if-else statements once more
* Fix flake8 issues
* Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on.
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix not being able to pass boxes flow as None to pdf2txt
* Changes from code review
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* open_filename accepts a pathlib.PurePath object
* Add test for open_filename with pathlib
* Fix a wrong function name
* Cast a pathlib object to string for py3.4/3.5
* Add link to the PR
* Raise an exception when open_filename gets an unsupported type
* Add tests for open_filename
* Update CHANGELOG.md
* Documentation
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Updated high_level.py
This commit enables caching to be turned on and off rather than be always on regardless of the user input.
* Reverted params back to fix errors
* Updated CHANGELOG.md to reflect quick fix
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Restore PDFTextExtractionNotAllowed
Restore PDFTextExtractionNotAllowed exception class as an alias of the
new PDFTextExtractionNotAllowedError exception that was introduced in
6a9269b432
Removing PDFTextExtractionNotAllowed is an API breakage that made
several tools fail break.
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* Use PDFTextExtractionNotAllowed and prepare PDFTextExtractionNotAllowedError to be removed in the future
* Add line to CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Add trying to get cmap from pickle file. And cleaning up a bit.
* Don't use keyword argument for dict.get
* Add docs
* Make _get_cmap_name static
* Add test
* Add CHANGELOG.md
* Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there
* Add CJK characters to expected output of simple3.pdf
* Fix line length
* Add comment
* swap pycryptodome to the faster, smaller, and industry standard crytography io
* update changelog
* fixlint
* Update CHANGELOG.md
* from MR, unneeded ex and naming
* add samples to nosetests
* fix lint
* show mismatch
* fix lint
* typo and newline
* Revert "add samples to nosetests"
This reverts commit a49ca302
* Add tests for encrypted documents to nose test suite
* Optimize imports of pdfdocument.py
Co-authored-by: Oren Tysor <oren@atakama.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix converting path to multiple rectangles
For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.
fixes#369
* Reduce pdf size by removing font
* Add unittest for PDFLayoutAnalyzer.paint_path()
* Add line to CHANGELOG.md
* Add reference to pdf reference manual
* Cleanup function paint_path a bit
* Reduce line length of tests
* Reduce line length of tests
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Changed error to warning for 'Text extraction is not allowed'
* updated changelog
* fix lint
* made changes suggested in review
* Update CHANGELOG.md
* Add regression test for failing pdf
* Reduce line length to <80
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Fixes#176
* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's
* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref
* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know
* Adding changelog
* Extend help message
* add shebang line to script in tools
* fix: use shebang line with python 3
* Moved changelog to unreleased
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Update documentation for boxes_flow, allow None
* Apply comments from code review
* Small wording changes, remove unnecessary comment
* Update boxes_flow documentation for pdf2text
* Pin version of tox to ensure python 3.4 support
* Updated misleading documentation about word_margin
* Small change in sentence about word_margin
* Remove confusing sentence about adding spaces
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Group text lines if they are centered (#382)
Closes#382
* Add comparison private methods to LTTextLines
* Add missing docstrings
* Add tests for find_neighbors
* Update changelog
* Cosmetic changes from code review
* Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed
* Add test for catching ValueError and KeyError when font encoding differences are invalid
* Added line to CHANGELOG.md
* Default value for --all-texts should be false, because using the flag enables it
* Fix edge case: when no neighbors are found a line should form its own text box
* Added test for grouping textlines where 1 is outside the parent bounding box
* Added CHANGELOG.md line
* Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's.
* Added line to CHANGELOG.md
* Fix font name by removing subset tag
* Added line to CHANGELOG.md
* Add documentation and clear variable name
* Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`
* Remove scaling font height/width with size of font bounding box
* Refactor LTChar bounding box computation
* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.
* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects
* Add line to CHANGELOG
* Fix getting filename when extracting embedded files
* Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs
* Add line to CHANGELOG
Fixes#186
* Tread the permissions (the /P entry) as unsigned long, fix#186
* handle negative values for p
* Extract function for resolving an twos-complement
* Add test for issue #352
* Add line to CHANGELOG.md
* Only ints can be converted to a uint using two's-complement method
* Standardize import style; multiple imports from same module on one line
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Drop support for legacy Python 2
* Add python_requires to help pip
* Upgrade Python syntax with pyupgrade
* Upgrade Python syntax with pyupgrade --py3-plus
* Python 3 imports
* Replace six
* Update CONTRIBUTING.md
* Added line to changelog
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
* Code Refractor: Use code-style enforcement #312
* Add flake8 to travis-ci
* Remove python 2 3 comment on six library. 891 errors > 870 errors.
* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.
* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.
* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting
* Cleanup pdfinterp.py and add documentation from PDF Reference
* Cleanup pdfpage.py
* Cleanup pdffont.py
* Clean psparser.py
* Cleanup high_level.py
* Cleanup layout.py
* Cleanup pdfparser.py
* Cleanup pdfcolor.py
* Cleanup rijndael.py
* Cleanup converter.py
* Rename klass to cls if it is the class variable, to be more consistent with standard practice
* Cleanup cmap.py
* Cleanup pdfdevice.py
* flake8 ignore fontmetrics.py
* Cleanup test_pdfminer_psparser.py
* Fix flake8 in pdfdocument.py; 339 errors to go
* Fix flake8 utils.py; 326 errors togo
* pep8 correction for few files in /tools/ 328 > 160 to go (#342)
* pep8 correction for few files in /tools/ 328 > 160 to go
* pep8 correction: 160 > 5 to go
* Fix ascii85.py errors
* Fix error in getting index from target that does not exists
* Remove commented print lines
* Fix flake8 error in pdfinterp.py
* Fix python2 specific error by removing argument from print statement
* Ignore invalid python2 syntax
* Update contributing.md
* Added changelog
* Remove unused import
Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
Fixes#171Fixes#199Fixes#118Fixes#178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
Changed: using a heap instead of a SortedList and avoid rebuilding the heap in each iteration
Changed: avoid potentially huge number of variable assignments in list comprehension.
Changed: avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).