Fixes#176
* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's
* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref
* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know
* Adding changelog
* Extend help message
* Make structure of documentation more clear: tutorials, how-to, topics and reference
* Add howto for images
* Restructure tutorials section, and add install section
* Always use up-to-date version
* Fix indentation warning in docstring
* Add option to dumppdf.py and pdf2txt.py to show version
Fixes#162
* add shebang line to script in tools
* fix: use shebang line with python 3
* Moved changelog to unreleased
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Update documentation for boxes_flow, allow None
* Apply comments from code review
* Small wording changes, remove unnecessary comment
* Update boxes_flow documentation for pdf2text
* Pin version of tox to ensure python 3.4 support
* Updated misleading documentation about word_margin
* Small change in sentence about word_margin
* Remove confusing sentence about adding spaces
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Group text lines if they are centered (#382)
Closes#382
* Add comparison private methods to LTTextLines
* Add missing docstrings
* Add tests for find_neighbors
* Update changelog
* Cosmetic changes from code review
* Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed
* Add test for catching ValueError and KeyError when font encoding differences are invalid
* Added line to CHANGELOG.md
* Default value for --all-texts should be false, because using the flag enables it
* Fix edge case: when no neighbors are found a line should form its own text box
* Added test for grouping textlines where 1 is outside the parent bounding box
* Added CHANGELOG.md line
* Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's.
* Added line to CHANGELOG.md
* Fix font name by removing subset tag
* Added line to CHANGELOG.md
* Add documentation and clear variable name
* Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`
* Remove scaling font height/width with size of font bounding box
* Refactor LTChar bounding box computation
* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.
* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects
* Add line to CHANGELOG
* Fix getting filename when extracting embedded files
* Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs
* Add line to CHANGELOG
Fixes#186
* Tread the permissions (the /P entry) as unsigned long, fix#186
* handle negative values for p
* Extract function for resolving an twos-complement
* Add test for issue #352
* Add line to CHANGELOG.md
* Only ints can be converted to a uint using two's-complement method
* Standardize import style; multiple imports from same module on one line
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Drop support for legacy Python 2
* Add python_requires to help pip
* Upgrade Python syntax with pyupgrade
* Upgrade Python syntax with pyupgrade --py3-plus
* Python 3 imports
* Replace six
* Update CONTRIBUTING.md
* Added line to changelog
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
* Code Refractor: Use code-style enforcement #312
* Add flake8 to travis-ci
* Remove python 2 3 comment on six library. 891 errors > 870 errors.
* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.
* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.
* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting
* Cleanup pdfinterp.py and add documentation from PDF Reference
* Cleanup pdfpage.py
* Cleanup pdffont.py
* Clean psparser.py
* Cleanup high_level.py
* Cleanup layout.py
* Cleanup pdfparser.py
* Cleanup pdfcolor.py
* Cleanup rijndael.py
* Cleanup converter.py
* Rename klass to cls if it is the class variable, to be more consistent with standard practice
* Cleanup cmap.py
* Cleanup pdfdevice.py
* flake8 ignore fontmetrics.py
* Cleanup test_pdfminer_psparser.py
* Fix flake8 in pdfdocument.py; 339 errors to go
* Fix flake8 utils.py; 326 errors togo
* pep8 correction for few files in /tools/ 328 > 160 to go (#342)
* pep8 correction for few files in /tools/ 328 > 160 to go
* pep8 correction: 160 > 5 to go
* Fix ascii85.py errors
* Fix error in getting index from target that does not exists
* Remove commented print lines
* Fix flake8 error in pdfinterp.py
* Fix python2 specific error by removing argument from print statement
* Ignore invalid python2 syntax
* Update contributing.md
* Added changelog
* Remove unused import
Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
Fixes#171Fixes#199Fixes#118Fixes#178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
Changed: using a heap instead of a SortedList and avoid rebuilding the heap in each iteration
Changed: avoid potentially huge number of variable assignments in list comprehension.
Changed: avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).