Commit Graph

444 Commits (9d7fe2d9ee2e563571b5c941444505525245d880)

Author SHA1 Message Date
Pieter Marsman 9d7fe2d9ee
Catch ValueError when converting font encoding differences to characters (#389)
* Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed

* Add test for catching ValueError and KeyError when font encoding differences are invalid

* Added line to CHANGELOG.md
2020-03-16 20:12:45 +01:00
Pieter Marsman 1d773dc38a
Fix grouping textlines when bounding box of parent container is wrong (#386)
* Default value for --all-texts should be false, because using the flag enables it

* Fix edge case: when no neighbors are found a line should form its own text box

* Added test for grouping textlines where 1 is outside the parent bounding box

* Added CHANGELOG.md line
2020-03-14 10:33:39 +01:00
Pieter Marsman bab6d154c2 Bump version 20200124 2020-01-24 12:38:11 +01:00
Pieter Marsman bc494ff03c Bump version to 20200121 2020-01-21 21:13:52 +01:00
Pieter Marsman 410d7ecac3
Fix value for font-family in html by removing the subset tag from the PDF font-name (#357)
* Fix font name by removing subset tag

* Added line to CHANGELOG.md

* Add documentation and clear variable name

* Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`
2020-01-16 22:25:20 +01:00
Pieter Marsman fff3ac2ba6
Fix bug in computing character bounding box (#348)
* Remove scaling font height/width with size of font bounding box

* Refactor LTChar bounding box computation

* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.

* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects

* Add line to CHANGELOG
2020-01-16 22:15:50 +01:00
Recursing 0b1741b9bf Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352)
Fixes #186 

* Tread the permissions (the /P entry) as unsigned long, fix #186

* handle negative values for p

* Extract function for resolving an twos-complement

* Add test for issue #352

* Add line to CHANGELOG.md

* Only ints can be converted to a uint using two's-complement method

* Standardize import style; multiple imports from same module on one line

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-01-07 21:59:13 +01:00
Pieter Marsman b27d3d0aff Bump version 2020-01-04 18:15:15 +01:00
Pieter Marsman 3502dc9f3b
Drop support for legacy Python 2 (#346)
* Drop support for legacy Python 2

* Add python_requires to help pip

* Upgrade Python syntax with pyupgrade

* Upgrade Python syntax with pyupgrade --py3-plus

* Python 3 imports

* Replace six

* Update CONTRIBUTING.md

* Added line to changelog

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
2020-01-04 16:47:07 +01:00
Pieter Marsman f3ab1bc61e
Enforce pep8 coding-style (#345)
* Code Refractor: Use code-style enforcement #312

* Add flake8 to travis-ci

* Remove python 2 3 comment on six library. 891 errors > 870 errors.

* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.

* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.

* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting

* Cleanup pdfinterp.py and add documentation from PDF Reference

* Cleanup pdfpage.py

* Cleanup pdffont.py

* Clean psparser.py

* Cleanup high_level.py

* Cleanup layout.py

* Cleanup pdfparser.py

* Cleanup pdfcolor.py

* Cleanup rijndael.py

* Cleanup converter.py

* Rename klass to cls if it is the class variable, to be more consistent with standard practice

* Cleanup cmap.py

* Cleanup pdfdevice.py

* flake8 ignore fontmetrics.py

* Cleanup test_pdfminer_psparser.py

* Fix flake8 in pdfdocument.py; 339 errors to go

* Fix flake8 utils.py; 326 errors togo

* pep8 correction for few files in /tools/ 328 > 160 to go (#342)

* pep8 correction for few files in /tools/ 328 > 160 to go

* pep8 correction: 160 > 5 to go

* Fix ascii85.py errors

* Fix error in getting index from target that does not exists

* Remove commented print lines

* Fix flake8 error in pdfinterp.py

* Fix python2 specific error by removing argument from print statement

* Ignore invalid python2 syntax

* Update contributing.md

* Added changelog

* Remove unused import

Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 21:20:20 +01:00
Pieter Marsman 803a7d9598 Release 20191110 2019-11-10 12:29:14 +01:00
Pieter Marsman 2bee7d8dcf
Fix wrong ordering of grouping textboxes introduced by #315. The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335)
Fixes #334
2019-11-10 12:18:49 +01:00
Pieter Marsman 5c6fa8f986 Release 20191107 2019-11-07 21:52:44 +01:00
Pieter Marsman bc034c8e59
Create sphinx documentation for Read the Docs (#329)
Fixes #171
Fixes #199
Fixes #118
Fixes #178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
2019-11-07 21:12:34 +01:00
Igor Moura 40aa2533c9 Added: simple wrapper to extract text from pdf (#330)
Fixes #327
2019-11-07 07:54:10 +01:00
Martin Hasoň ed1b09c6f2 Fix debug logging for pdf2txt.py and dumppdf.py (#325)
Fixes #313
2019-11-06 21:47:19 +01:00
Pieter Marsman 33b16b3f07
Deprecate the use of _py2_no_more_posargs (#328)
Fixes #324
2019-11-02 10:29:39 +01:00
Jianfeng 44b223cf0a Speedup grouping of textboxes (#315)
Changed: using a heap instead of a SortedList and avoid rebuilding the heap in each iteration
Changed: avoid potentially huge number of variable assignments in list comprehension.
Changed: avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).
2019-10-31 09:22:58 +01:00
Pieter Marsman d88d6020a2
Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320)
Fixes #314 
Fixes #105
2019-10-26 19:16:37 +02:00
Pieter Marsman a238a19999
Fix assertionerror when dumping pdf with reference to objid 0 (#318)
Fixes #94 
Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.
2019-10-25 22:49:58 +02:00
Serj Sintsov cb9cd8ea46 Use named logger instead of root logger (#236) 2019-10-22 20:52:43 +02:00
Pieter Marsman 373c6e7b97
Added: extraction of JBIG2 encoded images (#311)
And added test for pdf with JBIG2 image.

Fixes #26 
Closes #46
2019-10-22 17:37:06 +02:00
Pieter Marsman 694aa508c3 Release 20191020 2019-10-20 14:21:48 +02:00
Pieter Marsman adc4726e06
Add warning about dropping python2 support (#307)
Fix #303
2019-10-20 13:59:29 +02:00
Pieter Marsman 9fd7172f7b Cleanup utils.py 2019-10-17 12:14:02 +02:00
jet457 7e40fde320 Removing assertion in drange to allow equal inputs (#246) and mimic behaviour of built-in method range
Fixes #66, since it now allows the bbox to have 0 width or 0 height
Added tests for Plane since it is the API that uses drange
2019-10-17 12:04:25 +02:00
D.A.Bashkirtsev 4df6d4e5ca Changed: comparations for image colorspace literals (#132)
Fixes #131 

Changed: comparations for image colorspace literals
Added: test for extracting images from pdfs
2019-10-15 16:11:54 +02:00
Pieter Marsman 63b2e09ac3
Merge pull request #203 from jbarlow83/negative-descent
Interpret font Descent as a negative number even if specified as positive
2019-10-13 20:06:52 +02:00
Tony Tong 106a09c5bb fix stoke color and non-stroke color in PDFGraphicState 2019-10-12 17:35:46 -04:00
Tata Ganesh f218996fe9
Merge pull request #273 from igormp/develop
Use resolve_all on PdfFont widths and bbox
2019-10-12 21:24:29 +05:30
Fakabbir Amin 7c03d96d25 Corrects Comment 2019-08-20 17:16:10 +05:30
Fakabbir Amin abd685fdc6 Corrects Code Comment 2019-08-20 17:13:27 +05:30
Fakabbir Amin 3d549ea48c Removes code comments 2019-08-20 16:48:40 +05:30
Igor Moura cf4641d877
Merge branch 'develop' into develop 2019-08-15 08:11:28 -03:00
Fakabbir Amin fe38695739
Merge branch 'develop' into pdfstream-as-cmap 2019-08-10 10:44:31 +05:30
Fakabbir Amin 5a0d8db052 Adds decoder for OnebyteIdentityH/V instead of using default CMap 2019-08-10 10:07:23 +05:30
Tata Ganesh 42e2c8143b
Merge pull request #263 from pietermarsman/261-glyph-list-specification
name2unicode() should follow the Adobe Glyph List Specification
2019-07-26 22:13:34 +05:30
Igor Moura 2f4518231f Use resolve_all on PdfFont widths and bbox
Fixes #268
2019-07-24 15:10:13 -03:00
Igor Moura 540df9f676 Replaced .iteritems() and with six.iteritems() for Python 3 compat
This is a squashed commit, the previous messages can be seen bellow

This is the 1st commit message:

Replaced .iteritems() usage for .items()

Fixed some python 2 leftovers, as discussed in #267. Also formatted code according to Black.\nThis possibly breaks some python 2 compatibility

This is the commit message #2:

Reverted formatting and more spread six usage
2019-07-24 14:08:30 -03:00
Fakabbir Amin f1a4dcea88 Adds Test Cases, Neater Code For CMap Assignment 2019-07-24 11:56:06 +05:30
Fakabbir Amin fa400431f5 Adds Test, Removes Unnecessary Assumptions 2019-07-17 11:38:00 +05:30
Pieter Marsman 6f362f53fe Raise a `KeyError` with a useful message if `unicode2name()` does not match any glyph name. Use this message to log debug statements. 2019-07-16 08:52:24 +02:00
Pieter Marsman 0fb83366b6 Remove intermediate variable `full_stop` because it is just a dot 2019-07-16 08:49:57 +02:00
Fakabbir Amin cc40af3d2b Removes @property, Adds docstring 2019-07-15 14:21:21 +05:30
Pieter Marsman c597e95a9f Use KeyError to signal that the name does not resemble any unicode, this pattern is also used in the rest of pdfminer.six 2019-07-14 15:37:15 +02:00
Pieter Marsman 33cc9861ae Add docstring to Type1FontHeaderParser.get_encoding() that describes that the custom CharStrings of the font are mapped to '' 2019-07-14 15:19:17 +02:00
Pieter Marsman f0392f8049 Change implementation of name2unicode such that it follows the Adobe Glyph specs (with allowing lowercase) 2019-07-14 15:16:42 +02:00
Fakabbir Amin 8e4a82ad8b Corrects Indentation 2019-07-13 05:00:25 +05:30
Fakabbir Amin c022358c8d Encapsulates character map name 2019-07-13 04:52:24 +05:30
John Kesegich 8ab2e287be Handle PDFStream as character map name in PDFCIDFont 2019-02-25 11:42:30 -06:00