Commit Graph

58 Commits (ec223d1f1d2483069de0e4b80e7fcfb9d740c6ed)

Author SHA1 Message Date
Jeremy Singer-Vine e83dd26671
Fix .paint_path for non-rectangle quadrilaterals (#512)
* Fix paint_path bug noted in issue #473

Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.

Also adds a test for cases presented in issue #473

* Tweak paint_path fix per @pietermarsman review

- Adjusts logic to adhere to if-elif-else rather than early returns.

- Shortens subpath detection/reprocessing step, using re.finditer().

* Reorder paint_path() if-else statements once more

* Fix flake8 issues

* Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on.

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-12 17:53:00 +02:00
estshorter f03657e5c4
Allow a pathlib.PurePath object as a input to open_filename (#492)
* open_filename accepts a pathlib.PurePath object

* Add test for open_filename with pathlib

* Fix a wrong function name

* Cast a pathlib object to string for py3.4/3.5

* Add link to the PR

* Raise an exception when open_filename gets an unsupported type

* Add tests for open_filename

* Update CHANGELOG.md

* Documentation

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-09-17 21:29:00 +02:00
Igor Moura a83f853de7
Remove unused rijndael encryption implementation (#465)
* Remove unused rijndael encryption

* Add current PR link to CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-09-10 19:28:00 +02:00
Pieter Marsman 4f65242750
Always try to get CMap, even if name is not recognized (#438)
* Add trying to get cmap from pickle file. And cleaning up a bit.

* Don't use keyword argument for dict.get

* Add docs

* Make _get_cmap_name static

* Add test

* Add CHANGELOG.md

* Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there

* Add CJK characters to expected output of simple3.pdf

* Fix line length

* Add comment
2020-07-23 20:27:38 +02:00
lithiumFlower c10cf3cdb8
Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456)
* swap pycryptodome to the faster, smaller, and industry standard crytography io

* update changelog

* fixlint

* Update CHANGELOG.md

* from MR, unneeded ex and naming

* add samples to nosetests

* fix lint

* show mismatch

* fix lint

* typo and newline

* Revert "add samples to nosetests"

This reverts commit a49ca302

* Add tests for encrypted documents to nose test suite

* Optimize imports of pdfdocument.py

Co-authored-by: Oren Tysor <oren@atakama.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung 60863cfd55
Fix converting path to multiple rectangles (#371)
* Fix converting path to multiple rectangles

For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes #369

* Reduce pdf size by removing font

* Add unittest for PDFLayoutAnalyzer.paint_path()

* Add line to CHANGELOG.md

* Add reference to pdf reference manual

* Cleanup function paint_path a bit

* Reduce line length of tests

* Reduce line length of tests

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 17:34:38 +02:00
madhurcodes 6a9269b432
Change Text extraction is not allowed error to warning (#453)
* Changed error to warning for 'Text extraction is not allowed'

* updated changelog

* fix lint

* made changes suggested in review

* Update CHANGELOG.md

* Add regression test for failing pdf

* Reduce line length to <80

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 16:04:11 +02:00
Pieter Marsman 6e05baf0b7
Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it
Fixes #176 

* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's

* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref

* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know

* Adding changelog

* Extend help message
2020-05-23 18:04:34 +02:00
Jake Stockwin 7254530d27
Fix ordering of textlines within a textbox when boxes_flow is disabled (#412)
* Fix ordering of textlines within a textbox when boxes_flow is disabled

* Add new test PDF sample
2020-05-09 15:37:49 +02:00
fabbox 7eff108fa5
add shebang line to script in tools (#408)
* add shebang line to script in tools

* fix: use shebang line with python 3

* Moved changelog to unreleased

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-04-28 10:58:42 +02:00
Jake Stockwin 68e2ae8632
Fix text coming in reverse order with boxes flow disabled (#399)
Closes #398
2020-04-01 13:37:04 +02:00
Jake Stockwin 1a4a06da9f
Fix #392 Split out IO logic from high level functions (#393)
* Allow file-like inputs to high level functions (#392)

* PR Review - move open_filename to utils
2020-03-26 22:52:00 +01:00
Jake Stockwin 1cc1b961c5
Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382) (#384)
* Group text lines if they are centered (#382)

Closes #382

* Add comparison private methods to LTTextLines

* Add missing docstrings

* Add tests for find_neighbors

* Update changelog

* Cosmetic changes from code review
2020-03-23 22:38:39 +01:00
Pieter Marsman 9d7fe2d9ee
Catch ValueError when converting font encoding differences to characters (#389)
* Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed

* Add test for catching ValueError and KeyError when font encoding differences are invalid

* Added line to CHANGELOG.md
2020-03-16 20:12:45 +01:00
Pieter Marsman 1d773dc38a
Fix grouping textlines when bounding box of parent container is wrong (#386)
* Default value for --all-texts should be false, because using the flag enables it

* Fix edge case: when no neighbors are found a line should form its own text box

* Added test for grouping textlines where 1 is outside the parent bounding box

* Added CHANGELOG.md line
2020-03-14 10:33:39 +01:00
Pieter Marsman 1c3047b68b
Remove samples/ directory from source distribution to prevent downloading all pdf's when installing pdfminer.six (#364)
Fixes #363 

* Remove samples/ and docs/ from source distribution. The samples/ dictionairy contains pdf's for testing purposes and the docs/ contain readthedocs documentation and is published online.

* Remove issue-00152-embedded-pdf.pdf because it contains a possible exploit.

See https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Exploit%3AJS%2FShellCode.gen
And https://github.com/pdfminer/pdfminer.six/issues/363

* Added line to CHANGELOG.md

* Remove unused imports
2020-01-24 12:36:02 +01:00
Pieter Marsman fff3ac2ba6
Fix bug in computing character bounding box (#348)
* Remove scaling font height/width with size of font bounding box

* Refactor LTChar bounding box computation

* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.

* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects

* Add line to CHANGELOG
2020-01-16 22:15:50 +01:00
Pieter Marsman 2f7f5d2667
Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338)
* Fix getting filename when extracting embedded files

* Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs

* Add line to CHANGELOG
2020-01-16 22:11:42 +01:00
Recursing 0b1741b9bf Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352)
Fixes #186 

* Tread the permissions (the /P entry) as unsigned long, fix #186

* handle negative values for p

* Extract function for resolving an twos-complement

* Add test for issue #352

* Add line to CHANGELOG.md

* Only ints can be converted to a uint using two's-complement method

* Standardize import style; multiple imports from same module on one line

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-01-07 21:59:13 +01:00
Pieter Marsman 3502dc9f3b
Drop support for legacy Python 2 (#346)
* Drop support for legacy Python 2

* Add python_requires to help pip

* Upgrade Python syntax with pyupgrade

* Upgrade Python syntax with pyupgrade --py3-plus

* Python 3 imports

* Replace six

* Update CONTRIBUTING.md

* Added line to changelog

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
2020-01-04 16:47:07 +01:00
Pieter Marsman f3ab1bc61e
Enforce pep8 coding-style (#345)
* Code Refractor: Use code-style enforcement #312

* Add flake8 to travis-ci

* Remove python 2 3 comment on six library. 891 errors > 870 errors.

* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.

* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.

* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting

* Cleanup pdfinterp.py and add documentation from PDF Reference

* Cleanup pdfpage.py

* Cleanup pdffont.py

* Clean psparser.py

* Cleanup high_level.py

* Cleanup layout.py

* Cleanup pdfparser.py

* Cleanup pdfcolor.py

* Cleanup rijndael.py

* Cleanup converter.py

* Rename klass to cls if it is the class variable, to be more consistent with standard practice

* Cleanup cmap.py

* Cleanup pdfdevice.py

* flake8 ignore fontmetrics.py

* Cleanup test_pdfminer_psparser.py

* Fix flake8 in pdfdocument.py; 339 errors to go

* Fix flake8 utils.py; 326 errors togo

* pep8 correction for few files in /tools/ 328 > 160 to go (#342)

* pep8 correction for few files in /tools/ 328 > 160 to go

* pep8 correction: 160 > 5 to go

* Fix ascii85.py errors

* Fix error in getting index from target that does not exists

* Remove commented print lines

* Fix flake8 error in pdfinterp.py

* Fix python2 specific error by removing argument from print statement

* Ignore invalid python2 syntax

* Update contributing.md

* Added changelog

* Remove unused import

Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 21:20:20 +01:00
Pieter Marsman 2bee7d8dcf
Fix wrong ordering of grouping textboxes introduced by #315. The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335)
Fixes #334
2019-11-10 12:18:49 +01:00
Igor Moura 40aa2533c9 Added: simple wrapper to extract text from pdf (#330)
Fixes #327
2019-11-07 07:54:10 +01:00
Pieter Marsman 6cc78ee124
Replace opts by argparse in dumppdf.py (#321)
Also add multi-character argument names
Fixes #175
2019-10-27 21:40:04 +01:00
Pieter Marsman 1c4a4167ed
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 18:42:33 +02:00
Pieter Marsman a238a19999
Fix assertionerror when dumping pdf with reference to objid 0 (#318)
Fixes #94 
Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.
2019-10-25 22:49:58 +02:00
Serj Sintsov cb9cd8ea46 Use named logger instead of root logger (#236) 2019-10-22 20:52:43 +02:00
jbarlow83 733ddf7e57 Added: tests for extracting tests from pdfs with Type3 fonts (#205) 2019-10-22 18:15:59 +02:00
Pieter Marsman 373c6e7b97
Added: extraction of JBIG2 encoded images (#311)
And added test for pdf with JBIG2 image.

Fixes #26 
Closes #46
2019-10-22 17:37:06 +02:00
jet457 7e40fde320 Removing assertion in drange to allow equal inputs (#246) and mimic behaviour of built-in method range
Fixes #66, since it now allows the bbox to have 0 width or 0 height
Added tests for Plane since it is the API that uses drange
2019-10-17 12:04:25 +02:00
D.A.Bashkirtsev 4df6d4e5ca Changed: comparations for image colorspace literals (#132)
Fixes #131 

Changed: comparations for image colorspace literals
Added: test for extracting images from pdfs
2019-10-15 16:11:54 +02:00
Fakabbir Amin 3f0f05def6 Merge branch 'pdfstream-as-cmap' of https://github.com/fakabbir/pdfminer.six into pdfstream-as-cmap 2019-08-10 11:04:10 +05:30
Fakabbir Amin 3125d3634a Correct old test cases 2019-08-10 11:03:28 +05:30
Fakabbir Amin fe38695739
Merge branch 'develop' into pdfstream-as-cmap 2019-08-10 10:44:31 +05:30
Fakabbir Amin 5b210981c9 Adds Test Case 2019-08-10 10:19:20 +05:30
Fakabbir Amin f1a4dcea88 Adds Test Cases, Neater Code For CMap Assignment 2019-07-24 11:56:06 +05:30
Fakabbir Amin b4c261b647 Removes Code Comments 2019-07-17 11:43:45 +05:30
Fakabbir Amin fa400431f5 Adds Test, Removes Unnecessary Assumptions 2019-07-17 11:38:00 +05:30
Pieter Marsman 2bb850cdae Fix error, python2 cannot handle unicode in a .py file 2019-07-14 15:43:07 +02:00
Pieter Marsman 1e24bfa0bd Fix error, python2 cannot handle unicode in a .py file 2019-07-14 15:40:22 +02:00
Pieter Marsman c597e95a9f Use KeyError to signal that the name does not resemble any unicode, this pattern is also used in the rest of pdfminer.six 2019-07-14 15:37:15 +02:00
Pieter Marsman fdb7e54862 Add lowercase adobe glyph name tests 2019-07-14 15:20:25 +02:00
Pieter Marsman 5d7ac7e88a Added test for overflow error reported by @jtlz2: https://github.com/pdfminer/pdfminer.six/issues/177#issuecomment-510173228_ 2019-07-10 20:44:23 +02:00
Pieter Marsman ec5218a05f Add some (failing) unittests for name2unicode based on the examples in the Adobe Glyph List Specification 2019-07-10 20:35:42 +02:00
Sebastian Schuberth ec8530f6cf Add a test for the previous fix 2017-10-16 12:35:16 +02:00
Philippe Guglielmetti b010db6049 solves https://github.com/pdfminer/pdfminer.six/issues/65 2017-07-20 21:17:06 +02:00
Philippe Guglielmetti 82af7f0aac issue #56 reproduced, solution attempt unsucessful 2017-04-19 14:19:14 +02:00
Philippe Guglielmetti 5ef8333c5f new test fails on Linux & TRavis-CI. TODO: find why 2017-04-18 18:28:48 +02:00
Philippe Guglielmetti 7055862eaf solves https://github.com/pdfminer/pdfminer.six/issues/50 2017-04-18 18:20:31 +02:00
Philippe Guglielmetti fd63dbf62e no more skipped tests 2017-01-20 10:12:00 +01:00