Commit Graph

61 Commits (016239c146e4479dc14da811f3a934dcf0d9fe5f)

Author SHA1 Message Date
Jeremy Singer-Vine 016239c146
Fix .paint_path handling of single line segments (#530)
* Fix .paint_path handling of single line segments

- Fixes typo ("ml" should have been "mlh")

- Removes if-statement that required individual line segments to be
  strictly horizontal or vertical.

* Treat 'ml'-shape paths as lines not curves

Althoguh 'mlh' is the canonical implementation for a single line
segment, 'ml' is fairly common.

Adds tests and sample PDF.

* Fix trailing whitespace

* Fix point-extraction from Beziér path commands

This commit corrects the manner in which "pts" are extracted from Beziér
path commands. See Table 4.9 of PDF reference manual, and new comments
in code for details. Previously, depending on whether the command (c,
v, or y) the code was extracting some combination of control points (not
on curve) and the actual points-on-curve.

This commit also refactors .paint_path, so that apply_matrix_pt is only
called in one place, and to treat the "h" command in a manner more
consistent with other path commands.

* Add comments to test_paint_path_quadrilaterals

* Parse rect-forming mllll paths as rects not curves

Now that .paint_path has been refactored, adding support for
rect-forming mllll paths requires no extra code, beyond a minor tweak to
the relevant elif statement.

* One changelog line with ref to mr

* Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring

* Extract variables from if statement to make it easier to read

* Optimize imports order

* Trigger travis build

* Revert "Trigger travis build"

This reverts commit 41c05184

* Update travis badge

* Update travis badge

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-07-27 18:27:32 +02:00
Ev2geny 693e4f48a3
Issue #469 is fixed (When run on Windows a lot of tests fail with the error: [Errno 13] Permission denied) (#484)
Closes #469

* Issue #469 is fixed

* one extra comment to code is added

* TemporaryFilePath context manager is added to facilitate tests

* flake8 complaints fixed

* Update docs of tempfilepath.py

* Fix flake8

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-26 10:10:11 +01:00
Pieter Marsman f8e6ad6ac1
Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream (#523)
Closes #191 

* Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream

* Add docstring

* Fix flake8
2020-10-25 14:37:12 +01:00
Jeremy Singer-Vine e83dd26671
Fix .paint_path for non-rectangle quadrilaterals (#512)
* Fix paint_path bug noted in issue #473

Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.

Also adds a test for cases presented in issue #473

* Tweak paint_path fix per @pietermarsman review

- Adjusts logic to adhere to if-elif-else rather than early returns.

- Shortens subpath detection/reprocessing step, using re.finditer().

* Reorder paint_path() if-else statements once more

* Fix flake8 issues

* Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on.

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-12 17:53:00 +02:00
estshorter f03657e5c4
Allow a pathlib.PurePath object as a input to open_filename (#492)
* open_filename accepts a pathlib.PurePath object

* Add test for open_filename with pathlib

* Fix a wrong function name

* Cast a pathlib object to string for py3.4/3.5

* Add link to the PR

* Raise an exception when open_filename gets an unsupported type

* Add tests for open_filename

* Update CHANGELOG.md

* Documentation

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-09-17 21:29:00 +02:00
Igor Moura a83f853de7
Remove unused rijndael encryption implementation (#465)
* Remove unused rijndael encryption

* Add current PR link to CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-09-10 19:28:00 +02:00
Pieter Marsman 4f65242750
Always try to get CMap, even if name is not recognized (#438)
* Add trying to get cmap from pickle file. And cleaning up a bit.

* Don't use keyword argument for dict.get

* Add docs

* Make _get_cmap_name static

* Add test

* Add CHANGELOG.md

* Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there

* Add CJK characters to expected output of simple3.pdf

* Fix line length

* Add comment
2020-07-23 20:27:38 +02:00
lithiumFlower c10cf3cdb8
Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456)
* swap pycryptodome to the faster, smaller, and industry standard crytography io

* update changelog

* fixlint

* Update CHANGELOG.md

* from MR, unneeded ex and naming

* add samples to nosetests

* fix lint

* show mismatch

* fix lint

* typo and newline

* Revert "add samples to nosetests"

This reverts commit a49ca302

* Add tests for encrypted documents to nose test suite

* Optimize imports of pdfdocument.py

Co-authored-by: Oren Tysor <oren@atakama.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung 60863cfd55
Fix converting path to multiple rectangles (#371)
* Fix converting path to multiple rectangles

For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes #369

* Reduce pdf size by removing font

* Add unittest for PDFLayoutAnalyzer.paint_path()

* Add line to CHANGELOG.md

* Add reference to pdf reference manual

* Cleanup function paint_path a bit

* Reduce line length of tests

* Reduce line length of tests

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 17:34:38 +02:00
madhurcodes 6a9269b432
Change Text extraction is not allowed error to warning (#453)
* Changed error to warning for 'Text extraction is not allowed'

* updated changelog

* fix lint

* made changes suggested in review

* Update CHANGELOG.md

* Add regression test for failing pdf

* Reduce line length to <80

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 16:04:11 +02:00
Pieter Marsman 6e05baf0b7
Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it
Fixes #176 

* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's

* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref

* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know

* Adding changelog

* Extend help message
2020-05-23 18:04:34 +02:00
Jake Stockwin 7254530d27
Fix ordering of textlines within a textbox when boxes_flow is disabled (#412)
* Fix ordering of textlines within a textbox when boxes_flow is disabled

* Add new test PDF sample
2020-05-09 15:37:49 +02:00
fabbox 7eff108fa5
add shebang line to script in tools (#408)
* add shebang line to script in tools

* fix: use shebang line with python 3

* Moved changelog to unreleased

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-04-28 10:58:42 +02:00
Jake Stockwin 68e2ae8632
Fix text coming in reverse order with boxes flow disabled (#399)
Closes #398
2020-04-01 13:37:04 +02:00
Jake Stockwin 1a4a06da9f
Fix #392 Split out IO logic from high level functions (#393)
* Allow file-like inputs to high level functions (#392)

* PR Review - move open_filename to utils
2020-03-26 22:52:00 +01:00
Jake Stockwin 1cc1b961c5
Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382) (#384)
* Group text lines if they are centered (#382)

Closes #382

* Add comparison private methods to LTTextLines

* Add missing docstrings

* Add tests for find_neighbors

* Update changelog

* Cosmetic changes from code review
2020-03-23 22:38:39 +01:00
Pieter Marsman 9d7fe2d9ee
Catch ValueError when converting font encoding differences to characters (#389)
* Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed

* Add test for catching ValueError and KeyError when font encoding differences are invalid

* Added line to CHANGELOG.md
2020-03-16 20:12:45 +01:00
Pieter Marsman 1d773dc38a
Fix grouping textlines when bounding box of parent container is wrong (#386)
* Default value for --all-texts should be false, because using the flag enables it

* Fix edge case: when no neighbors are found a line should form its own text box

* Added test for grouping textlines where 1 is outside the parent bounding box

* Added CHANGELOG.md line
2020-03-14 10:33:39 +01:00
Pieter Marsman 1c3047b68b
Remove samples/ directory from source distribution to prevent downloading all pdf's when installing pdfminer.six (#364)
Fixes #363 

* Remove samples/ and docs/ from source distribution. The samples/ dictionairy contains pdf's for testing purposes and the docs/ contain readthedocs documentation and is published online.

* Remove issue-00152-embedded-pdf.pdf because it contains a possible exploit.

See https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Exploit%3AJS%2FShellCode.gen
And https://github.com/pdfminer/pdfminer.six/issues/363

* Added line to CHANGELOG.md

* Remove unused imports
2020-01-24 12:36:02 +01:00
Pieter Marsman fff3ac2ba6
Fix bug in computing character bounding box (#348)
* Remove scaling font height/width with size of font bounding box

* Refactor LTChar bounding box computation

* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.

* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects

* Add line to CHANGELOG
2020-01-16 22:15:50 +01:00
Pieter Marsman 2f7f5d2667
Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338)
* Fix getting filename when extracting embedded files

* Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs

* Add line to CHANGELOG
2020-01-16 22:11:42 +01:00
Recursing 0b1741b9bf Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352)
Fixes #186 

* Tread the permissions (the /P entry) as unsigned long, fix #186

* handle negative values for p

* Extract function for resolving an twos-complement

* Add test for issue #352

* Add line to CHANGELOG.md

* Only ints can be converted to a uint using two's-complement method

* Standardize import style; multiple imports from same module on one line

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-01-07 21:59:13 +01:00
Pieter Marsman 3502dc9f3b
Drop support for legacy Python 2 (#346)
* Drop support for legacy Python 2

* Add python_requires to help pip

* Upgrade Python syntax with pyupgrade

* Upgrade Python syntax with pyupgrade --py3-plus

* Python 3 imports

* Replace six

* Update CONTRIBUTING.md

* Added line to changelog

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
2020-01-04 16:47:07 +01:00
Pieter Marsman f3ab1bc61e
Enforce pep8 coding-style (#345)
* Code Refractor: Use code-style enforcement #312

* Add flake8 to travis-ci

* Remove python 2 3 comment on six library. 891 errors > 870 errors.

* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.

* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.

* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting

* Cleanup pdfinterp.py and add documentation from PDF Reference

* Cleanup pdfpage.py

* Cleanup pdffont.py

* Clean psparser.py

* Cleanup high_level.py

* Cleanup layout.py

* Cleanup pdfparser.py

* Cleanup pdfcolor.py

* Cleanup rijndael.py

* Cleanup converter.py

* Rename klass to cls if it is the class variable, to be more consistent with standard practice

* Cleanup cmap.py

* Cleanup pdfdevice.py

* flake8 ignore fontmetrics.py

* Cleanup test_pdfminer_psparser.py

* Fix flake8 in pdfdocument.py; 339 errors to go

* Fix flake8 utils.py; 326 errors togo

* pep8 correction for few files in /tools/ 328 > 160 to go (#342)

* pep8 correction for few files in /tools/ 328 > 160 to go

* pep8 correction: 160 > 5 to go

* Fix ascii85.py errors

* Fix error in getting index from target that does not exists

* Remove commented print lines

* Fix flake8 error in pdfinterp.py

* Fix python2 specific error by removing argument from print statement

* Ignore invalid python2 syntax

* Update contributing.md

* Added changelog

* Remove unused import

Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 21:20:20 +01:00
Pieter Marsman 2bee7d8dcf
Fix wrong ordering of grouping textboxes introduced by #315. The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335)
Fixes #334
2019-11-10 12:18:49 +01:00
Igor Moura 40aa2533c9 Added: simple wrapper to extract text from pdf (#330)
Fixes #327
2019-11-07 07:54:10 +01:00
Pieter Marsman 6cc78ee124
Replace opts by argparse in dumppdf.py (#321)
Also add multi-character argument names
Fixes #175
2019-10-27 21:40:04 +01:00
Pieter Marsman 1c4a4167ed
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 18:42:33 +02:00
Pieter Marsman a238a19999
Fix assertionerror when dumping pdf with reference to objid 0 (#318)
Fixes #94 
Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.
2019-10-25 22:49:58 +02:00
Serj Sintsov cb9cd8ea46 Use named logger instead of root logger (#236) 2019-10-22 20:52:43 +02:00
jbarlow83 733ddf7e57 Added: tests for extracting tests from pdfs with Type3 fonts (#205) 2019-10-22 18:15:59 +02:00
Pieter Marsman 373c6e7b97
Added: extraction of JBIG2 encoded images (#311)
And added test for pdf with JBIG2 image.

Fixes #26 
Closes #46
2019-10-22 17:37:06 +02:00
jet457 7e40fde320 Removing assertion in drange to allow equal inputs (#246) and mimic behaviour of built-in method range
Fixes #66, since it now allows the bbox to have 0 width or 0 height
Added tests for Plane since it is the API that uses drange
2019-10-17 12:04:25 +02:00
D.A.Bashkirtsev 4df6d4e5ca Changed: comparations for image colorspace literals (#132)
Fixes #131 

Changed: comparations for image colorspace literals
Added: test for extracting images from pdfs
2019-10-15 16:11:54 +02:00
Fakabbir Amin 3f0f05def6 Merge branch 'pdfstream-as-cmap' of https://github.com/fakabbir/pdfminer.six into pdfstream-as-cmap 2019-08-10 11:04:10 +05:30
Fakabbir Amin 3125d3634a Correct old test cases 2019-08-10 11:03:28 +05:30
Fakabbir Amin fe38695739
Merge branch 'develop' into pdfstream-as-cmap 2019-08-10 10:44:31 +05:30
Fakabbir Amin 5b210981c9 Adds Test Case 2019-08-10 10:19:20 +05:30
Fakabbir Amin f1a4dcea88 Adds Test Cases, Neater Code For CMap Assignment 2019-07-24 11:56:06 +05:30
Fakabbir Amin b4c261b647 Removes Code Comments 2019-07-17 11:43:45 +05:30
Fakabbir Amin fa400431f5 Adds Test, Removes Unnecessary Assumptions 2019-07-17 11:38:00 +05:30
Pieter Marsman 2bb850cdae Fix error, python2 cannot handle unicode in a .py file 2019-07-14 15:43:07 +02:00
Pieter Marsman 1e24bfa0bd Fix error, python2 cannot handle unicode in a .py file 2019-07-14 15:40:22 +02:00
Pieter Marsman c597e95a9f Use KeyError to signal that the name does not resemble any unicode, this pattern is also used in the rest of pdfminer.six 2019-07-14 15:37:15 +02:00
Pieter Marsman fdb7e54862 Add lowercase adobe glyph name tests 2019-07-14 15:20:25 +02:00
Pieter Marsman 5d7ac7e88a Added test for overflow error reported by @jtlz2: https://github.com/pdfminer/pdfminer.six/issues/177#issuecomment-510173228_ 2019-07-10 20:44:23 +02:00
Pieter Marsman ec5218a05f Add some (failing) unittests for name2unicode based on the examples in the Adobe Glyph List Specification 2019-07-10 20:35:42 +02:00
Sebastian Schuberth ec8530f6cf Add a test for the previous fix 2017-10-16 12:35:16 +02:00
Philippe Guglielmetti b010db6049 solves https://github.com/pdfminer/pdfminer.six/issues/65 2017-07-20 21:17:06 +02:00
Philippe Guglielmetti 82af7f0aac issue #56 reproduced, solution attempt unsucessful 2017-04-19 14:19:14 +02:00