Commit Graph

465 Commits (0b44f7771462363528c109f263276eb254c4fcd0)

Author SHA1 Message Date
Pieter Marsman 391fe149ca Release 20200726 2020-07-26 15:10:36 +02:00
Pieter Marsman 66856a1016 Replace internal usage of PDFTextExtractionNotAllowedError (deprecated) with PDFTextExtractionNotAllowed 2020-07-26 15:09:32 +02:00
Philippe Ombredanne 99f0c09869
Restore PDFTextExtractionNotAllowed exception (#461)
* Restore PDFTextExtractionNotAllowed 

Restore PDFTextExtractionNotAllowed  exception class as an alias of the
new PDFTextExtractionNotAllowedError exception that was introduced in
6a9269b432

Removing PDFTextExtractionNotAllowed is an API breakage that made
several tools fail break.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Use PDFTextExtractionNotAllowed and prepare PDFTextExtractionNotAllowedError to be removed in the future

* Add line to CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-26 15:06:04 +02:00
Pieter Marsman 4f65242750
Always try to get CMap, even if name is not recognized (#438)
* Add trying to get cmap from pickle file. And cleaning up a bit.

* Don't use keyword argument for dict.get

* Add docs

* Make _get_cmap_name static

* Add test

* Add CHANGELOG.md

* Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there

* Add CJK characters to expected output of simple3.pdf

* Fix line length

* Add comment
2020-07-23 20:27:38 +02:00
Pieter Marsman 3cebf5ef66 Release 20200720 2020-07-20 22:05:19 +02:00
lithiumFlower c10cf3cdb8
Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456)
* swap pycryptodome to the faster, smaller, and industry standard crytography io

* update changelog

* fixlint

* Update CHANGELOG.md

* from MR, unneeded ex and naming

* add samples to nosetests

* fix lint

* show mismatch

* fix lint

* typo and newline

* Revert "add samples to nosetests"

This reverts commit a49ca302

* Add tests for encrypted documents to nose test suite

* Optimize imports of pdfdocument.py

Co-authored-by: Oren Tysor <oren@atakama.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung 60863cfd55
Fix converting path to multiple rectangles (#371)
* Fix converting path to multiple rectangles

For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes #369

* Reduce pdf size by removing font

* Add unittest for PDFLayoutAnalyzer.paint_path()

* Add line to CHANGELOG.md

* Add reference to pdf reference manual

* Cleanup function paint_path a bit

* Reduce line length of tests

* Reduce line length of tests

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 17:34:38 +02:00
madhurcodes 6a9269b432
Change Text extraction is not allowed error to warning (#453)
* Changed error to warning for 'Text extraction is not allowed'

* updated changelog

* fix lint

* made changes suggested in review

* Update CHANGELOG.md

* Add regression test for failing pdf

* Reduce line length to <80

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 16:04:11 +02:00
Tony(Baojia) Tong 836d312982
Validate that object is PDFStream in do_EI (#451)
* check obj type

* update changelog

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-05 13:42:15 +02:00
Pieter Marsman 6e05baf0b7
Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it
Fixes #176 

* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's

* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref

* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know

* Adding changelog

* Extend help message
2020-05-23 18:04:34 +02:00
Pieter Marsman 33b60dfd54 Bump version 2020-05-17 17:50:01 +02:00
Pieter Marsman 91d89af788
Add section to documentation with howto for image extraction (#427)
* Make structure of documentation more clear: tutorials, how-to, topics and reference

* Add howto for images

* Restructure tutorials section, and add install section

* Always use up-to-date version

* Fix indentation warning in docstring

* Add option to dumppdf.py and pdf2txt.py to show version

Fixes #162
2020-05-17 17:48:06 +02:00
Jake Stockwin 7254530d27
Fix ordering of textlines within a textbox when boxes_flow is disabled (#412)
* Fix ordering of textlines within a textbox when boxes_flow is disabled

* Add new test PDF sample
2020-05-09 15:37:49 +02:00
fabbox 7eff108fa5
add shebang line to script in tools (#408)
* add shebang line to script in tools

* fix: use shebang line with python 3

* Moved changelog to unreleased

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-04-28 10:58:42 +02:00
Pieter Marsman d79bcb75ea Bump version 20200402 2020-04-01 21:37:39 +02:00
Pieter Marsman b8988b6848 Bump version 2020-04-01 21:22:59 +02:00
Jake Stockwin 68e2ae8632
Fix text coming in reverse order with boxes flow disabled (#399)
Closes #398
2020-04-01 13:37:04 +02:00
Jake Stockwin e55560f858
Fix #395: Update documentation for boxes_flow, allow None (#396)
* Update documentation for boxes_flow, allow None

* Apply comments from code review

* Small wording changes, remove unnecessary comment

* Update boxes_flow documentation for pdf2text

* Pin version of tox to ensure python 3.4 support
2020-03-26 23:03:49 +01:00
Jake Stockwin 518b5d6efc
Fix #390: Updated misleading documentation about word_margin (#407)
* Updated misleading documentation about word_margin

* Small change in sentence about word_margin

* Remove confusing sentence about adding spaces

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-03-26 23:02:48 +01:00
Jake Stockwin 1a4a06da9f
Fix #392 Split out IO logic from high level functions (#393)
* Allow file-like inputs to high level functions (#392)

* PR Review - move open_filename to utils
2020-03-26 22:52:00 +01:00
Jake Stockwin 1cc1b961c5
Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382) (#384)
* Group text lines if they are centered (#382)

Closes #382

* Add comparison private methods to LTTextLines

* Add missing docstrings

* Add tests for find_neighbors

* Update changelog

* Cosmetic changes from code review
2020-03-23 22:38:39 +01:00
Pieter Marsman 9d7fe2d9ee
Catch ValueError when converting font encoding differences to characters (#389)
* Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed

* Add test for catching ValueError and KeyError when font encoding differences are invalid

* Added line to CHANGELOG.md
2020-03-16 20:12:45 +01:00
Pieter Marsman 1d773dc38a
Fix grouping textlines when bounding box of parent container is wrong (#386)
* Default value for --all-texts should be false, because using the flag enables it

* Fix edge case: when no neighbors are found a line should form its own text box

* Added test for grouping textlines where 1 is outside the parent bounding box

* Added CHANGELOG.md line
2020-03-14 10:33:39 +01:00
Pieter Marsman bab6d154c2 Bump version 20200124 2020-01-24 12:38:11 +01:00
Pieter Marsman bc494ff03c Bump version to 20200121 2020-01-21 21:13:52 +01:00
Pieter Marsman 410d7ecac3
Fix value for font-family in html by removing the subset tag from the PDF font-name (#357)
* Fix font name by removing subset tag

* Added line to CHANGELOG.md

* Add documentation and clear variable name

* Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`
2020-01-16 22:25:20 +01:00
Pieter Marsman fff3ac2ba6
Fix bug in computing character bounding box (#348)
* Remove scaling font height/width with size of font bounding box

* Refactor LTChar bounding box computation

* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.

* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects

* Add line to CHANGELOG
2020-01-16 22:15:50 +01:00
Recursing 0b1741b9bf Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352)
Fixes #186 

* Tread the permissions (the /P entry) as unsigned long, fix #186

* handle negative values for p

* Extract function for resolving an twos-complement

* Add test for issue #352

* Add line to CHANGELOG.md

* Only ints can be converted to a uint using two's-complement method

* Standardize import style; multiple imports from same module on one line

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-01-07 21:59:13 +01:00
Pieter Marsman b27d3d0aff Bump version 2020-01-04 18:15:15 +01:00
Pieter Marsman 3502dc9f3b
Drop support for legacy Python 2 (#346)
* Drop support for legacy Python 2

* Add python_requires to help pip

* Upgrade Python syntax with pyupgrade

* Upgrade Python syntax with pyupgrade --py3-plus

* Python 3 imports

* Replace six

* Update CONTRIBUTING.md

* Added line to changelog

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
2020-01-04 16:47:07 +01:00
Pieter Marsman f3ab1bc61e
Enforce pep8 coding-style (#345)
* Code Refractor: Use code-style enforcement #312

* Add flake8 to travis-ci

* Remove python 2 3 comment on six library. 891 errors > 870 errors.

* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.

* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.

* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting

* Cleanup pdfinterp.py and add documentation from PDF Reference

* Cleanup pdfpage.py

* Cleanup pdffont.py

* Clean psparser.py

* Cleanup high_level.py

* Cleanup layout.py

* Cleanup pdfparser.py

* Cleanup pdfcolor.py

* Cleanup rijndael.py

* Cleanup converter.py

* Rename klass to cls if it is the class variable, to be more consistent with standard practice

* Cleanup cmap.py

* Cleanup pdfdevice.py

* flake8 ignore fontmetrics.py

* Cleanup test_pdfminer_psparser.py

* Fix flake8 in pdfdocument.py; 339 errors to go

* Fix flake8 utils.py; 326 errors togo

* pep8 correction for few files in /tools/ 328 > 160 to go (#342)

* pep8 correction for few files in /tools/ 328 > 160 to go

* pep8 correction: 160 > 5 to go

* Fix ascii85.py errors

* Fix error in getting index from target that does not exists

* Remove commented print lines

* Fix flake8 error in pdfinterp.py

* Fix python2 specific error by removing argument from print statement

* Ignore invalid python2 syntax

* Update contributing.md

* Added changelog

* Remove unused import

Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 21:20:20 +01:00
Pieter Marsman 803a7d9598 Release 20191110 2019-11-10 12:29:14 +01:00
Pieter Marsman 2bee7d8dcf
Fix wrong ordering of grouping textboxes introduced by #315. The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335)
Fixes #334
2019-11-10 12:18:49 +01:00
Pieter Marsman 5c6fa8f986 Release 20191107 2019-11-07 21:52:44 +01:00
Pieter Marsman bc034c8e59
Create sphinx documentation for Read the Docs (#329)
Fixes #171
Fixes #199
Fixes #118
Fixes #178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
2019-11-07 21:12:34 +01:00
Igor Moura 40aa2533c9 Added: simple wrapper to extract text from pdf (#330)
Fixes #327
2019-11-07 07:54:10 +01:00
Martin Hasoň ed1b09c6f2 Fix debug logging for pdf2txt.py and dumppdf.py (#325)
Fixes #313
2019-11-06 21:47:19 +01:00
Pieter Marsman 33b16b3f07
Deprecate the use of _py2_no_more_posargs (#328)
Fixes #324
2019-11-02 10:29:39 +01:00
Jianfeng 44b223cf0a Speedup grouping of textboxes (#315)
Changed: using a heap instead of a SortedList and avoid rebuilding the heap in each iteration
Changed: avoid potentially huge number of variable assignments in list comprehension.
Changed: avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).
2019-10-31 09:22:58 +01:00
Pieter Marsman d88d6020a2
Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320)
Fixes #314 
Fixes #105
2019-10-26 19:16:37 +02:00
Pieter Marsman a238a19999
Fix assertionerror when dumping pdf with reference to objid 0 (#318)
Fixes #94 
Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.
2019-10-25 22:49:58 +02:00
Serj Sintsov cb9cd8ea46 Use named logger instead of root logger (#236) 2019-10-22 20:52:43 +02:00
Pieter Marsman 373c6e7b97
Added: extraction of JBIG2 encoded images (#311)
And added test for pdf with JBIG2 image.

Fixes #26 
Closes #46
2019-10-22 17:37:06 +02:00
Pieter Marsman 694aa508c3 Release 20191020 2019-10-20 14:21:48 +02:00
Pieter Marsman adc4726e06
Add warning about dropping python2 support (#307)
Fix #303
2019-10-20 13:59:29 +02:00
Pieter Marsman 9fd7172f7b Cleanup utils.py 2019-10-17 12:14:02 +02:00
jet457 7e40fde320 Removing assertion in drange to allow equal inputs (#246) and mimic behaviour of built-in method range
Fixes #66, since it now allows the bbox to have 0 width or 0 height
Added tests for Plane since it is the API that uses drange
2019-10-17 12:04:25 +02:00
D.A.Bashkirtsev 4df6d4e5ca Changed: comparations for image colorspace literals (#132)
Fixes #131 

Changed: comparations for image colorspace literals
Added: test for extracting images from pdfs
2019-10-15 16:11:54 +02:00
Pieter Marsman 63b2e09ac3
Merge pull request #203 from jbarlow83/negative-descent
Interpret font Descent as a negative number even if specified as positive
2019-10-13 20:06:52 +02:00
Tony Tong 106a09c5bb fix stoke color and non-stroke color in PDFGraphicState 2019-10-12 17:35:46 -04:00