Commit Graph

860 Commits (d821fed340e7569e26568e9202b58a1d557025b1)

Author SHA1 Message Date
X d821fed340
Fix typos in readthedocs documentation. (#579)
* Fix typos and possible mistakes.

* Revert two edits based on discussion in #579

Revert the two changes based on our discussion. 

I read the documentation and had a glimpse at the default code. And perhaps the confusion was caused by the figure that shows the Char Margin (M) and the Word Margin (W). Clearly, M is smaller than W in absolute terms, but as mentioned, they are both relative numbers.

Maybe it is useful to point that out in the figure but I am not sure how best to do it. 

Another option is to mention use something like `min_char_margin_threshold` or similar, in the hope that they are easier to understand. Just some thoughts!

* Triggering travis again

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-26 20:58:50 +02:00
Tony(Baojia) Tong 543976f195
Fix issue of ValueError and KeyError rasied in PDFdocument and PDFparser (#574)
* check obj type

* update changelog

* Update CHANGELOG.md

* fix the bug

* fix condition

* update changelog

* update changelog again

* update changelog

* update

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Co-authored-by: Tony Tong <baojia.tong@kensho.com>
2021-08-26 20:55:02 +02:00
Eduardo Gonzalez Lopez de Murillas ea00f56ac6
Added support for Paeth PNG filter compression (predictor value = 4) (#537)
* Added support for Paeth PNG filter compression (predictor value = 4)

* Use `above` and `upper_left` as in the pseudo code

* Refactor: use variable names that are very close to the pseudo code and add pieces of the docs to show what is going on.

* Fix line length issues

* Add line about compressions to README.md

* Fix merge conflict on readme

* Fix bug in filter type Up

* Make if-else consistent

Co-authored-by: Eduardo Gonzalez Lopez de Murillas <eduardo.gonzalez@accha.nl>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-26 20:53:13 +02:00
Jake Stockwin 19c1372984
Fix for when 'trailer' is indented (#535)
* Fix for when trailer is indented

* Store stripped line

* This commit breaks things...

* Or maybe this one breaks things?

* Remove commented code because no longer used.

* Add CHANGELOG.md

* Add poetry venv management files to gitignore since I started using poetry to manage the python envs for this project

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-15 17:49:56 +02:00
Jeremy Singer-Vine 016239c146
Fix .paint_path handling of single line segments (#530)
* Fix .paint_path handling of single line segments

- Fixes typo ("ml" should have been "mlh")

- Removes if-statement that required individual line segments to be
  strictly horizontal or vertical.

* Treat 'ml'-shape paths as lines not curves

Althoguh 'mlh' is the canonical implementation for a single line
segment, 'ml' is fairly common.

Adds tests and sample PDF.

* Fix trailing whitespace

* Fix point-extraction from Beziér path commands

This commit corrects the manner in which "pts" are extracted from Beziér
path commands. See Table 4.9 of PDF reference manual, and new comments
in code for details. Previously, depending on whether the command (c,
v, or y) the code was extracting some combination of control points (not
on curve) and the actual points-on-curve.

This commit also refactors .paint_path, so that apply_matrix_pt is only
called in one place, and to treat the "h" command in a manner more
consistent with other path commands.

* Add comments to test_paint_path_quadrilaterals

* Parse rect-forming mllll paths as rects not curves

Now that .paint_path has been refactored, adding support for
rect-forming mllll paths requires no extra code, beyond a minor tweak to
the relevant elif statement.

* One changelog line with ref to mr

* Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring

* Extract variables from if statement to make it easier to read

* Optimize imports order

* Trigger travis build

* Revert "Trigger travis build"

This reverts commit 41c05184

* Update travis badge

* Update travis badge

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-07-27 18:27:32 +02:00
Jürgen Gmach 22f90521b8
Use python3.9 in tox config
* tox: use Python 3.9 final

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-03-11 20:46:31 +01:00
Pieter Marsman 761410e66c
Fix cryptography build in travis cicd by upgrading distribution from Trusty Tahr to Focal Fossa (#585)
* Update .travis.yml

* Also change 3.9-dev to 3.9 because that is now supported by travis
2021-02-20 10:32:07 +01:00
markfirmware f389b97923
Correct typo's and syntax errors from README.md (#538) 2020-11-08 16:20:10 +01:00
Ev2geny 693e4f48a3
Issue #469 is fixed (When run on Windows a lot of tests fail with the error: [Errno 13] Permission denied) (#484)
Closes #469

* Issue #469 is fixed

* one extra comment to code is added

* TemporaryFilePath context manager is added to facilitate tests

* flake8 complaints fixed

* Update docs of tempfilepath.py

* Fix flake8

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-26 10:10:11 +01:00
Pieter Marsman f8e6ad6ac1
Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream (#523)
Closes #191 

* Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream

* Add docstring

* Fix flake8
2020-10-25 14:37:12 +01:00
EucliTs0 fc75972bbd
Fix TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' (#529)
Closes #518 

* Fix TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2'

An error is occured when the 'DW2' key contains a PDFObjRef object instead of a list of int values, e.g: 'DW2': <PDFObjRef:152>.
To solve this issue, we utilise the resolve1() function

See: https://github.com/pdfminer/pdfminer.six/issues/518

* Updated CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Dimitrios TSOLAKIDIS <dimitrios.tsolakidis@vialink.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-25 14:34:45 +01:00
Pieter Marsman 178a831802
Revert "Fix for when 'trailer' is indented (#513)" (#534)
This reverts commit ec223d1f1d.
2020-10-25 13:22:42 +01:00
Pieter Marsman 875e53013a
Remove explicit support for Python 3.4 and 3.5, adding tests for python 3.9 (#522)
Closes #503
2020-10-25 12:34:51 +01:00
Jake Stockwin ec223d1f1d
Fix for when 'trailer' is indented (#513)
* Fix for when 'trailer' is indented

Closes #214

* Address CR comments - strip line after parsing

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-24 18:55:07 +02:00
estshorter 61300eef70
Remove unused dependency on sortedcontainers (#525)
* Remove unused sortedcontainers package

* Fix changelog format

* Fix a link to the PR

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-24 15:55:22 +02:00
Pieter Marsman c8cceb7c58 Release 20201018 2020-10-18 12:57:26 +02:00
Pieter Marsman 2a88fda543
Rebrand the .six by adding a punchline and a faq (#520)
* Add punchline to readme

* Add punchline to docs

* Add frequently asked questions

* Update docs/source/faq.rst

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

* Update docs/source/faq.rst

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

* Update docs/source/faq.rst

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

* Update faq.rst

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2020-10-18 12:50:59 +02:00
Pieter Marsman c66eca3c29 Update faq.rst 2020-10-18 12:49:54 +02:00
Jeremy Singer-Vine e83dd26671
Fix .paint_path for non-rectangle quadrilaterals (#512)
* Fix paint_path bug noted in issue #473

Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.

Also adds a test for cases presented in issue #473

* Tweak paint_path fix per @pietermarsman review

- Adjusts logic to adhere to if-elif-else rather than early returns.

- Shortens subpath detection/reprocessing step, using re.finditer().

* Reorder paint_path() if-else statements once more

* Fix flake8 issues

* Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on.

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-12 17:53:00 +02:00
Pieter Marsman 599f0391b5 Update faq.rst 2020-10-12 09:22:41 +02:00
Pieter Marsman e59b1bca2f
Update docs/source/faq.rst
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2020-10-12 09:20:43 +02:00
Pieter Marsman a805653a83
Update docs/source/faq.rst
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2020-10-12 09:20:37 +02:00
Pieter Marsman 4be9757b86
Update docs/source/faq.rst
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2020-10-12 09:20:30 +02:00
Pieter Marsman 14cc66ae6d Add frequently asked questions 2020-10-11 20:05:26 +02:00
Pieter Marsman bbc01f749a Add punchline to docs 2020-10-11 20:05:11 +02:00
Pieter Marsman d04c38fb8d Add punchline to readme 2020-10-11 20:04:57 +02:00
estshorter 360b1efc0b
Deprecate Python 3.4 and 3.5 (#507) 2020-10-10 16:15:03 +02:00
Diego Elio Pettenò 67e2d79591
Fix out-of-bound access on some PDFs. (#483)
Replace the non-emptiness check with a minimum length check — you can't get the second to last item in a list of less than two items.
2020-10-10 15:18:34 +02:00
Jake Stockwin ef4787d8ad
Fix not being able to pass boxes flow as None to pdf2txt (#479)
* Fix not being able to pass boxes flow as None to pdf2txt

* Changes from code review

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-10 15:17:04 +02:00
estshorter f03657e5c4
Allow a pathlib.PurePath object as a input to open_filename (#492)
* open_filename accepts a pathlib.PurePath object

* Add test for open_filename with pathlib

* Fix a wrong function name

* Cast a pathlib object to string for py3.4/3.5

* Add link to the PR

* Raise an exception when open_filename gets an unsupported type

* Add tests for open_filename

* Update CHANGELOG.md

* Documentation

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-09-17 21:29:00 +02:00
David Nicholson b4054ff4cf
Pass caching parameter to PDFResourceManager in `high_level` functions (#475)
* Updated high_level.py

This commit enables caching to be turned on and off rather than be always on regardless of the user input.

* Reverted params back to fix errors

* Updated CHANGELOG.md to reflect quick fix

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-09-10 21:09:07 +02:00
Igor Moura a83f853de7
Remove unused rijndael encryption implementation (#465)
* Remove unused rijndael encryption

* Add current PR link to CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-09-10 19:28:00 +02:00
typhoon71 4d8b5975cb
Add section to documentation with howto for AcroForm fields extraction (#458)
* Create aforms.rst

Add section to documentation with howto for AcroForm fields extraction

* Update index.rst

Added reference to aforms.rst

* Update aforms.rst

* Update aforms.rst

* Update index.rst

* Update and rename aforms.rst to acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update index.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update pdfdocument.py

* Update pdfdocument.py

* Update pdfdocument.py

* Update acro_forms.rst

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update acro_forms.rst

* reverted changes

* Update README.md

* Proper processing of ComboBox

ComboBox fields hold multiple values, so the must be returned as a list.

* PDF with AcroForm (samples)

* Create tmp

* Delete AcroForm_TEST.pdf

* Delete AcroForm_TEST_compiled.pdf

* PDF file with AcroForms

* Delete tmp

* Fixed typo

* Update index.rst

* Update README.md

* Update index.rst

* Update pdfdocument.py

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update pdfdocument.py

* Update pdfdocument.py

* Update pdfdocument.py

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>
2020-09-10 19:18:41 +02:00
Pieter Marsman 0b44f77714 Move changelog line for #438 to current release 2020-07-26 15:14:15 +02:00
Pieter Marsman 391fe149ca Release 20200726 2020-07-26 15:10:36 +02:00
Pieter Marsman 66856a1016 Replace internal usage of PDFTextExtractionNotAllowedError (deprecated) with PDFTextExtractionNotAllowed 2020-07-26 15:09:32 +02:00
Philippe Ombredanne 99f0c09869
Restore PDFTextExtractionNotAllowed exception (#461)
* Restore PDFTextExtractionNotAllowed 

Restore PDFTextExtractionNotAllowed  exception class as an alias of the
new PDFTextExtractionNotAllowedError exception that was introduced in
6a9269b432

Removing PDFTextExtractionNotAllowed is an API breakage that made
several tools fail break.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Use PDFTextExtractionNotAllowed and prepare PDFTextExtractionNotAllowedError to be removed in the future

* Add line to CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-26 15:06:04 +02:00
Pieter Marsman 4f65242750
Always try to get CMap, even if name is not recognized (#438)
* Add trying to get cmap from pickle file. And cleaning up a bit.

* Don't use keyword argument for dict.get

* Add docs

* Make _get_cmap_name static

* Add test

* Add CHANGELOG.md

* Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there

* Add CJK characters to expected output of simple3.pdf

* Fix line length

* Add comment
2020-07-23 20:27:38 +02:00
Pieter Marsman 3cebf5ef66 Release 20200720 2020-07-20 22:05:19 +02:00
lithiumFlower c10cf3cdb8
Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456)
* swap pycryptodome to the faster, smaller, and industry standard crytography io

* update changelog

* fixlint

* Update CHANGELOG.md

* from MR, unneeded ex and naming

* add samples to nosetests

* fix lint

* show mismatch

* fix lint

* typo and newline

* Revert "add samples to nosetests"

This reverts commit a49ca302

* Add tests for encrypted documents to nose test suite

* Optimize imports of pdfdocument.py

Co-authored-by: Oren Tysor <oren@atakama.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung 60863cfd55
Fix converting path to multiple rectangles (#371)
* Fix converting path to multiple rectangles

For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes #369

* Reduce pdf size by removing font

* Add unittest for PDFLayoutAnalyzer.paint_path()

* Add line to CHANGELOG.md

* Add reference to pdf reference manual

* Cleanup function paint_path a bit

* Reduce line length of tests

* Reduce line length of tests

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 17:34:38 +02:00
madhurcodes 6a9269b432
Change Text extraction is not allowed error to warning (#453)
* Changed error to warning for 'Text extraction is not allowed'

* updated changelog

* fix lint

* made changes suggested in review

* Update CHANGELOG.md

* Add regression test for failing pdf

* Reduce line length to <80

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 16:04:11 +02:00
Tony(Baojia) Tong 836d312982
Validate that object is PDFStream in do_EI (#451)
* check obj type

* update changelog

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-05 13:42:15 +02:00
Michael ce7775fd4f
Add setup.py classifiers for Python 3.7 and 3.8 (#450) 2020-07-05 13:19:02 +02:00
Jake Stockwin ac2b20a79a
[docs] Add extract_pages tutorial (#442)
Closes https://github.com/pdfminer/pdfminer.six/issues/361
2020-06-29 20:07:05 +02:00
AhnHyunJin 09c989f301
Fix spelling error (#436)
* Change rwo to two in pdfdiff.py

Co-authored-by: ahnhyunjin <hj.ahn@promptech.co.kr>
2020-06-06 15:43:57 +02:00
Pieter Marsman 6e05baf0b7
Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it
Fixes #176 

* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's

* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref

* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know

* Adding changelog

* Extend help message
2020-05-23 18:04:34 +02:00
Pieter Marsman 33b60dfd54 Bump version 2020-05-17 17:50:01 +02:00
Pieter Marsman 91d89af788
Add section to documentation with howto for image extraction (#427)
* Make structure of documentation more clear: tutorials, how-to, topics and reference

* Add howto for images

* Restructure tutorials section, and add install section

* Always use up-to-date version

* Fix indentation warning in docstring

* Add option to dumppdf.py and pdf2txt.py to show version

Fixes #162
2020-05-17 17:48:06 +02:00
Jake Stockwin 7254530d27
Fix ordering of textlines within a textbox when boxes_flow is disabled (#412)
* Fix ordering of textlines within a textbox when boxes_flow is disabled

* Add new test PDF sample
2020-05-09 15:37:49 +02:00