Commit Graph

16 Commits (1bf3c42b59125f4491d863e1c11dca7ebbe96adc)

Author SHA1 Message Date
jwyawney 43c8fc8557
Ignore empty characters when analyzing layout (#689)
* Adding in checks for spurious lines that contain either only spaces or new line characters

* Added spurious lines check and unit tests

* Updated CHANGELOG.md with changes

* Simplify code

* Simplify code

* Simplify code

* Remove changes to lines that are not actually changed

* Format import

* Improve CHANGELOG.md

* Improve CHANGELOG.md

* Fix cicd

* Blacken

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-22 21:20:26 +01:00
Andrew Baumann 1d1602e0c5
Added feature: page labels (#680)
* port page label code from pdfannots

* add tests and clean up

* more cleanup; harden against non-conforming input

* one more test

* update CHANGELOG

* cleanup & respond to review feedback (incomplete)

* Refactor implementation of get_page_labels() into a NumberTree and PageLabels class.

* PageLabels *is* a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more.

* fix type errors and cleanup slightly

 * fix mypy errors (including tweaking code to avoid problematic dynamic types)
 * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be)
 * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-01 10:08:05 +01:00
Pieter Marsman aa5dec252f Fixes jbig2 writer to write valid jb2 files
See: https://github.com/pdfminer/pdfminer.six/pull/653

Squashed commit of the following:

commit 8748c9fcddab0826cca243eee45c40d2b6611e80
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:40:50 2022 +0100

    Remove prints in test

commit bb977258a39fc7baa13bba1c3ea29726e17c0f6d
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:35:12 2022 +0100

    Cleanup exception handling for jbig2 global streams

commit cf0b47b01b7caad8acbd82097aadadb620606a8b
Merge: a5831d1 708dd20
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:29:15 2022 +0100

    Merge branch 'develop' into jbig2_fix

commit a5831d110a
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:59:17 2021 -0400

    flake8 tests

commit 18ffa29387
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:52:11 2021 -0400

    add description in changelog

commit 6c7ee43d6c
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:43:36 2021 -0400

    Fixes jbig2 writer to write valid jb2 files

    - closes #652
2022-01-23 21:41:08 +01:00
wind_chh c883f5e13f
Add support identity unicode cmap (#626)
Fixes #625 

* add support for Identity-H/V cmap fonts

* format code to pass flake8 check

* Remove indent

* Remove indent

* Use isinstance instead of type check

* Use or instead of any

* Use str in variable, instead of str.find()

* Fix mypy error: add typing annotations to get_unichr()

* Fix type of PDFCIDFont. Can be any type of CMapBase.

This is a quick fix, the entire cmap structure does not have proper inheritance.

* Added line to CHANGELOG.md

* Add separate class for IdentityUnicodeMap

* Remove ABC from CmapBase

* Remove ABC from CmapBase

* Remove blank line

Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-10-13 21:52:00 +02:00
wind_chh 234c466372
Fix extraction of some cjk characters (#593)
Fixes #566 

* try to fix issue of some Chinese characters cannot be extracted
correctly (#566).

* format code to pass flake8 check.

* fix typo and refer to issue 593.

Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-26 21:05:03 +02:00
Jeremy Singer-Vine 016239c146
Fix .paint_path handling of single line segments (#530)
* Fix .paint_path handling of single line segments

- Fixes typo ("ml" should have been "mlh")

- Removes if-statement that required individual line segments to be
  strictly horizontal or vertical.

* Treat 'ml'-shape paths as lines not curves

Althoguh 'mlh' is the canonical implementation for a single line
segment, 'ml' is fairly common.

Adds tests and sample PDF.

* Fix trailing whitespace

* Fix point-extraction from Beziér path commands

This commit corrects the manner in which "pts" are extracted from Beziér
path commands. See Table 4.9 of PDF reference manual, and new comments
in code for details. Previously, depending on whether the command (c,
v, or y) the code was extracting some combination of control points (not
on curve) and the actual points-on-curve.

This commit also refactors .paint_path, so that apply_matrix_pt is only
called in one place, and to treat the "h" command in a manner more
consistent with other path commands.

* Add comments to test_paint_path_quadrilaterals

* Parse rect-forming mllll paths as rects not curves

Now that .paint_path has been refactored, adding support for
rect-forming mllll paths requires no extra code, beyond a minor tweak to
the relevant elif statement.

* One changelog line with ref to mr

* Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring

* Extract variables from if statement to make it easier to read

* Optimize imports order

* Trigger travis build

* Revert "Trigger travis build"

This reverts commit 41c05184

* Update travis badge

* Update travis badge

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-07-27 18:27:32 +02:00
Kwok-kuen Cheung 60863cfd55
Fix converting path to multiple rectangles (#371)
* Fix converting path to multiple rectangles

For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes #369

* Reduce pdf size by removing font

* Add unittest for PDFLayoutAnalyzer.paint_path()

* Add line to CHANGELOG.md

* Add reference to pdf reference manual

* Cleanup function paint_path a bit

* Reduce line length of tests

* Reduce line length of tests

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 17:34:38 +02:00
madhurcodes 6a9269b432
Change Text extraction is not allowed error to warning (#453)
* Changed error to warning for 'Text extraction is not allowed'

* updated changelog

* fix lint

* made changes suggested in review

* Update CHANGELOG.md

* Add regression test for failing pdf

* Reduce line length to <80

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 16:04:11 +02:00
Pieter Marsman 1c3047b68b
Remove samples/ directory from source distribution to prevent downloading all pdf's when installing pdfminer.six (#364)
Fixes #363 

* Remove samples/ and docs/ from source distribution. The samples/ dictionairy contains pdf's for testing purposes and the docs/ contain readthedocs documentation and is published online.

* Remove issue-00152-embedded-pdf.pdf because it contains a possible exploit.

See https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Exploit%3AJS%2FShellCode.gen
And https://github.com/pdfminer/pdfminer.six/issues/363

* Added line to CHANGELOG.md

* Remove unused imports
2020-01-24 12:36:02 +01:00
Pieter Marsman 2f7f5d2667
Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338)
* Fix getting filename when extracting embedded files

* Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs

* Add line to CHANGELOG
2020-01-16 22:11:42 +01:00
Recursing 0b1741b9bf Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352)
Fixes #186 

* Tread the permissions (the /P entry) as unsigned long, fix #186

* handle negative values for p

* Extract function for resolving an twos-complement

* Add test for issue #352

* Add line to CHANGELOG.md

* Only ints can be converted to a uint using two's-complement method

* Standardize import style; multiple imports from same module on one line

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-01-07 21:59:13 +01:00
Pieter Marsman 1c4a4167ed
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 18:42:33 +02:00
jbarlow83 733ddf7e57 Added: tests for extracting tests from pdfs with Type3 fonts (#205) 2019-10-22 18:15:59 +02:00
Pieter Marsman 373c6e7b97
Added: extraction of JBIG2 encoded images (#311)
And added test for pdf with JBIG2 image.

Fixes #26 
Closes #46
2019-10-22 17:37:06 +02:00
Philippe Guglielmetti 82af7f0aac issue #56 reproduced, solution attempt unsucessful 2017-04-19 14:19:14 +02:00
Philippe Guglielmetti 7055862eaf solves https://github.com/pdfminer/pdfminer.six/issues/50 2017-04-18 18:20:31 +02:00