Commit Graph

898 Commits (a2e1d6a8bf1af359f6c3c85781086d2168ed6c1e)

Author SHA1 Message Date
Pieter Marsman a2e1d6a8bf Fix github actions tag regex 2022-03-19 20:53:14 +01:00
Pieter Marsman c2e516d6df Bump version 2022-03-19 20:49:22 +01:00
Pieter Marsman d89cc357ee
Add github action for releasing to pypi if git tag is added. (#727)
* Add github action for releasing to pypi if git tag is added.

* Checkout code and fix typos.

* Replace end with fi

* Strictly numeric version for testing.

* Remove obsolete Make commands for publishing

* Also create GitHub release

* Update pdfminer/__init__.py

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

* Remove test pypi release

* Use maintained github action for releasing

* Change tag format for versions

* Undo commenting pypi publishing

* Remove develop branch, since that will be removed in favor off adding tags for releases.

* Change version regex

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2022-03-19 20:46:00 +01:00
jwyawney 43c8fc8557
Ignore empty characters when analyzing layout (#689)
* Adding in checks for spurious lines that contain either only spaces or new line characters

* Added spurious lines check and unit tests

* Updated CHANGELOG.md with changes

* Simplify code

* Simplify code

* Simplify code

* Remove changes to lines that are not actually changed

* Format import

* Improve CHANGELOG.md

* Improve CHANGELOG.md

* Fix cicd

* Blacken

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-22 21:20:26 +01:00
Pieter Marsman 121235e24b
Raise more specific error if Pillow cannot be imported (#714)
* Raise specific warning if Pillow cannot be imported

* Improve error message

* Update docs

* Update CHANGELOG.md

* Update pdfminer/image.py

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2022-02-22 20:20:17 +01:00
Pieter Marsman b9a8920cdf
Check blackness in github actions (#711)
* Check blackness in github actions

* Blacken code

* Update github action names

* Add contributing guidelines on using black

* Add to checklist for PR
2022-02-11 22:46:51 +01:00
Pedro Nunes 830acff94c
Changed `log.info` to `log.debug` in six files (#690)
* `log.info` changed to `log.debug` in six files

* Fix identation

* Remove from CHANGELOG.md since no functionality has changed

Co-authored-by: Pedro Nunes <pedro@paranamodapark.com.br>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-08 21:24:00 +01:00
Pieter Marsman 2254306a52 Update README.md batch for Continuous integration 2022-02-02 22:53:17 +01:00
Pieter Marsman 81f873e105 Update actions.yml so that it will run for all PR's 2022-02-02 22:45:05 +01:00
Pieter Marsman b84cfc98e0
Update development tools: travis ci to github actions, tox to nox, nose to pytest (#704)
* Replace tox with nox

* Replace travis with github actions

* Fix pytest, mypy and flake8 errors

* Add pytest.

* Run on all commits

* Remove nose

* Speedup slow tests to save GitHub actions minutes

* Added line to CHANGELOG.md

* Fix line too long in pdfdocument.py

* Update .github/workflows/actions.yml

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

* Improve actions.yml

* Fix error with nox name for mypy

* Add names for jobs

* Replace nose.raises with pytest.raises

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2022-02-02 22:24:32 +01:00
Andrew Baumann 1d1602e0c5
Added feature: page labels (#680)
* port page label code from pdfannots

* add tests and clean up

* more cleanup; harden against non-conforming input

* one more test

* update CHANGELOG

* cleanup & respond to review feedback (incomplete)

* Refactor implementation of get_page_labels() into a NumberTree and PageLabels class.

* PageLabels *is* a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more.

* fix type errors and cleanup slightly

 * fix mypy errors (including tweaking code to avoid problematic dynamic types)
 * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be)
 * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-01 10:08:05 +01:00
Pieter Marsman b19f9e7270
Remove obsolete returns (#707)
* Remove obsolete returns

* Update CHANGELOG.md

* Remove empty lines

* Remove more empty lines
2022-02-01 01:49:46 +01:00
Pieter Marsman 2610ef13af Revert "Remove obsolete returns"
This reverts commit c67abdfab0.
2022-02-01 01:36:17 +01:00
Pieter Marsman c67abdfab0 Remove obsolete returns 2022-02-01 01:35:35 +01:00
Tony(Baojia) Tong 4b138a6bc5
Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (#684)
* check obj type

* update changelog

* Update CHANGELOG.md

* add changes

* update change

* update changelog

* Use fallback in except clause

* Update changelog.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Co-authored-by: Tony Tong <baojia.tong@kensho.com>
2022-02-01 01:20:52 +01:00
htInEdin dc530f3a6f
Use logger.warn instead of warnings.warn if warning cannot be prevented by user (#673)
* Use logging.Logger.warning instead of warning.warn in most cases, following
 the Python official guidance that warning.warn is directed at _developers_,
 not users

 * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning,
			PDFNoValidXRefWarning

 * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning

 * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning

 * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather
				  than PDFNoValidXRefWarning

* get name right

* make flake8 happy

* Keep warning classes such that this does not crash code when these warnings are explictly ignored

* Update changelog to include pr ref

* Small textual change

* Remove patch

* No need for testing if the warning is actually raised. The test_tootls_dumppdf.py are just test cases if these pdfs are supported.

* Use logger as name for logger

* Add docs to legacy warnings

* Use logger.Logger.warn for failed decompression

* Add reference to docs describing when to use logger and warnings

Co-authored-by: Henry S. Thompson <ht@home.hst.name>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-26 20:41:12 +01:00
crisptag c4ac514984
Change log.info into log.debug to make pdfinterp.py less verbose 2022-01-26 19:57:55 +01:00
Andrew Baumann 95dee8d67c
Fix regression in page layout that sometimes returned text lines out of order (#659)
* add a test

* fix the bug

* rewrap long lines

* update CHANGELOG

* re-merge CHANGELOG

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-26 19:55:08 +01:00
Andrew Baumann 9a644aae76
export type annotations in package (#679)
* export type annotations via our pypi package

* update CHANGELOG

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 22:11:17 +01:00
Andrew Baumann 24eb15cae5
fix typos in PR template (#681) 2022-01-25 22:08:14 +01:00
Andrew Baumann d87bd025dd
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled

Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "pdfminer/high_level.py", line 85, in extract_text_to_fp
    interpreter.process_page(page)
  File "pdfminer/pdfinterp.py", line 896, in process_page
    self.device.end_page(page)
  File "pdfminer/converter.py", line 51, in end_page
    self.cur_item.analyze(self.laparams)
  File "pdfminer/layout.py", line 822, in analyze
    group.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 577, in analyze
    self._objs.sort(
  File "pdfminer/layout.py", line 578, in <lambda>
    key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```

Related: Issue #477, PR #479

* update CHANGELOG

* merge CHANGELOG

* pdf2txt: clean up handling of layout parameter arguments
 * avoid specifying default values twice
 * construct LAParams earlier, rather than passing its components around
 * fix crash with --boxes_flow=disabled

* update CHANGELOG

* construct new LAParams, so _validate runs

* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.

* Add cli argument for line_overlap

* Also use default values from LAParams for --detect-vertical and --all-texts

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 22:06:06 +01:00
Pieter Marsman aa5dec252f Fixes jbig2 writer to write valid jb2 files
See: https://github.com/pdfminer/pdfminer.six/pull/653

Squashed commit of the following:

commit 8748c9fcddab0826cca243eee45c40d2b6611e80
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:40:50 2022 +0100

    Remove prints in test

commit bb977258a39fc7baa13bba1c3ea29726e17c0f6d
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:35:12 2022 +0100

    Cleanup exception handling for jbig2 global streams

commit cf0b47b01b7caad8acbd82097aadadb620606a8b
Merge: a5831d1 708dd20
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:29:15 2022 +0100

    Merge branch 'develop' into jbig2_fix

commit a5831d110a
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:59:17 2021 -0400

    flake8 tests

commit 18ffa29387
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:52:11 2021 -0400

    add description in changelog

commit 6c7ee43d6c
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:43:36 2021 -0400

    Fixes jbig2 writer to write valid jb2 files

    - closes #652
2022-01-23 21:41:08 +01:00
Pieter Marsman 708dd20465 Add support for JPEG2000 image encoding 2022-01-23 21:17:47 +01:00
Pieter Marsman b82229245a
Added test case for CCITTFaxDecoder (#700)
* array.array.tostring -> array.array.tobytes

The tostring method has been deprecated since Python 3.2 and was
removed altogether in 3.9. In Python 3.2 the method was renamed
to "tobytes"

Will close #641

* changelog entry

* test for tobytes

* Fix CHANGELOG.md

* Update CHANGELOG.md to PR that I can push on

* Simplify tests

Co-authored-by: Forest Gregg <fgregg@uchicago.edu>
2022-01-23 21:00:13 +01:00
Sylvain Thénault 10f6fb40c2
Attempt to handle decompression error on some broken PDF files (#637)
* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422
and the test file has been taken from there, so credits to @zegrep.

* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by mstamy2/PyPDF2#422
and the test file has been taken from there, so credits to @zegrep.

* Use a warnings instead of raising exception

where zlib error is detected before the CRC checksum.

* Add line to CHANGELOG.md

* Only try decompressing if not in strict mode

* Change error into warning because warning.warn needs a subclass of Warning

Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-12-11 18:25:19 +01:00
wind_chh c883f5e13f
Add support identity unicode cmap (#626)
Fixes #625 

* add support for Identity-H/V cmap fonts

* format code to pass flake8 check

* Remove indent

* Remove indent

* Use isinstance instead of type check

* Use or instead of any

* Use str in variable, instead of str.find()

* Fix mypy error: add typing annotations to get_unichr()

* Fix type of PDFCIDFont. Can be any type of CMapBase.

This is a quick fix, the entire cmap structure does not have proper inheritance.

* Added line to CHANGELOG.md

* Add separate class for IdentityUnicodeMap

* Remove ABC from CmapBase

* Remove ABC from CmapBase

* Remove blank line

Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-10-13 21:52:00 +02:00
Pieter Marsman da5b96828e Bump version to 20211012 2021-10-12 20:45:24 +02:00
Pieter Marsman 104883df41
Replace typing-extensions Literal with the type of the Literal & run mypy, nosetest and sphinx in there own environment on cicd (#677)
* Improve tox.ini by running flake8, mypy, nosetests and sphinx in there own environment.

Improves isolation. Dependencies of one package won't influence the next.

This should fail for the current setup with typing-extensions.

* Try to fix actually running tox tests on travis

* Use recent tox

* Fix using Literal[False] for open_filename.

None has the same true value as False, and therefore it does not matter.

* Replace typing_extensions.Literal by the type of the literal

* Add line to CHANGELOG.md
2021-10-12 20:22:58 +02:00
Andrew Baumann 9406040d8e
Add type annotations (#661)
Squashed commit of the following:

commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 20:33:06 2021 -0700

    Merge branch 'develop' into mypy (and fixed types)

commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 20:00:45 2021 -0700

    reformat all multi-line function defs to one-arg-per-line

commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:58:48 2021 -0700

    ccitt nit -- avoid casting needlessly

commit 15983d8c1e7162632fde43752c9d1c15938cd980
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:58:36 2021 -0700

    tweak CHANGELOG

commit 13dc0babf782938e7d5b5e482d4c5adf92d82702
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:43:46 2021 -0700

    add failing tests for dumppdf crash

commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:24:23 2021 -0700

    ccitt: apply misc PR feedback

commit feb031ba86d3f22e41cfbbda13f17c039359f1e6
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:18:26 2021 -0700

    add missing None return type to all __init__ methods

commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:13:08 2021 -0700

    minor cleanup, remove a few more Any types

commit b52a0594e1998a492c172538a9b35491c5fc5f52
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sun Sep 5 22:37:28 2021 -0700

    tighten up types, avoid Any in favour of explicit casts

commit e58fd48bd14f31bebd2de8259f12630ac02756d6
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sun Sep 5 14:10:49 2021 -0700

    annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)

commit 605290633e55595e5e0045840df5c5b1d9de843a
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:37:38 2021 -0700

    python 3.7 back-compat

commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:32:43 2021 -0700

    annotate pdfminer.jbig2

commit 0d40b7c03a8028dc44acd3f457eac71abd681827
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:31:33 2021 -0700

    annotate pdf2txt.py

commit 5f82eb4f5646b5d1285252689191e0a14557ec7b
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 09:16:31 2021 -0700

    cleanup: make Plane generic

commit 624fc92b88473ff36a174760883f34c22109da2b
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 23:16:51 2021 -0700

    bluntly ignore calls to cryptography.hazmat

commit 96b20439c169f40dbb114cabba6a582ad1ebe91e
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 23:01:06 2021 -0700

    finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2

commit 0ab586347861b72b1d16880dc9293f9ad597e20a
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 21:51:56 2021 -0700

    annotate pdffont

commit 4b689f1bcbdaf654feb9de81023e318ca310a12e
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 18:30:02 2021 -0700

    annotate a couple more scripts; document sketchy code

commit 291981ff3d273952ec9c92ef8ab948473558b787
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 15:02:01 2021 -0700

    pacify flake8

commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 14:31:48 2021 -0700

    annotate dumppdf, and comment likely bugs

commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:49:58 2021 -0700

    enable mypy on tests and tools, fix one implicit reexport bug

commit 4a83166ef4e4733cd2113f43188b585a4fda392b
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:25:59 2021 -0700

    pdfdocument: per dumppdf.py, get_dest accepts either bytes or str

commit 43701e1bee068df98f378a253c9c2150ee4ad9f7
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:25:00 2021 -0700

    layout: LAParams.boxes_flow may be None

commit 164f81652f1788e74837466f0ab593e94079bc0f
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:45:09 2021 -0700

    add whitespace, pacify flake8

commit 893b9fb9ec918032b36a30456fc0b7a217da86d8
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:40:33 2021 -0700

    support old Python without typing.Protocol

commit dc245084102b7b04c3f5599d75b5d62ba4290787
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:12:03 2021 -0700

    Move "# type: ignore" comments to fix mypy on Python < 3.8

    The placement of these comments got more flexible in 3.8 due to
    https://github.com/python/mypy/issues/1032

    Satisfying older Python and fitting in flake8's 79-character line
    limit was quite a challenge!

commit da03afe7bd2cf3336e611f467f1c901455940ae8
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 22:59:58 2021 -0700

    fix text output from HTMLConverter

commit 5401276a2ed3b74a385ebcab5152485224146161
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 22:40:22 2021 -0700

    annotate high_level.py and the immediately-reachable internal APIs (mostly converters)

commit cc490513f8f17a7adc0bcbab2e0e86f37e832300
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 17:04:35 2021 -0700

     * expand and improve annotations in cmap, encryption/decompression and fonts
     * disallow untyped calls; this way, we have a core set of
       typed code that can grow over time
       (just not for ccitt, because there's a ton of work lurking there)
     * expand "typing: none" comments to suppress a specific error code

commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99
Author: Andrew Baumann <ab@ab.id.au>
Date:   Wed Sep 1 20:50:59 2021 -0700

    update CHANGELOG

commit f72aaead45d0615e472a9b3190c9551a6b67b36e
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date:   Wed Sep 1 20:47:03 2021 -0700

    Merge branch 'develop' into mypy

commit ff787a93986c60361536a97182a41774f4a53ac3
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Aug 21 21:46:14 2021 -0700

    be more precise about types on ps/pdf stacks, remove most of the Any annotations

commit be1550189e10717f6827dbb7009d6e8c8b3f4c62
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Aug 21 10:13:58 2021 -0700

    silence missing imports, (maybe?) hook to tox

commit ff4b6a9bd46b352583d823d39065652c9a6f05f4
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 22:49:06 2021 -0700

    turn on more strict checks, and untangle the layout mess with generics

    Status:
    $ mypy pdfminer
    pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
    pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
    pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
    pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
    pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
    pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
    Found 5 errors in 4 files (checked 27 source files)

    pdfdevice.py:191 appears to be a real bug

commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 17:22:41 2021 -0700

    finish annotating layout

commit 0e6871c16abb29df2868ab145b4ce451b4b6c777
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 16:54:46 2021 -0700

    general progress on annotations
     * finish utils
     * annotate more of pdfinterp, pdfdevice
     * document reason for # type: ignore comments
     * fix cyclic imports
     * satisfy flake8

commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Aug 19 21:38:50 2021 -0700

    WIP on type annotations

    With the possible exception of psparser.py, this is far from complete.

    $ mypy pdfminer
    pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
    pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
    pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
    pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
    pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
2021-10-09 16:23:28 +02:00
htInEdin 33d7dde4d1
Fix bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed (#616)
* detect TextIOWrapper as non-binary

* I don't understand the CHANGELOG.md format, hope this is good enough

* Delete \\\r\n in Literal Strings (ref. section 7.3.4.2 of PDF32000_2008)

* Keep Travis CI happy

* Added test

* Remove pdfminer/Changelog

* Prettify _parse_string_1

* Add CHANGELOG.md

* Satisfy flake8

* Update CHANGELOG.md

* Use logging.Logger.warning instead of warning.warn in most cases, following
 the Python official guidance that warning.warn is directed at _developers_,
 not users

 * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning,
			PDFNoValidXRefWarning

 * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning

 * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning

 * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather
				  than PDFNoValidXRefWarning

* get name right

* make flake8 happy

* Revert "make flake8 happy"

This reverts commit 4592769686.

* Revert "get name right"

This reverts commit 80091ea211.

* Revert "Use logging.Logger.warning instead of warning.warn in most cases, following"

This reverts commit 3c1e3d6606.

* Revert "Merge branch 'preferLoggingToWarning' into hst"

This reverts commit 9d9d139921, reversing
changes made to 80091ea211.

* Revert "Revert "Merge branch 'preferLoggingToWarning' into hst""

This reverts commit b3da21934d.

Co-authored-by: Henry S. Thompson <ht@home.hst.name>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-09-27 20:30:40 +02:00
Raphaël Cohen c3e3499a6b
Add support for ISO 32000-2 AES256 encryption (#614)
* feat: Add support for ISO 32000-2 AES256 encryption

* feat: Applies review suggestions
2021-09-06 22:00:23 +02:00
MapleCCC 8ea9f1091a
Fix typos in converting_pdf_to_text.rst (#611)
* Fix typos in converting_pdf_to_text.rst

* The word "pdfminer.six" as a whole should not be separated by newline, otherwise they are treated as two separated words by renderer, and incorrectly displayed as separated.

* Trim redundant spaces

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-31 20:52:13 +02:00
Mingye Wang 46fa21476a
Raise proper error when bad --output-type is used and fix formatting output of TagExtractor
* high_level: emit diagnostic for bad output_type

* TagExtractor: eliminate runtime error

This does not make is usable, but will satisfy my curiosity.

* Use if-elif-else structure

* Fix pycharm spacing warning

* Rename _write_outfp to _write

* Properly format tag names and tag values. Using utils.make_compat_str() such that the tag value is always a string.

* Update CHANGELOG.md

* Fix flake8 errors

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-31 20:46:20 +02:00
Fiete 7f54cefe02
Use visible imports in highlevel.rst documentation (#609)
* add missing import for extract_text_to_fp

* Replace testsetup with visible imports in documentation

* Remove obsolete check for python version; python 2 is not supported anymore

* (Unrelated to this MR) Remove sys from converter.py

* Optimize imports

* (Unrelated to this MR) fix line length error

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-30 22:17:21 +02:00
Daniele Procida 1d33c026e4
Updated link to Diátaxis documentation website (#606)
The canonical home of the documentation framework has moved
from documentation.divio.com to https://diataxis.fr.
2021-08-30 21:47:40 +02:00
estshorter 047a246512
Fix `AttributeError` when dumping a TOC with bytes destinations (#600)
* Fix an error when dumping a TOC

* Fix a bug that a TOC title variable is a bytes type

* Update CHANGELOG.md

* Update CHANGELOG.md

* Rename e() to escape() and merge two isinstance() checks

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-30 21:31:32 +02:00
Richard Millson a70f08818d
Fix 594 use null id when encrypted but no id given (#595)
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-29 21:32:14 +02:00
wind_chh 234c466372
Fix extraction of some cjk characters (#593)
Fixes #566 

* try to fix issue of some Chinese characters cannot be extracted
correctly (#566).

* format code to pass flake8 check.

* fix typo and refer to issue 593.

Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-26 21:05:03 +02:00
X d821fed340
Fix typos in readthedocs documentation. (#579)
* Fix typos and possible mistakes.

* Revert two edits based on discussion in #579

Revert the two changes based on our discussion. 

I read the documentation and had a glimpse at the default code. And perhaps the confusion was caused by the figure that shows the Char Margin (M) and the Word Margin (W). Clearly, M is smaller than W in absolute terms, but as mentioned, they are both relative numbers.

Maybe it is useful to point that out in the figure but I am not sure how best to do it. 

Another option is to mention use something like `min_char_margin_threshold` or similar, in the hope that they are easier to understand. Just some thoughts!

* Triggering travis again

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-26 20:58:50 +02:00
Tony(Baojia) Tong 543976f195
Fix issue of ValueError and KeyError rasied in PDFdocument and PDFparser (#574)
* check obj type

* update changelog

* Update CHANGELOG.md

* fix the bug

* fix condition

* update changelog

* update changelog again

* update changelog

* update

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Co-authored-by: Tony Tong <baojia.tong@kensho.com>
2021-08-26 20:55:02 +02:00
Eduardo Gonzalez Lopez de Murillas ea00f56ac6
Added support for Paeth PNG filter compression (predictor value = 4) (#537)
* Added support for Paeth PNG filter compression (predictor value = 4)

* Use `above` and `upper_left` as in the pseudo code

* Refactor: use variable names that are very close to the pseudo code and add pieces of the docs to show what is going on.

* Fix line length issues

* Add line about compressions to README.md

* Fix merge conflict on readme

* Fix bug in filter type Up

* Make if-else consistent

Co-authored-by: Eduardo Gonzalez Lopez de Murillas <eduardo.gonzalez@accha.nl>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-26 20:53:13 +02:00
Jake Stockwin 19c1372984
Fix for when 'trailer' is indented (#535)
* Fix for when trailer is indented

* Store stripped line

* This commit breaks things...

* Or maybe this one breaks things?

* Remove commented code because no longer used.

* Add CHANGELOG.md

* Add poetry venv management files to gitignore since I started using poetry to manage the python envs for this project

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-15 17:49:56 +02:00
Jeremy Singer-Vine 016239c146
Fix .paint_path handling of single line segments (#530)
* Fix .paint_path handling of single line segments

- Fixes typo ("ml" should have been "mlh")

- Removes if-statement that required individual line segments to be
  strictly horizontal or vertical.

* Treat 'ml'-shape paths as lines not curves

Althoguh 'mlh' is the canonical implementation for a single line
segment, 'ml' is fairly common.

Adds tests and sample PDF.

* Fix trailing whitespace

* Fix point-extraction from Beziér path commands

This commit corrects the manner in which "pts" are extracted from Beziér
path commands. See Table 4.9 of PDF reference manual, and new comments
in code for details. Previously, depending on whether the command (c,
v, or y) the code was extracting some combination of control points (not
on curve) and the actual points-on-curve.

This commit also refactors .paint_path, so that apply_matrix_pt is only
called in one place, and to treat the "h" command in a manner more
consistent with other path commands.

* Add comments to test_paint_path_quadrilaterals

* Parse rect-forming mllll paths as rects not curves

Now that .paint_path has been refactored, adding support for
rect-forming mllll paths requires no extra code, beyond a minor tweak to
the relevant elif statement.

* One changelog line with ref to mr

* Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring

* Extract variables from if statement to make it easier to read

* Optimize imports order

* Trigger travis build

* Revert "Trigger travis build"

This reverts commit 41c05184

* Update travis badge

* Update travis badge

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-07-27 18:27:32 +02:00
Jürgen Gmach 22f90521b8
Use python3.9 in tox config
* tox: use Python 3.9 final

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-03-11 20:46:31 +01:00
Pieter Marsman 761410e66c
Fix cryptography build in travis cicd by upgrading distribution from Trusty Tahr to Focal Fossa (#585)
* Update .travis.yml

* Also change 3.9-dev to 3.9 because that is now supported by travis
2021-02-20 10:32:07 +01:00
markfirmware f389b97923
Correct typo's and syntax errors from README.md (#538) 2020-11-08 16:20:10 +01:00
Ev2geny 693e4f48a3
Issue #469 is fixed (When run on Windows a lot of tests fail with the error: [Errno 13] Permission denied) (#484)
Closes #469

* Issue #469 is fixed

* one extra comment to code is added

* TemporaryFilePath context manager is added to facilitate tests

* flake8 complaints fixed

* Update docs of tempfilepath.py

* Fix flake8

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-26 10:10:11 +01:00
Pieter Marsman f8e6ad6ac1
Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream (#523)
Closes #191 

* Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream

* Add docstring

* Fix flake8
2020-10-25 14:37:12 +01:00
EucliTs0 fc75972bbd
Fix TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' (#529)
Closes #518 

* Fix TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2'

An error is occured when the 'DW2' key contains a PDFObjRef object instead of a list of int values, e.g: 'DW2': <PDFObjRef:152>.
To solve this issue, we utilise the resolve1() function

See: https://github.com/pdfminer/pdfminer.six/issues/518

* Updated CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Dimitrios TSOLAKIDIS <dimitrios.tsolakidis@vialink.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-25 14:34:45 +01:00
Pieter Marsman 178a831802
Revert "Fix for when 'trailer' is indented (#513)" (#534)
This reverts commit ec223d1f1d.
2020-10-25 13:22:42 +01:00