Commit Graph

925 Commits (20221105)

Author SHA1 Message Date
Pieter Marsman ebf7bcdb98
Add FAQ about special characters (#829)
* Add FAQ for extracting special characters

* Update CHANGELOG.md

* Update faq.rst
2022-11-05 17:22:08 +01:00
Pieter Marsman 3688911afe
Fix small typos in documentation (#828)
* Fix #795

* Documentation updates (FAQ and others)

* New how-to for extracting coordinates

* Indent fix in documentation

* Revert "Fix #795"

This reverts commit cac62171fc.

* Move description of iterating LTPage to the docstring of LTPage

* Remove adding how-to for extracting coordinates from this pr

* Add CHANGELOG.md

* Remove FAQ from this branch

* Only add one line to CHANGELOG.md

Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com>
2022-11-05 17:08:23 +01:00
Pieter Marsman fa71062c35
Fix `ValueError` when extracting images, due to breaking changes in Pillow (#827)
* Fix #795

* Update CHANGELOG.md

Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com>
2022-11-05 16:44:15 +01:00
Pieter Marsman 769dbb6343
Consistent instructions for how to install and use pdfminer.six (#793) 2022-11-05 16:30:39 +01:00
Jeremy Singer-Vine ad6587c697
Fix to set color space from color convenience ops (#794)
Section 4.5 of the PDF reference says: "Color values are interpreted
according to the current color space, another parameter of the graphics
state. A PDF content stream first selects a color space by invoking the
CS operator (for the stroking color) or the cs operator (for the
non-stroking color). It then selects color values within that color
space with the SC operator (stroking) or the sc operator (nonstroking).
There are also convenience operators—G, g, RG, rg, K, and k—that select
both a color space and a color value within it in a single step."

Previously, those convenience operators did *not* set the color space.
This commit, following on filed issue #779, fixes this. It also adds a
test to demonstrate that, at least for the do_rg method, the fix works
as intended.
2022-08-18 20:38:51 +02:00
sobuen ca9f75a032
Added font name aliases for Arial, Courier New and Times New Roman (#790)
* Fix `unknown` fontname in TrueType(Arial, TimesNewRoman) (#767)

* Add changelog

* Cleanup CHANGELOG.md

* Add comment with source of alias names

Co-authored-by: thirakawa <ewjohnp@gmail.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-08-14 12:12:02 +02:00
Richard Hudson 77df431871
Add HOCRConverter (fixes #650) (#651)
* Add HOCRConverter

* Add line to README.md

* Test cicd

* Test cicd 2

* Changes based on review comments

* Remove whitespace changes to CHANGELOG.md

* Remove duplicated html output

* Add link to hocr wiki

* Add tests for extracting hocr and html

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-08-14 11:52:50 +02:00
pettzilla1 f79ad56f48
Fix ValueError when bmp images with 1 bit channels are decoded (fixes #773) (#784)
* Update utils.py

bitspercomponent =1 is defined and stores as a .btm worked when I tested it

* Update utils.py

() replaced with []

* Update CHANGELOG.md

added changes for pull request

* Update for flake

* Update CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-08-08 22:35:53 +02:00
Nitesh Oswal 7b7889ff6a
Update README.md (#787)
Update pip install quote for optional extra dependency for extracting images
2022-08-08 22:21:39 +02:00
Pieter Marsman 8f52578e85
Run black locally with nox (#776)
* Run black locally with nox

* Update contributor instructions

* Fix workflow
2022-06-26 18:25:28 +02:00
Pieter Marsman 4733eb333a
Install typing_extensions on Python 3.6 and 3.7 (#775)
* Install typing_extensions on Python 3.6 and 3.7

* Add CHANGELOG.md

* Black setup.py
2022-06-26 17:47:28 +02:00
Christian Christiansen ebf92acf0c
Fix `TypeError` by Ignoring null characters in PSBaseParser (#768)
* Ignore null characters in PSBaseParser

Beforehand, null characters were encoded as PSKeyword tokens. This caused
issue #617, as pdfdevice.py would attempt to decode the null character
PSKeyword, when it expects a byte string, as opposed to a PSKeyword, causing
pdfminer.six to crash.

As null characters are superfluous within PSBaseParser, ignore them.

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-06-26 17:46:39 +02:00
Florian Apolloner f63e9fbee9
Fix `ValueError` with unencrypted metadata values (Fixes #766). (#774)
* Fix crash with unencrypted metadata values (pdfminer#766).

* Explicitly check for length

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-06-26 17:25:30 +02:00
gosiafilipek 1044fc05e8
Fix `TypeError` when getting default width of font (#772)
* Issue #720

resolve1 when getting the default width.

* Add CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-06-25 23:16:28 +02:00
Pieter Marsman 6cbee25b3e
Deprecate usage of `if __name__ == "__main__"` in scripts that are not documented. Also deprecate usage of scripts that are only there for testing purposes. (#756)
* Deprecate usage of `if __name__ == "__main__"` in scripts that are not document. Also deprecate usage of scripts that are only there for testing purposes.

* Add CHANGELOG.md

* Cleanup CHANGELOG.md

* Cleanup CHANGELOG.md

* Undo deleting conf_glyphlist.py and conf_afm.py and add a deprecation warning instead
2022-06-25 23:11:10 +02:00
Chris Mayo 86e34873e4
Fix Sphinx warnings and error (#760)
* Fix Sphinx warnings

howto/acro_forms.rst:4: WARNING: Title underline too short.
howto/acro_forms.rst:81: WARNING: Bullet list ends without a blank line; unexpected unindent.
howto/acro_forms.rst:88: WARNING: Bullet list ends without a blank line; unexpected unindent.
howto/acro_forms.rst:122: WARNING: Bullet list ends without a blank line; unexpected unindent.
tutorial/extract_pages.rst:6: WARNING: Failed to create a cross reference. A title or caption not found: api_extract_pages

* Fix documenting pdf2txt.py

reference/commandline.rst:12: ERROR: Module "tools.pdf2txt" has no attribute "maketheparser"
Incorrect argparse :module: or :func: values?

* Add CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-05-24 20:07:04 +02:00
Pieter Marsman 0b09d5f8db
Update CHANGELOG.md for #755 2022-05-24 19:41:54 +02:00
Philippe Ombredanne 7f97e26869
Remove upper version bounds (#755)
Using an upper bound for dependency versions on a library
is a source of troubles for users.
Let's not do it as it makes pdfminer wreck havoc downstream.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
2022-05-07 20:35:18 +02:00
Jeremy Singer-Vine f2c967f500
Ignore path constructors that do not begin with m (#749)
* Ignore path constructors that do not begin with  m

Per PDF Reference Section 4.4.1, "path construction operators may be
invoked in any sequence, but the first one invoked must be m or re to
begin a new subpath." Since pdfminer.six already converts all `re`
(rectangle) operators to their equivelent `mlllh` representation, paths
ingested by `.paint_path(...)` that do not begin with the `m` operator
are invalid.

In addition to the advantage of hewing to the PDF Reference, this change
also avoids the `ValueError: not enough values to unpack (expected 2,
got 1)` error raised by the ` pts = [apply_matrix_pt(self.ctm, pt) for
pt in raw_pts]` line in `converter.py` when parsing PDFs that
(erroneously) include `("h",)` paths.

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-05-06 22:15:00 +02:00
Pieter Marsman e19aea932d Bump version 20220506 & fix small issue with types 2022-05-06 22:02:32 +02:00
Pieter Marsman 1bf3c42b59
Use charset-normalizer instead of chardet (#744)
* Use charset-normalizer instead of chardet

* Ignore charset_normalizer type stub

* Add CHANGELOG.md
2022-04-20 21:42:50 +02:00
Pieter Marsman 617e4c8388
Refactor ImageWriter and add method for exporting an image from bytes. (#737)
* Refactor ImageWriter and add method for exporting an image from bytes.

E.g. when FlateDecode just results in a list of RGB bytes.

* Added docstrings

* Add CHANGELOG.md

* Run black

* Run black
2022-03-22 20:58:16 +01:00
Pieter Marsman 894dabf264
Log warning and continue gracefully if errors in cmap (#731)
* Log warning and continue gracefully if errors in cmap

* Fix nox testing

* Also log warning if cid range is larger than actual code

* Format with black

* Add docstring

* Add CHANGELOG.md

* Restore running cmapdb.py directly
2022-03-21 19:39:53 +01:00
Pieter Marsman 13021c9875
Fix log.debug statement in lzw.py by ensuring that self.table is always set (#732)
* Fix log.debug statement in lzw.py by ensuring that self.table is always set.

* Add CHANGELOG.md
2022-03-21 19:27:22 +01:00
Pieter Marsman 782368b911
Raise KeyError when name in name2unicode is not of type str (#733)
* Raise KeyError when name in name2unicode is not of type str

* Add CHANGELOG.md
2022-03-21 19:25:28 +01:00
Pieter Marsman e27cd54aff
Convert fontname to str if it is bytes in HTMLConverter (#734)
* Convert fontname to str if it is bytes

* Add CHANGELOG.md
2022-03-21 19:20:42 +01:00
Pieter Marsman ae7f315746 Fix github actions tag regex 2022-03-19 21:10:02 +01:00
Pieter Marsman a2e1d6a8bf Fix github actions tag regex 2022-03-19 20:53:14 +01:00
Pieter Marsman c2e516d6df Bump version 2022-03-19 20:49:22 +01:00
Pieter Marsman d89cc357ee
Add github action for releasing to pypi if git tag is added. (#727)
* Add github action for releasing to pypi if git tag is added.

* Checkout code and fix typos.

* Replace end with fi

* Strictly numeric version for testing.

* Remove obsolete Make commands for publishing

* Also create GitHub release

* Update pdfminer/__init__.py

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

* Remove test pypi release

* Use maintained github action for releasing

* Change tag format for versions

* Undo commenting pypi publishing

* Remove develop branch, since that will be removed in favor off adding tags for releases.

* Change version regex

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2022-03-19 20:46:00 +01:00
jwyawney 43c8fc8557
Ignore empty characters when analyzing layout (#689)
* Adding in checks for spurious lines that contain either only spaces or new line characters

* Added spurious lines check and unit tests

* Updated CHANGELOG.md with changes

* Simplify code

* Simplify code

* Simplify code

* Remove changes to lines that are not actually changed

* Format import

* Improve CHANGELOG.md

* Improve CHANGELOG.md

* Fix cicd

* Blacken

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-22 21:20:26 +01:00
Pieter Marsman 121235e24b
Raise more specific error if Pillow cannot be imported (#714)
* Raise specific warning if Pillow cannot be imported

* Improve error message

* Update docs

* Update CHANGELOG.md

* Update pdfminer/image.py

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2022-02-22 20:20:17 +01:00
Pieter Marsman b9a8920cdf
Check blackness in github actions (#711)
* Check blackness in github actions

* Blacken code

* Update github action names

* Add contributing guidelines on using black

* Add to checklist for PR
2022-02-11 22:46:51 +01:00
Pedro Nunes 830acff94c
Changed `log.info` to `log.debug` in six files (#690)
* `log.info` changed to `log.debug` in six files

* Fix identation

* Remove from CHANGELOG.md since no functionality has changed

Co-authored-by: Pedro Nunes <pedro@paranamodapark.com.br>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-08 21:24:00 +01:00
Pieter Marsman 2254306a52 Update README.md batch for Continuous integration 2022-02-02 22:53:17 +01:00
Pieter Marsman 81f873e105 Update actions.yml so that it will run for all PR's 2022-02-02 22:45:05 +01:00
Pieter Marsman b84cfc98e0
Update development tools: travis ci to github actions, tox to nox, nose to pytest (#704)
* Replace tox with nox

* Replace travis with github actions

* Fix pytest, mypy and flake8 errors

* Add pytest.

* Run on all commits

* Remove nose

* Speedup slow tests to save GitHub actions minutes

* Added line to CHANGELOG.md

* Fix line too long in pdfdocument.py

* Update .github/workflows/actions.yml

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

* Improve actions.yml

* Fix error with nox name for mypy

* Add names for jobs

* Replace nose.raises with pytest.raises

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2022-02-02 22:24:32 +01:00
Andrew Baumann 1d1602e0c5
Added feature: page labels (#680)
* port page label code from pdfannots

* add tests and clean up

* more cleanup; harden against non-conforming input

* one more test

* update CHANGELOG

* cleanup & respond to review feedback (incomplete)

* Refactor implementation of get_page_labels() into a NumberTree and PageLabels class.

* PageLabels *is* a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more.

* fix type errors and cleanup slightly

 * fix mypy errors (including tweaking code to avoid problematic dynamic types)
 * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be)
 * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-01 10:08:05 +01:00
Pieter Marsman b19f9e7270
Remove obsolete returns (#707)
* Remove obsolete returns

* Update CHANGELOG.md

* Remove empty lines

* Remove more empty lines
2022-02-01 01:49:46 +01:00
Pieter Marsman 2610ef13af Revert "Remove obsolete returns"
This reverts commit c67abdfab0.
2022-02-01 01:36:17 +01:00
Pieter Marsman c67abdfab0 Remove obsolete returns 2022-02-01 01:35:35 +01:00
Tony(Baojia) Tong 4b138a6bc5
Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (#684)
* check obj type

* update changelog

* Update CHANGELOG.md

* add changes

* update change

* update changelog

* Use fallback in except clause

* Update changelog.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Co-authored-by: Tony Tong <baojia.tong@kensho.com>
2022-02-01 01:20:52 +01:00
htInEdin dc530f3a6f
Use logger.warn instead of warnings.warn if warning cannot be prevented by user (#673)
* Use logging.Logger.warning instead of warning.warn in most cases, following
 the Python official guidance that warning.warn is directed at _developers_,
 not users

 * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning,
			PDFNoValidXRefWarning

 * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning

 * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning

 * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather
				  than PDFNoValidXRefWarning

* get name right

* make flake8 happy

* Keep warning classes such that this does not crash code when these warnings are explictly ignored

* Update changelog to include pr ref

* Small textual change

* Remove patch

* No need for testing if the warning is actually raised. The test_tootls_dumppdf.py are just test cases if these pdfs are supported.

* Use logger as name for logger

* Add docs to legacy warnings

* Use logger.Logger.warn for failed decompression

* Add reference to docs describing when to use logger and warnings

Co-authored-by: Henry S. Thompson <ht@home.hst.name>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-26 20:41:12 +01:00
crisptag c4ac514984
Change log.info into log.debug to make pdfinterp.py less verbose 2022-01-26 19:57:55 +01:00
Andrew Baumann 95dee8d67c
Fix regression in page layout that sometimes returned text lines out of order (#659)
* add a test

* fix the bug

* rewrap long lines

* update CHANGELOG

* re-merge CHANGELOG

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-26 19:55:08 +01:00
Andrew Baumann 9a644aae76
export type annotations in package (#679)
* export type annotations via our pypi package

* update CHANGELOG

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 22:11:17 +01:00
Andrew Baumann 24eb15cae5
fix typos in PR template (#681) 2022-01-25 22:08:14 +01:00
Andrew Baumann d87bd025dd
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled

Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "pdfminer/high_level.py", line 85, in extract_text_to_fp
    interpreter.process_page(page)
  File "pdfminer/pdfinterp.py", line 896, in process_page
    self.device.end_page(page)
  File "pdfminer/converter.py", line 51, in end_page
    self.cur_item.analyze(self.laparams)
  File "pdfminer/layout.py", line 822, in analyze
    group.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 577, in analyze
    self._objs.sort(
  File "pdfminer/layout.py", line 578, in <lambda>
    key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```

Related: Issue #477, PR #479

* update CHANGELOG

* merge CHANGELOG

* pdf2txt: clean up handling of layout parameter arguments
 * avoid specifying default values twice
 * construct LAParams earlier, rather than passing its components around
 * fix crash with --boxes_flow=disabled

* update CHANGELOG

* construct new LAParams, so _validate runs

* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.

* Add cli argument for line_overlap

* Also use default values from LAParams for --detect-vertical and --all-texts

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 22:06:06 +01:00
Pieter Marsman aa5dec252f Fixes jbig2 writer to write valid jb2 files
See: https://github.com/pdfminer/pdfminer.six/pull/653

Squashed commit of the following:

commit 8748c9fcddab0826cca243eee45c40d2b6611e80
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:40:50 2022 +0100

    Remove prints in test

commit bb977258a39fc7baa13bba1c3ea29726e17c0f6d
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:35:12 2022 +0100

    Cleanup exception handling for jbig2 global streams

commit cf0b47b01b7caad8acbd82097aadadb620606a8b
Merge: a5831d1 708dd20
Author: Pieter Marsman <pietermarsman@gmail.com>
Date:   Sun Jan 23 21:29:15 2022 +0100

    Merge branch 'develop' into jbig2_fix

commit a5831d110a
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:59:17 2021 -0400

    flake8 tests

commit 18ffa29387
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:52:11 2021 -0400

    add description in changelog

commit 6c7ee43d6c
Author: Forest Gregg <fgregg@datamade.us>
Date:   Sun Aug 1 22:43:36 2021 -0400

    Fixes jbig2 writer to write valid jb2 files

    - closes #652
2022-01-23 21:41:08 +01:00
Pieter Marsman 708dd20465 Add support for JPEG2000 image encoding 2022-01-23 21:17:47 +01:00