Section 4.5 of the PDF reference says: "Color values are interpreted
according to the current color space, another parameter of the graphics
state. A PDF content stream first selects a color space by invoking the
CS operator (for the stroking color) or the cs operator (for the
non-stroking color). It then selects color values within that color
space with the SC operator (stroking) or the sc operator (nonstroking).
There are also convenience operators—G, g, RG, rg, K, and k—that select
both a color space and a color value within it in a single step."
Previously, those convenience operators did *not* set the color space.
This commit, following on filed issue #779, fixes this. It also adds a
test to demonstrate that, at least for the do_rg method, the fix works
as intended.
* Add HOCRConverter
* Add line to README.md
* Test cicd
* Test cicd 2
* Changes based on review comments
* Remove whitespace changes to CHANGELOG.md
* Remove duplicated html output
* Add link to hocr wiki
* Add tests for extracting hocr and html
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Update utils.py
bitspercomponent =1 is defined and stores as a .btm worked when I tested it
* Update utils.py
() replaced with []
* Update CHANGELOG.md
added changes for pull request
* Update for flake
* Update CHANGELOG.md
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Ignore null characters in PSBaseParser
Beforehand, null characters were encoded as PSKeyword tokens. This caused
issue #617, as pdfdevice.py would attempt to decode the null character
PSKeyword, when it expects a byte string, as opposed to a PSKeyword, causing
pdfminer.six to crash.
As null characters are superfluous within PSBaseParser, ignore them.
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Deprecate usage of `if __name__ == "__main__"` in scripts that are not document. Also deprecate usage of scripts that are only there for testing purposes.
* Add CHANGELOG.md
* Cleanup CHANGELOG.md
* Cleanup CHANGELOG.md
* Undo deleting conf_glyphlist.py and conf_afm.py and add a deprecation warning instead
* Fix Sphinx warnings
howto/acro_forms.rst:4: WARNING: Title underline too short.
howto/acro_forms.rst:81: WARNING: Bullet list ends without a blank line; unexpected unindent.
howto/acro_forms.rst:88: WARNING: Bullet list ends without a blank line; unexpected unindent.
howto/acro_forms.rst:122: WARNING: Bullet list ends without a blank line; unexpected unindent.
tutorial/extract_pages.rst:6: WARNING: Failed to create a cross reference. A title or caption not found: api_extract_pages
* Fix documenting pdf2txt.py
reference/commandline.rst:12: ERROR: Module "tools.pdf2txt" has no attribute "maketheparser"
Incorrect argparse :module: or :func: values?
* Add CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Using an upper bound for dependency versions on a library
is a source of troubles for users.
Let's not do it as it makes pdfminer wreck havoc downstream.
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* Ignore path constructors that do not begin with m
Per PDF Reference Section 4.4.1, "path construction operators may be
invoked in any sequence, but the first one invoked must be m or re to
begin a new subpath." Since pdfminer.six already converts all `re`
(rectangle) operators to their equivelent `mlllh` representation, paths
ingested by `.paint_path(...)` that do not begin with the `m` operator
are invalid.
In addition to the advantage of hewing to the PDF Reference, this change
also avoids the `ValueError: not enough values to unpack (expected 2,
got 1)` error raised by the ` pts = [apply_matrix_pt(self.ctm, pt) for
pt in raw_pts]` line in `converter.py` when parsing PDFs that
(erroneously) include `("h",)` paths.
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Refactor ImageWriter and add method for exporting an image from bytes.
E.g. when FlateDecode just results in a list of RGB bytes.
* Added docstrings
* Add CHANGELOG.md
* Run black
* Run black
* Log warning and continue gracefully if errors in cmap
* Fix nox testing
* Also log warning if cid range is larger than actual code
* Format with black
* Add docstring
* Add CHANGELOG.md
* Restore running cmapdb.py directly
* Add github action for releasing to pypi if git tag is added.
* Checkout code and fix typos.
* Replace end with fi
* Strictly numeric version for testing.
* Remove obsolete Make commands for publishing
* Also create GitHub release
* Update pdfminer/__init__.py
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
* Remove test pypi release
* Use maintained github action for releasing
* Change tag format for versions
* Undo commenting pypi publishing
* Remove develop branch, since that will be removed in favor off adding tags for releases.
* Change version regex
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
* Adding in checks for spurious lines that contain either only spaces or new line characters
* Added spurious lines check and unit tests
* Updated CHANGELOG.md with changes
* Simplify code
* Simplify code
* Simplify code
* Remove changes to lines that are not actually changed
* Format import
* Improve CHANGELOG.md
* Improve CHANGELOG.md
* Fix cicd
* Blacken
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* `log.info` changed to `log.debug` in six files
* Fix identation
* Remove from CHANGELOG.md since no functionality has changed
Co-authored-by: Pedro Nunes <pedro@paranamodapark.com.br>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Replace tox with nox
* Replace travis with github actions
* Fix pytest, mypy and flake8 errors
* Add pytest.
* Run on all commits
* Remove nose
* Speedup slow tests to save GitHub actions minutes
* Added line to CHANGELOG.md
* Fix line too long in pdfdocument.py
* Update .github/workflows/actions.yml
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
* Improve actions.yml
* Fix error with nox name for mypy
* Add names for jobs
* Replace nose.raises with pytest.raises
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
* port page label code from pdfannots
* add tests and clean up
* more cleanup; harden against non-conforming input
* one more test
* update CHANGELOG
* cleanup & respond to review feedback (incomplete)
* Refactor implementation of get_page_labels() into a NumberTree and PageLabels class.
* PageLabels *is* a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more.
* fix type errors and cleanup slightly
* fix mypy errors (including tweaking code to avoid problematic dynamic types)
* hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be)
* avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Use logging.Logger.warning instead of warning.warn in most cases, following
the Python official guidance that warning.warn is directed at _developers_,
not users
* (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning,
PDFNoValidXRefWarning
* (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning
* (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning
* (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather
than PDFNoValidXRefWarning
* get name right
* make flake8 happy
* Keep warning classes such that this does not crash code when these warnings are explictly ignored
* Update changelog to include pr ref
* Small textual change
* Remove patch
* No need for testing if the warning is actually raised. The test_tootls_dumppdf.py are just test cases if these pdfs are supported.
* Use logger as name for logger
* Add docs to legacy warnings
* Use logger.Logger.warn for failed decompression
* Add reference to docs describing when to use logger and warnings
Co-authored-by: Henry S. Thompson <ht@home.hst.name>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* array.array.tostring -> array.array.tobytes
The tostring method has been deprecated since Python 3.2 and was
removed altogether in 3.9. In Python 3.2 the method was renamed
to "tobytes"
Will close#641
* changelog entry
* test for tobytes
* Fix CHANGELOG.md
* Update CHANGELOG.md to PR that I can push on
* Simplify tests
Co-authored-by: Forest Gregg <fgregg@uchicago.edu>
* Attempt to handle decompression error on some broken PDF files
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).
This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422
and the test file has been taken from there, so credits to @zegrep.
* Attempt to handle decompression error on some broken PDF files
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).
This has been largely inspired by mstamy2/PyPDF2#422
and the test file has been taken from there, so credits to @zegrep.
* Use a warnings instead of raising exception
where zlib error is detected before the CRC checksum.
* Add line to CHANGELOG.md
* Only try decompressing if not in strict mode
* Change error into warning because warning.warn needs a subclass of Warning
Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>