pdfminer.six

Commit Graph

Author	SHA1	Message	Date
Pieter Marsman	ebf7bcdb98	Add FAQ about special characters (#829 ) * Add FAQ for extracting special characters * Update CHANGELOG.md * Update faq.rst	2022-11-05 17:22:08 +01:00
Pieter Marsman	3688911afe	Fix small typos in documentation (#828 ) * Fix #795 * Documentation updates (FAQ and others) * New how-to for extracting coordinates * Indent fix in documentation * Revert "Fix #795" This reverts commit `cac62171fc`. * Move description of iterating LTPage to the docstring of LTPage * Remove adding how-to for extracting coordinates from this pr * Add CHANGELOG.md * Remove FAQ from this branch * Only add one line to CHANGELOG.md Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com>	2022-11-05 17:08:23 +01:00
Pieter Marsman	fa71062c35	Fix `ValueError` when extracting images, due to breaking changes in Pillow (#827 ) * Fix #795 * Update CHANGELOG.md Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com>	2022-11-05 16:44:15 +01:00
Pieter Marsman	769dbb6343	Consistent instructions for how to install and use pdfminer.six (#793 )	2022-11-05 16:30:39 +01:00
Jeremy Singer-Vine	ad6587c697	Fix to set color space from color convenience ops (#794 ) Section 4.5 of the PDF reference says: "Color values are interpreted according to the current color space, another parameter of the graphics state. A PDF content stream first selects a color space by invoking the CS operator (for the stroking color) or the cs operator (for the non-stroking color). It then selects color values within that color space with the SC operator (stroking) or the sc operator (nonstroking). There are also convenience operators—G, g, RG, rg, K, and k—that select both a color space and a color value within it in a single step." Previously, those convenience operators did not set the color space. This commit, following on filed issue #779, fixes this. It also adds a test to demonstrate that, at least for the do_rg method, the fix works as intended.	2022-08-18 20:38:51 +02:00
sobuen	ca9f75a032	Added font name aliases for Arial, Courier New and Times New Roman (#790 ) * Fix `unknown` fontname in TrueType(Arial, TimesNewRoman) (#767) * Add changelog * Cleanup CHANGELOG.md * Add comment with source of alias names Co-authored-by: thirakawa <ewjohnp@gmail.com> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-08-14 12:12:02 +02:00
Richard Hudson	77df431871	Add HOCRConverter (fixes #650 ) (#651 ) * Add HOCRConverter * Add line to README.md * Test cicd * Test cicd 2 * Changes based on review comments * Remove whitespace changes to CHANGELOG.md * Remove duplicated html output * Add link to hocr wiki * Add tests for extracting hocr and html Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-08-14 11:52:50 +02:00
pettzilla1	f79ad56f48	Fix ValueError when bmp images with 1 bit channels are decoded (fixes #773 ) (#784 ) * Update utils.py bitspercomponent =1 is defined and stores as a .btm worked when I tested it * Update utils.py () replaced with [] * Update CHANGELOG.md added changes for pull request * Update for flake * Update CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-08-08 22:35:53 +02:00
Nitesh Oswal	7b7889ff6a	Update README.md (#787 ) Update pip install quote for optional extra dependency for extracting images	2022-08-08 22:21:39 +02:00
Pieter Marsman	8f52578e85	Run black locally with nox (#776 ) * Run black locally with nox * Update contributor instructions * Fix workflow	2022-06-26 18:25:28 +02:00
Pieter Marsman	4733eb333a	Install typing_extensions on Python 3.6 and 3.7 (#775 ) * Install typing_extensions on Python 3.6 and 3.7 * Add CHANGELOG.md * Black setup.py	2022-06-26 17:47:28 +02:00
Christian Christiansen	ebf92acf0c	Fix `TypeError` by Ignoring null characters in PSBaseParser (#768 ) * Ignore null characters in PSBaseParser Beforehand, null characters were encoded as PSKeyword tokens. This caused issue #617, as pdfdevice.py would attempt to decode the null character PSKeyword, when it expects a byte string, as opposed to a PSKeyword, causing pdfminer.six to crash. As null characters are superfluous within PSBaseParser, ignore them. * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-06-26 17:46:39 +02:00
Florian Apolloner	f63e9fbee9	Fix `ValueError` with unencrypted metadata values (Fixes #766 ). (#774 ) * Fix crash with unencrypted metadata values (pdfminer#766). * Explicitly check for length * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-06-26 17:25:30 +02:00
gosiafilipek	1044fc05e8	Fix `TypeError` when getting default width of font (#772 ) * Issue #720 resolve1 when getting the default width. * Add CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-06-25 23:16:28 +02:00
Pieter Marsman	6cbee25b3e	Deprecate usage of `if __name__ == "__main__"` in scripts that are not documented. Also deprecate usage of scripts that are only there for testing purposes. (#756 ) * Deprecate usage of `if __name__ == "__main__"` in scripts that are not document. Also deprecate usage of scripts that are only there for testing purposes. * Add CHANGELOG.md * Cleanup CHANGELOG.md * Cleanup CHANGELOG.md * Undo deleting conf_glyphlist.py and conf_afm.py and add a deprecation warning instead	2022-06-25 23:11:10 +02:00
Chris Mayo	86e34873e4	Fix Sphinx warnings and error (#760 ) * Fix Sphinx warnings howto/acro_forms.rst:4: WARNING: Title underline too short. howto/acro_forms.rst:81: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:88: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:122: WARNING: Bullet list ends without a blank line; unexpected unindent. tutorial/extract_pages.rst:6: WARNING: Failed to create a cross reference. A title or caption not found: api_extract_pages * Fix documenting pdf2txt.py reference/commandline.rst:12: ERROR: Module "tools.pdf2txt" has no attribute "maketheparser" Incorrect argparse :module: or :func: values? * Add CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-05-24 20:07:04 +02:00
Pieter Marsman	0b09d5f8db	Update CHANGELOG.md for #755	2022-05-24 19:41:54 +02:00
Philippe Ombredanne	7f97e26869	Remove upper version bounds (#755 ) Using an upper bound for dependency versions on a library is a source of troubles for users. Let's not do it as it makes pdfminer wreck havoc downstream. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>	2022-05-07 20:35:18 +02:00
Jeremy Singer-Vine	f2c967f500	Ignore path constructors that do not begin with m (#749 ) * Ignore path constructors that do not begin with m Per PDF Reference Section 4.4.1, "path construction operators may be invoked in any sequence, but the first one invoked must be m or re to begin a new subpath." Since pdfminer.six already converts all `re` (rectangle) operators to their equivelent `mlllh` representation, paths ingested by `.paint_path(...)` that do not begin with the `m` operator are invalid. In addition to the advantage of hewing to the PDF Reference, this change also avoids the `ValueError: not enough values to unpack (expected 2, got 1)` error raised by the ` pts = [apply_matrix_pt(self.ctm, pt) for pt in raw_pts]` line in `converter.py` when parsing PDFs that (erroneously) include `("h",)` paths. * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-05-06 22:15:00 +02:00
Pieter Marsman	e19aea932d	Bump version 20220506 & fix small issue with types	2022-05-06 22:02:32 +02:00
Pieter Marsman	1bf3c42b59	Use charset-normalizer instead of chardet (#744 ) * Use charset-normalizer instead of chardet * Ignore charset_normalizer type stub * Add CHANGELOG.md	2022-04-20 21:42:50 +02:00
Pieter Marsman	617e4c8388	Refactor ImageWriter and add method for exporting an image from bytes. (#737 ) * Refactor ImageWriter and add method for exporting an image from bytes. E.g. when FlateDecode just results in a list of RGB bytes. * Added docstrings * Add CHANGELOG.md * Run black * Run black	2022-03-22 20:58:16 +01:00
Pieter Marsman	894dabf264	Log warning and continue gracefully if errors in cmap (#731 ) * Log warning and continue gracefully if errors in cmap * Fix nox testing * Also log warning if cid range is larger than actual code * Format with black * Add docstring * Add CHANGELOG.md * Restore running cmapdb.py directly	2022-03-21 19:39:53 +01:00
Pieter Marsman	13021c9875	Fix log.debug statement in lzw.py by ensuring that self.table is always set (#732 ) * Fix log.debug statement in lzw.py by ensuring that self.table is always set. * Add CHANGELOG.md	2022-03-21 19:27:22 +01:00
Pieter Marsman	782368b911	Raise KeyError when name in name2unicode is not of type str (#733 ) * Raise KeyError when name in name2unicode is not of type str * Add CHANGELOG.md	2022-03-21 19:25:28 +01:00
Pieter Marsman	e27cd54aff	Convert fontname to str if it is bytes in HTMLConverter (#734 ) * Convert fontname to str if it is bytes * Add CHANGELOG.md	2022-03-21 19:20:42 +01:00
Pieter Marsman	ae7f315746	Fix github actions tag regex	2022-03-19 21:10:02 +01:00
Pieter Marsman	a2e1d6a8bf	Fix github actions tag regex	2022-03-19 20:53:14 +01:00
Pieter Marsman	c2e516d6df	Bump version	2022-03-19 20:49:22 +01:00
Pieter Marsman	d89cc357ee	Add github action for releasing to pypi if git tag is added. (#727 ) * Add github action for releasing to pypi if git tag is added. * Checkout code and fix typos. * Replace end with fi * Strictly numeric version for testing. * Remove obsolete Make commands for publishing * Also create GitHub release * Update pdfminer/__init__.py Co-authored-by: Jake Stockwin <jstockwin@gmail.com> * Remove test pypi release * Use maintained github action for releasing * Change tag format for versions * Undo commenting pypi publishing * Remove develop branch, since that will be removed in favor off adding tags for releases. * Change version regex Co-authored-by: Jake Stockwin <jstockwin@gmail.com>	2022-03-19 20:46:00 +01:00
jwyawney	43c8fc8557	Ignore empty characters when analyzing layout (#689 ) * Adding in checks for spurious lines that contain either only spaces or new line characters * Added spurious lines check and unit tests * Updated CHANGELOG.md with changes * Simplify code * Simplify code * Simplify code * Remove changes to lines that are not actually changed * Format import * Improve CHANGELOG.md * Improve CHANGELOG.md * Fix cicd * Blacken Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-02-22 21:20:26 +01:00
Pieter Marsman	121235e24b	Raise more specific error if Pillow cannot be imported (#714 ) * Raise specific warning if Pillow cannot be imported * Improve error message * Update docs * Update CHANGELOG.md * Update pdfminer/image.py Co-authored-by: Jake Stockwin <jstockwin@gmail.com> Co-authored-by: Jake Stockwin <jstockwin@gmail.com>	2022-02-22 20:20:17 +01:00
Pieter Marsman	b9a8920cdf	Check blackness in github actions (#711 ) * Check blackness in github actions * Blacken code * Update github action names * Add contributing guidelines on using black * Add to checklist for PR	2022-02-11 22:46:51 +01:00
Pedro Nunes	830acff94c	Changed `log.info` to `log.debug` in six files (#690 ) * `log.info` changed to `log.debug` in six files * Fix identation * Remove from CHANGELOG.md since no functionality has changed Co-authored-by: Pedro Nunes <pedro@paranamodapark.com.br> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-02-08 21:24:00 +01:00
Pieter Marsman	2254306a52	Update README.md batch for Continuous integration	2022-02-02 22:53:17 +01:00
Pieter Marsman	81f873e105	Update actions.yml so that it will run for all PR's	2022-02-02 22:45:05 +01:00
Pieter Marsman	b84cfc98e0	Update development tools: travis ci to github actions, tox to nox, nose to pytest (#704 ) * Replace tox with nox * Replace travis with github actions * Fix pytest, mypy and flake8 errors * Add pytest. * Run on all commits * Remove nose * Speedup slow tests to save GitHub actions minutes * Added line to CHANGELOG.md * Fix line too long in pdfdocument.py * Update .github/workflows/actions.yml Co-authored-by: Jake Stockwin <jstockwin@gmail.com> * Improve actions.yml * Fix error with nox name for mypy * Add names for jobs * Replace nose.raises with pytest.raises Co-authored-by: Jake Stockwin <jstockwin@gmail.com>	2022-02-02 22:24:32 +01:00
Andrew Baumann	1d1602e0c5	Added feature: page labels (#680 ) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-02-01 10:08:05 +01:00
Pieter Marsman	b19f9e7270	Remove obsolete returns (#707 ) * Remove obsolete returns * Update CHANGELOG.md * Remove empty lines * Remove more empty lines	2022-02-01 01:49:46 +01:00
Pieter Marsman	2610ef13af	Revert "Remove obsolete returns" This reverts commit `c67abdfab0`.	2022-02-01 01:36:17 +01:00
Pieter Marsman	c67abdfab0	Remove obsolete returns	2022-02-01 01:35:35 +01:00
Tony(Baojia) Tong	4b138a6bc5	Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (#684 ) * check obj type * update changelog * Update CHANGELOG.md * add changes * update change * update changelog * Use fallback in except clause * Update changelog.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> Co-authored-by: Tony Tong <baojia.tong@kensho.com>	2022-02-01 01:20:52 +01:00
htInEdin	dc530f3a6f	Use logger.warn instead of warnings.warn if warning cannot be prevented by user (#673 ) * Use logging.Logger.warning instead of warning.warn in most cases, following the Python official guidance that warning.warn is directed at _developers_, not users * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning, PDFNoValidXRefWarning * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather than PDFNoValidXRefWarning * get name right * make flake8 happy * Keep warning classes such that this does not crash code when these warnings are explictly ignored * Update changelog to include pr ref * Small textual change * Remove patch * No need for testing if the warning is actually raised. The test_tootls_dumppdf.py are just test cases if these pdfs are supported. * Use logger as name for logger * Add docs to legacy warnings * Use logger.Logger.warn for failed decompression * Add reference to docs describing when to use logger and warnings Co-authored-by: Henry S. Thompson <ht@home.hst.name> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-01-26 20:41:12 +01:00
crisptag	c4ac514984	Change log.info into log.debug to make pdfinterp.py less verbose	2022-01-26 19:57:55 +01:00
Andrew Baumann	95dee8d67c	Fix regression in page layout that sometimes returned text lines out of order (#659 ) * add a test * fix the bug * rewrap long lines * update CHANGELOG * re-merge CHANGELOG Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-01-26 19:55:08 +01:00
Andrew Baumann	9a644aae76	export type annotations in package (#679 ) * export type annotations via our pypi package * update CHANGELOG Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-01-25 22:11:17 +01:00
Andrew Baumann	24eb15cae5	fix typos in PR template (#681 )	2022-01-25 22:08:14 +01:00
Andrew Baumann	d87bd025dd	pdf2txt: clean up construction of LAParams from arguments (#682 ) * Fix pdf2txt --boxes-flow=disabled Fixes: ``` $ pdf2txt.py --boxes-flow=disabled test.pdf Traceback (most recent call last): File "tools/pdf2txt.py", line 204, in <module> sys.exit(main()) File "tools/pdf2txt.py", line 198, in main outfp = extract_text(vars(A)) File "tools/pdf2txt.py", line 66, in extract_text pdfminer.high_level.extract_text_to_fp(fp, locals()) File "pdfminer/high_level.py", line 85, in extract_text_to_fp interpreter.process_page(page) File "pdfminer/pdfinterp.py", line 896, in process_page self.device.end_page(page) File "pdfminer/converter.py", line 51, in end_page self.cur_item.analyze(self.laparams) File "pdfminer/layout.py", line 822, in analyze group.analyze(laparams) File "pdfminer/layout.py", line 575, in analyze LTTextGroup.analyze(self, laparams) File "pdfminer/layout.py", line 362, in analyze obj.analyze(laparams) File "pdfminer/layout.py", line 575, in analyze LTTextGroup.analyze(self, laparams) File "pdfminer/layout.py", line 362, in analyze obj.analyze(laparams) File "pdfminer/layout.py", line 575, in analyze LTTextGroup.analyze(self, laparams) File "pdfminer/layout.py", line 362, in analyze obj.analyze(laparams) File "pdfminer/layout.py", line 577, in analyze self._objs.sort( File "pdfminer/layout.py", line 578, in <lambda> key=lambda obj: (1 - laparams.boxes_flow) * obj.x0 TypeError: unsupported operand type(s) for -: 'int' and 'str' ``` Related: Issue #477, PR #479 * update CHANGELOG * merge CHANGELOG * pdf2txt: clean up handling of layout parameter arguments * avoid specifying default values twice * construct LAParams earlier, rather than passing its components around * fix crash with --boxes_flow=disabled * update CHANGELOG * construct new LAParams, so _validate runs * Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method. * Add cli argument for line_overlap * Also use default values from LAParams for --detect-vertical and --all-texts Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-01-25 22:06:06 +01:00
Pieter Marsman	aa5dec252f	Fixes jbig2 writer to write valid jb2 files See: https://github.com/pdfminer/pdfminer.six/pull/653 Squashed commit of the following: commit 8748c9fcddab0826cca243eee45c40d2b6611e80 Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:40:50 2022 +0100 Remove prints in test commit bb977258a39fc7baa13bba1c3ea29726e17c0f6d Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:35:12 2022 +0100 Cleanup exception handling for jbig2 global streams commit cf0b47b01b7caad8acbd82097aadadb620606a8b Merge: `a5831d1` `708dd20` Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:29:15 2022 +0100 Merge branch 'develop' into jbig2_fix commit `a5831d110a` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:59:17 2021 -0400 flake8 tests commit `18ffa29387` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:52:11 2021 -0400 add description in changelog commit `6c7ee43d6c` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:43:36 2021 -0400 Fixes jbig2 writer to write valid jb2 files - closes #652	2022-01-23 21:41:08 +01:00
Pieter Marsman	708dd20465	Add support for JPEG2000 image encoding	2022-01-23 21:17:47 +01:00

1 2 3 4 5 ...

925 Commits (20221105) All Branches Search

925 Commits (20221105)

All Branches