pdfminer.six

Commit Graph

Author	SHA1	Message	Date
Pieter Marsman	875e53013a	Remove explicit support for Python 3.4 and 3.5, adding tests for python 3.9 (#522 ) Closes #503	2020-10-25 12:34:51 +01:00
Jake Stockwin	ec223d1f1d	Fix for when 'trailer' is indented (#513 ) * Fix for when 'trailer' is indented Closes #214 * Address CR comments - strip line after parsing * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-24 18:55:07 +02:00
estshorter	61300eef70	Remove unused dependency on sortedcontainers (#525 ) * Remove unused sortedcontainers package * Fix changelog format * Fix a link to the PR * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-24 15:55:22 +02:00
Pieter Marsman	c8cceb7c58	Release 20201018	2020-10-18 12:57:26 +02:00
Pieter Marsman	2a88fda543	Rebrand the .six by adding a punchline and a faq (#520 ) * Add punchline to readme * Add punchline to docs * Add frequently asked questions * Update docs/source/faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com> * Update docs/source/faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com> * Update docs/source/faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com> * Update faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com>	2020-10-18 12:50:59 +02:00
Pieter Marsman	c66eca3c29	Update faq.rst	2020-10-18 12:49:54 +02:00
Jeremy Singer-Vine	e83dd26671	Fix .paint_path for non-rectangle quadrilaterals (#512 ) * Fix paint_path bug noted in issue #473 Focuses on the handling of non-rect quadrilaterals, the decomposition of complex (m.h) paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type. Also adds a test for cases presented in issue #473 * Tweak paint_path fix per @pietermarsman review - Adjusts logic to adhere to if-elif-else rather than early returns. - Shortens subpath detection/reprocessing step, using re.finditer(). * Reorder paint_path() if-else statements once more * Fix flake8 issues * Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on. Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-12 17:53:00 +02:00
Pieter Marsman	599f0391b5	Update faq.rst	2020-10-12 09:22:41 +02:00
Pieter Marsman	e59b1bca2f	Update docs/source/faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com>	2020-10-12 09:20:43 +02:00
Pieter Marsman	a805653a83	Update docs/source/faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com>	2020-10-12 09:20:37 +02:00
Pieter Marsman	4be9757b86	Update docs/source/faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com>	2020-10-12 09:20:30 +02:00
Pieter Marsman	14cc66ae6d	Add frequently asked questions	2020-10-11 20:05:26 +02:00
Pieter Marsman	bbc01f749a	Add punchline to docs	2020-10-11 20:05:11 +02:00
Pieter Marsman	d04c38fb8d	Add punchline to readme	2020-10-11 20:04:57 +02:00
estshorter	360b1efc0b	Deprecate Python 3.4 and 3.5 (#507 )	2020-10-10 16:15:03 +02:00
Diego Elio Pettenò	67e2d79591	Fix out-of-bound access on some PDFs. (#483 ) Replace the non-emptiness check with a minimum length check — you can't get the second to last item in a list of less than two items.	2020-10-10 15:18:34 +02:00
Jake Stockwin	ef4787d8ad	Fix not being able to pass boxes flow as None to pdf2txt (#479 ) * Fix not being able to pass boxes flow as None to pdf2txt * Changes from code review * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-10 15:17:04 +02:00
estshorter	f03657e5c4	Allow a pathlib.PurePath object as a input to open_filename (#492 ) * open_filename accepts a pathlib.PurePath object * Add test for open_filename with pathlib * Fix a wrong function name * Cast a pathlib object to string for py3.4/3.5 * Add link to the PR * Raise an exception when open_filename gets an unsupported type * Add tests for open_filename * Update CHANGELOG.md * Documentation Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-09-17 21:29:00 +02:00
David Nicholson	b4054ff4cf	Pass caching parameter to PDFResourceManager in `high_level` functions (#475 ) * Updated high_level.py This commit enables caching to be turned on and off rather than be always on regardless of the user input. * Reverted params back to fix errors * Updated CHANGELOG.md to reflect quick fix * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-09-10 21:09:07 +02:00
Igor Moura	a83f853de7	Remove unused rijndael encryption implementation (#465 ) * Remove unused rijndael encryption * Add current PR link to CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-09-10 19:28:00 +02:00
typhoon71	4d8b5975cb	Add section to documentation with howto for AcroForm fields extraction (#458 ) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>	2020-09-10 19:18:41 +02:00
Pieter Marsman	0b44f77714	Move changelog line for #438 to current release	2020-07-26 15:14:15 +02:00
Pieter Marsman	391fe149ca	Release 20200726	2020-07-26 15:10:36 +02:00
Pieter Marsman	66856a1016	Replace internal usage of PDFTextExtractionNotAllowedError (deprecated) with PDFTextExtractionNotAllowed	2020-07-26 15:09:32 +02:00
Philippe Ombredanne	99f0c09869	Restore PDFTextExtractionNotAllowed exception (#461 ) * Restore PDFTextExtractionNotAllowed Restore PDFTextExtractionNotAllowed exception class as an alias of the new PDFTextExtractionNotAllowedError exception that was introduced in `6a9269b432` Removing PDFTextExtractionNotAllowed is an API breakage that made several tools fail break. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com> * Use PDFTextExtractionNotAllowed and prepare PDFTextExtractionNotAllowedError to be removed in the future * Add line to CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-26 15:06:04 +02:00
Pieter Marsman	4f65242750	Always try to get CMap, even if name is not recognized (#438 ) * Add trying to get cmap from pickle file. And cleaning up a bit. * Don't use keyword argument for dict.get * Add docs * Make _get_cmap_name static * Add test * Add CHANGELOG.md * Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there * Add CJK characters to expected output of simple3.pdf * Fix line length * Add comment	2020-07-23 20:27:38 +02:00
Pieter Marsman	3cebf5ef66	Release 20200720	2020-07-20 22:05:19 +02:00
lithiumFlower	c10cf3cdb8	Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456 ) * swap pycryptodome to the faster, smaller, and industry standard crytography io * update changelog * fixlint * Update CHANGELOG.md * from MR, unneeded ex and naming * add samples to nosetests * fix lint * show mismatch * fix lint * typo and newline * Revert "add samples to nosetests" This reverts commit `a49ca302` * Add tests for encrypted documents to nose test suite * Optimize imports of pdfdocument.py Co-authored-by: Oren Tysor <oren@atakama.com> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung	60863cfd55	Fix converting path to multiple rectangles (#371 ) * Fix converting path to multiple rectangles For path that consists of a series of rectangles (shape is 'mlllhmlllh...'), call paint_path again with each group of 5 points. The result is multiple rects instead of a single curve. fixes #369 * Reduce pdf size by removing font * Add unittest for PDFLayoutAnalyzer.paint_path() * Add line to CHANGELOG.md * Add reference to pdf reference manual * Cleanup function paint_path a bit * Reduce line length of tests * Reduce line length of tests Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 17:34:38 +02:00
madhurcodes	6a9269b432	Change Text extraction is not allowed error to warning (#453 ) * Changed error to warning for 'Text extraction is not allowed' * updated changelog * fix lint * made changes suggested in review * Update CHANGELOG.md * Add regression test for failing pdf * Reduce line length to <80 Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 16:04:11 +02:00
Tony(Baojia) Tong	836d312982	Validate that object is PDFStream in do_EI (#451 ) * check obj type * update changelog * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-05 13:42:15 +02:00
Michael	ce7775fd4f	Add setup.py classifiers for Python 3.7 and 3.8 (#450 )	2020-07-05 13:19:02 +02:00
Jake Stockwin	ac2b20a79a	[docs] Add extract_pages tutorial (#442 ) Closes https://github.com/pdfminer/pdfminer.six/issues/361	2020-06-29 20:07:05 +02:00
AhnHyunJin	09c989f301	Fix spelling error (#436 ) * Change rwo to two in pdfdiff.py Co-authored-by: ahnhyunjin <hj.ahn@promptech.co.kr>	2020-06-06 15:43:57 +02:00
Pieter Marsman	6e05baf0b7	Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it Fixes #176 * Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's * Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref * Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know * Adding changelog * Extend help message	2020-05-23 18:04:34 +02:00
Pieter Marsman	33b60dfd54	Bump version	2020-05-17 17:50:01 +02:00
Pieter Marsman	91d89af788	Add section to documentation with howto for image extraction (#427 ) * Make structure of documentation more clear: tutorials, how-to, topics and reference * Add howto for images * Restructure tutorials section, and add install section * Always use up-to-date version * Fix indentation warning in docstring * Add option to dumppdf.py and pdf2txt.py to show version Fixes #162	2020-05-17 17:48:06 +02:00
Jake Stockwin	7254530d27	Fix ordering of textlines within a textbox when boxes_flow is disabled (#412 ) * Fix ordering of textlines within a textbox when boxes_flow is disabled * Add new test PDF sample	2020-05-09 15:37:49 +02:00
fabbox	7eff108fa5	add shebang line to script in tools (#408 ) * add shebang line to script in tools * fix: use shebang line with python 3 * Moved changelog to unreleased Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-04-28 10:58:42 +02:00
Pieter Marsman	d79bcb75ea	Bump version 20200402	2020-04-01 21:37:39 +02:00
Pieter Marsman	b8988b6848	Bump version	2020-04-01 21:22:59 +02:00
Jake Stockwin	68e2ae8632	Fix text coming in reverse order with boxes flow disabled (#399 ) Closes #398	2020-04-01 13:37:04 +02:00
Jake Stockwin	e55560f858	Fix #395 : Update documentation for boxes_flow, allow None (#396 ) * Update documentation for boxes_flow, allow None * Apply comments from code review * Small wording changes, remove unnecessary comment * Update boxes_flow documentation for pdf2text * Pin version of tox to ensure python 3.4 support	2020-03-26 23:03:49 +01:00
Jake Stockwin	518b5d6efc	Fix #390 : Updated misleading documentation about word_margin (#407 ) * Updated misleading documentation about word_margin * Small change in sentence about word_margin * Remove confusing sentence about adding spaces Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-03-26 23:02:48 +01:00
Jake Stockwin	1a4a06da9f	Fix #392 Split out IO logic from high level functions (#393 ) * Allow file-like inputs to high level functions (#392) * PR Review - move open_filename to utils	2020-03-26 22:52:00 +01:00
Jake Stockwin	1cc1b961c5	Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382 ) (#384 ) * Group text lines if they are centered (#382) Closes #382 * Add comparison private methods to LTTextLines * Add missing docstrings * Add tests for find_neighbors * Update changelog * Cosmetic changes from code review	2020-03-23 22:38:39 +01:00
Pieter Marsman	9d7fe2d9ee	Catch ValueError when converting font encoding differences to characters (#389 ) * Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed * Add test for catching ValueError and KeyError when font encoding differences are invalid * Added line to CHANGELOG.md	2020-03-16 20:12:45 +01:00
fzyzcjy	a087d6dfc8	Fix typo in README.md (#388 )	2020-03-14 11:00:37 +01:00
Pieter Marsman	1d773dc38a	Fix grouping textlines when bounding box of parent container is wrong (#386 ) * Default value for --all-texts should be false, because using the flag enables it * Fix edge case: when no neighbors are found a line should form its own text box * Added test for grouping textlines where 1 is outside the parent bounding box * Added CHANGELOG.md line	2020-03-14 10:33:39 +01:00
Pieter Marsman	7e91d4ec6d	Improve docs and github templates	2020-03-08 15:06:13 +01:00

1 2 3 4 5 ...

848 Commits (875e53013ae569cb7e3c675f34d29e42ac961a10) All Branches Search

848 Commits (875e53013ae569cb7e3c675f34d29e42ac961a10)

All Branches