pdfminer.six

Commit Graph

Author	SHA1	Message	Date
Jake Stockwin	19c1372984	Fix for when 'trailer' is indented (#535 ) * Fix for when trailer is indented * Store stripped line * This commit breaks things... * Or maybe this one breaks things? * Remove commented code because no longer used. * Add CHANGELOG.md * Add poetry venv management files to gitignore since I started using poetry to manage the python envs for this project Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-08-15 17:49:56 +02:00
Jeremy Singer-Vine	016239c146	Fix .paint_path handling of single line segments (#530 ) * Fix .paint_path handling of single line segments - Fixes typo ("ml" should have been "mlh") - Removes if-statement that required individual line segments to be strictly horizontal or vertical. * Treat 'ml'-shape paths as lines not curves Althoguh 'mlh' is the canonical implementation for a single line segment, 'ml' is fairly common. Adds tests and sample PDF. * Fix trailing whitespace * Fix point-extraction from Beziér path commands This commit corrects the manner in which "pts" are extracted from Beziér path commands. See Table 4.9 of PDF reference manual, and new comments in code for details. Previously, depending on whether the command (c, v, or y) the code was extracting some combination of control points (not on curve) and the actual points-on-curve. This commit also refactors .paint_path, so that apply_matrix_pt is only called in one place, and to treat the "h" command in a manner more consistent with other path commands. * Add comments to test_paint_path_quadrilaterals * Parse rect-forming mllll paths as rects not curves Now that .paint_path has been refactored, adding support for rect-forming mllll paths requires no extra code, beyond a minor tweak to the relevant elif statement. * One changelog line with ref to mr * Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring * Extract variables from if statement to make it easier to read * Optimize imports order * Trigger travis build * Revert "Trigger travis build" This reverts commit `41c05184` * Update travis badge * Update travis badge Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-07-27 18:27:32 +02:00
Ev2geny	693e4f48a3	Issue #469 is fixed (When run on Windows a lot of tests fail with the error: [Errno 13] Permission denied) (#484 ) Closes #469 * Issue #469 is fixed * one extra comment to code is added * TemporaryFilePath context manager is added to facilitate tests * flake8 complaints fixed * Update docs of tempfilepath.py * Fix flake8 Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-26 10:10:11 +01:00
Pieter Marsman	f8e6ad6ac1	Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream (#523 ) Closes #191 * Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream * Add docstring * Fix flake8	2020-10-25 14:37:12 +01:00
EucliTs0	fc75972bbd	Fix TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' (#529 ) Closes #518 * Fix TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' An error is occured when the 'DW2' key contains a PDFObjRef object instead of a list of int values, e.g: 'DW2': <PDFObjRef:152>. To solve this issue, we utilise the resolve1() function See: https://github.com/pdfminer/pdfminer.six/issues/518 * Updated CHANGELOG * Update CHANGELOG.md Co-authored-by: Dimitrios TSOLAKIDIS <dimitrios.tsolakidis@vialink.fr> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-25 14:34:45 +01:00
Pieter Marsman	178a831802	Revert "Fix for when 'trailer' is indented (#513 )" (#534 ) This reverts commit `ec223d1f1d`.	2020-10-25 13:22:42 +01:00
Pieter Marsman	875e53013a	Remove explicit support for Python 3.4 and 3.5, adding tests for python 3.9 (#522 ) Closes #503	2020-10-25 12:34:51 +01:00
Jake Stockwin	ec223d1f1d	Fix for when 'trailer' is indented (#513 ) * Fix for when 'trailer' is indented Closes #214 * Address CR comments - strip line after parsing * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-24 18:55:07 +02:00
estshorter	61300eef70	Remove unused dependency on sortedcontainers (#525 ) * Remove unused sortedcontainers package * Fix changelog format * Fix a link to the PR * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-24 15:55:22 +02:00
Pieter Marsman	c8cceb7c58	Release 20201018	2020-10-18 12:57:26 +02:00
Jeremy Singer-Vine	e83dd26671	Fix .paint_path for non-rectangle quadrilaterals (#512 ) * Fix paint_path bug noted in issue #473 Focuses on the handling of non-rect quadrilaterals, the decomposition of complex (m.h) paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type. Also adds a test for cases presented in issue #473 * Tweak paint_path fix per @pietermarsman review - Adjusts logic to adhere to if-elif-else rather than early returns. - Shortens subpath detection/reprocessing step, using re.finditer(). * Reorder paint_path() if-else statements once more * Fix flake8 issues * Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on. Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-12 17:53:00 +02:00
estshorter	360b1efc0b	Deprecate Python 3.4 and 3.5 (#507 )	2020-10-10 16:15:03 +02:00
Diego Elio Pettenò	67e2d79591	Fix out-of-bound access on some PDFs. (#483 ) Replace the non-emptiness check with a minimum length check — you can't get the second to last item in a list of less than two items.	2020-10-10 15:18:34 +02:00
Jake Stockwin	ef4787d8ad	Fix not being able to pass boxes flow as None to pdf2txt (#479 ) * Fix not being able to pass boxes flow as None to pdf2txt * Changes from code review * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-10 15:17:04 +02:00
estshorter	f03657e5c4	Allow a pathlib.PurePath object as a input to open_filename (#492 ) * open_filename accepts a pathlib.PurePath object * Add test for open_filename with pathlib * Fix a wrong function name * Cast a pathlib object to string for py3.4/3.5 * Add link to the PR * Raise an exception when open_filename gets an unsupported type * Add tests for open_filename * Update CHANGELOG.md * Documentation Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-09-17 21:29:00 +02:00
David Nicholson	b4054ff4cf	Pass caching parameter to PDFResourceManager in `high_level` functions (#475 ) * Updated high_level.py This commit enables caching to be turned on and off rather than be always on regardless of the user input. * Reverted params back to fix errors * Updated CHANGELOG.md to reflect quick fix * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-09-10 21:09:07 +02:00
Igor Moura	a83f853de7	Remove unused rijndael encryption implementation (#465 ) * Remove unused rijndael encryption * Add current PR link to CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-09-10 19:28:00 +02:00
Pieter Marsman	0b44f77714	Move changelog line for #438 to current release	2020-07-26 15:14:15 +02:00
Philippe Ombredanne	99f0c09869	Restore PDFTextExtractionNotAllowed exception (#461 ) * Restore PDFTextExtractionNotAllowed Restore PDFTextExtractionNotAllowed exception class as an alias of the new PDFTextExtractionNotAllowedError exception that was introduced in `6a9269b432` Removing PDFTextExtractionNotAllowed is an API breakage that made several tools fail break. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com> * Use PDFTextExtractionNotAllowed and prepare PDFTextExtractionNotAllowedError to be removed in the future * Add line to CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-26 15:06:04 +02:00
Pieter Marsman	4f65242750	Always try to get CMap, even if name is not recognized (#438 ) * Add trying to get cmap from pickle file. And cleaning up a bit. * Don't use keyword argument for dict.get * Add docs * Make _get_cmap_name static * Add test * Add CHANGELOG.md * Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there * Add CJK characters to expected output of simple3.pdf * Fix line length * Add comment	2020-07-23 20:27:38 +02:00
Pieter Marsman	3cebf5ef66	Release 20200720	2020-07-20 22:05:19 +02:00
lithiumFlower	c10cf3cdb8	Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456 ) * swap pycryptodome to the faster, smaller, and industry standard crytography io * update changelog * fixlint * Update CHANGELOG.md * from MR, unneeded ex and naming * add samples to nosetests * fix lint * show mismatch * fix lint * typo and newline * Revert "add samples to nosetests" This reverts commit `a49ca302` * Add tests for encrypted documents to nose test suite * Optimize imports of pdfdocument.py Co-authored-by: Oren Tysor <oren@atakama.com> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung	60863cfd55	Fix converting path to multiple rectangles (#371 ) * Fix converting path to multiple rectangles For path that consists of a series of rectangles (shape is 'mlllhmlllh...'), call paint_path again with each group of 5 points. The result is multiple rects instead of a single curve. fixes #369 * Reduce pdf size by removing font * Add unittest for PDFLayoutAnalyzer.paint_path() * Add line to CHANGELOG.md * Add reference to pdf reference manual * Cleanup function paint_path a bit * Reduce line length of tests * Reduce line length of tests Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 17:34:38 +02:00
madhurcodes	6a9269b432	Change Text extraction is not allowed error to warning (#453 ) * Changed error to warning for 'Text extraction is not allowed' * updated changelog * fix lint * made changes suggested in review * Update CHANGELOG.md * Add regression test for failing pdf * Reduce line length to <80 Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 16:04:11 +02:00
Tony(Baojia) Tong	836d312982	Validate that object is PDFStream in do_EI (#451 ) * check obj type * update changelog * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-05 13:42:15 +02:00
Pieter Marsman	6e05baf0b7	Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it Fixes #176 * Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's * Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref * Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know * Adding changelog * Extend help message	2020-05-23 18:04:34 +02:00
Pieter Marsman	33b60dfd54	Bump version	2020-05-17 17:50:01 +02:00
Jake Stockwin	7254530d27	Fix ordering of textlines within a textbox when boxes_flow is disabled (#412 ) * Fix ordering of textlines within a textbox when boxes_flow is disabled * Add new test PDF sample	2020-05-09 15:37:49 +02:00
fabbox	7eff108fa5	add shebang line to script in tools (#408 ) * add shebang line to script in tools * fix: use shebang line with python 3 * Moved changelog to unreleased Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-04-28 10:58:42 +02:00
Pieter Marsman	d79bcb75ea	Bump version 20200402	2020-04-01 21:37:39 +02:00
Pieter Marsman	b8988b6848	Bump version	2020-04-01 21:22:59 +02:00
Jake Stockwin	68e2ae8632	Fix text coming in reverse order with boxes flow disabled (#399 ) Closes #398	2020-04-01 13:37:04 +02:00
Jake Stockwin	e55560f858	Fix #395 : Update documentation for boxes_flow, allow None (#396 ) * Update documentation for boxes_flow, allow None * Apply comments from code review * Small wording changes, remove unnecessary comment * Update boxes_flow documentation for pdf2text * Pin version of tox to ensure python 3.4 support	2020-03-26 23:03:49 +01:00
Jake Stockwin	518b5d6efc	Fix #390 : Updated misleading documentation about word_margin (#407 ) * Updated misleading documentation about word_margin * Small change in sentence about word_margin * Remove confusing sentence about adding spaces Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-03-26 23:02:48 +01:00
Jake Stockwin	1a4a06da9f	Fix #392 Split out IO logic from high level functions (#393 ) * Allow file-like inputs to high level functions (#392) * PR Review - move open_filename to utils	2020-03-26 22:52:00 +01:00
Jake Stockwin	1cc1b961c5	Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382 ) (#384 ) * Group text lines if they are centered (#382) Closes #382 * Add comparison private methods to LTTextLines * Add missing docstrings * Add tests for find_neighbors * Update changelog * Cosmetic changes from code review	2020-03-23 22:38:39 +01:00
Pieter Marsman	9d7fe2d9ee	Catch ValueError when converting font encoding differences to characters (#389 ) * Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed * Add test for catching ValueError and KeyError when font encoding differences are invalid * Added line to CHANGELOG.md	2020-03-16 20:12:45 +01:00
Pieter Marsman	1d773dc38a	Fix grouping textlines when bounding box of parent container is wrong (#386 ) * Default value for --all-texts should be false, because using the flag enables it * Fix edge case: when no neighbors are found a line should form its own text box * Added test for grouping textlines where 1 is outside the parent bounding box * Added CHANGELOG.md line	2020-03-14 10:33:39 +01:00
Pieter Marsman	bab6d154c2	Bump version 20200124	2020-01-24 12:38:11 +01:00
Pieter Marsman	1c3047b68b	Remove samples/ directory from source distribution to prevent downloading all pdf's when installing pdfminer.six (#364 ) Fixes #363 * Remove samples/ and docs/ from source distribution. The samples/ dictionairy contains pdf's for testing purposes and the docs/ contain readthedocs documentation and is published online. * Remove issue-00152-embedded-pdf.pdf because it contains a possible exploit. See https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Exploit%3AJS%2FShellCode.gen And https://github.com/pdfminer/pdfminer.six/issues/363 * Added line to CHANGELOG.md * Remove unused imports	2020-01-24 12:36:02 +01:00
Pieter Marsman	bc494ff03c	Bump version to 20200121	2020-01-21 21:13:52 +01:00
Pieter Marsman	52da65d5eb	Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's. (#360 ) * Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's. * Added line to CHANGELOG.md	2020-01-16 22:26:01 +01:00
Pieter Marsman	410d7ecac3	Fix value for font-family in html by removing the subset tag from the PDF font-name (#357 ) * Fix font name by removing subset tag * Added line to CHANGELOG.md * Add documentation and clear variable name * Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`	2020-01-16 22:25:20 +01:00
Pieter Marsman	fff3ac2ba6	Fix bug in computing character bounding box (#348 ) * Remove scaling font height/width with size of font bounding box * Refactor LTChar bounding box computation * Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font. * Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects * Add line to CHANGELOG	2020-01-16 22:15:50 +01:00
Pieter Marsman	2f7f5d2667	Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338 ) * Fix getting filename when extracting embedded files * Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs * Add line to CHANGELOG	2020-01-16 22:11:42 +01:00
Recursing	0b1741b9bf	Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352 ) Fixes #186 * Tread the permissions (the /P entry) as unsigned long, fix #186 * handle negative values for p * Extract function for resolving an twos-complement * Add test for issue #352 * Add line to CHANGELOG.md * Only ints can be converted to a uint using two's-complement method * Standardize import style; multiple imports from same module on one line Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-01-07 21:59:13 +01:00
Pieter Marsman	b27d3d0aff	Bump version	2020-01-04 18:15:15 +01:00
Pieter Marsman	3502dc9f3b	Drop support for legacy Python 2 (#346 ) * Drop support for legacy Python 2 * Add python_requires to help pip * Upgrade Python syntax with pyupgrade * Upgrade Python syntax with pyupgrade --py3-plus * Python 3 imports * Replace six * Update CONTRIBUTING.md * Added line to changelog Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>	2020-01-04 16:47:07 +01:00
Pieter Marsman	f3ab1bc61e	Enforce pep8 coding-style (#345 ) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>	2019-12-29 21:20:20 +01:00
Pieter Marsman	803a7d9598	Release 20191110	2019-11-10 12:29:14 +01:00

1 2

66 Commits (19c1372984ef1324359be91b31ae41412f6be82a)