pdfminer.six

Commit Graph

Author	SHA1	Message	Date
htInEdin	dc530f3a6f	Use logger.warn instead of warnings.warn if warning cannot be prevented by user (#673 ) * Use logging.Logger.warning instead of warning.warn in most cases, following the Python official guidance that warning.warn is directed at _developers_, not users * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning, PDFNoValidXRefWarning * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather than PDFNoValidXRefWarning * get name right * make flake8 happy * Keep warning classes such that this does not crash code when these warnings are explictly ignored * Update changelog to include pr ref * Small textual change * Remove patch * No need for testing if the warning is actually raised. The test_tootls_dumppdf.py are just test cases if these pdfs are supported. * Use logger as name for logger * Add docs to legacy warnings * Use logger.Logger.warn for failed decompression * Add reference to docs describing when to use logger and warnings Co-authored-by: Henry S. Thompson <ht@home.hst.name> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-01-26 20:41:12 +01:00
Andrew Baumann	95dee8d67c	Fix regression in page layout that sometimes returned text lines out of order (#659 ) * add a test * fix the bug * rewrap long lines * update CHANGELOG * re-merge CHANGELOG Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-01-26 19:55:08 +01:00
Pieter Marsman	aa5dec252f	Fixes jbig2 writer to write valid jb2 files See: https://github.com/pdfminer/pdfminer.six/pull/653 Squashed commit of the following: commit 8748c9fcddab0826cca243eee45c40d2b6611e80 Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:40:50 2022 +0100 Remove prints in test commit bb977258a39fc7baa13bba1c3ea29726e17c0f6d Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:35:12 2022 +0100 Cleanup exception handling for jbig2 global streams commit cf0b47b01b7caad8acbd82097aadadb620606a8b Merge: `a5831d1` `708dd20` Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:29:15 2022 +0100 Merge branch 'develop' into jbig2_fix commit `a5831d110a` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:59:17 2021 -0400 flake8 tests commit `18ffa29387` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:52:11 2021 -0400 add description in changelog commit `6c7ee43d6c` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:43:36 2021 -0400 Fixes jbig2 writer to write valid jb2 files - closes #652	2022-01-23 21:41:08 +01:00
Pieter Marsman	b82229245a	Added test case for CCITTFaxDecoder (#700 ) * array.array.tostring -> array.array.tobytes The tostring method has been deprecated since Python 3.2 and was removed altogether in 3.9. In Python 3.2 the method was renamed to "tobytes" Will close #641 * changelog entry * test for tobytes * Fix CHANGELOG.md * Update CHANGELOG.md to PR that I can push on * Simplify tests Co-authored-by: Forest Gregg <fgregg@uchicago.edu>	2022-01-23 21:00:13 +01:00
Sylvain Thénault	10f6fb40c2	Attempt to handle decompression error on some broken PDF files (#637 ) * Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422 and the test file has been taken from there, so credits to @zegrep. * Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by mstamy2/PyPDF2#422 and the test file has been taken from there, so credits to @zegrep. * Use a warnings instead of raising exception where zlib error is detected before the CRC checksum. * Add line to CHANGELOG.md * Only try decompressing if not in strict mode * Change error into warning because warning.warn needs a subclass of Warning Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-12-11 18:25:19 +01:00
wind_chh	c883f5e13f	Add support identity unicode cmap (#626 ) Fixes #625 * add support for Identity-H/V cmap fonts * format code to pass flake8 check * Remove indent * Remove indent * Use isinstance instead of type check * Use or instead of any * Use str in variable, instead of str.find() * Fix mypy error: add typing annotations to get_unichr() * Fix type of PDFCIDFont. Can be any type of CMapBase. This is a quick fix, the entire cmap structure does not have proper inheritance. * Added line to CHANGELOG.md * Add separate class for IdentityUnicodeMap * Remove ABC from CmapBase * Remove ABC from CmapBase * Remove blank line Co-authored-by: huan_cheng <huan_cheng@bestsign.cn> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-10-13 21:52:00 +02:00
Andrew Baumann	9406040d8e	Add type annotations (#661 ) Squashed commit of the following: commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd Merge: eaab3c6 `c3e3499` Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 20:33:06 2021 -0700 Merge branch 'develop' into mypy (and fixed types) commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 20:00:45 2021 -0700 reformat all multi-line function defs to one-arg-per-line commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:58:48 2021 -0700 ccitt nit -- avoid casting needlessly commit 15983d8c1e7162632fde43752c9d1c15938cd980 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:58:36 2021 -0700 tweak CHANGELOG commit 13dc0babf782938e7d5b5e482d4c5adf92d82702 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:43:46 2021 -0700 add failing tests for dumppdf crash commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:24:23 2021 -0700 ccitt: apply misc PR feedback commit feb031ba86d3f22e41cfbbda13f17c039359f1e6 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:18:26 2021 -0700 add missing None return type to all __init__ methods commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:13:08 2021 -0700 minor cleanup, remove a few more Any types commit b52a0594e1998a492c172538a9b35491c5fc5f52 Author: Andrew Baumann <ab@ab.id.au> Date: Sun Sep 5 22:37:28 2021 -0700 tighten up types, avoid Any in favour of explicit casts commit e58fd48bd14f31bebd2de8259f12630ac02756d6 Author: Andrew Baumann <ab@ab.id.au> Date: Sun Sep 5 14:10:49 2021 -0700 annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes) commit 605290633e55595e5e0045840df5c5b1d9de843a Author: Andrew Baumann <ab@ab.id.au> Date: Sat Sep 4 22:37:38 2021 -0700 python 3.7 back-compat commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c Author: Andrew Baumann <ab@ab.id.au> Date: Sat Sep 4 22:32:43 2021 -0700 annotate pdfminer.jbig2 commit 0d40b7c03a8028dc44acd3f457eac71abd681827 Author: Andrew Baumann <ab@ab.id.au> Date: Sat Sep 4 22:31:33 2021 -0700 annotate pdf2txt.py commit 5f82eb4f5646b5d1285252689191e0a14557ec7b Author: Andrew Baumann <ab@ab.id.au> Date: Sat Sep 4 09:16:31 2021 -0700 cleanup: make Plane generic commit 624fc92b88473ff36a174760883f34c22109da2b Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 23:16:51 2021 -0700 bluntly ignore calls to cryptography.hazmat commit 96b20439c169f40dbb114cabba6a582ad1ebe91e Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 23:01:06 2021 -0700 finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2 commit 0ab586347861b72b1d16880dc9293f9ad597e20a Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 21:51:56 2021 -0700 annotate pdffont commit 4b689f1bcbdaf654feb9de81023e318ca310a12e Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 18:30:02 2021 -0700 annotate a couple more scripts; document sketchy code commit 291981ff3d273952ec9c92ef8ab948473558b787 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 15:02:01 2021 -0700 pacify flake8 commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 14:31:48 2021 -0700 annotate dumppdf, and comment likely bugs commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 13:49:58 2021 -0700 enable mypy on tests and tools, fix one implicit reexport bug commit 4a83166ef4e4733cd2113f43188b585a4fda392b Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 13:25:59 2021 -0700 pdfdocument: per dumppdf.py, get_dest accepts either bytes or str commit 43701e1bee068df98f378a253c9c2150ee4ad9f7 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 13:25:00 2021 -0700 layout: LAParams.boxes_flow may be None commit 164f81652f1788e74837466f0ab593e94079bc0f Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 09:45:09 2021 -0700 add whitespace, pacify flake8 commit 893b9fb9ec918032b36a30456fc0b7a217da86d8 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 09:40:33 2021 -0700 support old Python without typing.Protocol commit dc245084102b7b04c3f5599d75b5d62ba4290787 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 09:12:03 2021 -0700 Move "# type: ignore" comments to fix mypy on Python < 3.8 The placement of these comments got more flexible in 3.8 due to https://github.com/python/mypy/issues/1032 Satisfying older Python and fitting in flake8's 79-character line limit was quite a challenge! commit da03afe7bd2cf3336e611f467f1c901455940ae8 Author: Andrew Baumann <ab@ab.id.au> Date: Thu Sep 2 22:59:58 2021 -0700 fix text output from HTMLConverter commit 5401276a2ed3b74a385ebcab5152485224146161 Author: Andrew Baumann <ab@ab.id.au> Date: Thu Sep 2 22:40:22 2021 -0700 annotate high_level.py and the immediately-reachable internal APIs (mostly converters) commit cc490513f8f17a7adc0bcbab2e0e86f37e832300 Author: Andrew Baumann <ab@ab.id.au> Date: Thu Sep 2 17:04:35 2021 -0700 * expand and improve annotations in cmap, encryption/decompression and fonts * disallow untyped calls; this way, we have a core set of typed code that can grow over time (just not for ccitt, because there's a ton of work lurking there) * expand "typing: none" comments to suppress a specific error code commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99 Author: Andrew Baumann <ab@ab.id.au> Date: Wed Sep 1 20:50:59 2021 -0700 update CHANGELOG commit f72aaead45d0615e472a9b3190c9551a6b67b36e Merge: ff787a9 `8ea9f10` Author: Andrew Baumann <ab@ab.id.au> Date: Wed Sep 1 20:47:03 2021 -0700 Merge branch 'develop' into mypy commit ff787a93986c60361536a97182a41774f4a53ac3 Author: Andrew Baumann <ab@ab.id.au> Date: Sat Aug 21 21:46:14 2021 -0700 be more precise about types on ps/pdf stacks, remove most of the Any annotations commit be1550189e10717f6827dbb7009d6e8c8b3f4c62 Author: Andrew Baumann <ab@ab.id.au> Date: Sat Aug 21 10:13:58 2021 -0700 silence missing imports, (maybe?) hook to tox commit ff4b6a9bd46b352583d823d39065652c9a6f05f4 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Aug 20 22:49:06 2021 -0700 turn on more strict checks, and untangle the layout mess with generics Status: $ mypy pdfminer pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame" pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes" pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL" Found 5 errors in 4 files (checked 27 source files) pdfdevice.py:191 appears to be a real bug commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c Author: Andrew Baumann <ab@ab.id.au> Date: Fri Aug 20 17:22:41 2021 -0700 finish annotating layout commit 0e6871c16abb29df2868ab145b4ce451b4b6c777 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Aug 20 16:54:46 2021 -0700 general progress on annotations * finish utils * annotate more of pdfinterp, pdfdevice * document reason for # type: ignore comments * fix cyclic imports * satisfy flake8 commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a Author: Andrew Baumann <ab@ab.id.au> Date: Thu Aug 19 21:38:50 2021 -0700 WIP on type annotations With the possible exception of psparser.py, this is far from complete. $ mypy pdfminer pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame" pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"	2021-10-09 16:23:28 +02:00
htInEdin	33d7dde4d1	Fix bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed (#616 ) * detect TextIOWrapper as non-binary * I don't understand the CHANGELOG.md format, hope this is good enough * Delete \\\r\n in Literal Strings (ref. section 7.3.4.2 of PDF32000_2008) * Keep Travis CI happy * Added test * Remove pdfminer/Changelog * Prettify _parse_string_1 * Add CHANGELOG.md * Satisfy flake8 * Update CHANGELOG.md * Use logging.Logger.warning instead of warning.warn in most cases, following the Python official guidance that warning.warn is directed at _developers_, not users * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning, PDFNoValidXRefWarning * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather than PDFNoValidXRefWarning * get name right * make flake8 happy * Revert "make flake8 happy" This reverts commit `4592769686`. * Revert "get name right" This reverts commit `80091ea211`. * Revert "Use logging.Logger.warning instead of warning.warn in most cases, following" This reverts commit `3c1e3d6606`. * Revert "Merge branch 'preferLoggingToWarning' into hst" This reverts commit `9d9d139921`, reversing changes made to `80091ea211`. * Revert "Revert "Merge branch 'preferLoggingToWarning' into hst"" This reverts commit `b3da21934d`. Co-authored-by: Henry S. Thompson <ht@home.hst.name> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-09-27 20:30:40 +02:00
Raphaël Cohen	c3e3499a6b	Add support for ISO 32000-2 AES256 encryption (#614 ) * feat: Add support for ISO 32000-2 AES256 encryption * feat: Applies review suggestions	2021-09-06 22:00:23 +02:00
Richard Millson	a70f08818d	Fix 594 use null id when encrypted but no id given (#595 ) Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-08-29 21:32:14 +02:00
wind_chh	234c466372	Fix extraction of some cjk characters (#593 ) Fixes #566 * try to fix issue of some Chinese characters cannot be extracted correctly (#566). * format code to pass flake8 check. * fix typo and refer to issue 593. Co-authored-by: huan_cheng <huan_cheng@bestsign.cn> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-08-26 21:05:03 +02:00
Jeremy Singer-Vine	016239c146	Fix .paint_path handling of single line segments (#530 ) * Fix .paint_path handling of single line segments - Fixes typo ("ml" should have been "mlh") - Removes if-statement that required individual line segments to be strictly horizontal or vertical. * Treat 'ml'-shape paths as lines not curves Althoguh 'mlh' is the canonical implementation for a single line segment, 'ml' is fairly common. Adds tests and sample PDF. * Fix trailing whitespace * Fix point-extraction from Beziér path commands This commit corrects the manner in which "pts" are extracted from Beziér path commands. See Table 4.9 of PDF reference manual, and new comments in code for details. Previously, depending on whether the command (c, v, or y) the code was extracting some combination of control points (not on curve) and the actual points-on-curve. This commit also refactors .paint_path, so that apply_matrix_pt is only called in one place, and to treat the "h" command in a manner more consistent with other path commands. * Add comments to test_paint_path_quadrilaterals * Parse rect-forming mllll paths as rects not curves Now that .paint_path has been refactored, adding support for rect-forming mllll paths requires no extra code, beyond a minor tweak to the relevant elif statement. * One changelog line with ref to mr * Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring * Extract variables from if statement to make it easier to read * Optimize imports order * Trigger travis build * Revert "Trigger travis build" This reverts commit `41c05184` * Update travis badge * Update travis badge Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-07-27 18:27:32 +02:00
Ev2geny	693e4f48a3	Issue #469 is fixed (When run on Windows a lot of tests fail with the error: [Errno 13] Permission denied) (#484 ) Closes #469 * Issue #469 is fixed * one extra comment to code is added * TemporaryFilePath context manager is added to facilitate tests * flake8 complaints fixed * Update docs of tempfilepath.py * Fix flake8 Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-26 10:10:11 +01:00
Pieter Marsman	f8e6ad6ac1	Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream (#523 ) Closes #191 * Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream * Add docstring * Fix flake8	2020-10-25 14:37:12 +01:00
Jeremy Singer-Vine	e83dd26671	Fix .paint_path for non-rectangle quadrilaterals (#512 ) * Fix paint_path bug noted in issue #473 Focuses on the handling of non-rect quadrilaterals, the decomposition of complex (m.h) paths into subpaths, and assigning those subpaths the correct LTCurve/LTRect type. Also adds a test for cases presented in issue #473 * Tweak paint_path fix per @pietermarsman review - Adjusts logic to adhere to if-elif-else rather than early returns. - Shortens subpath detection/reprocessing step, using re.finditer(). * Reorder paint_path() if-else statements once more * Fix flake8 issues * Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on. Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-10-12 17:53:00 +02:00
estshorter	f03657e5c4	Allow a pathlib.PurePath object as a input to open_filename (#492 ) * open_filename accepts a pathlib.PurePath object * Add test for open_filename with pathlib * Fix a wrong function name * Cast a pathlib object to string for py3.4/3.5 * Add link to the PR * Raise an exception when open_filename gets an unsupported type * Add tests for open_filename * Update CHANGELOG.md * Documentation Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-09-17 21:29:00 +02:00
Igor Moura	a83f853de7	Remove unused rijndael encryption implementation (#465 ) * Remove unused rijndael encryption * Add current PR link to CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-09-10 19:28:00 +02:00
Pieter Marsman	4f65242750	Always try to get CMap, even if name is not recognized (#438 ) * Add trying to get cmap from pickle file. And cleaning up a bit. * Don't use keyword argument for dict.get * Add docs * Make _get_cmap_name static * Add test * Add CHANGELOG.md * Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there * Add CJK characters to expected output of simple3.pdf * Fix line length * Add comment	2020-07-23 20:27:38 +02:00
lithiumFlower	c10cf3cdb8	Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456 ) * swap pycryptodome to the faster, smaller, and industry standard crytography io * update changelog * fixlint * Update CHANGELOG.md * from MR, unneeded ex and naming * add samples to nosetests * fix lint * show mismatch * fix lint * typo and newline * Revert "add samples to nosetests" This reverts commit `a49ca302` * Add tests for encrypted documents to nose test suite * Optimize imports of pdfdocument.py Co-authored-by: Oren Tysor <oren@atakama.com> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung	60863cfd55	Fix converting path to multiple rectangles (#371 ) * Fix converting path to multiple rectangles For path that consists of a series of rectangles (shape is 'mlllhmlllh...'), call paint_path again with each group of 5 points. The result is multiple rects instead of a single curve. fixes #369 * Reduce pdf size by removing font * Add unittest for PDFLayoutAnalyzer.paint_path() * Add line to CHANGELOG.md * Add reference to pdf reference manual * Cleanup function paint_path a bit * Reduce line length of tests * Reduce line length of tests Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 17:34:38 +02:00
madhurcodes	6a9269b432	Change Text extraction is not allowed error to warning (#453 ) * Changed error to warning for 'Text extraction is not allowed' * updated changelog * fix lint * made changes suggested in review * Update CHANGELOG.md * Add regression test for failing pdf * Reduce line length to <80 Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 16:04:11 +02:00
Pieter Marsman	6e05baf0b7	Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it Fixes #176 * Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's * Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref * Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know * Adding changelog * Extend help message	2020-05-23 18:04:34 +02:00
Jake Stockwin	7254530d27	Fix ordering of textlines within a textbox when boxes_flow is disabled (#412 ) * Fix ordering of textlines within a textbox when boxes_flow is disabled * Add new test PDF sample	2020-05-09 15:37:49 +02:00
fabbox	7eff108fa5	add shebang line to script in tools (#408 ) * add shebang line to script in tools * fix: use shebang line with python 3 * Moved changelog to unreleased Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-04-28 10:58:42 +02:00
Jake Stockwin	68e2ae8632	Fix text coming in reverse order with boxes flow disabled (#399 ) Closes #398	2020-04-01 13:37:04 +02:00
Jake Stockwin	1a4a06da9f	Fix #392 Split out IO logic from high level functions (#393 ) * Allow file-like inputs to high level functions (#392) * PR Review - move open_filename to utils	2020-03-26 22:52:00 +01:00
Jake Stockwin	1cc1b961c5	Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382 ) (#384 ) * Group text lines if they are centered (#382) Closes #382 * Add comparison private methods to LTTextLines * Add missing docstrings * Add tests for find_neighbors * Update changelog * Cosmetic changes from code review	2020-03-23 22:38:39 +01:00
Pieter Marsman	9d7fe2d9ee	Catch ValueError when converting font encoding differences to characters (#389 ) * Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed * Add test for catching ValueError and KeyError when font encoding differences are invalid * Added line to CHANGELOG.md	2020-03-16 20:12:45 +01:00
Pieter Marsman	1d773dc38a	Fix grouping textlines when bounding box of parent container is wrong (#386 ) * Default value for --all-texts should be false, because using the flag enables it * Fix edge case: when no neighbors are found a line should form its own text box * Added test for grouping textlines where 1 is outside the parent bounding box * Added CHANGELOG.md line	2020-03-14 10:33:39 +01:00
Pieter Marsman	1c3047b68b	Remove samples/ directory from source distribution to prevent downloading all pdf's when installing pdfminer.six (#364 ) Fixes #363 * Remove samples/ and docs/ from source distribution. The samples/ dictionairy contains pdf's for testing purposes and the docs/ contain readthedocs documentation and is published online. * Remove issue-00152-embedded-pdf.pdf because it contains a possible exploit. See https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Exploit%3AJS%2FShellCode.gen And https://github.com/pdfminer/pdfminer.six/issues/363 * Added line to CHANGELOG.md * Remove unused imports	2020-01-24 12:36:02 +01:00
Pieter Marsman	fff3ac2ba6	Fix bug in computing character bounding box (#348 ) * Remove scaling font height/width with size of font bounding box * Refactor LTChar bounding box computation * Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font. * Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects * Add line to CHANGELOG	2020-01-16 22:15:50 +01:00
Pieter Marsman	2f7f5d2667	Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338 ) * Fix getting filename when extracting embedded files * Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs * Add line to CHANGELOG	2020-01-16 22:11:42 +01:00
Recursing	0b1741b9bf	Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352 ) Fixes #186 * Tread the permissions (the /P entry) as unsigned long, fix #186 * handle negative values for p * Extract function for resolving an twos-complement * Add test for issue #352 * Add line to CHANGELOG.md * Only ints can be converted to a uint using two's-complement method * Standardize import style; multiple imports from same module on one line Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-01-07 21:59:13 +01:00
Pieter Marsman	3502dc9f3b	Drop support for legacy Python 2 (#346 ) * Drop support for legacy Python 2 * Add python_requires to help pip * Upgrade Python syntax with pyupgrade * Upgrade Python syntax with pyupgrade --py3-plus * Python 3 imports * Replace six * Update CONTRIBUTING.md * Added line to changelog Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>	2020-01-04 16:47:07 +01:00
Pieter Marsman	f3ab1bc61e	Enforce pep8 coding-style (#345 ) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>	2019-12-29 21:20:20 +01:00
Pieter Marsman	2bee7d8dcf	Fix wrong ordering of grouping textboxes introduced by #315 . The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335 ) Fixes #334	2019-11-10 12:18:49 +01:00
Igor Moura	40aa2533c9	Added: simple wrapper to extract text from pdf (#330 ) Fixes #327	2019-11-07 07:54:10 +01:00
Pieter Marsman	6cc78ee124	Replace opts by argparse in dumppdf.py (#321 ) Also add multi-character argument names Fixes #175	2019-10-27 21:40:04 +01:00
Pieter Marsman	1c4a4167ed	Fix failing test on develop & cleaning up test files (#319 )	2019-10-26 18:42:33 +02:00
Pieter Marsman	a238a19999	Fix assertionerror when dumping pdf with reference to objid 0 (#318 ) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.	2019-10-25 22:49:58 +02:00
Serj Sintsov	cb9cd8ea46	Use named logger instead of root logger (#236 )	2019-10-22 20:52:43 +02:00
jbarlow83	733ddf7e57	Added: tests for extracting tests from pdfs with Type3 fonts (#205 )	2019-10-22 18:15:59 +02:00
Pieter Marsman	373c6e7b97	Added: extraction of JBIG2 encoded images (#311 ) And added test for pdf with JBIG2 image. Fixes #26 Closes #46	2019-10-22 17:37:06 +02:00
jet457	7e40fde320	Removing assertion in drange to allow equal inputs (#246 ) and mimic behaviour of built-in method range Fixes #66, since it now allows the bbox to have 0 width or 0 height Added tests for Plane since it is the API that uses drange	2019-10-17 12:04:25 +02:00
D.A.Bashkirtsev	4df6d4e5ca	Changed: comparations for image colorspace literals (#132 ) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs	2019-10-15 16:11:54 +02:00
Fakabbir Amin	3f0f05def6	Merge branch 'pdfstream-as-cmap' of https://github.com/fakabbir/pdfminer.six into pdfstream-as-cmap	2019-08-10 11:04:10 +05:30
Fakabbir Amin	3125d3634a	Correct old test cases	2019-08-10 11:03:28 +05:30
Fakabbir Amin	fe38695739	Merge branch 'develop' into pdfstream-as-cmap	2019-08-10 10:44:31 +05:30
Fakabbir Amin	5b210981c9	Adds Test Case	2019-08-10 10:19:20 +05:30
Fakabbir Amin	f1a4dcea88	Adds Test Cases, Neater Code For CMap Assignment	2019-07-24 11:56:06 +05:30

1 2

72 Commits (dc530f3a6fe198ae556e799a9288fe537c13644e)