pdfminer.six

Commit Graph

Author	SHA1	Message	Date
jwyawney	43c8fc8557	Ignore empty characters when analyzing layout (#689 ) * Adding in checks for spurious lines that contain either only spaces or new line characters * Added spurious lines check and unit tests * Updated CHANGELOG.md with changes * Simplify code * Simplify code * Simplify code * Remove changes to lines that are not actually changed * Format import * Improve CHANGELOG.md * Improve CHANGELOG.md * Fix cicd * Blacken Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-02-22 21:20:26 +01:00
Andrew Baumann	1d1602e0c5	Added feature: page labels (#680 ) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-02-01 10:08:05 +01:00
Andrew Baumann	95dee8d67c	Fix regression in page layout that sometimes returned text lines out of order (#659 ) * add a test * fix the bug * rewrap long lines * update CHANGELOG * re-merge CHANGELOG Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2022-01-26 19:55:08 +01:00
Pieter Marsman	aa5dec252f	Fixes jbig2 writer to write valid jb2 files See: https://github.com/pdfminer/pdfminer.six/pull/653 Squashed commit of the following: commit 8748c9fcddab0826cca243eee45c40d2b6611e80 Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:40:50 2022 +0100 Remove prints in test commit bb977258a39fc7baa13bba1c3ea29726e17c0f6d Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:35:12 2022 +0100 Cleanup exception handling for jbig2 global streams commit cf0b47b01b7caad8acbd82097aadadb620606a8b Merge: `a5831d1` `708dd20` Author: Pieter Marsman <pietermarsman@gmail.com> Date: Sun Jan 23 21:29:15 2022 +0100 Merge branch 'develop' into jbig2_fix commit `a5831d110a` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:59:17 2021 -0400 flake8 tests commit `18ffa29387` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:52:11 2021 -0400 add description in changelog commit `6c7ee43d6c` Author: Forest Gregg <fgregg@datamade.us> Date: Sun Aug 1 22:43:36 2021 -0400 Fixes jbig2 writer to write valid jb2 files - closes #652	2022-01-23 21:41:08 +01:00
Sylvain Thénault	10f6fb40c2	Attempt to handle decompression error on some broken PDF files (#637 ) * Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by https://github.com/mstamy2/PyPDF2/issues/422 and the test file has been taken from there, so credits to @zegrep. * Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by mstamy2/PyPDF2#422 and the test file has been taken from there, so credits to @zegrep. * Use a warnings instead of raising exception where zlib error is detected before the CRC checksum. * Add line to CHANGELOG.md * Only try decompressing if not in strict mode * Change error into warning because warning.warn needs a subclass of Warning Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-12-11 18:25:19 +01:00
wind_chh	c883f5e13f	Add support identity unicode cmap (#626 ) Fixes #625 * add support for Identity-H/V cmap fonts * format code to pass flake8 check * Remove indent * Remove indent * Use isinstance instead of type check * Use or instead of any * Use str in variable, instead of str.find() * Fix mypy error: add typing annotations to get_unichr() * Fix type of PDFCIDFont. Can be any type of CMapBase. This is a quick fix, the entire cmap structure does not have proper inheritance. * Added line to CHANGELOG.md * Add separate class for IdentityUnicodeMap * Remove ABC from CmapBase * Remove ABC from CmapBase * Remove blank line Co-authored-by: huan_cheng <huan_cheng@bestsign.cn> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-10-13 21:52:00 +02:00
Raphaël Cohen	c3e3499a6b	Add support for ISO 32000-2 AES256 encryption (#614 ) * feat: Add support for ISO 32000-2 AES256 encryption * feat: Applies review suggestions	2021-09-06 22:00:23 +02:00
Richard Millson	a70f08818d	Fix 594 use null id when encrypted but no id given (#595 ) Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-08-29 21:32:14 +02:00
wind_chh	234c466372	Fix extraction of some cjk characters (#593 ) Fixes #566 * try to fix issue of some Chinese characters cannot be extracted correctly (#566). * format code to pass flake8 check. * fix typo and refer to issue 593. Co-authored-by: huan_cheng <huan_cheng@bestsign.cn> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-08-26 21:05:03 +02:00
Jeremy Singer-Vine	016239c146	Fix .paint_path handling of single line segments (#530 ) * Fix .paint_path handling of single line segments - Fixes typo ("ml" should have been "mlh") - Removes if-statement that required individual line segments to be strictly horizontal or vertical. * Treat 'ml'-shape paths as lines not curves Althoguh 'mlh' is the canonical implementation for a single line segment, 'ml' is fairly common. Adds tests and sample PDF. * Fix trailing whitespace * Fix point-extraction from Beziér path commands This commit corrects the manner in which "pts" are extracted from Beziér path commands. See Table 4.9 of PDF reference manual, and new comments in code for details. Previously, depending on whether the command (c, v, or y) the code was extracting some combination of control points (not on curve) and the actual points-on-curve. This commit also refactors .paint_path, so that apply_matrix_pt is only called in one place, and to treat the "h" command in a manner more consistent with other path commands. * Add comments to test_paint_path_quadrilaterals * Parse rect-forming mllll paths as rects not curves Now that .paint_path has been refactored, adding support for rect-forming mllll paths requires no extra code, beyond a minor tweak to the relevant elif statement. * One changelog line with ref to mr * Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring * Extract variables from if statement to make it easier to read * Optimize imports order * Trigger travis build * Revert "Trigger travis build" This reverts commit `41c05184` * Update travis badge * Update travis badge Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2021-07-27 18:27:32 +02:00
typhoon71	4d8b5975cb	Add section to documentation with howto for AcroForm fields extraction (#458 ) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>	2020-09-10 19:18:41 +02:00
lithiumFlower	c10cf3cdb8	Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456 ) * swap pycryptodome to the faster, smaller, and industry standard crytography io * update changelog * fixlint * Update CHANGELOG.md * from MR, unneeded ex and naming * add samples to nosetests * fix lint * show mismatch * fix lint * typo and newline * Revert "add samples to nosetests" This reverts commit `a49ca302` * Add tests for encrypted documents to nose test suite * Optimize imports of pdfdocument.py Co-authored-by: Oren Tysor <oren@atakama.com> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung	60863cfd55	Fix converting path to multiple rectangles (#371 ) * Fix converting path to multiple rectangles For path that consists of a series of rectangles (shape is 'mlllhmlllh...'), call paint_path again with each group of 5 points. The result is multiple rects instead of a single curve. fixes #369 * Reduce pdf size by removing font * Add unittest for PDFLayoutAnalyzer.paint_path() * Add line to CHANGELOG.md * Add reference to pdf reference manual * Cleanup function paint_path a bit * Reduce line length of tests * Reduce line length of tests Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 17:34:38 +02:00
madhurcodes	6a9269b432	Change Text extraction is not allowed error to warning (#453 ) * Changed error to warning for 'Text extraction is not allowed' * updated changelog * fix lint * made changes suggested in review * Update CHANGELOG.md * Add regression test for failing pdf * Reduce line length to <80 Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 16:04:11 +02:00
Jake Stockwin	7254530d27	Fix ordering of textlines within a textbox when boxes_flow is disabled (#412 ) * Fix ordering of textlines within a textbox when boxes_flow is disabled * Add new test PDF sample	2020-05-09 15:37:49 +02:00
Pieter Marsman	1c3047b68b	Remove samples/ directory from source distribution to prevent downloading all pdf's when installing pdfminer.six (#364 ) Fixes #363 * Remove samples/ and docs/ from source distribution. The samples/ dictionairy contains pdf's for testing purposes and the docs/ contain readthedocs documentation and is published online. * Remove issue-00152-embedded-pdf.pdf because it contains a possible exploit. See https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Exploit%3AJS%2FShellCode.gen And https://github.com/pdfminer/pdfminer.six/issues/363 * Added line to CHANGELOG.md * Remove unused imports	2020-01-24 12:36:02 +01:00
Pieter Marsman	fff3ac2ba6	Fix bug in computing character bounding box (#348 ) * Remove scaling font height/width with size of font bounding box * Refactor LTChar bounding box computation * Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font. * Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects * Add line to CHANGELOG	2020-01-16 22:15:50 +01:00
Pieter Marsman	2f7f5d2667	Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338 ) * Fix getting filename when extracting embedded files * Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs * Add line to CHANGELOG	2020-01-16 22:11:42 +01:00
Recursing	0b1741b9bf	Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352 ) Fixes #186 * Tread the permissions (the /P entry) as unsigned long, fix #186 * handle negative values for p * Extract function for resolving an twos-complement * Add test for issue #352 * Add line to CHANGELOG.md * Only ints can be converted to a uint using two's-complement method * Standardize import style; multiple imports from same module on one line Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-01-07 21:59:13 +01:00
Pieter Marsman	3502dc9f3b	Drop support for legacy Python 2 (#346 ) * Drop support for legacy Python 2 * Add python_requires to help pip * Upgrade Python syntax with pyupgrade * Upgrade Python syntax with pyupgrade --py3-plus * Python 3 imports * Replace six * Update CONTRIBUTING.md * Added line to changelog Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>	2020-01-04 16:47:07 +01:00
Pieter Marsman	1c4a4167ed	Fix failing test on develop & cleaning up test files (#319 )	2019-10-26 18:42:33 +02:00
jbarlow83	733ddf7e57	Added: tests for extracting tests from pdfs with Type3 fonts (#205 )	2019-10-22 18:15:59 +02:00
Pieter Marsman	373c6e7b97	Added: extraction of JBIG2 encoded images (#311 ) And added test for pdf with JBIG2 image. Fixes #26 Closes #46	2019-10-22 17:37:06 +02:00
Fakabbir Amin	5b210981c9	Adds Test Case	2019-08-10 10:19:20 +05:30
Sebastian Schuberth	ec8530f6cf	Add a test for the previous fix	2017-10-16 12:35:16 +02:00
Philippe Guglielmetti	b010db6049	solves https://github.com/pdfminer/pdfminer.six/issues/65	2017-07-20 21:17:06 +02:00
Philippe Guglielmetti	82af7f0aac	issue #56 reproduced, solution attempt unsucessful	2017-04-19 14:19:14 +02:00
Philippe Guglielmetti	7055862eaf	solves https://github.com/pdfminer/pdfminer.six/issues/50	2017-04-18 18:20:31 +02:00
Daniel Berthereau	10815bff7b	Fixed tests.	2016-06-27 00:00:00 +02:00
cybjit	2ee7153f6e	add python3 in sample Makefile	2014-09-16 22:56:13 +02:00
Yusuke Shinyama	2e900e5d10	Fixed for consistent test results. (hopefully...)	2014-06-26 17:41:31 +09:00
Yusuke Shinyama	a3ab6c253b	Fixed: loose autotesting.	2014-06-25 19:50:20 +09:00
Yusuke Shinyama	8f9c4dedff	Test rig cleanup.	2014-06-15 11:41:30 +09:00
Yusuke Shinyama	a8ec99a848	More autotest tweaks.	2014-06-15 10:52:59 +09:00
Yusuke Shinyama	fb3f2d9629	Further test tweaks.	2014-06-14 12:00:31 +09:00
Yusuke Shinyama	a7489aaabe	Fixed: autotests	2014-06-14 10:54:40 +09:00
numion	a4997d6f10	Implement revision 4 and 5 encryption handler.	2014-05-19 16:27:43 +02:00
Yusuke Shinyama	c8b6d4112a	Fixed: crash with negative layout bbox.	2013-11-09 15:10:14 +09:00
Matthew Duggan	f02cb11945	Update test references based on recent layout analysis improvements	2013-11-07 17:44:09 +09:00
Yusuke Shinyama	56917a213c	testcase updated	2011-05-15 01:22:51 +09:00
Yusuke Shinyama	e8cd880409	testdata changed	2011-02-27 19:48:22 +09:00
yusuke.shinyama.dummy	5d98a27d9c	test cases updated git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@282 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-12-25 08:41:11 +00:00
yusuke.shinyama.dummy	509ab66319	stay with python2 git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@264 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-10-19 09:57:01 +00:00
yusuke.shinyama.dummy	607d4734db	update test cases git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@255 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-10-17 05:15:28 +00:00
yusuke.shinyama.dummy	3305c07ba2	layout analysis improved git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@245 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-10-17 05:13:39 +00:00
yusuke.shinyama.dummy	0944cfaded	test file simple3.pdf added. git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@240 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-08-29 06:39:41 +00:00
yusuke.shinyama.dummy	83d2086f19	fix minor layout issue git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@239 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-08-29 06:39:31 +00:00
yusuke.shinyama.dummy	f5aff374fc	some wordings and documentations git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@229 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-06-19 03:56:50 +00:00
yusuke.shinyama.dummy	f2005bee55	non-free sample files moved into a separate directory git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@227 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-06-13 04:35:18 +00:00
yusuke.shinyama.dummy	aa7e7d3e35	add a README file to show credits of the sample files. git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@223 1aa58f4a-7d42-0410-adbc-911cccaed67c	2010-06-06 05:16:37 +00:00

1 2

78 Commits (d89cc357ee4d3be82412b14f4c1aee9e1141f5d2)