pdfminer.six

Commit Graph

Author	SHA1	Message	Date
Jake Stockwin	ac2b20a79a	[docs] Add extract_pages tutorial (#442 ) Closes https://github.com/pdfminer/pdfminer.six/issues/361	2020-06-29 20:07:05 +02:00
AhnHyunJin	09c989f301	Fix spelling error (#436 ) * Change rwo to two in pdfdiff.py Co-authored-by: ahnhyunjin <hj.ahn@promptech.co.kr>	2020-06-06 15:43:57 +02:00
Pieter Marsman	6e05baf0b7	Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it Fixes #176 * Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's * Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref * Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know * Adding changelog * Extend help message	2020-05-23 18:04:34 +02:00
Pieter Marsman	33b60dfd54	Bump version	2020-05-17 17:50:01 +02:00
Pieter Marsman	91d89af788	Add section to documentation with howto for image extraction (#427 ) * Make structure of documentation more clear: tutorials, how-to, topics and reference * Add howto for images * Restructure tutorials section, and add install section * Always use up-to-date version * Fix indentation warning in docstring * Add option to dumppdf.py and pdf2txt.py to show version Fixes #162	2020-05-17 17:48:06 +02:00
Jake Stockwin	7254530d27	Fix ordering of textlines within a textbox when boxes_flow is disabled (#412 ) * Fix ordering of textlines within a textbox when boxes_flow is disabled * Add new test PDF sample	2020-05-09 15:37:49 +02:00
fabbox	7eff108fa5	add shebang line to script in tools (#408 ) * add shebang line to script in tools * fix: use shebang line with python 3 * Moved changelog to unreleased Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-04-28 10:58:42 +02:00
Pieter Marsman	d79bcb75ea	Bump version 20200402	2020-04-01 21:37:39 +02:00
Pieter Marsman	b8988b6848	Bump version	2020-04-01 21:22:59 +02:00
Jake Stockwin	68e2ae8632	Fix text coming in reverse order with boxes flow disabled (#399 ) Closes #398	2020-04-01 13:37:04 +02:00
Jake Stockwin	e55560f858	Fix #395 : Update documentation for boxes_flow, allow None (#396 ) * Update documentation for boxes_flow, allow None * Apply comments from code review * Small wording changes, remove unnecessary comment * Update boxes_flow documentation for pdf2text * Pin version of tox to ensure python 3.4 support	2020-03-26 23:03:49 +01:00
Jake Stockwin	518b5d6efc	Fix #390 : Updated misleading documentation about word_margin (#407 ) * Updated misleading documentation about word_margin * Small change in sentence about word_margin * Remove confusing sentence about adding spaces Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-03-26 23:02:48 +01:00
Jake Stockwin	1a4a06da9f	Fix #392 Split out IO logic from high level functions (#393 ) * Allow file-like inputs to high level functions (#392) * PR Review - move open_filename to utils	2020-03-26 22:52:00 +01:00
Jake Stockwin	1cc1b961c5	Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382 ) (#384 ) * Group text lines if they are centered (#382) Closes #382 * Add comparison private methods to LTTextLines * Add missing docstrings * Add tests for find_neighbors * Update changelog * Cosmetic changes from code review	2020-03-23 22:38:39 +01:00
Pieter Marsman	9d7fe2d9ee	Catch ValueError when converting font encoding differences to characters (#389 ) * Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed * Add test for catching ValueError and KeyError when font encoding differences are invalid * Added line to CHANGELOG.md	2020-03-16 20:12:45 +01:00
fzyzcjy	a087d6dfc8	Fix typo in README.md (#388 )	2020-03-14 11:00:37 +01:00
Pieter Marsman	1d773dc38a	Fix grouping textlines when bounding box of parent container is wrong (#386 ) * Default value for --all-texts should be false, because using the flag enables it * Fix edge case: when no neighbors are found a line should form its own text box * Added test for grouping textlines where 1 is outside the parent bounding box * Added CHANGELOG.md line	2020-03-14 10:33:39 +01:00
Pieter Marsman	7e91d4ec6d	Improve docs and github templates	2020-03-08 15:06:13 +01:00
Pieter Marsman	bab6d154c2	Bump version 20200124	2020-01-24 12:38:11 +01:00
Pieter Marsman	1c3047b68b	Remove samples/ directory from source distribution to prevent downloading all pdf's when installing pdfminer.six (#364 ) Fixes #363 * Remove samples/ and docs/ from source distribution. The samples/ dictionairy contains pdf's for testing purposes and the docs/ contain readthedocs documentation and is published online. * Remove issue-00152-embedded-pdf.pdf because it contains a possible exploit. See https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Exploit%3AJS%2FShellCode.gen And https://github.com/pdfminer/pdfminer.six/issues/363 * Added line to CHANGELOG.md * Remove unused imports	2020-01-24 12:36:02 +01:00
Pieter Marsman	bc494ff03c	Bump version to 20200121	2020-01-21 21:13:52 +01:00
Pieter Marsman	52da65d5eb	Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's. (#360 ) * Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's. * Added line to CHANGELOG.md	2020-01-16 22:26:01 +01:00
Pieter Marsman	410d7ecac3	Fix value for font-family in html by removing the subset tag from the PDF font-name (#357 ) * Fix font name by removing subset tag * Added line to CHANGELOG.md * Add documentation and clear variable name * Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`	2020-01-16 22:25:20 +01:00
Pieter Marsman	fff3ac2ba6	Fix bug in computing character bounding box (#348 ) * Remove scaling font height/width with size of font bounding box * Refactor LTChar bounding box computation * Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font. * Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects * Add line to CHANGELOG	2020-01-16 22:15:50 +01:00
Pieter Marsman	2f7f5d2667	Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338 ) * Fix getting filename when extracting embedded files * Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs * Add line to CHANGELOG	2020-01-16 22:11:42 +01:00
Recursing	0b1741b9bf	Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352 ) Fixes #186 * Tread the permissions (the /P entry) as unsigned long, fix #186 * handle negative values for p * Extract function for resolving an twos-complement * Add test for issue #352 * Add line to CHANGELOG.md * Only ints can be converted to a uint using two's-complement method * Standardize import style; multiple imports from same module on one line Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-01-07 21:59:13 +01:00
Pieter Marsman	e4790fdbc2	Add AES as supported encryption method to docs	2020-01-07 18:38:53 +01:00
Pieter Marsman	b27d3d0aff	Bump version	2020-01-04 18:15:15 +01:00
Pieter Marsman	6eb9957e8a	Update docs: at least python 3.4 is needed now	2020-01-04 16:51:54 +01:00
Pieter Marsman	3502dc9f3b	Drop support for legacy Python 2 (#346 ) * Drop support for legacy Python 2 * Add python_requires to help pip * Upgrade Python syntax with pyupgrade * Upgrade Python syntax with pyupgrade --py3-plus * Python 3 imports * Replace six * Update CONTRIBUTING.md * Added line to changelog Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>	2020-01-04 16:47:07 +01:00
Pieter Marsman	f3ab1bc61e	Enforce pep8 coding-style (#345 ) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>	2019-12-29 21:20:20 +01:00
Martin Hasoň	78f06225b6	Removed duplicated and therefore unused code from pdf2txt.py (#341 )	2019-12-09 22:04:05 +01:00
Pieter Marsman	452f0b4ad0	Merge branch 'develop'	2019-11-10 12:59:55 +01:00
Pieter Marsman	803a7d9598	Release 20191110	2019-11-10 12:29:14 +01:00
Pieter Marsman	2bee7d8dcf	Fix wrong ordering of grouping textboxes introduced by #315 . The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335 ) Fixes #334	2019-11-10 12:18:49 +01:00
Pieter Marsman	b63a636512	Merge branch 'develop'	2019-11-07 21:52:58 +01:00
Pieter Marsman	5c6fa8f986	Release 20191107	2019-11-07 21:52:44 +01:00
Pieter Marsman	bc034c8e59	Create sphinx documentation for Read the Docs (#329 ) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation	2019-11-07 21:12:34 +01:00
Igor Moura	40aa2533c9	Added: simple wrapper to extract text from pdf (#330 ) Fixes #327	2019-11-07 07:54:10 +01:00
Pieter Marsman	027bb62943	Merge branch 'develop' of github.com:pdfminer/pdfminer.six into develop	2019-11-06 21:51:41 +01:00
Pieter Marsman	548b933a84	Add line to CHANGELOG.md for #325	2019-11-06 21:51:34 +01:00
Martin Hasoň	ed1b09c6f2	Fix debug logging for pdf2txt.py and dumppdf.py (#325 ) Fixes #313	2019-11-06 21:47:19 +01:00
Pieter Marsman	33b16b3f07	Deprecate the use of _py2_no_more_posargs (#328 ) Fixes #324	2019-11-02 10:29:39 +01:00
Jianfeng	44b223cf0a	Speedup grouping of textboxes (#315 ) Changed: using a heap instead of a SortedList and avoid rebuilding the heap in each iteration Changed: avoid potentially huge number of variable assignments in list comprehension. Changed: avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).	2019-10-31 09:22:58 +01:00
Pieter Marsman	6cc78ee124	Replace opts by argparse in dumppdf.py (#321 ) Also add multi-character argument names Fixes #175	2019-10-27 21:40:04 +01:00
Pieter Marsman	347c125fb8	Revert "Move old documentation to subfolder" This reverts commit `a2e6c7c0`	2019-10-27 14:26:11 +01:00
Pieter Marsman	a2e6c7c0c9	Move old documentation to subfolder	2019-10-27 14:21:47 +01:00
Pieter Marsman	d88d6020a2	Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320 ) Fixes #314 Fixes #105	2019-10-26 19:16:37 +02:00
Pieter Marsman	1c4a4167ed	Fix failing test on develop & cleaning up test files (#319 )	2019-10-26 18:42:33 +02:00
Pieter Marsman	a238a19999	Fix assertionerror when dumping pdf with reference to objid 0 (#318 ) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.	2019-10-25 22:49:58 +02:00

1 2 3 4 5 ...

916 Commits (8f52578e85b27831ab8a68a6d86721ea3348a553) All Branches Search

916 Commits (8f52578e85b27831ab8a68a6d86721ea3348a553)

All Branches