pdfminer.six

Commit Graph

Author	SHA1	Message	Date
Pieter Marsman	4f65242750	Always try to get CMap, even if name is not recognized (#438 ) * Add trying to get cmap from pickle file. And cleaning up a bit. * Don't use keyword argument for dict.get * Add docs * Make _get_cmap_name static * Add test * Add CHANGELOG.md * Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there * Add CJK characters to expected output of simple3.pdf * Fix line length * Add comment	2020-07-23 20:27:38 +02:00
Pieter Marsman	3cebf5ef66	Release 20200720	2020-07-20 22:05:19 +02:00
lithiumFlower	c10cf3cdb8	Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456 ) * swap pycryptodome to the faster, smaller, and industry standard crytography io * update changelog * fixlint * Update CHANGELOG.md * from MR, unneeded ex and naming * add samples to nosetests * fix lint * show mismatch * fix lint * typo and newline * Revert "add samples to nosetests" This reverts commit `a49ca302` * Add tests for encrypted documents to nose test suite * Optimize imports of pdfdocument.py Co-authored-by: Oren Tysor <oren@atakama.com> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-20 22:00:54 +02:00
Kwok-kuen Cheung	60863cfd55	Fix converting path to multiple rectangles (#371 ) * Fix converting path to multiple rectangles For path that consists of a series of rectangles (shape is 'mlllhmlllh...'), call paint_path again with each group of 5 points. The result is multiple rects instead of a single curve. fixes #369 * Reduce pdf size by removing font * Add unittest for PDFLayoutAnalyzer.paint_path() * Add line to CHANGELOG.md * Add reference to pdf reference manual * Cleanup function paint_path a bit * Reduce line length of tests * Reduce line length of tests Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 17:34:38 +02:00
madhurcodes	6a9269b432	Change Text extraction is not allowed error to warning (#453 ) * Changed error to warning for 'Text extraction is not allowed' * updated changelog * fix lint * made changes suggested in review * Update CHANGELOG.md * Add regression test for failing pdf * Reduce line length to <80 Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-11 16:04:11 +02:00
Tony(Baojia) Tong	836d312982	Validate that object is PDFStream in do_EI (#451 ) * check obj type * update changelog * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-07-05 13:42:15 +02:00
Pieter Marsman	6e05baf0b7	Dont dump fallback xref by default when using dumppdf.py, adding a flag to enable it Fixes #176 * Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's * Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref * Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know * Adding changelog * Extend help message	2020-05-23 18:04:34 +02:00
Pieter Marsman	33b60dfd54	Bump version	2020-05-17 17:50:01 +02:00
Pieter Marsman	91d89af788	Add section to documentation with howto for image extraction (#427 ) * Make structure of documentation more clear: tutorials, how-to, topics and reference * Add howto for images * Restructure tutorials section, and add install section * Always use up-to-date version * Fix indentation warning in docstring * Add option to dumppdf.py and pdf2txt.py to show version Fixes #162	2020-05-17 17:48:06 +02:00
Jake Stockwin	7254530d27	Fix ordering of textlines within a textbox when boxes_flow is disabled (#412 ) * Fix ordering of textlines within a textbox when boxes_flow is disabled * Add new test PDF sample	2020-05-09 15:37:49 +02:00
fabbox	7eff108fa5	add shebang line to script in tools (#408 ) * add shebang line to script in tools * fix: use shebang line with python 3 * Moved changelog to unreleased Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-04-28 10:58:42 +02:00
Pieter Marsman	d79bcb75ea	Bump version 20200402	2020-04-01 21:37:39 +02:00
Pieter Marsman	b8988b6848	Bump version	2020-04-01 21:22:59 +02:00
Jake Stockwin	68e2ae8632	Fix text coming in reverse order with boxes flow disabled (#399 ) Closes #398	2020-04-01 13:37:04 +02:00
Jake Stockwin	e55560f858	Fix #395 : Update documentation for boxes_flow, allow None (#396 ) * Update documentation for boxes_flow, allow None * Apply comments from code review * Small wording changes, remove unnecessary comment * Update boxes_flow documentation for pdf2text * Pin version of tox to ensure python 3.4 support	2020-03-26 23:03:49 +01:00
Jake Stockwin	518b5d6efc	Fix #390 : Updated misleading documentation about word_margin (#407 ) * Updated misleading documentation about word_margin * Small change in sentence about word_margin * Remove confusing sentence about adding spaces Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-03-26 23:02:48 +01:00
Jake Stockwin	1a4a06da9f	Fix #392 Split out IO logic from high level functions (#393 ) * Allow file-like inputs to high level functions (#392) * PR Review - move open_filename to utils	2020-03-26 22:52:00 +01:00
Jake Stockwin	1cc1b961c5	Also group center-aligned text lines in addition to left-aligned and right-aligned text lines (#382 ) (#384 ) * Group text lines if they are centered (#382) Closes #382 * Add comparison private methods to LTTextLines * Add missing docstrings * Add tests for find_neighbors * Update changelog * Cosmetic changes from code review	2020-03-23 22:38:39 +01:00
Pieter Marsman	9d7fe2d9ee	Catch ValueError when converting font encoding differences to characters (#389 ) * Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed * Add test for catching ValueError and KeyError when font encoding differences are invalid * Added line to CHANGELOG.md	2020-03-16 20:12:45 +01:00
Pieter Marsman	1d773dc38a	Fix grouping textlines when bounding box of parent container is wrong (#386 ) * Default value for --all-texts should be false, because using the flag enables it * Fix edge case: when no neighbors are found a line should form its own text box * Added test for grouping textlines where 1 is outside the parent bounding box * Added CHANGELOG.md line	2020-03-14 10:33:39 +01:00
Pieter Marsman	bab6d154c2	Bump version 20200124	2020-01-24 12:38:11 +01:00
Pieter Marsman	bc494ff03c	Bump version to 20200121	2020-01-21 21:13:52 +01:00
Pieter Marsman	410d7ecac3	Fix value for font-family in html by removing the subset tag from the PDF font-name (#357 ) * Fix font name by removing subset tag * Added line to CHANGELOG.md * Add documentation and clear variable name * Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`	2020-01-16 22:25:20 +01:00
Pieter Marsman	fff3ac2ba6	Fix bug in computing character bounding box (#348 ) * Remove scaling font height/width with size of font bounding box * Refactor LTChar bounding box computation * Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font. * Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects * Add line to CHANGELOG	2020-01-16 22:15:50 +01:00
Recursing	0b1741b9bf	Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352 ) Fixes #186 * Tread the permissions (the /P entry) as unsigned long, fix #186 * handle negative values for p * Extract function for resolving an twos-complement * Add test for issue #352 * Add line to CHANGELOG.md * Only ints can be converted to a uint using two's-complement method * Standardize import style; multiple imports from same module on one line Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>	2020-01-07 21:59:13 +01:00
Pieter Marsman	b27d3d0aff	Bump version	2020-01-04 18:15:15 +01:00
Pieter Marsman	3502dc9f3b	Drop support for legacy Python 2 (#346 ) * Drop support for legacy Python 2 * Add python_requires to help pip * Upgrade Python syntax with pyupgrade * Upgrade Python syntax with pyupgrade --py3-plus * Python 3 imports * Replace six * Update CONTRIBUTING.md * Added line to changelog Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>	2020-01-04 16:47:07 +01:00
Pieter Marsman	f3ab1bc61e	Enforce pep8 coding-style (#345 ) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>	2019-12-29 21:20:20 +01:00
Pieter Marsman	803a7d9598	Release 20191110	2019-11-10 12:29:14 +01:00
Pieter Marsman	2bee7d8dcf	Fix wrong ordering of grouping textboxes introduced by #315 . The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335 ) Fixes #334	2019-11-10 12:18:49 +01:00
Pieter Marsman	5c6fa8f986	Release 20191107	2019-11-07 21:52:44 +01:00
Pieter Marsman	bc034c8e59	Create sphinx documentation for Read the Docs (#329 ) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation	2019-11-07 21:12:34 +01:00
Igor Moura	40aa2533c9	Added: simple wrapper to extract text from pdf (#330 ) Fixes #327	2019-11-07 07:54:10 +01:00
Martin Hasoň	ed1b09c6f2	Fix debug logging for pdf2txt.py and dumppdf.py (#325 ) Fixes #313	2019-11-06 21:47:19 +01:00
Pieter Marsman	33b16b3f07	Deprecate the use of _py2_no_more_posargs (#328 ) Fixes #324	2019-11-02 10:29:39 +01:00
Jianfeng	44b223cf0a	Speedup grouping of textboxes (#315 ) Changed: using a heap instead of a SortedList and avoid rebuilding the heap in each iteration Changed: avoid potentially huge number of variable assignments in list comprehension. Changed: avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).	2019-10-31 09:22:58 +01:00
Pieter Marsman	d88d6020a2	Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320 ) Fixes #314 Fixes #105	2019-10-26 19:16:37 +02:00
Pieter Marsman	a238a19999	Fix assertionerror when dumping pdf with reference to objid 0 (#318 ) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.	2019-10-25 22:49:58 +02:00
Serj Sintsov	cb9cd8ea46	Use named logger instead of root logger (#236 )	2019-10-22 20:52:43 +02:00
Pieter Marsman	373c6e7b97	Added: extraction of JBIG2 encoded images (#311 ) And added test for pdf with JBIG2 image. Fixes #26 Closes #46	2019-10-22 17:37:06 +02:00
Pieter Marsman	694aa508c3	Release 20191020	2019-10-20 14:21:48 +02:00
Pieter Marsman	adc4726e06	Add warning about dropping python2 support (#307 ) Fix #303	2019-10-20 13:59:29 +02:00
Pieter Marsman	9fd7172f7b	Cleanup utils.py	2019-10-17 12:14:02 +02:00
jet457	7e40fde320	Removing assertion in drange to allow equal inputs (#246 ) and mimic behaviour of built-in method range Fixes #66, since it now allows the bbox to have 0 width or 0 height Added tests for Plane since it is the API that uses drange	2019-10-17 12:04:25 +02:00
D.A.Bashkirtsev	4df6d4e5ca	Changed: comparations for image colorspace literals (#132 ) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs	2019-10-15 16:11:54 +02:00
Pieter Marsman	63b2e09ac3	Merge pull request #203 from jbarlow83/negative-descent Interpret font Descent as a negative number even if specified as positive	2019-10-13 20:06:52 +02:00
Tony Tong	106a09c5bb	fix stoke color and non-stroke color in PDFGraphicState	2019-10-12 17:35:46 -04:00
Tata Ganesh	f218996fe9	Merge pull request #273 from igormp/develop Use resolve_all on PdfFont widths and bbox	2019-10-12 21:24:29 +05:30
Fakabbir Amin	7c03d96d25	Corrects Comment	2019-08-20 17:16:10 +05:30
Fakabbir Amin	abd685fdc6	Corrects Code Comment	2019-08-20 17:13:27 +05:30
Fakabbir Amin	3d549ea48c	Removes code comments	2019-08-20 16:48:40 +05:30
Igor Moura	cf4641d877	Merge branch 'develop' into develop	2019-08-15 08:11:28 -03:00
Fakabbir Amin	fe38695739	Merge branch 'develop' into pdfstream-as-cmap	2019-08-10 10:44:31 +05:30
Fakabbir Amin	5a0d8db052	Adds decoder for OnebyteIdentityH/V instead of using default CMap	2019-08-10 10:07:23 +05:30
Tata Ganesh	42e2c8143b	Merge pull request #263 from pietermarsman/261-glyph-list-specification name2unicode() should follow the Adobe Glyph List Specification	2019-07-26 22:13:34 +05:30
Igor Moura	2f4518231f	Use resolve_all on PdfFont widths and bbox Fixes #268	2019-07-24 15:10:13 -03:00
Igor Moura	540df9f676	Replaced .iteritems() and with six.iteritems() for Python 3 compat This is a squashed commit, the previous messages can be seen bellow This is the 1st commit message: Replaced .iteritems() usage for .items() Fixed some python 2 leftovers, as discussed in #267. Also formatted code according to Black.\nThis possibly breaks some python 2 compatibility This is the commit message #2: Reverted formatting and more spread six usage	2019-07-24 14:08:30 -03:00
Fakabbir Amin	f1a4dcea88	Adds Test Cases, Neater Code For CMap Assignment	2019-07-24 11:56:06 +05:30
Fakabbir Amin	fa400431f5	Adds Test, Removes Unnecessary Assumptions	2019-07-17 11:38:00 +05:30
Pieter Marsman	6f362f53fe	Raise a `KeyError` with a useful message if `unicode2name()` does not match any glyph name. Use this message to log debug statements.	2019-07-16 08:52:24 +02:00
Pieter Marsman	0fb83366b6	Remove intermediate variable `full_stop` because it is just a dot	2019-07-16 08:49:57 +02:00
Fakabbir Amin	cc40af3d2b	Removes @property, Adds docstring	2019-07-15 14:21:21 +05:30
Pieter Marsman	c597e95a9f	Use KeyError to signal that the name does not resemble any unicode, this pattern is also used in the rest of pdfminer.six	2019-07-14 15:37:15 +02:00
Pieter Marsman	33cc9861ae	Add docstring to Type1FontHeaderParser.get_encoding() that describes that the custom CharStrings of the font are mapped to ''	2019-07-14 15:19:17 +02:00
Pieter Marsman	f0392f8049	Change implementation of name2unicode such that it follows the Adobe Glyph specs (with allowing lowercase)	2019-07-14 15:16:42 +02:00
Fakabbir Amin	8e4a82ad8b	Corrects Indentation	2019-07-13 05:00:25 +05:30
Fakabbir Amin	c022358c8d	Encapsulates character map name	2019-07-13 04:52:24 +05:30
John Kesegich	8ab2e287be	Handle PDFStream as character map name in PDFCIDFont	2019-02-25 11:42:30 -06:00
ganeshtata	b6a5848208	FEAT: Release 20181108	2018-11-08 22:37:11 +05:30
Tata Ganesh	e03ecab856	Merge pull request #141 from timb07/speedup_layout Speed up layout of text boxes	2018-11-08 20:28:40 +05:30
James R. Barlow	2ede124142	Interpet font Descent as a negative number even if specified as positive The PDF RM specifies that Descent should be negative. Fonts that claim to have a positive Descent (not that it would make sense) always seem to be wrong about this claim.	2018-11-03 23:17:48 -07:00
Tata Ganesh	259b29299e	Merge pull request #133 from timb07/speedup Speed up handling of PDFs with large images	2018-07-15 11:27:35 +05:30
Martin Wolf	edaf2c9e3f	move unittest to main()	2018-06-26 00:51:51 +02:00
Martin Wolf	eff3f19886	Merge remote-tracking branch 'upstream/master'	2018-06-25 23:32:52 +02:00
Tata Ganesh	9c7bdcc716	Merge pull request #157 from h2ri/master decode cid: 160 and 173 to spaces	2018-06-25 11:19:27 +05:30
Charles Reid	7b08cdbff9	apply dos2unix to files in pdfminer/ and tools/ to remove \r\n windows line endings	2018-06-21 12:19:48 -07:00
Goulu	1db260609e	render_string must have 5 params in all PDFDevice classes (#158 )	2018-06-21 10:21:26 +02:00
Guglielmetti Philippe	70624a64dd	render_string() now takes 3 parameters, not 5 (reverted from commit `95b65536af`)	2018-06-21 09:49:45 +02:00
Guglielmetti Philippe	95b65536af	render_string() now takes 3 parameters, not 5	2018-06-21 09:28:55 +02:00
Healthi	65eb0cef82	decode cid: 160 and 170 to spaces	2018-06-20 17:17:03 +05:30
Martin Wolf	26f80715ed	Merge remote-tracking branch 'upstream/master'	2018-06-20 13:27:18 +02:00
Tata Ganesh	67bc581bd3	Merge pull request #134 from timb07/issue_90 FIX: TypeError caused by bug in _parse_comment; #90 #89 #109	2018-06-14 09:27:34 +05:30
Tata Ganesh	7084d81bd1	Merge pull request #129 from clustree/xml-color FEAT: Send color to XML conversion	2018-06-10 21:02:34 +05:30
Martin Wolf	4bdb3ba8cc	Fixes needed to be able to compile pdfminer.six with Cython	2018-04-12 00:05:38 +02:00
Tim Bell	1cbeaebfce	Fix Python 2.6 incompatibility	2018-04-11 10:34:15 +10:00
Tim Bell	0c8cf748fe	Fix copy-paste error	2018-04-11 10:15:32 +10:00
Tim Bell	8f8a78bb88	Remove now-unused csort()	2018-04-11 09:37:32 +10:00
Tim Bell	2dda2b12b4	Speedup layout with .sort() and sortedcontainers.SortedListWithKey()	2018-04-11 09:03:32 +10:00
Gregory Mori	335c25c045	only check for bytes input to enc() in python3 In python2, isinstance("", bytes) is true, causing enc() to suppress any string input. This results in fontnames being lost when running pdf2txt.py in python2. As this check was not present in the original python2 version of pdfminer, restrict it to only check when running in python3.	2018-04-09 12:21:59 -07:00
Tim Bell	981e3a575e	Fix TypeError caused by bug in _parse_comment; #90 #89 #109	2018-04-03 12:47:40 +10:00
Tim Bell	083f11b165	Fix cases where a bytearray doesn't work in place of bytes	2018-04-03 07:27:29 +10:00
Tim Bell	185ddeb2ab	Speed up handling of PDFs with large images with more minimal change	2018-04-03 07:21:21 +10:00
Tim Bell	fab1c9462c	Speed up handling of PDFs with large images	2018-03-29 14:21:31 +11:00
Tata Ganesh	eddf861fbd	Merge pull request #125 from yosida95/bytes-type Fix type of an argument to PDFFont#decode to bytes in py3	2018-03-19 11:00:10 +05:30
Quentin Pradet	0911703eba	pdfcolor: Fix Python 2.6 compatibility	2018-03-06 14:53:11 +04:00
Quentin Pradet	94f3d61bb2	converter: Fix XML syntax	2018-03-06 14:41:52 +04:00
Quentin Pradet	2231f0892e	Send non-stroke color to XML conversion Inspired by https://github.com/euske/pdfminer/pull/158 from @andruo11 and https://github.com/euske/pdfminer/pull/197 from @staccatosound.	2018-03-06 14:11:48 +04:00
Quentin Pradet	b6c63bedc6	Make DeviceGray the default color as it should be	2018-03-06 11:24:07 +04:00
Quentin Pradet	0ce9a29f83	Fix colorspace determinism with OrderedDict	2018-03-06 11:23:32 +04:00
Kohei YOSHIDA	a636cbcfd4	fix type of an argument to PDFFont#decode to bytes in py3	2018-02-20 13:42:09 +09:00

1 2 3 4 5 ...

512 Commits (13021c9875c8e425b575cc2e3ac4d8f406eb36e5)