pdfminer.six

Commit Graph

Author	SHA1	Message	Date
Pieter Marsman	33b16b3f07	Deprecate the use of _py2_no_more_posargs (#328 ) Fixes #324	2019-11-02 10:29:39 +01:00
Jianfeng	44b223cf0a	Speedup grouping of textboxes (#315 ) Changed: using a heap instead of a SortedList and avoid rebuilding the heap in each iteration Changed: avoid potentially huge number of variable assignments in list comprehension. Changed: avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).	2019-10-31 09:22:58 +01:00
Pieter Marsman	d88d6020a2	Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320 ) Fixes #314 Fixes #105	2019-10-26 19:16:37 +02:00
Pieter Marsman	a238a19999	Fix assertionerror when dumping pdf with reference to objid 0 (#318 ) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.	2019-10-25 22:49:58 +02:00
Serj Sintsov	cb9cd8ea46	Use named logger instead of root logger (#236 )	2019-10-22 20:52:43 +02:00
Pieter Marsman	373c6e7b97	Added: extraction of JBIG2 encoded images (#311 ) And added test for pdf with JBIG2 image. Fixes #26 Closes #46	2019-10-22 17:37:06 +02:00
Pieter Marsman	694aa508c3	Release 20191020	2019-10-20 14:21:48 +02:00
Pieter Marsman	adc4726e06	Add warning about dropping python2 support (#307 ) Fix #303	2019-10-20 13:59:29 +02:00
Pieter Marsman	9fd7172f7b	Cleanup utils.py	2019-10-17 12:14:02 +02:00
jet457	7e40fde320	Removing assertion in drange to allow equal inputs (#246 ) and mimic behaviour of built-in method range Fixes #66, since it now allows the bbox to have 0 width or 0 height Added tests for Plane since it is the API that uses drange	2019-10-17 12:04:25 +02:00
D.A.Bashkirtsev	4df6d4e5ca	Changed: comparations for image colorspace literals (#132 ) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs	2019-10-15 16:11:54 +02:00
Pieter Marsman	63b2e09ac3	Merge pull request #203 from jbarlow83/negative-descent Interpret font Descent as a negative number even if specified as positive	2019-10-13 20:06:52 +02:00
Tony Tong	106a09c5bb	fix stoke color and non-stroke color in PDFGraphicState	2019-10-12 17:35:46 -04:00
Tata Ganesh	f218996fe9	Merge pull request #273 from igormp/develop Use resolve_all on PdfFont widths and bbox	2019-10-12 21:24:29 +05:30
Fakabbir Amin	7c03d96d25	Corrects Comment	2019-08-20 17:16:10 +05:30
Fakabbir Amin	abd685fdc6	Corrects Code Comment	2019-08-20 17:13:27 +05:30
Fakabbir Amin	3d549ea48c	Removes code comments	2019-08-20 16:48:40 +05:30
Igor Moura	cf4641d877	Merge branch 'develop' into develop	2019-08-15 08:11:28 -03:00
Fakabbir Amin	fe38695739	Merge branch 'develop' into pdfstream-as-cmap	2019-08-10 10:44:31 +05:30
Fakabbir Amin	5a0d8db052	Adds decoder for OnebyteIdentityH/V instead of using default CMap	2019-08-10 10:07:23 +05:30
Tata Ganesh	42e2c8143b	Merge pull request #263 from pietermarsman/261-glyph-list-specification name2unicode() should follow the Adobe Glyph List Specification	2019-07-26 22:13:34 +05:30
Igor Moura	2f4518231f	Use resolve_all on PdfFont widths and bbox Fixes #268	2019-07-24 15:10:13 -03:00
Igor Moura	540df9f676	Replaced .iteritems() and with six.iteritems() for Python 3 compat This is a squashed commit, the previous messages can be seen bellow This is the 1st commit message: Replaced .iteritems() usage for .items() Fixed some python 2 leftovers, as discussed in #267. Also formatted code according to Black.\nThis possibly breaks some python 2 compatibility This is the commit message #2: Reverted formatting and more spread six usage	2019-07-24 14:08:30 -03:00
Fakabbir Amin	f1a4dcea88	Adds Test Cases, Neater Code For CMap Assignment	2019-07-24 11:56:06 +05:30
Fakabbir Amin	fa400431f5	Adds Test, Removes Unnecessary Assumptions	2019-07-17 11:38:00 +05:30
Pieter Marsman	6f362f53fe	Raise a `KeyError` with a useful message if `unicode2name()` does not match any glyph name. Use this message to log debug statements.	2019-07-16 08:52:24 +02:00
Pieter Marsman	0fb83366b6	Remove intermediate variable `full_stop` because it is just a dot	2019-07-16 08:49:57 +02:00
Fakabbir Amin	cc40af3d2b	Removes @property, Adds docstring	2019-07-15 14:21:21 +05:30
Pieter Marsman	c597e95a9f	Use KeyError to signal that the name does not resemble any unicode, this pattern is also used in the rest of pdfminer.six	2019-07-14 15:37:15 +02:00
Pieter Marsman	33cc9861ae	Add docstring to Type1FontHeaderParser.get_encoding() that describes that the custom CharStrings of the font are mapped to ''	2019-07-14 15:19:17 +02:00
Pieter Marsman	f0392f8049	Change implementation of name2unicode such that it follows the Adobe Glyph specs (with allowing lowercase)	2019-07-14 15:16:42 +02:00
Fakabbir Amin	8e4a82ad8b	Corrects Indentation	2019-07-13 05:00:25 +05:30
Fakabbir Amin	c022358c8d	Encapsulates character map name	2019-07-13 04:52:24 +05:30
John Kesegich	8ab2e287be	Handle PDFStream as character map name in PDFCIDFont	2019-02-25 11:42:30 -06:00
ganeshtata	b6a5848208	FEAT: Release 20181108	2018-11-08 22:37:11 +05:30
Tata Ganesh	e03ecab856	Merge pull request #141 from timb07/speedup_layout Speed up layout of text boxes	2018-11-08 20:28:40 +05:30
James R. Barlow	2ede124142	Interpet font Descent as a negative number even if specified as positive The PDF RM specifies that Descent should be negative. Fonts that claim to have a positive Descent (not that it would make sense) always seem to be wrong about this claim.	2018-11-03 23:17:48 -07:00
Tata Ganesh	259b29299e	Merge pull request #133 from timb07/speedup Speed up handling of PDFs with large images	2018-07-15 11:27:35 +05:30
Martin Wolf	edaf2c9e3f	move unittest to main()	2018-06-26 00:51:51 +02:00
Martin Wolf	eff3f19886	Merge remote-tracking branch 'upstream/master'	2018-06-25 23:32:52 +02:00
Tata Ganesh	9c7bdcc716	Merge pull request #157 from h2ri/master decode cid: 160 and 173 to spaces	2018-06-25 11:19:27 +05:30
Charles Reid	7b08cdbff9	apply dos2unix to files in pdfminer/ and tools/ to remove \r\n windows line endings	2018-06-21 12:19:48 -07:00
Goulu	1db260609e	render_string must have 5 params in all PDFDevice classes (#158 )	2018-06-21 10:21:26 +02:00
Guglielmetti Philippe	70624a64dd	render_string() now takes 3 parameters, not 5 (reverted from commit `95b65536af`)	2018-06-21 09:49:45 +02:00
Guglielmetti Philippe	95b65536af	render_string() now takes 3 parameters, not 5	2018-06-21 09:28:55 +02:00
Healthi	65eb0cef82	decode cid: 160 and 170 to spaces	2018-06-20 17:17:03 +05:30
Martin Wolf	26f80715ed	Merge remote-tracking branch 'upstream/master'	2018-06-20 13:27:18 +02:00
Tata Ganesh	67bc581bd3	Merge pull request #134 from timb07/issue_90 FIX: TypeError caused by bug in _parse_comment; #90 #89 #109	2018-06-14 09:27:34 +05:30
Tata Ganesh	7084d81bd1	Merge pull request #129 from clustree/xml-color FEAT: Send color to XML conversion	2018-06-10 21:02:34 +05:30
Martin Wolf	4bdb3ba8cc	Fixes needed to be able to compile pdfminer.six with Cython	2018-04-12 00:05:38 +02:00
Tim Bell	1cbeaebfce	Fix Python 2.6 incompatibility	2018-04-11 10:34:15 +10:00
Tim Bell	0c8cf748fe	Fix copy-paste error	2018-04-11 10:15:32 +10:00
Tim Bell	8f8a78bb88	Remove now-unused csort()	2018-04-11 09:37:32 +10:00
Tim Bell	2dda2b12b4	Speedup layout with .sort() and sortedcontainers.SortedListWithKey()	2018-04-11 09:03:32 +10:00
Gregory Mori	335c25c045	only check for bytes input to enc() in python3 In python2, isinstance("", bytes) is true, causing enc() to suppress any string input. This results in fontnames being lost when running pdf2txt.py in python2. As this check was not present in the original python2 version of pdfminer, restrict it to only check when running in python3.	2018-04-09 12:21:59 -07:00
Tim Bell	981e3a575e	Fix TypeError caused by bug in _parse_comment; #90 #89 #109	2018-04-03 12:47:40 +10:00
Tim Bell	083f11b165	Fix cases where a bytearray doesn't work in place of bytes	2018-04-03 07:27:29 +10:00
Tim Bell	185ddeb2ab	Speed up handling of PDFs with large images with more minimal change	2018-04-03 07:21:21 +10:00
Tim Bell	fab1c9462c	Speed up handling of PDFs with large images	2018-03-29 14:21:31 +11:00
Tata Ganesh	eddf861fbd	Merge pull request #125 from yosida95/bytes-type Fix type of an argument to PDFFont#decode to bytes in py3	2018-03-19 11:00:10 +05:30
Quentin Pradet	0911703eba	pdfcolor: Fix Python 2.6 compatibility	2018-03-06 14:53:11 +04:00
Quentin Pradet	94f3d61bb2	converter: Fix XML syntax	2018-03-06 14:41:52 +04:00
Quentin Pradet	2231f0892e	Send non-stroke color to XML conversion Inspired by https://github.com/euske/pdfminer/pull/158 from @andruo11 and https://github.com/euske/pdfminer/pull/197 from @staccatosound.	2018-03-06 14:11:48 +04:00
Quentin Pradet	b6c63bedc6	Make DeviceGray the default color as it should be	2018-03-06 11:24:07 +04:00
Quentin Pradet	0ce9a29f83	Fix colorspace determinism with OrderedDict	2018-03-06 11:23:32 +04:00
Kohei YOSHIDA	a636cbcfd4	fix type of an argument to PDFFont#decode to bytes in py3	2018-02-20 13:42:09 +09:00
KOLANICH	3bf3c97bbb	Added a vector between 2 boxes which may be useful for users of the library	2018-02-16 14:49:12 +00:00
Tata Ganesh	3e6cc20cb2	Merge pull request #96 from sschuberth/patch-1 TrueTypeFont: Check for enough data to unpack	2018-01-31 18:26:54 +05:30
ganeshtata	1b88575e79	FIX: Null character replaced by blank -The presence of the character '\0' was causing an error with some PDFs. -It has been fixed by replacing all occurences of '\0' with ''.	2017-11-08 12:50:50 +05:30
Sebastian Schuberth	fcd3e6ce00	Catch an error unpack might throw instead of checking the length before	2017-10-30 19:31:58 +01:00
Sebastian Schuberth	39428fb4f0	TrueTypeFont: Check for enough data to unpack Fixes https://github.com/euske/pdfminer/issues/96 and https://github.com/euske/pdfminer/issues/144.	2017-10-16 12:35:04 +02:00
SUZUKI Masaya	d4118cf5e8	Enabled PDFDevice in the with statement (#88 )	2017-08-18 08:15:04 +02:00
Peter Bittner	e39800f14c	Move package description into package docstring (#87 ) Convert Windows/DOS line endings CR/LF to Unix LF (again!) Add Python 3.6 to classifiers, update project URL	2017-08-18 08:13:15 +02:00
Venelin Stoykov	171cdcc69d	Microoptimization for singlebyte fonts (#84 ) Instead of list comprehension which will call a function to get the integer value of the bytes directly convert it to bytearray which is more optimal structure for storing list of bytes.	2017-08-18 08:10:27 +02:00
Venelin Stoykov	14de393d5e	Cleanup psparser (#83 ) - Do not use bytesindex function. Use native slices instead - Fix import ordering	2017-08-18 08:10:06 +02:00
Venelin Stoykov	496bfd0778	Remove leftover from removing shebangs (#81 )	2017-08-18 08:09:00 +02:00
Venelin Stoykov	c2432c32f1	Fix assert message for PDFLayoutAnalyzer.end_page (#80 ) stack is undefined	2017-08-18 08:08:08 +02:00
Philippe Guglielmetti	4c604828e8	v. 20170720	2017-07-20 21:35:49 +02:00
Philippe Guglielmetti	b010db6049	solves https://github.com/pdfminer/pdfminer.six/issues/65	2017-07-20 21:17:06 +02:00
Sergei Maertens	67bf5ab124	Compare byte with byte instead of int (#78 )	2017-07-20 20:47:14 +02:00
Sergei Maertens	3e364354da	Fixes #64 -- be less strict when inspecting a tree type (#76 ) In the PDFStream it's possible that the /Type element is not present, but /type is. According to the spec, these are different elements, but in the case in point they had the same meaning. If PDFMiner is not running in STRICT mode and /Type doesn't resolve, a fallback to /type is used to determine the tree type.	2017-07-20 20:46:35 +02:00
Attila Szász	938419c476	Align dumppdf tool to modified data structures. (#73 ) * Align dumppdf tool to modified data structures. TOC page numbers should also work now, counting from 1. * Update version number.	2017-07-20 20:46:11 +02:00
Sergei Maertens	d79612c455	Resolve unresolved PDFObjectRefs (#70 ) Thank you !	2017-06-02 13:35:12 +02:00
Hugh Secker-Walker	488545ddc7	Add string expressions to asserts showing local data (#67 )	2017-05-29 09:06:09 +02:00
Michał Pasternak	fe21725f07	Please replace pycrypto with pycryptodome (#63 ) * Enable 3.6 and replace pycrypto with cryptodome * Upgrade version number	2017-05-29 09:04:38 +02:00
Anton Oleynick	4bc0a0c105	Update pdftypes.py (#61 ) Fix errors with: File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 850, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 860, in render_contents self.init_resources(resources) File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 360, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 210, in get_font font = self.get_font(None, subspec) File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 201, in get_font font = PDFCIDFont(self, spec) File "/app/python/lib/python3.5/site-packages/pdfminer/pdffont.py", line 667, in __init__ BytesIO(self.fontfile.get_data())) File "/app/python/lib/python3.5/site-packages/pdfminer/pdftypes.py", line 297, in get_data self.decode() File "/app/python/lib/python3.5/site-packages/pdfminer/pdftypes.py", line 278, in decode if 'Predictor' in params: TypeError: argument of type 'NoneType' is not iterable	2017-05-29 08:55:02 +02:00
Philippe Guglielmetti	baddb25df6	v 20170419 (patches a stupid bug from yesterday...)	2017-04-19 14:24:13 +02:00
Philippe Guglielmetti	82af7f0aac	issue #56 reproduced, solution attempt unsucessful	2017-04-19 14:19:14 +02:00
Philippe Guglielmetti	cd92883925	logging (stupid bug)	2017-04-19 13:48:45 +02:00
Philippe Guglielmetti	11a4c8b6c1	version 20170418	2017-04-18 19:13:20 +02:00
Philippe Guglielmetti	7055862eaf	solves https://github.com/pdfminer/pdfminer.six/issues/50	2017-04-18 18:20:31 +02:00
Sergei Maertens	f2b0650ad5	Fixes #54 -- don't pass bytestrings through ord() (#55 )	2017-04-18 16:57:53 +02:00
Andrew Baumann	9439a3a31a	Miscellaneous bug fixes (#47 ) * utils.decode_text: fix "TypeError: ord() expected string of length 1, but int found" fixes https://github.com/goulu/pdfminer/issues/24 * pdfinterp.execute: don't assume that every keyword name can be decoded as utf-8 fixes "'str' does not support the buffer interface", https://github.com/goulu/pdfminer/issues/23 * default settings.STRICT to False, for compatibility with the original pdfminer * PDFCIDFont: handle font registry/orderings that may be PDFObjRefs * utils.nunpack: handle 8-byte integers	2017-02-06 14:57:01 +01:00
Philippe Guglielmetti	9b9d69aee9	image export works again with Py3 (issue #15 ) https://github.com/pdfminer/pdfminer.six/issues/15	2017-01-20 10:11:19 +01:00
Philippe Guglielmetti	f094f0b380	v. 20170119 RC	2017-01-19 08:42:20 +01:00
Philippe Guglielmetti	52feb22eeb	Merge remote-tracking branch 'origin/master' Conflicts: MANIFEST.in README.md pdfminer/latin_enc.py pdfminer/pdfdocument.py pdfminer/pdfinterp.py pdfminer/pdfpage.py pdfminer/pdftypes.py pdfminer/psparser.py pdfminer/utils.py samples/Makefile setup.py	2017-01-19 08:03:16 +01:00
Jin-tae Hwang	61d423d81c	bugfix: if fontname is bytes then skip (#43 )	2016-12-14 17:34:16 +01:00
Gabriel Augendre	6cc4abbaa8	Fix import of Django settings (#41 ) Settings in Django are imported as such, see https://docs.djangoproject.com/en/1.10/topics/settings/#using-settings-in-python-code	2016-11-26 20:26:23 +01:00
Humberto Pereira	e6ad15af79	Added painting information (#37 ) * added color support to stroking and non stroking color spaces * extended LTCurve, LTLine and LTRect to save painting information * modified PDFLayoutAnalyzer to populate the shapes with painting information	2016-11-08 20:01:58 +01:00
Antonio Ercole De Luca	0fdebc6739	Removing all the "#!/usr/bin/env python" lines, they do not need for … (#34 ) * Removing all the "#!/usr/bin/env python" lines, they do not need for python3, solving issue number: #19. * Restored all the shebangs in the tools and tests folders (because they are real executables) but used "#!/usr/bin/env python" instead of "#!/usr/bin/python" as this blog points out: https://www.peterbe.com/plog/importance-of-env Removed also the shebang from pdfminer/psparser.py file.	2016-11-08 20:01:11 +01:00
Yusuke Shinyama	8150458718	Added: a simpler ordering mode when 1<F.	2016-09-26 18:06:34 +09:00
Friedrich Lindenberg	447adcf02f	fix STRICT reference	2016-09-24 12:03:22 +02:00
Friedrich Lindenberg	70918095cc	Return an empty list when no `Differences` are found.	2016-09-24 11:57:11 +02:00
Friedrich Lindenberg	865246bd0c	fix print, upstream: `0112112458`	2016-09-23 15:04:07 +02:00
Friedrich Lindenberg	0cb13983f7	Backport LICENSE.	2016-09-23 14:57:28 +02:00
Friedrich Lindenberg	1820f96481	backport changes for upstream: #145 , #95 , #111 , #117 , #129 , #132 .	2016-09-23 14:31:31 +02:00
Jakub Wilk	5ddbecb551	Fix typos	2016-09-13 16:25:09 +02:00
Yusuke Shinyama	3068dcdb4a	Merge pull request #145 from vinayak-mehta/glyphlist_link Replace old Adobe glyphlist link	2016-09-12 00:18:24 +09:00
Yusuke Shinyama	c753dbac4c	Merge pull request #117 from native-api/png_pred_errors make ValueError's descriptive	2016-09-11 23:55:34 +09:00
Yusuke Shinyama	f1dd9ea6d2	Merge pull request #129 from lucanaso/lucanaso-patch-1 Fixed for rendering non breaking spaces (cid:160)	2016-09-11 23:53:03 +09:00
Yusuke Shinyama	177a4ab937	Fixed: #132 (PDFStream.get_filters: support multiple parameterless filters)	2016-09-11 23:52:13 +09:00
Yusuke Shinyama	e95a483790	Merge pull request #134 from speedplane/feature/Fix-Get-Filters Fix Bug with PDF Stream Decoder	2016-09-11 23:48:42 +09:00
Yusuke Shinyama	64fe538b24	Fixed: #114 (UnicodeEncodeError in PSLiteral)	2016-09-11 23:43:22 +09:00
Vinayak Mehta	2926002017	Replace old Adobe glyphlist link	2016-09-08 16:34:53 +05:30
Philippe Guglielmetti	881ea17553	v 20160614	2016-06-14 19:02:07 +02:00
speedplane	2049462f6f	Revert changes unrelated to this branch.	2016-06-13 23:42:21 -04:00
speedplane	b0b8818a41	Fix a bug with pdfminer which occurs when two or more filters are applied to a stream, even though no parameters are specified. The code would previously drop all of the streams after the first due to misapplication of the zip function.	2016-06-13 23:35:11 -04:00
Friedrich Lindenberg	1d54ecd31c	Make the logger run in a namespace.	2016-05-20 21:12:05 +02:00
Philippe Guglielmetti	21fd2bbd23	v 20160202 with Py 2.6 & Py 3.5 support	2016-02-02 15:38:51 +01:00
Goulu	5a23fad6fd	Merge pull request #14 from orangain/close-device Close device to write footer of xml/html files	2016-01-18 11:22:35 +01:00
Goulu	2103e5875e	Merge pull request #13 from orangain/include-cmap Include compiled cmap resources to simplify installation for CJK languages	2016-01-18 11:22:08 +01:00
Steve Hair	92c71436b9	Improved settings management	2016-01-10 12:17:38 -05:00
orangain	f8a051adbd	Close device to write footer of xml/html files	2015-12-27 20:57:00 +09:00
orangain	f1d5d681b6	Include compiled cmap resources to simplify installation for CJK languages * Run `make cmap` and `git add pdfminer/cmap`. * Modify MANIFEST.in not to include cmaprsrc dir in the sdist package. * Add pdfminer/cmap/README.txt to include license in the sdist package. * Remove installation guide specific to CJK languages from README.md.	2015-12-27 13:32:29 +09:00
lucanaso	63bb3caec2	Fixed for rendering non breaking spaces (cid:160) As stated in the PDF specification ISO 32000-1, table in Annex D.2 Latin Character Set and Encodings page 653 to 656 (available here: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf): "The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE." The duplicate key was missing, therefore PDFMiner was returning the string "(cid:160)". This fix adds the duplicate key in latin_enc.py glyphlist.py does not need to be modified as it already contains a key for non breaking space https://github.com/lucanaso/pdfminer/blob/master/pdfminer/glyphlist.py#L2755.	2015-12-09 16:47:32 +01:00
Chris Hager	8149be1669	bugfixes	2015-12-06 00:17:58 +01:00
Chris Hager	2e1be5721f	removed settings.ENFORCE_CHECK_EXTRACTABLE	2015-11-01 22:34:18 +01:00
Chris Hager	b686dd0139	pdfminer/settings.py for STRICT and added ENFORCE_CHECK_EXTRACTABLE	2015-11-01 22:28:08 +01:00
Ivan Pozdeev	63c9378b8b	make ValueError's descriptive	2015-08-10 03:14:51 +03:00
Alex Zagorodniuk	131cb1ea92	change STRICT to be a settings attribute	2015-06-22 19:08:35 -04:00
Goulu	623bd98452	Update __init__.py version 20150601	2015-06-01 10:21:51 +02:00
Cathal Garvey	403711ed13	Whoops, forgot to version-gate chardet in the actual code. Thanks Travis!	2015-05-30 19:33:35 +01:00
Cathal Garvey	a2ad7a6d03	Fixed some bugs preventing all tests from passing in Py2.	2015-05-30 18:02:29 +01:00
Cathal Garvey	79c97ac221	Docstrings.	2015-05-30 17:16:06 +01:00
Cathal Garvey	3b7edba48c	Forgot to add the actual compartmentalised function..	2015-05-30 17:04:28 +01:00
Cathal Garvey	08cb217983	Progress, progress.. not nearly atomic enough, sorry.	2015-05-30 16:14:24 +01:00
Cathal Garvey	1b47bed306	Many changes to make pdf2txt.py work better in Py3, some in that script, others in module! Sorry, changes should have been more atomic. In pdf2txt.py: * Re-wrote main function to use argparse instead of optparse. * Manually tested in Py2/Py3 to get partial consistency. * Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway. * Py2 mode probably unchanged, cannot find any bugs yet... * Kept old main function for posterity, for now. In utils: * Added a few compatibility functions (some string hax required chardet, new dependency): - make_compat_bytes(in_str)-> (py3->bytes \| py2->str) - make_compat_str(in_str)-> (str) - compatible_encode_method(bytesorstring, encoding, erraction)-> (str) In pdfdevice: * To handle different output filetypes in Py3, injected lots of calls to new utils methods, as well as some six.PYX checks and logic. These changes are largely responsible for enhanced Py2/Py3 consistency. In converter: * To handle output filetypes in Py2, injected a few checks and fixes particularly around the py2 `str.encode` method and its assumed usual use-analogies in Py3.	2015-05-17 21:08:57 +01:00
Yusuke Shinyama	14fd0fd2d6	Fixed: #84 (fontname was in unicode)	2015-04-05 19:02:02 +09:00
speedplane	806ee603ff	More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear. This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way. For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.	2014-12-12 00:36:59 -05:00
speedplane	45170e7183	There are a number of relatively complex changes here. Comments are in order of where the change appears. 1. When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words. 2. Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical. 3. Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines. 4. Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.	2014-12-12 00:36:59 -05:00
Yusuke Shinyama	0112112458	Fixed: crash on invalid chr number.	2014-12-09 22:55:47 +09:00
enkore	d0379a2c44	Fix utils.decode_text	2014-12-04 17:09:52 +01:00
speedplane	36977fbe08	Add debug flags for much of the debug output.	2014-11-11 23:36:58 -05:00
speedplane	ecc4d05675	Fix a unicode conversion bug. See https://github.com/euske/pdfminer/issues/75	2014-11-11 23:34:33 -05:00
cybjit	515687e1bb	more xrange to range	2014-09-16 23:17:31 +02:00
cybjit	9b2e29396b	apply_png_predictor py3	2014-09-16 22:59:29 +02:00
cybjit	ad05121c69	password py3	2014-09-16 22:59:00 +02:00
cybjit	14585987c3	keep password api unicode, latin1 or utf-8 is encoded in handler	2014-09-16 22:58:25 +02:00
cybjit	2260f77b19	fix dict_value usage in strict mode	2014-09-16 22:57:29 +02:00
cybjit	51a361c145	clean up HTMLConverter and XMLConverter encoding	2014-09-16 22:57:00 +02:00

1 2 3 4 5 ...

528 Commits (ebf7bcdb983f36d0ff5b40e4f23b52525cb28f18)