pdfminer.six

Commit Graph

Author	SHA1	Message	Date
Cathal Garvey	1b47bed306	Many changes to make pdf2txt.py work better in Py3, some in that script, others in module! Sorry, changes should have been more atomic. In pdf2txt.py: * Re-wrote main function to use argparse instead of optparse. * Manually tested in Py2/Py3 to get partial consistency. * Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway. * Py2 mode probably unchanged, cannot find any bugs yet... * Kept old main function for posterity, for now. In utils: * Added a few compatibility functions (some string hax required chardet, new dependency): - make_compat_bytes(in_str)-> (py3->bytes \| py2->str) - make_compat_str(in_str)-> (str) - compatible_encode_method(bytesorstring, encoding, erraction)-> (str) In pdfdevice: * To handle different output filetypes in Py3, injected lots of calls to new utils methods, as well as some six.PYX checks and logic. These changes are largely responsible for enhanced Py2/Py3 consistency. In converter: * To handle output filetypes in Py2, injected a few checks and fixes particularly around the py2 `str.encode` method and its assumed usual use-analogies in Py3.	2015-05-17 21:08:57 +01:00
Yusuke Shinyama	14fd0fd2d6	Fixed: #84 (fontname was in unicode)	2015-04-05 19:02:02 +09:00
speedplane	806ee603ff	More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear. This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way. For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.	2014-12-12 00:36:59 -05:00
speedplane	45170e7183	There are a number of relatively complex changes here. Comments are in order of where the change appears. 1. When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words. 2. Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical. 3. Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines. 4. Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.	2014-12-12 00:36:59 -05:00
Yusuke Shinyama	0112112458	Fixed: crash on invalid chr number.	2014-12-09 22:55:47 +09:00
enkore	d0379a2c44	Fix utils.decode_text	2014-12-04 17:09:52 +01:00
speedplane	36977fbe08	Add debug flags for much of the debug output.	2014-11-11 23:36:58 -05:00
speedplane	ecc4d05675	Fix a unicode conversion bug. See https://github.com/euske/pdfminer/issues/75	2014-11-11 23:34:33 -05:00
cybjit	515687e1bb	more xrange to range	2014-09-16 23:17:31 +02:00
cybjit	9b2e29396b	apply_png_predictor py3	2014-09-16 22:59:29 +02:00
cybjit	ad05121c69	password py3	2014-09-16 22:59:00 +02:00
cybjit	14585987c3	keep password api unicode, latin1 or utf-8 is encoded in handler	2014-09-16 22:58:25 +02:00
cybjit	2260f77b19	fix dict_value usage in strict mode	2014-09-16 22:57:29 +02:00
cybjit	51a361c145	clean up HTMLConverter and XMLConverter encoding	2014-09-16 22:57:00 +02:00
Goulu	8861d7e0ed	version 20140915 pushed to PyPi as pdfminer_six	2014-09-15 10:33:04 +02:00
cybjit	39942b6642	avoid string formating when not logging	2014-09-12 00:29:31 +02:00
cybjit	01821c7d1e	rename bytes to avoid built-in collision	2014-09-12 00:29:31 +02:00
cybjit	31e6afc7cf	faster and simpler bytes implementation	2014-09-12 00:29:30 +02:00
cybjit	cba5a42ba8	decipher_all bytes	2014-09-12 00:29:30 +02:00
cybjit	6357e2da80	code2cid uses int, not byte	2014-09-12 00:29:27 +02:00
cybjit	9b0a3ee53e	decode cmap font name	2014-09-11 23:30:02 +02:00
cybjit	a6f31a713d	cmap bytes and decode	2014-09-07 18:41:04 +02:00
cybjit	cc733c8217	fixes for ARC4	2014-09-07 18:38:22 +02:00
cybjit	f9a67db89b	change xrange to range	2014-09-07 18:36:12 +02:00
cybjit	0a2d90c051	pdf2txt: do not double encode stdout	2014-09-07 18:34:11 +02:00
unknown	58b8492783	no logging in travis.ci	2014-09-04 10:19:50 +02:00
unknown	1c93468c7e	faster, less verbose tests	2014-09-04 10:02:29 +02:00
unknown	4ab48d1803	Python 3.4 compatibility + tests	2014-09-04 09:36:19 +02:00
unknown	29c07ea770	Python 3.4 support and tests	2014-09-03 15:26:08 +02:00
unknown	a6475b61b4	Python 3.4 support added and tested	2014-09-03 13:17:41 +02:00
unknown	846cd18186	Python 3.4 support	2014-09-02 15:49:46 +02:00
unknown	faea7291a8	tests pass under Py 2.7 and 3.4	2014-09-01 14:16:49 +02:00
Yusuke Shinyama	b0e035c24f	Style fix: always have an explicit return.	2014-07-15 21:38:29 +09:00
Yusuke Shinyama	f5b5e31921	Fixed: DecodeParms array support.	2014-07-09 19:07:27 +09:00
Yusuke Shinyama	137fc3a1ae	Use KWD instead of token.name.	2014-06-30 19:15:21 +09:00
Yusuke Shinyama	1ccfaff411	String-Bytes distinction (first attempt).	2014-06-30 19:05:56 +09:00
Yusuke Shinyama	8791355e1d	Cleanup imports. Use relative imports.	2014-06-26 18:12:39 +09:00
Yusuke Shinyama	2e900e5d10	Fixed for consistent test results. (hopefully...)	2014-06-26 17:41:31 +09:00
Yusuke Shinyama	fe86b4e64e	Changed: StringIO -> io.BytesIO	2014-06-25 19:55:41 +09:00
Yusuke Shinyama	44074b42ea	Added: stripcontrol for XMLConverter (-S option)	2014-06-22 00:33:00 +09:00
Yusuke Shinyama	81391c09f4	Fixed: #56 (with a derpy fix)	2014-06-18 19:11:45 +09:00
Yusuke Shinyama	bb866ae148	Changed: new except syntax (2.6 or above).	2014-06-16 18:50:07 +09:00
Yusuke Shinyama	28e96ba3d0	Use print as a function.	2014-06-15 12:14:33 +09:00
Yusuke Shinyama	0387a6c260	Removed: tuple-unpacking args.	2014-06-15 12:12:13 +09:00
Yusuke Shinyama	a8ec99a848	More autotest tweaks.	2014-06-15 10:52:59 +09:00
Yusuke Shinyama	1384a3fe8d	Code cleanup: removed some debug flags.	2014-06-14 15:43:10 +09:00
Yusuke Shinyama	d9680fca7e	Plane: preserve the object order so that the test result is always consistent.	2014-06-14 14:44:53 +09:00
Yusuke Shinyama	aed248610c	Fixed: dependency on pygame in a unittest.	2014-06-14 12:05:26 +09:00
Yusuke Shinyama	8e14ebf4e1	Use logging module instead of print.	2014-06-14 12:00:49 +09:00
Yusuke Shinyama	8e8e22c095	Fixed a layout bug introduced at `c97ec304`.	2014-06-13 23:05:04 +09:00
numion	a4997d6f10	Implement revision 4 and 5 encryption handler.	2014-05-19 16:27:43 +02:00
Michael R. Hines	ae2547b0f2	Stop throwing exception on LITERALS_DCT_DECODE I have PDF documents with images stream and two filters, don't throw exceptions on the second one (DCT).	2014-05-14 13:25:30 +08:00
Yusuke Shinyama	6b6fc264ff	Code refactoring: CMap and UnicodeMap both inherit CMapBase.	2014-04-16 18:57:16 +09:00
Yusuke Shinyama	b09c37902f	Fixed: issue #48 (thanks to speedplane)	2014-04-09 17:55:50 +09:00
Yusuke Shinyama	7b354c7ab3	Version 20140328	2014-03-28 22:49:18 +09:00
Yusuke Shinyama	340387bfc6	Cleanup: isinstance	2014-03-28 17:50:59 +09:00
Yusuke Shinyama	7849c8724a	Fixed: PDFXRefStream.get_objids returns invalid objids.	2014-03-28 17:29:26 +09:00
Yusuke Shinyama	57adad55d7	Revert the wrong fix.	2014-03-28 17:24:03 +09:00
Yusuke Shinyama	b18e8c549d	Version 20140327	2014-03-28 00:19:52 +09:00
Yusuke Shinyama	ee47a6603a	Fixed: issues #45	2014-03-28 00:18:17 +09:00
Yusuke Shinyama	ab03037444	Version 20140324	2014-03-24 21:03:46 +09:00
Yusuke Shinyama	4b2beba398	Code cleanup.	2014-03-24 20:59:24 +09:00
Yusuke Shinyama	f9079e4c0a	Fixed dumppdf.py issues.	2014-03-24 20:55:00 +09:00
Yusuke Shinyama	607be269ab	Applied a patch by Axel Kaiser.	2014-03-24 20:45:35 +09:00
Yusuke Shinyama	d7c4ff28e9	Applied a patch by Axel Kaiser.	2014-03-24 20:39:30 +09:00
Yusuke Shinyama	636d4caeb3	Fixed the PNG predictor bug. Thanks to Gabor Molnar.	2014-03-24 19:57:05 +09:00
Yusuke Shinyama	c97ec3048e	Changed / to // for clarity.	2013-11-26 21:35:16 +09:00
Yusuke Shinyama	b589da51b7	Fix for malformed PDFs.	2013-11-26 21:27:45 +09:00
Yusuke Shinyama	cf1e3c9973	Version bump!	2013-11-13 14:52:01 +09:00
Yusuke Shinyama	acad011e3f	Code cleanup.	2013-11-11 20:46:30 +09:00
Yusuke Shinyama	cbef967fbf	Renamed: LTAnon -> LTAnno	2013-11-11 19:17:45 +09:00
Yusuke Shinyama	c8b6d4112a	Fixed: crash with negative layout bbox.	2013-11-09 15:10:14 +09:00
Yusuke Shinyama	2b56b2eedf	Merged.	2013-11-07 19:50:41 +09:00
Matthew Duggan	2caa5edc25	PEP8: Whitespace changes to match pep8	2013-11-07 17:35:04 +09:00
Matthew Duggan	c1da8b835c	PEP8: Remove trailing whitespace	2013-11-07 16:14:53 +09:00
Matthew Duggan	024b821056	Make pyflakes happy by defining variable	2013-11-07 16:10:14 +09:00
Matthew Duggan	10a68c83bd	Remove unused imports identified by pyflakes	2013-11-07 16:09:44 +09:00
Yusuke Shinyama	4ef81ae9d8	Improved word spacing.	2013-11-05 18:25:19 +09:00
Yusuke Shinyama	02ad086f6a	fixed: HTMLConverter.	2013-10-25 18:10:40 +09:00
Yusuke Shinyama	87842233b3	Version bump!	2013-10-22 22:19:38 +09:00
Yusuke Shinyama	d3730a29ec	API change: process_pdf -> PDFPage.get_pages	2013-10-22 18:59:16 +09:00
Yusuke Shinyama	e927bd307e	fixed: https://github.com/euske/pdfminer/issues/8	2013-10-22 18:24:39 +09:00
Yusuke Shinyama	2aa757978b	Reverted to Python2.x syntax. Fixed LZW decoding.	2013-10-19 08:19:40 +09:00
Yusuke Shinyama	bfd9e93c12	Merge branch 'master' of https://github.com/JordanReiter/pdfminer into JordanReiter-master	2013-10-19 07:46:45 +09:00
Yusuke Shinyama	8e4c0c88e3	fixed: https://github.com/euske/pdfminer/issues/26	2013-10-17 23:20:08 +09:00
Yusuke Shinyama	0ea08890d4	renamed: python2 -> python.	2013-10-17 23:05:27 +09:00
Yusuke Shinyama	8d42eec94d	in_cmap is on by default.	2013-10-17 21:40:43 +09:00
Yusuke Shinyama	de9f9715e3	Added: Adobe-UCS	2013-10-17 21:35:25 +09:00
Yusuke Shinyama	1455f134c6	Fixed: missing ObjStm due to invalid seek.	2013-10-10 20:10:57 +09:00
Yusuke Shinyama	f85c374cae	Separated PDFPage to pdfpage.py.	2013-10-10 19:54:55 +09:00
Yusuke Shinyama	2df67d85ae	Expand ObjStm in XRefFallback.	2013-10-10 19:40:43 +09:00
Yusuke Shinyama	e4bc4e43b1	Code cleanup.	2013-10-10 19:17:58 +09:00
Yusuke Shinyama	cfd60eafbf	Removed PDFDocument.read_xref().	2013-10-10 18:57:08 +09:00
Yusuke Shinyama	658be970b8	Separated PDFXRefFallback.	2013-10-10 18:44:12 +09:00
Yusuke Shinyama	c926874d20	API Change: the PDFDocument cstr now takes PDFParser. set_parser() is removed.	2013-10-10 18:40:06 +09:00
Yusuke Shinyama	557c2c72e6	Removed ObjIdRange for terseness.	2013-10-10 18:34:43 +09:00
Yusuke Shinyama	2221163b94	Split pdfparser.py and pdfdocument.py.	2013-10-10 18:29:30 +09:00
Yusuke Shinyama	1467fc674c	Added fallback for broken PDFs.	2013-10-09 22:45:54 +09:00
Yusuke Shinyama	eabe72ee63	Prevent crash with empty layout box.	2013-10-09 22:13:22 +09:00
Yusuke Shinyama	87143cb36f	Fallback when /Pages does not exist.	2013-10-09 22:08:16 +09:00
Yusuke Shinyama	06425bba00	Introducing PDFObjectNotFound	2013-10-09 21:39:23 +09:00
Yusuke Shinyama	3c3cba2ecc	Moved: import PIL.	2013-04-09 18:42:32 +09:00
Yusuke Shinyama	19e7d70ac1	Merge pull request #15 from jcushman/patch-1 2x faster layout analysis: Use set instead of list for Plane's internal collection of objects.	2013-04-09 02:39:46 -07:00
Yusuke Shinyama	4faccff9c9	Merge pull request #16 from jcushman/master 2x faster group_textboxes function.	2013-04-09 01:58:56 -07:00
Yusuke Shinyama	d8bc13b3af	Merge pull request #13 from gendoc/master PDFDocument.lookup_name.lookup isn't searching for 'Names' key.	2013-04-09 01:55:54 -07:00
Jordan Reiter	e28b75a462	StringIO	2013-03-27 13:14:58 -04:00
Jordan Reiter	44653071c3	Fixes for LZW error (see https://bitbucket.org/hsoft/pdfminer3k/commits/ae9a4ca0691a/)	2013-03-27 13:05:29 -04:00
jcushman	f77f196cd3	2x faster group_textboxes function.	2012-06-22 18:11:45 -03:00
jcushman	da3f023b2d	Use set instead of list for Plane's internal collection of objects.	2012-06-22 16:36:33 -03:00
Humberto Pereira	89c81db295	PDFDocument.lookup_names.lookup didn't find 'Names' in some files	2012-03-19 16:42:58 -03:00
Jim Morrison	6413eb7de4	Deal with CMYK images by converting them to RGB. PIL does not invert CMYK images as of PIL 1.1.7, so the invert happens in ImageWriter.	2012-01-24 16:18:36 -08:00
Yusuke Shinyama	c7709045e9	fixed: invalid bmp file output	2011-11-08 00:29:24 +10:00
Yusuke Shinyama	82ff98c7b3	imagewriter now works with text output	2011-11-07 01:15:10 +10:00
Yusuke Shinyama	91174b5665	avoid crash when colorspace is null.	2011-11-06 20:10:48 +10:00
Yusuke Shinyama	3d1652963a	Merge github.com:euske/pdfminer	2011-10-30 15:44:49 +10:00
dwilson	60dbf6bb69	avoids crash in pdf syntax error for missing ids when an object id is out of range, rather than crashing, only raise a pdf syntax error if STRICT is enabled and return None otherwise	2011-08-31 17:03:10 -04:00
Yusuke Shinyama	f638784e1d	experimental layout analysis improvements	2011-08-14 09:44:21 +09:00
Yusuke Shinyama	cbb8d869c7	removed initial cmap/ directory	2011-07-31 18:05:07 +10:00
Yusuke Shinyama	cdef0d7883	Merge github.com:euske/pdfminer	2011-07-31 17:47:20 +10:00
Yusuke Shinyama	46bb0107aa	fixed: crash due to small layout elements (thanks to hsoft)	2011-07-31 17:44:09 +10:00
Yusuke Shinyama	eec317ae10	Merge pull request #6 from rsennrich/master cleaner widths for Adobe core 14 fonts. (thanks to rsennrich)	2011-07-31 00:39:36 -07:00
Yusuke Shinyama	24cd161fb7	CCITTFaxFilter.reversed fix	2011-07-31 17:36:02 +10:00
Rico	6e4f36d9a1	get width based on utf-8 char. fills some gaps and fixes inconsistencies between standard encodings	2011-07-23 16:34:11 +02:00
Yusuke Shinyama	dc8fde0e47	added CCITTFaxFilter support and a very crude image extraction.	2011-07-18 21:07:00 +10:00
Yusuke Shinyama	2707ba75df	added CCITTFaxFilter support and a very crude image extraction.	2011-07-18 21:06:50 +10:00
Yusuke Shinyama	fda6f7ba5d	ccitt.py added.	2011-07-18 17:36:37 +10:00
Yusuke Shinyama	0278076ea8	PNG predictor added	2011-06-07 00:46:33 +09:00
Yusuke Shinyama	18a5058af6	separated predictor functions.	2011-06-07 00:31:03 +09:00
Yusuke Shinyama	170c97a12b	colorspace patch by Lieb Simon	2011-06-06 17:10:12 +09:00
Yusuke Shinyama	2e8180ddee	documentation update and version bump	2011-05-15 01:37:14 +09:00
Yusuke Shinyama	c134596e2f	code cleanup and testcase stabilization	2011-05-15 01:22:19 +09:00
Yusuke Shinyama	e5d02f8653	fixed the infinite recursion bug.	2011-05-14 16:32:09 +09:00
Yusuke Shinyama	0c41b8348e	code cleanup	2011-05-14 15:51:40 +09:00
Yusuke Shinyama	038ce4cd0c	added LTText.get_text() and .text property is no longer accessible.	2011-05-14 15:45:08 +09:00
Yusuke Shinyama	5004e4b28d	layout analysis speedup.	2011-05-14 14:17:39 +09:00
Yusuke Shinyama	095534b294	figure object now does not call analyze.	2011-05-14 14:17:22 +09:00
Yusuke Shinyama	b8d516fc52	extended Plane class.	2011-05-14 14:16:40 +09:00
Yusuke Shinyama	fcf0d74ecc	tweaks for debugging	2011-04-21 22:07:52 +09:00
Yusuke Shinyama	8f9684f6a6	code cleanup: layout analysis	2011-04-21 22:07:04 +09:00
Yusuke Shinyama	0e660dd385	rename: LTPolygon -> LTCurve	2011-04-20 22:05:25 +09:00
Yusuke Shinyama	dab70855bf	LTLine is now strictly horizontal or vertical.	2011-04-20 22:01:54 +09:00
Jonathan J Hunt	ec682539da	Optimized memory usage in TextConverter by ignoring all drawing commands.	2011-03-07 15:11:31 +10:00
Yusuke Shinyama	4918d59bc2	disable caching support	2011-03-03 00:04:43 +09:00
Yusuke Shinyama	18e782f330	canonicalize package names	2011-03-02 23:43:03 +09:00
Yusuke Shinyama	bb26cf9180	eliminate empty textboxes	2011-03-01 20:47:20 +09:00
Yusuke Shinyama	dfd621b98c	minor bugfix. thanks to Hiroshi Manabe.	2011-02-28 19:50:07 +09:00
Yusuke Shinyama	f22b056454	release-20110227	2011-02-27 19:53:12 +09:00
Yusuke Shinyama	a8bf9b159e	docstring fix	2011-02-27 13:09:12 +09:00
Yusuke Shinyama	cabaa10e4f	layout analysis improvement	2011-02-27 12:56:28 +09:00
Yusuke Shinyama	7dbb664db3	code cleanup and more debugging options	2011-02-14 23:42:05 +09:00

1 2 3 4 5 ...

392 Commits (90d61f2a3a04a2f783492c50599838c7ddf3fce3)