pdfminer.six

Commit Graph

Author	SHA1	Message	Date
Cathal Garvey	e2d3adc8c1	Adding chardet to Travis	2015-05-30 19:35:05 +01:00
Cathal Garvey	403711ed13	Whoops, forgot to version-gate chardet in the actual code. Thanks Travis!	2015-05-30 19:33:35 +01:00
Cathal Garvey	a2ad7a6d03	Fixed some bugs preventing all tests from passing in Py2.	2015-05-30 18:02:29 +01:00
Cathal Garvey	79c97ac221	Docstrings.	2015-05-30 17:16:06 +01:00
Cathal Garvey	268e9fb2bd	Removed typechecking, nothing's exploded yet and argparse does lots of heavy lifting already.	2015-05-30 17:05:28 +01:00
Cathal Garvey	3b7edba48c	Forgot to add the actual compartmentalised function..	2015-05-30 17:04:28 +01:00
Cathal Garvey	b3553cef10	Cleaning up pdf2txt.py after the partition/move.	2015-05-30 17:03:55 +01:00
Cathal Garvey	cbe270a4bf	Killed the old main function for pdf2txt.py	2015-05-30 16:37:22 +01:00
Cathal Garvey	ead8e778a6	Successfully compartmentalised code, getting closer to moving pdf->text as a module function.	2015-05-30 16:27:58 +01:00
Cathal Garvey	08cb217983	Progress, progress.. not nearly atomic enough, sorry.	2015-05-30 16:14:24 +01:00
Cathal Garvey	1b47bed306	Many changes to make pdf2txt.py work better in Py3, some in that script, others in module! Sorry, changes should have been more atomic. In pdf2txt.py: * Re-wrote main function to use argparse instead of optparse. * Manually tested in Py2/Py3 to get partial consistency. * Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway. * Py2 mode probably unchanged, cannot find any bugs yet... * Kept old main function for posterity, for now. In utils: * Added a few compatibility functions (some string hax required chardet, new dependency): - make_compat_bytes(in_str)-> (py3->bytes \| py2->str) - make_compat_str(in_str)-> (str) - compatible_encode_method(bytesorstring, encoding, erraction)-> (str) In pdfdevice: * To handle different output filetypes in Py3, injected lots of calls to new utils methods, as well as some six.PYX checks and logic. These changes are largely responsible for enhanced Py2/Py3 consistency. In converter: * To handle output filetypes in Py2, injected a few checks and fixes particularly around the py2 `str.encode` method and its assumed usual use-analogies in Py3.	2015-05-17 21:08:57 +01:00
Yusuke Shinyama	14fd0fd2d6	Fixed: #84 (fontname was in unicode)	2015-04-05 19:02:02 +09:00
Ashley Blackmore	1dbe9ff7e7	Update setup.py Install missing pycrypto lib	2015-02-18 18:35:53 +01:00
speedplane	5609418351	Add gz to gitignore.	2014-12-14 01:29:39 -05:00
speedplane	69afd3dd30	Use a .gitignore file.	2014-12-14 01:23:44 -05:00
speedplane	2199c25493	Add my own .gitignore.	2014-12-12 00:37:54 -05:00
speedplane	806ee603ff	More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear. This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way. For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.	2014-12-12 00:36:59 -05:00
speedplane	45170e7183	There are a number of relatively complex changes here. Comments are in order of where the change appears. 1. When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words. 2. Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical. 3. Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines. 4. Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.	2014-12-12 00:36:59 -05:00
speedplane	c32550dd4a	Merge branch 'fix-makefile'	2014-12-11 00:54:14 -05:00
speedplane	5cbdd915c7	Remove the dependancy on python2. Also, allow tests to be run on cygwin by checking for it, and converting unix2dos line endings.	2014-12-11 00:53:33 -05:00
speedplane	830b2403e2	Merge branch 'euske-main/master'	2014-12-11 00:06:46 -05:00
Yusuke Shinyama	0112112458	Fixed: crash on invalid chr number.	2014-12-09 22:55:47 +09:00
Yusuke Shinyama	75206ba18d	Removed: .gitignore	2014-12-09 22:49:13 +09:00
Yusuke Shinyama	4b585221e2	Merge pull request #76 from speedplane/master Fix Unicode Bug + Add GitIgnore + Add Debug Flags	2014-12-09 22:22:33 +09:00
Philippe Guglielmetti	448aa08bc4	Merge pull request #4 from enkore/master Fix utils.decode_text	2014-12-05 09:58:58 +01:00
enkore	d0379a2c44	Fix utils.decode_text	2014-12-04 17:09:52 +01:00
speedplane	36977fbe08	Add debug flags for much of the debug output.	2014-11-11 23:36:58 -05:00
speedplane	1067cb9f9f	Use a .gitignore file.	2014-11-11 23:36:26 -05:00
speedplane	ecc4d05675	Fix a unicode conversion bug. See https://github.com/euske/pdfminer/issues/75	2014-11-11 23:34:33 -05:00
Philippe Guglielmetti	0e40264071	Merge pull request #3 from Cybjit/master Samples and latin1 passwords	2014-09-17 07:22:52 +02:00
cybjit	515687e1bb	more xrange to range	2014-09-16 23:17:31 +02:00
cybjit	2639b15ef4	guess argv encoding in py2 using sys.stdin.encoding	2014-09-16 23:17:26 +02:00
cybjit	9b2e29396b	apply_png_predictor py3	2014-09-16 22:59:29 +02:00
cybjit	ad05121c69	password py3	2014-09-16 22:59:00 +02:00
cybjit	14585987c3	keep password api unicode, latin1 or utf-8 is encoded in handler	2014-09-16 22:58:25 +02:00
cybjit	2260f77b19	fix dict_value usage in strict mode	2014-09-16 22:57:29 +02:00
cybjit	51a361c145	clean up HTMLConverter and XMLConverter encoding	2014-09-16 22:57:00 +02:00
cybjit	2ee7153f6e	add python3 in sample Makefile	2014-09-16 22:56:13 +02:00
Goulu	f577f76c52	renamed as pdfminer.six in PyPi	2014-09-15 11:10:00 +02:00
Goulu	03de0f4db8	forgot 'six' requirement ...	2014-09-15 10:42:08 +02:00
Goulu	8861d7e0ed	version 20140915 pushed to PyPi as pdfminer_six	2014-09-15 10:33:04 +02:00
Philippe Guglielmetti	4f8aa9ff5b	Merge pull request #2 from Cybjit/master CMap fixes and speed improvements	2014-09-12 07:33:06 +02:00
cybjit	714423883c	setup logging for pdf2txt and fix dumppdf	2014-09-12 00:29:31 +02:00
cybjit	39942b6642	avoid string formating when not logging	2014-09-12 00:29:31 +02:00
cybjit	01821c7d1e	rename bytes to avoid built-in collision	2014-09-12 00:29:31 +02:00
cybjit	31e6afc7cf	faster and simpler bytes implementation	2014-09-12 00:29:30 +02:00
cybjit	ed13f7c47d	conv_cmap py3 compat	2014-09-12 00:29:30 +02:00
cybjit	cba5a42ba8	decipher_all bytes	2014-09-12 00:29:30 +02:00
cybjit	6357e2da80	code2cid uses int, not byte	2014-09-12 00:29:27 +02:00
cybjit	9b0a3ee53e	decode cmap font name	2014-09-11 23:30:02 +02:00

1 2 3 4 5 ...

637 Commits (ec8530f6cf992ebbb8f23a3fcdbee729f4163689) All Branches Search

637 Commits (ec8530f6cf992ebbb8f23a3fcdbee729f4163689)

All Branches