Commit Graph

874 Commits (10f6fb40c258c86fd04d86bade20f69fb07faabd)

Author SHA1 Message Date
speedplane 5609418351 Add gz to gitignore. 2014-12-14 01:29:39 -05:00
speedplane 69afd3dd30 Use a .gitignore file. 2014-12-14 01:23:44 -05:00
speedplane 2199c25493 Add my own .gitignore. 2014-12-12 00:37:54 -05:00
speedplane 806ee603ff More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary.  This leads to much better text extraction  if the PDF was created in a funky way.

For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane 45170e7183 There are a number of relatively complex changes here. Comments are in order of where the change appears.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart.  However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.

2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.

3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.

4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute.  Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
speedplane c32550dd4a Merge branch 'fix-makefile' 2014-12-11 00:54:14 -05:00
speedplane 5cbdd915c7 Remove the dependancy on python2. Also, allow tests to be run on cygwin by checking for it, and converting unix2dos line endings. 2014-12-11 00:53:33 -05:00
speedplane 830b2403e2 Merge branch 'euske-main/master' 2014-12-11 00:06:46 -05:00
Yusuke Shinyama 0112112458 Fixed: crash on invalid chr number. 2014-12-09 22:55:47 +09:00
Yusuke Shinyama 75206ba18d Removed: .gitignore 2014-12-09 22:49:13 +09:00
Yusuke Shinyama 4b585221e2 Merge pull request #76 from speedplane/master
Fix Unicode Bug + Add GitIgnore + Add Debug Flags
2014-12-09 22:22:33 +09:00
Philippe Guglielmetti 448aa08bc4 Merge pull request #4 from enkore/master
Fix utils.decode_text
2014-12-05 09:58:58 +01:00
enkore d0379a2c44 Fix utils.decode_text 2014-12-04 17:09:52 +01:00
speedplane 36977fbe08 Add debug flags for much of the debug output. 2014-11-11 23:36:58 -05:00
speedplane 1067cb9f9f Use a .gitignore file. 2014-11-11 23:36:26 -05:00
speedplane ecc4d05675 Fix a unicode conversion bug.
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
Philippe Guglielmetti 0e40264071 Merge pull request #3 from Cybjit/master
Samples and latin1 passwords
2014-09-17 07:22:52 +02:00
cybjit 515687e1bb more xrange to range 2014-09-16 23:17:31 +02:00
cybjit 2639b15ef4 guess argv encoding in py2 using sys.stdin.encoding 2014-09-16 23:17:26 +02:00
cybjit 9b2e29396b apply_png_predictor py3 2014-09-16 22:59:29 +02:00
cybjit ad05121c69 password py3 2014-09-16 22:59:00 +02:00
cybjit 14585987c3 keep password api unicode, latin1 or utf-8 is encoded in handler 2014-09-16 22:58:25 +02:00
cybjit 2260f77b19 fix dict_value usage in strict mode 2014-09-16 22:57:29 +02:00
cybjit 51a361c145 clean up HTMLConverter and XMLConverter encoding 2014-09-16 22:57:00 +02:00
cybjit 2ee7153f6e add python3 in sample Makefile 2014-09-16 22:56:13 +02:00
Goulu f577f76c52 renamed as pdfminer.six in PyPi 2014-09-15 11:10:00 +02:00
Goulu 03de0f4db8 forgot 'six' requirement ... 2014-09-15 10:42:08 +02:00
Goulu 8861d7e0ed version 20140915 pushed to PyPi as pdfminer_six 2014-09-15 10:33:04 +02:00
Philippe Guglielmetti 4f8aa9ff5b Merge pull request #2 from Cybjit/master
CMap fixes and speed improvements
2014-09-12 07:33:06 +02:00
cybjit 714423883c setup logging for pdf2txt and fix dumppdf 2014-09-12 00:29:31 +02:00
cybjit 39942b6642 avoid string formating when not logging 2014-09-12 00:29:31 +02:00
cybjit 01821c7d1e rename bytes to avoid built-in collision 2014-09-12 00:29:31 +02:00
cybjit 31e6afc7cf faster and simpler bytes implementation 2014-09-12 00:29:30 +02:00
cybjit ed13f7c47d conv_cmap py3 compat 2014-09-12 00:29:30 +02:00
cybjit cba5a42ba8 decipher_all bytes 2014-09-12 00:29:30 +02:00
cybjit 6357e2da80 code2cid uses int, not byte 2014-09-12 00:29:27 +02:00
cybjit 9b0a3ee53e decode cmap font name 2014-09-11 23:30:02 +02:00
Philippe Guglielmetti 7b620b3146 Merge pull request #1 from Cybjit/master
Python 3 text conversion issues
2014-09-09 20:42:37 +02:00
cybjit a6f31a713d cmap bytes and decode 2014-09-07 18:41:04 +02:00
cybjit cc733c8217 fixes for ARC4 2014-09-07 18:38:22 +02:00
cybjit f9a67db89b change xrange to range 2014-09-07 18:36:12 +02:00
cybjit 0a2d90c051 pdf2txt: do not double encode stdout 2014-09-07 18:34:11 +02:00
unknown 28c2a4e6ad 2.7/3.4 encoding corrected 2014-09-04 10:31:33 +02:00
unknown 58b8492783 no logging in travis.ci 2014-09-04 10:19:50 +02:00
unknown 1c93468c7e faster, less verbose tests 2014-09-04 10:02:29 +02:00
unknown 7b610b34be tools must be a module to enable scripts tests 2014-09-04 09:47:33 +02:00
unknown 4ab48d1803 Python 3.4 compatibility + tests 2014-09-04 09:36:19 +02:00
unknown 29c07ea770 Python 3.4 support and tests 2014-09-03 15:26:08 +02:00
unknown a6475b61b4 Python 3.4 support added and tested 2014-09-03 13:17:41 +02:00
unknown 846cd18186 Python 3.4 support 2014-09-02 15:49:46 +02:00