Commit Graph

491 Commits (dcf07272a175f0c2911c62986e16ac9b4afeb1a6)

Author SHA1 Message Date
speedplane dcf07272a1 Revert changes unrelated to this feature. 2016-06-13 23:46:30 -04:00
speedplane 549b560765 Revert changes unrelated to this feature. 2016-06-13 23:44:54 -04:00
speedplane 2049462f6f Revert changes unrelated to this branch. 2016-06-13 23:42:21 -04:00
speedplane b0b8818a41 Fix a bug with pdfminer which occurs when two or more filters are applied to a stream, even though no parameters are specified. The code would previously drop all of the streams after the first due to misapplication of the zip function. 2016-06-13 23:35:11 -04:00
speedplane 5609418351 Add gz to gitignore. 2014-12-14 01:29:39 -05:00
speedplane 69afd3dd30 Use a .gitignore file. 2014-12-14 01:23:44 -05:00
speedplane 2199c25493 Add my own .gitignore. 2014-12-12 00:37:54 -05:00
speedplane 806ee603ff More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary.  This leads to much better text extraction  if the PDF was created in a funky way.

For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane 45170e7183 There are a number of relatively complex changes here. Comments are in order of where the change appears.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart.  However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.

2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.

3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.

4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute.  Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
speedplane c32550dd4a Merge branch 'fix-makefile' 2014-12-11 00:54:14 -05:00
speedplane 5cbdd915c7 Remove the dependancy on python2. Also, allow tests to be run on cygwin by checking for it, and converting unix2dos line endings. 2014-12-11 00:53:33 -05:00
speedplane 830b2403e2 Merge branch 'euske-main/master' 2014-12-11 00:06:46 -05:00
Yusuke Shinyama 0112112458 Fixed: crash on invalid chr number. 2014-12-09 22:55:47 +09:00
Yusuke Shinyama 75206ba18d Removed: .gitignore 2014-12-09 22:49:13 +09:00
Yusuke Shinyama 4b585221e2 Merge pull request #76 from speedplane/master
Fix Unicode Bug + Add GitIgnore + Add Debug Flags
2014-12-09 22:22:33 +09:00
speedplane 36977fbe08 Add debug flags for much of the debug output. 2014-11-11 23:36:58 -05:00
speedplane 1067cb9f9f Use a .gitignore file. 2014-11-11 23:36:26 -05:00
speedplane ecc4d05675 Fix a unicode conversion bug.
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
Yusuke Shinyama b0e035c24f Style fix: always have an explicit return. 2014-07-15 21:38:29 +09:00
Yusuke Shinyama f5b5e31921 Fixed: DecodeParms array support. 2014-07-09 19:07:27 +09:00
Yusuke Shinyama 137fc3a1ae Use KWD instead of token.name. 2014-06-30 19:15:21 +09:00
Yusuke Shinyama 1ccfaff411 String-Bytes distinction (first attempt). 2014-06-30 19:05:56 +09:00
Yusuke Shinyama 8791355e1d Cleanup imports. Use relative imports. 2014-06-26 18:12:39 +09:00
Yusuke Shinyama 2e900e5d10 Fixed for consistent test results. (hopefully...) 2014-06-26 17:41:31 +09:00
Yusuke Shinyama fe86b4e64e Changed: StringIO -> io.BytesIO 2014-06-25 19:55:41 +09:00
Yusuke Shinyama a3ab6c253b Fixed: loose autotesting. 2014-06-25 19:50:20 +09:00
Yusuke Shinyama 107e071508 Drop Python 2.4 support. The oldest supported version is now Python 2.6. 2014-06-25 19:28:54 +09:00
Yusuke Shinyama 44074b42ea Added: stripcontrol for XMLConverter (-S option) 2014-06-22 00:33:00 +09:00
Yusuke Shinyama 81391c09f4 Fixed: #56 (with a derpy fix) 2014-06-18 19:11:45 +09:00
Yusuke Shinyama bb866ae148 Changed: new except syntax (2.6 or above). 2014-06-16 18:50:07 +09:00
Yusuke Shinyama 28e96ba3d0 Use print as a function. 2014-06-15 12:14:33 +09:00
Yusuke Shinyama 0387a6c260 Removed: tuple-unpacking args. 2014-06-15 12:12:13 +09:00
Yusuke Shinyama 8f9c4dedff Test rig cleanup. 2014-06-15 11:41:30 +09:00
Yusuke Shinyama a8ec99a848 More autotest tweaks. 2014-06-15 10:52:59 +09:00
Yusuke Shinyama 1384a3fe8d Code cleanup: removed some debug flags. 2014-06-14 15:43:10 +09:00
Yusuke Shinyama d9680fca7e Plane: preserve the object order so that the test result is always consistent. 2014-06-14 14:44:53 +09:00
Yusuke Shinyama aed248610c Fixed: dependency on pygame in a unittest. 2014-06-14 12:05:26 +09:00
Yusuke Shinyama 8e14ebf4e1 Use logging module instead of print. 2014-06-14 12:00:49 +09:00
Yusuke Shinyama fb3f2d9629 Further test tweaks. 2014-06-14 12:00:31 +09:00
Yusuke Shinyama 2c90e6ac42 Updated: copyright year. 2014-06-14 11:29:42 +09:00
Yusuke Shinyama 9ebd6d5938 Travis-CI tweaks. 2014-06-14 11:24:45 +09:00
Yusuke Shinyama fe0ae545ec added: pip install to travis.yml 2014-06-14 11:16:50 +09:00
Yusuke Shinyama 582fbcbc1b Merge branch 'travis-yaml' of https://github.com/mduggan/pdfminer 2014-06-14 11:05:46 +09:00
Yusuke Shinyama a7489aaabe Fixed: autotests 2014-06-14 10:54:40 +09:00
Yusuke Shinyama 8e8e22c095 Fixed a layout bug introduced at c97ec304. 2014-06-13 23:05:04 +09:00
Matthew Duggan 0786262bac Add travis.yml for CI 2014-05-29 15:58:58 +09:00
Yusuke Shinyama 18817d0e38 Merge pull request #53 from numion/encryption
Support revision 4 and 5 encryption if PyCrypto library is available.
2014-05-28 11:42:43 +09:00
numion a4997d6f10 Implement revision 4 and 5 encryption handler. 2014-05-19 16:27:43 +02:00
Yusuke Shinyama 0be2f5422b Fixed the document, thanks to Darius Thabit. 2014-05-19 23:23:41 +09:00
Yusuke Shinyama 29ebc2d618 Merge pull request #52 from hinesmr/master
Stop throwing exception on LITERALS_DCT_DECODE
2014-05-15 21:52:23 +09:00