speedplane
2199c25493
Add my own .gitignore.
2014-12-12 00:37:54 -05:00
speedplane
806ee603ff
More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
...
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way.
For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane
45170e7183
There are a number of relatively complex changes here. Comments are in order of where the change appears.
...
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.
2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.
3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.
4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
speedplane
c32550dd4a
Merge branch 'fix-makefile'
2014-12-11 00:54:14 -05:00
speedplane
5cbdd915c7
Remove the dependancy on python2. Also, allow tests to be run on cygwin by checking for it, and converting unix2dos line endings.
2014-12-11 00:53:33 -05:00
speedplane
830b2403e2
Merge branch 'euske-main/master'
2014-12-11 00:06:46 -05:00
Yusuke Shinyama
0112112458
Fixed: crash on invalid chr number.
2014-12-09 22:55:47 +09:00
Yusuke Shinyama
75206ba18d
Removed: .gitignore
2014-12-09 22:49:13 +09:00
Yusuke Shinyama
4b585221e2
Merge pull request #76 from speedplane/master
...
Fix Unicode Bug + Add GitIgnore + Add Debug Flags
2014-12-09 22:22:33 +09:00
speedplane
36977fbe08
Add debug flags for much of the debug output.
2014-11-11 23:36:58 -05:00
speedplane
1067cb9f9f
Use a .gitignore file.
2014-11-11 23:36:26 -05:00
speedplane
ecc4d05675
Fix a unicode conversion bug.
...
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
Yusuke Shinyama
b0e035c24f
Style fix: always have an explicit return.
2014-07-15 21:38:29 +09:00
Yusuke Shinyama
f5b5e31921
Fixed: DecodeParms array support.
2014-07-09 19:07:27 +09:00
Yusuke Shinyama
137fc3a1ae
Use KWD instead of token.name.
2014-06-30 19:15:21 +09:00
Yusuke Shinyama
1ccfaff411
String-Bytes distinction (first attempt).
2014-06-30 19:05:56 +09:00
Yusuke Shinyama
8791355e1d
Cleanup imports. Use relative imports.
2014-06-26 18:12:39 +09:00
Yusuke Shinyama
2e900e5d10
Fixed for consistent test results. (hopefully...)
2014-06-26 17:41:31 +09:00
Yusuke Shinyama
fe86b4e64e
Changed: StringIO -> io.BytesIO
2014-06-25 19:55:41 +09:00
Yusuke Shinyama
a3ab6c253b
Fixed: loose autotesting.
2014-06-25 19:50:20 +09:00
Yusuke Shinyama
107e071508
Drop Python 2.4 support. The oldest supported version is now Python 2.6.
2014-06-25 19:28:54 +09:00
Yusuke Shinyama
44074b42ea
Added: stripcontrol for XMLConverter (-S option)
2014-06-22 00:33:00 +09:00
Yusuke Shinyama
81391c09f4
Fixed : #56 (with a derpy fix)
2014-06-18 19:11:45 +09:00
Yusuke Shinyama
bb866ae148
Changed: new except syntax (2.6 or above).
2014-06-16 18:50:07 +09:00
Yusuke Shinyama
28e96ba3d0
Use print as a function.
2014-06-15 12:14:33 +09:00
Yusuke Shinyama
0387a6c260
Removed: tuple-unpacking args.
2014-06-15 12:12:13 +09:00
Yusuke Shinyama
8f9c4dedff
Test rig cleanup.
2014-06-15 11:41:30 +09:00
Yusuke Shinyama
a8ec99a848
More autotest tweaks.
2014-06-15 10:52:59 +09:00
Yusuke Shinyama
1384a3fe8d
Code cleanup: removed some debug flags.
2014-06-14 15:43:10 +09:00
Yusuke Shinyama
d9680fca7e
Plane: preserve the object order so that the test result is always consistent.
2014-06-14 14:44:53 +09:00
Yusuke Shinyama
aed248610c
Fixed: dependency on pygame in a unittest.
2014-06-14 12:05:26 +09:00
Yusuke Shinyama
8e14ebf4e1
Use logging module instead of print.
2014-06-14 12:00:49 +09:00
Yusuke Shinyama
fb3f2d9629
Further test tweaks.
2014-06-14 12:00:31 +09:00
Yusuke Shinyama
2c90e6ac42
Updated: copyright year.
2014-06-14 11:29:42 +09:00
Yusuke Shinyama
9ebd6d5938
Travis-CI tweaks.
2014-06-14 11:24:45 +09:00
Yusuke Shinyama
fe0ae545ec
added: pip install to travis.yml
2014-06-14 11:16:50 +09:00
Yusuke Shinyama
582fbcbc1b
Merge branch 'travis-yaml' of https://github.com/mduggan/pdfminer
2014-06-14 11:05:46 +09:00
Yusuke Shinyama
a7489aaabe
Fixed: autotests
2014-06-14 10:54:40 +09:00
Yusuke Shinyama
8e8e22c095
Fixed a layout bug introduced at c97ec304
.
2014-06-13 23:05:04 +09:00
Matthew Duggan
0786262bac
Add travis.yml for CI
2014-05-29 15:58:58 +09:00
Yusuke Shinyama
18817d0e38
Merge pull request #53 from numion/encryption
...
Support revision 4 and 5 encryption if PyCrypto library is available.
2014-05-28 11:42:43 +09:00
numion
a4997d6f10
Implement revision 4 and 5 encryption handler.
2014-05-19 16:27:43 +02:00
Yusuke Shinyama
0be2f5422b
Fixed the document, thanks to Darius Thabit.
2014-05-19 23:23:41 +09:00
Yusuke Shinyama
29ebc2d618
Merge pull request #52 from hinesmr/master
...
Stop throwing exception on LITERALS_DCT_DECODE
2014-05-15 21:52:23 +09:00
Michael R. Hines
ae2547b0f2
Stop throwing exception on LITERALS_DCT_DECODE
...
I have PDF documents with images stream and two filters, don't throw exceptions on the second one (DCT).
2014-05-14 13:25:30 +08:00
Yusuke Shinyama
6b6fc264ff
Code refactoring: CMap and UnicodeMap both inherit CMapBase.
2014-04-16 18:57:16 +09:00
Yusuke Shinyama
b09c37902f
Fixed: issue #48 (thanks to speedplane)
2014-04-09 17:55:50 +09:00
Yusuke Shinyama
52d96b3b67
Added a demo app url.
2014-04-05 12:26:33 +09:00
Yusuke Shinyama
17b9b19a26
Fixed for newer version: pdf2html.cgi
2014-04-02 18:54:50 +09:00
Yusuke Shinyama
9242356357
Updated the url.
2014-03-28 22:55:06 +09:00