speedplane
2199c25493
Add my own .gitignore.
2014-12-12 00:37:54 -05:00
speedplane
806ee603ff
More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
...
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way.
For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane
45170e7183
There are a number of relatively complex changes here. Comments are in order of where the change appears.
...
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.
2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.
3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.
4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
speedplane
c32550dd4a
Merge branch 'fix-makefile'
2014-12-11 00:54:14 -05:00
speedplane
5cbdd915c7
Remove the dependancy on python2. Also, allow tests to be run on cygwin by checking for it, and converting unix2dos line endings.
2014-12-11 00:53:33 -05:00
speedplane
830b2403e2
Merge branch 'euske-main/master'
2014-12-11 00:06:46 -05:00
Yusuke Shinyama
0112112458
Fixed: crash on invalid chr number.
2014-12-09 22:55:47 +09:00
Yusuke Shinyama
75206ba18d
Removed: .gitignore
2014-12-09 22:49:13 +09:00
Yusuke Shinyama
4b585221e2
Merge pull request #76 from speedplane/master
...
Fix Unicode Bug + Add GitIgnore + Add Debug Flags
2014-12-09 22:22:33 +09:00
Philippe Guglielmetti
448aa08bc4
Merge pull request #4 from enkore/master
...
Fix utils.decode_text
2014-12-05 09:58:58 +01:00
enkore
d0379a2c44
Fix utils.decode_text
2014-12-04 17:09:52 +01:00
speedplane
36977fbe08
Add debug flags for much of the debug output.
2014-11-11 23:36:58 -05:00
speedplane
1067cb9f9f
Use a .gitignore file.
2014-11-11 23:36:26 -05:00
speedplane
ecc4d05675
Fix a unicode conversion bug.
...
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
Philippe Guglielmetti
0e40264071
Merge pull request #3 from Cybjit/master
...
Samples and latin1 passwords
2014-09-17 07:22:52 +02:00
cybjit
515687e1bb
more xrange to range
2014-09-16 23:17:31 +02:00
cybjit
2639b15ef4
guess argv encoding in py2 using sys.stdin.encoding
2014-09-16 23:17:26 +02:00
cybjit
9b2e29396b
apply_png_predictor py3
2014-09-16 22:59:29 +02:00
cybjit
ad05121c69
password py3
2014-09-16 22:59:00 +02:00
cybjit
14585987c3
keep password api unicode, latin1 or utf-8 is encoded in handler
2014-09-16 22:58:25 +02:00
cybjit
2260f77b19
fix dict_value usage in strict mode
2014-09-16 22:57:29 +02:00
cybjit
51a361c145
clean up HTMLConverter and XMLConverter encoding
2014-09-16 22:57:00 +02:00
cybjit
2ee7153f6e
add python3 in sample Makefile
2014-09-16 22:56:13 +02:00
Goulu
f577f76c52
renamed as pdfminer.six in PyPi
2014-09-15 11:10:00 +02:00
Goulu
03de0f4db8
forgot 'six' requirement ...
2014-09-15 10:42:08 +02:00
Goulu
8861d7e0ed
version 20140915 pushed to PyPi as pdfminer_six
2014-09-15 10:33:04 +02:00
Philippe Guglielmetti
4f8aa9ff5b
Merge pull request #2 from Cybjit/master
...
CMap fixes and speed improvements
2014-09-12 07:33:06 +02:00
cybjit
714423883c
setup logging for pdf2txt and fix dumppdf
2014-09-12 00:29:31 +02:00
cybjit
39942b6642
avoid string formating when not logging
2014-09-12 00:29:31 +02:00
cybjit
01821c7d1e
rename bytes to avoid built-in collision
2014-09-12 00:29:31 +02:00
cybjit
31e6afc7cf
faster and simpler bytes implementation
2014-09-12 00:29:30 +02:00
cybjit
ed13f7c47d
conv_cmap py3 compat
2014-09-12 00:29:30 +02:00
cybjit
cba5a42ba8
decipher_all bytes
2014-09-12 00:29:30 +02:00
cybjit
6357e2da80
code2cid uses int, not byte
2014-09-12 00:29:27 +02:00
cybjit
9b0a3ee53e
decode cmap font name
2014-09-11 23:30:02 +02:00
Philippe Guglielmetti
7b620b3146
Merge pull request #1 from Cybjit/master
...
Python 3 text conversion issues
2014-09-09 20:42:37 +02:00
cybjit
a6f31a713d
cmap bytes and decode
2014-09-07 18:41:04 +02:00
cybjit
cc733c8217
fixes for ARC4
2014-09-07 18:38:22 +02:00
cybjit
f9a67db89b
change xrange to range
2014-09-07 18:36:12 +02:00
cybjit
0a2d90c051
pdf2txt: do not double encode stdout
2014-09-07 18:34:11 +02:00
unknown
28c2a4e6ad
2.7/3.4 encoding corrected
2014-09-04 10:31:33 +02:00
unknown
58b8492783
no logging in travis.ci
2014-09-04 10:19:50 +02:00
unknown
1c93468c7e
faster, less verbose tests
2014-09-04 10:02:29 +02:00
unknown
7b610b34be
tools must be a module to enable scripts tests
2014-09-04 09:47:33 +02:00
unknown
4ab48d1803
Python 3.4 compatibility + tests
2014-09-04 09:36:19 +02:00
unknown
29c07ea770
Python 3.4 support and tests
2014-09-03 15:26:08 +02:00
unknown
a6475b61b4
Python 3.4 support added and tested
2014-09-03 13:17:41 +02:00
unknown
846cd18186
Python 3.4 support
2014-09-02 15:49:46 +02:00
unknown
faea7291a8
tests pass under Py 2.7 and 3.4
2014-09-01 14:16:49 +02:00
Yusuke Shinyama
b0e035c24f
Style fix: always have an explicit return.
2014-07-15 21:38:29 +09:00