speedplane
806ee603ff
More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
...
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way.
For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane
45170e7183
There are a number of relatively complex changes here. Comments are in order of where the change appears.
...
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.
2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.
3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.
4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
Yusuke Shinyama
8791355e1d
Cleanup imports. Use relative imports.
2014-06-26 18:12:39 +09:00
Yusuke Shinyama
2e900e5d10
Fixed for consistent test results. (hopefully...)
2014-06-26 17:41:31 +09:00
Yusuke Shinyama
0387a6c260
Removed: tuple-unpacking args.
2014-06-15 12:12:13 +09:00
Yusuke Shinyama
a8ec99a848
More autotest tweaks.
2014-06-15 10:52:59 +09:00
Yusuke Shinyama
1384a3fe8d
Code cleanup: removed some debug flags.
2014-06-14 15:43:10 +09:00
Yusuke Shinyama
8e8e22c095
Fixed a layout bug introduced at c97ec304
.
2014-06-13 23:05:04 +09:00
Yusuke Shinyama
340387bfc6
Cleanup: isinstance
2014-03-28 17:50:59 +09:00
Yusuke Shinyama
c97ec3048e
Changed / to // for clarity.
2013-11-26 21:35:16 +09:00
Yusuke Shinyama
acad011e3f
Code cleanup.
2013-11-11 20:46:30 +09:00
Yusuke Shinyama
cbef967fbf
Renamed: LTAnon -> LTAnno
2013-11-11 19:17:45 +09:00
Yusuke Shinyama
c8b6d4112a
Fixed: crash with negative layout bbox.
2013-11-09 15:10:14 +09:00
Yusuke Shinyama
2b56b2eedf
Merged.
2013-11-07 19:50:41 +09:00
Matthew Duggan
2caa5edc25
PEP8: Whitespace changes to match pep8
2013-11-07 17:35:04 +09:00
Matthew Duggan
c1da8b835c
PEP8: Remove trailing whitespace
2013-11-07 16:14:53 +09:00
Matthew Duggan
10a68c83bd
Remove unused imports identified by pyflakes
2013-11-07 16:09:44 +09:00
Yusuke Shinyama
4ef81ae9d8
Improved word spacing.
2013-11-05 18:25:19 +09:00
Yusuke Shinyama
e927bd307e
fixed: https://github.com/euske/pdfminer/issues/8
2013-10-22 18:24:39 +09:00
Yusuke Shinyama
0ea08890d4
renamed: python2 -> python.
2013-10-17 23:05:27 +09:00
Yusuke Shinyama
eabe72ee63
Prevent crash with empty layout box.
2013-10-09 22:13:22 +09:00
jcushman
f77f196cd3
2x faster group_textboxes function.
2012-06-22 18:11:45 -03:00
Yusuke Shinyama
f638784e1d
experimental layout analysis improvements
2011-08-14 09:44:21 +09:00
Yusuke Shinyama
c134596e2f
code cleanup and testcase stabilization
2011-05-15 01:22:19 +09:00
Yusuke Shinyama
e5d02f8653
fixed the infinite recursion bug.
2011-05-14 16:32:09 +09:00
Yusuke Shinyama
0c41b8348e
code cleanup
2011-05-14 15:51:40 +09:00
Yusuke Shinyama
038ce4cd0c
added LTText.get_text() and .text property is no longer accessible.
2011-05-14 15:45:08 +09:00
Yusuke Shinyama
5004e4b28d
layout analysis speedup.
2011-05-14 14:17:39 +09:00
Yusuke Shinyama
8f9684f6a6
code cleanup: layout analysis
2011-04-21 22:07:04 +09:00
Yusuke Shinyama
0e660dd385
rename: LTPolygon -> LTCurve
2011-04-20 22:05:25 +09:00
Yusuke Shinyama
bb26cf9180
eliminate empty textboxes
2011-03-01 20:47:20 +09:00
Yusuke Shinyama
a8bf9b159e
docstring fix
2011-02-27 13:09:12 +09:00
Yusuke Shinyama
cabaa10e4f
layout analysis improvement
2011-02-27 12:56:28 +09:00
Yusuke Shinyama
f00f1dbd04
better layout analysis
2011-02-14 23:41:23 +09:00
Yusuke Shinyama
cd412308bd
text flow detection bug fix (thanks to fujimoto-san)
2011-02-14 22:32:55 +09:00
Yusuke Shinyama
cbd58121e3
fix aggressive vertical writing detection (which ruins layout)
2011-02-02 23:09:34 +09:00
Yusuke Shinyama
a24c452ba2
boxes_flow patch by Daniel Gerber
2010-12-26 17:26:39 +09:00
Yusuke Shinyama
3da3adad9b
method renamed: finish(self) -> analyze(self, laparams).
2010-12-26 16:56:21 +09:00
yusuke.shinyama.dummy
476ecf7e32
add html exect layout mode; default changed.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@272 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-14 10:07:41 +00:00
yusuke.shinyama.dummy
0d1f00fa9b
improved layout analysis for vertical script
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@269 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-09 10:40:14 +00:00
yusuke.shinyama.dummy
9584845358
layout analysis improved
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@268 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-09 10:40:05 +00:00
yusuke.shinyama.dummy
edbd3764a7
html layout output fix
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@267 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-09 10:39:48 +00:00
yusuke.shinyama.dummy
509ab66319
stay with python2
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@264 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-19 09:57:01 +00:00
yusuke.shinyama.dummy
438b4953be
documentation bit and code cleanup
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@263 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-18 15:04:49 +00:00
yusuke.shinyama.dummy
3305c07ba2
layout analysis improved
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@245 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:13:39 +00:00
yusuke.shinyama.dummy
bc1303e901
layout analysis improvement 1
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@244 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:13:33 +00:00
yusuke.shinyama.dummy
0944cfaded
test file simple3.pdf added.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@240 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-29 06:39:41 +00:00
yusuke.shinyama.dummy
83d2086f19
fix minor layout issue
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@239 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-29 06:39:31 +00:00
yusuke.shinyama.dummy
4554705881
glyphlist bug (due to my misunderstanding of spec.)
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@237 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-26 15:02:46 +00:00
yusuke.shinyama.dummy
ac74542d1f
minor bugfixes
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@234 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-26 15:02:29 +00:00