Commit Graph

98 Commits (3d549ea48c11a50d427f4636fb060eac654c044d)

Author SHA1 Message Date
Tata Ganesh e03ecab856
Merge pull request #141 from timb07/speedup_layout
Speed up layout of text boxes
2018-11-08 20:28:40 +05:30
Tim Bell 1cbeaebfce Fix Python 2.6 incompatibility 2018-04-11 10:34:15 +10:00
Tim Bell 0c8cf748fe Fix copy-paste error 2018-04-11 10:15:32 +10:00
Tim Bell 2dda2b12b4 Speedup layout with .sort() and sortedcontainers.SortedListWithKey() 2018-04-11 09:03:32 +10:00
Quentin Pradet 2231f0892e
Send non-stroke color to XML conversion
Inspired by https://github.com/euske/pdfminer/pull/158 from @andruo11
and https://github.com/euske/pdfminer/pull/197 from @staccatosound.
2018-03-06 14:11:48 +04:00
Hugh Secker-Walker 488545ddc7 Add string expressions to asserts showing local data (#67) 2017-05-29 09:06:09 +02:00
Philippe Guglielmetti 52feb22eeb Merge remote-tracking branch 'origin/master'
Conflicts:
	MANIFEST.in
	README.md
	pdfminer/latin_enc.py
	pdfminer/pdfdocument.py
	pdfminer/pdfinterp.py
	pdfminer/pdfpage.py
	pdfminer/pdftypes.py
	pdfminer/psparser.py
	pdfminer/utils.py
	samples/Makefile
	setup.py
2017-01-19 08:03:16 +01:00
Humberto Pereira e6ad15af79 Added painting information (#37)
* added color support to stroking and non stroking color spaces

* extended LTCurve, LTLine and LTRect to save painting information

* modified PDFLayoutAnalyzer to populate the shapes with painting information
2016-11-08 20:01:58 +01:00
Antonio Ercole De Luca 0fdebc6739 Removing all the "#!/usr/bin/env python" lines, they do not need for … (#34)
* Removing all the "#!/usr/bin/env python" lines, they do not need for python3, solving issue number: #19.

* Restored all the shebangs in the tools and tests folders (because they are real executables) but used "#!/usr/bin/env python" instead of "#!/usr/bin/python" as this blog points out: https://www.peterbe.com/plog/importance-of-env
Removed also the shebang from pdfminer/psparser.py file.
2016-11-08 20:01:11 +01:00
Yusuke Shinyama 8150458718 Added: a simpler ordering mode when 1<F. 2016-09-26 18:06:34 +09:00
speedplane 2049462f6f Revert changes unrelated to this branch. 2016-06-13 23:42:21 -04:00
speedplane 806ee603ff More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary.  This leads to much better text extraction  if the PDF was created in a funky way.

For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane 45170e7183 There are a number of relatively complex changes here. Comments are in order of where the change appears.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart.  However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.

2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.

3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.

4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute.  Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
unknown 29c07ea770 Python 3.4 support and tests 2014-09-03 15:26:08 +02:00
Yusuke Shinyama 8791355e1d Cleanup imports. Use relative imports. 2014-06-26 18:12:39 +09:00
Yusuke Shinyama 2e900e5d10 Fixed for consistent test results. (hopefully...) 2014-06-26 17:41:31 +09:00
Yusuke Shinyama 0387a6c260 Removed: tuple-unpacking args. 2014-06-15 12:12:13 +09:00
Yusuke Shinyama a8ec99a848 More autotest tweaks. 2014-06-15 10:52:59 +09:00
Yusuke Shinyama 1384a3fe8d Code cleanup: removed some debug flags. 2014-06-14 15:43:10 +09:00
Yusuke Shinyama 8e8e22c095 Fixed a layout bug introduced at c97ec304. 2014-06-13 23:05:04 +09:00
Yusuke Shinyama 340387bfc6 Cleanup: isinstance 2014-03-28 17:50:59 +09:00
Yusuke Shinyama c97ec3048e Changed / to // for clarity. 2013-11-26 21:35:16 +09:00
Yusuke Shinyama acad011e3f Code cleanup. 2013-11-11 20:46:30 +09:00
Yusuke Shinyama cbef967fbf Renamed: LTAnon -> LTAnno 2013-11-11 19:17:45 +09:00
Yusuke Shinyama c8b6d4112a Fixed: crash with negative layout bbox. 2013-11-09 15:10:14 +09:00
Yusuke Shinyama 2b56b2eedf Merged. 2013-11-07 19:50:41 +09:00
Matthew Duggan 2caa5edc25 PEP8: Whitespace changes to match pep8 2013-11-07 17:35:04 +09:00
Matthew Duggan c1da8b835c PEP8: Remove trailing whitespace 2013-11-07 16:14:53 +09:00
Matthew Duggan 10a68c83bd Remove unused imports identified by pyflakes 2013-11-07 16:09:44 +09:00
Yusuke Shinyama 4ef81ae9d8 Improved word spacing. 2013-11-05 18:25:19 +09:00
Yusuke Shinyama e927bd307e fixed: https://github.com/euske/pdfminer/issues/8 2013-10-22 18:24:39 +09:00
Yusuke Shinyama 0ea08890d4 renamed: python2 -> python. 2013-10-17 23:05:27 +09:00
Yusuke Shinyama eabe72ee63 Prevent crash with empty layout box. 2013-10-09 22:13:22 +09:00
jcushman f77f196cd3 2x faster group_textboxes function. 2012-06-22 18:11:45 -03:00
Yusuke Shinyama f638784e1d experimental layout analysis improvements 2011-08-14 09:44:21 +09:00
Yusuke Shinyama c134596e2f code cleanup and testcase stabilization 2011-05-15 01:22:19 +09:00
Yusuke Shinyama e5d02f8653 fixed the infinite recursion bug. 2011-05-14 16:32:09 +09:00
Yusuke Shinyama 0c41b8348e code cleanup 2011-05-14 15:51:40 +09:00
Yusuke Shinyama 038ce4cd0c added LTText.get_text() and .text property is no longer accessible. 2011-05-14 15:45:08 +09:00
Yusuke Shinyama 5004e4b28d layout analysis speedup. 2011-05-14 14:17:39 +09:00
Yusuke Shinyama 8f9684f6a6 code cleanup: layout analysis 2011-04-21 22:07:04 +09:00
Yusuke Shinyama 0e660dd385 rename: LTPolygon -> LTCurve 2011-04-20 22:05:25 +09:00
Yusuke Shinyama bb26cf9180 eliminate empty textboxes 2011-03-01 20:47:20 +09:00
Yusuke Shinyama a8bf9b159e docstring fix 2011-02-27 13:09:12 +09:00
Yusuke Shinyama cabaa10e4f layout analysis improvement 2011-02-27 12:56:28 +09:00
Yusuke Shinyama f00f1dbd04 better layout analysis 2011-02-14 23:41:23 +09:00
Yusuke Shinyama cd412308bd text flow detection bug fix (thanks to fujimoto-san) 2011-02-14 22:32:55 +09:00
Yusuke Shinyama cbd58121e3 fix aggressive vertical writing detection (which ruins layout) 2011-02-02 23:09:34 +09:00
Yusuke Shinyama a24c452ba2 boxes_flow patch by Daniel Gerber 2010-12-26 17:26:39 +09:00
Yusuke Shinyama 3da3adad9b method renamed: finish(self) -> analyze(self, laparams). 2010-12-26 16:56:21 +09:00