Fix errors with:
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 850, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 860, in render_contents
self.init_resources(resources)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 360, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 210, in get_font
font = self.get_font(None, subspec)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 201, in get_font
font = PDFCIDFont(self, spec)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdffont.py", line 667, in __init__
BytesIO(self.fontfile.get_data()))
File "/app/python/lib/python3.5/site-packages/pdfminer/pdftypes.py", line 297, in get_data
self.decode()
File "/app/python/lib/python3.5/site-packages/pdfminer/pdftypes.py", line 278, in decode
if 'Predictor' in params:
TypeError: argument of type 'NoneType' is not iterable
* utils.decode_text: fix "TypeError: ord() expected string of length 1, but int found"
fixes https://github.com/goulu/pdfminer/issues/24
* pdfinterp.execute: don't assume that every keyword name can be decoded as utf-8
fixes "'str' does not support the buffer interface", https://github.com/goulu/pdfminer/issues/23
* default settings.STRICT to False, for compatibility with the original pdfminer
* PDFCIDFont: handle font registry/orderings that may be PDFObjRefs
* utils.nunpack: handle 8-byte integers
* added color support to stroking and non stroking color spaces
* extended LTCurve, LTLine and LTRect to save painting information
* modified PDFLayoutAnalyzer to populate the shapes with painting information
* Removing all the "#!/usr/bin/env python" lines, they do not need for python3, solving issue number: #19.
* Restored all the shebangs in the tools and tests folders (because they are real executables) but used "#!/usr/bin/env python" instead of "#!/usr/bin/python" as this blog points out: https://www.peterbe.com/plog/importance-of-env
Removed also the shebang from pdfminer/psparser.py file.
* Run `make cmap` and `git add pdfminer/cmap`.
* Modify MANIFEST.in not to include cmaprsrc dir in the sdist package.
* Add pdfminer/cmap/README.txt to include license in the sdist package.
* Remove installation guide specific to CJK languages from README.md.
As stated in the PDF specification ISO 32000-1, table in Annex D.2 Latin Character Set and Encodings page 653 to 656 (available here: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf):
"The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE."
The duplicate key was missing, therefore PDFMiner was returning the string "(cid:160)".
This fix adds the duplicate key in latin_enc.py
glyphlist.py does not need to be modified as it already contains a key for non breaking space https://github.com/lucanaso/pdfminer/blob/master/pdfminer/glyphlist.py#L2755.
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way.
For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.
2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.
3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.
4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.