Instead of list comprehension which will call a function to get the integer value of the bytes directly convert it to bytearray which is more optimal structure for storing list of bytes.
In the PDFStream it's possible that the /Type element is not
present, but /type is. According to the spec, these are different
elements, but in the case in point they had the same meaning.
If PDFMiner is not running in STRICT mode and /Type doesn't resolve,
a fallback to /type is used to determine the tree type.
Fix errors with:
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 850, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 860, in render_contents
self.init_resources(resources)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 360, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 210, in get_font
font = self.get_font(None, subspec)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 201, in get_font
font = PDFCIDFont(self, spec)
File "/app/python/lib/python3.5/site-packages/pdfminer/pdffont.py", line 667, in __init__
BytesIO(self.fontfile.get_data()))
File "/app/python/lib/python3.5/site-packages/pdfminer/pdftypes.py", line 297, in get_data
self.decode()
File "/app/python/lib/python3.5/site-packages/pdfminer/pdftypes.py", line 278, in decode
if 'Predictor' in params:
TypeError: argument of type 'NoneType' is not iterable
* utils.decode_text: fix "TypeError: ord() expected string of length 1, but int found"
fixes https://github.com/goulu/pdfminer/issues/24
* pdfinterp.execute: don't assume that every keyword name can be decoded as utf-8
fixes "'str' does not support the buffer interface", https://github.com/goulu/pdfminer/issues/23
* default settings.STRICT to False, for compatibility with the original pdfminer
* PDFCIDFont: handle font registry/orderings that may be PDFObjRefs
* utils.nunpack: handle 8-byte integers
* added color support to stroking and non stroking color spaces
* extended LTCurve, LTLine and LTRect to save painting information
* modified PDFLayoutAnalyzer to populate the shapes with painting information
* Removing all the "#!/usr/bin/env python" lines, they do not need for python3, solving issue number: #19.
* Restored all the shebangs in the tools and tests folders (because they are real executables) but used "#!/usr/bin/env python" instead of "#!/usr/bin/python" as this blog points out: https://www.peterbe.com/plog/importance-of-env
Removed also the shebang from pdfminer/psparser.py file.
* Run `make cmap` and `git add pdfminer/cmap`.
* Modify MANIFEST.in not to include cmaprsrc dir in the sdist package.
* Add pdfminer/cmap/README.txt to include license in the sdist package.
* Remove installation guide specific to CJK languages from README.md.
As stated in the PDF specification ISO 32000-1, table in Annex D.2 Latin Character Set and Encodings page 653 to 656 (available here: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf):
"The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE."
The duplicate key was missing, therefore PDFMiner was returning the string "(cid:160)".
This fix adds the duplicate key in latin_enc.py
glyphlist.py does not need to be modified as it already contains a key for non breaking space https://github.com/lucanaso/pdfminer/blob/master/pdfminer/glyphlist.py#L2755.
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way.
For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.
2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.
3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.
4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.