Fix bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed (#616)
* detect TextIOWrapper as non-binary * I don't understand the CHANGELOG.md format, hope this is good enough * Delete \\\r\n in Literal Strings (ref. section 7.3.4.2 of PDF32000_2008) * Keep Travis CI happy * Added test * Remove pdfminer/Changelog * Prettify _parse_string_1 * Add CHANGELOG.md * Satisfy flake8 * Update CHANGELOG.md * Use logging.Logger.warning instead of warning.warn in most cases, following the Python official guidance that warning.warn is directed at _developers_, not users * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning, PDFNoValidXRefWarning * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather than PDFNoValidXRefWarning * get name right * make flake8 happy * Revert "make flake8 happy" This reverts commitpull/661/head^24592769686
. * Revert "get name right" This reverts commit80091ea211
. * Revert "Use logging.Logger.warning instead of warning.warn in most cases, following" This reverts commit3c1e3d6606
. * Revert "Merge branch 'preferLoggingToWarning' into hst" This reverts commit9d9d139921
, reversing changes made to80091ea211
. * Revert "Revert "Merge branch 'preferLoggingToWarning' into hst"" This reverts commitb3da21934d
. Co-authored-by: Henry S. Thompson <ht@home.hst.name> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
parent
c3e3499a6b
commit
33d7dde4d1
|
@ -20,6 +20,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
||||||
- Fix `.paint_path` logic for handling single line segments and extracting point-on-curve positions of Beziér path commands ([#530](https://github.com/pdfminer/pdfminer.six/pull/530))
|
- Fix `.paint_path` logic for handling single line segments and extracting point-on-curve positions of Beziér path commands ([#530](https://github.com/pdfminer/pdfminer.six/pull/530))
|
||||||
- Raising `UnboundLocalError` when a bad `--output-type` is used ([#610](https://github.com/pdfminer/pdfminer.six/pull/610))
|
- Raising `UnboundLocalError` when a bad `--output-type` is used ([#610](https://github.com/pdfminer/pdfminer.six/pull/610))
|
||||||
- `TypeError` when using `TagExtractor` with non-string or non-bytes tag values ([#610](https://github.com/pdfminer/pdfminer.six/pull/610))
|
- `TypeError` when using `TagExtractor` with non-string or non-bytes tag values ([#610](https://github.com/pdfminer/pdfminer.six/pull/610))
|
||||||
|
- Using `io.TextIOBase` as the file to write to ([#616](https://github.com/pdfminer/pdfminer.six/pull/616))
|
||||||
|
- Parsing \r\n after the escape character in a literal string ([#616](https://github.com/pdfminer/pdfminer.six/pull/616))
|
||||||
|
|
||||||
## Removed
|
## Removed
|
||||||
- Support for Python 3.4 and 3.5 ([#522](https://github.com/pdfminer/pdfminer.six/pull/522))
|
- Support for Python 3.4 and 3.5 ([#522](https://github.com/pdfminer/pdfminer.six/pull/522))
|
||||||
|
|
|
@ -181,6 +181,8 @@ class PDFConverter(PDFLayoutAnalyzer):
|
||||||
return True
|
return True
|
||||||
elif isinstance(outfp, io.StringIO):
|
elif isinstance(outfp, io.StringIO):
|
||||||
return False
|
return False
|
||||||
|
elif isinstance(outfp, io.TextIOBase):
|
||||||
|
return False
|
||||||
|
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
|
@ -444,16 +444,29 @@ class PSBaseParser:
|
||||||
return j+1
|
return j+1
|
||||||
|
|
||||||
def _parse_string_1(self, s, i):
|
def _parse_string_1(self, s, i):
|
||||||
|
"""Parse literal strings
|
||||||
|
|
||||||
|
PDF Reference 3.2.3
|
||||||
|
"""
|
||||||
c = s[i:i+1]
|
c = s[i:i+1]
|
||||||
if OCT_STRING.match(c) and len(self.oct) < 3:
|
if OCT_STRING.match(c) and len(self.oct) < 3:
|
||||||
self.oct += c
|
self.oct += c
|
||||||
return i+1
|
return i+1
|
||||||
if self.oct:
|
|
||||||
|
elif self.oct:
|
||||||
self._curtoken += bytes((int(self.oct, 8),))
|
self._curtoken += bytes((int(self.oct, 8),))
|
||||||
self._parse1 = self._parse_string
|
self._parse1 = self._parse_string
|
||||||
return i
|
return i
|
||||||
if c in ESC_STRING:
|
|
||||||
|
elif c in ESC_STRING:
|
||||||
self._curtoken += bytes((ESC_STRING[c],))
|
self._curtoken += bytes((ESC_STRING[c],))
|
||||||
|
|
||||||
|
elif c == b'\r' and len(s) > i+1 and s[i+1:i+2] == b'\n':
|
||||||
|
# If current and next character is \r\n skip both because enters
|
||||||
|
# after a \ are ignored
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
# default action
|
||||||
self._parse1 = self._parse_string
|
self._parse1 = self._parse_string
|
||||||
return i+1
|
return i+1
|
||||||
|
|
||||||
|
|
|
@ -207,3 +207,6 @@ class TestBinaryDetector():
|
||||||
|
|
||||||
def test_non_file_like_object_defaults_to_binary(self):
|
def test_non_file_like_object_defaults_to_binary(self):
|
||||||
assert_true(PDFConverter._is_binary_stream(object()))
|
assert_true(PDFConverter._is_binary_stream(object()))
|
||||||
|
|
||||||
|
def test_textiowrapper(self):
|
||||||
|
assert_false(PDFConverter._is_binary_stream(io.TextIOBase()))
|
||||||
|
|
Loading…
Reference in New Issue