Ignore path constructors that do not begin with m (#749)
* Ignore path constructors that do not begin with m Per PDF Reference Section 4.4.1, "path construction operators may be invoked in any sequence, but the first one invoked must be m or re to begin a new subpath." Since pdfminer.six already converts all `re` (rectangle) operators to their equivelent `mlllh` representation, paths ingested by `.paint_path(...)` that do not begin with the `m` operator are invalid. In addition to the advantage of hewing to the PDF Reference, this change also avoids the `ValueError: not enough values to unpack (expected 2, got 1)` error raised by the ` pts = [apply_matrix_pt(self.ctm, pt) for pt in raw_pts]` line in `converter.py` when parsing PDFs that (erroneously) include `("h",)` paths. * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>pull/755/head
parent
e19aea932d
commit
f2c967f500
|
@ -8,6 +8,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
|
||||||
|
- Ignoring (invalid) path constructors that do not begin with `m` ([#749](https://github.com/pdfminer/pdfminer.six/pull/749))
|
||||||
|
|
||||||
|
## [20220506]
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
|
||||||
- `IndexError` when handling invalid bfrange code map in
|
- `IndexError` when handling invalid bfrange code map in
|
||||||
CMap ([#731](https://github.com/pdfminer/pdfminer.six/pull/731))
|
CMap ([#731](https://github.com/pdfminer/pdfminer.six/pull/731))
|
||||||
- `TypeError` in lzw.py when `self.table` is not set ([#732](https://github.com/pdfminer/pdfminer.six/pull/732))
|
- `TypeError` in lzw.py when `self.table` is not set ([#732](https://github.com/pdfminer/pdfminer.six/pull/732))
|
||||||
|
|
|
@ -109,7 +109,16 @@ class PDFLayoutAnalyzer(PDFTextDevice):
|
||||||
"""Paint paths described in section 4.4 of the PDF reference manual"""
|
"""Paint paths described in section 4.4 of the PDF reference manual"""
|
||||||
shape = "".join(x[0] for x in path)
|
shape = "".join(x[0] for x in path)
|
||||||
|
|
||||||
if shape.count("m") > 1:
|
if shape[:1] != "m":
|
||||||
|
# Per PDF Reference Section 4.4.1, "path construction operators may
|
||||||
|
# be invoked in any sequence, but the first one invoked must be m
|
||||||
|
# or re to begin a new subpath." Since pdfminer.six already
|
||||||
|
# converts all `re` (rectangle) operators to their equivelent
|
||||||
|
# `mlllh` representation, paths ingested by `.paint_path(...)` that
|
||||||
|
# do not begin with the `m` operator are invalid.
|
||||||
|
pass
|
||||||
|
|
||||||
|
elif shape.count("m") > 1:
|
||||||
# recurse if there are multiple m's in this shape
|
# recurse if there are multiple m's in this shape
|
||||||
for m in re.finditer(r"m[^m]+", shape):
|
for m in re.finditer(r"m[^m]+", shape):
|
||||||
subpath = path[m.start(0) : m.end(0)]
|
subpath = path[m.start(0) : m.end(0)]
|
||||||
|
|
|
@ -215,6 +215,15 @@ class TestPaintPath:
|
||||||
(71.41, 434.89),
|
(71.41, 434.89),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
def test_paint_path_without_starting_m(self):
|
||||||
|
gs = PDFGraphicState()
|
||||||
|
analyzer = self._get_analyzer()
|
||||||
|
analyzer.cur_item = LTContainer([0, 100, 0, 100])
|
||||||
|
paths = [[("h",)], [("l", 72.41, 433.89), ("l", 82.41, 433.89), ("h",)]]
|
||||||
|
for path in paths:
|
||||||
|
analyzer.paint_path(gs, False, False, False, path)
|
||||||
|
assert len(analyzer.cur_item._objs) == 0
|
||||||
|
|
||||||
|
|
||||||
class TestBinaryDetector:
|
class TestBinaryDetector:
|
||||||
def test_stringio(self):
|
def test_stringio(self):
|
||||||
|
|
Loading…
Reference in New Issue