Ignore path constructors that do not begin with m (#749)
* Ignore path constructors that do not begin with m Per PDF Reference Section 4.4.1, "path construction operators may be invoked in any sequence, but the first one invoked must be m or re to begin a new subpath." Since pdfminer.six already converts all `re` (rectangle) operators to their equivelent `mlllh` representation, paths ingested by `.paint_path(...)` that do not begin with the `m` operator are invalid. In addition to the advantage of hewing to the PDF Reference, this change also avoids the `ValueError: not enough values to unpack (expected 2, got 1)` error raised by the ` pts = [apply_matrix_pt(self.ctm, pt) for pt in raw_pts]` line in `converter.py` when parsing PDFs that (erroneously) include `("h",)` paths. * Update CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>pull/755/head
parent
e19aea932d
commit
f2c967f500
|
@ -8,6 +8,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|||
|
||||
### Fixed
|
||||
|
||||
- Ignoring (invalid) path constructors that do not begin with `m` ([#749](https://github.com/pdfminer/pdfminer.six/pull/749))
|
||||
|
||||
## [20220506]
|
||||
|
||||
### Fixed
|
||||
|
||||
- `IndexError` when handling invalid bfrange code map in
|
||||
CMap ([#731](https://github.com/pdfminer/pdfminer.six/pull/731))
|
||||
- `TypeError` in lzw.py when `self.table` is not set ([#732](https://github.com/pdfminer/pdfminer.six/pull/732))
|
||||
|
|
|
@ -109,7 +109,16 @@ class PDFLayoutAnalyzer(PDFTextDevice):
|
|||
"""Paint paths described in section 4.4 of the PDF reference manual"""
|
||||
shape = "".join(x[0] for x in path)
|
||||
|
||||
if shape.count("m") > 1:
|
||||
if shape[:1] != "m":
|
||||
# Per PDF Reference Section 4.4.1, "path construction operators may
|
||||
# be invoked in any sequence, but the first one invoked must be m
|
||||
# or re to begin a new subpath." Since pdfminer.six already
|
||||
# converts all `re` (rectangle) operators to their equivelent
|
||||
# `mlllh` representation, paths ingested by `.paint_path(...)` that
|
||||
# do not begin with the `m` operator are invalid.
|
||||
pass
|
||||
|
||||
elif shape.count("m") > 1:
|
||||
# recurse if there are multiple m's in this shape
|
||||
for m in re.finditer(r"m[^m]+", shape):
|
||||
subpath = path[m.start(0) : m.end(0)]
|
||||
|
|
|
@ -215,6 +215,15 @@ class TestPaintPath:
|
|||
(71.41, 434.89),
|
||||
]
|
||||
|
||||
def test_paint_path_without_starting_m(self):
|
||||
gs = PDFGraphicState()
|
||||
analyzer = self._get_analyzer()
|
||||
analyzer.cur_item = LTContainer([0, 100, 0, 100])
|
||||
paths = [[("h",)], [("l", 72.41, 433.89), ("l", 82.41, 433.89), ("h",)]]
|
||||
for path in paths:
|
||||
analyzer.paint_path(gs, False, False, False, path)
|
||||
assert len(analyzer.cur_item._objs) == 0
|
||||
|
||||
|
||||
class TestBinaryDetector:
|
||||
def test_stringio(self):
|
||||
|
|
Loading…
Reference in New Issue