* Fix an error when dumping a TOC
* Fix a bug that a TOC title variable is a bytes type
* Update CHANGELOG.md
* Update CHANGELOG.md
* Rename e() to escape() and merge two isinstance() checks
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Fixes#566
* try to fix issue of some Chinese characters cannot be extracted
correctly (#566).
* format code to pass flake8 check.
* fix typo and refer to issue 593.
Co-authored-by: huan_cheng <huan_cheng@bestsign.cn>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix typos and possible mistakes.
* Revert two edits based on discussion in #579
Revert the two changes based on our discussion.
I read the documentation and had a glimpse at the default code. And perhaps the confusion was caused by the figure that shows the Char Margin (M) and the Word Margin (W). Clearly, M is smaller than W in absolute terms, but as mentioned, they are both relative numbers.
Maybe it is useful to point that out in the figure but I am not sure how best to do it.
Another option is to mention use something like `min_char_margin_threshold` or similar, in the hope that they are easier to understand. Just some thoughts!
* Triggering travis again
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Added support for Paeth PNG filter compression (predictor value = 4)
* Use `above` and `upper_left` as in the pseudo code
* Refactor: use variable names that are very close to the pseudo code and add pieces of the docs to show what is going on.
* Fix line length issues
* Add line about compressions to README.md
* Fix merge conflict on readme
* Fix bug in filter type Up
* Make if-else consistent
Co-authored-by: Eduardo Gonzalez Lopez de Murillas <eduardo.gonzalez@accha.nl>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix for when trailer is indented
* Store stripped line
* This commit breaks things...
* Or maybe this one breaks things?
* Remove commented code because no longer used.
* Add CHANGELOG.md
* Add poetry venv management files to gitignore since I started using poetry to manage the python envs for this project
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix .paint_path handling of single line segments
- Fixes typo ("ml" should have been "mlh")
- Removes if-statement that required individual line segments to be
strictly horizontal or vertical.
* Treat 'ml'-shape paths as lines not curves
Althoguh 'mlh' is the canonical implementation for a single line
segment, 'ml' is fairly common.
Adds tests and sample PDF.
* Fix trailing whitespace
* Fix point-extraction from Beziér path commands
This commit corrects the manner in which "pts" are extracted from Beziér
path commands. See Table 4.9 of PDF reference manual, and new comments
in code for details. Previously, depending on whether the command (c,
v, or y) the code was extracting some combination of control points (not
on curve) and the actual points-on-curve.
This commit also refactors .paint_path, so that apply_matrix_pt is only
called in one place, and to treat the "h" command in a manner more
consistent with other path commands.
* Add comments to test_paint_path_quadrilaterals
* Parse rect-forming mllll paths as rects not curves
Now that .paint_path has been refactored, adding support for
rect-forming mllll paths requires no extra code, beyond a minor tweak to
the relevant elif statement.
* One changelog line with ref to mr
* Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring
* Extract variables from if statement to make it easier to read
* Optimize imports order
* Trigger travis build
* Revert "Trigger travis build"
This reverts commit 41c05184
* Update travis badge
* Update travis badge
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Closes#469
* Issue #469 is fixed
* one extra comment to code is added
* TemporaryFilePath context manager is added to facilitate tests
* flake8 complaints fixed
* Update docs of tempfilepath.py
* Fix flake8
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Closes#191
* Remove supoprt for non standard output streams that are not binary by removing the try-except check that writes a unicode character to the stream
* Add docstring
* Fix flake8
Closes#518
* Fix TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2'
An error is occured when the 'DW2' key contains a PDFObjRef object instead of a list of int values, e.g: 'DW2': <PDFObjRef:152>.
To solve this issue, we utilise the resolve1() function
See: https://github.com/pdfminer/pdfminer.six/issues/518
* Updated CHANGELOG
* Update CHANGELOG.md
Co-authored-by: Dimitrios TSOLAKIDIS <dimitrios.tsolakidis@vialink.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix for when 'trailer' is indented
Closes#214
* Address CR comments - strip line after parsing
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Remove unused sortedcontainers package
* Fix changelog format
* Fix a link to the PR
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix paint_path bug noted in issue #473
Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.
Also adds a test for cases presented in issue #473
* Tweak paint_path fix per @pietermarsman review
- Adjusts logic to adhere to if-elif-else rather than early returns.
- Shortens subpath detection/reprocessing step, using re.finditer().
* Reorder paint_path() if-else statements once more
* Fix flake8 issues
* Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on.
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix not being able to pass boxes flow as None to pdf2txt
* Changes from code review
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* open_filename accepts a pathlib.PurePath object
* Add test for open_filename with pathlib
* Fix a wrong function name
* Cast a pathlib object to string for py3.4/3.5
* Add link to the PR
* Raise an exception when open_filename gets an unsupported type
* Add tests for open_filename
* Update CHANGELOG.md
* Documentation
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Updated high_level.py
This commit enables caching to be turned on and off rather than be always on regardless of the user input.
* Reverted params back to fix errors
* Updated CHANGELOG.md to reflect quick fix
* Update CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Restore PDFTextExtractionNotAllowed
Restore PDFTextExtractionNotAllowed exception class as an alias of the
new PDFTextExtractionNotAllowedError exception that was introduced in
6a9269b432
Removing PDFTextExtractionNotAllowed is an API breakage that made
several tools fail break.
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* Use PDFTextExtractionNotAllowed and prepare PDFTextExtractionNotAllowedError to be removed in the future
* Add line to CHANGELOG.md
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Add trying to get cmap from pickle file. And cleaning up a bit.
* Don't use keyword argument for dict.get
* Add docs
* Make _get_cmap_name static
* Add test
* Add CHANGELOG.md
* Remove identity mappings from IDENTITY_ENCODER because that's now the default if the key is not in there
* Add CJK characters to expected output of simple3.pdf
* Fix line length
* Add comment
* swap pycryptodome to the faster, smaller, and industry standard crytography io
* update changelog
* fixlint
* Update CHANGELOG.md
* from MR, unneeded ex and naming
* add samples to nosetests
* fix lint
* show mismatch
* fix lint
* typo and newline
* Revert "add samples to nosetests"
This reverts commit a49ca302
* Add tests for encrypted documents to nose test suite
* Optimize imports of pdfdocument.py
Co-authored-by: Oren Tysor <oren@atakama.com>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Fix converting path to multiple rectangles
For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.
fixes#369
* Reduce pdf size by removing font
* Add unittest for PDFLayoutAnalyzer.paint_path()
* Add line to CHANGELOG.md
* Add reference to pdf reference manual
* Cleanup function paint_path a bit
* Reduce line length of tests
* Reduce line length of tests
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
* Changed error to warning for 'Text extraction is not allowed'
* updated changelog
* fix lint
* made changes suggested in review
* Update CHANGELOG.md
* Add regression test for failing pdf
* Reduce line length to <80
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Fixes#176
* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's
* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref
* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know
* Adding changelog
* Extend help message