2020-04-28 08:58:42 +00:00
|
|
|
#!/usr/bin/env python3
|
2019-12-29 20:20:20 +00:00
|
|
|
"""A command line tool for extracting text and images from PDF and
|
|
|
|
output it to plain text, html, xml or tags."""
|
2018-08-13 04:07:52 +00:00
|
|
|
import argparse
|
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
|
|
|
import logging
|
2018-08-13 04:07:52 +00:00
|
|
|
import sys
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
from typing import Any, Container, Iterable, List, Optional
|
2019-12-29 20:20:20 +00:00
|
|
|
|
2015-05-30 16:03:55 +00:00
|
|
|
import pdfminer.high_level
|
Add type annotations (#661)
Squashed commit of the following:
commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:33:06 2021 -0700
Merge branch 'develop' into mypy (and fixed types)
commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:00:45 2021 -0700
reformat all multi-line function defs to one-arg-per-line
commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:48 2021 -0700
ccitt nit -- avoid casting needlessly
commit 15983d8c1e7162632fde43752c9d1c15938cd980
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:36 2021 -0700
tweak CHANGELOG
commit 13dc0babf782938e7d5b5e482d4c5adf92d82702
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:43:46 2021 -0700
add failing tests for dumppdf crash
commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:24:23 2021 -0700
ccitt: apply misc PR feedback
commit feb031ba86d3f22e41cfbbda13f17c039359f1e6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:18:26 2021 -0700
add missing None return type to all __init__ methods
commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:13:08 2021 -0700
minor cleanup, remove a few more Any types
commit b52a0594e1998a492c172538a9b35491c5fc5f52
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 22:37:28 2021 -0700
tighten up types, avoid Any in favour of explicit casts
commit e58fd48bd14f31bebd2de8259f12630ac02756d6
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 14:10:49 2021 -0700
annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)
commit 605290633e55595e5e0045840df5c5b1d9de843a
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:37:38 2021 -0700
python 3.7 back-compat
commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:32:43 2021 -0700
annotate pdfminer.jbig2
commit 0d40b7c03a8028dc44acd3f457eac71abd681827
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:31:33 2021 -0700
annotate pdf2txt.py
commit 5f82eb4f5646b5d1285252689191e0a14557ec7b
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 09:16:31 2021 -0700
cleanup: make Plane generic
commit 624fc92b88473ff36a174760883f34c22109da2b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:16:51 2021 -0700
bluntly ignore calls to cryptography.hazmat
commit 96b20439c169f40dbb114cabba6a582ad1ebe91e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:01:06 2021 -0700
finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2
commit 0ab586347861b72b1d16880dc9293f9ad597e20a
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 21:51:56 2021 -0700
annotate pdffont
commit 4b689f1bcbdaf654feb9de81023e318ca310a12e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 18:30:02 2021 -0700
annotate a couple more scripts; document sketchy code
commit 291981ff3d273952ec9c92ef8ab948473558b787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 15:02:01 2021 -0700
pacify flake8
commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 14:31:48 2021 -0700
annotate dumppdf, and comment likely bugs
commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:49:58 2021 -0700
enable mypy on tests and tools, fix one implicit reexport bug
commit 4a83166ef4e4733cd2113f43188b585a4fda392b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:59 2021 -0700
pdfdocument: per dumppdf.py, get_dest accepts either bytes or str
commit 43701e1bee068df98f378a253c9c2150ee4ad9f7
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:00 2021 -0700
layout: LAParams.boxes_flow may be None
commit 164f81652f1788e74837466f0ab593e94079bc0f
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:45:09 2021 -0700
add whitespace, pacify flake8
commit 893b9fb9ec918032b36a30456fc0b7a217da86d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:40:33 2021 -0700
support old Python without typing.Protocol
commit dc245084102b7b04c3f5599d75b5d62ba4290787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:12:03 2021 -0700
Move "# type: ignore" comments to fix mypy on Python < 3.8
The placement of these comments got more flexible in 3.8 due to
https://github.com/python/mypy/issues/1032
Satisfying older Python and fitting in flake8's 79-character line
limit was quite a challenge!
commit da03afe7bd2cf3336e611f467f1c901455940ae8
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:59:58 2021 -0700
fix text output from HTMLConverter
commit 5401276a2ed3b74a385ebcab5152485224146161
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:40:22 2021 -0700
annotate high_level.py and the immediately-reachable internal APIs (mostly converters)
commit cc490513f8f17a7adc0bcbab2e0e86f37e832300
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 17:04:35 2021 -0700
* expand and improve annotations in cmap, encryption/decompression and fonts
* disallow untyped calls; this way, we have a core set of
typed code that can grow over time
(just not for ccitt, because there's a ton of work lurking there)
* expand "typing: none" comments to suppress a specific error code
commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:50:59 2021 -0700
update CHANGELOG
commit f72aaead45d0615e472a9b3190c9551a6b67b36e
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:47:03 2021 -0700
Merge branch 'develop' into mypy
commit ff787a93986c60361536a97182a41774f4a53ac3
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 21:46:14 2021 -0700
be more precise about types on ps/pdf stacks, remove most of the Any annotations
commit be1550189e10717f6827dbb7009d6e8c8b3f4c62
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 10:13:58 2021 -0700
silence missing imports, (maybe?) hook to tox
commit ff4b6a9bd46b352583d823d39065652c9a6f05f4
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 22:49:06 2021 -0700
turn on more strict checks, and untangle the layout mess with generics
Status:
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
Found 5 errors in 4 files (checked 27 source files)
pdfdevice.py:191 appears to be a real bug
commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 17:22:41 2021 -0700
finish annotating layout
commit 0e6871c16abb29df2868ab145b4ce451b4b6c777
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 16:54:46 2021 -0700
general progress on annotations
* finish utils
* annotate more of pdfinterp, pdfdevice
* document reason for # type: ignore comments
* fix cyclic imports
* satisfy flake8
commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Aug 19 21:38:50 2021 -0700
WIP on type annotations
With the possible exception of psparser.py, this is far from complete.
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
2021-10-09 14:23:28 +00:00
|
|
|
from pdfminer.layout import LAParams
|
|
|
|
from pdfminer.utils import AnyIO
|
2009-05-15 14:34:53 +00:00
|
|
|
|
2019-11-06 20:47:19 +00:00
|
|
|
logging.basicConfig()
|
|
|
|
|
2022-02-11 21:46:51 +00:00
|
|
|
OUTPUT_TYPES = ((".htm", "html"), (".html", "html"), (".xml", "xml"), (".tag", "tag"))
|
2019-12-09 21:04:05 +00:00
|
|
|
|
2015-05-30 15:14:24 +00:00
|
|
|
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
def float_or_disabled(x: str) -> Optional[float]:
|
2020-10-10 13:17:04 +00:00
|
|
|
if x.lower().strip() == "disabled":
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
return None
|
2020-10-10 13:17:04 +00:00
|
|
|
try:
|
Add type annotations (#661)
Squashed commit of the following:
commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:33:06 2021 -0700
Merge branch 'develop' into mypy (and fixed types)
commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:00:45 2021 -0700
reformat all multi-line function defs to one-arg-per-line
commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:48 2021 -0700
ccitt nit -- avoid casting needlessly
commit 15983d8c1e7162632fde43752c9d1c15938cd980
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:36 2021 -0700
tweak CHANGELOG
commit 13dc0babf782938e7d5b5e482d4c5adf92d82702
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:43:46 2021 -0700
add failing tests for dumppdf crash
commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:24:23 2021 -0700
ccitt: apply misc PR feedback
commit feb031ba86d3f22e41cfbbda13f17c039359f1e6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:18:26 2021 -0700
add missing None return type to all __init__ methods
commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:13:08 2021 -0700
minor cleanup, remove a few more Any types
commit b52a0594e1998a492c172538a9b35491c5fc5f52
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 22:37:28 2021 -0700
tighten up types, avoid Any in favour of explicit casts
commit e58fd48bd14f31bebd2de8259f12630ac02756d6
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 14:10:49 2021 -0700
annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)
commit 605290633e55595e5e0045840df5c5b1d9de843a
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:37:38 2021 -0700
python 3.7 back-compat
commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:32:43 2021 -0700
annotate pdfminer.jbig2
commit 0d40b7c03a8028dc44acd3f457eac71abd681827
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:31:33 2021 -0700
annotate pdf2txt.py
commit 5f82eb4f5646b5d1285252689191e0a14557ec7b
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 09:16:31 2021 -0700
cleanup: make Plane generic
commit 624fc92b88473ff36a174760883f34c22109da2b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:16:51 2021 -0700
bluntly ignore calls to cryptography.hazmat
commit 96b20439c169f40dbb114cabba6a582ad1ebe91e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:01:06 2021 -0700
finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2
commit 0ab586347861b72b1d16880dc9293f9ad597e20a
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 21:51:56 2021 -0700
annotate pdffont
commit 4b689f1bcbdaf654feb9de81023e318ca310a12e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 18:30:02 2021 -0700
annotate a couple more scripts; document sketchy code
commit 291981ff3d273952ec9c92ef8ab948473558b787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 15:02:01 2021 -0700
pacify flake8
commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 14:31:48 2021 -0700
annotate dumppdf, and comment likely bugs
commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:49:58 2021 -0700
enable mypy on tests and tools, fix one implicit reexport bug
commit 4a83166ef4e4733cd2113f43188b585a4fda392b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:59 2021 -0700
pdfdocument: per dumppdf.py, get_dest accepts either bytes or str
commit 43701e1bee068df98f378a253c9c2150ee4ad9f7
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:00 2021 -0700
layout: LAParams.boxes_flow may be None
commit 164f81652f1788e74837466f0ab593e94079bc0f
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:45:09 2021 -0700
add whitespace, pacify flake8
commit 893b9fb9ec918032b36a30456fc0b7a217da86d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:40:33 2021 -0700
support old Python without typing.Protocol
commit dc245084102b7b04c3f5599d75b5d62ba4290787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:12:03 2021 -0700
Move "# type: ignore" comments to fix mypy on Python < 3.8
The placement of these comments got more flexible in 3.8 due to
https://github.com/python/mypy/issues/1032
Satisfying older Python and fitting in flake8's 79-character line
limit was quite a challenge!
commit da03afe7bd2cf3336e611f467f1c901455940ae8
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:59:58 2021 -0700
fix text output from HTMLConverter
commit 5401276a2ed3b74a385ebcab5152485224146161
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:40:22 2021 -0700
annotate high_level.py and the immediately-reachable internal APIs (mostly converters)
commit cc490513f8f17a7adc0bcbab2e0e86f37e832300
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 17:04:35 2021 -0700
* expand and improve annotations in cmap, encryption/decompression and fonts
* disallow untyped calls; this way, we have a core set of
typed code that can grow over time
(just not for ccitt, because there's a ton of work lurking there)
* expand "typing: none" comments to suppress a specific error code
commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:50:59 2021 -0700
update CHANGELOG
commit f72aaead45d0615e472a9b3190c9551a6b67b36e
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:47:03 2021 -0700
Merge branch 'develop' into mypy
commit ff787a93986c60361536a97182a41774f4a53ac3
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 21:46:14 2021 -0700
be more precise about types on ps/pdf stacks, remove most of the Any annotations
commit be1550189e10717f6827dbb7009d6e8c8b3f4c62
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 10:13:58 2021 -0700
silence missing imports, (maybe?) hook to tox
commit ff4b6a9bd46b352583d823d39065652c9a6f05f4
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 22:49:06 2021 -0700
turn on more strict checks, and untangle the layout mess with generics
Status:
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
Found 5 errors in 4 files (checked 27 source files)
pdfdevice.py:191 appears to be a real bug
commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 17:22:41 2021 -0700
finish annotating layout
commit 0e6871c16abb29df2868ab145b4ce451b4b6c777
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 16:54:46 2021 -0700
general progress on annotations
* finish utils
* annotate more of pdfinterp, pdfdevice
* document reason for # type: ignore comments
* fix cyclic imports
* satisfy flake8
commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Aug 19 21:38:50 2021 -0700
WIP on type annotations
With the possible exception of psparser.py, this is far from complete.
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
2021-10-09 14:23:28 +00:00
|
|
|
return float(x)
|
2020-10-10 13:17:04 +00:00
|
|
|
except ValueError:
|
|
|
|
raise argparse.ArgumentTypeError("invalid float value: {}".format(x))
|
|
|
|
|
|
|
|
|
Add type annotations (#661)
Squashed commit of the following:
commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:33:06 2021 -0700
Merge branch 'develop' into mypy (and fixed types)
commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:00:45 2021 -0700
reformat all multi-line function defs to one-arg-per-line
commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:48 2021 -0700
ccitt nit -- avoid casting needlessly
commit 15983d8c1e7162632fde43752c9d1c15938cd980
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:36 2021 -0700
tweak CHANGELOG
commit 13dc0babf782938e7d5b5e482d4c5adf92d82702
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:43:46 2021 -0700
add failing tests for dumppdf crash
commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:24:23 2021 -0700
ccitt: apply misc PR feedback
commit feb031ba86d3f22e41cfbbda13f17c039359f1e6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:18:26 2021 -0700
add missing None return type to all __init__ methods
commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:13:08 2021 -0700
minor cleanup, remove a few more Any types
commit b52a0594e1998a492c172538a9b35491c5fc5f52
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 22:37:28 2021 -0700
tighten up types, avoid Any in favour of explicit casts
commit e58fd48bd14f31bebd2de8259f12630ac02756d6
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 14:10:49 2021 -0700
annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)
commit 605290633e55595e5e0045840df5c5b1d9de843a
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:37:38 2021 -0700
python 3.7 back-compat
commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:32:43 2021 -0700
annotate pdfminer.jbig2
commit 0d40b7c03a8028dc44acd3f457eac71abd681827
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:31:33 2021 -0700
annotate pdf2txt.py
commit 5f82eb4f5646b5d1285252689191e0a14557ec7b
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 09:16:31 2021 -0700
cleanup: make Plane generic
commit 624fc92b88473ff36a174760883f34c22109da2b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:16:51 2021 -0700
bluntly ignore calls to cryptography.hazmat
commit 96b20439c169f40dbb114cabba6a582ad1ebe91e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:01:06 2021 -0700
finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2
commit 0ab586347861b72b1d16880dc9293f9ad597e20a
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 21:51:56 2021 -0700
annotate pdffont
commit 4b689f1bcbdaf654feb9de81023e318ca310a12e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 18:30:02 2021 -0700
annotate a couple more scripts; document sketchy code
commit 291981ff3d273952ec9c92ef8ab948473558b787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 15:02:01 2021 -0700
pacify flake8
commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 14:31:48 2021 -0700
annotate dumppdf, and comment likely bugs
commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:49:58 2021 -0700
enable mypy on tests and tools, fix one implicit reexport bug
commit 4a83166ef4e4733cd2113f43188b585a4fda392b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:59 2021 -0700
pdfdocument: per dumppdf.py, get_dest accepts either bytes or str
commit 43701e1bee068df98f378a253c9c2150ee4ad9f7
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:00 2021 -0700
layout: LAParams.boxes_flow may be None
commit 164f81652f1788e74837466f0ab593e94079bc0f
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:45:09 2021 -0700
add whitespace, pacify flake8
commit 893b9fb9ec918032b36a30456fc0b7a217da86d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:40:33 2021 -0700
support old Python without typing.Protocol
commit dc245084102b7b04c3f5599d75b5d62ba4290787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:12:03 2021 -0700
Move "# type: ignore" comments to fix mypy on Python < 3.8
The placement of these comments got more flexible in 3.8 due to
https://github.com/python/mypy/issues/1032
Satisfying older Python and fitting in flake8's 79-character line
limit was quite a challenge!
commit da03afe7bd2cf3336e611f467f1c901455940ae8
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:59:58 2021 -0700
fix text output from HTMLConverter
commit 5401276a2ed3b74a385ebcab5152485224146161
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:40:22 2021 -0700
annotate high_level.py and the immediately-reachable internal APIs (mostly converters)
commit cc490513f8f17a7adc0bcbab2e0e86f37e832300
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 17:04:35 2021 -0700
* expand and improve annotations in cmap, encryption/decompression and fonts
* disallow untyped calls; this way, we have a core set of
typed code that can grow over time
(just not for ccitt, because there's a ton of work lurking there)
* expand "typing: none" comments to suppress a specific error code
commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:50:59 2021 -0700
update CHANGELOG
commit f72aaead45d0615e472a9b3190c9551a6b67b36e
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:47:03 2021 -0700
Merge branch 'develop' into mypy
commit ff787a93986c60361536a97182a41774f4a53ac3
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 21:46:14 2021 -0700
be more precise about types on ps/pdf stacks, remove most of the Any annotations
commit be1550189e10717f6827dbb7009d6e8c8b3f4c62
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 10:13:58 2021 -0700
silence missing imports, (maybe?) hook to tox
commit ff4b6a9bd46b352583d823d39065652c9a6f05f4
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 22:49:06 2021 -0700
turn on more strict checks, and untangle the layout mess with generics
Status:
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
Found 5 errors in 4 files (checked 27 source files)
pdfdevice.py:191 appears to be a real bug
commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 17:22:41 2021 -0700
finish annotating layout
commit 0e6871c16abb29df2868ab145b4ce451b4b6c777
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 16:54:46 2021 -0700
general progress on annotations
* finish utils
* annotate more of pdfinterp, pdfdevice
* document reason for # type: ignore comments
* fix cyclic imports
* satisfy flake8
commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Aug 19 21:38:50 2021 -0700
WIP on type annotations
With the possible exception of psparser.py, this is far from complete.
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
2021-10-09 14:23:28 +00:00
|
|
|
def extract_text(
|
|
|
|
files: Iterable[str] = [],
|
2022-02-11 21:46:51 +00:00
|
|
|
outfile: str = "-",
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
laparams: Optional[LAParams] = None,
|
2022-02-11 21:46:51 +00:00
|
|
|
output_type: str = "text",
|
|
|
|
codec: str = "utf-8",
|
Add type annotations (#661)
Squashed commit of the following:
commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:33:06 2021 -0700
Merge branch 'develop' into mypy (and fixed types)
commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:00:45 2021 -0700
reformat all multi-line function defs to one-arg-per-line
commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:48 2021 -0700
ccitt nit -- avoid casting needlessly
commit 15983d8c1e7162632fde43752c9d1c15938cd980
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:36 2021 -0700
tweak CHANGELOG
commit 13dc0babf782938e7d5b5e482d4c5adf92d82702
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:43:46 2021 -0700
add failing tests for dumppdf crash
commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:24:23 2021 -0700
ccitt: apply misc PR feedback
commit feb031ba86d3f22e41cfbbda13f17c039359f1e6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:18:26 2021 -0700
add missing None return type to all __init__ methods
commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:13:08 2021 -0700
minor cleanup, remove a few more Any types
commit b52a0594e1998a492c172538a9b35491c5fc5f52
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 22:37:28 2021 -0700
tighten up types, avoid Any in favour of explicit casts
commit e58fd48bd14f31bebd2de8259f12630ac02756d6
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 14:10:49 2021 -0700
annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)
commit 605290633e55595e5e0045840df5c5b1d9de843a
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:37:38 2021 -0700
python 3.7 back-compat
commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:32:43 2021 -0700
annotate pdfminer.jbig2
commit 0d40b7c03a8028dc44acd3f457eac71abd681827
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:31:33 2021 -0700
annotate pdf2txt.py
commit 5f82eb4f5646b5d1285252689191e0a14557ec7b
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 09:16:31 2021 -0700
cleanup: make Plane generic
commit 624fc92b88473ff36a174760883f34c22109da2b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:16:51 2021 -0700
bluntly ignore calls to cryptography.hazmat
commit 96b20439c169f40dbb114cabba6a582ad1ebe91e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:01:06 2021 -0700
finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2
commit 0ab586347861b72b1d16880dc9293f9ad597e20a
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 21:51:56 2021 -0700
annotate pdffont
commit 4b689f1bcbdaf654feb9de81023e318ca310a12e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 18:30:02 2021 -0700
annotate a couple more scripts; document sketchy code
commit 291981ff3d273952ec9c92ef8ab948473558b787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 15:02:01 2021 -0700
pacify flake8
commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 14:31:48 2021 -0700
annotate dumppdf, and comment likely bugs
commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:49:58 2021 -0700
enable mypy on tests and tools, fix one implicit reexport bug
commit 4a83166ef4e4733cd2113f43188b585a4fda392b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:59 2021 -0700
pdfdocument: per dumppdf.py, get_dest accepts either bytes or str
commit 43701e1bee068df98f378a253c9c2150ee4ad9f7
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:00 2021 -0700
layout: LAParams.boxes_flow may be None
commit 164f81652f1788e74837466f0ab593e94079bc0f
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:45:09 2021 -0700
add whitespace, pacify flake8
commit 893b9fb9ec918032b36a30456fc0b7a217da86d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:40:33 2021 -0700
support old Python without typing.Protocol
commit dc245084102b7b04c3f5599d75b5d62ba4290787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:12:03 2021 -0700
Move "# type: ignore" comments to fix mypy on Python < 3.8
The placement of these comments got more flexible in 3.8 due to
https://github.com/python/mypy/issues/1032
Satisfying older Python and fitting in flake8's 79-character line
limit was quite a challenge!
commit da03afe7bd2cf3336e611f467f1c901455940ae8
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:59:58 2021 -0700
fix text output from HTMLConverter
commit 5401276a2ed3b74a385ebcab5152485224146161
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:40:22 2021 -0700
annotate high_level.py and the immediately-reachable internal APIs (mostly converters)
commit cc490513f8f17a7adc0bcbab2e0e86f37e832300
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 17:04:35 2021 -0700
* expand and improve annotations in cmap, encryption/decompression and fonts
* disallow untyped calls; this way, we have a core set of
typed code that can grow over time
(just not for ccitt, because there's a ton of work lurking there)
* expand "typing: none" comments to suppress a specific error code
commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:50:59 2021 -0700
update CHANGELOG
commit f72aaead45d0615e472a9b3190c9551a6b67b36e
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:47:03 2021 -0700
Merge branch 'develop' into mypy
commit ff787a93986c60361536a97182a41774f4a53ac3
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 21:46:14 2021 -0700
be more precise about types on ps/pdf stacks, remove most of the Any annotations
commit be1550189e10717f6827dbb7009d6e8c8b3f4c62
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 10:13:58 2021 -0700
silence missing imports, (maybe?) hook to tox
commit ff4b6a9bd46b352583d823d39065652c9a6f05f4
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 22:49:06 2021 -0700
turn on more strict checks, and untangle the layout mess with generics
Status:
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
Found 5 errors in 4 files (checked 27 source files)
pdfdevice.py:191 appears to be a real bug
commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 17:22:41 2021 -0700
finish annotating layout
commit 0e6871c16abb29df2868ab145b4ce451b4b6c777
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 16:54:46 2021 -0700
general progress on annotations
* finish utils
* annotate more of pdfinterp, pdfdevice
* document reason for # type: ignore comments
* fix cyclic imports
* satisfy flake8
commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Aug 19 21:38:50 2021 -0700
WIP on type annotations
With the possible exception of psparser.py, this is far from complete.
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
2021-10-09 14:23:28 +00:00
|
|
|
strip_control: bool = False,
|
|
|
|
maxpages: int = 0,
|
|
|
|
page_numbers: Optional[Container[int]] = None,
|
|
|
|
password: str = "",
|
|
|
|
scale: float = 1.0,
|
|
|
|
rotation: int = 0,
|
2022-02-11 21:46:51 +00:00
|
|
|
layoutmode: str = "normal",
|
Add type annotations (#661)
Squashed commit of the following:
commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:33:06 2021 -0700
Merge branch 'develop' into mypy (and fixed types)
commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:00:45 2021 -0700
reformat all multi-line function defs to one-arg-per-line
commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:48 2021 -0700
ccitt nit -- avoid casting needlessly
commit 15983d8c1e7162632fde43752c9d1c15938cd980
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:36 2021 -0700
tweak CHANGELOG
commit 13dc0babf782938e7d5b5e482d4c5adf92d82702
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:43:46 2021 -0700
add failing tests for dumppdf crash
commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:24:23 2021 -0700
ccitt: apply misc PR feedback
commit feb031ba86d3f22e41cfbbda13f17c039359f1e6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:18:26 2021 -0700
add missing None return type to all __init__ methods
commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:13:08 2021 -0700
minor cleanup, remove a few more Any types
commit b52a0594e1998a492c172538a9b35491c5fc5f52
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 22:37:28 2021 -0700
tighten up types, avoid Any in favour of explicit casts
commit e58fd48bd14f31bebd2de8259f12630ac02756d6
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 14:10:49 2021 -0700
annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)
commit 605290633e55595e5e0045840df5c5b1d9de843a
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:37:38 2021 -0700
python 3.7 back-compat
commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:32:43 2021 -0700
annotate pdfminer.jbig2
commit 0d40b7c03a8028dc44acd3f457eac71abd681827
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:31:33 2021 -0700
annotate pdf2txt.py
commit 5f82eb4f5646b5d1285252689191e0a14557ec7b
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 09:16:31 2021 -0700
cleanup: make Plane generic
commit 624fc92b88473ff36a174760883f34c22109da2b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:16:51 2021 -0700
bluntly ignore calls to cryptography.hazmat
commit 96b20439c169f40dbb114cabba6a582ad1ebe91e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:01:06 2021 -0700
finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2
commit 0ab586347861b72b1d16880dc9293f9ad597e20a
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 21:51:56 2021 -0700
annotate pdffont
commit 4b689f1bcbdaf654feb9de81023e318ca310a12e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 18:30:02 2021 -0700
annotate a couple more scripts; document sketchy code
commit 291981ff3d273952ec9c92ef8ab948473558b787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 15:02:01 2021 -0700
pacify flake8
commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 14:31:48 2021 -0700
annotate dumppdf, and comment likely bugs
commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:49:58 2021 -0700
enable mypy on tests and tools, fix one implicit reexport bug
commit 4a83166ef4e4733cd2113f43188b585a4fda392b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:59 2021 -0700
pdfdocument: per dumppdf.py, get_dest accepts either bytes or str
commit 43701e1bee068df98f378a253c9c2150ee4ad9f7
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:00 2021 -0700
layout: LAParams.boxes_flow may be None
commit 164f81652f1788e74837466f0ab593e94079bc0f
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:45:09 2021 -0700
add whitespace, pacify flake8
commit 893b9fb9ec918032b36a30456fc0b7a217da86d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:40:33 2021 -0700
support old Python without typing.Protocol
commit dc245084102b7b04c3f5599d75b5d62ba4290787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:12:03 2021 -0700
Move "# type: ignore" comments to fix mypy on Python < 3.8
The placement of these comments got more flexible in 3.8 due to
https://github.com/python/mypy/issues/1032
Satisfying older Python and fitting in flake8's 79-character line
limit was quite a challenge!
commit da03afe7bd2cf3336e611f467f1c901455940ae8
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:59:58 2021 -0700
fix text output from HTMLConverter
commit 5401276a2ed3b74a385ebcab5152485224146161
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:40:22 2021 -0700
annotate high_level.py and the immediately-reachable internal APIs (mostly converters)
commit cc490513f8f17a7adc0bcbab2e0e86f37e832300
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 17:04:35 2021 -0700
* expand and improve annotations in cmap, encryption/decompression and fonts
* disallow untyped calls; this way, we have a core set of
typed code that can grow over time
(just not for ccitt, because there's a ton of work lurking there)
* expand "typing: none" comments to suppress a specific error code
commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:50:59 2021 -0700
update CHANGELOG
commit f72aaead45d0615e472a9b3190c9551a6b67b36e
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:47:03 2021 -0700
Merge branch 'develop' into mypy
commit ff787a93986c60361536a97182a41774f4a53ac3
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 21:46:14 2021 -0700
be more precise about types on ps/pdf stacks, remove most of the Any annotations
commit be1550189e10717f6827dbb7009d6e8c8b3f4c62
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 10:13:58 2021 -0700
silence missing imports, (maybe?) hook to tox
commit ff4b6a9bd46b352583d823d39065652c9a6f05f4
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 22:49:06 2021 -0700
turn on more strict checks, and untangle the layout mess with generics
Status:
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
Found 5 errors in 4 files (checked 27 source files)
pdfdevice.py:191 appears to be a real bug
commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 17:22:41 2021 -0700
finish annotating layout
commit 0e6871c16abb29df2868ab145b4ce451b4b6c777
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 16:54:46 2021 -0700
general progress on annotations
* finish utils
* annotate more of pdfinterp, pdfdevice
* document reason for # type: ignore comments
* fix cyclic imports
* satisfy flake8
commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Aug 19 21:38:50 2021 -0700
WIP on type annotations
With the possible exception of psparser.py, this is far from complete.
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
2021-10-09 14:23:28 +00:00
|
|
|
output_dir: Optional[str] = None,
|
|
|
|
debug: bool = False,
|
|
|
|
disable_caching: bool = False,
|
|
|
|
**kwargs: Any
|
|
|
|
) -> AnyIO:
|
2015-05-30 15:14:24 +00:00
|
|
|
if not files:
|
|
|
|
raise ValueError("Must provide files to work upon!")
|
|
|
|
|
|
|
|
if output_type == "text" and outfile != "-":
|
2019-12-09 21:04:05 +00:00
|
|
|
for override, alttype in OUTPUT_TYPES:
|
2015-05-30 15:14:24 +00:00
|
|
|
if outfile.endswith(override):
|
|
|
|
output_type = alttype
|
2015-11-01 21:24:30 +00:00
|
|
|
|
2015-05-30 15:14:24 +00:00
|
|
|
if outfile == "-":
|
Add type annotations (#661)
Squashed commit of the following:
commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:33:06 2021 -0700
Merge branch 'develop' into mypy (and fixed types)
commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 20:00:45 2021 -0700
reformat all multi-line function defs to one-arg-per-line
commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:48 2021 -0700
ccitt nit -- avoid casting needlessly
commit 15983d8c1e7162632fde43752c9d1c15938cd980
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:58:36 2021 -0700
tweak CHANGELOG
commit 13dc0babf782938e7d5b5e482d4c5adf92d82702
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:43:46 2021 -0700
add failing tests for dumppdf crash
commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:24:23 2021 -0700
ccitt: apply misc PR feedback
commit feb031ba86d3f22e41cfbbda13f17c039359f1e6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:18:26 2021 -0700
add missing None return type to all __init__ methods
commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6
Author: Andrew Baumann <ab@ab.id.au>
Date: Mon Sep 6 15:13:08 2021 -0700
minor cleanup, remove a few more Any types
commit b52a0594e1998a492c172538a9b35491c5fc5f52
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 22:37:28 2021 -0700
tighten up types, avoid Any in favour of explicit casts
commit e58fd48bd14f31bebd2de8259f12630ac02756d6
Author: Andrew Baumann <ab@ab.id.au>
Date: Sun Sep 5 14:10:49 2021 -0700
annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)
commit 605290633e55595e5e0045840df5c5b1d9de843a
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:37:38 2021 -0700
python 3.7 back-compat
commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:32:43 2021 -0700
annotate pdfminer.jbig2
commit 0d40b7c03a8028dc44acd3f457eac71abd681827
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 22:31:33 2021 -0700
annotate pdf2txt.py
commit 5f82eb4f5646b5d1285252689191e0a14557ec7b
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Sep 4 09:16:31 2021 -0700
cleanup: make Plane generic
commit 624fc92b88473ff36a174760883f34c22109da2b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:16:51 2021 -0700
bluntly ignore calls to cryptography.hazmat
commit 96b20439c169f40dbb114cabba6a582ad1ebe91e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 23:01:06 2021 -0700
finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2
commit 0ab586347861b72b1d16880dc9293f9ad597e20a
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 21:51:56 2021 -0700
annotate pdffont
commit 4b689f1bcbdaf654feb9de81023e318ca310a12e
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 18:30:02 2021 -0700
annotate a couple more scripts; document sketchy code
commit 291981ff3d273952ec9c92ef8ab948473558b787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 15:02:01 2021 -0700
pacify flake8
commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 14:31:48 2021 -0700
annotate dumppdf, and comment likely bugs
commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:49:58 2021 -0700
enable mypy on tests and tools, fix one implicit reexport bug
commit 4a83166ef4e4733cd2113f43188b585a4fda392b
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:59 2021 -0700
pdfdocument: per dumppdf.py, get_dest accepts either bytes or str
commit 43701e1bee068df98f378a253c9c2150ee4ad9f7
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 13:25:00 2021 -0700
layout: LAParams.boxes_flow may be None
commit 164f81652f1788e74837466f0ab593e94079bc0f
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:45:09 2021 -0700
add whitespace, pacify flake8
commit 893b9fb9ec918032b36a30456fc0b7a217da86d8
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:40:33 2021 -0700
support old Python without typing.Protocol
commit dc245084102b7b04c3f5599d75b5d62ba4290787
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Sep 3 09:12:03 2021 -0700
Move "# type: ignore" comments to fix mypy on Python < 3.8
The placement of these comments got more flexible in 3.8 due to
https://github.com/python/mypy/issues/1032
Satisfying older Python and fitting in flake8's 79-character line
limit was quite a challenge!
commit da03afe7bd2cf3336e611f467f1c901455940ae8
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:59:58 2021 -0700
fix text output from HTMLConverter
commit 5401276a2ed3b74a385ebcab5152485224146161
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 22:40:22 2021 -0700
annotate high_level.py and the immediately-reachable internal APIs (mostly converters)
commit cc490513f8f17a7adc0bcbab2e0e86f37e832300
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Sep 2 17:04:35 2021 -0700
* expand and improve annotations in cmap, encryption/decompression and fonts
* disallow untyped calls; this way, we have a core set of
typed code that can grow over time
(just not for ccitt, because there's a ton of work lurking there)
* expand "typing: none" comments to suppress a specific error code
commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:50:59 2021 -0700
update CHANGELOG
commit f72aaead45d0615e472a9b3190c9551a6b67b36e
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date: Wed Sep 1 20:47:03 2021 -0700
Merge branch 'develop' into mypy
commit ff787a93986c60361536a97182a41774f4a53ac3
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 21:46:14 2021 -0700
be more precise about types on ps/pdf stacks, remove most of the Any annotations
commit be1550189e10717f6827dbb7009d6e8c8b3f4c62
Author: Andrew Baumann <ab@ab.id.au>
Date: Sat Aug 21 10:13:58 2021 -0700
silence missing imports, (maybe?) hook to tox
commit ff4b6a9bd46b352583d823d39065652c9a6f05f4
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 22:49:06 2021 -0700
turn on more strict checks, and untangle the layout mess with generics
Status:
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
Found 5 errors in 4 files (checked 27 source files)
pdfdevice.py:191 appears to be a real bug
commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 17:22:41 2021 -0700
finish annotating layout
commit 0e6871c16abb29df2868ab145b4ce451b4b6c777
Author: Andrew Baumann <ab@ab.id.au>
Date: Fri Aug 20 16:54:46 2021 -0700
general progress on annotations
* finish utils
* annotate more of pdfinterp, pdfdevice
* document reason for # type: ignore comments
* fix cyclic imports
* satisfy flake8
commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a
Author: Andrew Baumann <ab@ab.id.au>
Date: Thu Aug 19 21:38:50 2021 -0700
WIP on type annotations
With the possible exception of psparser.py, this is far from complete.
$ mypy pdfminer
pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
2021-10-09 14:23:28 +00:00
|
|
|
outfp: AnyIO = sys.stdout
|
|
|
|
if sys.stdout.encoding is not None:
|
2022-02-11 21:46:51 +00:00
|
|
|
codec = "utf-8"
|
2015-05-30 15:14:24 +00:00
|
|
|
else:
|
|
|
|
outfp = open(outfile, "wb")
|
2015-11-01 21:24:30 +00:00
|
|
|
|
2015-05-30 15:14:24 +00:00
|
|
|
for fname in files:
|
|
|
|
with open(fname, "rb") as fp:
|
2015-05-30 16:03:55 +00:00
|
|
|
pdfminer.high_level.extract_text_to_fp(fp, **locals())
|
2015-05-30 15:14:24 +00:00
|
|
|
return outfp
|
|
|
|
|
2018-08-13 04:07:52 +00:00
|
|
|
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
def parse_args(args: Optional[List[str]]) -> argparse.Namespace:
|
2018-08-13 04:07:52 +00:00
|
|
|
parser = argparse.ArgumentParser(description=__doc__, add_help=True)
|
2019-12-29 20:20:20 +00:00
|
|
|
parser.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"files",
|
|
|
|
type=str,
|
|
|
|
default=None,
|
|
|
|
nargs="+",
|
|
|
|
help="One or more paths to PDF files.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
|
2020-05-17 15:48:06 +00:00
|
|
|
parser.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--version",
|
|
|
|
"-v",
|
|
|
|
action="version",
|
|
|
|
version="pdfminer.six v{}".format(pdfminer.__version__),
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
parser.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--debug",
|
|
|
|
"-d",
|
|
|
|
default=False,
|
|
|
|
action="store_true",
|
|
|
|
help="Use debug logging level.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
parser.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--disable-caching",
|
|
|
|
"-C",
|
|
|
|
default=False,
|
|
|
|
action="store_true",
|
|
|
|
help="If caching or resources, such as fonts, should be disabled.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
|
|
|
|
parse_params = parser.add_argument_group(
|
2022-02-11 21:46:51 +00:00
|
|
|
"Parser", description="Used during PDF parsing"
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
parse_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--page-numbers",
|
|
|
|
type=int,
|
|
|
|
default=None,
|
|
|
|
nargs="+",
|
|
|
|
help="A space-seperated list of page numbers to parse.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
parse_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--pagenos",
|
|
|
|
"-p",
|
|
|
|
type=str,
|
2019-12-29 20:20:20 +00:00
|
|
|
help="A comma-separated list of page numbers to parse. "
|
2022-02-11 21:46:51 +00:00
|
|
|
"Included for legacy applications, use --page-numbers "
|
|
|
|
"for more idiomatic argument entry.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
parse_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--maxpages",
|
|
|
|
"-m",
|
|
|
|
type=int,
|
|
|
|
default=0,
|
|
|
|
help="The maximum number of pages to parse.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
parse_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--password",
|
|
|
|
"-P",
|
|
|
|
type=str,
|
|
|
|
default="",
|
|
|
|
help="The password to use for decrypting PDF file.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
parse_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--rotation",
|
|
|
|
"-R",
|
|
|
|
default=0,
|
|
|
|
type=int,
|
2019-12-29 20:20:20 +00:00
|
|
|
help="The number of degrees to rotate the PDF "
|
2022-02-11 21:46:51 +00:00
|
|
|
"before other types of processing.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_params = LAParams() # will be used for defaults
|
|
|
|
la_param_group = parser.add_argument_group(
|
2022-02-11 21:46:51 +00:00
|
|
|
"Layout analysis", description="Used during layout analysis."
|
|
|
|
)
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_param_group.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--no-laparams",
|
|
|
|
"-n",
|
|
|
|
default=False,
|
|
|
|
action="store_true",
|
|
|
|
help="If layout analysis parameters should be ignored.",
|
|
|
|
)
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_param_group.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--detect-vertical",
|
|
|
|
"-V",
|
|
|
|
default=la_params.detect_vertical,
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
action="store_true",
|
2022-02-11 21:46:51 +00:00
|
|
|
help="If vertical text should be considered during layout analysis",
|
|
|
|
)
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_param_group.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--line-overlap",
|
|
|
|
type=float,
|
|
|
|
default=la_params.line_overlap,
|
|
|
|
help="If two characters have more overlap than this they "
|
|
|
|
"are considered to be on the same line. The overlap is specified "
|
|
|
|
"relative to the minimum height of both characters.",
|
|
|
|
)
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_param_group.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--char-margin",
|
|
|
|
"-M",
|
|
|
|
type=float,
|
|
|
|
default=la_params.char_margin,
|
2019-12-29 20:20:20 +00:00
|
|
|
help="If two characters are closer together than this margin they "
|
2022-02-11 21:46:51 +00:00
|
|
|
"are considered to be part of the same line. The margin is "
|
|
|
|
"specified relative to the width of the character.",
|
|
|
|
)
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_param_group.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--word-margin",
|
|
|
|
"-W",
|
|
|
|
type=float,
|
|
|
|
default=la_params.word_margin,
|
2020-03-26 22:02:48 +00:00
|
|
|
help="If two characters on the same line are further apart than this "
|
2022-02-11 21:46:51 +00:00
|
|
|
"margin then they are considered to be two separate words, and "
|
|
|
|
"an intermediate space will be added for readability. The margin "
|
|
|
|
"is specified relative to the width of the character.",
|
|
|
|
)
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_param_group.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--line-margin",
|
|
|
|
"-L",
|
|
|
|
type=float,
|
|
|
|
default=la_params.line_margin,
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
help="If two lines are close together they are considered to "
|
2022-02-11 21:46:51 +00:00
|
|
|
"be part of the same paragraph. The margin is specified "
|
|
|
|
"relative to the height of a line.",
|
|
|
|
)
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_param_group.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--boxes-flow",
|
|
|
|
"-F",
|
|
|
|
type=float_or_disabled,
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
default=la_params.boxes_flow,
|
2019-12-29 20:20:20 +00:00
|
|
|
help="Specifies how much a horizontal and vertical position of a "
|
2022-02-11 21:46:51 +00:00
|
|
|
"text matters when determining the order of lines. The value "
|
|
|
|
"should be within the range of -1.0 (only horizontal position "
|
|
|
|
"matters) to +1.0 (only vertical position matters). You can also "
|
|
|
|
"pass `disabled` to disable advanced layout analysis, and "
|
|
|
|
"instead return text based on the position of the bottom left "
|
|
|
|
"corner of the text box.",
|
|
|
|
)
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
la_param_group.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--all-texts",
|
|
|
|
"-A",
|
|
|
|
default=la_params.all_texts,
|
|
|
|
action="store_true",
|
|
|
|
help="If layout analysis should be performed on text in figures.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
|
|
|
|
output_params = parser.add_argument_group(
|
2022-02-11 21:46:51 +00:00
|
|
|
"Output", description="Used during output generation."
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
output_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--outfile",
|
|
|
|
"-o",
|
|
|
|
type=str,
|
|
|
|
default="-",
|
2019-12-29 20:20:20 +00:00
|
|
|
help="Path to file where output is written. "
|
2022-02-11 21:46:51 +00:00
|
|
|
'Or "-" (default) to write to stdout.',
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
output_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--output_type",
|
|
|
|
"-t",
|
|
|
|
type=str,
|
|
|
|
default="text",
|
|
|
|
help="Type of output to generate {text,html,xml,tag}.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
output_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--codec",
|
|
|
|
"-c",
|
|
|
|
type=str,
|
|
|
|
default="utf-8",
|
|
|
|
help="Text encoding to use in output file.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
output_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--output-dir",
|
|
|
|
"-O",
|
|
|
|
default=None,
|
2019-12-29 20:20:20 +00:00
|
|
|
help="The output directory to put extracted images in. If not given, "
|
2022-02-11 21:46:51 +00:00
|
|
|
"images are not extracted.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
output_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--layoutmode",
|
|
|
|
"-Y",
|
|
|
|
default="normal",
|
|
|
|
type=str,
|
|
|
|
help="Type of layout to use when generating html "
|
|
|
|
"{normal,exact,loose}. If normal,each line is"
|
|
|
|
" positioned separately in the html. If exact"
|
|
|
|
", each character is positioned separately in"
|
|
|
|
" the html. If loose, same result as normal "
|
|
|
|
"but with an additional newline after each "
|
|
|
|
"text line. Only used when output_type is html.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
output_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--scale",
|
|
|
|
"-s",
|
|
|
|
type=float,
|
|
|
|
default=1.0,
|
2019-12-29 20:20:20 +00:00
|
|
|
help="The amount of zoom to use when generating html file. "
|
2022-02-11 21:46:51 +00:00
|
|
|
"Only used when output_type is html.",
|
|
|
|
)
|
2019-12-29 20:20:20 +00:00
|
|
|
output_params.add_argument(
|
2022-02-11 21:46:51 +00:00
|
|
|
"--strip-control",
|
|
|
|
"-S",
|
|
|
|
default=False,
|
|
|
|
action="store_true",
|
2019-12-29 20:20:20 +00:00
|
|
|
help="Remove control statement from text. "
|
2022-02-11 21:46:51 +00:00
|
|
|
"Only used when output_type is xml.",
|
|
|
|
)
|
2018-08-13 04:07:52 +00:00
|
|
|
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
parsed_args = parser.parse_args(args=args)
|
2018-08-13 04:07:52 +00:00
|
|
|
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
# Propagate parsed layout parameters to LAParams object
|
|
|
|
if parsed_args.no_laparams:
|
|
|
|
parsed_args.laparams = None
|
|
|
|
else:
|
|
|
|
parsed_args.laparams = LAParams(
|
|
|
|
line_overlap=parsed_args.line_overlap,
|
|
|
|
char_margin=parsed_args.char_margin,
|
|
|
|
line_margin=parsed_args.line_margin,
|
|
|
|
word_margin=parsed_args.word_margin,
|
|
|
|
boxes_flow=parsed_args.boxes_flow,
|
|
|
|
detect_vertical=parsed_args.detect_vertical,
|
|
|
|
all_texts=parsed_args.all_texts,
|
|
|
|
)
|
|
|
|
|
|
|
|
if parsed_args.page_numbers:
|
2022-02-11 21:46:51 +00:00
|
|
|
parsed_args.page_numbers = {x - 1 for x in parsed_args.page_numbers}
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
|
|
|
|
if parsed_args.pagenos:
|
2022-02-11 21:46:51 +00:00
|
|
|
parsed_args.page_numbers = {int(x) - 1 for x in parsed_args.pagenos.split(",")}
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
|
|
|
|
if parsed_args.output_type == "text" and parsed_args.outfile != "-":
|
|
|
|
for override, alttype in OUTPUT_TYPES:
|
|
|
|
if parsed_args.outfile.endswith(override):
|
|
|
|
parsed_args.output_type = alttype
|
2018-08-13 04:07:52 +00:00
|
|
|
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
return parsed_args
|
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
|
|
|
|
|
|
|
|
pdf2txt: clean up construction of LAParams from arguments (#682)
* Fix pdf2txt --boxes-flow=disabled
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "pdfminer/high_level.py", line 85, in extract_text_to_fp
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 896, in process_page
self.device.end_page(page)
File "pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "pdfminer/layout.py", line 822, in analyze
group.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 575, in analyze
LTTextGroup.analyze(self, laparams)
File "pdfminer/layout.py", line 362, in analyze
obj.analyze(laparams)
File "pdfminer/layout.py", line 577, in analyze
self._objs.sort(
File "pdfminer/layout.py", line 578, in <lambda>
key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```
Related: Issue #477, PR #479
* update CHANGELOG
* merge CHANGELOG
* pdf2txt: clean up handling of layout parameter arguments
* avoid specifying default values twice
* construct LAParams earlier, rather than passing its components around
* fix crash with --boxes_flow=disabled
* update CHANGELOG
* construct new LAParams, so _validate runs
* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
* Add cli argument for line_overlap
* Also use default values from LAParams for --detect-vertical and --all-texts
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-01-25 21:06:06 +00:00
|
|
|
def main(args: Optional[List[str]] = None) -> int:
|
|
|
|
parsed_args = parse_args(args)
|
|
|
|
outfp = extract_text(**vars(parsed_args))
|
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
|
|
|
outfp.close()
|
2015-05-30 16:03:55 +00:00
|
|
|
return 0
|
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
|
|
|
|
2009-05-15 14:34:53 +00:00
|
|
|
|
2022-02-11 21:46:51 +00:00
|
|
|
if __name__ == "__main__":
|
2019-12-09 21:04:05 +00:00
|
|
|
sys.exit(main())
|