Commit Graph

494 Commits (aa5dec252f43a72857ec2abe18d577bf84a1a16d)

Author SHA1 Message Date
Pieter Marsman 9d7fe2d9ee
Catch ValueError when converting font encoding differences to characters (#389)
* Catch ValueError when calling `name2unicode` when a unicode value cannot be parsed

* Add test for catching ValueError and KeyError when font encoding differences are invalid

* Added line to CHANGELOG.md
2020-03-16 20:12:45 +01:00
Pieter Marsman 1d773dc38a
Fix grouping textlines when bounding box of parent container is wrong (#386)
* Default value for --all-texts should be false, because using the flag enables it

* Fix edge case: when no neighbors are found a line should form its own text box

* Added test for grouping textlines where 1 is outside the parent bounding box

* Added CHANGELOG.md line
2020-03-14 10:33:39 +01:00
Pieter Marsman bab6d154c2 Bump version 20200124 2020-01-24 12:38:11 +01:00
Pieter Marsman bc494ff03c Bump version to 20200121 2020-01-21 21:13:52 +01:00
Pieter Marsman 410d7ecac3
Fix value for font-family in html by removing the subset tag from the PDF font-name (#357)
* Fix font name by removing subset tag

* Added line to CHANGELOG.md

* Add documentation and clear variable name

* Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`
2020-01-16 22:25:20 +01:00
Pieter Marsman fff3ac2ba6
Fix bug in computing character bounding box (#348)
* Remove scaling font height/width with size of font bounding box

* Refactor LTChar bounding box computation

* Change expected outcome of `python tools/pdf2txt.py samples/simple3.pdf`, because it looks like an improvement. However, when I view `samples/simple3.pdf` I don't see any text at all. The change in expected outcome is explained by the fact that the bounding boxes of characters can be different, depending on the `/FontBBox` parameter of the font.

* Add test for font sizes, and for this a high-level function that returns an iterator of LTPage objects

* Add line to CHANGELOG
2020-01-16 22:15:50 +01:00
Recursing 0b1741b9bf Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352)
Fixes #186 

* Tread the permissions (the /P entry) as unsigned long, fix #186

* handle negative values for p

* Extract function for resolving an twos-complement

* Add test for issue #352

* Add line to CHANGELOG.md

* Only ints can be converted to a uint using two's-complement method

* Standardize import style; multiple imports from same module on one line

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-01-07 21:59:13 +01:00
Pieter Marsman b27d3d0aff Bump version 2020-01-04 18:15:15 +01:00
Pieter Marsman 3502dc9f3b
Drop support for legacy Python 2 (#346)
* Drop support for legacy Python 2

* Add python_requires to help pip

* Upgrade Python syntax with pyupgrade

* Upgrade Python syntax with pyupgrade --py3-plus

* Python 3 imports

* Replace six

* Update CONTRIBUTING.md

* Added line to changelog

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
2020-01-04 16:47:07 +01:00
Pieter Marsman f3ab1bc61e
Enforce pep8 coding-style (#345)
* Code Refractor: Use code-style enforcement #312

* Add flake8 to travis-ci

* Remove python 2 3 comment on six library. 891 errors > 870 errors.

* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.

* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.

* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting

* Cleanup pdfinterp.py and add documentation from PDF Reference

* Cleanup pdfpage.py

* Cleanup pdffont.py

* Clean psparser.py

* Cleanup high_level.py

* Cleanup layout.py

* Cleanup pdfparser.py

* Cleanup pdfcolor.py

* Cleanup rijndael.py

* Cleanup converter.py

* Rename klass to cls if it is the class variable, to be more consistent with standard practice

* Cleanup cmap.py

* Cleanup pdfdevice.py

* flake8 ignore fontmetrics.py

* Cleanup test_pdfminer_psparser.py

* Fix flake8 in pdfdocument.py; 339 errors to go

* Fix flake8 utils.py; 326 errors togo

* pep8 correction for few files in /tools/ 328 > 160 to go (#342)

* pep8 correction for few files in /tools/ 328 > 160 to go

* pep8 correction: 160 > 5 to go

* Fix ascii85.py errors

* Fix error in getting index from target that does not exists

* Remove commented print lines

* Fix flake8 error in pdfinterp.py

* Fix python2 specific error by removing argument from print statement

* Ignore invalid python2 syntax

* Update contributing.md

* Added changelog

* Remove unused import

Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 21:20:20 +01:00
Pieter Marsman 803a7d9598 Release 20191110 2019-11-10 12:29:14 +01:00
Pieter Marsman 2bee7d8dcf
Fix wrong ordering of grouping textboxes introduced by #315. The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335)
Fixes #334
2019-11-10 12:18:49 +01:00
Pieter Marsman 5c6fa8f986 Release 20191107 2019-11-07 21:52:44 +01:00
Pieter Marsman bc034c8e59
Create sphinx documentation for Read the Docs (#329)
Fixes #171
Fixes #199
Fixes #118
Fixes #178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
2019-11-07 21:12:34 +01:00
Igor Moura 40aa2533c9 Added: simple wrapper to extract text from pdf (#330)
Fixes #327
2019-11-07 07:54:10 +01:00
Martin Hasoň ed1b09c6f2 Fix debug logging for pdf2txt.py and dumppdf.py (#325)
Fixes #313
2019-11-06 21:47:19 +01:00
Pieter Marsman 33b16b3f07
Deprecate the use of _py2_no_more_posargs (#328)
Fixes #324
2019-11-02 10:29:39 +01:00
Jianfeng 44b223cf0a Speedup grouping of textboxes (#315)
Changed: using a heap instead of a SortedList and avoid rebuilding the heap in each iteration
Changed: avoid potentially huge number of variable assignments in list comprehension.
Changed: avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).
2019-10-31 09:22:58 +01:00
Pieter Marsman d88d6020a2
Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320)
Fixes #314 
Fixes #105
2019-10-26 19:16:37 +02:00
Pieter Marsman a238a19999
Fix assertionerror when dumping pdf with reference to objid 0 (#318)
Fixes #94 
Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested.
2019-10-25 22:49:58 +02:00
Serj Sintsov cb9cd8ea46 Use named logger instead of root logger (#236) 2019-10-22 20:52:43 +02:00
Pieter Marsman 373c6e7b97
Added: extraction of JBIG2 encoded images (#311)
And added test for pdf with JBIG2 image.

Fixes #26 
Closes #46
2019-10-22 17:37:06 +02:00
Pieter Marsman 694aa508c3 Release 20191020 2019-10-20 14:21:48 +02:00
Pieter Marsman adc4726e06
Add warning about dropping python2 support (#307)
Fix #303
2019-10-20 13:59:29 +02:00
Pieter Marsman 9fd7172f7b Cleanup utils.py 2019-10-17 12:14:02 +02:00
jet457 7e40fde320 Removing assertion in drange to allow equal inputs (#246) and mimic behaviour of built-in method range
Fixes #66, since it now allows the bbox to have 0 width or 0 height
Added tests for Plane since it is the API that uses drange
2019-10-17 12:04:25 +02:00
D.A.Bashkirtsev 4df6d4e5ca Changed: comparations for image colorspace literals (#132)
Fixes #131 

Changed: comparations for image colorspace literals
Added: test for extracting images from pdfs
2019-10-15 16:11:54 +02:00
Pieter Marsman 63b2e09ac3
Merge pull request #203 from jbarlow83/negative-descent
Interpret font Descent as a negative number even if specified as positive
2019-10-13 20:06:52 +02:00
Tony Tong 106a09c5bb fix stoke color and non-stroke color in PDFGraphicState 2019-10-12 17:35:46 -04:00
Tata Ganesh f218996fe9
Merge pull request #273 from igormp/develop
Use resolve_all on PdfFont widths and bbox
2019-10-12 21:24:29 +05:30
Fakabbir Amin 7c03d96d25 Corrects Comment 2019-08-20 17:16:10 +05:30
Fakabbir Amin abd685fdc6 Corrects Code Comment 2019-08-20 17:13:27 +05:30
Fakabbir Amin 3d549ea48c Removes code comments 2019-08-20 16:48:40 +05:30
Igor Moura cf4641d877
Merge branch 'develop' into develop 2019-08-15 08:11:28 -03:00
Fakabbir Amin fe38695739
Merge branch 'develop' into pdfstream-as-cmap 2019-08-10 10:44:31 +05:30
Fakabbir Amin 5a0d8db052 Adds decoder for OnebyteIdentityH/V instead of using default CMap 2019-08-10 10:07:23 +05:30
Tata Ganesh 42e2c8143b
Merge pull request #263 from pietermarsman/261-glyph-list-specification
name2unicode() should follow the Adobe Glyph List Specification
2019-07-26 22:13:34 +05:30
Igor Moura 2f4518231f Use resolve_all on PdfFont widths and bbox
Fixes #268
2019-07-24 15:10:13 -03:00
Igor Moura 540df9f676 Replaced .iteritems() and with six.iteritems() for Python 3 compat
This is a squashed commit, the previous messages can be seen bellow

This is the 1st commit message:

Replaced .iteritems() usage for .items()

Fixed some python 2 leftovers, as discussed in #267. Also formatted code according to Black.\nThis possibly breaks some python 2 compatibility

This is the commit message #2:

Reverted formatting and more spread six usage
2019-07-24 14:08:30 -03:00
Fakabbir Amin f1a4dcea88 Adds Test Cases, Neater Code For CMap Assignment 2019-07-24 11:56:06 +05:30
Fakabbir Amin fa400431f5 Adds Test, Removes Unnecessary Assumptions 2019-07-17 11:38:00 +05:30
Pieter Marsman 6f362f53fe Raise a `KeyError` with a useful message if `unicode2name()` does not match any glyph name. Use this message to log debug statements. 2019-07-16 08:52:24 +02:00
Pieter Marsman 0fb83366b6 Remove intermediate variable `full_stop` because it is just a dot 2019-07-16 08:49:57 +02:00
Fakabbir Amin cc40af3d2b Removes @property, Adds docstring 2019-07-15 14:21:21 +05:30
Pieter Marsman c597e95a9f Use KeyError to signal that the name does not resemble any unicode, this pattern is also used in the rest of pdfminer.six 2019-07-14 15:37:15 +02:00
Pieter Marsman 33cc9861ae Add docstring to Type1FontHeaderParser.get_encoding() that describes that the custom CharStrings of the font are mapped to '' 2019-07-14 15:19:17 +02:00
Pieter Marsman f0392f8049 Change implementation of name2unicode such that it follows the Adobe Glyph specs (with allowing lowercase) 2019-07-14 15:16:42 +02:00
Fakabbir Amin 8e4a82ad8b Corrects Indentation 2019-07-13 05:00:25 +05:30
Fakabbir Amin c022358c8d Encapsulates character map name 2019-07-13 04:52:24 +05:30
John Kesegich 8ab2e287be Handle PDFStream as character map name in PDFCIDFont 2019-02-25 11:42:30 -06:00
ganeshtata b6a5848208 FEAT: Release 20181108 2018-11-08 22:37:11 +05:30
Tata Ganesh e03ecab856
Merge pull request #141 from timb07/speedup_layout
Speed up layout of text boxes
2018-11-08 20:28:40 +05:30
James R. Barlow 2ede124142 Interpet font Descent as a negative number even if specified as positive
The PDF RM specifies that Descent should be negative. Fonts that claim
to have a positive Descent (not that it would make sense) always seem
to be wrong about this claim.
2018-11-03 23:17:48 -07:00
Tata Ganesh 259b29299e
Merge pull request #133 from timb07/speedup
Speed up handling of PDFs with large images
2018-07-15 11:27:35 +05:30
Martin Wolf edaf2c9e3f move unittest to main() 2018-06-26 00:51:51 +02:00
Martin Wolf eff3f19886 Merge remote-tracking branch 'upstream/master' 2018-06-25 23:32:52 +02:00
Tata Ganesh 9c7bdcc716
Merge pull request #157 from h2ri/master
decode cid: 160 and 173 to spaces
2018-06-25 11:19:27 +05:30
Charles Reid 7b08cdbff9 apply dos2unix to files in pdfminer/ and tools/ to remove \r\n windows line endings 2018-06-21 12:19:48 -07:00
Goulu 1db260609e
render_string must have 5 params in all PDFDevice classes (#158) 2018-06-21 10:21:26 +02:00
Guglielmetti Philippe 70624a64dd render_string() now takes 3 parameters, not 5 (reverted from commit 95b65536af) 2018-06-21 09:49:45 +02:00
Guglielmetti Philippe 95b65536af render_string() now takes 3 parameters, not 5 2018-06-21 09:28:55 +02:00
Healthi 65eb0cef82 decode cid: 160 and 170 to spaces 2018-06-20 17:17:03 +05:30
Martin Wolf 26f80715ed Merge remote-tracking branch 'upstream/master' 2018-06-20 13:27:18 +02:00
Tata Ganesh 67bc581bd3
Merge pull request #134 from timb07/issue_90
FIX: TypeError caused by bug in _parse_comment; #90 #89 #109
2018-06-14 09:27:34 +05:30
Tata Ganesh 7084d81bd1
Merge pull request #129 from clustree/xml-color
FEAT: Send color to XML conversion
2018-06-10 21:02:34 +05:30
Martin Wolf 4bdb3ba8cc Fixes needed to be able to compile pdfminer.six with Cython 2018-04-12 00:05:38 +02:00
Tim Bell 1cbeaebfce Fix Python 2.6 incompatibility 2018-04-11 10:34:15 +10:00
Tim Bell 0c8cf748fe Fix copy-paste error 2018-04-11 10:15:32 +10:00
Tim Bell 8f8a78bb88 Remove now-unused csort() 2018-04-11 09:37:32 +10:00
Tim Bell 2dda2b12b4 Speedup layout with .sort() and sortedcontainers.SortedListWithKey() 2018-04-11 09:03:32 +10:00
Gregory Mori 335c25c045 only check for bytes input to enc() in python3
In python2, isinstance("", bytes) is true, causing enc() to
suppress any string input. This results in fontnames being lost
when running pdf2txt.py in python2.

As this check was not present in the original python2 version of
pdfminer, restrict it to only check when running in python3.
2018-04-09 12:21:59 -07:00
Tim Bell 981e3a575e Fix TypeError caused by bug in _parse_comment; #90 #89 #109 2018-04-03 12:47:40 +10:00
Tim Bell 083f11b165 Fix cases where a bytearray doesn't work in place of bytes 2018-04-03 07:27:29 +10:00
Tim Bell 185ddeb2ab Speed up handling of PDFs with large images with more minimal change 2018-04-03 07:21:21 +10:00
Tim Bell fab1c9462c Speed up handling of PDFs with large images 2018-03-29 14:21:31 +11:00
Tata Ganesh eddf861fbd
Merge pull request #125 from yosida95/bytes-type
Fix type of an argument to PDFFont#decode to bytes in py3
2018-03-19 11:00:10 +05:30
Quentin Pradet 0911703eba
pdfcolor: Fix Python 2.6 compatibility 2018-03-06 14:53:11 +04:00
Quentin Pradet 94f3d61bb2
converter: Fix XML syntax 2018-03-06 14:41:52 +04:00
Quentin Pradet 2231f0892e
Send non-stroke color to XML conversion
Inspired by https://github.com/euske/pdfminer/pull/158 from @andruo11
and https://github.com/euske/pdfminer/pull/197 from @staccatosound.
2018-03-06 14:11:48 +04:00
Quentin Pradet b6c63bedc6
Make DeviceGray the default color as it should be 2018-03-06 11:24:07 +04:00
Quentin Pradet 0ce9a29f83
Fix colorspace determinism with OrderedDict 2018-03-06 11:23:32 +04:00
Kohei YOSHIDA a636cbcfd4 fix type of an argument to PDFFont#decode to bytes in py3 2018-02-20 13:42:09 +09:00
KOLANICH 3bf3c97bbb
Added a vector between 2 boxes which may be useful for users of the library 2018-02-16 14:49:12 +00:00
Tata Ganesh 3e6cc20cb2
Merge pull request #96 from sschuberth/patch-1
TrueTypeFont: Check for enough data to unpack
2018-01-31 18:26:54 +05:30
ganeshtata 1b88575e79 FIX: Null character replaced by blank
-The presence of the character '\0' was causing an error with some PDFs.
-It has been fixed by replacing all occurences of '\0' with ''.
2017-11-08 12:50:50 +05:30
Sebastian Schuberth fcd3e6ce00 Catch an error unpack might throw instead of checking the length before 2017-10-30 19:31:58 +01:00
Sebastian Schuberth 39428fb4f0 TrueTypeFont: Check for enough data to unpack
Fixes https://github.com/euske/pdfminer/issues/96
and https://github.com/euske/pdfminer/issues/144.
2017-10-16 12:35:04 +02:00
SUZUKI Masaya d4118cf5e8 Enabled PDFDevice in the with statement (#88) 2017-08-18 08:15:04 +02:00
Peter Bittner e39800f14c Move package description into package docstring (#87)
Convert Windows/DOS line endings CR/LF to Unix LF (again!)

Add Python 3.6 to classifiers, update project URL
2017-08-18 08:13:15 +02:00
Venelin Stoykov 171cdcc69d Microoptimization for singlebyte fonts (#84)
Instead of list comprehension which will call a function to get the integer value of the bytes directly convert it to bytearray which is more optimal structure for storing list of bytes.
2017-08-18 08:10:27 +02:00
Venelin Stoykov 14de393d5e Cleanup psparser (#83)
- Do not use bytesindex function. Use native slices instead
- Fix import ordering
2017-08-18 08:10:06 +02:00
Venelin Stoykov 496bfd0778 Remove leftover from removing shebangs (#81) 2017-08-18 08:09:00 +02:00
Venelin Stoykov c2432c32f1 Fix assert message for PDFLayoutAnalyzer.end_page (#80)
stack is undefined
2017-08-18 08:08:08 +02:00
Philippe Guglielmetti 4c604828e8 v. 20170720 2017-07-20 21:35:49 +02:00
Philippe Guglielmetti b010db6049 solves https://github.com/pdfminer/pdfminer.six/issues/65 2017-07-20 21:17:06 +02:00
Sergei Maertens 67bf5ab124 Compare byte with byte instead of int (#78) 2017-07-20 20:47:14 +02:00
Sergei Maertens 3e364354da Fixes #64 -- be less strict when inspecting a tree type (#76)
In the PDFStream it's possible that the /Type element is not
present, but /type is. According to the spec, these are different
elements, but in the case in point they had the same meaning.

If PDFMiner is not running in STRICT mode and /Type doesn't resolve,
a fallback to /type is used to determine the tree type.
2017-07-20 20:46:35 +02:00
Attila Szász 938419c476 Align dumppdf tool to modified data structures. (#73)
* Align dumppdf tool to modified data structures.
TOC page numbers should also work now, counting from 1.

* Update version number.
2017-07-20 20:46:11 +02:00
Sergei Maertens d79612c455 Resolve unresolved PDFObjectRefs (#70)
Thank you !
2017-06-02 13:35:12 +02:00
Hugh Secker-Walker 488545ddc7 Add string expressions to asserts showing local data (#67) 2017-05-29 09:06:09 +02:00