Commit Graph

426 Commits (347c125fb88d0b62d9e0c599e266bf8ca47d915a)

Author SHA1 Message Date
Tim Bell 8f8a78bb88 Remove now-unused csort() 2018-04-11 09:37:32 +10:00
Tim Bell 2dda2b12b4 Speedup layout with .sort() and sortedcontainers.SortedListWithKey() 2018-04-11 09:03:32 +10:00
Gregory Mori 335c25c045 only check for bytes input to enc() in python3
In python2, isinstance("", bytes) is true, causing enc() to
suppress any string input. This results in fontnames being lost
when running pdf2txt.py in python2.

As this check was not present in the original python2 version of
pdfminer, restrict it to only check when running in python3.
2018-04-09 12:21:59 -07:00
Tim Bell 981e3a575e Fix TypeError caused by bug in _parse_comment; #90 #89 #109 2018-04-03 12:47:40 +10:00
Tim Bell 083f11b165 Fix cases where a bytearray doesn't work in place of bytes 2018-04-03 07:27:29 +10:00
Tim Bell 185ddeb2ab Speed up handling of PDFs with large images with more minimal change 2018-04-03 07:21:21 +10:00
Tim Bell fab1c9462c Speed up handling of PDFs with large images 2018-03-29 14:21:31 +11:00
Tata Ganesh eddf861fbd
Merge pull request #125 from yosida95/bytes-type
Fix type of an argument to PDFFont#decode to bytes in py3
2018-03-19 11:00:10 +05:30
Quentin Pradet 0911703eba
pdfcolor: Fix Python 2.6 compatibility 2018-03-06 14:53:11 +04:00
Quentin Pradet 94f3d61bb2
converter: Fix XML syntax 2018-03-06 14:41:52 +04:00
Quentin Pradet 2231f0892e
Send non-stroke color to XML conversion
Inspired by https://github.com/euske/pdfminer/pull/158 from @andruo11
and https://github.com/euske/pdfminer/pull/197 from @staccatosound.
2018-03-06 14:11:48 +04:00
Quentin Pradet b6c63bedc6
Make DeviceGray the default color as it should be 2018-03-06 11:24:07 +04:00
Quentin Pradet 0ce9a29f83
Fix colorspace determinism with OrderedDict 2018-03-06 11:23:32 +04:00
Kohei YOSHIDA a636cbcfd4 fix type of an argument to PDFFont#decode to bytes in py3 2018-02-20 13:42:09 +09:00
KOLANICH 3bf3c97bbb
Added a vector between 2 boxes which may be useful for users of the library 2018-02-16 14:49:12 +00:00
Tata Ganesh 3e6cc20cb2
Merge pull request #96 from sschuberth/patch-1
TrueTypeFont: Check for enough data to unpack
2018-01-31 18:26:54 +05:30
ganeshtata 1b88575e79 FIX: Null character replaced by blank
-The presence of the character '\0' was causing an error with some PDFs.
-It has been fixed by replacing all occurences of '\0' with ''.
2017-11-08 12:50:50 +05:30
Sebastian Schuberth fcd3e6ce00 Catch an error unpack might throw instead of checking the length before 2017-10-30 19:31:58 +01:00
Sebastian Schuberth 39428fb4f0 TrueTypeFont: Check for enough data to unpack
Fixes https://github.com/euske/pdfminer/issues/96
and https://github.com/euske/pdfminer/issues/144.
2017-10-16 12:35:04 +02:00
SUZUKI Masaya d4118cf5e8 Enabled PDFDevice in the with statement (#88) 2017-08-18 08:15:04 +02:00
Peter Bittner e39800f14c Move package description into package docstring (#87)
Convert Windows/DOS line endings CR/LF to Unix LF (again!)

Add Python 3.6 to classifiers, update project URL
2017-08-18 08:13:15 +02:00
Venelin Stoykov 171cdcc69d Microoptimization for singlebyte fonts (#84)
Instead of list comprehension which will call a function to get the integer value of the bytes directly convert it to bytearray which is more optimal structure for storing list of bytes.
2017-08-18 08:10:27 +02:00
Venelin Stoykov 14de393d5e Cleanup psparser (#83)
- Do not use bytesindex function. Use native slices instead
- Fix import ordering
2017-08-18 08:10:06 +02:00
Venelin Stoykov 496bfd0778 Remove leftover from removing shebangs (#81) 2017-08-18 08:09:00 +02:00
Venelin Stoykov c2432c32f1 Fix assert message for PDFLayoutAnalyzer.end_page (#80)
stack is undefined
2017-08-18 08:08:08 +02:00
Philippe Guglielmetti 4c604828e8 v. 20170720 2017-07-20 21:35:49 +02:00
Philippe Guglielmetti b010db6049 solves https://github.com/pdfminer/pdfminer.six/issues/65 2017-07-20 21:17:06 +02:00
Sergei Maertens 67bf5ab124 Compare byte with byte instead of int (#78) 2017-07-20 20:47:14 +02:00
Sergei Maertens 3e364354da Fixes #64 -- be less strict when inspecting a tree type (#76)
In the PDFStream it's possible that the /Type element is not
present, but /type is. According to the spec, these are different
elements, but in the case in point they had the same meaning.

If PDFMiner is not running in STRICT mode and /Type doesn't resolve,
a fallback to /type is used to determine the tree type.
2017-07-20 20:46:35 +02:00
Attila Szász 938419c476 Align dumppdf tool to modified data structures. (#73)
* Align dumppdf tool to modified data structures.
TOC page numbers should also work now, counting from 1.

* Update version number.
2017-07-20 20:46:11 +02:00
Sergei Maertens d79612c455 Resolve unresolved PDFObjectRefs (#70)
Thank you !
2017-06-02 13:35:12 +02:00
Hugh Secker-Walker 488545ddc7 Add string expressions to asserts showing local data (#67) 2017-05-29 09:06:09 +02:00
Michał Pasternak fe21725f07 Please replace pycrypto with pycryptodome (#63)
* Enable 3.6 and replace pycrypto with cryptodome

* Upgrade version number
2017-05-29 09:04:38 +02:00
Anton Oleynick 4bc0a0c105 Update pdftypes.py (#61)
Fix errors with:
  File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 850, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 860, in render_contents
    self.init_resources(resources)
  File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 360, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 210, in get_font
    font = self.get_font(None, subspec)
  File "/app/python/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 201, in get_font
    font = PDFCIDFont(self, spec)
  File "/app/python/lib/python3.5/site-packages/pdfminer/pdffont.py", line 667, in __init__
    BytesIO(self.fontfile.get_data()))
  File "/app/python/lib/python3.5/site-packages/pdfminer/pdftypes.py", line 297, in get_data
    self.decode()
  File "/app/python/lib/python3.5/site-packages/pdfminer/pdftypes.py", line 278, in decode
    if 'Predictor' in params:
TypeError: argument of type 'NoneType' is not iterable
2017-05-29 08:55:02 +02:00
Philippe Guglielmetti baddb25df6 v 20170419 (patches a stupid bug from yesterday...) 2017-04-19 14:24:13 +02:00
Philippe Guglielmetti 82af7f0aac issue #56 reproduced, solution attempt unsucessful 2017-04-19 14:19:14 +02:00
Philippe Guglielmetti cd92883925 logging (stupid bug) 2017-04-19 13:48:45 +02:00
Philippe Guglielmetti 11a4c8b6c1 version 20170418 2017-04-18 19:13:20 +02:00
Philippe Guglielmetti 7055862eaf solves https://github.com/pdfminer/pdfminer.six/issues/50 2017-04-18 18:20:31 +02:00
Sergei Maertens f2b0650ad5 Fixes #54 -- don't pass bytestrings through ord() (#55) 2017-04-18 16:57:53 +02:00
Andrew Baumann 9439a3a31a Miscellaneous bug fixes (#47)
* utils.decode_text: fix "TypeError: ord() expected string of length 1, but int found"

fixes https://github.com/goulu/pdfminer/issues/24

* pdfinterp.execute: don't assume that every keyword name can be decoded as utf-8

fixes "'str' does not support the buffer interface", https://github.com/goulu/pdfminer/issues/23

* default settings.STRICT to False, for compatibility with the original pdfminer

* PDFCIDFont: handle font registry/orderings that may be PDFObjRefs

* utils.nunpack: handle 8-byte integers
2017-02-06 14:57:01 +01:00
Philippe Guglielmetti 9b9d69aee9 image export works again with Py3 (issue #15)
https://github.com/pdfminer/pdfminer.six/issues/15
2017-01-20 10:11:19 +01:00
Philippe Guglielmetti f094f0b380 v. 20170119 RC 2017-01-19 08:42:20 +01:00
Philippe Guglielmetti 52feb22eeb Merge remote-tracking branch 'origin/master'
Conflicts:
	MANIFEST.in
	README.md
	pdfminer/latin_enc.py
	pdfminer/pdfdocument.py
	pdfminer/pdfinterp.py
	pdfminer/pdfpage.py
	pdfminer/pdftypes.py
	pdfminer/psparser.py
	pdfminer/utils.py
	samples/Makefile
	setup.py
2017-01-19 08:03:16 +01:00
Jin-tae Hwang 61d423d81c bugfix: if fontname is bytes then skip (#43) 2016-12-14 17:34:16 +01:00
Gabriel Augendre 6cc4abbaa8 Fix import of Django settings (#41)
Settings in Django are imported as such, see https://docs.djangoproject.com/en/1.10/topics/settings/#using-settings-in-python-code
2016-11-26 20:26:23 +01:00
Humberto Pereira e6ad15af79 Added painting information (#37)
* added color support to stroking and non stroking color spaces

* extended LTCurve, LTLine and LTRect to save painting information

* modified PDFLayoutAnalyzer to populate the shapes with painting information
2016-11-08 20:01:58 +01:00
Antonio Ercole De Luca 0fdebc6739 Removing all the "#!/usr/bin/env python" lines, they do not need for … (#34)
* Removing all the "#!/usr/bin/env python" lines, they do not need for python3, solving issue number: #19.

* Restored all the shebangs in the tools and tests folders (because they are real executables) but used "#!/usr/bin/env python" instead of "#!/usr/bin/python" as this blog points out: https://www.peterbe.com/plog/importance-of-env
Removed also the shebang from pdfminer/psparser.py file.
2016-11-08 20:01:11 +01:00
Yusuke Shinyama 8150458718 Added: a simpler ordering mode when 1<F. 2016-09-26 18:06:34 +09:00
Friedrich Lindenberg 447adcf02f fix STRICT reference 2016-09-24 12:03:22 +02:00
Friedrich Lindenberg 70918095cc Return an empty list when no `Differences` are found. 2016-09-24 11:57:11 +02:00
Friedrich Lindenberg 865246bd0c fix print, upstream: 0112112458 2016-09-23 15:04:07 +02:00
Friedrich Lindenberg 0cb13983f7 Backport LICENSE. 2016-09-23 14:57:28 +02:00
Friedrich Lindenberg 1820f96481 backport changes for upstream: #145, #95, #111, #117, #129, #132. 2016-09-23 14:31:31 +02:00
Jakub Wilk 5ddbecb551 Fix typos 2016-09-13 16:25:09 +02:00
Yusuke Shinyama 3068dcdb4a Merge pull request #145 from vinayak-mehta/glyphlist_link
Replace old Adobe glyphlist link
2016-09-12 00:18:24 +09:00
Yusuke Shinyama c753dbac4c Merge pull request #117 from native-api/png_pred_errors
make ValueError's descriptive
2016-09-11 23:55:34 +09:00
Yusuke Shinyama f1dd9ea6d2 Merge pull request #129 from lucanaso/lucanaso-patch-1
Fixed for rendering non breaking spaces (cid:160)
2016-09-11 23:53:03 +09:00
Yusuke Shinyama 177a4ab937 Fixed: #132 (PDFStream.get_filters: support multiple parameterless filters) 2016-09-11 23:52:13 +09:00
Yusuke Shinyama e95a483790 Merge pull request #134 from speedplane/feature/Fix-Get-Filters
Fix Bug with PDF Stream Decoder
2016-09-11 23:48:42 +09:00
Yusuke Shinyama 64fe538b24 Fixed: #114 (UnicodeEncodeError in PSLiteral) 2016-09-11 23:43:22 +09:00
Vinayak Mehta 2926002017 Replace old Adobe glyphlist link 2016-09-08 16:34:53 +05:30
Philippe Guglielmetti 881ea17553 v 20160614 2016-06-14 19:02:07 +02:00
speedplane 2049462f6f Revert changes unrelated to this branch. 2016-06-13 23:42:21 -04:00
speedplane b0b8818a41 Fix a bug with pdfminer which occurs when two or more filters are applied to a stream, even though no parameters are specified. The code would previously drop all of the streams after the first due to misapplication of the zip function. 2016-06-13 23:35:11 -04:00
Friedrich Lindenberg 1d54ecd31c Make the logger run in a namespace. 2016-05-20 21:12:05 +02:00
Philippe Guglielmetti 21fd2bbd23 v 20160202 with Py 2.6 & Py 3.5 support 2016-02-02 15:38:51 +01:00
Goulu 5a23fad6fd Merge pull request #14 from orangain/close-device
Close device to write footer of xml/html files
2016-01-18 11:22:35 +01:00
Goulu 2103e5875e Merge pull request #13 from orangain/include-cmap
Include compiled cmap resources to simplify installation for CJK languages
2016-01-18 11:22:08 +01:00
Steve Hair 92c71436b9 Improved settings management 2016-01-10 12:17:38 -05:00
orangain f8a051adbd Close device to write footer of xml/html files 2015-12-27 20:57:00 +09:00
orangain f1d5d681b6 Include compiled cmap resources to simplify installation for CJK languages
* Run `make cmap` and `git add pdfminer/cmap`.
* Modify MANIFEST.in not to include cmaprsrc dir in the sdist package.
* Add pdfminer/cmap/README.txt to include license in the sdist package.
* Remove installation guide specific to CJK languages from README.md.
2015-12-27 13:32:29 +09:00
lucanaso 63bb3caec2 Fixed for rendering non breaking spaces (cid:160)
As stated in the PDF specification ISO 32000-1, table in Annex D.2 Latin Character Set and Encodings page 653 to 656 (available here: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf):
"The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE."
The duplicate key was missing, therefore PDFMiner was returning the string "(cid:160)". 

This fix adds the duplicate key in latin_enc.py
glyphlist.py does not need to be modified as it already contains a key for non breaking space https://github.com/lucanaso/pdfminer/blob/master/pdfminer/glyphlist.py#L2755.
2015-12-09 16:47:32 +01:00
Chris Hager 8149be1669 bugfixes 2015-12-06 00:17:58 +01:00
Chris Hager 2e1be5721f removed settings.ENFORCE_CHECK_EXTRACTABLE 2015-11-01 22:34:18 +01:00
Chris Hager b686dd0139 pdfminer/settings.py for STRICT and added ENFORCE_CHECK_EXTRACTABLE 2015-11-01 22:28:08 +01:00
Ivan Pozdeev 63c9378b8b make ValueError's descriptive 2015-08-10 03:14:51 +03:00
Alex Zagorodniuk 131cb1ea92 change STRICT to be a settings attribute 2015-06-22 19:08:35 -04:00
Goulu 623bd98452 Update __init__.py
version 20150601
2015-06-01 10:21:51 +02:00
Cathal Garvey 403711ed13 Whoops, forgot to version-gate chardet in the actual code. Thanks Travis! 2015-05-30 19:33:35 +01:00
Cathal Garvey a2ad7a6d03 Fixed some bugs preventing all tests from passing in Py2. 2015-05-30 18:02:29 +01:00
Cathal Garvey 79c97ac221 Docstrings. 2015-05-30 17:16:06 +01:00
Cathal Garvey 3b7edba48c Forgot to add the actual compartmentalised function.. 2015-05-30 17:04:28 +01:00
Cathal Garvey 08cb217983 Progress, progress.. not nearly atomic enough, sorry. 2015-05-30 16:14:24 +01:00
Cathal Garvey 1b47bed306 Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.

*In pdf2txt.py:*

* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.

*In utils:*

* Added a few compatibility functions (some string hax required chardet, new dependency):
    - make_compat_bytes(in_str)-> (py3->bytes | py2->str)
    - make_compat_str(in_str)-> (str)
    - compatible_encode_method(bytesorstring, encoding, erraction)-> (str)

*In pdfdevice:*

* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
  as well as some six.PYX checks and logic. These changes are largely responsible for
  enhanced Py2/Py3 consistency.

*In converter:*

* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
  py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
Yusuke Shinyama 14fd0fd2d6 Fixed: #84 (fontname was in unicode) 2015-04-05 19:02:02 +09:00
speedplane 806ee603ff More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary.  This leads to much better text extraction  if the PDF was created in a funky way.

For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane 45170e7183 There are a number of relatively complex changes here. Comments are in order of where the change appears.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart.  However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.

2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.

3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.

4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute.  Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
Yusuke Shinyama 0112112458 Fixed: crash on invalid chr number. 2014-12-09 22:55:47 +09:00
enkore d0379a2c44 Fix utils.decode_text 2014-12-04 17:09:52 +01:00
speedplane 36977fbe08 Add debug flags for much of the debug output. 2014-11-11 23:36:58 -05:00
speedplane ecc4d05675 Fix a unicode conversion bug.
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
cybjit 515687e1bb more xrange to range 2014-09-16 23:17:31 +02:00
cybjit 9b2e29396b apply_png_predictor py3 2014-09-16 22:59:29 +02:00
cybjit ad05121c69 password py3 2014-09-16 22:59:00 +02:00
cybjit 14585987c3 keep password api unicode, latin1 or utf-8 is encoded in handler 2014-09-16 22:58:25 +02:00
cybjit 2260f77b19 fix dict_value usage in strict mode 2014-09-16 22:57:29 +02:00
cybjit 51a361c145 clean up HTMLConverter and XMLConverter encoding 2014-09-16 22:57:00 +02:00
Goulu 8861d7e0ed version 20140915 pushed to PyPi as pdfminer_six 2014-09-15 10:33:04 +02:00
cybjit 39942b6642 avoid string formating when not logging 2014-09-12 00:29:31 +02:00