Commit Graph

97 Commits (ec223d1f1d2483069de0e4b80e7fcfb9d740c6ed)

Author SHA1 Message Date
Jeremy Singer-Vine e83dd26671
Fix .paint_path for non-rectangle quadrilaterals (#512)
* Fix paint_path bug noted in issue #473

Focuses on the handling of non-rect quadrilaterals, the decomposition of
complex (m.*h)* paths into subpaths, and assigning those subpaths the
correct LTCurve/LTRect type.

Also adds a test for cases presented in issue #473

* Tweak paint_path fix per @pietermarsman review

- Adjusts logic to adhere to if-elif-else rather than early returns.

- Shortens subpath detection/reprocessing step, using re.finditer().

* Reorder paint_path() if-else statements once more

* Fix flake8 issues

* Fix error: should select item 1 and 2 from the list, and possible items [3, 4], and so on.

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-10-12 17:53:00 +02:00
Kwok-kuen Cheung 60863cfd55
Fix converting path to multiple rectangles (#371)
* Fix converting path to multiple rectangles

For path that consists of a series of rectangles
(shape is 'mlllhmlllh...'), call paint_path again with each group of
5 points. The result is multiple rects instead of a single curve.

fixes #369

* Reduce pdf size by removing font

* Add unittest for PDFLayoutAnalyzer.paint_path()

* Add line to CHANGELOG.md

* Add reference to pdf reference manual

* Cleanup function paint_path a bit

* Reduce line length of tests

* Reduce line length of tests

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-07-11 17:34:38 +02:00
Pieter Marsman 410d7ecac3
Fix value for font-family in html by removing the subset tag from the PDF font-name (#357)
* Fix font name by removing subset tag

* Added line to CHANGELOG.md

* Add documentation and clear variable name

* Use `html.escape()` to encode strings for html and always return `str` instead of `bytes`
2020-01-16 22:25:20 +01:00
Pieter Marsman 3502dc9f3b
Drop support for legacy Python 2 (#346)
* Drop support for legacy Python 2

* Add python_requires to help pip

* Upgrade Python syntax with pyupgrade

* Upgrade Python syntax with pyupgrade --py3-plus

* Python 3 imports

* Replace six

* Update CONTRIBUTING.md

* Added line to changelog

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
2020-01-04 16:47:07 +01:00
Pieter Marsman f3ab1bc61e
Enforce pep8 coding-style (#345)
* Code Refractor: Use code-style enforcement #312

* Add flake8 to travis-ci

* Remove python 2 3 comment on six library. 891 errors > 870 errors.

* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.

* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.

* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting

* Cleanup pdfinterp.py and add documentation from PDF Reference

* Cleanup pdfpage.py

* Cleanup pdffont.py

* Clean psparser.py

* Cleanup high_level.py

* Cleanup layout.py

* Cleanup pdfparser.py

* Cleanup pdfcolor.py

* Cleanup rijndael.py

* Cleanup converter.py

* Rename klass to cls if it is the class variable, to be more consistent with standard practice

* Cleanup cmap.py

* Cleanup pdfdevice.py

* flake8 ignore fontmetrics.py

* Cleanup test_pdfminer_psparser.py

* Fix flake8 in pdfdocument.py; 339 errors to go

* Fix flake8 utils.py; 326 errors togo

* pep8 correction for few files in /tools/ 328 > 160 to go (#342)

* pep8 correction for few files in /tools/ 328 > 160 to go

* pep8 correction: 160 > 5 to go

* Fix ascii85.py errors

* Fix error in getting index from target that does not exists

* Remove commented print lines

* Fix flake8 error in pdfinterp.py

* Fix python2 specific error by removing argument from print statement

* Ignore invalid python2 syntax

* Update contributing.md

* Added changelog

* Remove unused import

Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 21:20:20 +01:00
Pieter Marsman bc034c8e59
Create sphinx documentation for Read the Docs (#329)
Fixes #171
Fixes #199
Fixes #118
Fixes #178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
2019-11-07 21:12:34 +01:00
Quentin Pradet 94f3d61bb2
converter: Fix XML syntax 2018-03-06 14:41:52 +04:00
Quentin Pradet 2231f0892e
Send non-stroke color to XML conversion
Inspired by https://github.com/euske/pdfminer/pull/158 from @andruo11
and https://github.com/euske/pdfminer/pull/197 from @staccatosound.
2018-03-06 14:11:48 +04:00
Venelin Stoykov c2432c32f1 Fix assert message for PDFLayoutAnalyzer.end_page (#80)
stack is undefined
2017-08-18 08:08:08 +02:00
Hugh Secker-Walker 488545ddc7 Add string expressions to asserts showing local data (#67) 2017-05-29 09:06:09 +02:00
Philippe Guglielmetti 52feb22eeb Merge remote-tracking branch 'origin/master'
Conflicts:
	MANIFEST.in
	README.md
	pdfminer/latin_enc.py
	pdfminer/pdfdocument.py
	pdfminer/pdfinterp.py
	pdfminer/pdfpage.py
	pdfminer/pdftypes.py
	pdfminer/psparser.py
	pdfminer/utils.py
	samples/Makefile
	setup.py
2017-01-19 08:03:16 +01:00
Humberto Pereira e6ad15af79 Added painting information (#37)
* added color support to stroking and non stroking color spaces

* extended LTCurve, LTLine and LTRect to save painting information

* modified PDFLayoutAnalyzer to populate the shapes with painting information
2016-11-08 20:01:58 +01:00
Antonio Ercole De Luca 0fdebc6739 Removing all the "#!/usr/bin/env python" lines, they do not need for … (#34)
* Removing all the "#!/usr/bin/env python" lines, they do not need for python3, solving issue number: #19.

* Restored all the shebangs in the tools and tests folders (because they are real executables) but used "#!/usr/bin/env python" instead of "#!/usr/bin/python" as this blog points out: https://www.peterbe.com/plog/importance-of-env
Removed also the shebang from pdfminer/psparser.py file.
2016-11-08 20:01:11 +01:00
Jakub Wilk 5ddbecb551 Fix typos 2016-09-13 16:25:09 +02:00
Friedrich Lindenberg 1d54ecd31c Make the logger run in a namespace. 2016-05-20 21:12:05 +02:00
Cathal Garvey a2ad7a6d03 Fixed some bugs preventing all tests from passing in Py2. 2015-05-30 18:02:29 +01:00
Cathal Garvey 08cb217983 Progress, progress.. not nearly atomic enough, sorry. 2015-05-30 16:14:24 +01:00
Cathal Garvey 1b47bed306 Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.

*In pdf2txt.py:*

* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.

*In utils:*

* Added a few compatibility functions (some string hax required chardet, new dependency):
    - make_compat_bytes(in_str)-> (py3->bytes | py2->str)
    - make_compat_str(in_str)-> (str)
    - compatible_encode_method(bytesorstring, encoding, erraction)-> (str)

*In pdfdevice:*

* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
  as well as some six.PYX checks and logic. These changes are largely responsible for
  enhanced Py2/Py3 consistency.

*In converter:*

* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
  py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
Yusuke Shinyama 14fd0fd2d6 Fixed: #84 (fontname was in unicode) 2015-04-05 19:02:02 +09:00
cybjit 51a361c145 clean up HTMLConverter and XMLConverter encoding 2014-09-16 22:57:00 +02:00
cybjit 39942b6642 avoid string formating when not logging 2014-09-12 00:29:31 +02:00
cybjit f9a67db89b change xrange to range 2014-09-07 18:36:12 +02:00
cybjit 0a2d90c051 pdf2txt: do not double encode stdout 2014-09-07 18:34:11 +02:00
unknown 29c07ea770 Python 3.4 support and tests 2014-09-03 15:26:08 +02:00
Yusuke Shinyama 8791355e1d Cleanup imports. Use relative imports. 2014-06-26 18:12:39 +09:00
Yusuke Shinyama 44074b42ea Added: stripcontrol for XMLConverter (-S option) 2014-06-22 00:33:00 +09:00
Yusuke Shinyama 1384a3fe8d Code cleanup: removed some debug flags. 2014-06-14 15:43:10 +09:00
Yusuke Shinyama 8e14ebf4e1 Use logging module instead of print. 2014-06-14 12:00:49 +09:00
Yusuke Shinyama 2b56b2eedf Merged. 2013-11-07 19:50:41 +09:00
Matthew Duggan 2caa5edc25 PEP8: Whitespace changes to match pep8 2013-11-07 17:35:04 +09:00
Matthew Duggan c1da8b835c PEP8: Remove trailing whitespace 2013-11-07 16:14:53 +09:00
Matthew Duggan 10a68c83bd Remove unused imports identified by pyflakes 2013-11-07 16:09:44 +09:00
Yusuke Shinyama 02ad086f6a fixed: HTMLConverter. 2013-10-25 18:10:40 +09:00
Yusuke Shinyama 0ea08890d4 renamed: python2 -> python. 2013-10-17 23:05:27 +09:00
Yusuke Shinyama 82ff98c7b3 imagewriter now works with text output 2011-11-07 01:15:10 +10:00
Yusuke Shinyama dc8fde0e47 added CCITTFaxFilter support and a very crude image extraction. 2011-07-18 21:07:00 +10:00
Yusuke Shinyama 170c97a12b colorspace patch by Lieb Simon 2011-06-06 17:10:12 +09:00
Yusuke Shinyama 0c41b8348e code cleanup 2011-05-14 15:51:40 +09:00
Yusuke Shinyama 038ce4cd0c added LTText.get_text() and .text property is no longer accessible. 2011-05-14 15:45:08 +09:00
Yusuke Shinyama 095534b294 figure object now does not call analyze. 2011-05-14 14:17:22 +09:00
Yusuke Shinyama 0e660dd385 rename: LTPolygon -> LTCurve 2011-04-20 22:05:25 +09:00
Yusuke Shinyama dab70855bf LTLine is now strictly horizontal or vertical. 2011-04-20 22:01:54 +09:00
Jonathan J Hunt ec682539da Optimized memory usage in TextConverter by ignoring all drawing commands. 2011-03-07 15:11:31 +10:00
Yusuke Shinyama 7dbb664db3 code cleanup and more debugging options 2011-02-14 23:42:05 +09:00
Yusuke Shinyama b2d13db29a code cleanup 2011-02-14 22:51:20 +09:00
Yusuke Shinyama 4eb6083c09 code cleanup 2011-01-03 18:11:22 +09:00
Yusuke Shinyama 3da3adad9b method renamed: finish(self) -> analyze(self, laparams). 2010-12-26 16:56:21 +09:00
yusuke.shinyama.dummy 84ed94aec0 another bugfix
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@281 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:41:03 +00:00
yusuke.shinyama.dummy 9bba7ac08b oops, forgot to fix this
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@280 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:40:58 +00:00
yusuke.shinyama.dummy 9f78915ea6 show cid for unknown characters
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@275 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-23 10:53:19 +00:00