Commit Graph

146 Commits (1a4a06da9fe295920e23311e12f22a37a2799899)

Author SHA1 Message Date
Pieter Marsman 1d773dc38a
Fix grouping textlines when bounding box of parent container is wrong (#386)
* Default value for --all-texts should be false, because using the flag enables it

* Fix edge case: when no neighbors are found a line should form its own text box

* Added test for grouping textlines where 1 is outside the parent bounding box

* Added CHANGELOG.md line
2020-03-14 10:33:39 +01:00
Pieter Marsman 52da65d5eb
Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's. (#360)
* Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's.

* Added line to CHANGELOG.md
2020-01-16 22:26:01 +01:00
Pieter Marsman 2f7f5d2667
Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338)
* Fix getting filename when extracting embedded files

* Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs

* Add line to CHANGELOG
2020-01-16 22:11:42 +01:00
Pieter Marsman 3502dc9f3b
Drop support for legacy Python 2 (#346)
* Drop support for legacy Python 2

* Add python_requires to help pip

* Upgrade Python syntax with pyupgrade

* Upgrade Python syntax with pyupgrade --py3-plus

* Python 3 imports

* Replace six

* Update CONTRIBUTING.md

* Added line to changelog

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
2020-01-04 16:47:07 +01:00
Pieter Marsman f3ab1bc61e
Enforce pep8 coding-style (#345)
* Code Refractor: Use code-style enforcement #312

* Add flake8 to travis-ci

* Remove python 2 3 comment on six library. 891 errors > 870 errors.

* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.

* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.

* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting

* Cleanup pdfinterp.py and add documentation from PDF Reference

* Cleanup pdfpage.py

* Cleanup pdffont.py

* Clean psparser.py

* Cleanup high_level.py

* Cleanup layout.py

* Cleanup pdfparser.py

* Cleanup pdfcolor.py

* Cleanup rijndael.py

* Cleanup converter.py

* Rename klass to cls if it is the class variable, to be more consistent with standard practice

* Cleanup cmap.py

* Cleanup pdfdevice.py

* flake8 ignore fontmetrics.py

* Cleanup test_pdfminer_psparser.py

* Fix flake8 in pdfdocument.py; 339 errors to go

* Fix flake8 utils.py; 326 errors togo

* pep8 correction for few files in /tools/ 328 > 160 to go (#342)

* pep8 correction for few files in /tools/ 328 > 160 to go

* pep8 correction: 160 > 5 to go

* Fix ascii85.py errors

* Fix error in getting index from target that does not exists

* Remove commented print lines

* Fix flake8 error in pdfinterp.py

* Fix python2 specific error by removing argument from print statement

* Ignore invalid python2 syntax

* Update contributing.md

* Added changelog

* Remove unused import

Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 21:20:20 +01:00
Martin Hasoň 78f06225b6 Removed duplicated and therefore unused code from pdf2txt.py (#341) 2019-12-09 22:04:05 +01:00
Pieter Marsman bc034c8e59
Create sphinx documentation for Read the Docs (#329)
Fixes #171
Fixes #199
Fixes #118
Fixes #178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
2019-11-07 21:12:34 +01:00
Martin Hasoň ed1b09c6f2 Fix debug logging for pdf2txt.py and dumppdf.py (#325)
Fixes #313
2019-11-06 21:47:19 +01:00
Pieter Marsman 33b16b3f07
Deprecate the use of _py2_no_more_posargs (#328)
Fixes #324
2019-11-02 10:29:39 +01:00
Pieter Marsman 6cc78ee124
Replace opts by argparse in dumppdf.py (#321)
Also add multi-character argument names
Fixes #175
2019-10-27 21:40:04 +01:00
Pieter Marsman d88d6020a2
Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320)
Fixes #314 
Fixes #105
2019-10-26 19:16:37 +02:00
Igor Moura 540df9f676 Replaced .iteritems() and with six.iteritems() for Python 3 compat
This is a squashed commit, the previous messages can be seen bellow

This is the 1st commit message:

Replaced .iteritems() usage for .items()

Fixed some python 2 leftovers, as discussed in #267. Also formatted code according to Black.\nThis possibly breaks some python 2 compatibility

This is the commit message #2:

Reverted formatting and more spread six usage
2019-07-24 14:08:30 -03:00
Wm Bentley 495c92e050 Move argparse object setup out of main to separate function.
As preparation for implementing Sphinx documentation, create a
separate function that builds and returns the argparse parser.
Move import argparse out of main to the top of the file.
2018-08-12 21:07:52 -07:00
Martin Wolf eff3f19886 Merge remote-tracking branch 'upstream/master' 2018-06-25 23:32:52 +02:00
Charles Reid 7b08cdbff9 apply dos2unix to files in pdfminer/ and tools/ to remove \r\n windows line endings 2018-06-21 12:19:48 -07:00
Martin Wolf 26f80715ed Merge remote-tracking branch 'upstream/master' 2018-06-20 13:27:18 +02:00
Martin Wolf 4bdb3ba8cc Fixes needed to be able to compile pdfminer.six with Cython 2018-04-12 00:05:38 +02:00
Andy Kluger ed7d8308d9 -P is *not* for page numbers, but passwords, so reflect that in the help text 2018-04-03 12:26:01 -04:00
Kohei YOSHIDA baf3cd0c2c use except Exception as e clause for Python 3 compatibility 2018-02-19 23:32:36 +09:00
Guglielmetti Philippe 6d3210d206 pdfdiff tool (and .spec files for compilation with pyinstaller) 2017-11-21 10:48:45 +01:00
Attila Szász 938419c476 Align dumppdf tool to modified data structures. (#73)
* Align dumppdf tool to modified data structures.
TOC page numbers should also work now, counting from 1.

* Update version number.
2017-07-20 20:46:11 +02:00
Hugh Secker-Walker 35a58ee5b5 Add tools/pdfstats.py which counts all LT* types in a PDF (#68) 2017-05-29 09:11:58 +02:00
Hugh Secker-Walker 488545ddc7 Add string expressions to asserts showing local data (#67) 2017-05-29 09:06:09 +02:00
Philippe Guglielmetti 52feb22eeb Merge remote-tracking branch 'origin/master'
Conflicts:
	MANIFEST.in
	README.md
	pdfminer/latin_enc.py
	pdfminer/pdfdocument.py
	pdfminer/pdfinterp.py
	pdfminer/pdfpage.py
	pdfminer/pdftypes.py
	pdfminer/psparser.py
	pdfminer/utils.py
	samples/Makefile
	setup.py
2017-01-19 08:03:16 +01:00
Antonio Ercole De Luca 0fdebc6739 Removing all the "#!/usr/bin/env python" lines, they do not need for … (#34)
* Removing all the "#!/usr/bin/env python" lines, they do not need for python3, solving issue number: #19.

* Restored all the shebangs in the tools and tests folders (because they are real executables) but used "#!/usr/bin/env python" instead of "#!/usr/bin/python" as this blog points out: https://www.peterbe.com/plog/importance-of-env
Removed also the shebang from pdfminer/psparser.py file.
2016-11-08 20:01:11 +01:00
Jakub Wilk 5ddbecb551 Fix typos 2016-09-13 16:25:09 +02:00
Friedrich Lindenberg 1d54ecd31c Make the logger run in a namespace. 2016-05-20 21:12:05 +02:00
Ivan Teoh 2c8f226907 Fix issues #20 - NameError: global name 'ImageWriter' is not defined 2016-04-26 12:38:42 +10:00
Chris Hager 2e1be5721f removed settings.ENFORCE_CHECK_EXTRACTABLE 2015-11-01 22:34:18 +01:00
Chris Hager b686dd0139 pdfminer/settings.py for STRICT and added ENFORCE_CHECK_EXTRACTABLE 2015-11-01 22:28:08 +01:00
Cathal Garvey 268e9fb2bd Removed typechecking, nothing's exploded yet and argparse does lots of heavy lifting already. 2015-05-30 17:05:28 +01:00
Cathal Garvey b3553cef10 Cleaning up pdf2txt.py after the partition/move. 2015-05-30 17:03:55 +01:00
Cathal Garvey cbe270a4bf Killed the old main function for pdf2txt.py 2015-05-30 16:37:22 +01:00
Cathal Garvey ead8e778a6 Successfully compartmentalised code, getting closer to moving pdf->text as a module function. 2015-05-30 16:27:58 +01:00
Cathal Garvey 08cb217983 Progress, progress.. not nearly atomic enough, sorry. 2015-05-30 16:14:24 +01:00
Cathal Garvey 1b47bed306 Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.

*In pdf2txt.py:*

* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.

*In utils:*

* Added a few compatibility functions (some string hax required chardet, new dependency):
    - make_compat_bytes(in_str)-> (py3->bytes | py2->str)
    - make_compat_str(in_str)-> (str)
    - compatible_encode_method(bytesorstring, encoding, erraction)-> (str)

*In pdfdevice:*

* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
  as well as some six.PYX checks and logic. These changes are largely responsible for
  enhanced Py2/Py3 consistency.

*In converter:*

* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
  py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
cybjit 2639b15ef4 guess argv encoding in py2 using sys.stdin.encoding 2014-09-16 23:17:26 +02:00
cybjit 14585987c3 keep password api unicode, latin1 or utf-8 is encoded in handler 2014-09-16 22:58:25 +02:00
cybjit 714423883c setup logging for pdf2txt and fix dumppdf 2014-09-12 00:29:31 +02:00
cybjit ed13f7c47d conv_cmap py3 compat 2014-09-12 00:29:30 +02:00
cybjit 0a2d90c051 pdf2txt: do not double encode stdout 2014-09-07 18:34:11 +02:00
unknown 28c2a4e6ad 2.7/3.4 encoding corrected 2014-09-04 10:31:33 +02:00
unknown 7b610b34be tools must be a module to enable scripts tests 2014-09-04 09:47:33 +02:00
unknown 29c07ea770 Python 3.4 support and tests 2014-09-03 15:26:08 +02:00
unknown a6475b61b4 Python 3.4 support added and tested 2014-09-03 13:17:41 +02:00
Yusuke Shinyama fe86b4e64e Changed: StringIO -> io.BytesIO 2014-06-25 19:55:41 +09:00
Yusuke Shinyama 44074b42ea Added: stripcontrol for XMLConverter (-S option) 2014-06-22 00:33:00 +09:00
Yusuke Shinyama bb866ae148 Changed: new except syntax (2.6 or above). 2014-06-16 18:50:07 +09:00
Yusuke Shinyama 28e96ba3d0 Use print as a function. 2014-06-15 12:14:33 +09:00
Yusuke Shinyama 1384a3fe8d Code cleanup: removed some debug flags. 2014-06-14 15:43:10 +09:00