pdfminer.six/tests/test_tools_pdf2txt.py

import os
from shutil import rmtree
from tempfile import NamedTemporaryFile, mkdtemp

import tools.pdf2txt as pdf2txt
from helpers import absolute_sample_path


def run(sample_path, options=None):
    absolute_path = absolute_sample_path(sample_path)
    with NamedTemporaryFile() as output_file:
        if options:
            s = 'pdf2txt -o{} {} {}' \
                .format(output_file.name, options, absolute_path)
        else:
            s = 'pdf2txt -o{} {}'.format(output_file.name, absolute_path)
        pdf2txt.main(s.split(' ')[1:])


class TestPdf2Txt():
    def test_jo(self):
        run('jo.pdf')

    def test_simple1(self):
        run('simple1.pdf')

    def test_simple2(self):
        run('simple2.pdf')

    def test_simple3(self):
        run('simple3.pdf')

    def test_sample_one_byte_identity_encode(self):
        run('sampleOneByteIdentityEncode.pdf')

    def test_nonfree_175(self):
        """Regression test for:
        https://github.com/pdfminer/pdfminer.six/issues/65
        """
        run('nonfree/175.pdf')

    def test_nonfree_dmca(self):
        run('nonfree/dmca.pdf')

    def test_nonfree_f1040nr(self):
        run('nonfree/f1040nr.pdf')

    def test_nonfree_i1040nr(self):
        run('nonfree/i1040nr.pdf')

    def test_nonfree_kampo(self):
        run('nonfree/kampo.pdf')

    def test_nonfree_naacl06_shinyama(self):
        run('nonfree/naacl06-shinyama.pdf')

    def test_nlp2004slides(self):
        run('nonfree/nlp2004slides.pdf')

    def test_contrib_2b(self):
        run('contrib/2b.pdf', '-A -t xml')

    def test_scancode_patchelf(self):
        """Regression test for # https://github.com/euske/pdfminer/issues/96"""
        run('scancode/patchelf.pdf')

    def test_contrib_hash_two_complement(self):
        """Check that unsigned integer is added correctly to encryption hash.

        See https://github.com/pdfminer/pdfminer.six/issues/186
        """
        run('contrib/issue-00352-hash-twos-complement.pdf')


class TestDumpImages:

    @staticmethod
    def extract_images(input_file):
        output_dir = mkdtemp()
        with NamedTemporaryFile() as output_file:
            commands = ['-o', output_file.name, '--output-dir',
                        output_dir, input_file]
            pdf2txt.main(commands)
        image_files = os.listdir(output_dir)
        rmtree(output_dir)
        return image_files

    def test_nonfree_dmca(self):
        """Extract images of pdf containing bmp images

        Regression test for:
        https://github.com/pdfminer/pdfminer.six/issues/131
        """
        image_files = self.extract_images(
            absolute_sample_path('../samples/nonfree/dmca.pdf'))
        assert image_files[0].endswith('bmp')

    def test_nonfree_175(self):
        """Extract images of pdf containing jpg images"""
        self.extract_images(absolute_sample_path('../samples/nonfree/175.pdf'))

    def test_jbig2_image_export(self):
        """Extract images of pdf containing jbig2 images

        Feature test for: https://github.com/pdfminer/pdfminer.six/pull/46
        """
        image_files = self.extract_images(
            absolute_sample_path('../samples/contrib/pdf-with-jbig2.pdf'))
        assert image_files[0].endswith('.jb2')

    def test_contrib_matplotlib(self):
        """Test a pdf with Type3 font"""
        run('contrib/matplotlib.pdf')

    def test_nonfree_cmp_itext_logo(self):
        """Test a pdf with Type3 font"""
        run('nonfree/cmp_itext_logo.pdf')
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00			`import os`
			`from shutil import rmtree`
			`from tempfile import NamedTemporaryFile, mkdtemp`
Removing all the "#!/usr/bin/env python" lines, they do not need for … (#34) * Removing all the "#!/usr/bin/env python" lines, they do not need for python3, solving issue number: #19. * Restored all the shebangs in the tools and tests folders (because they are real executables) but used "#!/usr/bin/env python" instead of "#!/usr/bin/python" as this blog points out: https://www.peterbe.com/plog/importance-of-env Removed also the shebang from pdfminer/psparser.py file. 2016-11-08 19:01:11 +00:00
Python 3.4 support and tests 2014-09-03 13:26:08 +00:00			`import tools.pdf2txt as pdf2txt`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`from helpers import absolute_sample_path`
Python 3.4 support and tests 2014-09-03 13:26:08 +00:00

Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def run(sample_path, options=None):`
			`absolute_path = absolute_sample_path(sample_path)`
			`with NamedTemporaryFile() as output_file:`
			`if options:`
Drop support for legacy Python 2 (#346) * Drop support for legacy Python 2 * Add python_requires to help pip * Upgrade Python syntax with pyupgrade * Upgrade Python syntax with pyupgrade --py3-plus * Python 3 imports * Replace six * Update CONTRIBUTING.md * Added line to changelog Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com> 2020-01-04 15:47:07 +00:00			`s = 'pdf2txt -o{} {} {}' \`
			`.format(output_file.name, options, absolute_path)`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`else:`
Drop support for legacy Python 2 (#346) * Drop support for legacy Python 2 * Add python_requires to help pip * Upgrade Python syntax with pyupgrade * Upgrade Python syntax with pyupgrade --py3-plus * Python 3 imports * Replace six * Update CONTRIBUTING.md * Added line to changelog Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com> 2020-01-04 15:47:07 +00:00			`s = 'pdf2txt -o{} {}'.format(output_file.name, absolute_path)`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`pdf2txt.main(s.split(' ')[1:])`
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00

Fallback on backwards-compatible key (F) for embedded files URL's when the unicode URL (UF) does not exist (#338) * Fix getting filename when extracting embedded files * Add test for pdf that contains embedded pdf, and fix additional errors in looping over multiple xrefs * Add line to CHANGELOG 2020-01-16 21:11:42 +00:00			`class TestPdf2Txt():`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_jo(self):`
			`run('jo.pdf')`
Python 3.4 support and tests 2014-09-03 13:26:08 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_simple1(self):`
			`run('simple1.pdf')`
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_simple2(self):`
			`run('simple2.pdf')`
Python 3.4 support and tests 2014-09-03 13:26:08 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_simple3(self):`
			`run('simple3.pdf')`
new test fails on Linux & TRavis-CI. TODO: find why 2017-04-18 16:28:48 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_sample_one_byte_identity_encode(self):`
			`run('sampleOneByteIdentityEncode.pdf')`
new test fails on Linux & TRavis-CI. TODO: find why 2017-04-18 16:28:48 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_nonfree_175(self):`
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com> 2019-12-29 20:20:20 +00:00			`"""Regression test for:`
Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352) Fixes #186 * Tread the permissions (the /P entry) as unsigned long, fix #186 * handle negative values for p * Extract function for resolving an twos-complement * Add test for issue #352 * Add line to CHANGELOG.md * Only ints can be converted to a uint using two's-complement method * Standardize import style; multiple imports from same module on one line Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2020-01-07 20:59:13 +00:00			`https://github.com/pdfminer/pdfminer.six/issues/65`
			`"""`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`run('nonfree/175.pdf')`
Python 3.4 support and tests 2014-09-03 13:26:08 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_nonfree_dmca(self):`
			`run('nonfree/dmca.pdf')`

			`def test_nonfree_f1040nr(self):`
			`run('nonfree/f1040nr.pdf')`
new test fails on Linux & TRavis-CI. TODO: find why 2017-04-18 16:28:48 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_nonfree_i1040nr(self):`
			`run('nonfree/i1040nr.pdf')`
new test fails on Linux & TRavis-CI. TODO: find why 2017-04-18 16:28:48 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_nonfree_kampo(self):`
			`run('nonfree/kampo.pdf')`
Python 3.4 support and tests 2014-09-03 13:26:08 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_nonfree_naacl06_shinyama(self):`
			`run('nonfree/naacl06-shinyama.pdf')`
Add a test for the previous fix 2017-10-16 10:05:39 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_nlp2004slides(self):`
			`run('nonfree/nlp2004slides.pdf')`
solves https://github.com/pdfminer/pdfminer.six/issues/65 2017-07-20 19:17:06 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_contrib_2b(self):`
			`run('contrib/2b.pdf', '-A -t xml')`
Add a test for the previous fix 2017-10-16 10:05:39 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`def test_scancode_patchelf(self):`
			`"""Regression test for # https://github.com/euske/pdfminer/issues/96"""`
			`run('scancode/patchelf.pdf')`
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00
Pack the /P (ermissions) entry from the /Encrypt dictionionary in the file trailer, as unsigned long (#352) Fixes #186 * Tread the permissions (the /P entry) as unsigned long, fix #186 * handle negative values for p * Extract function for resolving an twos-complement * Add test for issue #352 * Add line to CHANGELOG.md * Only ints can be converted to a uint using two's-complement method * Standardize import style; multiple imports from same module on one line Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2020-01-07 20:59:13 +00:00			`def test_contrib_hash_two_complement(self):`
			`"""Check that unsigned integer is added correctly to encryption hash.`

			`See https://github.com/pdfminer/pdfminer.six/issues/186`
			`"""`
			`run('contrib/issue-00352-hash-twos-complement.pdf')`

Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00
Drop support for legacy Python 2 (#346) * Drop support for legacy Python 2 * Add python_requires to help pip * Upgrade Python syntax with pyupgrade * Upgrade Python syntax with pyupgrade --py3-plus * Python 3 imports * Replace six * Update CONTRIBUTING.md * Added line to changelog Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com> 2020-01-04 15:47:07 +00:00			`class TestDumpImages:`
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`@staticmethod`
			`def extract_images(input_file):`
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00			`output_dir = mkdtemp()`
			`with NamedTemporaryFile() as output_file:`
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com> 2019-12-29 20:20:20 +00:00			`commands = ['-o', output_file.name, '--output-dir',`
			`output_dir, input_file]`
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00			`pdf2txt.main(commands)`
			`image_files = os.listdir(output_dir)`
			`rmtree(output_dir)`
			`return image_files`

			`def test_nonfree_dmca(self):`
			`"""Extract images of pdf containing bmp images`

Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com> 2019-12-29 20:20:20 +00:00			`Regression test for:`
			`https://github.com/pdfminer/pdfminer.six/issues/131`
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00			`"""`
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com> 2019-12-29 20:20:20 +00:00			`image_files = self.extract_images(`
			`absolute_sample_path('../samples/nonfree/dmca.pdf'))`
Changed: comparations for image colorspace literals (#132) Fixes #131 Changed: comparations for image colorspace literals Added: test for extracting images from pdfs 2019-10-15 14:11:54 +00:00			`assert image_files[0].endswith('bmp')`

			`def test_nonfree_175(self):`
			`"""Extract images of pdf containing jpg images"""`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`self.extract_images(absolute_sample_path('../samples/nonfree/175.pdf'))`
Added: extraction of JBIG2 encoded images (#311) And added test for pdf with JBIG2 image. Fixes #26 Closes #46 2019-10-22 15:37:06 +00:00
			`def test_jbig2_image_export(self):`
			`"""Extract images of pdf containing jbig2 images`

			`Feature test for: https://github.com/pdfminer/pdfminer.six/pull/46`
			`"""`
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com> 2019-12-29 20:20:20 +00:00			`image_files = self.extract_images(`
			`absolute_sample_path('../samples/contrib/pdf-with-jbig2.pdf'))`
Added: extraction of JBIG2 encoded images (#311) And added test for pdf with JBIG2 image. Fixes #26 Closes #46 2019-10-22 15:37:06 +00:00			`assert image_files[0].endswith('.jb2')`

Added: tests for extracting tests from pdfs with Type3 fonts (#205) 2019-10-22 16:15:59 +00:00			`def test_contrib_matplotlib(self):`
			`"""Test a pdf with Type3 font"""`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`run('contrib/matplotlib.pdf')`
Added: tests for extracting tests from pdfs with Type3 fonts (#205) 2019-10-22 16:15:59 +00:00
			`def test_nonfree_cmp_itext_logo(self):`
			`"""Test a pdf with Type3 font"""`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`run('nonfree/cmp_itext_logo.pdf')`