pdfminer.six/tests/test_tools_pdf2txt.py

118 lines
3.4 KiB
Python
Raw Normal View History

import os
from shutil import rmtree
from tempfile import NamedTemporaryFile, mkdtemp
2014-09-03 13:26:08 +00:00
import tools.pdf2txt as pdf2txt
from helpers import absolute_sample_path
2014-09-03 13:26:08 +00:00
def run(sample_path, options=None):
absolute_path = absolute_sample_path(sample_path)
with NamedTemporaryFile() as output_file:
if options:
s = 'pdf2txt -o{} {} {}' \
.format(output_file.name, options, absolute_path)
else:
s = 'pdf2txt -o{} {}'.format(output_file.name, absolute_path)
pdf2txt.main(s.split(' ')[1:])
class TestPdf2Txt():
def test_jo(self):
run('jo.pdf')
2014-09-03 13:26:08 +00:00
def test_simple1(self):
run('simple1.pdf')
def test_simple2(self):
run('simple2.pdf')
2014-09-03 13:26:08 +00:00
def test_simple3(self):
run('simple3.pdf')
def test_sample_one_byte_identity_encode(self):
run('sampleOneByteIdentityEncode.pdf')
def test_nonfree_175(self):
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 20:20:20 +00:00
"""Regression test for:
https://github.com/pdfminer/pdfminer.six/issues/65
"""
run('nonfree/175.pdf')
2014-09-03 13:26:08 +00:00
def test_nonfree_dmca(self):
run('nonfree/dmca.pdf')
def test_nonfree_f1040nr(self):
run('nonfree/f1040nr.pdf')
def test_nonfree_i1040nr(self):
run('nonfree/i1040nr.pdf')
def test_nonfree_kampo(self):
run('nonfree/kampo.pdf')
2014-09-03 13:26:08 +00:00
def test_nonfree_naacl06_shinyama(self):
run('nonfree/naacl06-shinyama.pdf')
2017-10-16 10:05:39 +00:00
def test_nlp2004slides(self):
run('nonfree/nlp2004slides.pdf')
def test_contrib_2b(self):
run('contrib/2b.pdf', '-A -t xml')
2017-10-16 10:05:39 +00:00
def test_scancode_patchelf(self):
"""Regression test for # https://github.com/euske/pdfminer/issues/96"""
run('scancode/patchelf.pdf')
def test_contrib_hash_two_complement(self):
"""Check that unsigned integer is added correctly to encryption hash.
See https://github.com/pdfminer/pdfminer.six/issues/186
"""
run('contrib/issue-00352-hash-twos-complement.pdf')
class TestDumpImages:
@staticmethod
def extract_images(input_file):
output_dir = mkdtemp()
with NamedTemporaryFile() as output_file:
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 20:20:20 +00:00
commands = ['-o', output_file.name, '--output-dir',
output_dir, input_file]
pdf2txt.main(commands)
image_files = os.listdir(output_dir)
rmtree(output_dir)
return image_files
def test_nonfree_dmca(self):
"""Extract images of pdf containing bmp images
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 20:20:20 +00:00
Regression test for:
https://github.com/pdfminer/pdfminer.six/issues/131
"""
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 20:20:20 +00:00
image_files = self.extract_images(
absolute_sample_path('../samples/nonfree/dmca.pdf'))
assert image_files[0].endswith('bmp')
def test_nonfree_175(self):
"""Extract images of pdf containing jpg images"""
self.extract_images(absolute_sample_path('../samples/nonfree/175.pdf'))
def test_jbig2_image_export(self):
"""Extract images of pdf containing jbig2 images
Feature test for: https://github.com/pdfminer/pdfminer.six/pull/46
"""
Enforce pep8 coding-style (#345) * Code Refractor: Use code-style enforcement #312 * Add flake8 to travis-ci * Remove python 2 3 comment on six library. 891 errors > 870 errors. * Remove class and functions comments that consist of just the name. 870 errors > 855 errors. * Fix flake8 errors in pdftypes.py. 855 errors > 833 errors. * Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting * Cleanup pdfinterp.py and add documentation from PDF Reference * Cleanup pdfpage.py * Cleanup pdffont.py * Clean psparser.py * Cleanup high_level.py * Cleanup layout.py * Cleanup pdfparser.py * Cleanup pdfcolor.py * Cleanup rijndael.py * Cleanup converter.py * Rename klass to cls if it is the class variable, to be more consistent with standard practice * Cleanup cmap.py * Cleanup pdfdevice.py * flake8 ignore fontmetrics.py * Cleanup test_pdfminer_psparser.py * Fix flake8 in pdfdocument.py; 339 errors to go * Fix flake8 utils.py; 326 errors togo * pep8 correction for few files in /tools/ 328 > 160 to go (#342) * pep8 correction for few files in /tools/ 328 > 160 to go * pep8 correction: 160 > 5 to go * Fix ascii85.py errors * Fix error in getting index from target that does not exists * Remove commented print lines * Fix flake8 error in pdfinterp.py * Fix python2 specific error by removing argument from print statement * Ignore invalid python2 syntax * Update contributing.md * Added changelog * Remove unused import Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 20:20:20 +00:00
image_files = self.extract_images(
absolute_sample_path('../samples/contrib/pdf-with-jbig2.pdf'))
assert image_files[0].endswith('.jb2')
def test_contrib_matplotlib(self):
"""Test a pdf with Type3 font"""
run('contrib/matplotlib.pdf')
def test_nonfree_cmp_itext_logo(self):
"""Test a pdf with Type3 font"""
run('nonfree/cmp_itext_logo.pdf')