pdfminer.six/tests/test_pdfdocument.py

import itertools

from nose.tools import assert_equal, raises

from helpers import absolute_sample_path
from pdfminer.pdfdocument import PDFDocument, PDFNoPageLabels
from pdfminer.pdfparser import PDFParser
from pdfminer.pdftypes import PDFObjectNotFound, dict_value, int_value


class TestPdfDocument(object):

    @raises(PDFObjectNotFound)
    def test_get_zero_objid_raises_pdfobjectnotfound(self):
        with open(absolute_sample_path('simple1.pdf'), 'rb') as in_file:
            parser = PDFParser(in_file)
            doc = PDFDocument(parser)
            doc.getobj(0)

    def test_encrypted_no_id(self):
        # Some documents may be encrypted but not have an /ID key in
        # their trailer. Tests
        # https://github.com/pdfminer/pdfminer.six/issues/594
        path = absolute_sample_path('encryption/encrypted_doc_no_id.pdf')
        with open(path, 'rb') as fp:
            parser = PDFParser(fp)
            doc = PDFDocument(parser)
            assert_equal(doc.info,
                         [{'Producer': b'European Patent Office'}])

    def test_page_labels(self):
        path = absolute_sample_path('contrib/pagelabels.pdf')
        with open(path, 'rb') as fp:
            parser = PDFParser(fp)
            doc = PDFDocument(parser)
            total_pages = int_value(dict_value(doc.catalog['Pages'])['Count'])
            assert_equal(
                list(itertools.islice(doc.get_page_labels(), total_pages)),
                ['iii', 'iv', '1', '2', '1'])

    @raises(PDFNoPageLabels)
    def test_no_page_labels(self):
        path = absolute_sample_path('simple1.pdf')
        with open(path, 'rb') as fp:
            parser = PDFParser(fp)
            doc = PDFDocument(parser)
            doc.get_page_labels()
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00			`import itertools`

Fix 594 use null id when encrypted but no id given (#595) Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2021-08-29 19:32:14 +00:00			`from nose.tools import assert_equal, raises`
Fix assertionerror when dumping pdf with reference to objid 0 (#318) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested. 2019-10-25 20:49:58 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`from helpers import absolute_sample_path`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00			`from pdfminer.pdfdocument import PDFDocument, PDFNoPageLabels`
Fix assertionerror when dumping pdf with reference to objid 0 (#318) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested. 2019-10-25 20:49:58 +00:00			`from pdfminer.pdfparser import PDFParser`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00			`from pdfminer.pdftypes import PDFObjectNotFound, dict_value, int_value`
Fix assertionerror when dumping pdf with reference to objid 0 (#318) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested. 2019-10-25 20:49:58 +00:00

			`class TestPdfDocument(object):`

			`@raises(PDFObjectNotFound)`
			`def test_get_zero_objid_raises_pdfobjectnotfound(self):`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`with open(absolute_sample_path('simple1.pdf'), 'rb') as in_file:`
Fix assertionerror when dumping pdf with reference to objid 0 (#318) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested. 2019-10-25 20:49:58 +00:00			`parser = PDFParser(in_file)`
			`doc = PDFDocument(parser)`
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`doc.getobj(0)`
Fix 594 use null id when encrypted but no id given (#595) Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2021-08-29 19:32:14 +00:00
			`def test_encrypted_no_id(self):`
			`# Some documents may be encrypted but not have an /ID key in`
			`# their trailer. Tests`
			`# https://github.com/pdfminer/pdfminer.six/issues/594`
			`path = absolute_sample_path('encryption/encrypted_doc_no_id.pdf')`
			`with open(path, 'rb') as fp:`
			`parser = PDFParser(fp)`
			`doc = PDFDocument(parser)`
			`assert_equal(doc.info,`
			`[{'Producer': b'European Patent Office'}])`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00
			`def test_page_labels(self):`
			`path = absolute_sample_path('contrib/pagelabels.pdf')`
			`with open(path, 'rb') as fp:`
			`parser = PDFParser(fp)`
			`doc = PDFDocument(parser)`
			`total_pages = int_value(dict_value(doc.catalog['Pages'])['Count'])`
			`assert_equal(`
			`list(itertools.islice(doc.get_page_labels(), total_pages)),`
			`['iii', 'iv', '1', '2', '1'])`

			`@raises(PDFNoPageLabels)`
			`def test_no_page_labels(self):`
			`path = absolute_sample_path('simple1.pdf')`
			`with open(path, 'rb') as fp:`
			`parser = PDFParser(fp)`
			`doc = PDFDocument(parser)`
			`doc.get_page_labels()`