pdfminer.six/tests/test_pdfdocument.py

import itertools

import pytest

from helpers import absolute_sample_path
from pdfminer.pdfdocument import PDFDocument, PDFNoPageLabels
from pdfminer.pdfparser import PDFParser
from pdfminer.pdftypes import PDFObjectNotFound, dict_value, int_value


class TestPdfDocument(object):
    def test_get_zero_objid_raises_pdfobjectnotfound(self):
        with open(absolute_sample_path("simple1.pdf"), "rb") as in_file:
            parser = PDFParser(in_file)
            doc = PDFDocument(parser)
            with pytest.raises(PDFObjectNotFound):
                doc.getobj(0)

    def test_encrypted_no_id(self):
        # Some documents may be encrypted but not have an /ID key in
        # their trailer. Tests
        # https://github.com/pdfminer/pdfminer.six/issues/594
        path = absolute_sample_path("encryption/encrypted_doc_no_id.pdf")
        with open(path, "rb") as fp:
            parser = PDFParser(fp)
            doc = PDFDocument(parser)
            assert doc.info == [{"Producer": b"European Patent Office"}]

    def test_page_labels(self):
        path = absolute_sample_path("contrib/pagelabels.pdf")
        with open(path, "rb") as fp:
            parser = PDFParser(fp)
            doc = PDFDocument(parser)
            total_pages = int_value(dict_value(doc.catalog["Pages"])["Count"])
            assert list(itertools.islice(doc.get_page_labels(), total_pages)) == [
                "iii",
                "iv",
                "1",
                "2",
                "1",
            ]

    def test_no_page_labels(self):
        path = absolute_sample_path("simple1.pdf")
        with open(path, "rb") as fp:
            parser = PDFParser(fp)
            doc = PDFDocument(parser)

            with pytest.raises(PDFNoPageLabels):
                doc.get_page_labels()
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00			`import itertools`

Update development tools: travis ci to github actions, tox to nox, nose to pytest (#704) * Replace tox with nox * Replace travis with github actions * Fix pytest, mypy and flake8 errors * Add pytest. * Run on all commits * Remove nose * Speedup slow tests to save GitHub actions minutes * Added line to CHANGELOG.md * Fix line too long in pdfdocument.py * Update .github/workflows/actions.yml Co-authored-by: Jake Stockwin <jstockwin@gmail.com> * Improve actions.yml * Fix error with nox name for mypy * Add names for jobs * Replace nose.raises with pytest.raises Co-authored-by: Jake Stockwin <jstockwin@gmail.com> 2022-02-02 21:24:32 +00:00			`import pytest`
Fix assertionerror when dumping pdf with reference to objid 0 (#318) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested. 2019-10-25 20:49:58 +00:00
Fix failing test on develop & cleaning up test files (#319) 2019-10-26 16:42:33 +00:00			`from helpers import absolute_sample_path`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00			`from pdfminer.pdfdocument import PDFDocument, PDFNoPageLabels`
Fix assertionerror when dumping pdf with reference to objid 0 (#318) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested. 2019-10-25 20:49:58 +00:00			`from pdfminer.pdfparser import PDFParser`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00			`from pdfminer.pdftypes import PDFObjectNotFound, dict_value, int_value`
Fix assertionerror when dumping pdf with reference to objid 0 (#318) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested. 2019-10-25 20:49:58 +00:00

			`class TestPdfDocument(object):`
			`def test_get_zero_objid_raises_pdfobjectnotfound(self):`
Check blackness in github actions (#711) * Check blackness in github actions * Blacken code * Update github action names * Add contributing guidelines on using black * Add to checklist for PR 2022-02-11 21:46:51 +00:00			`with open(absolute_sample_path("simple1.pdf"), "rb") as in_file:`
Fix assertionerror when dumping pdf with reference to objid 0 (#318) Fixes #94 Added: test to get check if `PDFObjectNotFound` error is raised if objid 0 is requested. 2019-10-25 20:49:58 +00:00			`parser = PDFParser(in_file)`
			`doc = PDFDocument(parser)`
Update development tools: travis ci to github actions, tox to nox, nose to pytest (#704) * Replace tox with nox * Replace travis with github actions * Fix pytest, mypy and flake8 errors * Add pytest. * Run on all commits * Remove nose * Speedup slow tests to save GitHub actions minutes * Added line to CHANGELOG.md * Fix line too long in pdfdocument.py * Update .github/workflows/actions.yml Co-authored-by: Jake Stockwin <jstockwin@gmail.com> * Improve actions.yml * Fix error with nox name for mypy * Add names for jobs * Replace nose.raises with pytest.raises Co-authored-by: Jake Stockwin <jstockwin@gmail.com> 2022-02-02 21:24:32 +00:00			`with pytest.raises(PDFObjectNotFound):`
			`doc.getobj(0)`
Fix 594 use null id when encrypted but no id given (#595) Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2021-08-29 19:32:14 +00:00
			`def test_encrypted_no_id(self):`
			`# Some documents may be encrypted but not have an /ID key in`
			`# their trailer. Tests`
			`# https://github.com/pdfminer/pdfminer.six/issues/594`
Check blackness in github actions (#711) * Check blackness in github actions * Blacken code * Update github action names * Add contributing guidelines on using black * Add to checklist for PR 2022-02-11 21:46:51 +00:00			`path = absolute_sample_path("encryption/encrypted_doc_no_id.pdf")`
			`with open(path, "rb") as fp:`
Fix 594 use null id when encrypted but no id given (#595) Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2021-08-29 19:32:14 +00:00			`parser = PDFParser(fp)`
			`doc = PDFDocument(parser)`
Check blackness in github actions (#711) * Check blackness in github actions * Blacken code * Update github action names * Add contributing guidelines on using black * Add to checklist for PR 2022-02-11 21:46:51 +00:00			`assert doc.info == [{"Producer": b"European Patent Office"}]`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00
			`def test_page_labels(self):`
Check blackness in github actions (#711) * Check blackness in github actions * Blacken code * Update github action names * Add contributing guidelines on using black * Add to checklist for PR 2022-02-11 21:46:51 +00:00			`path = absolute_sample_path("contrib/pagelabels.pdf")`
			`with open(path, "rb") as fp:`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00			`parser = PDFParser(fp)`
			`doc = PDFDocument(parser)`
Check blackness in github actions (#711) * Check blackness in github actions * Blacken code * Update github action names * Add contributing guidelines on using black * Add to checklist for PR 2022-02-11 21:46:51 +00:00			`total_pages = int_value(dict_value(doc.catalog["Pages"])["Count"])`
			`assert list(itertools.islice(doc.get_page_labels(), total_pages)) == [`
			`"iii",`
			`"iv",`
			`"1",`
			`"2",`
			`"1",`
			`]`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00
			`def test_no_page_labels(self):`
Check blackness in github actions (#711) * Check blackness in github actions * Blacken code * Update github action names * Add contributing guidelines on using black * Add to checklist for PR 2022-02-11 21:46:51 +00:00			`path = absolute_sample_path("simple1.pdf")`
			`with open(path, "rb") as fp:`
Added feature: page labels (#680) * port page label code from pdfannots * add tests and clean up * more cleanup; harden against non-conforming input * one more test * update CHANGELOG * cleanup & respond to review feedback (incomplete) * Refactor implementation of get_page_labels() into a NumberTree and PageLabels class. * PageLabels is a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more. * fix type errors and cleanup slightly * fix mypy errors (including tweaking code to avoid problematic dynamic types) * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be) * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-02-01 09:08:05 +00:00			`parser = PDFParser(fp)`
			`doc = PDFDocument(parser)`
Update development tools: travis ci to github actions, tox to nox, nose to pytest (#704) * Replace tox with nox * Replace travis with github actions * Fix pytest, mypy and flake8 errors * Add pytest. * Run on all commits * Remove nose * Speedup slow tests to save GitHub actions minutes * Added line to CHANGELOG.md * Fix line too long in pdfdocument.py * Update .github/workflows/actions.yml Co-authored-by: Jake Stockwin <jstockwin@gmail.com> * Improve actions.yml * Fix error with nox name for mypy * Add names for jobs * Replace nose.raises with pytest.raises Co-authored-by: Jake Stockwin <jstockwin@gmail.com> 2022-02-02 21:24:32 +00:00
			`with pytest.raises(PDFNoPageLabels):`
			`doc.get_page_labels()`