pdfminer.six/docs/source/howto/acro_forms.rst

.. _acro_forms:

How to extract AcroForm interactive form fields from a PDF using PDFMiner
*************************************************************************

Before you start, make sure you have :ref:`installed pdfminer.six<install>`.

The second thing you need is a PDF with AcroForms (as found in PDF files with fillable forms or multiple choices). There are some examples of these in the GitHub repository under `samples/acroform`.

Only AcroForm interactive forms are supported, XFA forms are not supported.

.. code-block:: python

    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdftypes import resolve1
    from pdfminer.psparser import PSLiteral, PSKeyword
    from pdfminer.utils import decode_text    
    
    
    data = {}
 
 
    def decode_value(value):

        # decode PSLiteral, PSKeyword
        if isinstance(value, (PSLiteral, PSKeyword)):
            value = value.name

        # decode bytes
        if isinstance(value, bytes):
            value = decode_text(value)

        return value


    with open(file_path, 'rb') as fp:
        parser = PDFParser(fp)
        
        doc = PDFDocument(parser)
        res = resolve1(doc.catalog)

        if 'AcroForm' not in res:
            raise ValueError("No AcroForm Found")
            
        fields = resolve1(doc.catalog['AcroForm'])['Fields']  # may need further resolving

        for f in fields:
            field = resolve1(f)
            name, values = field.get('T'), field.get('V')

            # decode name
            name = decode_text(name)

            # resolve indirect obj
            values = resolve1(values)
            
            # decode value(s)
            if isinstance(values, list):
                values = [decode_value(v) for v in values]
            else:
                values = decode_value(values)

            data.update({name: values})    
              
            print(name, values)

This code snippet will print all the fields name and value and save them in the "data" dictionary.


How it works:

- Initialize the parser and the PDFDocument objects

.. code-block:: python

    parser = PDFParser(fp)
    doc = PDFDocument(parser)

- Get the catalog

  (the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html)

.. code-block:: python

    res = resolve1(doc.catalog)

- Check if the catalog contains the AcroForm key and raise ValueError if not 

  (the PDF does not contain Acroform type of interactive forms if this key is missing in the catalog, see section 12.7.2 of PDF 32000-1:2008 specs)

.. code-block:: python

    if 'AcroForm' not in res:
        raise ValueError("No AcroForm Found")

- Get the field list resolving the entry in the catalog

.. code-block:: python

    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    for f in fields:
        field = resolve1(f)

- Get field name and field value(s)

.. code-block:: python

    name, values = field.get('T'), field.get('V')

- Decode field name.

.. code-block:: python

    name = decode_text(name)

- Resolve indirect field value objects

.. code-block:: python

    values = resolve1(value)

- Call the value(s) decoding method as needed

  (a single field can hold multiple values, for example a combo box can hold more than one value at time)

.. code-block:: python

    if isinstance(values, list):
        values = [decode_value(v) for v in values]
    else:
        values = decode_value(values)
        
(the decode_value method takes care of decoding the fields value returning a string)

- Decode PSLiteral and PSKeyword field values

.. code-block:: python

    if isinstance(value, (PSLiteral, PSKeyword)):
        value = value.name

- Decode bytes field values

.. code-block:: python

    if isinstance(value, bytes):
        value = utils.decode_text(value)
Add section to documentation with howto for AcroForm fields extraction (#458) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> 2020-09-10 17:18:41 +00:00			`.. _acro_forms:`

			`How to extract AcroForm interactive form fields from a PDF using PDFMiner`
Fix Sphinx warnings and error (#760) * Fix Sphinx warnings howto/acro_forms.rst:4: WARNING: Title underline too short. howto/acro_forms.rst:81: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:88: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:122: WARNING: Bullet list ends without a blank line; unexpected unindent. tutorial/extract_pages.rst:6: WARNING: Failed to create a cross reference. A title or caption not found: api_extract_pages * Fix documenting pdf2txt.py reference/commandline.rst:12: ERROR: Module "tools.pdf2txt" has no attribute "maketheparser" Incorrect argparse :module: or :func: values? * Add CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-05-24 18:07:04 +00:00			`*************************************************************************`
Add section to documentation with howto for AcroForm fields extraction (#458) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> 2020-09-10 17:18:41 +00:00
			Before you start, make sure you have :ref:`installed pdfminer.six<install>`.

			The second thing you need is a PDF with AcroForms (as found in PDF files with fillable forms or multiple choices). There are some examples of these in the GitHub repository under `samples/acroform`.

			`Only AcroForm interactive forms are supported, XFA forms are not supported.`

			`.. code-block:: python`

			`from pdfminer.pdfparser import PDFParser`
			`from pdfminer.pdfdocument import PDFDocument`
			`from pdfminer.pdftypes import resolve1`
			`from pdfminer.psparser import PSLiteral, PSKeyword`
			`from pdfminer.utils import decode_text`


			`data = {}`


			`def decode_value(value):`

			`# decode PSLiteral, PSKeyword`
			`if isinstance(value, (PSLiteral, PSKeyword)):`
			`value = value.name`

			`# decode bytes`
			`if isinstance(value, bytes):`
			`value = decode_text(value)`

			`return value`


			`with open(file_path, 'rb') as fp:`
			`parser = PDFParser(fp)`

			`doc = PDFDocument(parser)`
			`res = resolve1(doc.catalog)`

			`if 'AcroForm' not in res:`
			`raise ValueError("No AcroForm Found")`

			`fields = resolve1(doc.catalog['AcroForm'])['Fields'] # may need further resolving`

			`for f in fields:`
			`field = resolve1(f)`
			`name, values = field.get('T'), field.get('V')`

			`# decode name`
			`name = decode_text(name)`

			`# resolve indirect obj`
			`values = resolve1(values)`

			`# decode value(s)`
			`if isinstance(values, list):`
			`values = [decode_value(v) for v in values]`
			`else:`
			`values = decode_value(values)`

			`data.update({name: values})`

			`print(name, values)`

			`This code snippet will print all the fields name and value and save them in the "data" dictionary.`


			`How it works:`

			`- Initialize the parser and the PDFDocument objects`

			`.. code-block:: python`

			`parser = PDFParser(fp)`
			`doc = PDFDocument(parser)`

			`- Get the catalog`
Fix Sphinx warnings and error (#760) * Fix Sphinx warnings howto/acro_forms.rst:4: WARNING: Title underline too short. howto/acro_forms.rst:81: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:88: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:122: WARNING: Bullet list ends without a blank line; unexpected unindent. tutorial/extract_pages.rst:6: WARNING: Failed to create a cross reference. A title or caption not found: api_extract_pages * Fix documenting pdf2txt.py reference/commandline.rst:12: ERROR: Module "tools.pdf2txt" has no attribute "maketheparser" Incorrect argparse :module: or :func: values? * Add CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-05-24 18:07:04 +00:00
			`(the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html)`
Add section to documentation with howto for AcroForm fields extraction (#458) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> 2020-09-10 17:18:41 +00:00
			`.. code-block:: python`

			`res = resolve1(doc.catalog)`

			`- Check if the catalog contains the AcroForm key and raise ValueError if not`
Fix Sphinx warnings and error (#760) * Fix Sphinx warnings howto/acro_forms.rst:4: WARNING: Title underline too short. howto/acro_forms.rst:81: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:88: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:122: WARNING: Bullet list ends without a blank line; unexpected unindent. tutorial/extract_pages.rst:6: WARNING: Failed to create a cross reference. A title or caption not found: api_extract_pages * Fix documenting pdf2txt.py reference/commandline.rst:12: ERROR: Module "tools.pdf2txt" has no attribute "maketheparser" Incorrect argparse :module: or :func: values? * Add CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-05-24 18:07:04 +00:00
			`(the PDF does not contain Acroform type of interactive forms if this key is missing in the catalog, see section 12.7.2 of PDF 32000-1:2008 specs)`
Add section to documentation with howto for AcroForm fields extraction (#458) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> 2020-09-10 17:18:41 +00:00
			`.. code-block:: python`

			`if 'AcroForm' not in res:`
			`raise ValueError("No AcroForm Found")`

			`- Get the field list resolving the entry in the catalog`

			`.. code-block:: python`

			`fields = resolve1(doc.catalog['AcroForm'])['Fields']`
			`for f in fields:`
			`field = resolve1(f)`

			`- Get field name and field value(s)`

			`.. code-block:: python`

			`name, values = field.get('T'), field.get('V')`

			`- Decode field name.`

			`.. code-block:: python`

			`name = decode_text(name)`

			`- Resolve indirect field value objects`

			`.. code-block:: python`

			`values = resolve1(value)`

			`- Call the value(s) decoding method as needed`
Fix Sphinx warnings and error (#760) * Fix Sphinx warnings howto/acro_forms.rst:4: WARNING: Title underline too short. howto/acro_forms.rst:81: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:88: WARNING: Bullet list ends without a blank line; unexpected unindent. howto/acro_forms.rst:122: WARNING: Bullet list ends without a blank line; unexpected unindent. tutorial/extract_pages.rst:6: WARNING: Failed to create a cross reference. A title or caption not found: api_extract_pages * Fix documenting pdf2txt.py reference/commandline.rst:12: ERROR: Module "tools.pdf2txt" has no attribute "maketheparser" Incorrect argparse :module: or :func: values? * Add CHANGELOG.md Co-authored-by: Pieter Marsman <pietermarsman@gmail.com> 2022-05-24 18:07:04 +00:00
			`(a single field can hold multiple values, for example a combo box can hold more than one value at time)`
Add section to documentation with howto for AcroForm fields extraction (#458) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> 2020-09-10 17:18:41 +00:00
			`.. code-block:: python`

			`if isinstance(values, list):`
			`values = [decode_value(v) for v in values]`
			`else:`
			`values = decode_value(values)`

			`(the decode_value method takes care of decoding the fields value returning a string)`

			`- Decode PSLiteral and PSKeyword field values`

			`.. code-block:: python`

			`if isinstance(value, (PSLiteral, PSKeyword)):`
			`value = value.name`

			`- Decode bytes field values`

			`.. code-block:: python`

			`if isinstance(value, bytes):`
			`value = utils.decode_text(value)`