diff --git a/README.md b/README.md index 0285d95..e2552a6 100644 --- a/README.md +++ b/README.md @@ -29,6 +29,7 @@ Features * Various font types (Type1, TrueType, Type3, and CID) support. * Support for extracting images (JPG, JBIG2 and Bitmaps). * Support for RC4 and AES encryption. + * Support for AcroForm interactive form extraction. * Table of contents extraction. * Tagged contents extraction. * Automatic layout analysis. diff --git a/docs/source/howto/acro_forms.rst b/docs/source/howto/acro_forms.rst new file mode 100644 index 0000000..23444ff --- /dev/null +++ b/docs/source/howto/acro_forms.rst @@ -0,0 +1,145 @@ +.. _acro_forms: + +How to extract AcroForm interactive form fields from a PDF using PDFMiner +******************************** + +Before you start, make sure you have :ref:`installed pdfminer.six`. + +The second thing you need is a PDF with AcroForms (as found in PDF files with fillable forms or multiple choices). There are some examples of these in the GitHub repository under `samples/acroform`. + +Only AcroForm interactive forms are supported, XFA forms are not supported. + +.. code-block:: python + + from pdfminer.pdfparser import PDFParser + from pdfminer.pdfdocument import PDFDocument + from pdfminer.pdftypes import resolve1 + from pdfminer.psparser import PSLiteral, PSKeyword + from pdfminer.utils import decode_text + + + data = {} + + + def decode_value(value): + + # decode PSLiteral, PSKeyword + if isinstance(value, (PSLiteral, PSKeyword)): + value = value.name + + # decode bytes + if isinstance(value, bytes): + value = decode_text(value) + + return value + + + with open(file_path, 'rb') as fp: + parser = PDFParser(fp) + + doc = PDFDocument(parser) + res = resolve1(doc.catalog) + + if 'AcroForm' not in res: + raise ValueError("No AcroForm Found") + + fields = resolve1(doc.catalog['AcroForm'])['Fields'] # may need further resolving + + for f in fields: + field = resolve1(f) + name, values = field.get('T'), field.get('V') + + # decode name + name = decode_text(name) + + # resolve indirect obj + values = resolve1(values) + + # decode value(s) + if isinstance(values, list): + values = [decode_value(v) for v in values] + else: + values = decode_value(values) + + data.update({name: values}) + + print(name, values) + +This code snippet will print all the fields name and value and save them in the "data" dictionary. + + +How it works: + +- Initialize the parser and the PDFDocument objects + +.. code-block:: python + + parser = PDFParser(fp) + doc = PDFDocument(parser) + +- Get the catalog +(the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html) + +.. code-block:: python + + res = resolve1(doc.catalog) + +- Check if the catalog contains the AcroForm key and raise ValueError if not +(the PDF does not contain Acroform type of interactive forms if this key is missing in the catalog, see section 12.7.2 of PDF 32000-1:2008 specs) + +.. code-block:: python + + if 'AcroForm' not in res: + raise ValueError("No AcroForm Found") + +- Get the field list resolving the entry in the catalog + +.. code-block:: python + + fields = resolve1(doc.catalog['AcroForm'])['Fields'] + for f in fields: + field = resolve1(f) + +- Get field name and field value(s) + +.. code-block:: python + + name, values = field.get('T'), field.get('V') + +- Decode field name. + +.. code-block:: python + + name = decode_text(name) + +- Resolve indirect field value objects + +.. code-block:: python + + values = resolve1(value) + +- Call the value(s) decoding method as needed +(a single field can hold multiple values, for example a combo box can hold more than one value at time) + +.. code-block:: python + + if isinstance(values, list): + values = [decode_value(v) for v in values] + else: + values = decode_value(values) + +(the decode_value method takes care of decoding the fields value returning a string) + +- Decode PSLiteral and PSKeyword field values + +.. code-block:: python + + if isinstance(value, (PSLiteral, PSKeyword)): + value = value.name + +- Decode bytes field values + +.. code-block:: python + + if isinstance(value, bytes): + value = utils.decode_text(value) diff --git a/docs/source/howto/index.rst b/docs/source/howto/index.rst index b8a758b..9d3269a 100644 --- a/docs/source/howto/index.rst +++ b/docs/source/howto/index.rst @@ -9,3 +9,4 @@ How-to guides help you to solve specific problems with pdfminer.six. :maxdepth: 1 images + acro_forms diff --git a/docs/source/index.rst b/docs/source/index.rst index 75588f9..e5d19a7 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -48,10 +48,10 @@ Features * Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged contents and more. * Support for (almost all) features from the PDF-1.7 specification -* Support for Chinese, Japanese and Korean CJK) languages as well as vertical - writing. +* Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing. * Support for various font types (Type1, TrueType, Type3, and CID). * Support for RC4 and AES encryption. +* Support for AcroForm interactive form extraction. Installation instructions diff --git a/samples/acroform/AcroForm_TEST.pdf b/samples/acroform/AcroForm_TEST.pdf new file mode 100644 index 0000000..8c366d5 Binary files /dev/null and b/samples/acroform/AcroForm_TEST.pdf differ diff --git a/samples/acroform/AcroForm_TEST_compiled.pdf b/samples/acroform/AcroForm_TEST_compiled.pdf new file mode 100644 index 0000000..1823b69 Binary files /dev/null and b/samples/acroform/AcroForm_TEST_compiled.pdf differ