Add section to documentation with howto for AcroForm fields extraction (#458)

* Create aforms.rst

Add section to documentation with howto for AcroForm fields extraction

* Update index.rst

Added reference to aforms.rst

* Update aforms.rst

* Update aforms.rst

* Update index.rst

* Update and rename aforms.rst to acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update index.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update pdfdocument.py

* Update pdfdocument.py

* Update pdfdocument.py

* Update acro_forms.rst

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update acro_forms.rst

* reverted changes

* Update README.md

* Proper processing of ComboBox

ComboBox fields hold multiple values, so the must be returned as a list.

* PDF with AcroForm (samples)

* Create tmp

* Delete AcroForm_TEST.pdf

* Delete AcroForm_TEST_compiled.pdf

* PDF file with AcroForms

* Delete tmp

* Fixed typo

* Update index.rst

* Update README.md

* Update index.rst

* Update pdfdocument.py

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update pdfdocument.py

* Update pdfdocument.py

* Update pdfdocument.py

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>
pull/475/head^2
typhoon71 2020-09-10 19:18:41 +02:00 committed by GitHub
parent 0b44f77714
commit 4d8b5975cb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 149 additions and 2 deletions

View File

@ -29,6 +29,7 @@ Features
* Various font types (Type1, TrueType, Type3, and CID) support. * Various font types (Type1, TrueType, Type3, and CID) support.
* Support for extracting images (JPG, JBIG2 and Bitmaps). * Support for extracting images (JPG, JBIG2 and Bitmaps).
* Support for RC4 and AES encryption. * Support for RC4 and AES encryption.
* Support for AcroForm interactive form extraction.
* Table of contents extraction. * Table of contents extraction.
* Tagged contents extraction. * Tagged contents extraction.
* Automatic layout analysis. * Automatic layout analysis.

View File

@ -0,0 +1,145 @@
.. _acro_forms:
How to extract AcroForm interactive form fields from a PDF using PDFMiner
********************************
Before you start, make sure you have :ref:`installed pdfminer.six<install>`.
The second thing you need is a PDF with AcroForms (as found in PDF files with fillable forms or multiple choices). There are some examples of these in the GitHub repository under `samples/acroform`.
Only AcroForm interactive forms are supported, XFA forms are not supported.
.. code-block:: python
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.psparser import PSLiteral, PSKeyword
from pdfminer.utils import decode_text
data = {}
def decode_value(value):
# decode PSLiteral, PSKeyword
if isinstance(value, (PSLiteral, PSKeyword)):
value = value.name
# decode bytes
if isinstance(value, bytes):
value = decode_text(value)
return value
with open(file_path, 'rb') as fp:
parser = PDFParser(fp)
doc = PDFDocument(parser)
res = resolve1(doc.catalog)
if 'AcroForm' not in res:
raise ValueError("No AcroForm Found")
fields = resolve1(doc.catalog['AcroForm'])['Fields'] # may need further resolving
for f in fields:
field = resolve1(f)
name, values = field.get('T'), field.get('V')
# decode name
name = decode_text(name)
# resolve indirect obj
values = resolve1(values)
# decode value(s)
if isinstance(values, list):
values = [decode_value(v) for v in values]
else:
values = decode_value(values)
data.update({name: values})
print(name, values)
This code snippet will print all the fields name and value and save them in the "data" dictionary.
How it works:
- Initialize the parser and the PDFDocument objects
.. code-block:: python
parser = PDFParser(fp)
doc = PDFDocument(parser)
- Get the catalog
(the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html)
.. code-block:: python
res = resolve1(doc.catalog)
- Check if the catalog contains the AcroForm key and raise ValueError if not
(the PDF does not contain Acroform type of interactive forms if this key is missing in the catalog, see section 12.7.2 of PDF 32000-1:2008 specs)
.. code-block:: python
if 'AcroForm' not in res:
raise ValueError("No AcroForm Found")
- Get the field list resolving the entry in the catalog
.. code-block:: python
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for f in fields:
field = resolve1(f)
- Get field name and field value(s)
.. code-block:: python
name, values = field.get('T'), field.get('V')
- Decode field name.
.. code-block:: python
name = decode_text(name)
- Resolve indirect field value objects
.. code-block:: python
values = resolve1(value)
- Call the value(s) decoding method as needed
(a single field can hold multiple values, for example a combo box can hold more than one value at time)
.. code-block:: python
if isinstance(values, list):
values = [decode_value(v) for v in values]
else:
values = decode_value(values)
(the decode_value method takes care of decoding the fields value returning a string)
- Decode PSLiteral and PSKeyword field values
.. code-block:: python
if isinstance(value, (PSLiteral, PSKeyword)):
value = value.name
- Decode bytes field values
.. code-block:: python
if isinstance(value, bytes):
value = utils.decode_text(value)

View File

@ -9,3 +9,4 @@ How-to guides help you to solve specific problems with pdfminer.six.
:maxdepth: 1 :maxdepth: 1
images images
acro_forms

View File

@ -48,10 +48,10 @@ Features
* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged * Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
contents and more. contents and more.
* Support for (almost all) features from the PDF-1.7 specification * Support for (almost all) features from the PDF-1.7 specification
* Support for Chinese, Japanese and Korean CJK) languages as well as vertical * Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing.
writing.
* Support for various font types (Type1, TrueType, Type3, and CID). * Support for various font types (Type1, TrueType, Type3, and CID).
* Support for RC4 and AES encryption. * Support for RC4 and AES encryption.
* Support for AcroForm interactive form extraction.
Installation instructions Installation instructions

Binary file not shown.

Binary file not shown.