Add section to documentation with howto for AcroForm fields extraction (#458)
* Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>pull/475/head^2
parent
0b44f77714
commit
4d8b5975cb
|
@ -29,6 +29,7 @@ Features
|
||||||
* Various font types (Type1, TrueType, Type3, and CID) support.
|
* Various font types (Type1, TrueType, Type3, and CID) support.
|
||||||
* Support for extracting images (JPG, JBIG2 and Bitmaps).
|
* Support for extracting images (JPG, JBIG2 and Bitmaps).
|
||||||
* Support for RC4 and AES encryption.
|
* Support for RC4 and AES encryption.
|
||||||
|
* Support for AcroForm interactive form extraction.
|
||||||
* Table of contents extraction.
|
* Table of contents extraction.
|
||||||
* Tagged contents extraction.
|
* Tagged contents extraction.
|
||||||
* Automatic layout analysis.
|
* Automatic layout analysis.
|
||||||
|
|
|
@ -0,0 +1,145 @@
|
||||||
|
.. _acro_forms:
|
||||||
|
|
||||||
|
How to extract AcroForm interactive form fields from a PDF using PDFMiner
|
||||||
|
********************************
|
||||||
|
|
||||||
|
Before you start, make sure you have :ref:`installed pdfminer.six<install>`.
|
||||||
|
|
||||||
|
The second thing you need is a PDF with AcroForms (as found in PDF files with fillable forms or multiple choices). There are some examples of these in the GitHub repository under `samples/acroform`.
|
||||||
|
|
||||||
|
Only AcroForm interactive forms are supported, XFA forms are not supported.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from pdfminer.pdfparser import PDFParser
|
||||||
|
from pdfminer.pdfdocument import PDFDocument
|
||||||
|
from pdfminer.pdftypes import resolve1
|
||||||
|
from pdfminer.psparser import PSLiteral, PSKeyword
|
||||||
|
from pdfminer.utils import decode_text
|
||||||
|
|
||||||
|
|
||||||
|
data = {}
|
||||||
|
|
||||||
|
|
||||||
|
def decode_value(value):
|
||||||
|
|
||||||
|
# decode PSLiteral, PSKeyword
|
||||||
|
if isinstance(value, (PSLiteral, PSKeyword)):
|
||||||
|
value = value.name
|
||||||
|
|
||||||
|
# decode bytes
|
||||||
|
if isinstance(value, bytes):
|
||||||
|
value = decode_text(value)
|
||||||
|
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
with open(file_path, 'rb') as fp:
|
||||||
|
parser = PDFParser(fp)
|
||||||
|
|
||||||
|
doc = PDFDocument(parser)
|
||||||
|
res = resolve1(doc.catalog)
|
||||||
|
|
||||||
|
if 'AcroForm' not in res:
|
||||||
|
raise ValueError("No AcroForm Found")
|
||||||
|
|
||||||
|
fields = resolve1(doc.catalog['AcroForm'])['Fields'] # may need further resolving
|
||||||
|
|
||||||
|
for f in fields:
|
||||||
|
field = resolve1(f)
|
||||||
|
name, values = field.get('T'), field.get('V')
|
||||||
|
|
||||||
|
# decode name
|
||||||
|
name = decode_text(name)
|
||||||
|
|
||||||
|
# resolve indirect obj
|
||||||
|
values = resolve1(values)
|
||||||
|
|
||||||
|
# decode value(s)
|
||||||
|
if isinstance(values, list):
|
||||||
|
values = [decode_value(v) for v in values]
|
||||||
|
else:
|
||||||
|
values = decode_value(values)
|
||||||
|
|
||||||
|
data.update({name: values})
|
||||||
|
|
||||||
|
print(name, values)
|
||||||
|
|
||||||
|
This code snippet will print all the fields name and value and save them in the "data" dictionary.
|
||||||
|
|
||||||
|
|
||||||
|
How it works:
|
||||||
|
|
||||||
|
- Initialize the parser and the PDFDocument objects
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
parser = PDFParser(fp)
|
||||||
|
doc = PDFDocument(parser)
|
||||||
|
|
||||||
|
- Get the catalog
|
||||||
|
(the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html)
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
res = resolve1(doc.catalog)
|
||||||
|
|
||||||
|
- Check if the catalog contains the AcroForm key and raise ValueError if not
|
||||||
|
(the PDF does not contain Acroform type of interactive forms if this key is missing in the catalog, see section 12.7.2 of PDF 32000-1:2008 specs)
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
if 'AcroForm' not in res:
|
||||||
|
raise ValueError("No AcroForm Found")
|
||||||
|
|
||||||
|
- Get the field list resolving the entry in the catalog
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
fields = resolve1(doc.catalog['AcroForm'])['Fields']
|
||||||
|
for f in fields:
|
||||||
|
field = resolve1(f)
|
||||||
|
|
||||||
|
- Get field name and field value(s)
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
name, values = field.get('T'), field.get('V')
|
||||||
|
|
||||||
|
- Decode field name.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
name = decode_text(name)
|
||||||
|
|
||||||
|
- Resolve indirect field value objects
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
values = resolve1(value)
|
||||||
|
|
||||||
|
- Call the value(s) decoding method as needed
|
||||||
|
(a single field can hold multiple values, for example a combo box can hold more than one value at time)
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
if isinstance(values, list):
|
||||||
|
values = [decode_value(v) for v in values]
|
||||||
|
else:
|
||||||
|
values = decode_value(values)
|
||||||
|
|
||||||
|
(the decode_value method takes care of decoding the fields value returning a string)
|
||||||
|
|
||||||
|
- Decode PSLiteral and PSKeyword field values
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
if isinstance(value, (PSLiteral, PSKeyword)):
|
||||||
|
value = value.name
|
||||||
|
|
||||||
|
- Decode bytes field values
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
if isinstance(value, bytes):
|
||||||
|
value = utils.decode_text(value)
|
|
@ -9,3 +9,4 @@ How-to guides help you to solve specific problems with pdfminer.six.
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
||||||
images
|
images
|
||||||
|
acro_forms
|
||||||
|
|
|
@ -48,10 +48,10 @@ Features
|
||||||
* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
|
* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
|
||||||
contents and more.
|
contents and more.
|
||||||
* Support for (almost all) features from the PDF-1.7 specification
|
* Support for (almost all) features from the PDF-1.7 specification
|
||||||
* Support for Chinese, Japanese and Korean CJK) languages as well as vertical
|
* Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing.
|
||||||
writing.
|
|
||||||
* Support for various font types (Type1, TrueType, Type3, and CID).
|
* Support for various font types (Type1, TrueType, Type3, and CID).
|
||||||
* Support for RC4 and AES encryption.
|
* Support for RC4 and AES encryption.
|
||||||
|
* Support for AcroForm interactive form extraction.
|
||||||
|
|
||||||
|
|
||||||
Installation instructions
|
Installation instructions
|
||||||
|
|
Binary file not shown.
Binary file not shown.
Loading…
Reference in New Issue