pdfminer.six/docs/source/index.rst

Welcome to pdfminer.six's documentation!
****************************************

.. image:: https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master
    :target: https://travis-ci.org/pdfminer/pdfminer.six
    :alt: Travis-ci build badge

.. image:: https://img.shields.io/pypi/v/pdfminer.six.svg
    :target: https://pypi.python.org/pypi/pdfminer.six/
    :alt: PyPi version badge

.. image:: https://badges.gitter.im/pdfminer-six/Lobby.svg
    :target: https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium
    :alt: gitter badge

We fathom PDF.

Pdfminer.six is a python package for extracting information from PDF documents.

Check out the source on `github <https://github.com/pdfminer/pdfminer.six>`_.

Content
=======

This documentation is organized into four sections (according to the `Diátaxis
documentation framework <https://diataxis.fr>`_). The
:ref:`tutorial` section helps you setup and use pdfminer.six for the first
time. Read this section if this is your first time working with pdfminer.six.
The :ref:`howto` offers specific recipies for solving common problems.
Take a look at the :ref:`topic` if you want more background information on
how pdfminer.six works internally. The :ref:`reference` provides
detailed api documentation for all the common classes and functions in
pdfminer.six.

.. toctree::
    :maxdepth: 2

    tutorial/index
    howto/index
    topic/index
    reference/index
    faq


Features
========

* Parse all objects from a PDF document into Python objects.
* Analyze and group text in a human-readable way.
* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
  contents and more.
* Support for (almost all) features from the PDF-1.7 specification
* Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing.
* Support for various font types (Type1, TrueType, Type3, and CID).
* Support for RC4 and AES encryption.
* Support for AcroForm interactive form extraction.


Installation instructions
=========================

* Install Python 3.6 or newer.
* Install pdfminer.six.

::
    $ pip install pdfminer.six`

* (Optionally) install extra dependencies for extracting images.

::
    $ pip install 'pdfminer.six[image]'`

* Use the command-line interface to extract text from pdf.

::
    $ pdf2txt.py example.pdf`

* Or use it with Python.

.. code-block:: python

    from pdfminer.high_level import extract_text

    text = extract_text("example.pdf")
    print(text)


Contributing
============

We welcome any contributors to pdfminer.six! But, before doing anything, take
a look at the `contribution guide
<https://github.com/pdfminer/pdfminer.six/blob/master/CONTRIBUTING.md>`_.
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00			`Welcome to pdfminer.six's documentation!`
			`****************************************`

			`.. image:: https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master`
			`:target: https://travis-ci.org/pdfminer/pdfminer.six`
			`:alt: Travis-ci build badge`

			`.. image:: https://img.shields.io/pypi/v/pdfminer.six.svg`
			`:target: https://pypi.python.org/pypi/pdfminer.six/`
			`:alt: PyPi version badge`

			`.. image:: https://badges.gitter.im/pdfminer-six/Lobby.svg`
			`:target: https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium`
			`:alt: gitter badge`

Add punchline to docs 2020-10-11 18:05:11 +00:00			`We fathom PDF.`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00
			`Pdfminer.six is a python package for extracting information from PDF documents.`

			Check out the source on `github <https://github.com/pdfminer/pdfminer.six>`_.

			`Content`
			`=======`

Updated link to Diátaxis documentation website (#606) The canonical home of the documentation framework has moved from documentation.divio.com to https://diataxis.fr. 2021-08-30 19:47:40 +00:00			This documentation is organized into four sections (according to the `Diátaxis
			documentation framework <https://diataxis.fr>`_). The
Add section to documentation with howto for image extraction (#427) * Make structure of documentation more clear: tutorials, how-to, topics and reference * Add howto for images * Restructure tutorials section, and add install section * Always use up-to-date version * Fix indentation warning in docstring * Add option to dumppdf.py and pdf2txt.py to show version Fixes #162 2020-05-17 15:48:06 +00:00			:ref:`tutorial` section helps you setup and use pdfminer.six for the first
			`time. Read this section if this is your first time working with pdfminer.six.`
			The :ref:`howto` offers specific recipies for solving common problems.
			Take a look at the :ref:`topic` if you want more background information on
			how pdfminer.six works internally. The :ref:`reference` provides
			`detailed api documentation for all the common classes and functions in`
			`pdfminer.six.`

Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00			`.. toctree::`
			`:maxdepth: 2`

Add section to documentation with howto for image extraction (#427) * Make structure of documentation more clear: tutorials, how-to, topics and reference * Add howto for images * Restructure tutorials section, and add install section * Always use up-to-date version * Fix indentation warning in docstring * Add option to dumppdf.py and pdf2txt.py to show version Fixes #162 2020-05-17 15:48:06 +00:00			`tutorial/index`
			`howto/index`
			`topic/index`
			`reference/index`
Add frequently asked questions 2020-10-11 18:05:26 +00:00			`faq`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00

			`Features`
			`========`

			`* Parse all objects from a PDF document into Python objects.`
			`* Analyze and group text in a human-readable way.`
			`* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged`
			`contents and more.`
			`* Support for (almost all) features from the PDF-1.7 specification`
Add section to documentation with howto for AcroForm fields extraction (#458) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> 2020-09-10 17:18:41 +00:00			`* Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing.`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00			`* Support for various font types (Type1, TrueType, Type3, and CID).`
Add AES as supported encryption method to docs 2020-01-07 17:38:53 +00:00			`* Support for RC4 and AES encryption.`
Add section to documentation with howto for AcroForm fields extraction (#458) * Create aforms.rst Add section to documentation with howto for AcroForm fields extraction * Update index.rst Added reference to aforms.rst * Update aforms.rst * Update aforms.rst * Update index.rst * Update and rename aforms.rst to acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update index.rst * Update acro_forms.rst * Update acro_forms.rst * Update acro_forms.rst * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py * Update acro_forms.rst * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update acro_forms.rst * reverted changes * Update README.md * Proper processing of ComboBox ComboBox fields hold multiple values, so the must be returned as a list. * PDF with AcroForm (samples) * Create tmp * Delete AcroForm_TEST.pdf * Delete AcroForm_TEST_compiled.pdf * PDF file with AcroForms * Delete tmp * Fixed typo * Update index.rst * Update README.md * Update index.rst * Update pdfdocument.py * Update docs/source/howto/acro_forms.rst Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> * Update pdfdocument.py * Update pdfdocument.py * Update pdfdocument.py Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com> 2020-09-10 17:18:41 +00:00			`* Support for AcroForm interactive form extraction.`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00

			`Installation instructions`
			`=========================`

Consistent instructions for how to install and use pdfminer.six (#793) 2022-11-05 15:30:39 +00:00			`* Install Python 3.6 or newer.`
			`* Install pdfminer.six.`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00
			`::`
Consistent instructions for how to install and use pdfminer.six (#793) 2022-11-05 15:30:39 +00:00			$ pip install pdfminer.six`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00
Consistent instructions for how to install and use pdfminer.six (#793) 2022-11-05 15:30:39 +00:00			`* (Optionally) install extra dependencies for extracting images.`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00
Consistent instructions for how to install and use pdfminer.six (#793) 2022-11-05 15:30:39 +00:00			`::`
			$ pip install 'pdfminer.six[image]'`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00
Consistent instructions for how to install and use pdfminer.six (#793) 2022-11-05 15:30:39 +00:00			`* Use the command-line interface to extract text from pdf.`
Raise more specific error if Pillow cannot be imported (#714) * Raise specific warning if Pillow cannot be imported * Improve error message * Update docs * Update CHANGELOG.md * Update pdfminer/image.py Co-authored-by: Jake Stockwin <jstockwin@gmail.com> Co-authored-by: Jake Stockwin <jstockwin@gmail.com> 2022-02-22 19:20:17 +00:00
			`::`
Consistent instructions for how to install and use pdfminer.six (#793) 2022-11-05 15:30:39 +00:00			$ pdf2txt.py example.pdf`

			`* Or use it with Python.`

			`.. code-block:: python`

			`from pdfminer.high_level import extract_text`

			`text = extract_text("example.pdf")`
			`print(text)`
Raise more specific error if Pillow cannot be imported (#714) * Raise specific warning if Pillow cannot be imported * Improve error message * Update docs * Update CHANGELOG.md * Update pdfminer/image.py Co-authored-by: Jake Stockwin <jstockwin@gmail.com> Co-authored-by: Jake Stockwin <jstockwin@gmail.com> 2022-02-22 19:20:17 +00:00


Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00			`Contributing`
			`============`

			`We welcome any contributors to pdfminer.six! But, before doing anything, take`
			a look at the `contribution guide
			<https://github.com/pdfminer/pdfminer.six/blob/master/CONTRIBUTING.md>`_.