2019-11-07 20:12:34 +00:00
|
|
|
Welcome to pdfminer.six's documentation!
|
|
|
|
****************************************
|
|
|
|
|
|
|
|
.. image:: https://travis-ci.org/pdfminer/pdfminer.six.svg?branch=master
|
|
|
|
:target: https://travis-ci.org/pdfminer/pdfminer.six
|
|
|
|
:alt: Travis-ci build badge
|
|
|
|
|
|
|
|
.. image:: https://img.shields.io/pypi/v/pdfminer.six.svg
|
|
|
|
:target: https://pypi.python.org/pypi/pdfminer.six/
|
|
|
|
:alt: PyPi version badge
|
|
|
|
|
|
|
|
.. image:: https://badges.gitter.im/pdfminer-six/Lobby.svg
|
|
|
|
:target: https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium
|
|
|
|
:alt: gitter badge
|
|
|
|
|
2020-10-11 18:05:11 +00:00
|
|
|
We fathom PDF.
|
2019-11-07 20:12:34 +00:00
|
|
|
|
|
|
|
Pdfminer.six is a python package for extracting information from PDF documents.
|
|
|
|
|
|
|
|
Check out the source on `github <https://github.com/pdfminer/pdfminer.six>`_.
|
|
|
|
|
|
|
|
Content
|
|
|
|
=======
|
|
|
|
|
2021-08-30 19:47:40 +00:00
|
|
|
This documentation is organized into four sections (according to the `Diátaxis
|
|
|
|
documentation framework <https://diataxis.fr>`_). The
|
2020-05-17 15:48:06 +00:00
|
|
|
:ref:`tutorial` section helps you setup and use pdfminer.six for the first
|
|
|
|
time. Read this section if this is your first time working with pdfminer.six.
|
|
|
|
The :ref:`howto` offers specific recipies for solving common problems.
|
|
|
|
Take a look at the :ref:`topic` if you want more background information on
|
|
|
|
how pdfminer.six works internally. The :ref:`reference` provides
|
|
|
|
detailed api documentation for all the common classes and functions in
|
|
|
|
pdfminer.six.
|
|
|
|
|
2019-11-07 20:12:34 +00:00
|
|
|
.. toctree::
|
|
|
|
:maxdepth: 2
|
|
|
|
|
2020-05-17 15:48:06 +00:00
|
|
|
tutorial/index
|
|
|
|
howto/index
|
|
|
|
topic/index
|
|
|
|
reference/index
|
2020-10-11 18:05:26 +00:00
|
|
|
faq
|
2019-11-07 20:12:34 +00:00
|
|
|
|
|
|
|
|
|
|
|
Features
|
|
|
|
========
|
|
|
|
|
|
|
|
* Parse all objects from a PDF document into Python objects.
|
|
|
|
* Analyze and group text in a human-readable way.
|
|
|
|
* Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
|
|
|
|
contents and more.
|
|
|
|
* Support for (almost all) features from the PDF-1.7 specification
|
2020-09-10 17:18:41 +00:00
|
|
|
* Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing.
|
2019-11-07 20:12:34 +00:00
|
|
|
* Support for various font types (Type1, TrueType, Type3, and CID).
|
2020-01-07 17:38:53 +00:00
|
|
|
* Support for RC4 and AES encryption.
|
2020-09-10 17:18:41 +00:00
|
|
|
* Support for AcroForm interactive form extraction.
|
2019-11-07 20:12:34 +00:00
|
|
|
|
|
|
|
|
|
|
|
Installation instructions
|
|
|
|
=========================
|
|
|
|
|
2022-11-05 15:30:39 +00:00
|
|
|
* Install Python 3.6 or newer.
|
|
|
|
* Install pdfminer.six.
|
2019-11-07 20:12:34 +00:00
|
|
|
|
|
|
|
::
|
2022-11-05 15:30:39 +00:00
|
|
|
$ pip install pdfminer.six`
|
2019-11-07 20:12:34 +00:00
|
|
|
|
2022-11-05 15:30:39 +00:00
|
|
|
* (Optionally) install extra dependencies for extracting images.
|
2019-11-07 20:12:34 +00:00
|
|
|
|
2022-11-05 15:30:39 +00:00
|
|
|
::
|
|
|
|
$ pip install 'pdfminer.six[image]'`
|
2019-11-07 20:12:34 +00:00
|
|
|
|
2022-11-05 15:30:39 +00:00
|
|
|
* Use the command-line interface to extract text from pdf.
|
2022-02-22 19:20:17 +00:00
|
|
|
|
|
|
|
::
|
2022-11-05 15:30:39 +00:00
|
|
|
$ pdf2txt.py example.pdf`
|
|
|
|
|
|
|
|
* Or use it with Python.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
from pdfminer.high_level import extract_text
|
|
|
|
|
|
|
|
text = extract_text("example.pdf")
|
|
|
|
print(text)
|
2022-02-22 19:20:17 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
2019-11-07 20:12:34 +00:00
|
|
|
Contributing
|
|
|
|
============
|
|
|
|
|
|
|
|
We welcome any contributors to pdfminer.six! But, before doing anything, take
|
|
|
|
a look at the `contribution guide
|
|
|
|
<https://github.com/pdfminer/pdfminer.six/blob/master/CONTRIBUTING.md>`_.
|