2019-11-07 20:12:34 +00:00
|
|
|
pdfminer.six
|
2014-09-15 09:10:00 +00:00
|
|
|
============
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2022-02-02 21:53:17 +00:00
|
|
|
[![Continuous integration](https://github.com/pdfminer/pdfminer.six/actions/workflows/actions.yml/badge.svg)](https://github.com/pdfminer/pdfminer.six/actions/workflows/actions.yml)
|
2019-11-07 20:12:34 +00:00
|
|
|
[![PyPI version](https://img.shields.io/pypi/v/pdfminer.six.svg)](https://pypi.python.org/pypi/pdfminer.six/)
|
|
|
|
[![gitter](https://badges.gitter.im/pdfminer-six/Lobby.svg)](https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium)
|
2014-09-15 09:10:00 +00:00
|
|
|
|
2020-10-11 18:04:57 +00:00
|
|
|
*We fathom PDF*
|
|
|
|
|
2021-08-26 18:53:13 +00:00
|
|
|
Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF
|
|
|
|
documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the
|
|
|
|
sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.
|
2020-03-08 13:53:16 +00:00
|
|
|
|
2021-08-26 18:53:13 +00:00
|
|
|
It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own
|
|
|
|
interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2019-11-07 20:12:34 +00:00
|
|
|
Check out the full documentation on
|
|
|
|
[Read the Docs](https://pdfminersix.readthedocs.io).
|
2014-03-27 15:19:52 +00:00
|
|
|
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2013-10-26 15:05:26 +00:00
|
|
|
Features
|
|
|
|
--------
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2021-08-26 18:53:13 +00:00
|
|
|
* Written entirely in Python.
|
|
|
|
* Parse, analyze, and convert PDF documents.
|
|
|
|
* PDF-1.7 specification support. (well, almost).
|
|
|
|
* CJK languages and vertical writing scripts support.
|
|
|
|
* Various font types (Type1, TrueType, Type3, and CID) support.
|
|
|
|
* Support for extracting images (JPG, JBIG2, Bitmaps).
|
|
|
|
* Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode,
|
|
|
|
CCITTFaxDecode)
|
|
|
|
* Support for RC4 and AES encryption.
|
|
|
|
* Support for AcroForm interactive form extraction.
|
|
|
|
* Table of contents extraction.
|
|
|
|
* Tagged contents extraction.
|
|
|
|
* Automatic layout analysis.
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2019-11-07 20:12:34 +00:00
|
|
|
How to use
|
|
|
|
----------
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2021-08-26 18:53:13 +00:00
|
|
|
* Install Python 3.6 or newer.
|
|
|
|
* Install
|
2013-10-22 15:17:12 +00:00
|
|
|
|
2021-08-26 18:53:13 +00:00
|
|
|
`pip install pdfminer.six`
|
2013-10-26 15:05:26 +00:00
|
|
|
|
2022-02-22 19:20:17 +00:00
|
|
|
* (Optionally) install extra dependencies for extracting images.
|
|
|
|
|
2022-08-08 20:21:39 +00:00
|
|
|
`pip install 'pdfminer.six[image]'`
|
2022-02-22 19:20:17 +00:00
|
|
|
|
2021-08-26 18:53:13 +00:00
|
|
|
* Use command-line interface to extract text from pdf:
|
2013-10-26 15:05:26 +00:00
|
|
|
|
2021-08-26 18:53:13 +00:00
|
|
|
`python pdf2txt.py samples/simple1.pdf`
|
2013-11-17 06:32:57 +00:00
|
|
|
|
2019-07-08 21:05:47 +00:00
|
|
|
Contributing
|
|
|
|
------------
|
|
|
|
|
|
|
|
Be sure to read the [contribution guidelines](https://github.com/pdfminer/pdfminer.six/blob/master/CONTRIBUTING.md).
|
2021-09-06 20:00:23 +00:00
|
|
|
|
|
|
|
Acknowledgement
|
|
|
|
---------------
|
|
|
|
|
2022-08-08 20:21:39 +00:00
|
|
|
This repository includes code from `pyHanko` ; the original license has been included [here](/docs/licenses/LICENSE.pyHanko).
|