Community maintained fork of pdfminer - we fathom PDF
 
 
Go to file
Andrew Baumann 1d1602e0c5
Added feature: page labels (#680)
* port page label code from pdfannots

* add tests and clean up

* more cleanup; harden against non-conforming input

* one more test

* update CHANGELOG

* cleanup & respond to review feedback (incomplete)

* Refactor implementation of get_page_labels() into a NumberTree and PageLabels class.

* PageLabels *is* a NumberTree and should always behave like one. This justifies inheriting its data and behavior. And it simplifies the code a bit more.

* fix type errors and cleanup slightly

 * fix mypy errors (including tweaking code to avoid problematic dynamic types)
 * hoist dict_value from NumberTree (where it may not be a dict) to PageLabels (where it must be)
 * avoid repeated warnings by calling _parse() recursively, and checking sortedness only at the end

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-02-01 10:08:05 +01:00
.github fix typos in PR template (#681) 2022-01-25 22:08:14 +01:00
cmaprsrc Fix typos 2016-09-13 16:25:09 +02:00
docs Add type annotations (#661) 2021-10-09 16:23:28 +02:00
pdfminer Added feature: page labels (#680) 2022-02-01 10:08:05 +01:00
samples Added feature: page labels (#680) 2022-02-01 10:08:05 +01:00
tests Added feature: page labels (#680) 2022-02-01 10:08:05 +01:00
tools Use logger.warn instead of warnings.warn if warning cannot be prevented by user (#673) 2022-01-26 20:41:12 +01:00
.gitignore Fix extraction of some cjk characters (#593) 2021-08-26 21:05:03 +02:00
.travis.yml Replace typing-extensions Literal with the type of the Literal & run mypy, nosetest and sphinx in there own environment on cicd (#677) 2021-10-12 20:22:58 +02:00
CHANGELOG.md Added feature: page labels (#680) 2022-02-01 10:08:05 +01:00
CONTRIBUTING.md Remove explicit support for Python 3.4 and 3.5, adding tests for python 3.9 (#522) 2020-10-25 12:34:51 +01:00
LICENSE Added: LICENSE 2016-09-11 23:38:18 +09:00
MANIFEST.in Remove samples/ directory from source distribution to prevent downloading all pdf's when installing pdfminer.six (#364) 2020-01-24 12:36:02 +01:00
Makefile Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456) 2020-07-20 22:00:54 +02:00
README.md Add support for ISO 32000-2 AES256 encryption (#614) 2021-09-06 22:00:23 +02:00
mypy.ini Add type annotations (#661) 2021-10-09 16:23:28 +02:00
setup.py export type annotations in package (#679) 2022-01-25 22:11:17 +01:00
tox.ini Replace typing-extensions Literal with the type of the Literal & run mypy, nosetest and sphinx in there own environment on cicd (#677) 2021-10-12 20:22:58 +02:00

README.md

pdfminer.six

Build Status PyPI version gitter

We fathom PDF

Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.

It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.

Check out the full documentation on Read the Docs.

Features

  • Written entirely in Python.
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost).
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Support for extracting images (JPG, JBIG2, Bitmaps).
  • Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
  • Support for RC4 and AES encryption.
  • Support for AcroForm interactive form extraction.
  • Table of contents extraction.
  • Tagged contents extraction.
  • Automatic layout analysis.

How to use

  • Install Python 3.6 or newer.

  • Install

    pip install pdfminer.six

  • Use command-line interface to extract text from pdf:

    python pdf2txt.py samples/simple1.pdf

Contributing

Be sure to read the contribution guidelines.

Acknowledgement

This repository includes code from pyHanko ; the original license has been included here.