Commit Graph

127 Commits (master)

Author SHA1 Message Date
Pieter Marsman ebf7bcdb98
Add FAQ about special characters (#829)
* Add FAQ for extracting special characters

* Update CHANGELOG.md

* Update faq.rst
2022-11-05 17:22:08 +01:00
Pieter Marsman 3688911afe
Fix small typos in documentation (#828)
* Fix #795

* Documentation updates (FAQ and others)

* New how-to for extracting coordinates

* Indent fix in documentation

* Revert "Fix #795"

This reverts commit cac62171fc.

* Move description of iterating LTPage to the docstring of LTPage

* Remove adding how-to for extracting coordinates from this pr

* Add CHANGELOG.md

* Remove FAQ from this branch

* Only add one line to CHANGELOG.md

Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com>
2022-11-05 17:08:23 +01:00
Pieter Marsman 769dbb6343
Consistent instructions for how to install and use pdfminer.six (#793) 2022-11-05 16:30:39 +01:00
Chris Mayo 86e34873e4
Fix Sphinx warnings and error (#760)
* Fix Sphinx warnings

howto/acro_forms.rst:4: WARNING: Title underline too short.
howto/acro_forms.rst:81: WARNING: Bullet list ends without a blank line; unexpected unindent.
howto/acro_forms.rst:88: WARNING: Bullet list ends without a blank line; unexpected unindent.
howto/acro_forms.rst:122: WARNING: Bullet list ends without a blank line; unexpected unindent.
tutorial/extract_pages.rst:6: WARNING: Failed to create a cross reference. A title or caption not found: api_extract_pages

* Fix documenting pdf2txt.py

reference/commandline.rst:12: ERROR: Module "tools.pdf2txt" has no attribute "maketheparser"
Incorrect argparse :module: or :func: values?

* Add CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2022-05-24 20:07:04 +02:00
Pieter Marsman 121235e24b
Raise more specific error if Pillow cannot be imported (#714)
* Raise specific warning if Pillow cannot be imported

* Improve error message

* Update docs

* Update CHANGELOG.md

* Update pdfminer/image.py

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>

Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2022-02-22 20:20:17 +01:00
Pieter Marsman b9a8920cdf
Check blackness in github actions (#711)
* Check blackness in github actions

* Blacken code

* Update github action names

* Add contributing guidelines on using black

* Add to checklist for PR
2022-02-11 22:46:51 +01:00
Andrew Baumann 9406040d8e
Add type annotations (#661)
Squashed commit of the following:

commit fa229f7b7591c07aea4e5a4545f9e0c34246e1cd
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 20:33:06 2021 -0700

    Merge branch 'develop' into mypy (and fixed types)

commit eaab3c65e2e3ab5f1f400cfc5186a3834c4ffe34
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 20:00:45 2021 -0700

    reformat all multi-line function defs to one-arg-per-line

commit 3fe2b69eed9197009d9da6776462f580ebf0dfa3
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:58:48 2021 -0700

    ccitt nit -- avoid casting needlessly

commit 15983d8c1e7162632fde43752c9d1c15938cd980
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:58:36 2021 -0700

    tweak CHANGELOG

commit 13dc0babf782938e7d5b5e482d4c5adf92d82702
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:43:46 2021 -0700

    add failing tests for dumppdf crash

commit 6b509c517876b8c15ac5a98a963884e23bd2e4d8
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:24:23 2021 -0700

    ccitt: apply misc PR feedback

commit feb031ba86d3f22e41cfbbda13f17c039359f1e6
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:18:26 2021 -0700

    add missing None return type to all __init__ methods

commit c0d62d6c54c7ec37b40bea54a3f6a7a618ec0ec6
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:13:08 2021 -0700

    minor cleanup, remove a few more Any types

commit b52a0594e1998a492c172538a9b35491c5fc5f52
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sun Sep 5 22:37:28 2021 -0700

    tighten up types, avoid Any in favour of explicit casts

commit e58fd48bd14f31bebd2de8259f12630ac02756d6
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sun Sep 5 14:10:49 2021 -0700

    annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)

commit 605290633e55595e5e0045840df5c5b1d9de843a
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:37:38 2021 -0700

    python 3.7 back-compat

commit 4dbcf8760f8a1d3e3d99f085476f86e6a043c80c
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:32:43 2021 -0700

    annotate pdfminer.jbig2

commit 0d40b7c03a8028dc44acd3f457eac71abd681827
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:31:33 2021 -0700

    annotate pdf2txt.py

commit 5f82eb4f5646b5d1285252689191e0a14557ec7b
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 09:16:31 2021 -0700

    cleanup: make Plane generic

commit 624fc92b88473ff36a174760883f34c22109da2b
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 23:16:51 2021 -0700

    bluntly ignore calls to cryptography.hazmat

commit 96b20439c169f40dbb114cabba6a582ad1ebe91e
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 23:01:06 2021 -0700

    finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2

commit 0ab586347861b72b1d16880dc9293f9ad597e20a
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 21:51:56 2021 -0700

    annotate pdffont

commit 4b689f1bcbdaf654feb9de81023e318ca310a12e
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 18:30:02 2021 -0700

    annotate a couple more scripts; document sketchy code

commit 291981ff3d273952ec9c92ef8ab948473558b787
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 15:02:01 2021 -0700

    pacify flake8

commit 45d2ce91ff333f3b7e34322b16e9c52b99b7a972
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 14:31:48 2021 -0700

    annotate dumppdf, and comment likely bugs

commit 7278d83851cb336a1be3803a0993b5ec0ad39b4c
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:49:58 2021 -0700

    enable mypy on tests and tools, fix one implicit reexport bug

commit 4a83166ef4e4733cd2113f43188b585a4fda392b
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:25:59 2021 -0700

    pdfdocument: per dumppdf.py, get_dest accepts either bytes or str

commit 43701e1bee068df98f378a253c9c2150ee4ad9f7
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:25:00 2021 -0700

    layout: LAParams.boxes_flow may be None

commit 164f81652f1788e74837466f0ab593e94079bc0f
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:45:09 2021 -0700

    add whitespace, pacify flake8

commit 893b9fb9ec918032b36a30456fc0b7a217da86d8
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:40:33 2021 -0700

    support old Python without typing.Protocol

commit dc245084102b7b04c3f5599d75b5d62ba4290787
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:12:03 2021 -0700

    Move "# type: ignore" comments to fix mypy on Python < 3.8

    The placement of these comments got more flexible in 3.8 due to
    https://github.com/python/mypy/issues/1032

    Satisfying older Python and fitting in flake8's 79-character line
    limit was quite a challenge!

commit da03afe7bd2cf3336e611f467f1c901455940ae8
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 22:59:58 2021 -0700

    fix text output from HTMLConverter

commit 5401276a2ed3b74a385ebcab5152485224146161
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 22:40:22 2021 -0700

    annotate high_level.py and the immediately-reachable internal APIs (mostly converters)

commit cc490513f8f17a7adc0bcbab2e0e86f37e832300
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 17:04:35 2021 -0700

     * expand and improve annotations in cmap, encryption/decompression and fonts
     * disallow untyped calls; this way, we have a core set of
       typed code that can grow over time
       (just not for ccitt, because there's a ton of work lurking there)
     * expand "typing: none" comments to suppress a specific error code

commit 92df54ba1d53d5dbbd5442757dd85be5b1851f99
Author: Andrew Baumann <ab@ab.id.au>
Date:   Wed Sep 1 20:50:59 2021 -0700

    update CHANGELOG

commit f72aaead45d0615e472a9b3190c9551a6b67b36e
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date:   Wed Sep 1 20:47:03 2021 -0700

    Merge branch 'develop' into mypy

commit ff787a93986c60361536a97182a41774f4a53ac3
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Aug 21 21:46:14 2021 -0700

    be more precise about types on ps/pdf stacks, remove most of the Any annotations

commit be1550189e10717f6827dbb7009d6e8c8b3f4c62
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Aug 21 10:13:58 2021 -0700

    silence missing imports, (maybe?) hook to tox

commit ff4b6a9bd46b352583d823d39065652c9a6f05f4
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 22:49:06 2021 -0700

    turn on more strict checks, and untangle the layout mess with generics

    Status:
    $ mypy pdfminer
    pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
    pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
    pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
    pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
    pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
    pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
    Found 5 errors in 4 files (checked 27 source files)

    pdfdevice.py:191 appears to be a real bug

commit 5c9c0b19d26ae391aea0e69c2c819261cc04460c
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 17:22:41 2021 -0700

    finish annotating layout

commit 0e6871c16abb29df2868ab145b4ce451b4b6c777
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 16:54:46 2021 -0700

    general progress on annotations
     * finish utils
     * annotate more of pdfinterp, pdfdevice
     * document reason for # type: ignore comments
     * fix cyclic imports
     * satisfy flake8

commit 17d59f42917fbf9b2b2eb844d3e83a8f2a3f123a
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Aug 19 21:38:50 2021 -0700

    WIP on type annotations

    With the possible exception of psparser.py, this is far from complete.

    $ mypy pdfminer
    pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
    pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
    pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
    pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
    pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
2021-10-09 16:23:28 +02:00
Raphaël Cohen c3e3499a6b
Add support for ISO 32000-2 AES256 encryption (#614)
* feat: Add support for ISO 32000-2 AES256 encryption

* feat: Applies review suggestions
2021-09-06 22:00:23 +02:00
MapleCCC 8ea9f1091a
Fix typos in converting_pdf_to_text.rst (#611)
* Fix typos in converting_pdf_to_text.rst

* The word "pdfminer.six" as a whole should not be separated by newline, otherwise they are treated as two separated words by renderer, and incorrectly displayed as separated.

* Trim redundant spaces

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-31 20:52:13 +02:00
Fiete 7f54cefe02
Use visible imports in highlevel.rst documentation (#609)
* add missing import for extract_text_to_fp

* Replace testsetup with visible imports in documentation

* Remove obsolete check for python version; python 2 is not supported anymore

* (Unrelated to this MR) Remove sys from converter.py

* Optimize imports

* (Unrelated to this MR) fix line length error

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-30 22:17:21 +02:00
Daniele Procida 1d33c026e4
Updated link to Diátaxis documentation website (#606)
The canonical home of the documentation framework has moved
from documentation.divio.com to https://diataxis.fr.
2021-08-30 21:47:40 +02:00
X d821fed340
Fix typos in readthedocs documentation. (#579)
* Fix typos and possible mistakes.

* Revert two edits based on discussion in #579

Revert the two changes based on our discussion. 

I read the documentation and had a glimpse at the default code. And perhaps the confusion was caused by the figure that shows the Char Margin (M) and the Word Margin (W). Clearly, M is smaller than W in absolute terms, but as mentioned, they are both relative numbers.

Maybe it is useful to point that out in the figure but I am not sure how best to do it. 

Another option is to mention use something like `min_char_margin_threshold` or similar, in the hope that they are easier to understand. Just some thoughts!

* Triggering travis again

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-26 20:58:50 +02:00
Pieter Marsman 875e53013a
Remove explicit support for Python 3.4 and 3.5, adding tests for python 3.9 (#522)
Closes #503
2020-10-25 12:34:51 +01:00
Pieter Marsman c66eca3c29 Update faq.rst 2020-10-18 12:49:54 +02:00
Pieter Marsman 599f0391b5 Update faq.rst 2020-10-12 09:22:41 +02:00
Pieter Marsman e59b1bca2f
Update docs/source/faq.rst
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2020-10-12 09:20:43 +02:00
Pieter Marsman a805653a83
Update docs/source/faq.rst
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2020-10-12 09:20:37 +02:00
Pieter Marsman 4be9757b86
Update docs/source/faq.rst
Co-authored-by: Jake Stockwin <jstockwin@gmail.com>
2020-10-12 09:20:30 +02:00
Pieter Marsman 14cc66ae6d Add frequently asked questions 2020-10-11 20:05:26 +02:00
Pieter Marsman bbc01f749a Add punchline to docs 2020-10-11 20:05:11 +02:00
estshorter 360b1efc0b
Deprecate Python 3.4 and 3.5 (#507) 2020-10-10 16:15:03 +02:00
typhoon71 4d8b5975cb
Add section to documentation with howto for AcroForm fields extraction (#458)
* Create aforms.rst

Add section to documentation with howto for AcroForm fields extraction

* Update index.rst

Added reference to aforms.rst

* Update aforms.rst

* Update aforms.rst

* Update index.rst

* Update and rename aforms.rst to acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update index.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update acro_forms.rst

* Update pdfdocument.py

* Update pdfdocument.py

* Update pdfdocument.py

* Update acro_forms.rst

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update acro_forms.rst

* reverted changes

* Update README.md

* Proper processing of ComboBox

ComboBox fields hold multiple values, so the must be returned as a list.

* PDF with AcroForm (samples)

* Create tmp

* Delete AcroForm_TEST.pdf

* Delete AcroForm_TEST_compiled.pdf

* PDF file with AcroForms

* Delete tmp

* Fixed typo

* Update index.rst

* Update README.md

* Update index.rst

* Update pdfdocument.py

* Update docs/source/howto/acro_forms.rst

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>

* Update pdfdocument.py

* Update pdfdocument.py

* Update pdfdocument.py

Co-authored-by: Jake Stockwin <jake.stockwin@optimorlabs.com>
2020-09-10 19:18:41 +02:00
Jake Stockwin ac2b20a79a
[docs] Add extract_pages tutorial (#442)
Closes https://github.com/pdfminer/pdfminer.six/issues/361
2020-06-29 20:07:05 +02:00
Pieter Marsman 91d89af788
Add section to documentation with howto for image extraction (#427)
* Make structure of documentation more clear: tutorials, how-to, topics and reference

* Add howto for images

* Restructure tutorials section, and add install section

* Always use up-to-date version

* Fix indentation warning in docstring

* Add option to dumppdf.py and pdf2txt.py to show version

Fixes #162
2020-05-17 17:48:06 +02:00
Jake Stockwin 518b5d6efc
Fix #390: Updated misleading documentation about word_margin (#407)
* Updated misleading documentation about word_margin

* Small change in sentence about word_margin

* Remove confusing sentence about adding spaces

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2020-03-26 23:02:48 +01:00
Pieter Marsman 7e91d4ec6d Improve docs and github templates 2020-03-08 15:06:13 +01:00
Pieter Marsman e4790fdbc2 Add AES as supported encryption method to docs 2020-01-07 18:38:53 +01:00
Pieter Marsman 6eb9957e8a Update docs: at least python 3.4 is needed now 2020-01-04 16:51:54 +01:00
Pieter Marsman f3ab1bc61e
Enforce pep8 coding-style (#345)
* Code Refractor: Use code-style enforcement #312

* Add flake8 to travis-ci

* Remove python 2 3 comment on six library. 891 errors > 870 errors.

* Remove class and functions comments that consist of just the name. 870 errors > 855 errors.

* Fix flake8 errors in pdftypes.py. 855 errors > 833 errors.

* Moving flake8 testing from .travis.yml to tox.ini to ensure local testing before commiting

* Cleanup pdfinterp.py and add documentation from PDF Reference

* Cleanup pdfpage.py

* Cleanup pdffont.py

* Clean psparser.py

* Cleanup high_level.py

* Cleanup layout.py

* Cleanup pdfparser.py

* Cleanup pdfcolor.py

* Cleanup rijndael.py

* Cleanup converter.py

* Rename klass to cls if it is the class variable, to be more consistent with standard practice

* Cleanup cmap.py

* Cleanup pdfdevice.py

* flake8 ignore fontmetrics.py

* Cleanup test_pdfminer_psparser.py

* Fix flake8 in pdfdocument.py; 339 errors to go

* Fix flake8 utils.py; 326 errors togo

* pep8 correction for few files in /tools/ 328 > 160 to go (#342)

* pep8 correction for few files in /tools/ 328 > 160 to go

* pep8 correction: 160 > 5 to go

* Fix ascii85.py errors

* Fix error in getting index from target that does not exists

* Remove commented print lines

* Fix flake8 error in pdfinterp.py

* Fix python2 specific error by removing argument from print statement

* Ignore invalid python2 syntax

* Update contributing.md

* Added changelog

* Remove unused import

Co-authored-by: Fakabbir Amin <f4amin@gmail.com>
2019-12-29 21:20:20 +01:00
Pieter Marsman 2bee7d8dcf
Fix wrong ordering of grouping textboxes introduced by #315. The first grouping of textboxes should be skipped if there are intermediate textboxes. (#335)
Fixes #334
2019-11-10 12:18:49 +01:00
Pieter Marsman bc034c8e59
Create sphinx documentation for Read the Docs (#329)
Fixes #171
Fixes #199
Fixes #118
Fixes #178
Added: tests for building documentation and example code in documentation
Added: docstrings for common used functions and classes
Removed: old documentation
2019-11-07 21:12:34 +01:00
Pieter Marsman 347c125fb8 Revert "Move old documentation to subfolder"
This reverts commit a2e6c7c0
2019-10-27 14:26:11 +01:00
Pieter Marsman a2e6c7c0c9 Move old documentation to subfolder 2019-10-27 14:21:47 +01:00
Pieter Marsman d88d6020a2
Remove webapp and other (un)helpful application references: django, cgi, and pyinstaller. (#320)
Fixes #314 
Fixes #105
2019-10-26 19:16:37 +02:00
Kaushik Acharya 963a227b2e Updated URL for the article 2019-08-19 20:16:34 +05:30
Kaushik Acharya bfbb8b8f0b Adding Denis's article name. 2019-08-15 11:59:29 +05:30
Goulu 8861d7e0ed version 20140915 pushed to PyPi as pdfminer_six 2014-09-15 10:33:04 +02:00
Yusuke Shinyama 107e071508 Drop Python 2.4 support. The oldest supported version is now Python 2.6. 2014-06-25 19:28:54 +09:00
Yusuke Shinyama 0be2f5422b Fixed the document, thanks to Darius Thabit. 2014-05-19 23:23:41 +09:00
Yusuke Shinyama 7b354c7ab3 Version 20140328 2014-03-28 22:49:18 +09:00
Yusuke Shinyama 62eab0048b Documentation updated. 2014-03-24 21:03:10 +09:00
Yusuke Shinyama 0e7274de1b Added the description of boxes_flow. 2014-03-24 19:20:40 +09:00
Yusuke Shinyama e39e39fa12 Documentation updates. 2013-11-17 15:32:57 +09:00
Yusuke Shinyama 7504d2bf27 Updated and fixed the documents. 2013-11-13 14:51:24 +09:00
Yusuke Shinyama 0a4bc9dee9 Renamed: LTAnon -> LTAnno 2013-11-11 19:18:16 +09:00
Yusuke Shinyama 96667d286f Updated documentation. 2013-10-27 00:05:26 +09:00
Yusuke Shinyama 86348eba2f Documentation updated. 2013-10-23 00:17:12 +09:00
Yusuke Shinyama 87842233b3 Version bump! 2013-10-22 22:19:38 +09:00
Yusuke Shinyama ead3137121 updated documents. 2013-10-22 19:09:14 +09:00
Yusuke Shinyama 4f677b6bcf fixed: wrong dates in index.html 2013-10-22 19:00:26 +09:00