pdfminer.six/docs/source/tutorial/composable.rst

.. _tutorial_composable:

Extract text from a PDF using Python - part 2
*********************************************

The command line tools and the high-level API are just shortcuts for often
used combinations of pdfminer.six components. You can use these components to
modify pdfminer.six to your own needs.

For example, to extract the text from a PDF file and save it in a python
variable::

    from io import StringIO

    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser

    output_string = StringIO()
    with open('samples/simple1.pdf', 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    print(output_string.getvalue())
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00			`.. _tutorial_composable:`

Add section to documentation with howto for image extraction (#427) * Make structure of documentation more clear: tutorials, how-to, topics and reference * Add howto for images * Restructure tutorials section, and add install section * Always use up-to-date version * Fix indentation warning in docstring * Add option to dumppdf.py and pdf2txt.py to show version Fixes #162 2020-05-17 15:48:06 +00:00			`Extract text from a PDF using Python - part 2`
			`*********************************************`
Create sphinx documentation for Read the Docs (#329) Fixes #171 Fixes #199 Fixes #118 Fixes #178 Added: tests for building documentation and example code in documentation Added: docstrings for common used functions and classes Removed: old documentation 2019-11-07 20:12:34 +00:00
			`The command line tools and the high-level API are just shortcuts for often`
			`used combinations of pdfminer.six components. You can use these components to`
			`modify pdfminer.six to your own needs.`

			`For example, to extract the text from a PDF file and save it in a python`
			`variable::`

			`from io import StringIO`

			`from pdfminer.converter import TextConverter`
			`from pdfminer.layout import LAParams`
			`from pdfminer.pdfdocument import PDFDocument`
			`from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter`
			`from pdfminer.pdfpage import PDFPage`
			`from pdfminer.pdfparser import PDFParser`

			`output_string = StringIO()`
			`with open('samples/simple1.pdf', 'rb') as in_file:`
			`parser = PDFParser(in_file)`
			`doc = PDFDocument(parser)`
			`rsrcmgr = PDFResourceManager()`
			`device = TextConverter(rsrcmgr, output_string, laparams=LAParams())`
			`interpreter = PDFPageInterpreter(rsrcmgr, device)`
			`for page in PDFPage.create_pages(doc):`
			`interpreter.process_page(page)`

			`print(output_string.getvalue())`