[docs] Add extract_pages tutorial (#442)
Closes https://github.com/pdfminer/pdfminer.six/issues/361pull/450/head
parent
09c989f301
commit
ac2b20a79a
|
@ -20,6 +20,9 @@ interactive elements and higher-level application data. A PDF file contains
|
||||||
the objects making up a PDF document along with associated structural
|
the objects making up a PDF document along with associated structural
|
||||||
information, all represented as a single self-contained sequence of bytes. [1]_
|
information, all represented as a single self-contained sequence of bytes. [1]_
|
||||||
|
|
||||||
|
|
||||||
|
.. _topic_pdf_to_text_layout:
|
||||||
|
|
||||||
Layout analysis algorithm
|
Layout analysis algorithm
|
||||||
=========================
|
=========================
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,47 @@
|
||||||
|
.. _tutorial_extract_pages:
|
||||||
|
|
||||||
|
Extract elements from a PDF using Python
|
||||||
|
****************************************
|
||||||
|
|
||||||
|
The high level functions can be used to achieve common tasks. In this case,
|
||||||
|
we can use :ref:`api_extract_pages`:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from pdfminer.high_level import extract_pages
|
||||||
|
for page_layout in extract_pages("test.pdf"):
|
||||||
|
for element in page_layout:
|
||||||
|
print(element)
|
||||||
|
|
||||||
|
|
||||||
|
Each ``element`` will be an ``LTTextBox``, ``LTFigure``, ``LTLine``, ``LTRect``
|
||||||
|
or an ``LTImage``. Some of these can be iterated further, for example iterating
|
||||||
|
though an ``LTTextBox`` will give you an ``LTTextLine``, and these in turn can
|
||||||
|
be iterated through to get an ``LTChar``. See the diagram here:
|
||||||
|
:ref:`topic_pdf_to_text_layout`.
|
||||||
|
|
||||||
|
Let's say we want to extract all of the text. We could do:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from pdfminer.high_level import extract_pages
|
||||||
|
from pdfminer.layout import LTTextContainer
|
||||||
|
for page_layout in extract_pages("test.pdf"):
|
||||||
|
for element in page_layout:
|
||||||
|
if isinstance(element, LTTextContainer):
|
||||||
|
print(element.get_text())
|
||||||
|
|
||||||
|
Or, we could extract the fontname or size of each individual character:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from pdfminer.high_level import extract_pages
|
||||||
|
from pdfminer.layout import LTTextContainer, LTChar
|
||||||
|
for page_layout in extract_pages("test.pdf"):
|
||||||
|
for element in page_layout:
|
||||||
|
if isinstance(element, LTTextContainer):
|
||||||
|
for text_line in element:
|
||||||
|
for character in text_line:
|
||||||
|
if isinstance(character, LTChar):
|
||||||
|
print(character.fontname)
|
||||||
|
print(character.size)
|
|
@ -12,3 +12,4 @@ Tutorials help you get started with specific parts of pdfminer.six.
|
||||||
commandline
|
commandline
|
||||||
highlevel
|
highlevel
|
||||||
composable
|
composable
|
||||||
|
extract_pages
|
||||||
|
|
Loading…
Reference in New Issue