Fix small typos in documentation (#828)
* Fix #795
* Documentation updates (FAQ and others)
* New how-to for extracting coordinates
* Indent fix in documentation
* Revert "Fix #795"
This reverts commit cac62171fc
.
* Move description of iterating LTPage to the docstring of LTPage
* Remove adding how-to for extracting coordinates from this pr
* Add CHANGELOG.md
* Remove FAQ from this branch
* Only add one line to CHANGELOG.md
Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com>
pull/801/head
parent
fa71062c35
commit
3688911afe
|
@ -20,6 +20,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
||||||
- `TypeError` in cmapdb.py when parsing null characters ([#768](https://github.com/pdfminer/pdfminer.six/pull/768))
|
- `TypeError` in cmapdb.py when parsing null characters ([#768](https://github.com/pdfminer/pdfminer.six/pull/768))
|
||||||
- Color "convenience operators" now (per spec) also set color space ([#794](https://github.com/pdfminer/pdfminer.six/pull/794))
|
- Color "convenience operators" now (per spec) also set color space ([#794](https://github.com/pdfminer/pdfminer.six/pull/794))
|
||||||
- `ValueError` when extracting images, due to breaking changes in Pillow ([#827](https://github.com/pdfminer/pdfminer.six/pull/827))
|
- `ValueError` when extracting images, due to breaking changes in Pillow ([#827](https://github.com/pdfminer/pdfminer.six/pull/827))
|
||||||
|
- Small typo's and issues in the documentation ([#828](https://github.com/pdfminer/pdfminer.six/pull/828))
|
||||||
|
|
||||||
### Deprecated
|
### Deprecated
|
||||||
|
|
||||||
|
|
|
@ -7,11 +7,11 @@ Why is it called pdfminer.six?
|
||||||
==============================
|
==============================
|
||||||
|
|
||||||
Pdfminer.six is a fork of the `original pdfminer created by Euske
|
Pdfminer.six is a fork of the `original pdfminer created by Euske
|
||||||
<https://github.com/euske>`_. Almost all of the code and architecture is in
|
<https://github.com/euske>`_. Almost all of the code and architecture are in
|
||||||
fact created by Euske. But, for a long time this original pdfminer did not
|
-fact created by Euske. But, for a long time, this original pdfminer did not
|
||||||
support Python 3. Until 2020 the original pdfminer only supported Python 2.
|
support Python 3. Until 2020 the original pdfminer only supported Python 2.
|
||||||
The original goal of pdfminer.six was to add support for Python 3. This was
|
The original goal of pdfminer.six was to add support for Python 3. This was
|
||||||
done with the six package. The six package helps to write code that is
|
done with the `six` package. The `six` package helps to write code that is
|
||||||
compatible with both Python 2 and Python 3. Hence, pdfminer.six.
|
compatible with both Python 2 and Python 3. Hence, pdfminer.six.
|
||||||
|
|
||||||
As of 2020, pdfminer.six dropped the support for Python 2 because it was
|
As of 2020, pdfminer.six dropped the support for Python 2 because it was
|
||||||
|
@ -27,13 +27,13 @@ also equal to six feet.
|
||||||
How does pdfminer.six compare to other forks of pdfminer?
|
How does pdfminer.six compare to other forks of pdfminer?
|
||||||
==========================================================
|
==========================================================
|
||||||
|
|
||||||
Pdfminer.six is now an independent and community maintained package for
|
Pdfminer.six is now an independent and community-maintained package for
|
||||||
extracting text from PDF's with Python. We actively fix bugs (also for PDF's
|
extracting text from PDFs with Python. We actively fix bugs (also for PDFs
|
||||||
that don't strictly follow the PDF Reference), add new features and improve
|
that don't strictly follow the PDF Reference), add new features and improve
|
||||||
the usability of pdfminer.six. This community separates pdfminer.six from the
|
the usability of pdfminer.six. This community separates pdfminer.six from the
|
||||||
other forks of the original pdfminer. PDF as a format is very diverse and
|
other forks of the original pdfminer. PDF as a format is very diverse and
|
||||||
there are countless deviations from the official format. The only way to
|
there are countless deviations from the official format. The only way to
|
||||||
support all the PDF's out there is to have a community that actively uses and
|
support all the PDFs out there is to have a community that actively uses and
|
||||||
improves pdfminer.
|
improves pdfminer.
|
||||||
|
|
||||||
Since 2020, the original pdfminer is `dormant
|
Since 2020, the original pdfminer is `dormant
|
||||||
|
|
|
@ -65,7 +65,7 @@ Only AcroForm interactive forms are supported, XFA forms are not supported.
|
||||||
|
|
||||||
print(name, values)
|
print(name, values)
|
||||||
|
|
||||||
This code snippet will print all the fields name and value and save them in the "data" dictionary.
|
This code snippet will print all the fields' names and values and save them in the "data" dictionary.
|
||||||
|
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
|
@ -77,9 +77,9 @@ How it works:
|
||||||
parser = PDFParser(fp)
|
parser = PDFParser(fp)
|
||||||
doc = PDFDocument(parser)
|
doc = PDFDocument(parser)
|
||||||
|
|
||||||
- Get the catalog
|
- Get the Catalog
|
||||||
|
|
||||||
(the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html)
|
(the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference)
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
|
@ -122,7 +122,7 @@ How it works:
|
||||||
|
|
||||||
- Call the value(s) decoding method as needed
|
- Call the value(s) decoding method as needed
|
||||||
|
|
||||||
(a single field can hold multiple values, for example a combo box can hold more than one value at time)
|
(a single field can hold multiple values, for example, a combo box can hold more than one value at a time)
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
|
@ -131,7 +131,7 @@ How it works:
|
||||||
else:
|
else:
|
||||||
values = decode_value(values)
|
values = decode_value(values)
|
||||||
|
|
||||||
(the decode_value method takes care of decoding the fields value returning a string)
|
(the decode_value method takes care of decoding the field's value, returning a string)
|
||||||
|
|
||||||
- Decode PSLiteral and PSKeyword field values
|
- Decode PSLiteral and PSKeyword field values
|
||||||
|
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
Converting a PDF file to text
|
Converting a PDF file to text
|
||||||
*****************************
|
*****************************
|
||||||
|
|
||||||
Most PDF files look like they contain well structured text. But the reality is
|
Most PDF files look like they contain well-structured text. But the reality is
|
||||||
that a PDF file does not contain anything that resembles paragraphs,
|
that a PDF file does not contain anything that resembles paragraphs,
|
||||||
sentences or even words. When it comes to text, a PDF file is only aware of
|
sentences or even words. When it comes to text, a PDF file is only aware of
|
||||||
the characters and their placement.
|
the characters and their placement.
|
||||||
|
@ -14,7 +14,7 @@ compose the table, the page footer or the description of a figure. Unlike
|
||||||
other document formats, like a `.txt` file or a word document, the PDF format
|
other document formats, like a `.txt` file or a word document, the PDF format
|
||||||
does not contain a stream of text.
|
does not contain a stream of text.
|
||||||
|
|
||||||
A PDF document does consists of a collection of objects that together describe
|
A PDF document consists of a collection of objects that together describe
|
||||||
the appearance of one or more pages, possibly accompanied by additional
|
the appearance of one or more pages, possibly accompanied by additional
|
||||||
interactive elements and higher-level application data. A PDF file contains
|
interactive elements and higher-level application data. A PDF file contains
|
||||||
the objects making up a PDF document along with associated structural
|
the objects making up a PDF document along with associated structural
|
||||||
|
@ -53,7 +53,7 @@ uses these bounding boxes to decide which characters belong together.
|
||||||
|
|
||||||
Characters that are both horizontally and vertically close are grouped onto
|
Characters that are both horizontally and vertically close are grouped onto
|
||||||
one line. How close they should be is determined by the `char_margin`
|
one line. How close they should be is determined by the `char_margin`
|
||||||
(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
|
(M in the figure) and the `line_overlap` (not in figure) parameter. The horizontal
|
||||||
*distance* between the bounding boxes of two characters should be smaller than
|
*distance* between the bounding boxes of two characters should be smaller than
|
||||||
the `char_margin` and the vertical *overlap* between the bounding boxes should
|
the `char_margin` and the vertical *overlap* between the bounding boxes should
|
||||||
be smaller than the `line_overlap`.
|
be smaller than the `line_overlap`.
|
||||||
|
@ -76,7 +76,7 @@ be separated by a space.
|
||||||
|
|
||||||
The result of this stage is a list of lines. Each line consists of a list of
|
The result of this stage is a list of lines. Each line consists of a list of
|
||||||
characters. These characters are either original `LTChar` characters that
|
characters. These characters are either original `LTChar` characters that
|
||||||
originate from the PDF file, or inserted `LTAnno` characters that
|
originate from the PDF file or inserted `LTAnno` characters that
|
||||||
represent spaces between words or newlines at the end of each line.
|
represent spaces between words or newlines at the end of each line.
|
||||||
|
|
||||||
Grouping lines into boxes
|
Grouping lines into boxes
|
||||||
|
@ -91,7 +91,7 @@ Lines that are both horizontally overlapping and vertically close are grouped.
|
||||||
How vertically close the lines should be is determined by the `line_margin`.
|
How vertically close the lines should be is determined by the `line_margin`.
|
||||||
This margin is specified relative to the height of the bounding box. Lines
|
This margin is specified relative to the height of the bounding box. Lines
|
||||||
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
|
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
|
||||||
(see L :sub:`2`) in the figure) of the bounding boxes is closer together
|
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
|
||||||
than the absolute line margin, i.e. the `line_margin` multiplied by the
|
than the absolute line margin, i.e. the `line_margin` multiplied by the
|
||||||
height of the bounding box.
|
height of the bounding box.
|
||||||
|
|
||||||
|
@ -120,7 +120,7 @@ Working with rotated characters
|
||||||
|
|
||||||
The algorithm described above assumes that all characters have the same
|
The algorithm described above assumes that all characters have the same
|
||||||
orientation. However, any writing direction is possible in a PDF. To
|
orientation. However, any writing direction is possible in a PDF. To
|
||||||
accommodate for this, pdfminer.six allows to detect vertical writing with the
|
accommodate for this, pdfminer.six allows detecting vertical writing with the
|
||||||
`detect_vertical` parameter. This will apply all the grouping steps as if the
|
`detect_vertical` parameter. This will apply all the grouping steps as if the
|
||||||
pdf was rotated 90 (or 270) degrees
|
pdf was rotated 90 (or 270) degrees
|
||||||
|
|
||||||
|
|
|
@ -195,7 +195,7 @@ def extract_pages(
|
||||||
:param caching: If resources should be cached
|
:param caching: If resources should be cached
|
||||||
:param laparams: An LAParams object from pdfminer.layout. If None, uses
|
:param laparams: An LAParams object from pdfminer.layout. If None, uses
|
||||||
some default settings that often work well.
|
some default settings that often work well.
|
||||||
:return:
|
:return: LTPage objects
|
||||||
"""
|
"""
|
||||||
if laparams is None:
|
if laparams is None:
|
||||||
laparams = LAParams()
|
laparams = LAParams()
|
||||||
|
|
|
@ -1011,8 +1011,8 @@ class LTFigure(LTLayoutContainer):
|
||||||
class LTPage(LTLayoutContainer):
|
class LTPage(LTLayoutContainer):
|
||||||
"""Represents an entire page.
|
"""Represents an entire page.
|
||||||
|
|
||||||
May contain child objects like LTTextBox, LTFigure, LTImage, LTRect,
|
Like any other LTLayoutContainer, an LTPage can be iterated to obtain child
|
||||||
LTCurve and LTLine.
|
objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, pageid: int, bbox: Rect, rotate: float = 0) -> None:
|
def __init__(self, pageid: int, bbox: Rect, rotate: float = 0) -> None:
|
||||||
|
|
Loading…
Reference in New Issue