Fix small typos in documentation (#828)
* Fix #795
* Documentation updates (FAQ and others)
* New how-to for extracting coordinates
* Indent fix in documentation
* Revert "Fix #795"
This reverts commit cac62171fc
.
* Move description of iterating LTPage to the docstring of LTPage
* Remove adding how-to for extracting coordinates from this pr
* Add CHANGELOG.md
* Remove FAQ from this branch
* Only add one line to CHANGELOG.md
Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com>
pull/801/head
parent
fa71062c35
commit
3688911afe
|
@ -20,6 +20,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|||
- `TypeError` in cmapdb.py when parsing null characters ([#768](https://github.com/pdfminer/pdfminer.six/pull/768))
|
||||
- Color "convenience operators" now (per spec) also set color space ([#794](https://github.com/pdfminer/pdfminer.six/pull/794))
|
||||
- `ValueError` when extracting images, due to breaking changes in Pillow ([#827](https://github.com/pdfminer/pdfminer.six/pull/827))
|
||||
- Small typo's and issues in the documentation ([#828](https://github.com/pdfminer/pdfminer.six/pull/828))
|
||||
|
||||
### Deprecated
|
||||
|
||||
|
|
|
@ -7,11 +7,11 @@ Why is it called pdfminer.six?
|
|||
==============================
|
||||
|
||||
Pdfminer.six is a fork of the `original pdfminer created by Euske
|
||||
<https://github.com/euske>`_. Almost all of the code and architecture is in
|
||||
fact created by Euske. But, for a long time this original pdfminer did not
|
||||
<https://github.com/euske>`_. Almost all of the code and architecture are in
|
||||
-fact created by Euske. But, for a long time, this original pdfminer did not
|
||||
support Python 3. Until 2020 the original pdfminer only supported Python 2.
|
||||
The original goal of pdfminer.six was to add support for Python 3. This was
|
||||
done with the six package. The six package helps to write code that is
|
||||
done with the `six` package. The `six` package helps to write code that is
|
||||
compatible with both Python 2 and Python 3. Hence, pdfminer.six.
|
||||
|
||||
As of 2020, pdfminer.six dropped the support for Python 2 because it was
|
||||
|
@ -27,13 +27,13 @@ also equal to six feet.
|
|||
How does pdfminer.six compare to other forks of pdfminer?
|
||||
==========================================================
|
||||
|
||||
Pdfminer.six is now an independent and community maintained package for
|
||||
extracting text from PDF's with Python. We actively fix bugs (also for PDF's
|
||||
Pdfminer.six is now an independent and community-maintained package for
|
||||
extracting text from PDFs with Python. We actively fix bugs (also for PDFs
|
||||
that don't strictly follow the PDF Reference), add new features and improve
|
||||
the usability of pdfminer.six. This community separates pdfminer.six from the
|
||||
other forks of the original pdfminer. PDF as a format is very diverse and
|
||||
there are countless deviations from the official format. The only way to
|
||||
support all the PDF's out there is to have a community that actively uses and
|
||||
support all the PDFs out there is to have a community that actively uses and
|
||||
improves pdfminer.
|
||||
|
||||
Since 2020, the original pdfminer is `dormant
|
||||
|
|
|
@ -65,7 +65,7 @@ Only AcroForm interactive forms are supported, XFA forms are not supported.
|
|||
|
||||
print(name, values)
|
||||
|
||||
This code snippet will print all the fields name and value and save them in the "data" dictionary.
|
||||
This code snippet will print all the fields' names and values and save them in the "data" dictionary.
|
||||
|
||||
|
||||
How it works:
|
||||
|
@ -77,9 +77,9 @@ How it works:
|
|||
parser = PDFParser(fp)
|
||||
doc = PDFDocument(parser)
|
||||
|
||||
- Get the catalog
|
||||
- Get the Catalog
|
||||
|
||||
(the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html)
|
||||
(the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference)
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -122,7 +122,7 @@ How it works:
|
|||
|
||||
- Call the value(s) decoding method as needed
|
||||
|
||||
(a single field can hold multiple values, for example a combo box can hold more than one value at time)
|
||||
(a single field can hold multiple values, for example, a combo box can hold more than one value at a time)
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -131,7 +131,7 @@ How it works:
|
|||
else:
|
||||
values = decode_value(values)
|
||||
|
||||
(the decode_value method takes care of decoding the fields value returning a string)
|
||||
(the decode_value method takes care of decoding the field's value, returning a string)
|
||||
|
||||
- Decode PSLiteral and PSKeyword field values
|
||||
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
Converting a PDF file to text
|
||||
*****************************
|
||||
|
||||
Most PDF files look like they contain well structured text. But the reality is
|
||||
Most PDF files look like they contain well-structured text. But the reality is
|
||||
that a PDF file does not contain anything that resembles paragraphs,
|
||||
sentences or even words. When it comes to text, a PDF file is only aware of
|
||||
the characters and their placement.
|
||||
|
@ -14,7 +14,7 @@ compose the table, the page footer or the description of a figure. Unlike
|
|||
other document formats, like a `.txt` file or a word document, the PDF format
|
||||
does not contain a stream of text.
|
||||
|
||||
A PDF document does consists of a collection of objects that together describe
|
||||
A PDF document consists of a collection of objects that together describe
|
||||
the appearance of one or more pages, possibly accompanied by additional
|
||||
interactive elements and higher-level application data. A PDF file contains
|
||||
the objects making up a PDF document along with associated structural
|
||||
|
@ -53,7 +53,7 @@ uses these bounding boxes to decide which characters belong together.
|
|||
|
||||
Characters that are both horizontally and vertically close are grouped onto
|
||||
one line. How close they should be is determined by the `char_margin`
|
||||
(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
|
||||
(M in the figure) and the `line_overlap` (not in figure) parameter. The horizontal
|
||||
*distance* between the bounding boxes of two characters should be smaller than
|
||||
the `char_margin` and the vertical *overlap* between the bounding boxes should
|
||||
be smaller than the `line_overlap`.
|
||||
|
@ -76,7 +76,7 @@ be separated by a space.
|
|||
|
||||
The result of this stage is a list of lines. Each line consists of a list of
|
||||
characters. These characters are either original `LTChar` characters that
|
||||
originate from the PDF file, or inserted `LTAnno` characters that
|
||||
originate from the PDF file or inserted `LTAnno` characters that
|
||||
represent spaces between words or newlines at the end of each line.
|
||||
|
||||
Grouping lines into boxes
|
||||
|
@ -91,7 +91,7 @@ Lines that are both horizontally overlapping and vertically close are grouped.
|
|||
How vertically close the lines should be is determined by the `line_margin`.
|
||||
This margin is specified relative to the height of the bounding box. Lines
|
||||
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
|
||||
(see L :sub:`2`) in the figure) of the bounding boxes is closer together
|
||||
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
|
||||
than the absolute line margin, i.e. the `line_margin` multiplied by the
|
||||
height of the bounding box.
|
||||
|
||||
|
@ -120,7 +120,7 @@ Working with rotated characters
|
|||
|
||||
The algorithm described above assumes that all characters have the same
|
||||
orientation. However, any writing direction is possible in a PDF. To
|
||||
accommodate for this, pdfminer.six allows to detect vertical writing with the
|
||||
accommodate for this, pdfminer.six allows detecting vertical writing with the
|
||||
`detect_vertical` parameter. This will apply all the grouping steps as if the
|
||||
pdf was rotated 90 (or 270) degrees
|
||||
|
||||
|
|
|
@ -195,7 +195,7 @@ def extract_pages(
|
|||
:param caching: If resources should be cached
|
||||
:param laparams: An LAParams object from pdfminer.layout. If None, uses
|
||||
some default settings that often work well.
|
||||
:return:
|
||||
:return: LTPage objects
|
||||
"""
|
||||
if laparams is None:
|
||||
laparams = LAParams()
|
||||
|
|
|
@ -1011,8 +1011,8 @@ class LTFigure(LTLayoutContainer):
|
|||
class LTPage(LTLayoutContainer):
|
||||
"""Represents an entire page.
|
||||
|
||||
May contain child objects like LTTextBox, LTFigure, LTImage, LTRect,
|
||||
LTCurve and LTLine.
|
||||
Like any other LTLayoutContainer, an LTPage can be iterated to obtain child
|
||||
objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.
|
||||
"""
|
||||
|
||||
def __init__(self, pageid: int, bbox: Rect, rotate: float = 0) -> None:
|
||||
|
|
Loading…
Reference in New Issue