From 3688911afe1029b59ae09275228fa889679f495b Mon Sep 17 00:00:00 2001 From: Pieter Marsman Date: Sat, 5 Nov 2022 17:08:23 +0100 Subject: [PATCH] Fix small typos in documentation (#828) * Fix #795 * Documentation updates (FAQ and others) * New how-to for extracting coordinates * Indent fix in documentation * Revert "Fix #795" This reverts commit cac62171fc6c8458ff1673137eff233107cae47b. * Move description of iterating LTPage to the docstring of LTPage * Remove adding how-to for extracting coordinates from this pr * Add CHANGELOG.md * Remove FAQ from this branch * Only add one line to CHANGELOG.md Co-authored-by: Kunal Gehlot --- CHANGELOG.md | 1 + docs/source/faq.rst | 12 ++++++------ docs/source/howto/acro_forms.rst | 10 +++++----- docs/source/topic/converting_pdf_to_text.rst | 12 ++++++------ pdfminer/high_level.py | 2 +- pdfminer/layout.py | 4 ++-- 6 files changed, 21 insertions(+), 20 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index abc7362..51ecc2c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -20,6 +20,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). - `TypeError` in cmapdb.py when parsing null characters ([#768](https://github.com/pdfminer/pdfminer.six/pull/768)) - Color "convenience operators" now (per spec) also set color space ([#794](https://github.com/pdfminer/pdfminer.six/pull/794)) - `ValueError` when extracting images, due to breaking changes in Pillow ([#827](https://github.com/pdfminer/pdfminer.six/pull/827)) +- Small typo's and issues in the documentation ([#828](https://github.com/pdfminer/pdfminer.six/pull/828)) ### Deprecated diff --git a/docs/source/faq.rst b/docs/source/faq.rst index 5a742d6..3461492 100644 --- a/docs/source/faq.rst +++ b/docs/source/faq.rst @@ -7,11 +7,11 @@ Why is it called pdfminer.six? ============================== Pdfminer.six is a fork of the `original pdfminer created by Euske -`_. Almost all of the code and architecture is in -fact created by Euske. But, for a long time this original pdfminer did not +`_. Almost all of the code and architecture are in +-fact created by Euske. But, for a long time, this original pdfminer did not support Python 3. Until 2020 the original pdfminer only supported Python 2. The original goal of pdfminer.six was to add support for Python 3. This was -done with the six package. The six package helps to write code that is +done with the `six` package. The `six` package helps to write code that is compatible with both Python 2 and Python 3. Hence, pdfminer.six. As of 2020, pdfminer.six dropped the support for Python 2 because it was @@ -27,13 +27,13 @@ also equal to six feet. How does pdfminer.six compare to other forks of pdfminer? ========================================================== -Pdfminer.six is now an independent and community maintained package for -extracting text from PDF's with Python. We actively fix bugs (also for PDF's +Pdfminer.six is now an independent and community-maintained package for +extracting text from PDFs with Python. We actively fix bugs (also for PDFs that don't strictly follow the PDF Reference), add new features and improve the usability of pdfminer.six. This community separates pdfminer.six from the other forks of the original pdfminer. PDF as a format is very diverse and there are countless deviations from the official format. The only way to -support all the PDF's out there is to have a community that actively uses and +support all the PDFs out there is to have a community that actively uses and improves pdfminer. Since 2020, the original pdfminer is `dormant diff --git a/docs/source/howto/acro_forms.rst b/docs/source/howto/acro_forms.rst index 276dccf..c4932c3 100644 --- a/docs/source/howto/acro_forms.rst +++ b/docs/source/howto/acro_forms.rst @@ -65,7 +65,7 @@ Only AcroForm interactive forms are supported, XFA forms are not supported. print(name, values) -This code snippet will print all the fields name and value and save them in the "data" dictionary. +This code snippet will print all the fields' names and values and save them in the "data" dictionary. How it works: @@ -77,9 +77,9 @@ How it works: parser = PDFParser(fp) doc = PDFDocument(parser) -- Get the catalog +- Get the Catalog - (the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html) + (the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference) .. code-block:: python @@ -122,7 +122,7 @@ How it works: - Call the value(s) decoding method as needed - (a single field can hold multiple values, for example a combo box can hold more than one value at time) + (a single field can hold multiple values, for example, a combo box can hold more than one value at a time) .. code-block:: python @@ -131,7 +131,7 @@ How it works: else: values = decode_value(values) -(the decode_value method takes care of decoding the fields value returning a string) +(the decode_value method takes care of decoding the field's value, returning a string) - Decode PSLiteral and PSKeyword field values diff --git a/docs/source/topic/converting_pdf_to_text.rst b/docs/source/topic/converting_pdf_to_text.rst index 5194b11..18c1cba 100644 --- a/docs/source/topic/converting_pdf_to_text.rst +++ b/docs/source/topic/converting_pdf_to_text.rst @@ -3,7 +3,7 @@ Converting a PDF file to text ***************************** -Most PDF files look like they contain well structured text. But the reality is +Most PDF files look like they contain well-structured text. But the reality is that a PDF file does not contain anything that resembles paragraphs, sentences or even words. When it comes to text, a PDF file is only aware of the characters and their placement. @@ -14,7 +14,7 @@ compose the table, the page footer or the description of a figure. Unlike other document formats, like a `.txt` file or a word document, the PDF format does not contain a stream of text. -A PDF document does consists of a collection of objects that together describe +A PDF document consists of a collection of objects that together describe the appearance of one or more pages, possibly accompanied by additional interactive elements and higher-level application data. A PDF file contains the objects making up a PDF document along with associated structural @@ -53,7 +53,7 @@ uses these bounding boxes to decide which characters belong together. Characters that are both horizontally and vertically close are grouped onto one line. How close they should be is determined by the `char_margin` -(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal +(M in the figure) and the `line_overlap` (not in figure) parameter. The horizontal *distance* between the bounding boxes of two characters should be smaller than the `char_margin` and the vertical *overlap* between the bounding boxes should be smaller than the `line_overlap`. @@ -76,7 +76,7 @@ be separated by a space. The result of this stage is a list of lines. Each line consists of a list of characters. These characters are either original `LTChar` characters that -originate from the PDF file, or inserted `LTAnno` characters that +originate from the PDF file or inserted `LTAnno` characters that represent spaces between words or newlines at the end of each line. Grouping lines into boxes @@ -91,7 +91,7 @@ Lines that are both horizontally overlapping and vertically close are grouped. How vertically close the lines should be is determined by the `line_margin`. This margin is specified relative to the height of the bounding box. Lines are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms -(see L :sub:`2`) in the figure) of the bounding boxes is closer together +(see L :sub:`2`) in the figure) of the bounding boxes are closer together than the absolute line margin, i.e. the `line_margin` multiplied by the height of the bounding box. @@ -120,7 +120,7 @@ Working with rotated characters The algorithm described above assumes that all characters have the same orientation. However, any writing direction is possible in a PDF. To -accommodate for this, pdfminer.six allows to detect vertical writing with the +accommodate for this, pdfminer.six allows detecting vertical writing with the `detect_vertical` parameter. This will apply all the grouping steps as if the pdf was rotated 90 (or 270) degrees diff --git a/pdfminer/high_level.py b/pdfminer/high_level.py index 94be9d4..6587fde 100644 --- a/pdfminer/high_level.py +++ b/pdfminer/high_level.py @@ -195,7 +195,7 @@ def extract_pages( :param caching: If resources should be cached :param laparams: An LAParams object from pdfminer.layout. If None, uses some default settings that often work well. - :return: + :return: LTPage objects """ if laparams is None: laparams = LAParams() diff --git a/pdfminer/layout.py b/pdfminer/layout.py index 5158f0e..5bfe759 100644 --- a/pdfminer/layout.py +++ b/pdfminer/layout.py @@ -1011,8 +1011,8 @@ class LTFigure(LTLayoutContainer): class LTPage(LTLayoutContainer): """Represents an entire page. - May contain child objects like LTTextBox, LTFigure, LTImage, LTRect, - LTCurve and LTLine. + Like any other LTLayoutContainer, an LTPage can be iterated to obtain child + objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine. """ def __init__(self, pageid: int, bbox: Rect, rotate: float = 0) -> None: