Fix typos in readthedocs documentation. (#579)

* Fix typos and possible mistakes. * Revert two edits based on discussion in #579 Revert the two changes based on our discussion. I read the documentation and had a glimpse at the default code. And perhaps the confusion was caused by the figure that shows the Char Margin (M) and the Word Margin (W). Clearly, M is smaller than W in absolute terms, but as mentioned, they are both relative numbers. Maybe it is useful to point that out in the figure but I am not sure how best to do it. Another option is to mention use something like `min_char_margin_threshold` or similar, in the hope that they are easier to understand. Just some thoughts! * Triggering travis again Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-08-27 03:58:50 +09:00 · 2021-08-27 03:58:50 +09:00 · d821fed340
parent 543976f195
commit d821fed340
1 changed files with 7 additions and 11 deletions
--- a/docs/source/topic/converting_pdf_to_text.rst
+++ b/docs/source/topic/converting_pdf_to_text.rst
@ -4,11 +4,11 @@ Converting a PDF file to text
 *****************************

 Most PDF files look like they contain well structured text. But the reality  is
-that a PDF file does not contain anything that resembles a paragraphs,
+that a PDF file does not contain anything that resembles paragraphs,
 sentences or even words. When it comes to text, a PDF file is only aware of
 the characters and their placement.

-This makes extracting meaningful pieces of text from PDF's files difficult.
+This makes extracting meaningful pieces of text from PDF files difficult.
 The characters that compose a paragraph are no different from those that
 compose the table, the page footer or the description of a figure. Unlike
 other documents formats, like a `.txt` file or a word document, the PDF format
@ -20,7 +20,6 @@ interactive elements and higher-level application data. A PDF file contains
 the objects making up a PDF document along with associated structural
 information, all represented as a single self-contained sequence of bytes. [1]_

-
 .. _topic_pdf_to_text_layout:

 Layout analysis algorithm
@ -41,7 +40,6 @@ of layout objects on a PDF page.

    The output of the layout analysis is a hierarchy of layout objects.

-
 The output of the layout analysis heavily depends on a couple of parameters.
 All these parameters are part of the :ref:`api_laparams` class.

@ -56,10 +54,9 @@ bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
 Characters that are both horizontally and vertically close are grouped onto
 one line. How close they should be is determined by the `char_margin`
 (M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
-*distance* between the bounding boxes of two characters should be smaller that
+*distance* between the bounding boxes of two characters should be smaller than
 the `char_margin` and the vertical *overlap* between the bounding boxes should
-be smaller the the `line_overlap`.
-
+be smaller than the `line_overlap`.

 .. raw:: html
    :file: ../_static/layout_analysis.html
@ -71,14 +68,14 @@ relative to the minimum height of either one of the bounding boxes.

 Spaces need to be inserted between characters because the PDF format has no
 notion of the space character. A space is inserted if the characters are
-further apart that the `word_margin` (W in the figure). The `word_margin` is
+further apart than the `word_margin` (W in the figure). The `word_margin` is
 relative to the maximum width or height of the new character. Having a smaller
 `word_margin` creates smaller words. Note that the `word_margin` should at
 least be smaller than the `char_margin` otherwise none of the characters will
 be separated by a space.

 The result of this stage is a list of lines. Each line consists a list of
-characters. These characters either original `LTChar` characters that
+characters. These characters are either original `LTChar` characters that
 originate from the PDF file, or inserted `LTAnno` characters that
 represent spaces between words or newlines at the end of each line.

@ -107,7 +104,7 @@ of lines.
 Grouping textboxes hierarchically
 ---------------------------------

-the last step is to group the text boxes in a meaningful way. This step
+The last step is to group the text boxes in a meaningful way. This step
 repeatedly merges the two text boxes that are closest to each other.

 The closeness of bounding boxes is computed as the area that is between the
@ -118,7 +115,6 @@ boxes of the individual lines.
 .. raw:: html
    :file: ../_static/layout_analysis_group_boxes.html

-
 Working with rotated characters
 ===============================