Fix typos in converting_pdf_to_text.rst (#611)

* Fix typos in converting_pdf_to_text.rst * The word "pdfminer.six" as a whole should not be separated by newline, otherwise they are treated as two separated words by renderer, and incorrectly displayed as separated. * Trim redundant spaces Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-09-01 02:52:13 +08:00 · 2021-09-01 02:52:13 +08:00 · 8ea9f1091a
parent 46fa21476a
commit 8ea9f1091a
1 changed files with 8 additions and 8 deletions
--- a/docs/source/topic/converting_pdf_to_text.rst
+++ b/docs/source/topic/converting_pdf_to_text.rst
@ -11,7 +11,7 @@ the characters and their placement.
 This makes extracting meaningful pieces of text from PDF files difficult.
 The characters that compose a paragraph are no different from those that
 compose the table, the page footer or the description of a figure. Unlike
-other documents formats, like a `.txt` file or a word document, the PDF format
+other document formats, like a `.txt` file or a word document, the PDF format
 does not contain a stream of text.

 A PDF document does consists of a collection of objects that together describe
@ -29,7 +29,7 @@ PDFMiner attempts to reconstruct some of those structures by using heuristics
 on the positioning of characters. This works well for sentences and
 paragraphs because meaningful groups of nearby characters can be made.

-The layout analysis consist of three different stages: it groups characters
+The layout analysis consists of three different stages: it groups characters
 into words and lines, then it groups lines into boxes and finally it groups
 textboxes hierarchically. These stages are discussed in the following
 sections. The resulting output of the layout analysis is an ordered hierarchy
@ -48,8 +48,8 @@ Grouping characters into words and lines

 The first step in going from characters to text is to group characters in a
 meaningful way. Each character has an x-coordinate and a y-coordinate for its
-bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
-.six uses these bounding boxes to decide which characters belong together.
+bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer.six 
+uses these bounding boxes to decide which characters belong together.

 Characters that are both horizontally and vertically close are grouped onto
 one line. How close they should be is determined by the `char_margin`
@ -74,7 +74,7 @@ relative to the maximum width or height of the new character. Having a smaller
 least be smaller than the `char_margin` otherwise none of the characters will
 be separated by a space.

-The result of this stage is a list of lines. Each line consists a list of
+The result of this stage is a list of lines. Each line consists of a list of
 characters. These characters are either original `LTChar` characters that
 originate from the PDF file, or inserted `LTAnno` characters that
 represent spaces between words or newlines at the end of each line.
@ -91,14 +91,14 @@ Lines that are both horizontally overlapping and vertically close are grouped.
 How vertically close the lines should be is determined by the `line_margin`.
 This margin is specified relative to the height of the bounding box. Lines
 are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
-(see L :sub:`2`) in the figure) of the bounding boxes are closer together
+(see L :sub:`2`) in the figure) of the bounding boxes is closer together
 than the absolute line margin, i.e. the `line_margin` multiplied by the
 height of the bounding box.

 .. raw:: html
    :file: ../_static/layout_analysis_group_lines.html

-The result of this stage is a list of text boxes. Each box consist of a list
+The result of this stage is a list of text boxes. Each box consists of a list
 of lines.

 Grouping textboxes hierarchically