Fix typos in converting_pdf_to_text.rst (#611)

* Fix typos in converting_pdf_to_text.rst

* The word "pdfminer.six" as a whole should not be separated by newline, otherwise they are treated as two separated words by renderer, and incorrectly displayed as separated.

* Trim redundant spaces

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
pull/614/head
MapleCCC 2021-09-01 02:52:13 +08:00 committed by GitHub
parent 46fa21476a
commit 8ea9f1091a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 8 additions and 8 deletions

View File

@ -11,7 +11,7 @@ the characters and their placement.
This makes extracting meaningful pieces of text from PDF files difficult. This makes extracting meaningful pieces of text from PDF files difficult.
The characters that compose a paragraph are no different from those that The characters that compose a paragraph are no different from those that
compose the table, the page footer or the description of a figure. Unlike compose the table, the page footer or the description of a figure. Unlike
other documents formats, like a `.txt` file or a word document, the PDF format other document formats, like a `.txt` file or a word document, the PDF format
does not contain a stream of text. does not contain a stream of text.
A PDF document does consists of a collection of objects that together describe A PDF document does consists of a collection of objects that together describe
@ -29,7 +29,7 @@ PDFMiner attempts to reconstruct some of those structures by using heuristics
on the positioning of characters. This works well for sentences and on the positioning of characters. This works well for sentences and
paragraphs because meaningful groups of nearby characters can be made. paragraphs because meaningful groups of nearby characters can be made.
The layout analysis consist of three different stages: it groups characters The layout analysis consists of three different stages: it groups characters
into words and lines, then it groups lines into boxes and finally it groups into words and lines, then it groups lines into boxes and finally it groups
textboxes hierarchically. These stages are discussed in the following textboxes hierarchically. These stages are discussed in the following
sections. The resulting output of the layout analysis is an ordered hierarchy sections. The resulting output of the layout analysis is an ordered hierarchy
@ -48,8 +48,8 @@ Grouping characters into words and lines
The first step in going from characters to text is to group characters in a The first step in going from characters to text is to group characters in a
meaningful way. Each character has an x-coordinate and a y-coordinate for its meaningful way. Each character has an x-coordinate and a y-coordinate for its
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer.six
.six uses these bounding boxes to decide which characters belong together. uses these bounding boxes to decide which characters belong together.
Characters that are both horizontally and vertically close are grouped onto Characters that are both horizontally and vertically close are grouped onto
one line. How close they should be is determined by the `char_margin` one line. How close they should be is determined by the `char_margin`
@ -74,7 +74,7 @@ relative to the maximum width or height of the new character. Having a smaller
least be smaller than the `char_margin` otherwise none of the characters will least be smaller than the `char_margin` otherwise none of the characters will
be separated by a space. be separated by a space.
The result of this stage is a list of lines. Each line consists a list of The result of this stage is a list of lines. Each line consists of a list of
characters. These characters are either original `LTChar` characters that characters. These characters are either original `LTChar` characters that
originate from the PDF file, or inserted `LTAnno` characters that originate from the PDF file, or inserted `LTAnno` characters that
represent spaces between words or newlines at the end of each line. represent spaces between words or newlines at the end of each line.
@ -91,14 +91,14 @@ Lines that are both horizontally overlapping and vertically close are grouped.
How vertically close the lines should be is determined by the `line_margin`. How vertically close the lines should be is determined by the `line_margin`.
This margin is specified relative to the height of the bounding box. Lines This margin is specified relative to the height of the bounding box. Lines
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
(see L :sub:`2`) in the figure) of the bounding boxes are closer together (see L :sub:`2`) in the figure) of the bounding boxes is closer together
than the absolute line margin, i.e. the `line_margin` multiplied by the than the absolute line margin, i.e. the `line_margin` multiplied by the
height of the bounding box. height of the bounding box.
.. raw:: html .. raw:: html
:file: ../_static/layout_analysis_group_lines.html :file: ../_static/layout_analysis_group_lines.html
The result of this stage is a list of text boxes. Each box consist of a list The result of this stage is a list of text boxes. Each box consists of a list
of lines. of lines.
Grouping textboxes hierarchically Grouping textboxes hierarchically