Fix typos in converting_pdf_to_text.rst (#611)
* Fix typos in converting_pdf_to_text.rst * The word "pdfminer.six" as a whole should not be separated by newline, otherwise they are treated as two separated words by renderer, and incorrectly displayed as separated. * Trim redundant spaces Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>pull/614/head
parent
46fa21476a
commit
8ea9f1091a
|
@ -11,7 +11,7 @@ the characters and their placement.
|
|||
This makes extracting meaningful pieces of text from PDF files difficult.
|
||||
The characters that compose a paragraph are no different from those that
|
||||
compose the table, the page footer or the description of a figure. Unlike
|
||||
other documents formats, like a `.txt` file or a word document, the PDF format
|
||||
other document formats, like a `.txt` file or a word document, the PDF format
|
||||
does not contain a stream of text.
|
||||
|
||||
A PDF document does consists of a collection of objects that together describe
|
||||
|
@ -29,7 +29,7 @@ PDFMiner attempts to reconstruct some of those structures by using heuristics
|
|||
on the positioning of characters. This works well for sentences and
|
||||
paragraphs because meaningful groups of nearby characters can be made.
|
||||
|
||||
The layout analysis consist of three different stages: it groups characters
|
||||
The layout analysis consists of three different stages: it groups characters
|
||||
into words and lines, then it groups lines into boxes and finally it groups
|
||||
textboxes hierarchically. These stages are discussed in the following
|
||||
sections. The resulting output of the layout analysis is an ordered hierarchy
|
||||
|
@ -48,8 +48,8 @@ Grouping characters into words and lines
|
|||
|
||||
The first step in going from characters to text is to group characters in a
|
||||
meaningful way. Each character has an x-coordinate and a y-coordinate for its
|
||||
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
|
||||
.six uses these bounding boxes to decide which characters belong together.
|
||||
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer.six
|
||||
uses these bounding boxes to decide which characters belong together.
|
||||
|
||||
Characters that are both horizontally and vertically close are grouped onto
|
||||
one line. How close they should be is determined by the `char_margin`
|
||||
|
@ -74,7 +74,7 @@ relative to the maximum width or height of the new character. Having a smaller
|
|||
least be smaller than the `char_margin` otherwise none of the characters will
|
||||
be separated by a space.
|
||||
|
||||
The result of this stage is a list of lines. Each line consists a list of
|
||||
The result of this stage is a list of lines. Each line consists of a list of
|
||||
characters. These characters are either original `LTChar` characters that
|
||||
originate from the PDF file, or inserted `LTAnno` characters that
|
||||
represent spaces between words or newlines at the end of each line.
|
||||
|
@ -91,14 +91,14 @@ Lines that are both horizontally overlapping and vertically close are grouped.
|
|||
How vertically close the lines should be is determined by the `line_margin`.
|
||||
This margin is specified relative to the height of the bounding box. Lines
|
||||
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
|
||||
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
|
||||
(see L :sub:`2`) in the figure) of the bounding boxes is closer together
|
||||
than the absolute line margin, i.e. the `line_margin` multiplied by the
|
||||
height of the bounding box.
|
||||
|
||||
.. raw:: html
|
||||
:file: ../_static/layout_analysis_group_lines.html
|
||||
|
||||
The result of this stage is a list of text boxes. Each box consist of a list
|
||||
The result of this stage is a list of text boxes. Each box consists of a list
|
||||
of lines.
|
||||
|
||||
Grouping textboxes hierarchically
|
||||
|
|
Loading…
Reference in New Issue