Fix typos in converting_pdf_to_text.rst (#611)
* Fix typos in converting_pdf_to_text.rst * The word "pdfminer.six" as a whole should not be separated by newline, otherwise they are treated as two separated words by renderer, and incorrectly displayed as separated. * Trim redundant spaces Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>pull/614/head
parent
46fa21476a
commit
8ea9f1091a
|
@ -11,7 +11,7 @@ the characters and their placement.
|
||||||
This makes extracting meaningful pieces of text from PDF files difficult.
|
This makes extracting meaningful pieces of text from PDF files difficult.
|
||||||
The characters that compose a paragraph are no different from those that
|
The characters that compose a paragraph are no different from those that
|
||||||
compose the table, the page footer or the description of a figure. Unlike
|
compose the table, the page footer or the description of a figure. Unlike
|
||||||
other documents formats, like a `.txt` file or a word document, the PDF format
|
other document formats, like a `.txt` file or a word document, the PDF format
|
||||||
does not contain a stream of text.
|
does not contain a stream of text.
|
||||||
|
|
||||||
A PDF document does consists of a collection of objects that together describe
|
A PDF document does consists of a collection of objects that together describe
|
||||||
|
@ -29,7 +29,7 @@ PDFMiner attempts to reconstruct some of those structures by using heuristics
|
||||||
on the positioning of characters. This works well for sentences and
|
on the positioning of characters. This works well for sentences and
|
||||||
paragraphs because meaningful groups of nearby characters can be made.
|
paragraphs because meaningful groups of nearby characters can be made.
|
||||||
|
|
||||||
The layout analysis consist of three different stages: it groups characters
|
The layout analysis consists of three different stages: it groups characters
|
||||||
into words and lines, then it groups lines into boxes and finally it groups
|
into words and lines, then it groups lines into boxes and finally it groups
|
||||||
textboxes hierarchically. These stages are discussed in the following
|
textboxes hierarchically. These stages are discussed in the following
|
||||||
sections. The resulting output of the layout analysis is an ordered hierarchy
|
sections. The resulting output of the layout analysis is an ordered hierarchy
|
||||||
|
@ -48,8 +48,8 @@ Grouping characters into words and lines
|
||||||
|
|
||||||
The first step in going from characters to text is to group characters in a
|
The first step in going from characters to text is to group characters in a
|
||||||
meaningful way. Each character has an x-coordinate and a y-coordinate for its
|
meaningful way. Each character has an x-coordinate and a y-coordinate for its
|
||||||
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
|
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer.six
|
||||||
.six uses these bounding boxes to decide which characters belong together.
|
uses these bounding boxes to decide which characters belong together.
|
||||||
|
|
||||||
Characters that are both horizontally and vertically close are grouped onto
|
Characters that are both horizontally and vertically close are grouped onto
|
||||||
one line. How close they should be is determined by the `char_margin`
|
one line. How close they should be is determined by the `char_margin`
|
||||||
|
@ -74,7 +74,7 @@ relative to the maximum width or height of the new character. Having a smaller
|
||||||
least be smaller than the `char_margin` otherwise none of the characters will
|
least be smaller than the `char_margin` otherwise none of the characters will
|
||||||
be separated by a space.
|
be separated by a space.
|
||||||
|
|
||||||
The result of this stage is a list of lines. Each line consists a list of
|
The result of this stage is a list of lines. Each line consists of a list of
|
||||||
characters. These characters are either original `LTChar` characters that
|
characters. These characters are either original `LTChar` characters that
|
||||||
originate from the PDF file, or inserted `LTAnno` characters that
|
originate from the PDF file, or inserted `LTAnno` characters that
|
||||||
represent spaces between words or newlines at the end of each line.
|
represent spaces between words or newlines at the end of each line.
|
||||||
|
@ -91,14 +91,14 @@ Lines that are both horizontally overlapping and vertically close are grouped.
|
||||||
How vertically close the lines should be is determined by the `line_margin`.
|
How vertically close the lines should be is determined by the `line_margin`.
|
||||||
This margin is specified relative to the height of the bounding box. Lines
|
This margin is specified relative to the height of the bounding box. Lines
|
||||||
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
|
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
|
||||||
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
|
(see L :sub:`2`) in the figure) of the bounding boxes is closer together
|
||||||
than the absolute line margin, i.e. the `line_margin` multiplied by the
|
than the absolute line margin, i.e. the `line_margin` multiplied by the
|
||||||
height of the bounding box.
|
height of the bounding box.
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
:file: ../_static/layout_analysis_group_lines.html
|
:file: ../_static/layout_analysis_group_lines.html
|
||||||
|
|
||||||
The result of this stage is a list of text boxes. Each box consist of a list
|
The result of this stage is a list of text boxes. Each box consists of a list
|
||||||
of lines.
|
of lines.
|
||||||
|
|
||||||
Grouping textboxes hierarchically
|
Grouping textboxes hierarchically
|
||||||
|
|
Loading…
Reference in New Issue