* Updated misleading documentation about word_margin * Small change in sentence about word_margin * Remove confusing sentence about adding spaces Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>pull/399/head
parent
1a4a06da9f
commit
518b5d6efc
|
@ -6,7 +6,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
- Updated misleading documentation for `word_margin` and `char_margin` ([#407](https://github.com/pdfminer/pdfminer.six/pull/407))
|
||||||
- Ignore ValueError when converting font encoding differences ([#389](https://github.com/pdfminer/pdfminer.six/pull/389))
|
- Ignore ValueError when converting font encoding differences ([#389](https://github.com/pdfminer/pdfminer.six/pull/389))
|
||||||
- Grouping of text lines outside of parent container bounding box ([#386](https://github.com/pdfminer/pdfminer.six/pull/386))
|
- Grouping of text lines outside of parent container bounding box ([#386](https://github.com/pdfminer/pdfminer.six/pull/386))
|
||||||
|
|
||||||
|
|
|
@ -50,12 +50,12 @@ meaningful way. Each character has an x-coordinate and a y-coordinate for its
|
||||||
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
|
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
|
||||||
.six uses these bounding boxes to decide which characters belong together.
|
.six uses these bounding boxes to decide which characters belong together.
|
||||||
|
|
||||||
Characters that are both horizontally and vertically close are grouped. How
|
Characters that are both horizontally and vertically close are grouped onto
|
||||||
close they should be is determined by the `char_margin` (M in figure) and the
|
one line. How close they should be is determined by the `char_margin`
|
||||||
`line_overlap` (not in figure) parameter. The horizontal *distance* between the
|
(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
|
||||||
bounding boxes of two characters should be smaller that the `char_margin` and
|
*distance* between the bounding boxes of two characters should be smaller that
|
||||||
the vertical *overlap* between the bounding boxes should be smaller the the
|
the `char_margin` and the vertical *overlap* between the bounding boxes should
|
||||||
`line_overlap`.
|
be smaller the the `line_overlap`.
|
||||||
|
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
@ -69,10 +69,10 @@ relative to the minimum height of either one of the bounding boxes.
|
||||||
Spaces need to be inserted between characters because the PDF format has no
|
Spaces need to be inserted between characters because the PDF format has no
|
||||||
notion of the space character. A space is inserted if the characters are
|
notion of the space character. A space is inserted if the characters are
|
||||||
further apart that the `word_margin` (W in the figure). The `word_margin` is
|
further apart that the `word_margin` (W in the figure). The `word_margin` is
|
||||||
relative to the maximum width or height of the new character. Having a larger
|
relative to the maximum width or height of the new character. Having a smaller
|
||||||
`word_margin` creates smaller words and inserts spaces between characters
|
`word_margin` creates smaller words. Note that the `word_margin` should at
|
||||||
more often. Note that the `word_margin` should be smaller than the
|
least be smaller than the `char_margin` otherwise none of the characters will
|
||||||
`char_margin` otherwise all the characters are seperated by a space.
|
be separated by a space.
|
||||||
|
|
||||||
The result of this stage is a list of lines. Each line consists a list of
|
The result of this stage is a list of lines. Each line consists a list of
|
||||||
characters. These characters either original `LTChar` characters that
|
characters. These characters either original `LTChar` characters that
|
||||||
|
|
|
@ -36,14 +36,12 @@ class LAParams:
|
||||||
are considered to be on the same line. The overlap is specified
|
are considered to be on the same line. The overlap is specified
|
||||||
relative to the minimum height of both characters.
|
relative to the minimum height of both characters.
|
||||||
:param char_margin: If two characters are closer together than this
|
:param char_margin: If two characters are closer together than this
|
||||||
margin they are considered to be part of the same word. If
|
margin they are considered part of the same line. The margin is
|
||||||
characters are on the same line but not part of the same word, an
|
specified relative to the width of the character.
|
||||||
intermediate space is inserted. The margin is specified relative to
|
:param word_margin: If two characters on the same line are further apart
|
||||||
the width of the character.
|
than this margin then they are considered to be two separate words, and
|
||||||
:param word_margin: If two words are are closer together than this
|
an intermediate space will be added for readability. The margin is
|
||||||
margin they are considered to be part of the same line. A space is
|
specified relative to the width of the character.
|
||||||
added in between for readability. The margin is specified relative
|
|
||||||
to the width of the word.
|
|
||||||
:param line_margin: If two lines are are close together they are
|
:param line_margin: If two lines are are close together they are
|
||||||
considered to be part of the same paragraph. The margin is
|
considered to be part of the same paragraph. The margin is
|
||||||
specified relative to the height of a line.
|
specified relative to the height of a line.
|
||||||
|
|
|
@ -102,14 +102,14 @@ def maketheparser():
|
||||||
la_params.add_argument(
|
la_params.add_argument(
|
||||||
"--char-margin", "-M", type=float, default=2.0,
|
"--char-margin", "-M", type=float, default=2.0,
|
||||||
help="If two characters are closer together than this margin they "
|
help="If two characters are closer together than this margin they "
|
||||||
"are considered to be part of the same word. The margin is "
|
"are considered to be part of the same line. The margin is "
|
||||||
"specified relative to the width of the character.")
|
"specified relative to the width of the character.")
|
||||||
la_params.add_argument(
|
la_params.add_argument(
|
||||||
"--word-margin", "-W", type=float, default=0.1,
|
"--word-margin", "-W", type=float, default=0.1,
|
||||||
help="If two words are are closer together than this margin they "
|
help="If two characters on the same line are further apart than this "
|
||||||
"are considered to be part of the same line. A space is added "
|
"margin then they are considered to be two separate words, and "
|
||||||
"in between for readability. The margin is specified relative "
|
"an intermediate space will be added for readability. The margin "
|
||||||
"to the width of the word.")
|
"is specified relative to the width of the character.")
|
||||||
la_params.add_argument(
|
la_params.add_argument(
|
||||||
"--line-margin", "-L", type=float, default=0.5,
|
"--line-margin", "-L", type=float, default=0.5,
|
||||||
help="If two lines are are close together they are considered to "
|
help="If two lines are are close together they are considered to "
|
||||||
|
|
Loading…
Reference in New Issue