Fix typos in readthedocs documentation. (#579)
* Fix typos and possible mistakes. * Revert two edits based on discussion in #579 Revert the two changes based on our discussion. I read the documentation and had a glimpse at the default code. And perhaps the confusion was caused by the figure that shows the Char Margin (M) and the Word Margin (W). Clearly, M is smaller than W in absolute terms, but as mentioned, they are both relative numbers. Maybe it is useful to point that out in the figure but I am not sure how best to do it. Another option is to mention use something like `min_char_margin_threshold` or similar, in the hope that they are easier to understand. Just some thoughts! * Triggering travis again Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>pull/593/head^2
parent
543976f195
commit
d821fed340
|
@ -4,11 +4,11 @@ Converting a PDF file to text
|
||||||
*****************************
|
*****************************
|
||||||
|
|
||||||
Most PDF files look like they contain well structured text. But the reality is
|
Most PDF files look like they contain well structured text. But the reality is
|
||||||
that a PDF file does not contain anything that resembles a paragraphs,
|
that a PDF file does not contain anything that resembles paragraphs,
|
||||||
sentences or even words. When it comes to text, a PDF file is only aware of
|
sentences or even words. When it comes to text, a PDF file is only aware of
|
||||||
the characters and their placement.
|
the characters and their placement.
|
||||||
|
|
||||||
This makes extracting meaningful pieces of text from PDF's files difficult.
|
This makes extracting meaningful pieces of text from PDF files difficult.
|
||||||
The characters that compose a paragraph are no different from those that
|
The characters that compose a paragraph are no different from those that
|
||||||
compose the table, the page footer or the description of a figure. Unlike
|
compose the table, the page footer or the description of a figure. Unlike
|
||||||
other documents formats, like a `.txt` file or a word document, the PDF format
|
other documents formats, like a `.txt` file or a word document, the PDF format
|
||||||
|
@ -20,7 +20,6 @@ interactive elements and higher-level application data. A PDF file contains
|
||||||
the objects making up a PDF document along with associated structural
|
the objects making up a PDF document along with associated structural
|
||||||
information, all represented as a single self-contained sequence of bytes. [1]_
|
information, all represented as a single self-contained sequence of bytes. [1]_
|
||||||
|
|
||||||
|
|
||||||
.. _topic_pdf_to_text_layout:
|
.. _topic_pdf_to_text_layout:
|
||||||
|
|
||||||
Layout analysis algorithm
|
Layout analysis algorithm
|
||||||
|
@ -41,7 +40,6 @@ of layout objects on a PDF page.
|
||||||
|
|
||||||
The output of the layout analysis is a hierarchy of layout objects.
|
The output of the layout analysis is a hierarchy of layout objects.
|
||||||
|
|
||||||
|
|
||||||
The output of the layout analysis heavily depends on a couple of parameters.
|
The output of the layout analysis heavily depends on a couple of parameters.
|
||||||
All these parameters are part of the :ref:`api_laparams` class.
|
All these parameters are part of the :ref:`api_laparams` class.
|
||||||
|
|
||||||
|
@ -56,10 +54,9 @@ bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
|
||||||
Characters that are both horizontally and vertically close are grouped onto
|
Characters that are both horizontally and vertically close are grouped onto
|
||||||
one line. How close they should be is determined by the `char_margin`
|
one line. How close they should be is determined by the `char_margin`
|
||||||
(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
|
(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
|
||||||
*distance* between the bounding boxes of two characters should be smaller that
|
*distance* between the bounding boxes of two characters should be smaller than
|
||||||
the `char_margin` and the vertical *overlap* between the bounding boxes should
|
the `char_margin` and the vertical *overlap* between the bounding boxes should
|
||||||
be smaller the the `line_overlap`.
|
be smaller than the `line_overlap`.
|
||||||
|
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
:file: ../_static/layout_analysis.html
|
:file: ../_static/layout_analysis.html
|
||||||
|
@ -71,14 +68,14 @@ relative to the minimum height of either one of the bounding boxes.
|
||||||
|
|
||||||
Spaces need to be inserted between characters because the PDF format has no
|
Spaces need to be inserted between characters because the PDF format has no
|
||||||
notion of the space character. A space is inserted if the characters are
|
notion of the space character. A space is inserted if the characters are
|
||||||
further apart that the `word_margin` (W in the figure). The `word_margin` is
|
further apart than the `word_margin` (W in the figure). The `word_margin` is
|
||||||
relative to the maximum width or height of the new character. Having a smaller
|
relative to the maximum width or height of the new character. Having a smaller
|
||||||
`word_margin` creates smaller words. Note that the `word_margin` should at
|
`word_margin` creates smaller words. Note that the `word_margin` should at
|
||||||
least be smaller than the `char_margin` otherwise none of the characters will
|
least be smaller than the `char_margin` otherwise none of the characters will
|
||||||
be separated by a space.
|
be separated by a space.
|
||||||
|
|
||||||
The result of this stage is a list of lines. Each line consists a list of
|
The result of this stage is a list of lines. Each line consists a list of
|
||||||
characters. These characters either original `LTChar` characters that
|
characters. These characters are either original `LTChar` characters that
|
||||||
originate from the PDF file, or inserted `LTAnno` characters that
|
originate from the PDF file, or inserted `LTAnno` characters that
|
||||||
represent spaces between words or newlines at the end of each line.
|
represent spaces between words or newlines at the end of each line.
|
||||||
|
|
||||||
|
@ -107,7 +104,7 @@ of lines.
|
||||||
Grouping textboxes hierarchically
|
Grouping textboxes hierarchically
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
the last step is to group the text boxes in a meaningful way. This step
|
The last step is to group the text boxes in a meaningful way. This step
|
||||||
repeatedly merges the two text boxes that are closest to each other.
|
repeatedly merges the two text boxes that are closest to each other.
|
||||||
|
|
||||||
The closeness of bounding boxes is computed as the area that is between the
|
The closeness of bounding boxes is computed as the area that is between the
|
||||||
|
@ -118,7 +115,6 @@ boxes of the individual lines.
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
:file: ../_static/layout_analysis_group_boxes.html
|
:file: ../_static/layout_analysis_group_boxes.html
|
||||||
|
|
||||||
|
|
||||||
Working with rotated characters
|
Working with rotated characters
|
||||||
===============================
|
===============================
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue