Fix typos in readthedocs documentation. (#579)
* Fix typos and possible mistakes. * Revert two edits based on discussion in #579 Revert the two changes based on our discussion. I read the documentation and had a glimpse at the default code. And perhaps the confusion was caused by the figure that shows the Char Margin (M) and the Word Margin (W). Clearly, M is smaller than W in absolute terms, but as mentioned, they are both relative numbers. Maybe it is useful to point that out in the figure but I am not sure how best to do it. Another option is to mention use something like `min_char_margin_threshold` or similar, in the hope that they are easier to understand. Just some thoughts! * Triggering travis again Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>pull/593/head^2
parent
543976f195
commit
d821fed340
|
@ -4,11 +4,11 @@ Converting a PDF file to text
|
|||
*****************************
|
||||
|
||||
Most PDF files look like they contain well structured text. But the reality is
|
||||
that a PDF file does not contain anything that resembles a paragraphs,
|
||||
that a PDF file does not contain anything that resembles paragraphs,
|
||||
sentences or even words. When it comes to text, a PDF file is only aware of
|
||||
the characters and their placement.
|
||||
|
||||
This makes extracting meaningful pieces of text from PDF's files difficult.
|
||||
This makes extracting meaningful pieces of text from PDF files difficult.
|
||||
The characters that compose a paragraph are no different from those that
|
||||
compose the table, the page footer or the description of a figure. Unlike
|
||||
other documents formats, like a `.txt` file or a word document, the PDF format
|
||||
|
@ -20,7 +20,6 @@ interactive elements and higher-level application data. A PDF file contains
|
|||
the objects making up a PDF document along with associated structural
|
||||
information, all represented as a single self-contained sequence of bytes. [1]_
|
||||
|
||||
|
||||
.. _topic_pdf_to_text_layout:
|
||||
|
||||
Layout analysis algorithm
|
||||
|
@ -41,7 +40,6 @@ of layout objects on a PDF page.
|
|||
|
||||
The output of the layout analysis is a hierarchy of layout objects.
|
||||
|
||||
|
||||
The output of the layout analysis heavily depends on a couple of parameters.
|
||||
All these parameters are part of the :ref:`api_laparams` class.
|
||||
|
||||
|
@ -56,10 +54,9 @@ bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
|
|||
Characters that are both horizontally and vertically close are grouped onto
|
||||
one line. How close they should be is determined by the `char_margin`
|
||||
(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
|
||||
*distance* between the bounding boxes of two characters should be smaller that
|
||||
*distance* between the bounding boxes of two characters should be smaller than
|
||||
the `char_margin` and the vertical *overlap* between the bounding boxes should
|
||||
be smaller the the `line_overlap`.
|
||||
|
||||
be smaller than the `line_overlap`.
|
||||
|
||||
.. raw:: html
|
||||
:file: ../_static/layout_analysis.html
|
||||
|
@ -71,14 +68,14 @@ relative to the minimum height of either one of the bounding boxes.
|
|||
|
||||
Spaces need to be inserted between characters because the PDF format has no
|
||||
notion of the space character. A space is inserted if the characters are
|
||||
further apart that the `word_margin` (W in the figure). The `word_margin` is
|
||||
further apart than the `word_margin` (W in the figure). The `word_margin` is
|
||||
relative to the maximum width or height of the new character. Having a smaller
|
||||
`word_margin` creates smaller words. Note that the `word_margin` should at
|
||||
least be smaller than the `char_margin` otherwise none of the characters will
|
||||
be separated by a space.
|
||||
|
||||
The result of this stage is a list of lines. Each line consists a list of
|
||||
characters. These characters either original `LTChar` characters that
|
||||
characters. These characters are either original `LTChar` characters that
|
||||
originate from the PDF file, or inserted `LTAnno` characters that
|
||||
represent spaces between words or newlines at the end of each line.
|
||||
|
||||
|
@ -107,7 +104,7 @@ of lines.
|
|||
Grouping textboxes hierarchically
|
||||
---------------------------------
|
||||
|
||||
the last step is to group the text boxes in a meaningful way. This step
|
||||
The last step is to group the text boxes in a meaningful way. This step
|
||||
repeatedly merges the two text boxes that are closest to each other.
|
||||
|
||||
The closeness of bounding boxes is computed as the area that is between the
|
||||
|
@ -118,7 +115,6 @@ boxes of the individual lines.
|
|||
.. raw:: html
|
||||
:file: ../_static/layout_analysis_group_boxes.html
|
||||
|
||||
|
||||
Working with rotated characters
|
||||
===============================
|
||||
|
||||
|
|
Loading…
Reference in New Issue