pdfminer.six/docs/source/faq.rst

.. _faq:

Frequently asked questions
**************************

Why is it called pdfminer.six?
==============================

Pdfminer.six is a fork of the `original pdfminer created by Euske
<https://github.com/euske>`_. Almost all of the code and architecture are in
-fact created by Euske. But, for a long time, this original pdfminer did not
support Python 3. Until 2020 the original pdfminer only supported Python 2.
The original goal of pdfminer.six was to add support for Python 3. This was
done with the `six` package. The `six` package helps to write code that is
compatible with both Python 2 and Python 3. Hence, pdfminer.six.

As of 2020, pdfminer.six dropped the support for Python 2 because it was
`end-of-life <https://www.python.org/doc/sunset-python-2/>`_. While the .six
part is no longer applicable, we kept the name to prevent breaking changes for
existing users.

The current punchline "We fathom PDF" is a `whimsical reference
<https://github.com/pdfminer/pdfminer.six/issues/197#issuecomment-655091942>`_
to the six. Fathom means both deeply understanding something, and a fathom is
also equal to six feet.

How does pdfminer.six compare to other forks of pdfminer?
==========================================================

Pdfminer.six is now an independent and community-maintained package for
extracting text from PDFs with Python. We actively fix bugs (also for PDFs
that don't strictly follow the PDF Reference), add new features and improve
the usability of pdfminer.six. This community separates pdfminer.six from the
other forks of the original pdfminer. PDF as a format is very diverse and
there are countless deviations from the official format. The only way to
support all the PDFs out there is to have a community that actively uses and
improves pdfminer.

Since 2020, the original pdfminer is `dormant
<https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
which Euske recommends if you need an actively maintained version of pdfminer.

Why are there `(cid:x)` values in the textual output?
=====================================================

One of the most common issues with pdfminer.six is that the textual output
contains raw character id's `(cid:x)`. This is often experienced as confusing
because the text is shown fine in a PDF viewer and other text from the same
PDF is extracted properly.

The underlying problem is that a PDF has two different representations
of each character. Each character is mapped to a glyph that determines
how the character is shown in a PDF viewer. And each character is also
mapped to its unicode value that is used when copy-pasting the character.
Some PDF's have incomplete unicode mappings and therefore it is impossible
to convert the character to unicode. In these cases pdfminer.six defaults
to showing the raw character id `(cid:x)`

A quick test to see if pdfminer.six should be able to do better is to
copy-paste the text from a PDF viewer to a text editor. If the result
is proper text, pdfminer.six should also be able to extract proper text.
If the result is gibberish, pdfminer.six will also not be able to convert
the characters to unicode.

References: 

#. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
#. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_
Update docs/source/faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com> 2020-10-12 07:20:30 +00:00			`.. _faq:`
Add frequently asked questions 2020-10-11 18:05:26 +00:00
			`Frequently asked questions`
			`**************************`

			`Why is it called pdfminer.six?`
			`==============================`

Update faq.rst 2020-10-18 10:49:54 +00:00			Pdfminer.six is a fork of the `original pdfminer created by Euske
Fix small typos in documentation (#828) * Fix #795 * Documentation updates (FAQ and others) * New how-to for extracting coordinates * Indent fix in documentation * Revert "Fix #795" This reverts commit cac62171fc6c8458ff1673137eff233107cae47b. * Move description of iterating LTPage to the docstring of LTPage * Remove adding how-to for extracting coordinates from this pr * Add CHANGELOG.md * Remove FAQ from this branch * Only add one line to CHANGELOG.md Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com> 2022-11-05 16:08:23 +00:00			<https://github.com/euske>`_. Almost all of the code and architecture are in
			`-fact created by Euske. But, for a long time, this original pdfminer did not`
Update faq.rst 2020-10-18 10:49:54 +00:00			`support Python 3. Until 2020 the original pdfminer only supported Python 2.`
			`The original goal of pdfminer.six was to add support for Python 3. This was`
Fix small typos in documentation (#828) * Fix #795 * Documentation updates (FAQ and others) * New how-to for extracting coordinates * Indent fix in documentation * Revert "Fix #795" This reverts commit cac62171fc6c8458ff1673137eff233107cae47b. * Move description of iterating LTPage to the docstring of LTPage * Remove adding how-to for extracting coordinates from this pr * Add CHANGELOG.md * Remove FAQ from this branch * Only add one line to CHANGELOG.md Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com> 2022-11-05 16:08:23 +00:00			done with the `six` package. The `six` package helps to write code that is
Update faq.rst 2020-10-18 10:49:54 +00:00			`compatible with both Python 2 and Python 3. Hence, pdfminer.six.`
Add frequently asked questions 2020-10-11 18:05:26 +00:00
			`As of 2020, pdfminer.six dropped the support for Python 2 because it was`
Update docs/source/faq.rst Co-authored-by: Jake Stockwin <jstockwin@gmail.com> 2020-10-12 07:20:43 +00:00			`end-of-life <https://www.python.org/doc/sunset-python-2/>`_. While the .six
Add frequently asked questions 2020-10-11 18:05:26 +00:00			`part is no longer applicable, we kept the name to prevent breaking changes for`
			`existing users.`

			The current punchline "We fathom PDF" is a `whimsical reference
			<https://github.com/pdfminer/pdfminer.six/issues/197#issuecomment-655091942>`_
			`to the six. Fathom means both deeply understanding something, and a fathom is`
			`also equal to six feet.`

			`How does pdfminer.six compare to other forks of pdfminer?`
			`==========================================================`

Fix small typos in documentation (#828) * Fix #795 * Documentation updates (FAQ and others) * New how-to for extracting coordinates * Indent fix in documentation * Revert "Fix #795" This reverts commit cac62171fc6c8458ff1673137eff233107cae47b. * Move description of iterating LTPage to the docstring of LTPage * Remove adding how-to for extracting coordinates from this pr * Add CHANGELOG.md * Remove FAQ from this branch * Only add one line to CHANGELOG.md Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com> 2022-11-05 16:08:23 +00:00			`Pdfminer.six is now an independent and community-maintained package for`
			`extracting text from PDFs with Python. We actively fix bugs (also for PDFs`
Add frequently asked questions 2020-10-11 18:05:26 +00:00			`that don't strictly follow the PDF Reference), add new features and improve`
			`the usability of pdfminer.six. This community separates pdfminer.six from the`
			`other forks of the original pdfminer. PDF as a format is very diverse and`
			`there are countless deviations from the official format. The only way to`
Fix small typos in documentation (#828) * Fix #795 * Documentation updates (FAQ and others) * New how-to for extracting coordinates * Indent fix in documentation * Revert "Fix #795" This reverts commit cac62171fc6c8458ff1673137eff233107cae47b. * Move description of iterating LTPage to the docstring of LTPage * Remove adding how-to for extracting coordinates from this pr * Add CHANGELOG.md * Remove FAQ from this branch * Only add one line to CHANGELOG.md Co-authored-by: Kunal Gehlot <kunal.g@360hvpl.com> 2022-11-05 16:08:23 +00:00			`support all the PDFs out there is to have a community that actively uses and`
Add frequently asked questions 2020-10-11 18:05:26 +00:00			`improves pdfminer.`

			Since 2020, the original pdfminer is `dormant
Update faq.rst 2020-10-12 07:22:41 +00:00			<https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
			`which Euske recommends if you need an actively maintained version of pdfminer.`
Add FAQ about special characters (#829) * Add FAQ for extracting special characters * Update CHANGELOG.md * Update faq.rst 2022-11-05 16:22:08 +00:00
			Why are there `(cid:x)` values in the textual output?
			`=====================================================`

			`One of the most common issues with pdfminer.six is that the textual output`
			contains raw character id's `(cid:x)`. This is often experienced as confusing
			`because the text is shown fine in a PDF viewer and other text from the same`
			`PDF is extracted properly.`

			`The underlying problem is that a PDF has two different representations`
			`of each character. Each character is mapped to a glyph that determines`
			`how the character is shown in a PDF viewer. And each character is also`
			`mapped to its unicode value that is used when copy-pasting the character.`
			`Some PDF's have incomplete unicode mappings and therefore it is impossible`
			`to convert the character to unicode. In these cases pdfminer.six defaults`
			to showing the raw character id `(cid:x)`

			`A quick test to see if pdfminer.six should be able to do better is to`
			`copy-paste the text from a PDF viewer to a text editor. If the result`
			`is proper text, pdfminer.six should also be able to extract proper text.`
			`If the result is gibberish, pdfminer.six will also not be able to convert`
			`the characters to unicode.`

			`References:`

			#. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
			#. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_