Add FAQ about special characters (#829)

* Add FAQ for extracting special characters

* Update CHANGELOG.md

* Update faq.rst
pull/801/head 20221105
Pieter Marsman 2022-11-05 17:22:08 +01:00 committed by GitHub
parent 3688911afe
commit ebf7bcdb98
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 28 additions and 0 deletions

View File

@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
- Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651)) - Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651))
- Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790)) - Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790))
- Documentation on why special characters can sometimes not be extracted ([#829](https://github.com/pdfminer/pdfminer.six/pull/829))
### Fixed ### Fixed

View File

@ -39,3 +39,30 @@ improves pdfminer.
Since 2020, the original pdfminer is `dormant Since 2020, the original pdfminer is `dormant
<https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork <https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
which Euske recommends if you need an actively maintained version of pdfminer. which Euske recommends if you need an actively maintained version of pdfminer.
Why are there `(cid:x)` values in the textual output?
=====================================================
One of the most common issues with pdfminer.six is that the textual output
contains raw character id's `(cid:x)`. This is often experienced as confusing
because the text is shown fine in a PDF viewer and other text from the same
PDF is extracted properly.
The underlying problem is that a PDF has two different representations
of each character. Each character is mapped to a glyph that determines
how the character is shown in a PDF viewer. And each character is also
mapped to its unicode value that is used when copy-pasting the character.
Some PDF's have incomplete unicode mappings and therefore it is impossible
to convert the character to unicode. In these cases pdfminer.six defaults
to showing the raw character id `(cid:x)`
A quick test to see if pdfminer.six should be able to do better is to
copy-paste the text from a PDF viewer to a text editor. If the result
is proper text, pdfminer.six should also be able to extract proper text.
If the result is gibberish, pdfminer.six will also not be able to convert
the characters to unicode.
References:
#. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
#. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_