Add FAQ about special characters (#829)

* Add FAQ for extracting special characters * Update CHANGELOG.md * Update faq.rst
2022-11-05 17:22:08 +01:00 · 2022-11-05 17:22:08 +01:00 · ebf7bcdb98
parent 3688911afe
commit ebf7bcdb98
2 changed files with 28 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

 - Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651))
 - Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790))
+- Documentation on why special characters can sometimes not be extracted ([#829](https://github.com/pdfminer/pdfminer.six/pull/829))

 ### Fixed

--- a/docs/source/faq.rst
+++ b/docs/source/faq.rst
@ -39,3 +39,30 @@ improves pdfminer.
 Since 2020, the original pdfminer is `dormant
 <https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
 which Euske recommends if you need an actively maintained version of pdfminer.
+
+Why are there `(cid:x)` values in the textual output?
+=====================================================
+
+One of the most common issues with pdfminer.six is that the textual output
+contains raw character id's `(cid:x)`. This is often experienced as confusing
+because the text is shown fine in a PDF viewer and other text from the same
+PDF is extracted properly.
+
+The underlying problem is that a PDF has two different representations
+of each character. Each character is mapped to a glyph that determines
+how the character is shown in a PDF viewer. And each character is also
+mapped to its unicode value that is used when copy-pasting the character.
+Some PDF's have incomplete unicode mappings and therefore it is impossible
+to convert the character to unicode. In these cases pdfminer.six defaults
+to showing the raw character id `(cid:x)`
+
+A quick test to see if pdfminer.six should be able to do better is to
+copy-paste the text from a PDF viewer to a text editor. If the result
+is proper text, pdfminer.six should also be able to extract proper text.
+If the result is gibberish, pdfminer.six will also not be able to convert
+the characters to unicode.
+
+References: 
+
+#. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
+#. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_