Add FAQ about special characters (#829)

* Add FAQ for extracting special characters * Update CHANGELOG.md * Update faq.rst
2022-11-05 17:22:08 +01:00 · 2022-11-05 17:22:08 +01:00 · ebf7bcdb98
parent 3688911afe
commit ebf7bcdb98
2 changed files with 28 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 - Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651))
 - Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790))
 - Documentation on why special characters can sometimes not be extracted ([#829](https://github.com/pdfminer/pdfminer.six/pull/829))
 ### Fixed
--- a/docs/source/faq.rst
+++ b/docs/source/faq.rst
@ -39,3 +39,30 @@ improves pdfminer.
 Since 2020, the original pdfminer is `dormant
 <https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
 which Euske recommends if you need an actively maintained version of pdfminer.
 Why are there `(cid:x)` values in the textual output?
 =====================================================
 One of the most common issues with pdfminer.six is that the textual output
 contains raw character id's `(cid:x)`. This is often experienced as confusing
 because the text is shown fine in a PDF viewer and other text from the same
 PDF is extracted properly.
 The underlying problem is that a PDF has two different representations
 of each character. Each character is mapped to a glyph that determines
 how the character is shown in a PDF viewer. And each character is also
 mapped to its unicode value that is used when copy-pasting the character.
 Some PDF's have incomplete unicode mappings and therefore it is impossible
 to convert the character to unicode. In these cases pdfminer.six defaults
 to showing the raw character id `(cid:x)`
 A quick test to see if pdfminer.six should be able to do better is to
 copy-paste the text from a PDF viewer to a text editor. If the result
 is proper text, pdfminer.six should also be able to extract proper text.
 If the result is gibberish, pdfminer.six will also not be able to convert
 the characters to unicode.
 References: 
 #. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
 #. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_