diff --git a/CHANGELOG.md b/CHANGELOG.md index 51ecc2c..1a1ead6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). - Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651)) - Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790)) +- Documentation on why special characters can sometimes not be extracted ([#829](https://github.com/pdfminer/pdfminer.six/pull/829)) ### Fixed diff --git a/docs/source/faq.rst b/docs/source/faq.rst index 3461492..b209c80 100644 --- a/docs/source/faq.rst +++ b/docs/source/faq.rst @@ -39,3 +39,30 @@ improves pdfminer. Since 2020, the original pdfminer is `dormant `_, and pdfminer.six is the fork which Euske recommends if you need an actively maintained version of pdfminer. + +Why are there `(cid:x)` values in the textual output? +===================================================== + +One of the most common issues with pdfminer.six is that the textual output +contains raw character id's `(cid:x)`. This is often experienced as confusing +because the text is shown fine in a PDF viewer and other text from the same +PDF is extracted properly. + +The underlying problem is that a PDF has two different representations +of each character. Each character is mapped to a glyph that determines +how the character is shown in a PDF viewer. And each character is also +mapped to its unicode value that is used when copy-pasting the character. +Some PDF's have incomplete unicode mappings and therefore it is impossible +to convert the character to unicode. In these cases pdfminer.six defaults +to showing the raw character id `(cid:x)` + +A quick test to see if pdfminer.six should be able to do better is to +copy-paste the text from a PDF viewer to a text editor. If the result +is proper text, pdfminer.six should also be able to extract proper text. +If the result is gibberish, pdfminer.six will also not be able to convert +the characters to unicode. + +References: + +#. `Chapter 5: Text, PDF Reference 1.7 `_ +#. `Text: PDF, Wikipedia `_