Create PDF files with a searchable text layer
Surely everyone has already received a PDF file whose text cannot be selected and copied, for example. This is usually because the scanned text has been embedded as an image in the PDF file. The multifunction printer I use at work, for example, does this. However, the problem can be solved relatively easily and reliably.
The tool OCRmyPDF inserts an additional text layer above the image in the PDF file, whose content can be selected, copied and searched. Tesseract](https://github.com/tesseract-ocr/tesseract) is used for text recognition.
In the best case, one simply executes ocrmypdf input.pdf output.pdf. Input.pdf is the original file and output.pdf is the file that will be saved with the additional text layer. Further functions and possible optimisation options can be found in the documentation. Depending on the language of the content of the original file, you may have to install a language package such as tesseract-data-eng (English language package) beforehand. If this is missing, ORCmyPDF also displays a corresponding message and terminates.
I tested OCRmyPDF with a few simple PDF files with easily readable text. With these, the additional text layer was placed very well over the image file so that the text of both layers overlapped very precisely, so that selecting and copying the text was no problem. Searching the file also worked. Of course, the new file needs more storage space but I think it is within limits. For example, the original PDF file with a DIN A4 page containing the first lines of “The Raven” by Edgar Allan Poe is 49.4 KB in size. The PDF file with the text layer added has a size of 49.8 KB.