How to add OCR to PDF with OCRMyPDF

PDF files created by scanning pages or by printing or exporting from other file formats may not have a text layer. Without a text layer, every page is just an image on which you will not be able to search or highlight. The OCRMyPDF tool can be used to add an OCR text layer to any PDF easily.

  • Installing is easy:
$ sudo apt install ocrmypdf
  • Usage is straightforward:
$ ocrmypdf in.pdf out.pdf

I noticed that adding an OCR text layer increased the PDF file size by 1.5x! The tool also mentions at the end that this file size increase is surprising.

Advertisements

How to add OCR to PDF using PDFOCR

PDFOCR is a Ruby script that can be used to add OCR text to a scanned PDF file.

  • Install the OCR engines that it depends on:
$ sudo apt install tesseract-ocr tesseract-ocr-eng exactimage
  • Get the PDFOCR script:
$ git clone https://github.com/gkovacs/pdfocr
  • Use it to add OCR to a scanned PDF:
$ pdfocr.rb -i foo.pdf -o out.pdf

Tried with: Ubuntu 14.04

OCR fonts

Vim using OCR-A font
Vim using OCR-A font

Do you like that font used in the computer console of old science fiction movies? That font is called OCR and you can actually use it if that is your fancy!

There are two variants of these fonts: OCR-A and OCR-B. These fonts can be obtained here.

To use them, just unzip the files and add them to Ubuntu as described here. You can start using them in any console or application you want.

Tried with: Ubuntu 12.04 LTS

How to perform OCR on a DJVU document

OCR on DJVU using CuneiDjVu

A DJVU document typically contains both a layer of scanned image and a layer of the text in that image. Sometimes, a DJVU document is produced which does not have the text layer. This makes it hard to search and find text in the document.

Recognising the text in the DJVU document using OCR and adding that as a text layer to that document is easy:

  1. Download CuneiDjVu and unzip its contents.
  2. Run the CuneiDjVu program.
  3. Choose the DJVU document as input, choose the output folder and the OCR language.
  4. Press Process and the resulting file will beΒ CuneiDjVu Result.djvu on your Desktop.

Tried with: CuneiDjVu 1.4 and Windows 7 Professional x64

Adobe Acrobat: OCR

You have scanned in a paper or a section of a book and converted it to a PDF. What next? The next best thing to do would be to run OCR on the PDF.

I use Adobe Acrobat for this. The converted PDF document right now is only acting as a container for the scanned bitmap images. By running OCR on it, Acrobat can recognize text in the image and embed it along with the image. This way you can mark and copy text in the PDF. And also be able to search for text in the document.

To do OCR:

Choose Document β†’ OCR Text Recognition β†’ Recognize text using OCR …

In the Recognize Text dialog that pops up the default options should be fine. Click OK.

The OCR then runs on each page serially and may take some time on long documents. You may also notice that the scanned images in the document get straightened a bit and may also get downsampled. After the OCR is complete, save the PDF.