How to remove OCR from a PDF?

I have been searching Google for some time but cannot find an answer to my question.

I have unwanted layers of OCR in a document that I recently scanned with Adobe Acrobat. It has not been OCRed properly, and I want to redact some information, but the OCR is making the wanted information to get erased. I converted the files to TIFs, but noticed a (very) significant quality loss. I have heard that printing to another PDF either keeps the text or reduces the image quality.

11 Answers

In Acrobat Pro DC, the appropriate command is "Remove Hidden Information," which is available through both the "Protect" and "Redact" tools.

On running the command, it just searches out the hidden information but does not change the document. You must then tell Acrobat which information to remove. In this case, select "Hidden Text" in the Results pane, then click the Remove button and save the changed document.

If, as you say, the documents are scanned and not printed to PDF from Word for example, you can easily remove with your Adobe:

Select Document, Examine Document and now you can remove the hidden text (OCR).

After a lot of experimenting, I found that printing to Adobe PDF from Adobe Acrobat prints the document without the OCR and without losing the quality (an unnoticeable at first glance resolution is lost).

However, many sites claim that this does not work. I also tried the other printers such as Foxit Reader and OneNote but the quality was reduced. JPEG too was the same.

Please keep in mind that your mileage may vary.

Note: I am leaving this thread marked as unanswered in hope of finding a better answer than mine.

In Acrobat Pro: use 'remove hidden information' (under 'protection'). Select all, execute, OCR is gone

In Acrobat X, under Protection, there is a Sanitize Document button that removes EVERYTHING but what can be seen (including OCR'd text layer), converting the document to a flattened bit map.

I solved it by exporting to JPEG, then from JPEG 'combine files in acrobat'. This is from a doc that was originally a word doc and had been converted to PDF. OCR is gone.

Try the "MS Print to PDF" driver. It ships with all recent Windows versions. Make sure to check "Print As Image" under advanced settings to remove OCR.

The quality loss in printing to PDF is negligible. It does however keep the OCR by default unless you print as image.

Easy way to remove OCR layer from PDF: open PDF in Firefox and "print" into another PDF.

Note that "nice" PDF (e.g. created by MS Word) will become much larger (in my case, from 0.5 to 2 MB), and quality is reduced somewhat. Make sure you set correct paper size when "printing".

If you want to redo OCR instead of removing it completely, and you don't mind command line, use ocrmypdf:

ocrmypdf --redo-ocr --output-type=pdf input.pdf output.pdf

On Windows 10, the easiest way to setup and use ocrmypdf is via WSL.

Use the PitStop Pro Acrobat Plug In, in the "Actions List ", create a new action, in the upper right, look for" Select text fragment "and" Remove selected object ", run scope: whole document as seen below:

I built a tool to do this free PDF Redactor. If you upload the image and just click redact it'll flatten your pdf and remove OCR. If you want you can also draw redaction marks on the document as well.

For Adobe X and above: Tools > Protection > Remove Hidden Information.
For Adobe 9 and below: Document > Examine Document.

Reference:

Velvet Star Monitor

11 Answers

Your Answer

Sign up or log in

Post as a guest

Similar Journal

Persona 3 Portable - 10/21 atm, reached tartarus – What do I do?

Ability timers increasing when overused

How do I complete the "Everyone's A Critic" mission?

Which versions of Final Fantasy VI include multiplayer battle support?