How to find searchable PDFs
Andrew Henderson
I have a folder with many PDFs. Some are no doubt searchable. Can I search for and identify only those which are searchable?
Adobe gives an error message if a PDF is an image, asking if you want to convert it to searchable text. I do not know if that is generic or specific to Adobe. I suppose a more complete question would have been how do I set aside the file if an image is encountered? I will read up on man pdfinfo to see if I find anything in there to help.
1 Answer
On a particular folder you can use pdfgrep:
pdfgrep --recursive --count .The lines with zero at the end are not searchable (the dot is a regex that matches to any character). Also,
pdfgrep -r -c . | grep -oP "\:\d*$" | sed 's/^\:0$/Not searchable/g;s/^\:[1-9][0-9]*$/Searchable/' | sort | uniq -cwill give you some stats about how many are searchable or not.