Last week while looking for some other function in my version of Adobe Acrobat, I stumbled across "OCR" in one of the pull-down menus. OCR stands for Optical Character Recognition, an application which detects text characters from digital images. OCR has been around in some form for over a decade, with performance improving over that span. I presume some combination of scanning + OCR has already been implemented in corpus linguistics, but coupled with improvements in scanner technology, I think OCR could open up lots of new research angles.
First, about text searches in pdf files. As you probably know, much digitally-accessible academic literature is available in a particular type of portable document format (pdf) that has been directly generated from some sort of word processor or text editor. The text in these pdfs is searchable in Acrobat Reader, and depending on the author's settings, can be copied and pasted into other applications. The other type of pdf, generated from an image scan, is not inherently searchable, because each page in the file is a picture rather than an arrangement of characters. These are the pdfs we're used to seeing from course reserves posted online or articles obtained electronically via interlibrary loan. They're readable on screen and print fairly clearly, but they can have huge file sizes, and depending on your printer, they might print at a slow page-per-minute rate. They also are not searchable.
This is where OCR comes in - the pdf becomes searchable in Reader once OCR has been run, and the text can then be pasted into another application. Acrobat Pro might also let you export it to another file format as an alternative to copy-and-paste. The search and copy/paste functions aren't necessarily useful if you're just reading a scan of an old article, but suppose you have text data that's only in hard copy format - maybe a printout from an obsolete file format, or a bound grammar or dictionary. If you scan your hard copy data to pdf, OCR could enable you to make a searchable electronic copy of it. I'm thinking, whatever dictionary you have, you could even convert it to a sortable database, and better yet mark it up with xml.
OK, this has always been possible even without scanning and OCR, assuming you're willing to re-type your source from scratch. But the potential savings in time (and chiropractic care) make the task a whole lot more palatable.
The one major stumbling block is, how well does OCR work? To answer this, I've decided to put Adobe's OCR capables to a brief test, using the Woleaian-English dictionary (Sohn & Tawerilmang 1976). Using a new scanner/copier that sends a pdf to my email address, I had 8 pages scanned and OCRed in 20 minutes. As for the OCR results, there is some good and a little bad.
First, the good:
- Acrobat's OCR is sophisticated enough to know which end is up in your file. So if you scan something upside-down or sideways, it figures it out, so you don't have to rotate it yourself. In fact, I had some pages that could be OCRed only in a sideways orientation - but the output was always right-side up.
- It seems like character recognition is not context dependent. I was worried it would incorporate an English lexicon to help guess top-down-like at a fuzzy character image. This would be really useful if your document is unilingual, but if you're scanning a dictionary of a 1000-speaker language with a quirky orthography, fuzzy characters in non-English words would probably be rendered poorly if they were guessed at with an English lexicon.
- Italic text is generally recognizable (but may induce some misreads - see below).
- The error rate is surprisingly low (and regular - see below).
And the bad:
- Special characters can be problematic. In the dictionary I'm using, homonymic headwords are differentiated with subscript numbers, which appear in the post-OCR pdf as commas, l, z, and non-subscript 3.
- After a first OCR run-through, the program maintains a list of "suspects" - characters that it is not fully confident about and has thus not rendered into text yet. You have to go through these manually to accept the character that the application proposes for the suspect. Unfortunately, it gives you no choices.
- the available characters so far seem to be limited to the ASCII set, though I may test this by scanning a printout of an IPA-ful document.
- Special formats like sub/superscript are troublesome, and there are mixups among the characters 1, l, [square brackets], and (parentheses). And oddly, my entire OCR output (all 8 pages so far) is italic.
Overall, I feel like the good outweighs the bad at this point. I also believe a lot of misread characters can be cleaned out with an automated perl or java script once you've exported the pdf to some other text format. I say this only because a lot of the errors I found were regular.
I considered an error to be any substitution or deletion of a character, or coalescence of two into one. Format errors like inaccurate transfer of boldface, italicization, small-caps, and so on were not counted as errors. I checked one pdf page that comprised two dictionary pages and found 57 errors. This seems like a lot, but considering the file contained 5650 non-space characters, it makes for a character accuracy of about 99%. Waaay better than I can type.
Moreover, of those errors, 27 are the same mistake: the italic string i) is rendered z) (but in 4 cases, comes out as t)). So if I wanted, I could run a search-and-replace to remove these errors. Another 11 errors are misread subscript characters, half of which could be taken out with another search-and-replace. I should add that a lot of these errors are partially products of the structure of an entry: headword (possibly with subscript number), italic lexical representation in parentheses, followed by part-of-speech, glosses, and italic examples.
The upshot is, OCR + your own automated error-correction leaves about 0.5% of all characters misread. I think this is great, but it also means the result still needs to be checked by hand. Of course, if you would rather type the whole thing in yourself, you still need to proofread (to look for errors that are far less predictable to boot). I should add that "checking by hand" in Acrobat was actually really easy - I just alt-tabbed between the pre-OCR and post-OCR documents. (while proofreading a typed copy requires turning your head from book to screen and back every couple of words or so).
For the curious, I've posted screenshots of the sample scan before and after applying OCR; I have highlighted errors in the post-OCR document. I guess now I'm going to extend the test to the rest of the dictionary.