Proving OCR Correctness via Blink Comparison

Oct 2009

In junior high school I learned to program with an HP Silent 700 hard-copy terminal which accessed an HP 2000 Timeshare BASIC (TSB) minicomputer system via modem. At the end of my final year I preserved my programs by printing them out, along with a few system programs. I had no idea of what to do with the listings, other than maybe type them into a personal microcomputer, but I didn't want to lose my work. It's now almost thirty years later and I still have those old listings. Recently I discovered an HP 2000 emulator and TSB operating system image, so I spent a few days reliving pleasant memories while running it. My thanks go to all the people that made this possible.

It was great fun to go full circle with those program listings by entering them into the emulator and running them again in their native environment. However, I wasn't prepared to spend hours re-typing them and then spend even more hours proofreading; fortunately there was a better way. I decided to OCR the listings to save the typing labour but there was still the matter of proofreading sixty-some pages of source code and being confident that the proofreading was error-free.

That latter point is very important: I've done more than enough work to know that an eyeball comparison of two physical pages, or one text on-screen and one text on paper, is not sufficient to be confident they are identical because the brain gets tired and makes mistakes. The only way to make that type of comparison workable would be to split the material into small chunks and then proceed very slowly; and then after all that labour there is no easy way to prove correctness even if one makes copious notes during the process.

While pondering all of this I remembered a device called a Blink Comparator, which I had read about being used by astronomers for comet and asteroid hunting. It's an optical device that allows its operator to rapidly switch back and forth between two images of a star field that have been optically lined up so that they would appear to be superimposed if both were visible at the same time. When the view is switched from one photo to the other anything that's moved in the time between photos will appear to jump, or blink, because the object's motion has changed its position with respect to the background stars. This seemed to be the best way to efficiently proofread the OCRed source code, while also allowing anyone else to easily audit the correctness of the results.

The process I settled on uses an image editor that supports layers (I used Gimp). I used one layer for each image to align them for blinking. The first layer is an image of one page of the original source. The second layer is a screen capture of the OCRed text of that page that has been scaled and stretched so that it lines up with the original text as much as possible. The screen-capture should be scaled rather than the scan of the original because the screen-capture has sharper text. The scaling operation will blur the image slightly, so blurring the sharper image will not make it harder to do the blinking whereas blurring am image that is already slightly blurred will make it more difficult.

The following are the steps I followed. They may, or may not, work for you. Adjust as required.

  1. Scan a page of the original source listing. This is Image #1. It can also be the archival copy of the source if you want one, so take care scanning it.

  2. Prepare to run Image #1 through an OCR engine. If your OCR software is trainable then it is a very good idea to train it for the font used for the listing.

    I used Tesseract 2.04 to perform the OCR. It worked very well after I had trained it for the unique font used by the Silent 700 terminal. Without that training it was terrible, but that's what should be expected from blocky 7x5 dot-matrix thermally printed program source code rather than English text; thus the training.

    If the source listing is on something like aged thermal paper the contrast may be a bit low, or the appearance of the print may be poor. I write from experience: My listings are on thermal paper that occasionally shows all sorts of interesting damage but for the most part I found that the image was able to be cleaned up with Gimp and that Tesseract gave me excellent results on all but one page. A simple threshold operation was almost always all that I used.

  3. Run Image #1 through your OCR software. The output becomes Text #1.

    The text will likely contain some gross errors unless the listing is pristine. Correct all of the errors that you can find. One doesn't want to have too many errors going in to the blink comparison process, otherwise one will need to repeat the blinking with a corrected image to have the most confidence that no errors were missed. It takes some work to align a page so it's better to spend a little effort correcting the obvious errors before blinking the text.

    Only once did I have a page that Tesseract completely gave up on and refused to process regardless of what I did to the image. I never did figure out why, as the page was clean.

  4. Bring up Text #1 on screen and make a screen capture of it. This is Image #2. It helps to use a font that is similar to that used in the source listing, if that's possible.

  5. Using an image editor create a new image with Image #1 as the bottom layer (Layer #1), or use a scaled-down version of Image #1 if your scan used a large DPI value. Import Image #2 as the next layer (Layer #2). Temporarily make Layer #2 transparent enough so you can see Layer #1 through it for the alignment process. Both layers can be cropped to remove extraneous stuff that may surround the text block.

  6. Scale and/or stretch layer #2 so that the text becomes aligned reasonably well with the text of layer #1.

    When the text of both layers is aligned you're ready for the blinking process. This is easy if your image editing software can enable/disable a layer with just a mouse click.

  7. Make Layer #2 non-transparent.

  8. Repeatedly enable and disable layer #2. This will perform the blinking. You'll probably see the characters swimming around because perfect alignment isn't possible, but carefully examine the text and any differences should jump out.

    It's not always possible to get perfect alignment because paper tends to distort if it hasn't been kept under proper conditions such as a constant humidity, or appears distorted if it hasn't been pressed flat on the scanner bed. Under these conditions text will jump around en masse while blinking but the eye/brain can handle surprising amounts of this and still identify the differences.

  9. If an error is found then create a third layer (Layer #3) of 50% transparency for marking errors. I used a yellow coloured blotch over the error as a marker. Questionable things such as a "1" versus an "I" can also be marked for a quick check later. I circled those.

    I made Layer #3 transparent because I found a solid yellow blotch to be very distracting; it didn't dance around with the blinking. Making it transparent allowed me to see the text underneath it moving slightly with the blinking, which was enough.

    When you've finished blinking the page then make the corrections to Text #1. If there are a lot of corrections you may want to go back to step 4 and repeat the process with the new Text #1 so that you have absolute confidence that all errors have been corrected.

    If it proves necessary to repeat the blinking later (days, months, weeks, or whenever) and further errors are discovered, one should use a different colour mark for the newly discovered errors so that there is no confusion with the previous corrections.

  10. When finished you should save a copy of the blinking file so that you can check your work later if need be. For important documents it may be vital that you can prove you didn't alter something by mistake; for example, without even thinking about it, it is very easy to correct a typo that was in the original. This requires discipline: Once you start blinking then the only corrections you make are for errors found by blinking.

Here's a real example of one page that was blinked per the above guidelines. It has more errors than I typically found but that makes it be a very good example. Note the circled part that indicates doubt about a "1" versus an "I". In this case the OCR text was correct but I did need to verify it against the original.

The blink comparison process still involves a human comparing something visually, so the possibility of error isn't eliminated. Is there a better way that can be automated?

Notes: