Home
Manage Your Code
Snippet: Office OCR Component (C#)
Title: Office OCR Component Language: C#
Description: Microsoft Office Document Imaging provides a method to perform OCR on a saved image file from code. Views: 109
Author: Matt Schmidt Date Added: 8/15/2008
Copy Code  
1        /// <summary>

2        /// Performs optical character recognition on an image file, and returns the text.

3        /// </summary>

4        /// <param name="fileName">Filename of image</param>

5        /// <returns>Text of OCR</returns>

6        private String ConvertImageToText(String fileName)
7        {
8            StringBuilder result = new StringBuilder();
9
10            try
11            {
12                //Microsoft Office Document Imaging, used to OCR text on image

13                MODI.Document modiDoc = new MODI.Document();
14
15                modiDoc.Create(fileName);
16
17                modiDoc.OCR(MiLANGUAGES.miLANG_ENGLISH, false, false);
18
19                MODI.Image i = (MODI.Image)modiDoc.Images[0];
20
21                foreach (MODI.Word w in i.Layout.Words)
22                {
23                    result.Append(w.Text);
24                    result.Append(" ");
25                }
26            }
27            catch (Exception ex) 
28            {
29                //handle exception here

30            }
31            return result.ToString();
32        }
Usage
With this function, simply provide the file name of the image file, and the text of the image is returned as a string.
Notes
This method is reasonably accurate, but will throw exceptions if no text is found, so be sure to handle them. Note: you need to reference the Microsoft Office Document Imaging dll in your project. Its not always part of the default Office install, so check by running the add components part of the setup program.