Wednesday, February 2, 2011

Converting Printed Sinhala Documents to Formatted Editable Text

This is my final year project at Department of Computer Science, Faculty of Engineering, University of Peradeniya. Dr.Roshan Ragel, Shahina Ajward, Nalani Jayasundara and including myself are in this research project.We won Best Student Papper Award for ICT and Social Transformation by ICIAfS 2010.

INTRODUCTION

There are situations when we only have a printed copy of a document and need to do further modifications or need to merge content of two documents. The worst case is that even for adding small text, we have to apply all the font features and re-adjust the whole document again. We also get instances when we need to digitize books and material to editable text so that our search engines and tools can be used
on them. The typical process of digitizing text document is performed by scanning the printed copies to images and converting them to editable text. Currently optical character recognition (OCR) plays a vital role in converting scanned images of books, magazines, and newspapers into machine-readable text. It avoids the need for retyping already printed material for editing.

Most of the existing OCR solutions are commercial and they provide the editable text documents which facilitate international languages such as English. In Sri Lanka both Sinhala and Tamil languages are widely used in print and there are a few attempts to develop system for Sinhala language. In this project, we have identified the OCR algorithm to be used for Sinhala and developed an application for digitizing Sinhala characters. Any OCR implementation consists of a number of pre-processing steps and a classification method to recognize characters. In this study the approach use to recognize character is language independent and therefore we believe that our system can be extended for Tamil as well.

In addition to the digitization of Sinhala characters, we have developed a method of preserving a number of selected formatting features of a printed document (such as the font size of characters). We believe that this as a useful addition to Sinhala text digitization.

The project could be divided into two phases, character recognition using an OCR technique and extracting and preserving the layout (formatting)information of the document. An editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.

In phase one an optical character recognition method is used for identifying characters. This phase comprises the steps of identifying connected components in an image, selecting portion of the image corresponding to the connected components and extracting the features of connected components. A neural network is used to train the system which enables the system to identify characters that are not pre determined.

In phase two projection profiles are used to extract selected features of characters. Extracting and preserving layout or font features of a document will tremendously reduce the burden of the user during the process of editing and reproducing the same document with modifications.

The recognized characters (the outcome of phase one) are embedded with identified features (the outcome of phase two) to reconstruct the original document in Rich Text Format (RTF) format in an editor.

CHARACTER RECOGNITION

Since the system is mainly focused on the character recognition, major analysis was targeted on Optical Character Recognition (OCR) technology. Having the knowledge of OCR it was concluded that the problem analysis was consisted of two areas as image pre-processing and training the system.


Pre-processing
Since the soft copy of the scanned document is in an image file format, pre-processing is done to enhance the quality of the image. After identifying and analyzing several processing steps, it was concluded that required processed image could be obtained. We assumed that an image is from a high quality paper so that it does not need noise removal and documents are scanned without introducing skew.

Training the System
The neural network approach is used to train the system in order to recognize characters. Among various types of neural networks, our focus went with Feed Forward Back-propagation Neural Network. Using back-propagation neural network errors can be propagated backward through the network to control weight adjustment and by the feed-forward information moves in only one direction. So the result could be obtained efficiently and with higher accuracy.

Neural Network was trained for characters obtained from the pre-processed image. It maps set of inputs to a set of target value (outputs). By referring the target values, can recognize characters.


EXTRACTING FORMATTING INFORMATION

Projection profile of text line is an approach for font attribute recognition based on features. Different features are used for font discrimination and they can be derived from visual observation of different fonts and their projection profiles. The selected features are extracted from horizontal or vertical profiles.

RECONSTRUCTING THE DOCUMENT

Terp-Word is an open source word processor and supports ‘.html’ file format apart from a number of other formats such as RTF. As explained earlier the system is developed as two modules, character recognition and layout preserving. The scanned image went through two separate processes and each process generated a text file each as outputs. The text file with recognized characters and the encoded file with extracted features are used to generate an html file mapped to the original scanned document. In this html file, the encoded features are decoded and applied to the corresponding recognized characters. The html file can be loaded to the editor and can be converted to RTF file which facilitate any advance modifications. The resulted RTF file preserves selected font features over the original document.

CONCLUSION AND FUTURE WORK

The main idea in our system was to build a tool which supports editing facility for a scanned image which is in Sinhala language. Being familiar with current technologies which are used in international character recognition, our objective is mainly focused on character recognition of local languages.

The first phase implementation, results only character recognition without preserving original format of a scanned image file. The tool was further developed by adding functionality of second phase implementation which consists with original layout of the document.

Our Objective was mainly focused on character recognition of Sinhala language. Currently, our tool has been tested only for Sinhala language. But, it may support for Tamil language also since our implementation is language independent. Due to the shape of Sinhala characters, there are some limitations of properties of the characters.

The final outcome of the project is a rich software tool which allows the users to get an editable text file from a scanned image by preserving its original formats.

Due to the limited time, we had to restrict ourselves for few selected features. Following suggestions can be made to further improvements of the system. The intensity values of the original document could be used to recognition colors. By encoding the font attributes word wise we would be enabling to apply formats word-wise instead of line-wise as we have done now. Though we have managed to avoid merging of characters in general, due to rounded shape of Sinhala characters still there are few characters suffering from this issue. Further the system can be trained for Tamil character samples so that it can support Tamil language as well.