Segmentation of Optical Character Recognition
#1

[attachment=11646]
“Segmentation of Optical Character Recognition”
ABSTRACT

OCR system converts scanned input document into editable text document. This report presents the detailed description about the characteristics of Devanagari Script .How it is different from the other roman scripts. And what makes the OCR for any roman script different from the OCR for Devanagari script. The various stages of an OCR system are: upload a scanned image from the computer, segmentation process in which we extract the text zone from the image, recognition of the text and the last which is post processing process in which the output of the previous stage goes through the error detection and correction phase. This report explains about the user interface provided with the OCR with the help of which a user can very easily add or modify the segmentation done by the OCR system.
INTRODUCTION
Optical Character Recognition (OCR) is a process that translates images of typewritten scanned text into machine-editable text, or pictures of characters into a standard encoding scheme representing them in ASCII or Unicode. An OCR system enable us to feed a book or a magazine article directly into a electronic computer file, and edit the file using a word processor. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques. Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training (the provision of known samples of each character) to read a specific font. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components. However, this approach is sensitive to the size of the fonts and the font type. For handwritten input, the task becomes even more formidable. Soft computing has been adopted into the process of character recognition for its ability to create input output mapping with good approximation. The alternative for input/output mapping may be the use of a lookup table that is totally rigid with no room for input variations.
A performance of 93% at character level is obtained. We present a complete method for segmentation of text printed in Devanagari. Our segmentation approach is a hybrid approach, wherein we try to recognize the parts of the conjunct that form part of a character class. We use a set of lters that are robust and two distance based classiers to classify the segmented images into known classes. We present a two level partitioning scheme and search algorithm for the correction of optically read Devanagari characters of text recognition system for Devanagari script. The methodology described here makes use of the structural properties of the script that are unique to Indian scripts.
An OCR has a variety of commercial and practical applications in reading forms, manuscripts and their archival etc. Such a system facilitates a keyboard less user-computer interaction. Also the text which is either printed or hand-written can be directly transferred to the machine. The challenge of building an OCR system that can match the human performance also provides a strong motivation for research in this field.
We start with the binary image of a document and the image is segmented into sub images corresponding to characters and symbols by the initial segmentation process. Then the initial hypotheses for each sub image are generated based on the features extracted from these sub images. These are composed into words which are varied and corrected if necessary.
Development of OCRs for Indian script is an active area of research today. Indian scripts present great challenges to an OCR designer due to the large number of letters in the alphabet, the sophisticated ways in which they combine, and the complicated graphemes they result in. The problem is compounded by the unstructured manner in which popular fonts are designed. There is a lot of common structure in the different Indian scripts. In this project, we argue that semi-automatic tool can ease the development of recognizers for new font styles and new scripts. We present an OCR for printed Hindi text in devnagari script. Text written in Devnagari script, there is no separation between the characters. Preprocessing task considered in this paper is conversion of gray scale images to binary images, image rectification, and segmentation of text into lines, words and basic symbols. Basic symbols are identified as the fundamental unit of segmentation in this paper which are recognized by neural classifier.
Hindi is one of the most spoken languages in India. About 300 million people speak Hindi in India. One of the important reasons for poor recognition rate in optical character recognition (OCR) system for difficult symbols of devnagari is the error in character segmentation. Soft computing has been adopted into the process of character recognition for its ability to create input output mapping with good approximation. The alternative for input/output mapping may be the use of a lookup table that is totally rigid with no room for input variations.
The present Project is an attempt to understand the concept of OCR and thereby propounding a monumental effort towards the establishment of OCR that is capable of recognizing devnagari script.
ABOUT THE DEVANAGARI SCRIPT
Devanagari is used in many Indian languages like Hindi, Nepali, Marathi, Sindhi etc . More than 300 million people around the world use Devanagari script. This script forms the foundation of Indian languages. So Devanagari script plays a very major role in the development of litterature and manuscripts. There is so much of litterature from the old age manuscripts, vedas and scriptures and since these are so old so are not easily accessible to everyone . The need and urge to read these oldage scriptures led to the digital conversion of these by scanning the books. But the scanned copy is not in an editable form so to make them into an editable form OCR system for Devanagari text was introduced . This editable form out of output text can be input to various other systems like it can be sysnthesized with the voice to hear the enchatment of scriptures etc .
Devanagari script is written in left to right and top to bottom format. It consists of 11 vowels and 33 basic consonants. Each vowel except the first one have corresponding modifier that is used to modify a consonant. All words in Devanagari script have a continuous line of black pixels for whole word. This line is called “Shirorekha”. Based on shirorekha each character can be divided in three parts. The components in the part above shirorekha are called upper modifiers. In second part there are characters and in third part there are modifiers of vowels called lower modifiers. Moreover, some characters combine to form a new character set called joint characters. A character may be in shadow of another character, either due to the lower modifier or due to the shapes of two adjacent characters.
i) Words showing header lines
ii) Words with lower modifiers
iii) Words with shadow characters
iv) Words with composite characters
v) Characters with different height and width.
Devanagari owes its complexity to its rich set of conjuncts. Optical Character Recognition for Devanagari is fairly complex given its rich set of conjuncts. The language is partly phonetic in that a word written in Devanagari can only be pronounced in one way, but not all possible pronunciations can be written perfectly. A syllable ("akshar") is formed by a vowel alone or any combination of consonants with a vowel.
Reply
#2
[attachment=12770]
ABSTRACT
OCR system converts scanned input document into editable text document. This report presents the detailed description about the characteristics of Devanagari Script .How it is different from the other roman scripts. And what makes the OCR for any roman script different from the OCR for Devanagari script. The various stages of an OCR system are: upload a scanned image from the computer, segmentation process in which we extract the text zone from the image, recognition of the text and the last which is post processing process in which the output of the previous stage goes through the error detection and correction phase. This report explains about the user interface provided with the OCR with the help of which a user can very easily add or modify the segmentation done by the OCR system.
INTRODUCTION
Optical Character Recognition (OCR) is a process that translates images of typewritten scanned text into machine-editable text, or pictures of characters into a standard encoding scheme representing them in ASCII or Unicode. An OCR system enable us to feed a book or a magazine article directly into a electronic computer file, and edit the file using a word processor. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques. Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training (the provision of known samples of each character) to read a specific font. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components. However, this approach is sensitive to the size of the fonts and the font type. For handwritten input, the task becomes even more formidable. Soft computing has been adopted into the process of character recognition for its ability to create input output mapping with good approximation. The alternative for input/output mapping may be the use of a lookup table that is totally rigid with no room for input variations.
A performance of 93% at character level is obtained. We present a complete method for segmentation of text printed in Devanagari. Our segmentation approach is a hybrid approach, wherein we try to recognize the parts of the conjunct that form part of a character class. We use a set of lters that are robust and two distance based classiers to classify the segmented images into known classes. We present a two level partitioning scheme and search algorithm for the correction of optically read Devanagari characters of text recognition system for Devanagari script. The methodology described here makes use of the structural properties of the script that are unique to Indian scripts.
An OCR has a variety of commercial and practical applications in reading forms, manuscripts and their archival etc. Such a system facilitates a keyboard less user-computer interaction. Also the text which is either printed or hand-written can be directly transferred to the machine. The challenge of building an OCR system that can match the human performance also provides a strong motivation for research in this field.
We start with the binary image of a document and the image is segmented into sub images corresponding to characters and symbols by the initial segmentation process. Then the initial hypotheses for each sub image are generated based on the features extracted from these sub images. These are composed into words which are varied and corrected if necessary.
Development of OCRs for Indian script is an active area of research today. Indian scripts present great challenges to an OCR designer due to the large number of letters in the alphabet, the sophisticated ways in which they combine, and the complicated graphemes they result in. The problem is compounded by the unstructured manner in which popular fonts are designed. There is a lot of common structure in the different Indian scripts. In this project, we argue that semi-automatic tool can ease the development of recognizers for new font styles and new scripts. We present an OCR for printed Hindi text in devnagari script. Text written in Devnagari script, there is no separation between the characters. Preprocessing task considered in this paper is conversion of gray scale images to binary images, image rectification, and segmentation of text into lines, words and basic symbols. Basic symbols are identified as the fundamental unit of segmentation in this paper which are recognized by neural classifier.
Hindi is one of the most spoken languages in India. About 300 million people speak Hindi in India. One of the important reasons for poor recognition rate in optical character recognition (OCR) system for difficult symbols of devnagari is the error in character segmentation. Soft computing has been adopted into the process of character recognition for its ability to create input output mapping with good approximation. The alternative for input/output mapping may be the use of a lookup table that is totally rigid with no room for input variations.
The present Project is an attempt to understand the concept of OCR and thereby propounding a monumental effort towards the establishment of OCR that is capable of recognizing devnagari script.
ABOUT THE DEVANAGARI SCRIPT
Devanagari is used in many Indian languages like Hindi, Nepali, Marathi, Sindhi etc . More than 300 million people around the world use Devanagari script. This script forms the foundation of Indian languages. So Devanagari script plays a very major role in the development of litterature and manuscripts. There is so much of litterature from the old age manuscripts, vedas and scriptures and since these are so old so are not easily accessible to everyone . The need and urge to read these oldage scriptures led to the digital conversion of these by scanning the books. But the scanned copy is not in an editable form so to make them into an editable form OCR system for Devanagari text was introduced . This editable form out of output text can be input to various other systems like it can be sysnthesized with the voice to hear the enchatment of scriptures etc .
Devanagari script is written in left to right and top to bottom format. It consists of 11 vowels and 33 basic consonants. Each vowel except the first one have corresponding modifier that is used to modify a consonant. All words in Devanagari script have a continuous line of black pixels for whole word. This line is called “Shirorekha”. Based on shirorekha each character can be divided in three parts. The components in the part above shirorekha are called upper modifiers. In second part there are characters and in third part there are modifiers of vowels called lower modifiers. Moreover, some characters combine to form a new character set called joint characters. A character may be in shadow of another character, either due to the lower modifier or due to the shapes of two adjacent characters.
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: devanagari, arduino code for optical character recognition, optical character recognition advantages, optical character recognition past project work, character segmentation from word code in matlab, unicode optical character recognition ppt, download optical character recognition seminar report,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  Trends in satellite communications and the role of optical free-space communications seminar class 2 14,938 23-12-2018, 02:07 PM
Last Post:
  Color Iris Recognition Using Quaternion Phase Correlation matlab project project topics 3 3,420 02-07-2016, 09:38 AM
Last Post: visalakshik
  Isolated word speaker independent speech recognition project computer science technology 4 4,464 23-05-2014, 06:56 PM
Last Post: seminar report asees
  A neural network based artificial vision system for licence plate recognition on reception projectsofme 2 2,758 27-07-2013, 11:57 AM
Last Post: computer topic
  PON Topologies for Dynamic Optical Access Networks smart paper boy 1 1,793 12-12-2012, 12:40 PM
Last Post: seminar details
  PC To PC Optical Communication project topics 2 2,528 27-11-2012, 12:47 PM
Last Post: seminar details
  An Automatic Hand Gesture Recognition System Based on Viola-Jones Method and SVMs seminar class 1 3,589 22-11-2012, 12:00 PM
Last Post: seminar details
  Face Recognition Using Laplacian faces computer science crazy 1 2,682 19-11-2012, 01:14 PM
Last Post: seminar details
  OPTICAL INTEGRATED CIRCUITS smart paper boy 1 1,744 13-11-2012, 12:39 PM
Last Post: seminar details
  IMAGE RETRIEVAL USING SEGMENTATION seminar class 2 2,860 15-10-2012, 03:18 PM
Last Post: seminar details

Forum Jump: