ASK HERE

seminar class · 04-04-2011, 11:10 AM

[attachment=11646]
“Segmentation of Optical Character Recognition”
ABSTRACT
OCR system converts scanned input document into editable text document. This report presents the detailed description about the characteristics of Devanagari Script .How it is different from the other roman scripts. And what makes the OCR for any roman script different from the OCR for Devanagari script. The various stages of an OCR system are: upload a scanned image from the computer, segmentation process in which we extract the text zone from the image, recognition of the text and the last which is post processing process in which the output of the previous stage goes through the error detection and correction phase. This report explains about the user interface provided with the OCR with the help of which a user can very easily add or modify the segmentation done by the OCR system.
INTRODUCTION
Optical Character Recognition (OCR) is a process that translates images of typewritten scanned text into machine-editable text, or pictures of characters into a standard encoding scheme representing them in ASCII or Unicode. An OCR system enable us to feed a book or a magazine article directly into a electronic computer file, and edit the file using a word processor. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques. Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training (the provision of known samples of each character) to read a specific font. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components. However, this approach is sensitive to the size of the fonts and the font type. For handwritten input, the task becomes even more formidable. Soft computing has been adopted into the process of character recognition for its ability to create input output mapping with good approximation. The alternative for input/output mapping may be the use of a lookup table that is totally rigid with no room for input variations.
A performance of 93% at character level is obtained. We present a complete method for segmentation of text printed in Devanagari. Our segmentation approach is a hybrid approach, wherein we try to recognize the parts of the conjunct that form part of a character class. We use a set of lters that are robust and two distance based classiers to classify the segmented images into known classes. We present a two level partitioning scheme and search algorithm for the correction of optically read Devanagari characters of text recognition system for Devanagari script. The methodology described here makes use of the structural properties of the script that are unique to Indian scripts.
An OCR has a variety of commercial and practical applications in reading forms, manuscripts and their archival etc. Such a system facilitates a keyboard less user-computer interaction. Also the text which is either printed or hand-written can be directly transferred to the machine. The challenge of building an OCR system that can match the human performance also provides a strong motivation for research in this field.
We start with the binary image of a document and the image is segmented into sub images corresponding to characters and symbols by the initial segmentation process. Then the initial hypotheses for each sub image are generated based on the features extracted from these sub images. These are composed into words which are varied and corrected if necessary.
Development of OCRs for Indian script is an active area of research today. Indian scripts present great challenges to an OCR designer due to the large number of letters in the alphabet, the sophisticated ways in which they combine, and the complicated graphemes they result in. The problem is compounded by the unstructured manner in which popular fonts are designed. There is a lot of common structure in the different Indian scripts. In this project, we argue that semi-automatic tool can ease the development of recognizers for new font styles and new scripts. We present an OCR for printed Hindi text in devnagari script. Text written in Devnagari script, there is no separation between the characters. Preprocessing task considered in this paper is conversion of gray scale images to binary images, image rectification, and segmentation of text into lines, words and basic symbols. Basic symbols are identified as the fundamental unit of segmentation in this paper which are recognized by neural classifier.
Hindi is one of the most spoken languages in India. About 300 million people speak Hindi in India. One of the important reasons for poor recognition rate in optical character recognition (OCR) system for difficult symbols of devnagari is the error in character segmentation. Soft computing has been adopted into the process of character recognition for its ability to create input output mapping with good approximation. The alternative for input/output mapping may be the use of a lookup table that is totally rigid with no room for input variations.
The present Project is an attempt to understand the concept of OCR and thereby propounding a monumental effort towards the establishment of OCR that is capable of recognizing devnagari script.
ABOUT THE DEVANAGARI SCRIPT
Devanagari is used in many Indian languages like Hindi, Nepali, Marathi, Sindhi etc . More than 300 million people around the world use Devanagari script. This script forms the foundation of Indian languages. So Devanagari script plays a very major role in the development of litterature and manuscripts. There is so much of litterature from the old age manuscripts, vedas and scriptures and since these are so old so are not easily accessible to everyone . The need and urge to read these oldage scriptures led to the digital conversion of these by scanning the books. But the scanned copy is not in an editable form so to make them into an editable form OCR system for Devanagari text was introduced . This editable form out of output text can be input to various other systems like it can be sysnthesized with the voice to hear the enchatment of scriptures etc .
Devanagari script is written in left to right and top to bottom format. It consists of 11 vowels and 33 basic consonants. Each vowel except the first one have corresponding modifier that is used to modify a consonant. All words in Devanagari script have a continuous line of black pixels for whole word. This line is called “Shirorekha”. Based on shirorekha each character can be divided in three parts. The components in the part above shirorekha are called upper modifiers. In second part there are characters and in third part there are modifiers of vowels called lower modifiers. Moreover, some characters combine to form a new character set called joint characters. A character may be in shadow of another character, either due to the lower modifier or due to the shapes of two adjacent characters.
i) Words showing header lines
ii) Words with lower modifiers
iii) Words with shadow characters
iv) Words with composite characters
v) Characters with different height and width.
Devanagari owes its complexity to its rich set of conjuncts. Optical Character Recognition for Devanagari is fairly complex given its rich set of conjuncts. The language is partly phonetic in that a word written in Devanagari can only be pronounced in one way, but not all possible pronunciations can be written perfectly. A syllable ("akshar") is formed by a vowel alone or any combination of consonants with a vowel.

seminar class · 25-04-2011, 04:39 PM

[attachment=12770]
ABSTRACT
OCR system converts scanned input document into editable text document. This report presents the detailed description about the characteristics of Devanagari Script .How it is different from the other roman scripts. And what makes the OCR for any roman script different from the OCR for Devanagari script. The various stages of an OCR system are: upload a scanned image from the computer, segmentation process in which we extract the text zone from the image, recognition of the text and the last which is post processing process in which the output of the previous stage goes through the error detection and correction phase. This report explains about the user interface provided with the OCR with the help of which a user can very easily add or modify the segmentation done by the OCR system.
INTRODUCTION
Optical Character Recognition (OCR) is a process that translates images of typewritten scanned text into machine-editable text, or pictures of characters into a standard encoding scheme representing them in ASCII or Unicode. An OCR system enable us to feed a book or a magazine article directly into a electronic computer file, and edit the file using a word processor. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques. Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training (the provision of known samples of each character) to read a specific font. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components. However, this approach is sensitive to the size of the fonts and the font type. For handwritten input, the task becomes even more formidable. Soft computing has been adopted into the process of character recognition for its ability to create input output mapping with good approximation. The alternative for input/output mapping may be the use of a lookup table that is totally rigid with no room for input variations.
A performance of 93% at character level is obtained. We present a complete method for segmentation of text printed in Devanagari. Our segmentation approach is a hybrid approach, wherein we try to recognize the parts of the conjunct that form part of a character class. We use a set of lters that are robust and two distance based classiers to classify the segmented images into known classes. We present a two level partitioning scheme and search algorithm for the correction of optically read Devanagari characters of text recognition system for Devanagari script. The methodology described here makes use of the structural properties of the script that are unique to Indian scripts.
An OCR has a variety of commercial and practical applications in reading forms, manuscripts and their archival etc. Such a system facilitates a keyboard less user-computer interaction. Also the text which is either printed or hand-written can be directly transferred to the machine. The challenge of building an OCR system that can match the human performance also provides a strong motivation for research in this field.
We start with the binary image of a document and the image is segmented into sub images corresponding to characters and symbols by the initial segmentation process. Then the initial hypotheses for each sub image are generated based on the features extracted from these sub images. These are composed into words which are varied and corrected if necessary.
Development of OCRs for Indian script is an active area of research today. Indian scripts present great challenges to an OCR designer due to the large number of letters in the alphabet, the sophisticated ways in which they combine, and the complicated graphemes they result in. The problem is compounded by the unstructured manner in which popular fonts are designed. There is a lot of common structure in the different Indian scripts. In this project, we argue that semi-automatic tool can ease the development of recognizers for new font styles and new scripts. We present an OCR for printed Hindi text in devnagari script. Text written in Devnagari script, there is no separation between the characters. Preprocessing task considered in this paper is conversion of gray scale images to binary images, image rectification, and segmentation of text into lines, words and basic symbols. Basic symbols are identified as the fundamental unit of segmentation in this paper which are recognized by neural classifier.
Hindi is one of the most spoken languages in India. About 300 million people speak Hindi in India. One of the important reasons for poor recognition rate in optical character recognition (OCR) system for difficult symbols of devnagari is the error in character segmentation. Soft computing has been adopted into the process of character recognition for its ability to create input output mapping with good approximation. The alternative for input/output mapping may be the use of a lookup table that is totally rigid with no room for input variations.
The present Project is an attempt to understand the concept of OCR and thereby propounding a monumental effort towards the establishment of OCR that is capable of recognizing devnagari script.
ABOUT THE DEVANAGARI SCRIPT
Devanagari is used in many Indian languages like Hindi, Nepali, Marathi, Sindhi etc . More than 300 million people around the world use Devanagari script. This script forms the foundation of Indian languages. So Devanagari script plays a very major role in the development of litterature and manuscripts. There is so much of litterature from the old age manuscripts, vedas and scriptures and since these are so old so are not easily accessible to everyone . The need and urge to read these oldage scriptures led to the digital conversion of these by scanning the books. But the scanned copy is not in an editable form so to make them into an editable form OCR system for Devanagari text was introduced . This editable form out of output text can be input to various other systems like it can be sysnthesized with the voice to hear the enchatment of scriptures etc .
Devanagari script is written in left to right and top to bottom format. It consists of 11 vowels and 33 basic consonants. Each vowel except the first one have corresponding modifier that is used to modify a consonant. All words in Devanagari script have a continuous line of black pixels for whole word. This line is called “Shirorekha”. Based on shirorekha each character can be divided in three parts. The components in the part above shirorekha are called upper modifiers. In second part there are characters and in third part there are modifiers of vowels called lower modifiers. Moreover, some characters combine to form a new character set called joint characters. A character may be in shadow of another character, either due to the lower modifier or due to the shapes of two adjacent characters.

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Trends in satellite communications and the role of optical free-space communications	seminar class	2	14,938	23-12-2018, 02:07 PM Last Post:
	Color Iris Recognition Using Quaternion Phase Correlation matlab project	project topics	3	3,420	02-07-2016, 09:38 AM Last Post: visalakshik
	Isolated word speaker independent speech recognition project	computer science technology	4	4,464	23-05-2014, 06:56 PM Last Post: seminar report asees
	A neural network based artificial vision system for licence plate recognition on reception	projectsofme	2	2,758	27-07-2013, 11:57 AM Last Post: computer topic
	PON Topologies for Dynamic Optical Access Networks	smart paper boy	1	1,793	12-12-2012, 12:40 PM Last Post: seminar details
	PC To PC Optical Communication	project topics	2	2,528	27-11-2012, 12:47 PM Last Post: seminar details
	An Automatic Hand Gesture Recognition System Based on Viola-Jones Method and SVMs	seminar class	1	3,589	22-11-2012, 12:00 PM Last Post: seminar details
	Face Recognition Using Laplacian faces	computer science crazy	1	2,682	19-11-2012, 01:14 PM Last Post: seminar details
	OPTICAL INTEGRATED CIRCUITS	smart paper boy	1	1,744	13-11-2012, 12:39 PM Last Post: seminar details
	IMAGE RETRIEVAL USING SEGMENTATION	seminar class	2	2,860	15-10-2012, 03:18 PM Last Post: seminar details

Important Note..!

ASK HERE