ASK HERE

project topics · 10-04-2010, 11:16 PM

[attachment=3137]

A Project Report on OMNI FONT recognition in Indian language OCR

Presented by
Ravi Kant Yada
Master of Technology
Computer Science & Engineering
Guru Gobind Singh Indraprastha University
Kashmere Gate, Delhi

ABSTRACT
Most of the Indian language electronic data are Unicode encoded. Processing Unicode data is quite straight forward because it follows distinguished code ranges for each language and there is a one-to-one correspondence between characters. Hence it becomes necessary to identify the font encoding and convert the font-data into a phonetic notation. This project proposes an approach for identifying the font-type (font encoding name) of a font-data. This thesis also proposes a generic framework to build font converters for conversion of font-data into a phonetic transliteration scheme in Indian languages.
Development of OCRs for Indian script is an active area of research today. In Indian language Devnagri scripts present great challenges to an OCR designer due to the large number of letters in the alphabet, the sophisticated ways in which they combine, and the complicated graphemes they result in. The problem is compounded by the unstructured manner in which popular fonts are designed. There is a lot of common structure in the different Indian scripts. In this project, we argue that a number of automatic and semi-automatic tools can ease the development of recognizers for new font styles and new scripts. We discuss briefly and show how they have helped build new OCRs for the purpose of omni font recognition in the Hindi language. An integrated approach to the design of OCRs for all Devnagri scripts has great benefits. We are building OCRs for Hindi language following this approach as part of a system to provide tools to create content in them.
In this project we present a Multi-font OCR system to be employed for document processing, which performs recognition of the font-style belonging to a subset of the existing fonts. The detection of the font-style of the document words can guide a rough automatic classification of documents, and can also be used to improve the character recognition .An alternative for the crucial task of Optical Font Recognition (OFR) is proposed in this work; this is based on the analysis of texture characteristics of
document images formed of pure text.. A printed text block with a unique font is suitable to provide the specific texture properties necessary for the process of recognition of the most commonly used fonts in the Hindi language.
A typical OCR system contains three logical components: an image scanner, OCR software and hardware, and an output interface. The image scanner optically captures text images to be recognized. Text images are processed with OCR software and hardware. The process involves three operations: document analysis (extracting individual character images), recognizing these images (based on shape), and contextual processing (either to correct misclassifications made by the recognition algorithm or to limit recognition choices). The output interface is responsible for communication of OCR system results to the outside world.
TABLE OF CONTENTS
Candidateâ„¢s Declaration 2
Certificate 3
Acknowledgement 4
Abstract 5
List of Figures 7
List of Tables 8
Chapter 1. Introduction 9
1.1 What is OCR Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 9
1.2 History of OCRÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦ 9
1.3 Different uses for OCR Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 10
1.4. AccuracyÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 10
1.5. Todayâ„¢s position of OCRÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦ Â¦Â¦Â¦Â¦ 11
1.6 Applications of OCR 11
1.7 Limitations of OCRÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦... 11
1.8 Project StatementÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦. 12
1.9 ContributionsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦. 12
Chapter 2. MOTIVATION OF THE PROJECTÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 13
2.1 What does it take to make a successful OCR System........................................... 13
2.2 Input. 13
2.3 Importance of Font DesignÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦ 14
2.4 DocumentsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 15
2.5 Nature of documentÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦ 16
2.6 Nature of output requirementsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦. 17
2.7 Division of FontsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦. 17
2.8 Technical Description of OCRÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 17
Chapter 3. GENERAL APPROACHESÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦... 18
3.1 Need for handling Font-Data 18
3.2 Digital storage formate 19
3.3 Problem StatementÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 20
3.4 ObjectiveÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦ 20
3.5 Approaches for developing an OCRÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦. 20
Chapter 4. LITERATURE ANALYSISÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 21
4.1 Multi-Linguistic Optical Font Recognition Using Stroke TemplatesÂ¦Â¦Â¦Â¦Â¦Â¦Â¦. 21
4.2 Font Recognition Based on Global Texture AnalysisÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦... 22
4.3 Font Recognition By Invariant Moments Of Global Textures 23
4.4. Optical Font Recognition Using Typographical Features 24
4.5 Optical font recognition from projection profilesÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 25
4.6 Identification and Conversion On Font-Data In Indian LanguagesÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦ 27
4.7 Text Processing for Text-to-Speech Systems in Indian LanguagesÂ¦Â¦Â¦Â¦Â¦Â¦Â¦. 29
4.8 Tools for Developing OCRs for Indian ScriptsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦. 30
4.9 Optical Font Recognition for Multi-Font OCR and Document ProcessingÂ¦Â¦Â¦Â¦.. 31
4.10 Development of a Generic Font OCRÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 32
Chapter 5. FUTURE WORK AND CONCLUSIONÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.. 33
5.1. Future work 33
5.2. Conclusion 33
References 35
LIST OF FIGURES
Page No.
Figure 1 Illustration of code mapping for English fonts.
19
Figure 2 Illustration of code mapping for Indian (Hindi) fonts. 19
Figure 3 Flow chart of the proposed approach. 22
Figure 4 The flow chart of the font identification system. 23
Figure 5 Scheme of font identification system. 24
Figure 6 Illustration of code mapping for English fonts. 28
Figure 7 Illustration of code mapping for Hnidi fonts. 29
Figure 8 Block diagram of the data collection tool 31
LIST OF TABLES
Table Page No.
Table 1 Evolution Of Recognition Rates For Four Text Length 24
Table 2 Theoretical confusion rates for weight (normal, bold) and slope (roman, italic) with known family and size (12) and with unknown size (all sizes merged)
25
Table 3 Theoretical confusion rates between font sizes using h3 with known family, weight and slope
26
Table 4 Performance of font models 29
Table 5 Classification of characters Arial, Lucida Console, Lucida Sans Unicode, Tahoma.
The label of the columns is the output of the classifier.
31
Chapter 1
INTRODUCTION
1.1 What is OCR
Optical Character Recognition (OCR) is a type of document image analysis where scanned digital image that contains either machine printed or handwritten script input into an OCR software engine and translating it into an editable machine readable
digital text format (like ASCII text).
OCR works by first pre-processing the digital page image into its smallest component parts with layout analysis to find text blocks, sentence/line blocks, word blocks character blocks. Other features such as lines, graphics, photographs etc are recognized and discarded. The character blocks are then further broken down into components parts, pattern recognized and compared to the OCR engines large dictionary of characters from various fonts and languages. Once a likely match made then this is recorded and a set of characters in the word block are recognized until all likely characters have been found for the word block. The word is then compared to the OCR engineâ„¢s large dictionary of complete words that exist for
language.
These factors of characters and words recognized are the key to OCR accuracy combining them the OCR engine can deliver much higher levels of accuracy. Modern OCR engines extend this accuracy through more sophisticated pre-processing of source digital images and better algorithms for fuzzy matching, sounds-like matching and grammatical measurements to more accurately establish word accuracy.
1.2 History of OCR
The engineering attempts at automated recognition of printed characters started prior to World War II. But it was not until the early 1950's that a commercial venture was identified that justified necessary funding for research and development of the technology. This impetus was provided by the American Bankers Association and the Financial Services Industry. They challenged all the major equipment manufacturers to come up with a "Common Language" to automatically process checks. After the war, check processing had become the single largest paper processing application in the world. Although the banking industry eventually chose Magnetic Ink Recognition (MICR), some vendors had proposed the use of an optical recognition technology. However, OCR was still in its infancy at the time and did not perform as acceptably as MICR. The advantage of MICR was that it is relatively impervious to change, fraudulent alteration and interference from non-MlCR inks.
The "eye'' of early OCR equipment utilized lights, mirrors, fixed slits for the reflected light to pass through, and a moving disk with additional slits. The reflected image was broken into discrete bits of black and white data, presented to a photo-multiplier tube, and converted to electronic bits. The "brain's" logic required the presence or absence of "black'' or "white" data bits at prescribed intervals. This allowed it to recognize a very limited, specially designed character set. To accomplish this, the units required sophisticated transports for documents to be processed. The documents were required to run at a consistent speed and the printed data had to occur in a fixed location on each and every form.
1.3 Different uses for OCR
There are many uses for the output from an OCR engine and these are not limited to a full text representation online that exactly reproduces the original. Because OCR can, in many circumstances, deliver character recognition accuracy that is below what a good copy typist would achieve it is often assumed it has little validity as a process for many historical documents. However, as long as the process is fitted to the information requirement then OCR can have a place even when the accuracy is relatively low (see Accuracy below for more details). Potential uses include:
Indexing â€œ the OCR text is output into a pure text file that is then imported to a search engine. The text is used as the basis for full text searching of the information resource. However, the user never sees the OCRâ„¢d text â€œ they are delivered a page image from the scanned document instead. This allows for the OCR accuracy to be quite poor whilst still delivering the document to the user and providing searching capability. However, this mode of searching just identifies the document not necessarily the word or page on which it appears â€œ in other terms it just indexes that those words appear in a specific item. An example of this is the BUFVC British Universities Newsreels Database.
Full text retrieval â€œ in this mode the OCR text is created as above but further work is done in the delivery system to allow for true full text retrieval. The search results are displayed with hit highlighting within the page image displayed. This is a valuable addition to the indexing option from the perspective of the user. An example of this is the Forced Migration Online Digital Library.
Full text representation â€œ in this option the OCRâ„¢d text is shown to the end user as a representation of the original document. In this case the OCR must be very accurate indeed or the user will lose confidence in the information resource. All sorts of formatting issues in terms of the look and feel of the original are inherent within this option and it is rarely used with mark-up (see below) of some kind. The key factor is the accuracy and this leads to most projects having to check and correct OCR text to ensure the accuracy is suitable for publication with obvious time and cost implications.
Full text representation with xml mark-up - in this option the OCR output is presented to the end user with layout, structure or metadata added via the XML mark-up. In the majority of cases where OCRâ„¢d text is to be delivered there will be at least a minimal amount of mark-up done to represent structure or layout. Currently this process normally requires the highest amount of human intervention out of all the options listed here as OCR correction is very likely with additional mark-up of the content in some way.
Many examples of digital text resources with XML mark-up may be found through the Text Encoding Initiative website3. The projects listed there also demonstrate the variety in levels of mark-up that are possible making it possible to vary activity to match the projects intellectual requirements and economic constraints.
1.4 Accuracy
It is more useful to know how accurate the OCR engine will be on pre-1950â„¢s printed texts of very varying quality in terms of print and paper quality. In this context, it is highly unlikely that we will get 99.99% accuracy and we could assume that even the very best quality printed pre-1950â„¢s resources will give no more than 98% (and most would be considerably less than that). In these scenarios the accuracy measure given by the software suppliers is not very useful in deciding whether OCR is appropriate to the original printed resource.Regarding accuracy as a measurement of the amount of likely activity required to enable the text output to meet the defined requirements would be more useful. In this context we might look at the number of words that are incorrect number of characters.
1.5 Todayâ„¢s position of OCR
The advent of the array method of scanning, coupled with the higher speeds and more compact computing power, has led to the concept of "Image Processing". Image processing does not have to utilize optical recognition to be successful. For example, the ability to change any document to an electronically digitized item may effectively replace microfilm devices. This provides the user a much more convenient method of sorting images compared to handling actual documents or microfilm pictures. Image processing relies on larger more complex arrays than early third generation OCR scanners.
When these image scanners are coupled with OCR logic, they provide an extremely powerful tool for users. Image recognition can be done in an "off-line" mode rather than in "real time" - a tremendous advantage over earlier versions of OCR devices. This allows a much more powerful logic system to work over time and requires less rigorous demands on both the location of the information and the font design of the characters to be scanned. An example of this is found in the coupling of "image with convenience amount recognition" planned for the Financial Services Industry for check processing - still the world's largest paper processing application. This will be the first viable marriage of MICR with optical technology.
1.6 Applications of OCR
OCR has been used to enter data automatically into a computer for dissemination and processing. The
earliest of systems was dedicated to high volume variable data entry. The first major use of OCR was in
processing petroleum credit card sales drafts. This application provides recognition of the purchaser from the imprinted credit card account number and the introduction of a transaction. The early devices were coupled with punch units which made small holes to be read by the computer. As computers and OCR devices became more sophisticated, the scanners provided direct access into the CPU (computer
processing unit). This quickly lead to the payment processing of credit card purchases, known as "remittance processing". These two applications are still the two major applications for OCR.
Over time, other applications evolved. They included cash register tape readers, page scanners, etc. Any standard form or document with repetitive variable data would be a candidate application for OCR. Some very imaginative applications have evolved. Perhaps the most innovative are the Kurzwell scanners which read for the blind. With these devices, the optically scanned pages are converted to spoken words.
1.7 Limitations of OCR
OCR has never achieved a read rate that is 100% perfect. Because of this, a system which permits rapid and accurate correction of rejects is a major requirement. Exception item processing is always a problem because it delays the completion of the job entry, particularly the balancing function.
Of even greater concern is the problem of misreading a character (substitutions). In particular, if the system does not accurately balance dollar data, customer dissatisfaction will occur. The success of any OCR device to read accurately without substitutions is not the sole responsibility of the hardware manufacturer. Much depends on the quality of the items to be processed.
Through the years, the desire has been:
Â¢ To increase the accuracy of text recognition, that is, to reduce rejects and substitutions
Â¢ To reduce the sensitivity of scanning to read less-controlled input
Â¢ To eliminate the need for specially designed fonts (characters), and
Â¢ To read printed characters.
However, today's systems, while much more forgiving of printing quality and more accurate than earlier
equipment, still work best when specially designed characters are used and attention to printing quality is maintained. However, these limits are not objectionable to most applications, and dedicated users of
OCR systems are growing each year. But the ability to read a special character is not, by itself, sufficient to create a successful system.
1.8 Project Statement
Developing an OCR engine software which can recognize multifonts in devanagari language characters text document for Indian languages is a challenging task considering the computer-science, language aspects involved in the implementation. The problem of developing an OCR engine software which can recognize multifonts in devanagari language characters text document for Indian languages could be stated as: Given a text in Hindi language encoded in any format the objective wouldn be to process them into the required format and font out in the respective language. We wish to develop an omni font recognition for Indian languages and provide it for freely which would help in Hindi community.
1.9 CONTRIBUTIONS
The scope of this project is able to understanding about the font of Hindi languages and their
storage formats, identifying the font-type (font encoding name) of the font-data, designing
a generic framework for building font converters for scripts in Hindi languages and
developing an OCR engine software for Hindi language.
This research documents having two main contributions. They are:
(a) An approach for identifying the font-type (font encoding name): There are many approaches are follows models for fonts, classify or identify the font encoding name of the text .
(b) A generic framework for building font converters in Hindi language: Helps building
rapidly a font converter by providing a table of a font. In this method we are try to matchinging a font with the database of type phase of the font.
Chapter 2
MOTIVATION OF THE PROJECT
2.1 What does it take to make a successful OCR System
1. It takes a complimentary merging of the input document ~ stream with the processing requirements of the particular application with a total system concept that provides for convenient entry of exception type items with an output that provides cost effective entry to complete the system. To show a successful example, let's review the early credit card OCR applications. Input was a carbon imprinted document. However, if the carbon was wrinkled, the imprinter was misaligned, or any one of a variety of reasons existed, the imprinted characters were impossible to read accurately.
2. To compensate for this problem, the processing system permitted direct key entry of the fail to read items at a fairly high speed. Directly keyed items from the misread document were under intelligent computer control which placed the proper data in the right location for the data record. Important considerations in designing the system encouraged the use of modulus controlled check digits for the embossed credit card account number. This, coupled with tight monetary controls by batch totals, reduced the chance of recognition of font.
3. The output of these early systems provided a "country club" type of billing. That is, each of the credit
card sales slips was returned to the original purchaser. This provided the credit card customer with the opportunity to review his own purchases to insure the final accuracy of billing. This has been a very successful operation through the years. Today's systems improve the process by increasing the amount of data to be recognition, either directly or through reproduction of details on the sales draft. This provides customers with a "descriptive" billing statement which itemizes each transaction. OPTICAL CHARACTER RECOGNITION (OCR) Attention to the details of each application step is a requirement for successful OCR systems.
2.2 Input
When installing an OCR system, the most important consideration is the manner of creating input.
1. How do we intend to create the input If the input is typewritten data, how many different print image will create the input Will they be electronic, electric, or manual What type styles or fonts do they have Will the printed material be from a fabric or carbon ribbon This gives you an idea of the information you need to obtain.
2. What kind of a document will be used for the application For most systems, the data to be scanned must occur in the same location from document to document. Guide lines or the location of data identifiers, need to be pre-printed. Do they need to be in a "non-reading" color (drop-out ink) Where will they be printed What size will they be Will the form meet the requirements specified by the scanner manufacturer Will the right data be in the right location for best digit or balancing routines to facilitate performance Remember, your attention to detail and reviewing the "what if" possibilities before installation will save a tremendous amount of dissatisfaction later.
3. How will input be handled both prior to preparation, after printing, and after processing If accurate registration must be maintained, moisture-proof wrapping your pre-printed forms may be necessary. If the item is to be mailed to the processing center individually, you may want to prescribe a heavy duty envelope to prevent damage in transit. If the items are to be picked up in large quantities, a special basket or other carrier may be required to ensure documents are not damaged. Although a rubber band is a fine tool to bind a group of documents together, it is a prime cause of damage to paper documents that are to be processed in an automatic feeding device. If you require subsequent archival of the documents for retrieval purposes, proper storage containers are required. You may also need a pre-printed serial number to help research archived material.
These are but a few of the questions that need to be answered. Your OCR system manufacturer is the best source of information on individual system input requirements.
2.3 Importance of Font Design
Input, as we have seen, is very dependent on the application. This is especially true when considering the design of the font (style of characters).
For example, the first OCR device used in a commercial application read carbon imprinted credit card sales drafts. The font used is known as 7B. This font was designed by the Farrington Corp. for this type
of imprinting. The characters are large enough to be embossed on a plastic card. The "lakes" (open areas) of 6,9,0, and 8 have been opened so the carbon does not fill in those areas. The numbers are distinctly different from each other to reduce the possibility of substitution. This particular font is still the standard for this application .
One of the earliest OCR devices to read input from a data processing printer was the IBM 1418. At the time this device was designed, the printer used most was an IBM accounting machine called the 407. Therefore, the 1418 was designed specifically to read the 407 font. Due to style problems in some characters the font was modified and now retains the designation 407-1. This font is still used in some
applications .
IBM then introduced an OCR machine to read a full alphanumeric character set with its 1428 reader, thereby establishing the 1428 font. A modified version of this font, known as 1428E, is also available in
an elongated style for imprinter applications.
These fonts formed the basis for standardized fonts established by ANSI, the American National Standards Institute. This organization is comprised of participants who have agreed to create voluntary
compliance standards.
Two standard fonts established by this organization will improve the overall performance for OCR systems. The OCR A font is stylized and similar to the early 1428 font, Today, it is widely used in remittance processing billing documents where information to be scanned is on a separate line from the
information to be visually read by the customer. Every scanning device manufactured today can read this font due to its proven reliability.
The OCR B font is used in applications where data to be scanned must also be read by humans. It is less stylized in appearance than OCR A/ and is used to a great extent in European Countries.
These fonts are available in three sizes. Size 1 is commonly used by high speed data-processing printers. Size 2 has been expanded for use on devices such as cash registers that use a numbering sheet type of printer. Size 3 is even larger for use as an imprinter font. These gradations in size are proportional, allowing the fonts to be electronically reduced to the same size for presentation to the logic of the readers.
Using these fonts appropriately allows users to select readers that are very reliable and cheaper than devices capable of reading intermixed, multiple fonts. For the most part, today's OCR readers recognize several fonts, although they are most efficient and successful when running documents printed with a single font at a time.
2.4 Documents
Once the application has been defined, a decision must be made regarding which paper should be used and what information should be printed prior to distribution.
Paper Considerations
The vendor of the equipment usually has a list of basic requirements that must be met for the item to be
successfully processed. The basis weight is the first characteristic to be specified and will play a major role in the ultimate cost of the document. In general, most systems in place today prefer the use of a 24 Ib. sheet. (In other words, 500 sheets of paper, 17" x 22", would weigh 24 Ibs.) Paper thickness or caliper is also a consideration.
Other characteristics that may be specified include: stiffness, tear strength, bursting strength, fold resistance, porosity, etc.. Any one of these may play a part in the processability of the document. Other characteristics, such as smoothness, may be specified to provide a better printing surface. In addition, some vendors specify the cleanliness of the paper to avoid inadvertent "slime" spots that will not interfere with scanning.
Paper Color
For certain applications, it may be desirable to use colored stock to help users readily identify documents for use in different applications. Color coding is a very simple and satisfactory visual system. However, since OCR depends on the contrast between the printed characters and the background color, some color control is necessary.
Color for the most part is controlled by reflectance. Standardized tests are available to measure the relative reflectance of a sheet in comparison to absolute white (defined as 100% reflectance). For most systems 60% reflectance is the minimum that can be used. The readability of any individual character is determined by the print contrast signal it generates. This is determined by the formula, "PCS = reflectance of the background minus reflectance of the printing divided by the reflectance of the background ."
Printing Inks
As with paper color, the color of inks used for pre-printing the document are also very important. Some data that is pre-printed, such as a serial number, may be OCR read, but other pre-printed data should be ignored. This creates a new specification for non-reading inks.
Read vs. Non-read Inks
Read inks need to contain sufficient print contrast to be easily read by the OCR system. Black is the preferred color, but even black inks with insufficient density or coverage may not be recognized. Other colored inks will work in most systems if the PCS (print contrast signal) is 30% or less. Non-read inks are dependent on the response level of the OCR system used. For example, if it is a cathode ray scanner, it probably responds to the ultra-violet spectrum. A non-read ink for this system would be light blue. If the system used is sensitive to the infra-red region of the spectrum, then a non read ink would be red.
2.5 Nature of document
Once considering all the physical characteristics of the printed page then it is important remember the content related issues for OCR. OCR engines work by having as many examples as possible of character shapes from as many different language and national printing traditions as possible. They are supported by extensive dictionaries and vocabularies to enable natural language matching of grouped characters against known words to statistically improve the accuracy of the OCR output.
Therefore OCR engines may be fooled by alphanumeric content (such as part number for an engine e.g. MK2-3) or by personal names and place names that are unusual as they may not be in the engines dictionary of acceptable words. Further to this issue symbols, long s, diacritics and mathematical notation may play havoc with an OCR engines accuracy levels. Text content from pre-1900 may contain words that are no longer in common usage and a number of languages have gone through significant changes. Thus words may not be found in the OCR engines dictionary.
In this example, the Asian text would not be recognised by an OCR engine unless it had the specific dictionary engaged and also was told what language it was. Even then this form of text is usually not amenable to OCR.
In this example, the long s character will cause difficulties for OCR engines (see Christians, desire etc). Plus the use of words with alternate spellings to those used in modern language (comming, poore) will also diminish the OCR engines capacity to recognise words.
Consideration of the content and whether this is supported by the OCR engines word and font dictionaries is very worthwhile. Word dictionaries can always be added to in an effort to improve accuracy (such as by adding surname lists or street name content).
2.6 Nature of output requirements
What is needed from the OCR output will define and constrain the effectiveness of the OCR process. Output of text to an ASCII text file with just a list of words is very straightforward and can be heavily automated. Output to a PDF file can also be automated. But once the output has to represent the layout and structure of the original page then this creates technical difficulties that are usually only resolved by
human intervention and therefore be more costly in time and effort. XML mark-up covers such a range of possibilities (from just marking up the pagination to marking up the meaning of a single character) that this can be sometimes automated but is usually human mediated and thus has various cost factors associated with it.
2.7 DIVISION OF FONTS
Mono-Font: A document which is written with one specific font. In this type of document we can get very high accuracy in the process of OCR.
Multi-Font: A subset of some selected fonts. Its accuracy is related to the number and the similarity of the fonts under consideration. These systems achieve the best results when a single letter has very similar features in each font and it is easy to discriminate among different classes. On the other hand, the recognition is very difficult when different letters have similar features.
Omni-Font: It is the set of all fonts, and for this reasons in the OCR their accuracy is typically lower.
Optical Font Recognizer (OFR): - It is used to detect the font type and subsequently convert the multi-font problem into mono-font character recognition. An OFR can be useful to simply characterize single characters, words or paragraphs in a printed document.
2.8 Technical Description of OCR
The system works with an image of machine-printed or handwritten text (numerals, letters and symbols). The input image scanned either from books or from paper documents printed in Devanagari script is then processed by OCR software and all the text from that document available inside the computer just as if it was typed in. The System produces the output in the ASCII format and output can be viewed or edited with any editor that supports ASCII. Some of the potential applications of OCR system are Newspaper (printed in Hindi), Libraries, Office automation, Linguistic Community, reading aid for the Blind people, etc.
CHAPTER 3
GENERAL APPROACHES
3.1 NEED FOR HANDLING FONT-DATA
There is a chaos as far as the text in Indian languages in electronic form is concerned. Neither can one exchange the notes in Indian languages as conveniently as in English language, nor can one perform search on texts in Indian languages available over the web. This is so because the texts are being stored in ASCII (as apposed to Unicode) based font dependent codes. A large collection of text corpus assumes primary role in building many language technologies such as large vocabulary speech recognition or unit selection based speech synthesis system for a new language. In the case of Indian languages, the text which is available in digital format (on the web) is difficult to use as it is because they are available in numerous encoding (fonts) schemes. Applications like screen readers developed for Indian languages and other applications have to read or process such text. So it is important to have a generic framework for automatic identification of font-type and conversion of font-data to a phonetic transliteration scheme.
To view the websites hosting the content in a particular font-type then one requires that font to be installed on local machine. As this was the technology existed before the era of Unicode and hence a lot of electronic data in Indian languages were made and available in that form. The sources for these data are news websites (mainly), Universities/Institutes and some other organizations. They are using proprietary fonts to protect their data. Collection of these text corpora, identifying the font-type and conversion of font-data into a phonetically readable transliteration scheme is essential for building many natural language processing systems and applications.
A character of English language has the same code irrespective of the font being used to display it. However, most Indian language fonts assign different codes to the same character. For example Ëœaâ„¢ has the same numerical code Ëœ97â„¢ irrespective of the hardware or software platform.
Consider for example the word hello written in the Roman Script and the Devanagari Script.
Fig. 1: Illustration of code mapping for English fonts.
Arial and Times New Roman are used to display the same word. The underlying codes for the individual characters, however, are the same and according to the ASCII standard.
Fig. 2: Illustration of code mapping for Indian (Hindi) fonts.
The same word displayed in two different fonts in Devanagari, Yogesh and Jagran. The underlying codes for the individual characters are according to the code they are differentiat. Not only the decomposition of the codes assigned to them are both different but even the two fonts have different codes for the same characters. This leads to difficulties in processing or exchanging texts in these formats.
Three major reasons which cause this problem are: (i) There is no standard which defines the coding per language hence it differs between fonts of a specific language itself. (ii) Also there is no standard which defines the mapping of a glyph to a number (code value) in a language. (iii) There is no standard procedure to align the code while rendering. The common alignment order starts with left glyph followed by the pivotal character and followed by any of the top or right or bottom glyphs. Some font based scripting and rendering scheme may also violate this order and use their own.
3.2 DIGITAL STORAGE FORMAT
Another aspect of diversion in electronic content in Indian languages is their digital storage format. Formats like ASCII (American Standard Code for Information Interchange), ISCII (Indian Standard code for Information Interchange) and Unicode are often used to store the digital text data in Indian languages. The text is rendered using fonts of these formats. This section describes briefly about each storage format and also about fonts and glyphs.
3.2.1 ASCII Format
ASCII is a character encoding based on the English alphabets. Digital computers and operating systems in the early 90s supported only ASCII based encoding and hence many electronic news papers in Indian languages used glyph based fonts to store and render scripts of Indian languages. A font encoding stored in ASCII format specifies a correspondence between digital bit patterns and the symbols/glyphs of a written language. ASCII is strictly an eight bit code and ranges from 0 to 255.
3.2.2 Unicode Format
To allow computers to represent any character in any language, the international standard ISO 10646 defines the Universal Character Set (UCS) [4]. UCS contains the characters to practically represent all known languages in the world. ISO 10646 originally defined a 32- bit character set. Each character is assigned a 32 bit code. However, these codes vary only in the least-significant 16 bits. Table 2.1 shows the Unicode ranges for Indian languages. UTF: A Universal Transformation Format (UTF) [15] is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. Actual implementations in computer systems represent integers in specific code units of particular size (8 bit, 16 bit, or 32 bit). Encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. There are many Unicode Transformation Formats for encoding Unicode like UTF-8, UTF-16 and UTF-32. Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when averaging over the worldâ„¢s text in computers. With the advent of Unicode, UTF-8 and their support in operating systems, most of the current electronic documents are being published in Unicode specifically in UTF-8 formats. Some of the news websites which produce Indian language content in Unicode format are: BBC news, Yahoo, MSN and Google.
3.2.3 ISCII Format
In India since 1970s, different committees of the Department of Official Languages an the Department of Electronics (DOE) have been developing different character encodings schemes, which would cater to all the Indian scripts. In 1983, the DOE announced the 7-bit ISCII-83 code , which complied with the ISO 8-bit recommendations [16]. ISCII (Indian Script Code for Information Interchange) is a fixed-length 8-bit encoding. The lower 128(0- 127) code points are plain ASCII and the upper 95(160-255) code points are ISCII-specific,which is used for all Indian Script based on Brahmi script. This makes it possible to use an Indian script along with Latin script in an 8-bit environment. This facilitates 8-bit bilingual representation with Indic Script selection code. The ISCII code contains the basic 19 alphabet required by the Indian Scripts. All composite characters are formed by combining these basic characters. Unicode is based on ISCII-1988 and incorporates minor revisions of ISCII-1991, thus conversion between one to another is possible without loss of information.
3.3 Problem Statement
Multilingual OCR is developed with certain limitations. One of the limitation is recognizing multifonts Indian languages .So problem statement is to develop an OCR engine software which can recognize multifonts in devanagari language characters text document.
3.4 OBJECTIVE
Major: - OMNI FONT recognition in Hindi
language OCR
Minor: - Develop a Hindi OCR.
- Support maximum number of fonts.
- Recognize Print image.
- Recognize text.
- Provide facilities to correct and edit recognized text.
3.5 APPROACHES FOR DEVELOPING AN OCR
1. Application accepts a picture file as input. (The input file must be an image of text).
2. Normalized text to create a uniform block of text.
(i).locate text lines and characters
(ii).Character normalization
(iii).Space normalization
(iv).Form a uniform texture image (Text Padding)
3. Feature extraction: Font Specification and Identification
(I). Typeface (Amarujala,Jagran,Chanakya, ...),
(ii). Weight (light, regular, demi, bold, heavy),
(iii). Slope (roman, italic),
(iv).Width (normal, expanded,condensed), and
(v). Size.
4. Classifier: font recognition
This project describes an OCR-based technique for text editing of printed documents. The system accepts a text as input and returns a sequence of text images that are ranked in Hindi font according to their similarity with the input. During word recognition generated for each document word using a font-independent Devanagari OCR.
Font Recognition means the detection of the font style of the documents, can be useful for:
_ Document characterization / classification;
_ Document layout analysis;
_ Improvement of Multi-font OCR.
CHAPTER. 4
LITERATURE ANALYSIS
Paper Title
4.1 Multi-Linguistic Optical Font Recognition Using Stroke Templates
Analysis
In this paper there is telling about method is presented to automatically extract representative stroke templates from a text image, which contains characters of the
same typeface. The collected stroke templates are classified and saved to a font database. To recognize an unknown font for an input text image, a Bayes decision rule is used to determine which font entrant in the database provides the best matching to the unknown font. The experiment demonstrates that this approach can distinguish between Hindi and English fonts without the prior information of their script. Another advantage is that it can learn a new font very quickly.
Proposed Approaches
1.Stroke template extraction
There is a process is used to create a
skeleton image from the input text image.
The technique proposed by Suzuki is
adopted here because it is fast and can
produce skeletons with strict connectivity.
2. Classification of stroke templates
The present system has two operation
phases: template collection and font
recognition. During the template-collection
phase, a text image containing
characters of a certain font is input for
collecting stroke templates.
3. Font recognition
In the font-recognition phase, the extracted stroke templates are compared with the templates stored in the font database. A Bayes decision rule is used to determine a best matching.
Benefits
Due to approach belongs to the class and this class of methods has less limitation on their applications as they can recognize the images containing single
character/word, which are unacceptable for other classes of methods.
Contribution of this paper
By the help of this paper we can understand about method is more convenient to learn a new font; we just input a text image containing the sample characters of the new font in the template-collection phase while their system may need.
Paper Title
4.2 Font Recognition Based on Global Texture Analysis
Analysis
In this paper, describe a new texture analysis based approach towards font recognition. Existing methods are typically based on local features that often require connected components analysis. In this new method, we can takethe document as image containing some special textures, and font recognition a texture identification. We apply the well-established 2-D Gabor filtering technique to extract such features and a weighted Euclidean distance classifier to fulfil the recognition task.
Proposed Approaches
1.Discuss pre-processing in detail.
2.Introduce the multi-channel Gabor filtering technique for texture feature extraction
3.Understanding about the classifier.
4.Discussed about experiments and results.
Figure 4. The flow chart of the font identification system.
Preprocessing: creating a uniform block of text:
The original input is a binary image. It may contain characters of different sizes and spaces between text lines.For the purpose of texture feature extraction, the input
documents need to be normalized to create a uniform block of text.
Feature extraction: multi-channel Gabor filtering:
In fact, any type of texture analysis methods such as multi-channel Gabor filtering or gray level occurrence matrix can be employed here. Experiments showed that the former has better performance chose the multi-channel Gabor filtering approach to extract texture features.
Benefits
1. In theory, any texture classification and analysis technique can be applied in the new method.
2. As a result of preprocessing, it can function well even when the input image contains a small
amount of text.
3. The algorithm has been found to perform extremely well.
Contribution of this paper
This method help for computing, which makes it easy to be applied in practical applications.
Paper Title
4.3 FONT RECOGNITION BY INVARIANT MOMENTS OF GLOBAL TEXTURES
Analysis
In this paper there is discussion about Recognition (OFR) is proposed in this work; this is based on the analysis of texture characteristics of document images formed of pure text through the invariants moments technique. Page segmentation and paragraph structure analysis are out of the scope of this study. There is not need of explicit local analysis in our method since the central feature of it is the extraction of global characteristics from the analysis of textures.
Proposed Approaches
Fig. 5. Scheme of font identification system.
1. Locating Text Lines
The location of a text line was determined calculating the horizontal projection profile (HPP) of the whole text, which was determined by adding over each line the intensity of the pixels that belonged to it.
2. Text Line Normalization
Since a text can contain several types of fonts and different sizes of them it is necessary to normalize letters and words of different sizes to a standard one.
Once a text line was located, fonts were normalized to have all of the same size.
3. Spacing Normalization
The normalization of vertical spacing was used to reduce the undesired influence of spaces on each line. In other words it was used to eliminate spaces between words.
4. Text Padding
Since the text may not be justified, refilling of blank spaces was performed when the text did not end with the rest of the lines. The chosen option consisted on copying parts of the text of the preceding (or another) line, so there is a bigger probability that words be of the same type when they are close.
Invariant Moments
For this method, the original and preprocessed images were considered as two dimensional arrays of a random variable of dimension NÃƒâ€”N. The random variables took values from level 0 to 255. Moments were calculated for the random variable X, which was identified with the image block. In addition, X is a matrix of two coordinates (x,y). The definition of invariant moment around the origin is given by:
Contribution of this paper
Due to help of this paper we can find a better identification of the font descriptors for Hindi language.
Paper Title
4.4 Optical Font Recognition Using Typographical Features
Analysis
Here is the aim for identification of font on the basis of typeface, weight, slope and size of the text from an image Block without any knowledge of the content of that text. The recognition is based on a multivariante classifier and operates on a given set of known fonts.
This system use a process of ApOFIS (A priori Optical Font Identification System), global typographical features Are extracted from the text image
THE APOFIS APPROACH TO OPTICAL FONT RECOGNITION:
Font Specification and Identification Attributes :ApOFIS, a font is fully specified by five attributes:
(I). Typeface (Amarujala,Jagran,Chanakya, ...),
(ii). Weight (light, regular, demi, bold, heavy),
(iii). Slope (roman, italic),
(iv).Width (normal, expanded,condensed), and
(v). Size.
Feature Extraction:
The feature extraction process, which assumes skew-free images,performs three steps:
1) Determination of the typographical structure of the text line;
2) Classification of connected components;
3) Calculation of features using the typographical structure of the text and the classified connected
components
TABLE 1:EVOLUTION OF RECOGNITION RATES FOR FOUR TEXT LENGTHS
Effects of Text Length
The discussion so far has not considered text length. In the previous experiments, all text entities, used to create the FMB and to assess the classifier, have similar lengths. The study on the influence of text length is of importance because document analysis may require font identification from fragments of various lengths, e.g., words, lines, or even paragraphs.
We found that developed system analyze a text line image and to identify the typeface, the font style, and size from a given set of fonts. We have adopted a statistical approach based on the extraction of a few well-selected global features from a medium resolution image of a scanned text.
Paper Title
4.5 Optical font recognition from projection profiles
Analysis
This paper presents a statistical approach for font attribute recognition based on features extracted from projection profiles of text lines. The presented features allow the discrimination of the font weight, slope and size.
Font recognition approaches
There are two possible approaches for font recognition:
(1). Global feature extraction from text entities (word, line paragraph). These features are generally detected by non- experts in typography (text density, size, orientation and spacing of the letters, serifs, etc.).
(2). Local feature extraction from individual letters. The features are based on letter particularities like the shapes of serifs (coved, squared, triangular, etc.)
Font model
In our OFR system a font is modelled by the following attributes:
1. The font family, such as Amarujala,Jagran,Chanakya.. Commonly this corresponds to the definition
of typeface;
2. The size, expressed in typographic points.
3. The weight of the font, having one of the following values: light, normal or bold;
4. The slope indicating the orientation of the lettersâ„¢ main strokes. A font could be roman, slanted or
italic;
5. the spacing mode specifying the pitch of the characters. A font may have a fixed pitch (mono-
spaced) or a proportional one. The latter class may have condensed, normal or expanded spacing
mode.
Table 2. Theoretical confusion rates for weight (normal, bold) and slope (roman, italic) with known family and size (12) and with unknown size (all sizes merged)
Table 3. Theoretical confusion rates between font sizes using h3 with known family, weight and slope
Evaluation results
Table 1 gives an overview of the power of dn (density of black pixels) and dr (variance of horizontal projection profile derivative) features to discriminate the weight and slope of a font when its family and size are known, and when its size is unknown. dn was used to discriminate the weight and dr to discriminate the slope.We can see that:
_ these features are pertinent, and have a discrimination power greater than 97%;
_ the slope is easier to detect than the weight;
_ the font size has a very low influence on the discrimination of weight and slope.
The last line of Table 1 shows that dr is still very accurate in discriminating the slope when the font family is unknown, while dn is less accurate in discriminating the weight. This may be explained by the fact that fonts do not have homogeneous typographic grey levels.
Other tests with other font families (Helvetica, Palatino, Bookman, New Century Schoolbook) gave similar results.
Table 2 gives an overview of the power of the h3 feature to discriminate the font size with known family, weight and slope (h3 is presented because it estimates the main part of a text line). The confusion rates were computed, first for sizes 10, 11 and 12 and second for sizes 10, 12 and 14. The table shows that size discrimination is easy for non-consecutive sizes and is more difficult for successive ones.
Other tests with the h1 and h2 features led to the same conclusions. In fact, h1, h2 and h3 depend on the font family and have very low discrimination power for merged families, for example h3 has the same value for Helvetica-10 and Times-11.
CONCLUSIONS
We have shown in this paper the importance of font identification and the reliability of an a priori identification based on statistical analysis of projection profiles. We have presented some features allowing an accurate discrimination of font weight and slope.
In this paper the importance of font identification and the reliability of an a priori identification based on statistical analysis of projection profiles. Some features allowing an accurate discrimination of font weight and slope, but a classifier for omni-font recognition (discrimination of the weight and slope of any font family and size). Size discrimination was accurate when the font family was known.
Paper Title
4.6 IDENTIFICATION AND CONVERSION ON FONT-DATA IN INDIAN LANGUAGES
Analysis
1. This paper describes about the text in Indian languages in electronic form is concerned. The survey
purpose provides solutions and explains it by organizing into three parts.
(i). The first part presents the nature of the Indian language scripts and their different storage formats.
(ii). The second part presents a new TFIDF weights based approach to identify the font-types.
(iii). The third part explains a generic framework for building font converters for bnIndian languages
using glyph-map tables and glyph assimilation process.
2. In this paper we discuss our effort in addressing the issues related to font-data like font encoding
identification and font data conversion in Indian languages.
Proposed Approaches
IDENTIFICATION OF FONT-TYPE
-The identification and classification of the text or text documents based on their content to a specific encoding type.
. TF-IDF (Term Frequency - Inverse Document Frequency)
Weights
- The TF-IDF weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection.
. Modeling and Identification
- Modeling: Generating a statistical model for each font type using these TF-IDF weights is known as modeling the data.
- Identification: While identifying the language or encoding type (font
encoding name) first generate the terms (like unigram, bigram and trigram) of the data type
under consideration.
Need for Handling Font-Data
A character of English language has the same code irrespective of the font being used to display it. However, most Indian language fonts assign different codes to the same character.
For example â„¢aâ„¢ has the same numerical code â„¢97â„¢ irrespective of the hardware or software platform.
Consider for example the word hello written in the Roman Script and the Devanagari Script.
Fig. 6. Illustration of code mapping for English fonts.
Arial and Times New Roman are used to display the same word. The underlying codes for the individual characters, however, are the same and according to the ASCII standard.
Fig. 7. Illustration of code mapping for Hnidi fonts.
The same word displayed in two different fonts in Devanagari, Yogesh and Jagran. The underlying codes for the individual characters are according to the glyphs they are broken into. Not only the decomposition of glyphs and the codes assigned to them are both different but even the two fonts have
different codes for the same characters. This leads to difficulties in processing or exchanging texts in these formats.
Three major reasons which cause this problem are, (i) There is no standard which defines the number of glyphs per language hence it diferes between fonts of a specific language itself. (ii) Also there is no standard which defines the mapping of a glyph to a number (code value) in a language. (iii) There is no standard procedure to align the glyphs while rendering. The common glyph alignment order followed is first left glyph, then pivotal character and then top or right or bottom glyph. Some font based scripting and rendering is violating this order also.
CONVERSION OF FONT-DATA
- By font conversion we mean here the conversion of glyph to grapheme (akshara). So we want to make clear about glyph and grapheme. A character or grapheme is a unit of text, whereas a glyph is a graphical unit.
Advantages/Disadvantages
1. This paper explained the nature and difficulties associated with font-data processing in Indian languages.
2. Discussed the new TF-IDF weights based approach for font identification.
3. There is a framework to build font converters for font-data conversion from
glyph-to-grapheme using glyph asssimilation process
Modeling and Identification
Data Preparation: For training we need sufficiently enough data of that particular type. And here we have collected and used more than 0.12 million unique words per font-type. it is collected and prepared manually. We tried nearly 10 different fonts of 4 languages.
Modeling: Generating a statistical model for each fon ttype using these TF-IDF weights is known as modeling the data. For modeling we considered three different types of
terms. They are (i) unigram (single term), (ii) bigram (current and next terms) and (iii) trigram (previous, current and next terms). In text modeling the term refers the glyph based unigram or bigram or trigram
Identification: While identifying the language or encoding type (font encoding name) first generate the terms (likeunigram, bigram and trigram) of the data type under consideration. Get the TF-IDF weight of each term from the models and calculate the summation. The maximum of all summations will give the specific encoding type interms of the model of the data type itself.
CONTRIBUTION OF THIS PAPER
This paper helps in understanding about the nature and difficulties associated with font-data processing in Indian languages.
Paper Title
4.7Text Processing for Text-to-Speech Systems in Indian Languages
1.1 Analysis
This paper describes about to build a natural sounding speech synthesis system.
In this paper discuss about efforts in addressing the issues of Font-to-Akshara mapping, pronunciation rules for Aksharas,text normalizationin the context of building text-to-speech systems in Indian languages.
Proposed Approaches
Identification of Font-Type
In this paper, we propose the use of TF-IDF approach for identification of font-type.The term frequency inverse document frequency (TF-IDF)approach is used to weigh each glyph-sequence in the font-data according to how unique it is.
Font-to-Akshara Mapping
Font-data conversion can be defined as converting the font encoded data into Aksharas represented using phonetic transliteration scheme such as IT3.

Table 4. Performance of font models
Advantages/Disadvantages
Font-to-Akshara conversion and proposed a TF-IDF based approach for font-identification is used. syllable-level features could be used to build a text normalization system whose performance is significantly better than the word-level features.
Contribution of this paper
This research paper is helping for the understanding for the relevancy of font
identification and font-to-Akshara conversion and proposed a TF-IDF based approach. for font-identification.
Paper Title
4.8 Tools for Developing OCRs for Indian Scripts
Analysis
In this paper discussed about the problem of compounded by the unstructured manner in which popular fonts are designed. There is a lot of common structure
in the different Indian scripts. In this paper, we argue that a number of automatic and semi-automatic tools can ease the development of recognizers for new font styles and new scripts. We discuss briefly three such tools we developed and show how they have helped build new OCRs.
Proposed Approaches
In this paper those approach are used are given below:
Data Collection
Large amounts of data with sufficient variations are required to train an OCR that should work under different conditions. These variations can be on size font type, scanning resolution,etc.
Segmentation Analysis and Feature Selection
We bui

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	SAMBA SERVER ADMINISTRATION full report	project report tiger	3	4,759	17-01-2018, 05:40 PM Last Post: AustinnuAke
	air ticket reservation system full report	project report tiger	16	46,890	08-01-2018, 02:33 PM Last Post: RaymondGom
	An Efficient Algorithm for Mining Frequent Patterns full report	project topics	3	4,768	01-10-2016, 10:02 AM Last Post: Guest
	online examination full report	project report tiger	14	42,903	03-09-2016, 11:20 AM Last Post: jaseela123d
	Employee Cubicle Management System full report	computer science technology	4	5,123	07-04-2016, 11:37 AM Last Post: dhanabhagya
	e-Post Office System full report	computer science technology	27	25,997	30-03-2016, 02:56 PM Last Post: dhanabhagya
	college website project full report	project report tiger	28	67,219	29-11-2015, 02:37 PM Last Post: Guest
	steganography full report	project report tiger	31	33,894	07-07-2015, 02:57 PM Last Post: seminar report asees
	ENQUIRY INFORMATION ON INSTITUTE full report	seminar topics	1	2,211	10-11-2014, 09:15 PM Last Post: Guest
	data mining full report	project report tiger	25	171,262	07-10-2014, 09:10 PM Last Post: ToPWA

Important Note..!

ASK HERE