ASK HERE

seminar class · 03-05-2011, 03:22 PM

ABSTRACT
For languages without word boundary delimiters, dictionaries are
needed for segmenting running texts. This figure makes
segmentation accuracy depend significantly on the quality of the
dictionary used for analysis. If the dictionary is not sufficiently
good, it will lead to a great number of unknown or unrecognized
words. These unrecognized words certainly reduce segmentation
accuracy. To solve such problem, we propose a method based on
decision tree models. Without use of a dictionary, specific
information, called syntactic attribute, is applied to identify the
structure of Thai words. C4.5 is used as a tool for this purpose.
Using a Thai corpus, experiment results show that our method
outperforms some well-known dictionary-dependent techniques,
maximum and longest matching methods, in case of no dictionary.
Keywords
Decision trees, Word segmentation without a dictionary
1. INTRODUCTION
Word segmentation is a crucial topic in analysis of languages
without word boundary markers. Many researchers have been
trying to develop and implement in order to gain higher accuracy.
Unlike in English, word segmentation in Thai, as well as in many
other Asian languages, is more complex because the language
does not have any explicit word boundary delimiters, such as a
space, to separate between each word. It is even more complicated
to precisely segment and identify the word boundary in Thai
language because there are several levels and several roles in Thai
characters that may lead to ambiguity in segmenting the words. In
the past, most researchers had implemented Thai word
segmentation systems based on using a dictionary ([2], [3], [4],
[6], [7]). When using a dictionary, word segmentation has to cope
with an unknown word problem. Up to present, it is clear that
most researches on Thai word segmentation with a dictionary
suffer from this problem and then introduce some particular
process to handle such problem. In our preliminary experiment,
we extracted words from a pre-segmented corpus to form a
dictionary, randomly deleted some words from the dictionary and
used the modified dictionary in segmentation process based two
well-known techniques; Maximum and Longest Matching
methods. The result is shown in Figure 1. The percentages of
accuracy with different percentages of unknown words are
explored. We found out that in case of no unknown words, the
accuracy is around 97% in both maximum matching and longest
matching but the accuracy drops to 54% and 48% respectively, in
case that 50% of words are unknown words. As the percentage of
unknown words rises, the percentage of accuracy drops
continuously. This result reflects seriousness of unknown word
problem in word segmentation. 1
Unknown Accuracy (%)
Figure 1. The accuracy of two dictionary-based systems vs.
percentage of unknown words
In this paper, to take care of both known and unknown words, we
propose the implementation of a non-dictionary-based system
with the knowledge based on the decision tree model ([5]). This
model attempts to identify word boundaries of a Thai text. To do
this, the specific information about the structure of Thai words is
needed. We called such information in our method as syntactic
attributes of Thai words. As the learning stage, a training corpus is
utilized to construct a decision tree based on C4.5 algorithm. In
the segmentation process, a Thai text is segmented according to
the rules produced by the obtained decision tree. The rest shows
the proposed method, experimental results, discussion and
conclusion.
2. PREVIOUS APPROACHES
2.1 Longest Matching
Most of Thai early works in Thai word segmentation are based on
longest matching method ([4]). The method scans an input
sentence from left to right, and select the longest match with a
dictionary entry at each point. In case that the selected match
cannot lead the algorithm to find the rest of the words in the
sentence, the algorithm will backtrack to find the next longest one
and continue finding the rest and so on. It is obvious that this
algorithm will fail to find the correct the segmentation in many
cases because of its greedy characteristic. For example:ไปหามเหสี
(go to see the queen) will be incorrectly segmented as: ไป(go) หาม
(carry) เห(deviate) สี (color), while the correct one that cannot be
found by the algorithm is: ไป(go) หา(see) มเหสี (Queen).
2.2 Maximum Matching
The maximum matching algorithm was proposed to solve the
problem of the longest matching algorithm describes above ([7]).
This algorithm first generates all possible segmentations for a
sentence and then select the one that contain the fewest words,
which can be done efficiently by using dynamic programming
technique. Because the algorithm actually finds real maximum
matching instead of using local greedy heuristics to guess, it
always outperforms the longest matching method. Nevertheless,
when the alternatives have the same number of words, the
algorithm cannot determine the best candidate and some other
heuristics have to be applied. The heuristic often used is again the
greedy one: to prefer the longest matching at each point. For the
example, ตาก(expose) ลม(wind) is preferred to ตา(eye) กลม(round).
2.3 Feature-based Approach
A number of feature-based methods have been developed in
([3]) for solving ambiguity in word segmentation. In this
approach, the system generates multiple possible segmentation for
a string, which has segmentation ambiguity. The problem is that
how to select the best segmentation from the set of candidates. At
this point, this research applies and compares two learning
techniques, called RIPPER and Winnow. RIPPER algorithm is a
propositional learning algorithm that constructs a set of rules
while Winnow algorithm is a weighted-majority learning
algorithm that learns a network, where each node in the network is
called a specialist. Each specialist looks at a particular value of an
attribute of the target concept, and will vote for a value of the
target concept based on its specialty; i.e., based on a value of the
attribute it examines. The global algorithm combines the votes
from all specialists and makes decision. This approach is a
dictionary-based approach. It can acquire up to 91-99% of the
number of correct segmented sentences to the total number of
sentences.
2.4 Thai Character Chuster
In Thai language, some contiguous characters tend to be an
inseparable unit, called Thai character cluster (TCC). Unlike word
segmentation that is a very difficult task, segmenting a text into
TCCs is easily realized by applying a set of rules. The method to
segment a text into TCCs was proposed in ([8]). This method
needs no dictionary and can always correctly segment a text at
every word boundaries.
3. WORD SEGMENTATION WITH DECISION TREE MODELS
In this paper, we propose a word segmentation method that (1)
uses a set of rules to combine contiguous characters to an
inseparable unit (syllable-like unit) and (2) then applies a learned
decision tree to combine these contiguous units to words. This
section briefly shows the concept of TCC and the proposed
method based on decision trees.

DOWNLOAD FULL REPORT
http://aclwebanthology/H01-1057.pdf

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	WEB SERVICE SELECTION BASED ON RANKING OF QOS USING ASSOCIATIVE CLASSIFICATION		1	918	15-02-2017, 04:13 PM Last Post: jaseela123d
	Online Dictionary	nit_cal	2	2,311	06-04-2016, 12:16 PM Last Post: dhanabhagya
	Privacy Preserving Decision Tree Learning Using Unrealized Data Sets	Projects9	1	2,352	30-10-2013, 01:18 PM Last Post: Guest
	Efficient Graph-Based Image Segmentation	seminar class	2	3,354	02-02-2013, 01:58 PM Last Post: seminar details
	A Search Engine Using Case Based Reasoning	nit_cal	1	1,620	21-12-2012, 11:01 AM Last Post: seminar details
	IMAGE SEGMENTATION full report	seminar class	5	5,510	30-11-2012, 01:03 PM Last Post: seminar details
	Medical image segmentation using clustering algorithm	computer science technology	2	5,982	08-11-2012, 01:00 PM Last Post: seminar details
	CBIR - Content Based Image Retrieval Using Shape & Color Characteristics	seminar class	1	2,796	19-10-2012, 01:08 PM Last Post: seminar details
	WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES	electronics seminars	9	8,765	14-02-2012, 03:48 PM Last Post: seminar paper
	Controlling Data Dictionary Attacks using Graphical password and Data Hiding in Video	seminar surveyer	1	2,360	10-02-2012, 10:53 AM Last Post: seminar addict

Important Note..!

ASK HERE