Non-Dictionary-Based Thai Word Segmentation Using Decision Trees
#1

ABSTRACT
For languages without word boundary delimiters, dictionaries are
needed for segmenting running texts. This figure makes
segmentation accuracy depend significantly on the quality of the
dictionary used for analysis. If the dictionary is not sufficiently
good, it will lead to a great number of unknown or unrecognized
words. These unrecognized words certainly reduce segmentation
accuracy. To solve such problem, we propose a method based on
decision tree models. Without use of a dictionary, specific
information, called syntactic attribute, is applied to identify the
structure of Thai words. C4.5 is used as a tool for this purpose.
Using a Thai corpus, experiment results show that our method
outperforms some well-known dictionary-dependent techniques,
maximum and longest matching methods, in case of no dictionary.
Keywords
Decision trees, Word segmentation without a dictionary
1. INTRODUCTION
Word segmentation is a crucial topic in analysis of languages
without word boundary markers. Many researchers have been
trying to develop and implement in order to gain higher accuracy.
Unlike in English, word segmentation in Thai, as well as in many
other Asian languages, is more complex because the language
does not have any explicit word boundary delimiters, such as a
space, to separate between each word. It is even more complicated
to precisely segment and identify the word boundary in Thai
language because there are several levels and several roles in Thai
characters that may lead to ambiguity in segmenting the words. In
the past, most researchers had implemented Thai word
segmentation systems based on using a dictionary ([2], [3], [4],
[6], [7]). When using a dictionary, word segmentation has to cope
with an unknown word problem. Up to present, it is clear that
most researches on Thai word segmentation with a dictionary
suffer from this problem and then introduce some particular
process to handle such problem. In our preliminary experiment,
we extracted words from a pre-segmented corpus to form a
dictionary, randomly deleted some words from the dictionary and
used the modified dictionary in segmentation process based two
well-known techniques; Maximum and Longest Matching
methods. The result is shown in Figure 1. The percentages of
accuracy with different percentages of unknown words are
explored. We found out that in case of no unknown words, the
accuracy is around 97% in both maximum matching and longest
matching but the accuracy drops to 54% and 48% respectively, in
case that 50% of words are unknown words. As the percentage of
unknown words rises, the percentage of accuracy drops
continuously. This result reflects seriousness of unknown word
problem in word segmentation. 1
Unknown Accuracy (%)
Figure 1. The accuracy of two dictionary-based systems vs.
percentage of unknown words
In this paper, to take care of both known and unknown words, we
propose the implementation of a non-dictionary-based system
with the knowledge based on the decision tree model ([5]). This
model attempts to identify word boundaries of a Thai text. To do
this, the specific information about the structure of Thai words is
needed. We called such information in our method as syntactic
attributes of Thai words. As the learning stage, a training corpus is
utilized to construct a decision tree based on C4.5 algorithm. In
the segmentation process, a Thai text is segmented according to
the rules produced by the obtained decision tree. The rest shows
the proposed method, experimental results, discussion and
conclusion.
2. PREVIOUS APPROACHES
2.1 Longest Matching

Most of Thai early works in Thai word segmentation are based on
longest matching method ([4]). The method scans an input
sentence from left to right, and select the longest match with a
dictionary entry at each point. In case that the selected match
cannot lead the algorithm to find the rest of the words in the
sentence, the algorithm will backtrack to find the next longest one
and continue finding the rest and so on. It is obvious that this
algorithm will fail to find the correct the segmentation in many
cases because of its greedy characteristic. For example:ไปหามเหสี
(go to see the queen) will be incorrectly segmented as: ไป(go) หาม
(carry) เห(deviate) สี (color), while the correct one that cannot be
found by the algorithm is: ไป(go) หา(see) มเหสี (Queen).
2.2 Maximum Matching
The maximum matching algorithm was proposed to solve the
problem of the longest matching algorithm describes above ([7]).
This algorithm first generates all possible segmentations for a
sentence and then select the one that contain the fewest words,
which can be done efficiently by using dynamic programming
technique. Because the algorithm actually finds real maximum
matching instead of using local greedy heuristics to guess, it
always outperforms the longest matching method. Nevertheless,
when the alternatives have the same number of words, the
algorithm cannot determine the best candidate and some other
heuristics have to be applied. The heuristic often used is again the
greedy one: to prefer the longest matching at each point. For the
example, ตาก(expose) ลม(wind) is preferred to ตา(eye) กลม(round).
2.3 Feature-based Approach
A number of feature-based methods have been developed in
([3]) for solving ambiguity in word segmentation. In this
approach, the system generates multiple possible segmentation for
a string, which has segmentation ambiguity. The problem is that
how to select the best segmentation from the set of candidates. At
this point, this research applies and compares two learning
techniques, called RIPPER and Winnow. RIPPER algorithm is a
propositional learning algorithm that constructs a set of rules
while Winnow algorithm is a weighted-majority learning
algorithm that learns a network, where each node in the network is
called a specialist. Each specialist looks at a particular value of an
attribute of the target concept, and will vote for a value of the
target concept based on its specialty; i.e., based on a value of the
attribute it examines. The global algorithm combines the votes
from all specialists and makes decision. This approach is a
dictionary-based approach. It can acquire up to 91-99% of the
number of correct segmented sentences to the total number of
sentences.
2.4 Thai Character Chuster
In Thai language, some contiguous characters tend to be an
inseparable unit, called Thai character cluster (TCC). Unlike word
segmentation that is a very difficult task, segmenting a text into
TCCs is easily realized by applying a set of rules. The method to
segment a text into TCCs was proposed in ([8]). This method
needs no dictionary and can always correctly segment a text at
every word boundaries.
3. WORD SEGMENTATION WITH DECISION TREE MODELS
In this paper, we propose a word segmentation method that (1)
uses a set of rules to combine contiguous characters to an
inseparable unit (syllable-like unit) and (2) then applies a learned
decision tree to combine these contiguous units to words. This
section briefly shows the concept of TCC and the proposed
method based on decision trees.

DOWNLOAD FULL REPORT
http://aclwebanthology/H01-1057.pdf
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: dictionary search word by, medical dictionary and thesaurus, a concise dictionary, french monolingual dictionary online, java intelligent dictionary based encoding, dictionary citation in, dictionary file in word,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  WEB SERVICE SELECTION BASED ON RANKING OF QOS USING ASSOCIATIVE CLASSIFICATION 1 918 15-02-2017, 04:13 PM
Last Post: jaseela123d
  Online Dictionary nit_cal 2 2,311 06-04-2016, 12:16 PM
Last Post: dhanabhagya
  Privacy Preserving Decision Tree Learning Using Unrealized Data Sets Projects9 1 2,352 30-10-2013, 01:18 PM
Last Post: Guest
  Efficient Graph-Based Image Segmentation seminar class 2 3,354 02-02-2013, 01:58 PM
Last Post: seminar details
  A Search Engine Using Case Based Reasoning nit_cal 1 1,620 21-12-2012, 11:01 AM
Last Post: seminar details
  IMAGE SEGMENTATION full report seminar class 5 5,510 30-11-2012, 01:03 PM
Last Post: seminar details
  Medical image segmentation using clustering algorithm computer science technology 2 5,982 08-11-2012, 01:00 PM
Last Post: seminar details
  CBIR - Content Based Image Retrieval Using Shape & Color Characteristics seminar class 1 2,796 19-10-2012, 01:08 PM
Last Post: seminar details
  WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES electronics seminars 9 8,765 14-02-2012, 03:48 PM
Last Post: seminar paper
  Controlling Data Dictionary Attacks using Graphical password and Data Hiding in Video seminar surveyer 1 2,360 10-02-2012, 10:53 AM
Last Post: seminar addict

Forum Jump: