03-03-2017, 04:38 PM
Results from multiple databases make up the deep or hidden Web, which is estimated to contain a much higher amount of high quality information, usually structured and have a faster growth rate than static Web. The system that helps users integrate and, Compare query results returned from various web databases, an important task is to match records from different sources that refer to the same real-world entity. Most of the more advanced record classification methods are monitored, requiring the user to provide training data. In Web databases, records are not available in the hand, since they are query-dependent, they can only be retrieved after the user submits the query. After elimination of duplicates from the same source, duplicate records assumed from the same source can be used as training examples. The method uses the classifiers the weighted component similarity sum classifier (WCSS) and the Support Machine Classifier (SVM) that works together with the Gaussian Mix Model (GMM) to iteratively identify the duplicates. Classifiers work cooperatively to identify duplicate records. The complete GMM is parameterized by the mean vectors, covariance matrices and the mix weights of all records.
Most of the previous work is based on predefined concordance rules manually coded by domain experts or rules of agreement learned offline by some method of learning from a set of training examples. Such approaches work well in a traditional database environment, where all instances of the target databases are easily accessible, provided that a group of high-quality representative records can be examined by experts or selected for That the user labels. Therefore, manual or offline coding methods are not appropriate for two reasons. First, the complete data set is not available beforehand, and therefore, good representative data for training are difficult to obtain. Second, and most importantly, even when good representative data are found and labeled for learning, the rules learned in the representatives of a complete set of data may not work well in a partial and biased part of that set of data. While most previous record matching work is intended to match a single record type. Unfortunately, however, dependencies between several types of records are not available for many domains.
Most of the previous work is based on predefined concordance rules manually coded by domain experts or rules of agreement learned offline by some method of learning from a set of training examples. Such approaches work well in a traditional database environment, where all instances of the target databases are easily accessible, provided that a group of high-quality representative records can be examined by experts or selected for That the user labels. Therefore, manual or offline coding methods are not appropriate for two reasons. First, the complete data set is not available beforehand, and therefore, good representative data for training are difficult to obtain. Second, and most importantly, even when good representative data are found and labeled for learning, the rules learned in the representatives of a complete set of data may not work well in a partial and biased part of that set of data. While most previous record matching work is intended to match a single record type. Unfortunately, however, dependencies between several types of records are not available for many domains.