ASK HERE

seminar class · 15-03-2011, 04:20 PM

[attachment=10227]
Abstract -
In recent years, mining of sequential patterns has been studied extensively in various domains. Most of the existing algorithms find patterns in transactional databases. This paper proposes a novel algorithm to mine closed sequential patterns using an inverted matrix and prefix based sequence element matrix. Inverted matrix minimizes the search space for discovering various sequential patterns of different items. We use a prefix based sequence element matrix to minimize the scans required at levels k and k+1 in the mining process. Our experimental results show the performance improvement of the new algorithm over the previous work.
Keyword : - data mining, sequential patterns mining, closed sequential patterns
I. INTRODUCTION
The goal of data mining or knowledge discovery is to utilize those existing data to find out new facts and to uncover new relationships that were previously unknown, in an efficient manner with minimum utilization of the space and time. Frequent Itemset Mining plays an essential role in many data mining tasks and applications, such as mining association rules, correlations, sequential patterns, classification and clustering. Frequent item set construction has been a major research area over the years and several algorithms have been proposed in the literature to address the problem of mining association rules. FIM algorithm could be broadly classified into two categories. The first category, candidate-generation methods such as Apriori [1] and another category of methods, pattern-growth methods such as FP-growth [2] and Tree Projection [3].
Frequent pattern mining in temporal domain, an emerging research trend is the focus of this study. Existing frequent pattern mining algorithms in temporal domain are all based on Apriori’s candidate generation logic and hence suffer from repeated input scans setback. Prefix Span, a variant of FP-growth proposed for efficient sequential pattern mining requires input to be in the form of transactions or records and does not maintain temporal continuity across
transactions [4]. An algorithm on fast mining of closed sequential patterns mines patterns on transactional records [5].
An efficient algorithm for frequent temporal pattern (EFTP) discusses sequential pattern mining on single lengthy sequence, but it suffers from repeated scans of the original input and traversal of the tree [6].
A sequence s is called closed if there exists no super sequence of s with the same support in the database. Several studies were recently proposed to mine closed sequential patterns. Mining sequential patterns with closed patterns may largely reduce the number of patterns generated in the process and without losing any information because it can be used to derive the complete set of sequential patterns. Mining closed patterns may significantly reduce the number of patterns generated and is information lossless because it can be used to derive the complete set of sequential patterns.
In this paper, we focus on the problem of mining closed sequential patterns based on prefixes using inverted and prefix based sequence element matrices. As a result, this study proposes an efficient algorithm to find various sequential patterns by minimizing the search space and reduces the number of scans of the original input.
The rest of the paper is organized is as follows: Section 2 discusses the problem statement with an illustration. Section 3 illustrates the organization of the inverted matrix. Section 4 describes the prefix based fast mining of closed sequential patterns (PBFMCSP) with algorithms. Section 5 exhibits the experimental results. Finally, Section 6 presents the conclusion.
II. PROBLEM DEFINITION
This section discusses the basic concepts in sequential pattern mining and the problem of mining frequent closed sequential patterns in a sequence database.
Let CD be an input categorical sequence dataset and I= { i1, i2,…., ik }be a set of items in the dataset CD. A sequence S = <s1, s2,…., sk>, (si I) for 1≤i≤k, is an ordered list. We assume that there exists a linear order in S. A sequence S is closed iff no supersequences of S with the same support exist in the database.
Given a minimum support threshold min_sup and an input categorical dataset CD, the task of prefix based mining of sequential patterns is to mine the complete set of frequent patterns based on prefix in the dataset CD.
Example 1: Consider the input sequence ABCDBDACDBACBDCDABCD in which the patterns to be mined and assume that the support count is 2. The frequent patterns of item A of this sequence would be for frequent 1-itemset = {A,B,C,D}, frequent 2-itemset = {AB, AC, AD}, frequent 3-itemset = {ABC, ABD, ACB,ACD,ADB} and frequent 4-itemset = {ABCD, ACBD, ACDB}.
III. INVERTED MATRIX
This paper uses an inverted matrix approach [7] to associate each item with its sequence segments those are subsequences of the sequence in which it occurs as a prefix and to associate all items in each segment using pointers. Similar to the vertical approach in the transactional database representation, the item is the key of each record in this layout. The difference between this approach and the vertical approach is that each attribute on the inverted matrix is not the transaction ID, but a pointer points to the location of the next item on the same sequence segment. The construction of the inverted matrix is assumed to be pre-processing of the mining process. The inverted matrix that is made of two parts: the index and the sequence array. The index contains the items and the sequence array is a set of rows in which each row is associated with one item in the index part. Each row is made of pairs representing pointers, where each pair holds two information: the physical address in the index part of the next item in the same sequence segment, and the physical address in the row of the next item in the same sequence segment. The entries of the sequence segments with items are done as follows: the given sequence is read, the location of the first item in the sequence is sought and an entry to its sequence array is added that holds the location of the next item in this sequence. For the second time the same process occurs, in which an entry in the inverted matrix of the second item is added to hold the location of the third item in the sequence. Table 1 shows the inverted matrix of the sample sequence ABCDBDACDBACBDCDABCD.
IV. PREFIX BASED FAST MINING OF CLOSED SEQUENTIAL PATTERNS (PBFMCSP)
The proposed algorithm PBFMCSP consists of three algorithms namely the creation of inverted matrix, mining of sequential patterns and mining of closed sequential patterns. These three algorithms are developed to mine the frequent and closed sequential patterns in a sequence database. The steps for creation of inverted matrix are given in Algorithm 1. The inverted matrix is used to store the subsequences or sequence segments having various items as prefix in the sequence [6]. The mining phase steps are explained in two sub phases namely Phase I and Phase II as given in Algorithm 2. The process of mining closed sequential patterns is described in Algorithm 3.
A. Creation of Inverted Matrix
Algorithm 1
Step 1: Scan the input sequence and find the frequent 1- itemset for the given threshold. Assign an index to each item in the frequent 1- itemset and create an inverted matrix with index of the items as rows.
Step 2: Scan the input sequence and seek for the current item index in the inverted matrix and an entry to its sequential array is added that holds the index of the next item in this sequence.
Step3: Repeat Step 2 until end of sequence is encountered.
B. Mining of Sequential Patterns
Algorithm 2
Phase I
For each item in the frequent 1-itemset
Step 1: Find all the 2-itemset having the item as a prefix.
Step 2: Scan the sequence segments of the concerned item in the inverted matrix using the index of the prefix item in the itemset and find the frequencies of 2- itemset. Do the pruning process to remove infrequent item sets like Apriori.
Phase II
For each item in k-itemset
Step 1: Create a prefix based sequence element matrix having the itemset as prefix with remaining items as rows and columns. The diagonal elements the frequency of k+2-itemset.
Step 2: Scan the sequence segments having the sequence element matrix prefix as sequence prefix in the inverted matrix, find the frequencies of k+1- itemset and k+2-itemset and update entries in the matrix.

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	An Efficient Algorithm for Mining Frequent Patterns full report	project topics	3	4,805	01-10-2016, 10:02 AM Last Post: Guest
	Privacy Preservation in Data Mining	sajidpk123	3	3,004	13-11-2014, 10:48 PM Last Post: jaseela123d
	projects on data mining?	shakir_ali	2	2,060	05-11-2014, 09:30 PM Last Post: jaseela123d
	data mining full report	project report tiger	25	171,341	07-10-2014, 09:10 PM Last Post: ToPWA
	Fast Data Collection in Tree-Based Wireless Sensor Networks	Projects9	9	4,017	12-03-2014, 06:30 PM Last Post: computer topic
	A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Site	project topics	1	2,354	13-12-2012, 12:22 PM Last Post: Guest
	DATA MINING AND WAREHOUSE	smart paper boy	1	1,652	10-11-2012, 12:44 PM Last Post: seminar details
	Java based Data Mining Project Ideas	electronics seminars	1	4,058	10-03-2012, 03:21 PM Last Post: seminar paper
	Two Techniques For Fast Computation of Constrained Shortest Paths	seminar topics	4	3,733	06-03-2012, 12:49 PM Last Post: seminar paper
	HARDWARE ENHANCED ASSOCIATION RULE MINING WITH HASHING AND PIPELINING - KNOWLEDGE AND	electronics seminars	3	3,917	20-02-2012, 04:33 PM Last Post: seminar paper

Important Note..!

ASK HERE