ASK HERE

electronics seminars · 09-01-2010, 03:13 PM

[attachment=1125]
A Novel Algorithm for Hiding Sensitive Frequent Itemsets
Chih-Chia Weng, Shan-Tai Chen*
, Yuan-Chung Chang
Dept. of Information Science, Chung Cheng Institute of Technology, National Defense University

*
Dept. of Information Science, Chung Cheng Institute of Technology, National Defense University
No. 190, Sanyuan 1st
St., Tashi, Taoyuan, Taiwan, R.O.C.
Tel : +886-3-3809331 ext 209 (O)
Fax : +886-3-3806737
E-mail : stchen[at]ccit.edu.tw

Abstracts - With rapid advance of the network and data mining
techniques, the protection of the confidentiality of sensitive
information in a database becomes a critical issue to be resolved.
Association analysis is a powerful and popular tool for
discovering relationships hidden in large data sets. The
relationships can be represented in a form of frequent itemsets
or association rules. One rule is categorized as sensitive if its
disclosure risk is above some given threshold. Privacy-
preserving data mining is an important issue which can be
applied to various domains, such as Web commerce, crime
reconnoitering, health care, and customer's consumption
analysis.
The main approach to hide sensitive frequent itemsets is to
reduce the support of each given sensitive itemsets. This is done
by modifying transactions or items in the database. However,
the modifications will generate side effects, i.e., nonsensitive
frequent itemsets falsely hidden (the loss itemsets) and spurious
frequent itemsets falsely generated (the new itemsets). There is
a trade-off between sensitive frequent itemsets hidden and side
effects generated. Furthermore, it should always take huge
computing time to solve the problem.
In this study, we propose a novel algorithm, FHSFI, for
fast hiding sensitive frequent itemsets (SFI). The FHSFI has
achieved the following goals: 1) all SFI can be completely
hidden while without generating all frequent itemsets; 2) limited
side effects are generated; 3) any minimum support thresholds
are allowed, and 4) only one database scan is required.
Key Words: frequent itemsets, association rules, privacy
preserving data mining, sensitive frequent itemsets, side effects.
1. Introduction
The data mining technologies have been an important
technology for discovering previously unknown and potentially
useful information from large data sets or databases. They can
be applied to various domains, such as Web commerce, crime
reconnoitering, health care, and customer's consumption
analysis. Although these are useful technologies, there is also a
threat to data privacy. For example, the association rule analysis
is a powerful and popular tool for discovering relationships
hidden in large data sets. Therefore, some private information
could be easily discovered by this kind of tools. The protection
of the confidentiality of sensitive information in a database
becomes a critical issue to be resolved.
The relationships discovered from a database can be
represented in a form of frequent itemsets or association rules.
One rule is categorized as sensitive if its disclosure risk is above
some given threshold. With an association analyzer, if an
itemset with support above a given minimal support, we call the
itemset as a frequent itemset.
The problem for finding an optimal sanitization of a source
database with association rule analysis has been proven to be
NP-Hard [1]. In [2,3,4,5] the authors presented different
heuristic algorithms that modify transactions via inserting or
deleting items for hiding sensitive rules or itemsets.
Vassilios S. Verykios et al. [2] presented algorithms to hide
sensitive association rules, but they generate high side effects
and require multiple database scans. Instead of hiding sensitive
association rules, Shyue-Liang Wang [3] proposed algorithms
to hide sensitive items. The algorithm needs less number of
database scans but the side effects generated is higher. Ali
Amiri [4] also presented heuristic algorithms to hide sensitive
items. Finally, Yi-Hung Wu et al. [5] proposed a heuristic
method that could hide sensitive association rules with limited
side effects. However, it spent a lot of time on comparing and
checking if the sensitive rules are hidden and if side effects are
produced. Besides, it could fail to hide some sensitive rules in
some cases.
In this study, we propose a novel algorithm, FHSFI for fast
hiding sensitive frequent itemsets (SFI). The FHSFI has
achieved the following goals: 1) all SFI can be completely
hidden while without generating all frequent itemsets; 2) limited
side effects are generated; 3) any minimum support thresholds
are allowed, and 4) only one database scan is required.
The remainder of this paper is organized as follows:
Section 2 presents the problem formulation and notations. In
Section 3, we introduce the concept of the proposed algorithm
for fast hiding sensitive frequent itemsets and giving examples
to illustrate the proposed algorithm. Section 4 is the
experimental results which present the performance and various
side effects of the proposed algorithm. Section 5 is the
conclusion and further work.
2. Problem Formulation and Notations
In Table 1, we summarize the notations used hereafter in
this paper. Let I be a set of items in a transaction database D.
And let I = {i1, i2, ..., im}; D = {t1, t2, Â¦, tn}, where every
transaction ti is a subset of I, i.e. tiI. An example database is
shown in Table 2. Let X be a set of items in I. If Xti, we say
that the transaction ti supports X. There are nine items, |I|=9,
and five transactions, |D|=5, in the database. The support of
itemset X can be computed by equation (1). An association rule
is an implication of the form XY, where XI, YI and Xn
Y= ÃƒËœ. A rule XY will be extracted from a database if
1) support(XY) = min_support (a given minimum support
threshold) and
2) confidence(XY) = min_confidence (a given minimum
confidence threshold),
where support(XY) and confidence(XY) are given by
equations (2) and (3), .
support(X) = ||X|| / |D| (1)
support(X Y ) = ||X Y || / |D| (2)
confidence(X Y ) = ||XY|| / | X | (3)
In equation (1), ||X|| denotes the number of transactions in
the database that contains the itemset X, and |D| denotes the
number of the transactions in the database D. If support(X) =
min_support, we call X as a frequent itemset. Table 3 shows the
frequent itemsets for a given min_support = 60%.
For the example X = {1,4,7}, since Xt1, Xt2 and X
t3, we obtain ||X||=3. Therefore, support(1,4,7)=60%. Using the
form XY (support, confidence) for association rules, the rules
generated from the above itemset {1,4,7} can be described as
14,7 (60%,75%), 41,7 (60%,100%), 71,4 (60%,75%),
1,47 (60%,100%), 1,74 (60%,100%) and 4,71
(60%,100%).
Figure 1 shows the relationships among the sets, U, Uâ„¢,
and SFI. The study goal is to hide all SFI and to minimize the
loss itemsets. That is, Uâ„¢nSFI = ÃƒËœ and the set Uâ€œUâ„¢â€œSFI should
be minimized.
Table 1. Definitions of variables used in this paper
Variable Definition
D the original database
Dâ„¢ the released database which is transformed from D
U the sets of frequent itemsets generated from D
Uâ„¢ the sets of frequent itemsets generated from Dâ„¢
ti a transaction in Database D
|ti| the number of items in ti
TID a unique identifier of each transaction
SFI the set of sensitive frequent itemsets to be hidden
SFI.tj a sensitive frequent itemset in the SFI
||.|| the support count of an itemset, i.e., the number of
transactions that support the itemset
wi prior weight of ti
PWT a table for storing TID and wi for each transaction
in an order decreasing by wi
MICi the maximal number of itemsets in SFI that contain
an item ik, where ikti, SFI.tjti
SFI.t.i an item in SFI.tj that is supported by the
transaction to be modified

Figure 1. The relationships among the sets, U, Uâ„¢, and SFI
3 The Proposed Algorithm
We now demonstrate the algorithm, FHSFI. Given D, SFI,
and min_support, the algorithm is to generate a database to be
released, Dâ„¢, in which the sensitive frequent itemsets are hidden
and the side effects generated are minimized.
The sketch of the FHSFI algorithm is shown in Figure 2,
which can be depicted as the following stages.
Table 2.
Database D
TID Transaction
1 1,2,4,5,7
2 1,4,5,7
3 1,4,6,7,8
4 1,2,5,9
5 6,7,8
Table 3.
Frequent Itemsets
Itemset Support
1 80%
4 60%
5 60%
7 80%
1,4 60%
1,5 60%
1,7 60%
1,4,7 60%
4,7 60%
SFI

Uâ„¢
U

Figure 2. The pseudo code of the FHSFI algorithm
In stage 1, FHSFI scans database once while collects all
useful information about the correlation with SFI for each
transaction, including ||SFI.tj|| and wi. The ||SFI.tj|| is used for
checking if SFI.tj has been hidden. The wi is a prior weight of a
transaction ti, which provides a heuristic for estimating side
effects and can be computed by equation (4).
wi = 1 / [2( | ti | - 1)
/ MICi]. (4)
Table 4 shows an example of sensitive frequent itemset.
Let t1 = {1,2,4,5,7}, which supports SFI.t1, SFI.t2 and SFI.t3. As
shown in Figure 3 the correlation between t1 and the SFI can be
represented by a graph G=<V,E>. Each node is for an item ik in
t1; the weight associated with each edge in E denotes the
number of the itemsets in SFI that contain the both adjacent
nodes connected by the edge. Each node can be represented as
({SFI.tj | SFI.tj ti, ikSFI.tj}, item_countSFI.t). For example,
the node < {1,2,3}, 3> for item Ëœ1â„¢ indicates that three itemsets
in SFI that contain the item Ëœ1â„¢, namely the SFI.t1, SFI.t2, and
SFI.t3. As shown in Figure 3, item Ëœ1â„¢ has the maximum
item_countSFI.t which is equal to 3. Hence, we obtain MIC1 = 3
and w1 = 3/16.
Stage 2 repeats to modify transitions one-by-one until all
SFI have been hidden. The order of the transaction
modifications is according to the prior weight associated with a
transition. The following tasks are repeated until SFI is empty.
Â¢ Select a transaction tk from PWT such that wk is maximal.
Â¢ Select the item to be deleted, according to the heuristic
shown in Figure 4, and delete it.
Â¢ Recompute wk after modifying each item, and then insert it
into the PWT in the maintained order.
Â¢ Subtract 1 from ||SFI.tj|| if SFI.tj contains the deleted item
and is supported by tk.
Â¢ Remove SFI.tj from SFI, if the (||SFI.tj|| / |D|)< min_support.

Figure 3. The correlation between t1 and SFI

Figure 4 shows the heuristic procedure for determining which
item to be modified and for computing MIC for transaction ti.

Figure 4. The pseudo code of the Heuristic procedure
Table 4.
An example of sensitive
frequent itemsets, SFI
Itemset
1 1,2,5
2 1,4,7
3 1,5,7
4 6,8
Table 5.
The support count for
each itemset in SFI
Itemset ||.||
1 1,2,5 2
2 1,4,7 3
3 1,5,7 2
4 6,8 2
FHSFI( );
Input: D, SFI, min_support;
Output: Dâ„¢;

Stage 1
1 For each transaction ti in the Database D Do
2 Begin
3 If exist any SFI.t (sensitive frequent itemsets) supported by ti then
4 Begin
5 ||SFI.tj|| : = ||SFI.tj|| + 1;
6 MICi := the maximum item_count from Heuristic(i, SFI);
7 Compute the prior weight wi for each ti by the function;
wi = 1 / [2
( | ti | - 1)
/ MICi];
8 Store the TID and the wi in PWT;
9 End;
10 End;

Stage 2
11 While SFI is not empty (ÃƒËœ) do
12 Begin
13 Select a TID from PWT with maximal w;
14 Determine which item in tTID will be modified according to
Heuristic(TID, SFI);
15 Modify the item;
16 Modify wTID
of the tTID and insert the TID into the PWT;
17 For each SFI.tj that contains the modified item Do
18 Begin
19 ||SFI.tj||=||SFI.tj|| â€œ 1;
20 If ||SFI.tj|| / |D| < min_support then
21 Remove SFI.tj from SFI;
22 End;
23 End;
Heuristic ( );
Input: TID, SFI;
Output: the item to be modified, MICi;
1 Begin
2 For each SFI.t in SFI do
3 Begin
4 If the transaction tTID fully supports SFI.tj then
5 Begin
6 For each item SFI.tj.i in SFI.tj Do
7 item_countSFI.t.i = item_countSFI.t.i + 1;
8 End;
9 End;
10 Select the SFI.t.i with maximum item_count as the item of
tTID to be midified;
11 Return(SFI.tj.i, item_count);
12 End;

Now, we use the following example for illustrating the
proposed algorithm FHSFI.
Example 1. Given D, SFI, as shown in Tables 2 and 4, and
min_support = 40%. As shown in Table 5, the support count for
each SFI.t can be obtained from D and SFI. For example, SFI.t2,
{1,4,7}, is supported by t1, t2, and t3, so ||SFI.t2|| = 3. Table 6
lists the length, MIC, and the prior weight for each transaction
in the database. The PWT, as shown in Table 7, can be obtained
by sorting Table 6 in the decreasing order by w. Then, the first
transaction, i.e., t2, in PWT is chosen to be modified. According
the heuristic shown in Figure 3, the item Ëœ1â„¢ in t2 are removed.
Hence, ||SFI.t2|| and ||SFI.t3|| will be reduced by 1. SFI.t3 is
removed from SFI because the (||SFI.t3|| / |D|) < min_support.
The process is repeated until the SFI is empty. Finally, the
FHSFI algorithm removes the item Ëœ1â„¢ in t2, the item Ëœ6â„¢ or Ëœ8â„¢
in t5 (select randomly), and the item Ëœ1â„¢ in t1. Now all sensitive
frequent itemsets in SFI have been hidden. Ã‚Â¦
4. Performance Evaluation
We have performed our experiments on a notebook with
1.5G MHz processor and 512 MB memory, under Windows XP
operating system. The IBM data generator [11] is used to
synthesize the databases for the experiments. Databases with
sizes 5K, 10K, 15K, 20K, 25K, and 30K are generated for the
series of experiments. The average length of transactions of
each database is 10 and 50 items in the generated database. The
minimum support threshold given is 30%. The experimental
results are obtained by averaging from 5 independent trials with
different SFIs.
The performance of the FHSFI algorithm has been
measured according to three criteria: CPU time requirements,
side effects produced, and the number of entries modified.
Tables 8 and 9 present the experimental results for |SFI|=5 and
|SFI|=10, respectively.
The CPU time requirements, side-effect evaluation, and the
number of entries modified for varied |D| and |SFI| are shown in
Figures 6, 7, and 8, respectively.

0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 5000 10000 15000 20000 25000 30000 35000
|D|
CPU time (ms)
|SFI|=5
|SFI|=10

Figure 6. CPU time requirements

0
2
4
6
8
10
12
0 5000 10000 15000 20000 25000 30000 35000
|D|
Loss item sets
|SFI|=5
|SFI|=10

Figure 7. The side-effect evaluation

Table 8. Experiment results for |SFI|=5
|D| CPU time(ms) |U| |Uâ„¢| #loss itemsets #modified entries
5000 326.6 439 428.6 5.4 143
10000 454.2 417 406.4 5.6 307.2
15000 701 426 415.6 5.4 513
20000 905 442 431 6 711.6
25000 1183.6 432 421.2 5.8 902.8
30000 1502 443 432.4 5.6 863.8
Table 9. Experiment results for |SFI|=10
|D| CPU time(ms) |U| |Uâ„¢| #loss itemsets #modified entries
5000 314.4 439 420.2 8.8 236.8
10000 578.8 417 396.4 10.6 604.2
15000 807.2 426 406 10 896.6
20000 1155.6 442 422 10 1177
25000 1550 432 412.6 9.4 1450.6
30000 1899 443 422.8 10 1521.4
Table 6.
The MIC and prior weight for each
transaction in D
TID Transaction |ti| MIC w
1 1,2,4,5,7 5 3 3/16
2 1,4,5,7 4 2 2/8
3 1,4,6,7,8 5 1 1/16
4 1,2,5,9 4 1 1/8
5 6,7,8 3 1 1/4

Table 7.
The example
PWT
TID w
1 2 2/8
2 5 1/4
3 1 3/16
4 4 1/8
5 3 1/16
0
200
400
600
800
1000
1200
1400
1600
0 5000 10000 15000 20000 25000 30000 35000
|D|
Entries modified
|SFI|=5
|SFI|=10

Figure 8. The number of entries modified
The experimental results for FHSFI can be summarized as
follows:
Â¢ As shown in Figure 6, the CPU time is linear growth with
the size of database and is scalable with the size of SFI.
Â¢ The number of loss itemsets is independent of the size of
database, but linear-related with the size of SFI sets, which
can be discovered in Figure 7.
Â¢ The number of the modified entries depends on the size of
the database and the size of SFI. However, since the
heuristic procedures are used to determine the order of
modifications, we can observe in Figure 8 that only a small
part of transactions in the database are modified. For
|D|=10000, only 600 transactions are modified for
completely hiding the 10 itemsets in SFI.
5. Conclusions and further work
In this paper, we have presented the FHSFI algorithm in
order to fast hide sensitive frequent itemsets with limited side
effects. The correlations between the sensitive itemsets and each
transaction in the original database are analyzed. A heuristic
function to obtain a prior weight for each transaction is given.
The order of transactions to be modified can be efficiently
decided by the weight for each transaction. This will reduce the
time to deal with the transactions whose modification is not
helpful for hiding the given sensitive frequent itemsets. In other
words, the number of transactions in D that we have to deal
with could also be reduced.
Our approach has achieved the following goals: 1) all SFI
can be completely hidden while without generating all frequent
itemsets; 2) limited side effects are generated; 3) any minimum
support thresholds are allowed; and 4) only one database scan is
required. In this research, one of our goals is hiding all SFI with
limited side effects, but our algorithm still causes some loss rule
sets. We are currently considering extensions on the algorithms
to solve the problem. Another one is to apply the ideas
introduced in this paper to fast hide sensitive association rules.
These issues could be studied in the future.
Acknowledgements
This research was supported in part by the National
Science Council of ROC under grants NSC 95-3114-P-606-
001-Y.

References
[1] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, V.
Verykios, Disclosure limitation of sensitive rules
Knowledge and Data Engineering Exchange, pp. 45-52,
1999.
[2] Vassilios S. Verykios, A.K. Elmagarmid, E. Bertino, Y.
Saygin, and E. Dasseni, Association Rule Hiding, IEEE
Transactions on Knowledge and Data Engineering, vol.
16, no. 4, pp. 434-447, 2004.
[3] Shyue-Liang Wang, Hiding sensitive predictive
association rules, Systems, Man and Cybernetics, 2005
IEEE International Conference on Information Reuse and
Integration, vol. 1, pp. 164-169, 2005.
[4] Ali Amiri, Dare to share: Protecting sensitive knowledge
with data sanitization", Decision Support Systems archive
vol. 43, issue 1, pp. 181-191, 2007.
[5] Yi-Hung Wu, Chia-Ming Chiang, and Arbee L.P. Chen,
Hiding Sensitive Association Rules with Limited Side
Effects, IEEE Transactions on Knowledge and Data
Engineering, vol. 19, issue 1, pp. 29 - 42, 2007.
[6] Jiuyong Li, Hong Shen, Rodney Topor, Mining the
Smallest Association Rule Set for Predictions,
Proceedings of the 2001 IEEE International Conference
on Data Mining, 2001.
[7] Shyue-Liang Wang, Kuan-Wei Huang, Tien-Chin Wang,
Tzung-Pei Hong, Maintenance of discovered informative
rule sets: incremental deletion, International Conference
on Systems, Man and Cybernetics, 2005.
[8] Hsin-Wei Lin, Don-Lin Yang, Wei-Cheng Liao, and
Jungpin Wu, "Efficient Support Counting of Candidate
Itemsets for Association Rule Mining," Proceedings of
2nd International Workshop on Chance Discovery and
Data Mining, 2007.
[9] Shyue-Liang Wang, Bhavesh Parikh, Ayat Jafari, Hiding
informative association rule sets, Expert Systems with
Applications, 2007.
[10] http://wuyy.idv.tw/research/DM/AssociationRule.php,
2007
[11] http://almaden.ibmcs/projects/iis/hdb/Projects/
data_mining/mining.shtml, 2007.
[attachment=1125]

**seminar details** · 19-11-2012, 12:19 PM

to get information about the topic "hiding sensitive association rules with limited side effects source code" full report ppt and related topic refer the page link bellow

http://studentbank.in/report-hiding-sens...de-effects

http://studentbank.in/report-a-novel-alg...e=threaded

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	APRIORI Algorithm	project report helper	1	10,949	07-02-2019, 10:19 AM Last Post:
	Vertical Handoff Decision Algorithm Providing Optimized Performance in Heterogeneous Wireless Networks	computer science topics	2	30,434	07-10-2016, 09:02 AM Last Post: ijasti
	Implementation of RSA Algorithm Using Client-Server full report	seminar topics	6	26,806	10-05-2016, 12:21 PM Last Post: dhanabhagya
	Skin Tone based Secret Data hiding in Images	seminar class	9	7,015	23-12-2015, 04:18 PM Last Post: HelloGFS
	Dynamic Search Algorithm in Unstructured Peer-to-Peer Networks	seminar surveyer	3	2,823	14-07-2015, 02:24 PM Last Post: seminar report asees
	Particle Swarm Optimization Algorithm and Its Application in Engineering Design Optim	computer science crazy	3	5,487	03-05-2013, 10:28 AM Last Post: computer topic
	Public Key Cryptography and the RSA algorithm	seminar class	1	2,469	23-11-2012, 11:32 AM Last Post: seminar details
	rsa algorithm full report	computer science technology	3	8,207	23-11-2012, 11:32 AM Last Post: seminar details
	Page Rank Algorithm	summer project pal	1	2,963	30-10-2012, 02:04 PM Last Post: seminar details
	Advanced Algorithm Design and Analysis FULL PRESENTATION	seminar class	1	2,560	10-10-2012, 12:22 PM Last Post: seminar details

Important Note..!

ASK HERE