Posts: 2,051
Threads: 1,405
Joined: Jun 2011
[attachment=15214]
Introduction
In the last few years there has been a tremendous increase in connectivity between systems which has brought about limitless possibilities and opportunities. Unfortu¬nately security related problems have also increased at the same rate. Computer systems are becoming increasingly vulnerable to attacks. These attacks or intru¬sions, based on flaws in operating system or application programs, usually read or modify confidential information or render the system useless. Formally, an intrusion is defined as any activity that violates the confidentiality, integrity or availability of the system.
Intrusion prevention is more desirable, but it cannot be fully achieved due to several reasons like unknown bugs in software, vast base of installed systems, abuse by insiders and human negligence. Many times it is difficult to have good access control while simultaneously making the system user friendly. Attacks are inevitable, but even after the attack has occurred, it is important to determine that the attack has happened, assess the extent of damage and track down the attacker. This helps in preventing future attacks. Due to these reasons, a detection system as a second line of defence is always desirable.
Intrusion detection systems (IDS) can be classified in two ways. The first one is based on the source of data being analyzed by the system. If the data is from operating system logs and application logs, it is called a 'host based' detection system; if the data is from network traffic, it is called a 'network based' detection system. Each method has its own advantages and disadvantages. For example, an attack by a local user cannot be detected by a network based system, but a denial of service attack can be detected more efficiently by a network based system. Thus each method is more efficient in detecting a particular class of attacks than the other.
The other classification is based on the detection method being used irrespective of the source of data. The main types in this classification are misuse detection systems and anomaly detection systems. In misuse detection, well known intrusions are represented by signatures. Each signature is a pattern of activity which corre¬sponds to the intrusion it represents, A detection system using such signatures is called a 'signature based' or a 'misuse detection' system. These detection systems search for patterns of intrusions in the data being analyzed. Thus misuse detection is basically a pattern matching process. Misuse detection systems are accurate and have a low false alarm rate, but they cannot detect unknown intrusions.
Anomaly detection systems assume that intrusions are anomalies or deviations from normal system activity. These detection systems try to capture the normal behaviour of the system (also called the normal profile), and then detect deviations from this normal behaviour. If this deviation is greater than a threshold, an alert is raised. Anomaly detection systems can detect unknown intrusions, but they have a high false alarm rate. There is generally a trade-off between detection rate and false alarm rate.
Several IDSs have been developed in the public and private domains using a variety of techniques and with varying features. Commercial IDSs mostly use sig¬nature based detection techniques. The features offered by them include scalability, real-time detection and a user friendly interface. Open source IDSs employ either misuse detection or anomaly detection or both. They offer features like scalability and real-time detection. For example, Snort [4], an open source IDS, employs misuse detection and is capable of doing real-time detection. Public domain research IDSs generally employ novel detection techniques. For example, ADAM [6] uses data mining techniques and IDES [17] uses statistical techniques.
Looking at the intrusion detection field from a research perspective, the research in misuse detection is focused mainly on writing signatures which encompass all possible variations of an attack without matching normal activity and on developing efficient methods of pattern matching. In anomaly detection, the main focus is on finding methods for representing the normal profile, selection of features used for constructing the profile and determining threshold levels so that most intrusions are detected while false alarms are minimized. In an overall system perspective, the focus of current research is on developing hybrid systems, i.e systems that are both network based and host based or that employ both anomaly detection and misuse detection,
1.1 Problem statement and Approach
In this thesis, we describe the design and implementation of a network based, real¬time anomaly detection scheme for the Sachet IDS, Sachet is a network based, real¬time, hybrid intrusion detection system developed at IIT Kanpur, Sachet employs both misuse detection and anomaly detection; hence it has the benefits of both the techniques, i.e. the accuracy of misuse detection systems in detecting known attacks, and the ability of anomaly detection systems in detecting unknown attacks. The Sachet IDS has agent based architecture with a central server. The detection is carried out at each agent and the results are aggregated at the server. The architecture is explained in more detail in Chapter 3, In the remaining part of this section, we describe the main issues involved in the thesis, followed by our approach. The main task in anomaly detection is to construct the normal profile of the system under observation. This profile should adapt to the changes in the system over time. It should also be small enough so that real-time detection is possible. The profile is generally constructed from a set of measures or features extracted from the data being analyzed. In this case, the features are extracted from the network packets sniffed at appropriate points in the network being monitored. One of the main issues here is feature extraction in real-time.
The construction of profile from feature vectors follows the data stream model; we have a continuous stream of feature vectors and the profile at any point should capture the information in the stream up to that point. If possible, the profile construction method should give more weight to newer data when compared with older data. Since the amount of network data is generally very large, any method used to construct the profile cannot obviously take the entire data seen in the stream so far, as input. Hence, efficiently dealing with the data stream is also a major issue here. Older data in the stream has to be discarded periodically, but the information in the discarded data has to be retained to some extent. Stream handling techniques have to be employed for this purpose. Finally, the detection technique has to be implemented in Sachet so that it requires minimal human intervention.
Our approach is as follows: the profile is learned from feature vectors using unsu-pervised learning (clustering) techniques. The features used for learning the profile are extracted for each connection in real-time, from the header and pavload parts of network packets sniffed at various points in the network. Features corresponding to the pavload part of the packet are extracted only for commonly used application layer protocols. These features are then aggregated at a single location, the Sa¬chet learning agent, and the profile of the entire network is learned offline. Stream handling techniques are used to deal with the continuous stream of feature vectors. These techniques can be viewed as wrappers around the learning techniques. They construct a synopsis of the stream seen so far, with the possible option that newer data is given more weight in this synopsis. Learning is then applied on this synopsis and the resulting profile is distributed to the detection points where deviations are detected and alerts are raised.
Two different unsupervised learning techniques, support vector clustering [7] and a modified k-means technique [14] were considered for learning the profile. To handle the feature vector stream, three different techniques, Divide-and-eonquer technique of clustering over data streams [15], reservoir sampling [25] and bootstrapping [16], were considered. The five valid combinations (a clustering technique and a stream handling technique) resulting from the above were tested on a benchmark data set. The combination that gave best results was implemented in the Sachet IDS, The implemented anomaly detection scheme was then tested on a benchmark data set of size 20GB, which contains over 50 attacks of various types