Student Seminar Report & Project Report With Presentation (PPT,PDF,DOC,ZIP)

Presented by
D. Vijay Kumar
A. Siva Subrahmanyam.

[attachment=11716]
Abstract
While the desire to support fast, ad hoc query processing for large data warehouses has motivated the recent introduction of many new indexing structures, with a few notable exceptions little attention has been given to developing new indexing schemes that allow fast insertions. Since additions to a large warehouse may number in the millions per day, indices that require a disk seek (or even a significant fraction of a seek) per insertion are not acceptable.
In this paper, we offer an alternative to the B+-tree called the Y-tree for indexing huge warehouses having frequent insertions. The Y-tree is a new indexing structure supporting both point and range queries over a single attribute, with retrieval performance comparable to the B+-tree. For processing insertions, however, the Y-tree may exhibit a speedup of 100 times over batched insertions into a B+-tree.
1 Introduction
Efficiency in OLAP system operation is of significant current interest, from a research as well as from a practical perspective. There are two primary options for supporting efficient queries over a huge data warehouse. The first option is to allow the user to pre-define a set of views on the warehouse, where query results are at least partially pre-computed and maintained as data are added to the warehouse. The second option is to compute the results of a query only after it has been issued using indexing and fast algorithms, thereby allowing ad-hoc querying of the warehouse. We focus on the second option in this paper.
Work on processing ad-hoc queries over huge warehouses has resulted in the development of a number of special- purpose index structures, such as Projection Indices in Sybase IQ, Bitmapped Indices (BMI) in Oracle and Bitmapped Join Indices (BJI) in Informix and Red-Brick .Together with the regular value-list (B+-tree) index, the various grid based approaches, and hierarchical, multidimensional structures such as the R-tree these structures represent a formidable set of options for indexing large warehouses. However, while significant query processing advantages have resulted from these indices, warehouse refresh performance has suffered, seriously affecting the availability of the warehouse.
Warehouse refreshes differ from standard database insertion in that typically, refresh involves the addition of a number of new rows to a single, central fact table. The smaller dimension tables may also grow, but such growth is usually very slow compared to fact table growth. Usually, indexing in a data warehouse is done on foreign keys in the central fact table. If the number of distinct attribute values for a foreign key is relatively small, this can allow for fast index refresh, with only a few localized index changes required for each insertion. It is in this situation that a BMI is particularly useful, since a refresh of the fact table will result in appends of bits to only a few, already existing bitmaps. However, it is not always the case that the number of distinct foreign key values is small. We now present a case where this quantity is not small, and discuss the implications for index refresh.
1.1 Example
We illustrate the problem of maintaining an index in the face of a high insertion rate with an example drawn from the domain of call detail record (CDR) warehousing for telecommunication service providers. CDRs are records that are generated corresponding to every call through a telecommunication network. Such records are approximately 700 bytes in length. The AT&T Corporation experiences a call detail growth of around 20 GB/day, which translates to approximately 28 million calls per day . When these records are warehoused, assuming significant aggregation with respect to the detail records accumulated in CDR stores, one can reasonably
expect an order of magnitude decrease in the number of stored records. This translates to an average addition of nearly 3 million records per day. If seven years worth of data are maintained, the complete warehouse needs to store approximately 8 billion records.
Now, consider a BMI on some attribute of the central fact table of this warehouse, perhaps on the customer account number. It is not unimaginable that on the order of 10 million distinct account numbers would be found in this particular fact table. A BMI on the customer account number would then be made up of 10 million (very sparse) bitmaps composed of 8 billion bits each. Clearly, this is likely a prohibitive storage requirement.
Of course, these bitmaps could be compressed, but in such an extreme case, it would probably be preferable to use a value-list index, where instead of a bitmap for each customer account number, a list of pointers to fact table entries is stored. Note that if a compression scheme like RLE were used on the BMI, it would essentially become equivalent to using a value-list index. Because of this, and the prohibitive storage costs associated with using an uncompressed BMI in this warehouse, the value-list index is the primary existing option that we will consider for such a situation throughout this paper. Were a value-list index used instead of a BMI, there are two likely approaches to handling the 3 million insertions per day:
• Incremental, batch insertion could be used. Insertions could be batched, so that each edge in the tree need be traversed at most once. We have found that on our system, incremental, bulk insertion (following the algorithm outlined in [6]) into a similar structure, under similar conditions (cf. Section 4) can be accomplished at the sustained rate of 100,000 (key, ptr) pairs in slightly more than 41 minutes. This means that insertion of 3 million (key, ptr) pairs per day could be expected to take longer than 20 hours to complete. In other words, it would barely be possible to keep up with this insertion rate even if all system resources were devoted to maintenance, 24 hours a day. Even if more hardware were added to combat the problem, one can assume that in the face of ever-increasing warehouse sizes, the problem is bound to recur.
• Or, we could forsake the purely incremental approach and rebuild the index, using the old index as a guide. The LSM-Tree [4] and the Stepped Merge Method [1] are two access methods that use a version of such a rebuild of a B+-tree as their fundamental approach. These methods both have the important advantage that the resulting tree structures can be constructed optimally, with full nodes, and long runs of data can be stored sequentially on disk to allow fast query processing. Also important is the fact that since the new structure can be constructed from fast, sequential scans of the old structure, disk seeks can be minimized during construction, thereby drastically decreasing the average time required per insertion when compared to the value list index. However, a disadvantage of these methods is that in the case of a skewed insertion distribution, entire nodes must be rewritten, even if only a very few key values need be written to that node. We will discuss these issues more in detail in Section
1.2 An Index Allowing Fast Insertions
In response to these issues, we have developed the YATS-tree (Yet Another Tree Structure-tree) or Y-tree for short. The Ytree is an exceedingly simple, hierarchical, secondary indexing structure for use in evaluating point and range queries over a single attribute, much like the value-list index. In fact, it can be used to support the same set of secondary indexing applications as the value-list index.
However, in contrast to the value-list index, the Y-tree is designed to allow very fast insertions into a huge database. This is accomplished with the idea of a single-path, bulk insertion. In a Y-tree, a set of some small number of insertions (say, 500) are batched and inserted at once into the structure. There are no constraints placed on what key values may be in this set and performance is totally unaffected by the key values a batched insertion set contains. Insertion into the Y-tree is called single-path, bulk insertion because regardless of the key values, an insertion of a set of (key, ptr) pairs will only require a traversal from the tree root to a single leaf node holding a list of record identifiers. In this way, the Y-tree can achieve speed-ups on the order of 100 times over incremental, batch insertion into a value-list index. The daily insertion of 3 million key values into the value-list index described above (that would take nearly the entire day to complete) would take less than 12 minutes were a Y-tree used instead.
There is a cost associated with the faster insertion times. The Y-tree can produce slower query response times when compared to the value-list index. For example, when used for evaluation of a point query returning a single (key, ptr) pair, the Y-tree is on the order of four times slower than the value-list index (point queries, however, are expectedly rare in a warehousing environment). But as the size of the query result increases, as is the case in standard OLAP queries, the efficiency of the Y-tree increases as well. When used for evaluating range queries returning 1 million such pairs for a large database, the Y-tree is only around 50% slower than an optimally, bulk-constructed value-list index, and can be nearly three times faster than a value-list index that has been built incrementally. Depending on certain parameters, a
Y-tree may then actually be preferable to a value-list index for handling large queries. Combined with the fact that standard, value-list index insertion is virtually unusable for huge, constantly growing databases, we feel that the Y-tree represents an important alternative to the value-list index.
1.3 Paper Organization
This paper is organized as follows. In Section 2, we present the Y-tree structure and the associated algorithms. In Section 3, we present an analytical study of the Y-tree. We compare it to the value-list index, showing that the Y-tree presents a very attractive alternative to the value-list index at query and insertion loads that one would commonly expect in a huge data warehouse. In Section 4, we present experimental results comparing the performance of actual implementations of the two structures. Section 5 presents some related work; we conclude the paper in Section 6.
2 The Y-Tree
The Y-tree is similar in many ways to the value-list index. Like the value-list index, it is a hierarchical structure composed of leaf nodes and internal nodes:
• Leaf Nodes. Assuming that the data are not clustered on disk with respect to the indexed attribute, leaf nodes are simply sets of ordered pairs of the form (key, ptr-list) where key is a value from the domain of the attribute value to be indexed, and ptr-list is a list of RIDs containing that key value. Each leaf node is guaranteed to be at least 50% full at all times. In practice, we have found that a utilization of 65-70% is typical. This much is similar to the classical value-list index.
• Internal Nodes. The internal nodes of the Y-tree are quite different from those of the value-list index. Each internal node contains two components, the pointer-list and the heap. The pointer-list is borrowed from the value-list index. It is simply a list of the form:
<P1, K1, P2, K2,..., Pf-1, Kf-1, Pf>.
The associated heap is logically a set of f buckets, where f is a constant chosen before the structure is constructed. f denotes the fanout of the tree. The heap has an associated maximum heap size h, which likewise is chosen a priori. Each of the f buckets is associated with exactly one pointer to a node lower in the tree, and holds a set of ordered pairs of the form (key, ptr). These ordered pairs are identical to those found in the leaf nodes; indeed, they may eventually be moved into leaf nodes from buckets located in internal nodes, as we will describe below. Logically, then, the Y-tree looks something like what is depicted above in Figure 1.
2.1 Insertion into the Y-tree
The primary goal in designing the Y-tree is to provide for fast insertion while maintaining the functionality of the value-list for indexing quickly evaluating range queries and also point queries. We discuss Y-tree insertion in this section.
2.1.1 Why Insertion Is Fast
Insertion into the Y-tree is very fast because of the two important properties of the Y-tree we describe now. The first property is common to both the Y-tree and the value-list index:
Property 1. The insertion of a (key, ptr) pair into the tree results in the reading and writing of nodes on at most one path from root to leaf level in the tree.
The second property is quite different than for a value-list index, and is at the heart of the speed with which insertion into the Y-tree may be accomplished:
2.1.3 Example Insertion
Algorithm Insert (parameters S: set of (key, ptr) pairs of cardinality no greater than d, N: Node having fanout fN)
1) If N is an internal node:
2) For each element s of S, add s into the first heap bucket bi such that the associated key value ; or, inset into the last heap bucket if there is no such Ki.
3) Choose the bucket bj that has the most (key, ptr) pairs.
4) If the heap contains more than pairs,
5) Remove min (d, size(bj)) (key, ptr) pairs from bj to create Snew, write N to disk, and recursively call Insert (Snew, node pointed to by Pj).
6) Else, write N to disk.
7) Otherwise, N is a leaf node:
8) Simply add S to the set of (key, ptr) pairs in N, then write N to disk. Ki s.key  f N 1 – d 

seminar class