the content of the doc file is am posing for easy navigation actually the material added by rajav3[at]yahoo.co.in
thanks to rajav3[at]yahoo.co.in
HBA: Distributed Metadata Management for
Large Cluster-Based Storage Systems
Scope of the project
To create metadata of all the files in the network and finds a particular file using a search engine in fast and easy.
Introduction
Rapid advances in general-purpose communication networks have motivated the employment of inexpensive components to build competitive cluster-based storage solutions to meet the increasing demand of scalable computing. In the recent years, the bandwidth of these networks has been increased by two orders of magnitude. , which greatly narrows the performance gap between them and the dedicated networks used in commercial storage systems. Since all I/O requests can be classified into two categories, that is, user data requests and metadata requests, the scalability of accessing both data and metadata has to be carefully maintained to avoid any potential performance bottleneck along all data paths. This paper proposes a novel scheme, called Hierarchical Bloom Filter Arrays (HBA), to evenly distribute the tasks of metadata management to a group of MSs. A Bloom filter (BF) is a succinct data structure for probabilistic membership query. A straightforward extension of the BF approach to decentralizing metadata management onto multiple MSs is to use an array of BFs on each MS. The metadata of each file is stored on some MS, called the home MS.
Modules
1. Login
2. Finding Network Computers
3. Meta Data Creation
4. Searching Files
Module Description
Login
In Login Form module presents site visitors with a form with username and password fields. If the user enters a valid username/password combination they will be granted access to additional resources on website. Which additional resources they will have access to can be configured separately.
Finding Network Computers
In this module we are going to find out the available computers from the network. And we are going to share some of the folder in some computers. We are going to find out the computers those having the shared folder. By this way will get all the information about the file and we will form the Meta data.
Meta Data Creation
In this module we are creating a metadata for all the system files. The module is going to save all file names in a database. In addition to that, it also saves some information from the text file. This mechanism is applied to avoid the long run process of the existing system.
Searching Files
In this module the user going to enter the text for searching the required file. The searching mechanism is differing from the existing system. When ever the user gives their searching text, It is going to search from the database. At first, the search is based on the file name. After that, it contains some related file name. Then it collects some of the file text, it makes another search. Finally it produces a search result for corresponding related text for the user.
Module I/O
Login
Given Input-Login details
Expected Output-Login persons can use the software
Finding Network Computers
Given Input- Click on the particular button to know network system.
Expected Output- Shows all the connected nodes in the network.
Meta Data Creation
Given Input- Search all the files and stores necessary information.
Expected Output- Updation of database with created metadata.
Searching Files
Given Input- File name, or File size
Expected Output-shows page link to get particular file.
Module diagram
UML Diagrams
Use case diagram
Class diagram
Object diagram
State diagram
Activity diagram
Sequence diagram
Collaboration Diagram
Component Diagram
E-R diagram
Dataflow diagram
Project Flow Diagram
System Architecture
Literature review
Many cluster-based storage systems employ centralized metadata management. Experiments in GFS show that a single MS is not a performance bottleneck in a storage cluster with 100 nodes under a read-only Google searching workload. PVFS, which is a RAID-0-style parallel file system, also uses a single MS design to provide a clusterwide shared namespace. As data throughput is the most important objective of PVFS, some expensive but indispensable functions such as the concurrent control between data and metadata are not fully designed and implemented. In CEFT , which is an extension of PVFS to incorporate a RAID-10-style fault tolerance and parallel I/O scheduling, the MS synchronizes concurrent updates, which can limit the overall throughput under the workload of intensive concurrent metadata updates. In Lustre, some low-level metadata management tasks are offloaded from the MS to object storage devices, and ongoing efforts are being made to decentralize metadata management to further improve the scalability. Some other systems have addressed metadata scalability in their designs. For example, GPFS [18] uses dynamically elected metanodes to manage file metadata. The election is coordinated by a centralized token server. OceanStore, which is designed for LAN-based networked storage systems, scales the data location scheme by using an array of BFs, in which the ith BF is the union of all the BFs for all of the nodes within i hops. The requests are routed to their destinations by following the path with the maximum probability. Panasas ActiveScale not only uses object storage devices to offload some metadata management tasks but also scales up the metadata services by using a group of directory blades. Our target systems differ from the three systems above. Although GPFS and Panasas ActiveScale need to use their specially designed commercial hardware, our target systems only consist of commodity components. Our system is also different from OceanStore in that the latter focusesongeographically distributed storage nodes, whereas our design targets cluster-based storage systems, where all nodes are only one hop away. The following summarizes other research projects in scaling metadata management, including table-based mapping, hash-based mapping, static tree partitioning, and dynamic tree partitioning.
Table-Based Mapping
Globally replicating mapping tables is one approach to decentralizing metadata management. There is a salient trade-off between the space requirement and the granularity and flexibility of distribution. A fine-grained table allows more flexibility in metadata placement. In an extreme case, if the table records the home MS for each individual file, then the metadata of a file can be placed on any MS. However, the memory space requirement for this approach makes it unattractive for large-scale storage systems. A backof- the-envelope calculation shows that it would take as much
as 1.8 Gbytes of memory space to store such a table with 108 entries when 16 bytes are used for a filename and 2 bytes for an MS ID. In addition, searching for an entry in such a huge table consumes a large number of precious CPU cycles. To reduce the memory space overhead, xFS proposes a coarse-grained table that maps a group of files to an MS. To keep a good trade-off, it is suggested that in xFS, the number of entries in a table should be an order of magnitude larger than the total number of MSs.
Hashing-Based Mapping Modulus-based hashing is another decentralized scheme. This approach hashes a symbolic pathname of a file to a digital value and assigns its metadata to a server according to the modulus value with respect to the total number of MSs. In practice, the likelihood of serious skew of metadata workload is almost negligible in this scheme, since the number of frequently accessed files is usually much larger than the number of MSs. However, a serious problem arises when an upper directory is renamed or the total number of MSs changes: the hashing mapping needs to be reimplemented, and this requires all affected metadata to be migrated among MSs. Although the size of the metadata of a file is small, a large number of files may be involved. In particular, the metadata of all files has to be relocated if an MS joins or leaves. This could lead to both disk and network traffic surges and cause serious performance degradation. LazyHybrid is proposed to reduce the impact of metadata migration by updating lazily and also incorporating a small table that maps disjoint hash ranges to MS IDs. The migration overhead, however, can still overweigh the benefits of load balancing in a heavily loaded system.
Static Tree Partitioning static namespace partition is a simple way of distributing metadata operations to a group of MSs. A common partition technique has been to divide the directory tree during the process of installing or mounting and to store the information at some well-known locations. Some distributed file systems such as NFS , AFS, and Coda follow this approach. This scheme works well only when file access patterns are uniform,resulting in a balanced workload. Unfortunately, access patterns in general file systems are highly skewed and, thus, this partition scheme can lead to a highly imbalanced workload if files in some particular subdirectories become more popular than the others. Dynamic Tree Partitioning observe the disadvantages of the static tree partition approach and propose to dynamically partition the namespace across a cluster of MSs in order to scale up the aggregate metadata throughput. The key design idea is that initially, the partition is performed by hashing directories near the root of the hierarchy, and when a server becomes heavily loaded, this busy server automatically migrates some subdirectories to other servers with less load. It also proposes prefix caching to efficiently utilize available RAM on all servers to further improve the performance. This approach has three major disadvantages. First, it assumes that there is an accurate load measurement scheme available on each server and all servers periodically exchange the load information. Second, when an MS joins or leaves due to failure or recovery, all directories need to be rehashed to reflect the change in the server infrastructure, which, in fact, generates a prohibitively high overhead in a petabyte-scale storage system. Third, when the hot spots of metadata operations shift as the system evolves, frequent metadata migration in order to remove these hot spots may impose a large overhead and offset the benefits of load balancing.
Comparison of Existing Schemes summarizes the existing state-of-the-art approaches to decentralizing metadata management and compares them with the HBA scheme, which will be detailed later in this paper. Each existing solution has its own advantages and disadvantages. The hashing-based mapping approach can balance metadata workloads and inherently has fast metadata lookup operations, but it has slow directory operations such as listing the directory contents and renaming directories. In addition, when the total number of MSs changes, rehashing all existing files generates a prohibitive migration overhead. The table-based mapping method does not require any metadata migration, but it fails to balance the load. Furthermore, a back-of-theenvelope calculation shows that it would take as much as 1.8 Gbytes of memory to store such a table with 100 million files. The static tree balance approach has zero migration overhead, small memory overhead, and fast directory comparison of HBA with Existing Decentralization Schemes n and d are the total number of files and partitioned subdirectories, respectively. operation. However, it cannot balance the load, and it has a medium lookup time, since hot spots usually exist in this approach. Similar to the hashing-based mapping, dynamic tree partition has fast lookup operations and small memory overhead. However, this approach relies on load monitors to balance metadata workloads and thus incurs a large migration overhead. To combine their advantages and avoid their disadvantages, a novel approach, called HBA, is proposed in this paper to efficiently route metadata requests within a group of MSs. The detailed design of HBA will be presented later in this project.
Techniques and Algorithm Used
Static Tree Partitioning
Static namespace partition is a simple way of distributing metadata operations to a group of MSs. A common partition technique has been to divide the directory tree during the process of installing or mounting and to store the information at some well-known locations. Some distributed file systems such as NFS, AFS and Coda follow this approach. This scheme works well only when file access patterns are uniform, resulting in a balanced workload. Unfortunately, access patterns in general file systems are highly skewed and, thus, this partition scheme can lead to a highly imbalanced workload
HBA
A straightforward extension of the BF approach to decentralizing metadata management onto multiple MSs is to use an array of BFs on each MS. The metadata of each file is stored on some MS, called the home MS. In this design, each MS builds a BF that represents all files whose metadata is stored locally and then replicates this filter to all other MSs. Including the replicas of the BFs from the other servers, a MS stores all filters in an array. When a client initiates a metadata request, the client randomly chooses a MS and asks this server to perform the membership query against this array. The BF array is said to have a hit if exactly one filter gives a positive response. A miss is said to have occurred whenever no hit or more than one hit is found in the array. The desired metadata can be found on the MS represented by the hit BF with a very high probability.
Advantages
1. Faster search of different files from all the nodes of a network
2. Good results with file size and part of a file name.
Applications
This software can use in every intranet for the fast search of files in the network.
Abstract is:
An efficient and distributed scheme for file mapping or file lookup is critical in decentralizing metadata management within a group of metadata servers. This paper presents a novel technique called HBA (Hierarchical Bloom filter Arrays) to map filenames to the metadata servers holding their metadata. Two levels of probabilistic arrays, namely, Bloom filter arrays, with different level of accuracies, are used on each metadata server. One array, with lower accuracy and representing the distribution of the entire metadata, trades accuracy for significantly reduced memory overhead, while the other array, with higher accuracy, caches partial distribution information and exploits the temporal locality of file access patterns. Both arrays are replicated to all metadata servers to support fast local lookups. We evaluate HBA through extensive trace-driven simulations and an implementation in Linux. Simulation results show our HBA design to be highly effective and efficient in improving performance and scalability of file systems in clusters with 1,000 to 10,000 nodes (or super-clusters) and with the amount of data in the Peta-byte scale or higher. Our implementation indicates that HBA can reduce metadata operation time of a single-metadata-server architecture by a factor of up to 43.9 when the system is configured with 16 metadata servers.
s/w Requirements are:
Operating System : Windows XP Professional
Front End : Microsoft Visual Studio.NET 2005
Coding Language : ASP. Net 2.0, C# 2.0
Back-End : Sql Server 2000.
please read
http://studentbank.in/report-hba--9141 for more about HBA-Distributed-Metadata-Management-for-Large-Cluster-Based-Storage-Systems informations...