07-02-2012, 12:46 PM
HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems
[attachment=17235]
INTRODUCTION
RAPID advances in general-purpose communication networks
have motivated the deployment of inexpensive
components to build competitive cluster-based storage
solutions to meet the increasing demand of scalable
computing [1], [2], [3], [4], [5], [6]. In the recent years, the
bandwidth of these networks has been increased by two
orders of magnitude [7], [8], [9], which greatly narrows the
performance gap between them and the dedicated networks
used in commercial storage systems. This significant
improvement offers an appealing opportunity to provide
cost-effective high-performance storage services by aggregating
existing storage resources on each commodity PC in
a computing cluster with such networks if a scalable
scheme is in place to efficiently virtualize these distributed
resources into a single-disk image. The key challenge in
realizing this objective lies in the potentially huge number
of nodes (in thousands) in such a cluster. Currently, clusters
with thousands of nodes are already in existence, and
clusters with even larger numbers of nodes are expected in
the near future.
RELATED WORK AND COMPARISON OF
DECENTRALIZATION SCHEMES
Many cluster-based storage systems employ centralized
metadata management. Experiments in GFS show that a
single MS is not a performance bottleneck in a storage
cluster with 100 nodes under a read-only Google searching
workload. PVFS [3], which is a RAID-0-style parallel file
system, also uses a single MS design to provide a
clusterwide shared namespace. As data throughput is the
most important objective of PVFS, some expensive but
indispensable functions such as the concurrent control
between data and metadata are not fully designed and
implemented. In CEFT [6], [10], [13], [17], which is an
extension of PVFS to incorporate a RAID-10-style fault
tolerance and parallel I/O scheduling, the MS synchronizes
concurrent updates, which can limit the overall throughput
under the workload of intensive concurrent metadata
updates. In Lustre [1], some low-level metadata management
tasks are offloaded from the MS to object storage
devices, and ongoing efforts are being made to decentralize
metadata management to further improve the scalability.
Table-Based Mapping
Globally replicating mapping tables is one approach to
decentralizing metadata management. There is a salient
trade-off between the space requirement and the granularity
and flexibility of distribution. A fine-grained table allows
more flexibility in metadata placement. In an extreme case,
if the table records the home MS for each individual file,
then the metadata of a file can be placed on any MS.
However, the memory space requirement for this approach
makes it unattractive for large-scale storage systems.
Hashing-Based Mapping
Modulus-based hashing is another decentralized scheme.
This approach hashes a symbolic pathname of a file to a
digital value and assigns its metadata to a server according
to the modulus value with respect to the total number of
MSs. In practice, the likelihood of serious skew of metadata
workload is almost negligible in this scheme, since the
number of frequently accessed files is usually much larger
than the number of MSs. However, a serious problem arises
when an upper directory is renamed or the total number of
MSs changes: the hashing mapping needs to be reimplemented,
and this requires all affected metadata to be
migrated among MSs.
Dynamic Tree Partitioning
Weil et al. [29] observe the disadvantages of the static tree
partition approach and propose to dynamically partition
the namespace across a cluster of MSs in order to scale up
the aggregate metadata throughput. The key design idea is
that initially, the partition is performed by hashing
directories near the root of the hierarchy, and when a
server becomes heavily loaded, this busy server automatically
migrates some subdirectories to other servers with
less load. It also proposes prefix caching to efficiently
utilize available RAM on all servers to further improve the
performance. This approach has three major disadvantages.
[attachment=17235]
INTRODUCTION
RAPID advances in general-purpose communication networks
have motivated the deployment of inexpensive
components to build competitive cluster-based storage
solutions to meet the increasing demand of scalable
computing [1], [2], [3], [4], [5], [6]. In the recent years, the
bandwidth of these networks has been increased by two
orders of magnitude [7], [8], [9], which greatly narrows the
performance gap between them and the dedicated networks
used in commercial storage systems. This significant
improvement offers an appealing opportunity to provide
cost-effective high-performance storage services by aggregating
existing storage resources on each commodity PC in
a computing cluster with such networks if a scalable
scheme is in place to efficiently virtualize these distributed
resources into a single-disk image. The key challenge in
realizing this objective lies in the potentially huge number
of nodes (in thousands) in such a cluster. Currently, clusters
with thousands of nodes are already in existence, and
clusters with even larger numbers of nodes are expected in
the near future.
RELATED WORK AND COMPARISON OF
DECENTRALIZATION SCHEMES
Many cluster-based storage systems employ centralized
metadata management. Experiments in GFS show that a
single MS is not a performance bottleneck in a storage
cluster with 100 nodes under a read-only Google searching
workload. PVFS [3], which is a RAID-0-style parallel file
system, also uses a single MS design to provide a
clusterwide shared namespace. As data throughput is the
most important objective of PVFS, some expensive but
indispensable functions such as the concurrent control
between data and metadata are not fully designed and
implemented. In CEFT [6], [10], [13], [17], which is an
extension of PVFS to incorporate a RAID-10-style fault
tolerance and parallel I/O scheduling, the MS synchronizes
concurrent updates, which can limit the overall throughput
under the workload of intensive concurrent metadata
updates. In Lustre [1], some low-level metadata management
tasks are offloaded from the MS to object storage
devices, and ongoing efforts are being made to decentralize
metadata management to further improve the scalability.
Table-Based Mapping
Globally replicating mapping tables is one approach to
decentralizing metadata management. There is a salient
trade-off between the space requirement and the granularity
and flexibility of distribution. A fine-grained table allows
more flexibility in metadata placement. In an extreme case,
if the table records the home MS for each individual file,
then the metadata of a file can be placed on any MS.
However, the memory space requirement for this approach
makes it unattractive for large-scale storage systems.
Hashing-Based Mapping
Modulus-based hashing is another decentralized scheme.
This approach hashes a symbolic pathname of a file to a
digital value and assigns its metadata to a server according
to the modulus value with respect to the total number of
MSs. In practice, the likelihood of serious skew of metadata
workload is almost negligible in this scheme, since the
number of frequently accessed files is usually much larger
than the number of MSs. However, a serious problem arises
when an upper directory is renamed or the total number of
MSs changes: the hashing mapping needs to be reimplemented,
and this requires all affected metadata to be
migrated among MSs.
Dynamic Tree Partitioning
Weil et al. [29] observe the disadvantages of the static tree
partition approach and propose to dynamically partition
the namespace across a cluster of MSs in order to scale up
the aggregate metadata throughput. The key design idea is
that initially, the partition is performed by hashing
directories near the root of the hierarchy, and when a
server becomes heavily loaded, this busy server automatically
migrates some subdirectories to other servers with
less load. It also proposes prefix caching to efficiently
utilize available RAM on all servers to further improve the
performance. This approach has three major disadvantages.