03-05-2011, 04:34 PM
Abstract—
We consider an end-to-end approach of inferring probabilistic data-forwarding failures in an externally managed overlay
network, where overlay nodes are independently operated by various administrative domains. Our optimization goal is to minimize the
expected cost of correcting (i.e., diagnosing and repairing) all faulty overlay nodes that cannot properly deliver data. Instead of first
checking the most likely faulty nodes as in conventional fault localization problems, we prove that an optimal strategy should start with
checking one of the candidate nodes, which are identified based on a potential function that we develop. We propose several efficient
heuristics for inferring the best node to be checked in large-scale networks. By extensive simulation, we show that we can infer the
best node in at least 95% of time, and that first checking the candidate nodes rather than the most likely faulty nodes can decrease the
checking cost of correcting all faulty nodes.
Index Terms—network management, network diagnosis and correction, fault localization and repair, reliability engineering.
Note: An earlier and shorter conference version of this
paper appeared in IEEE INFOCOM ’07 [23].
1 INTRODUCTION
Network components are prone to a variety of faults
such as packet loss, page link cut, or node outage. To prevent
the faulty components from hindering network applications,
it is important to diagnose (i.e., detect and localize)
the components that are the root cause of network faults,
as in [18], [19], [32]. However, it is also desirable to repair
the faulty components to enable them to return to their
operational states. Therefore, we focus on network fault
correction, by which we mean not only to diagnose, but
also to repair all faulty components within a network.
In addition, it has been shown that a network outage
can bring significant economic loss. For example, the
revenue loss due to a 24-hour outage of a Switzerlandbased
Internet service provider can be more than CHF
30 million [15]. As a result, we want to devise a costeffective
network fault correction mechanism that corrects
all network faults at minimum cost.
To diagnose (but not repair) network faults, recent
approaches like [4], [28], [36] use all network nodes to
collaboratively achieve this. For instance, in hop-by-hop
authentication [4], each hop inspects packets received
from its previous hop and reports errors when packets
are found to be corrupted. While such a distributed
infrastructure can accurately pinpoint network faults deploying and maintaining numerous monitoring points
in a large-scale network introduces heavy computational
overhead in collecting network statistics [10] and involves
complicated administrative management [7]. In
particular, it is difficult to directly monitor and access all
overlay nodes in an externally managed network, whose
routing nodes are independently operated by various
administrative domains. In this case, we can only infer
the network condition from end-to-end information.
Here, we consider an end-to-end inference approach
which, using end-to-end measurements, infers components
that are probably faulty in forwarding data in an
application-layer overlay network whose overlay nodes
are externally managed by independent administrative
domains. We start with a routing tree topology with a set
of overlay nodes, since a tree-based setting is typically
seen in destination-based routing (e.g., CAN [30] and
Chord [33]), where each overlay node builds a routing
tree with itself as a root, as well as in multicast routing,
where a routing tree is built to connect members in
a multicast group. We then monitor every root-to-leaf
overlay path. If a path exhibits any “anomalous behavior”
in forwarding data, then some “faulty” overlay
node on the path must be responsible. In practice, the
precise definition of an “anomalous behavior” depends
on specific applications. For instance, a path is said to be
anomalous if it fails to deliver a number of correct packets
within a time window. Using the path information
collected at the application endpoints (i.e., leaf nodes),
we can narrow down the space of possibly faulty overlay
nodes.
Download full report
http://ieeexplore.ieeeiel5/71/5400984/04...er=4840335