10-03-2011, 10:28 AM
[attachment=9915]
Search and Replication in Unstructured Peer-to-Peer Networks
Disclaimer
• Results, statements, opinions in this talk do not represent Cisco in anyway
• This talk is about technical problems in networking, and does not discuss moral, legal and other issues related to P2P networks and their applications
Example systems:
– Napster, Gnutella;
– Freenet, FreeHaven, MajoNation, Alpine, ...
– JXTA, Ohaha, …
– Chord, CAN, “Past”, “Tapestry”, Oceanstore
Architecture Comparisons
• Napster: centralized
– A central website to hold file directory of all participants; Very efficient
– Scales
– Problem: Single point of failure
• Gnutella: decentralized
– No central directory; use “flooding w/ TTL”
– Very resilient against failure
– Problem: Doesn’t scale
Architecture Comparisons
• Various research projects such as CAN: decentralized, but “structured”
– CAN: distributed hash table
– “Structure”: all nodes participate in a precise scheme to maintain certain invariants
– Extra work when nodes join and leave
– Scales very well, but can be fragile
Architecture Comparisons
• FreeNet: decentralized, but semi-structured
– Intended for file storage
– Files are stored along a route biased by hints
– Queries for files follow a route biased by the same hints
– Scales very well
– Problem: would it really work?
• Simulation says yes in most cases, but no proof so far
• Our Focus: Gnutella-Style Systems
• Advantages of Gnutella:
– Support more flexible queries
• Typically, precise “name” search is a small portion of all queries
– Simplicity, high resilience against node failures
• Problems of Gnutella: Scalability
– Bottleneck: interrupt rates on individual nodes
– Self-limiting network: nodes have to exit to get real work done!
• Evaluation Methodologies
Simulation based:
Network topology
• Distribution of object popularity
• Distribution of replication density of objects
• Evaluation Methods
Network topologies:
• Uniform Random Graph (Random)
• Average and median node degree is 4
• Power-Law Random Graph (PLRG)
• max node degree: 1746, median: 1, average: 4.46
• Gnutella network snapshot (Gnutella)
• Oct 2000 snapshot
• max degree: 136, median: 2, average: 5.5
• Two-dimensional grid (Grid)
Modeling Methods
• Object popularity distribution pi
• Uniform
• Zipf-like
• Object replication density distribution ri
• Uniform
• Proportional: ri µ pi
• Square-Root: ri µ Ö pi
Evaluation Metrics
• Overhead: average # of messages per node per query
• Probability of search success: Pr(success)
• Delay: # of hops till success
• Load on Individual Nodes
Why is a node interrupted:
• To process a query
• To route the query to other nodes
• To process duplicated queries sent to it
Duplication in Flooding-Based Searches
• Duplication increases as TTL increases in flooding
• Worst case: a node A is interrrupted by N * q * degree(A) messages
• Duplications in Various Network Topologies
• Relationship between TTL and Search Successes
• Problems with Simple TTL-Based Flooding
• Hard to choose TTL:
• For objects that are widely present in the network, small TTLs suffice
• For objects that are rare in the network, large TTLs are necessary
• Number of query messages grow exponentially as TTL grows
• Idea #1: Adaptively Adjust TTL
• “Expanding Ring”
• Multiple floods: start with TTL=1; increment TTL by 2 each time until search succeeds
• Success varies by network topology
• For “Random”, 30- to 70- fold reduction in message traffic
• For Power-law and Gnutella graphs, only
3- to 9- fold reduction
• Limitations of Expanding Ring
• Idea #2: Random Walk
• Simple random walk
• takes too long to find anything!
• Multiple-walker random walk
• N agents after each walking T steps visits as many nodes as 1 agent walking N*T steps
• When to terminate the search: check back with the query originator once every C steps
• Search Traffic Comparison
• Search Delay Comparison
• Lessons Learnt about Search Methods
• Adaptive termination
• Minimize message duplication
• Small expansion in each step
Flexible Replication
• In unstructured systems, search success is essentially about coverage: visiting enough nodes to probabilistically find the object => replication density matters
• Limited node storage => what’s the optimal replication density distribution?
• In Gnutella, only nodes who query an object store it => ri µ pi
• What if we have different replication strategies?
• Optimal ri Distribution
• Goal: minimize S( pi/ ri ), where S ri =R
• Calculation:
• introduce Lagrange multiplier l, find ri and l that minimize:
S( pi/ ri ) + l * (S ri - R)
=> l - pi/ ri2 = 0 for all i
=> ri µ Ö pi
Square-Root Distribution
• General principle: to minimize S( pi/ ri ) under constraint S ri =R, make ri propotional to square root of pi
• Other application examples:
– Bandwidth allocation to minimize expected download times
– Server load balancing to minimize expected request latency
Achieving Square-Root Distribution
• Suggestions from some heuristics
– Store an object at a number of nodes that is proportional to the number of node visited in order to find the object
– Each node uses random replacement
• Two implementations:
– Path replication: store the object along the path of a successful “walk”
– Random replication: store the object randomly among nodes visited by the agents
Evaluation of Replication Methods
• Metrics
– Overall message traffic
– Search delay
• Dynamic simulation
– Assume Zipf-like object query probability
– 5 query/sec Poisson arrival
– Results are during 5000sec-9000sec
Distribution of ri
Total Search Message Comparison
• Observation: path replication is slightly inferior to random replication
Search Delay Comparison
Summary
• Multi-walker random walk scales much better than flooding
– It won’t scale as perfectly as structured network, but current unstructured network can be improved significantly
• Square-root replication distribution is desirable and can be achieved via path replication