ASK HERE

seminar class · 30-03-2011, 10:04 AM

presented by:
Syed Abdul Gafoor
P. Muralikrishna

[attachment=11295]
Abstract
As technology trends push future microprocessors to- ward chip multiprocessor designs, operating system net- work stacks must be parallelized in order to keep pace with improvements in network bandwidth. There are two competing strategies for stack parallelization. Message- parallel network stacks use concurrent threads to carry out network operations on independent messages (usu- ally packets), whereas connection-parallel stacks map operations to groups of connections and permit con- current processing on independent connection groups. Connection-parallel stacks can use either locks or threads to serialize access to connection groups. This paper evaluates these parallel stack organizations using a modern operating system and chip multiprocessor hardware.
Compared to uniprocessor kernels, all parallel stack organizations incur additional locking overhead, cache inefficiencies, and scheduling overhead. However, the organizations balance these limitations differently, lead- ing to variations in peak performance and connection scalability. Lock-serialized connection-parallel organi- zations reduce the locking overhead of message-parallel organizations by using many connection groups and eliminate the expensive thread handoff mechanism of thread-serialized connection-parallel organizations. The resultant organization outperforms the others, delivering 5.4 Gb/s of TCP throughput for most connection loads and providing a 126% throughput improvement versus a uniprocessor for the heaviest connection loads.
1 Introduction
As network bandwidths continue to increase at an expo- nential pace, the performance of modern network stacks must keep pace in order to efficiently utilize that band- width. In the past, exponential gains in microprocessor
performance have always enabled processing power to catch up with network bandwidth. However, the com- plexity of modern uniprocessors will prevent such con- tinued performance growth. Instead, microprocessors have begun to provide parallel processing cores to make up for the loss in performance growth of individual pro- cessor cores. For network servers to exploit these parallel processors, scalable parallelizations of the network stack are needed.
Modern network stacks can exploit either message- based parallelism or connection-based parallelism. Net- work stacks that exploit message-based parallelism, such as Linux and FreeBSD, allow multiple threads to si- multaneously process different messages from the same or different connections. Network stacks that ex- ploit connection-based parallelism, such as Dragonfly- BSD and Solaris 10 , assign each connection to a group. Threads may then simultaneously process messages as long as they belong to different connection groups. The connection-based approach can use either threads or locks for synchronization, yielding three major parallel network stack organizations: message-based (MsgP), connection-based using threads for synchronization (ConnP-T), and connection-based using locks for synchronization (ConnP-L).
The uniprocessor version of FreeBSD is efficient, but its performance falls short of saturating available net- work resources in a modern machine and degrades sig- nificantly as connections are added. Utilizing 4 cores, the parallel stack organizations can outperform the unipro- cessor stack (especially at high connection loads), but each parallel stack organization incurs higher locking overhead, reduced cache efficiency, and higher schedul- ing overhead than the uniprocessor. MsgP outperforms the uniprocessor for almost all connection loads but experiences significant locking overhead. In contrast, ConnP-T has very low locking overhead but incurs sig- nificant scheduling overhead, leading to reduced perfor- mance compared to even the uniprocessor kernel for all
but the heaviest loads. ConnP-L mitigates the locking overhead of MsgP, by grouping connections so that there is little global locking, and the scheduling overhead of ConnP-T, by using the requesting thread for network processing rather than forwarding the request to another thread. This results in the best performance of all stacks considered, delivering stable performance of 5440 Mb/s for moderate connection loads and providing a 126% im- provement over the uniprocessor kernel for large connec- tion loads.
The following section further motivates the need for parallelized network stacks and discusses prior work. Section 3 then describes the parallel network stack ar- chitectures. Section 4 presents and discusses the results. Finally, Section 5 concludes the paper.
2 Background
Traditionally, uniprocessors have not been able to sat- urate the network with the introduction of each new Ethernet bandwidth generation, but exponential gains in uniprocessor performance have always allowed process- ing power to catch up with network bandwidth. How- ever, the complexity of modern uniprocessors has made it prohibitively expensive to continue to improve proces- sor performance at the same rate as in the past. Not only is it difficult to further increase clock frequencies, but it is also difficult to further improve the efficiency of com- plex modern uniprocessor architectures.
To further increase performance despite these chal- lenges, industry has turned to single chip multiproces- sors (CMPs). IBM, Sun, AMD, and Intel have all released dual-core processors. Sun’s Nia- gara is perhaps the most aggressive example, with 8 cores on a single chip, each capable of executing four threads of control. However, a CMP trades uniproces- sor performance for additional processing cores, which should collectively deliver higher performance on paral- lel workloads. Therefore, the network stack will have to be parallelized extensively in order to saturate the net- work with modern microprocessors.
While modern operating systems exploit parallelism by allowing multiple threads to carry out network oper- ations concurrently in the kernel, supporting this paral- lelism comes with significant cost. For example, uniprocessor Linux kernels deliver 20% bet- ter end-to-end throughput over 10 Gigabit Ethernet than multiprocessor kernels .
In the mid-1990s, two forms of network process- ing parallelism were extensively examined: message- oriented and connection-oriented parallelism. Using message-oriented parallelism, messages (or packets) may be processed simultaneously by separate threads, even if those messages belong to the same connec-
tion. Using connection-oriented parallelism, messages are grouped according to connection, allowing concur- rent processing of messages belonging to different con- nections.
Nahum et al. first examined message-oriented par- allelism within the user-space x-kernel utilizing a sim- ulated network device on an SGI Challenge multipro- cessor.. This study found that finer grained lock- ing around connection state variables generally degrades performance by introducing additional overhead and does not result in significant improvements in speedup. Rather, coarser-grained locking (with just one lock pro- tecting all TCP state) performed best. They further- more found that careful attention had to be paid to thread scheduling and lock acquisition ordering on the inbound path to ensure that received packets were not reordered during processing.
Yates et al. later examined a connection-oriented par- allel implementation of the x-kernel, also utilizing a sim- ulated network device and running on an SGI Chal- lenge . They found that increasing the number of threads to match the number of connections yielded the best results, even far beyond the number of physical pro- cessors. They proposed using as many threads as were supported by the system, which was limited to 384 at that time.
Schmidt and Suda compared message-oriented and connection-oriented network stacks in a modified version of SunOS utilizing a real network interface . They found that with just a few connections, a connection- parallel stack outperforms a message-parallel one. How- ever, they note that context switching increases sig- nificantly as connections (and processors) are added to the connection-parallel scheme, and that synchro- nization cost heavily affects the efficiency with which each scheme operates (especially the message-parallel scheme).
Synchronization and context-switch costs have changed dramatically in recent years. The gap between memory system and processing performance has become much greater, vastly increasing synchronization cost in terms of lost execution cycles and exacerbating the cost of context switches as thread state is swapped in memory. Both the need to close gap between Ethernet bandwidth and microprocessor performance and the vast changes in the architectural characteristics that shaped prior parallel network stack analyses motivate a fresh examination of parallel network stack architectures on modern parallel hardware.
3 Parallel Network Stack Architectures
Despite the conclusions of the 1990s, no solid consen- sus exists among among modern operating system devel-
opers regarding efficient, scalable parallel network stack design. Current versions of FreeBSD and Linux incor- porate variations of message parallelism within their net- work stacks. Conversely, the network stack within So- laris 10 incorporates a variation of connection-based par- allelism , as does DragonflyBSD. Willmann et al. present a detailed description of parallel network stack organizations, and a brief overview follows .
3.1 Message-based Parallelism (MsgP)
Message-based parallel (MsgP) network stacks, such as FreeBSD, allow multiple threads to operate within the network stack simultaneously and permit these various threads to process messages independently. Two types of threads may perform network processing: one or more application threads and one or more inbound protocol threads. When an application thread makes a system call, that calling thread context is “borrowed” to carry out the requested service within the kernel. When the network interface card (NIC) interrupts the host, the NIC’s asso- ciated inbound protocol thread services the NIC and pro- cesses received packets “up” through the network stack. Given these concurrent application and inbound pro- tocol threads, FreeBSD utilizes fine-grained locking around shared kernel structures to ensure proper mes- sage ordering and connection state consistency. As a thread attempts to send or receive a message on a con- nection, it must acquire various locks when accessing shared connection state, such as the global connection hashtable lock (for looking up TCP connections) and per- connection locks (for both socket state and TCP state). This locking organization enables concurrent processing
of different messages on the same connection.
Note that the inbound thread configuration described is not the FreeBSD 7 default. Normally parallel driver threads service each NIC and then hand off inbound packets to a single worker thread. That worker thread then processes the received packets “up” through the net- work stack. The default configuration limits the perfor- mance of MsgP, so is not considered in this paper. The thread-per-NIC model also differs from the message- parallel organization described by Nahum et al. , which used many more worker threads than interfaces. Such an organization requires a sophisticated scheme to ensure these worker threads do not reorder inbound pack- ets, hence it is also not considered.
3.2 Connection-based Parallelism (ConnP)
To compare connection parallelism in the same frame- work as message parallelism, FreeBSD 7 was modified to support two variants of connection-based parallelism (ConnP) that differ in how they serialize TCP/IP pro-
cessing within a connection. The first variant assigns each connection to a protocol processing thread (ConnP- T), and the second assigns each connection to a lock (ConnP-L).
3.2.1 Thread Serialization (ConnP-T)
Connection-based parallelism using threads utilizes sev- eral kernel threads dedicated to protocol processing, each of which is assigned a subset of the system’s connections. At each entry point into the TCP/IP protocol stack, a re- quest for service is enqueued for the appropriate protocol thread based on the TCP connection. Later, the protocol threads, which only carry out TCP/IP processing and are bound to a specific CPU, dequeue requests and process them appropriately. Because connections are uniquely and persistently assigned to a specific protocol thread, no per-connection state locking is required. These proto- col threads implement both synchronous operations, for applications that require a return code, and asynchronous operations, for drivers that simply enqueue packets and then continue servicing the NIC.
The connection-based parallel stack uniquely maps a packet or socket request to a specific protocol thread by hashing the 4-tuple of remote IP address, remote port number, local IP address, and local port number. When the entire tuple is not yet defined (e.g., prior to port as- signment during a listen() call), the corresponding operation executes on protocol thread 0 and may later migrate to another thread when the tuple becomes fully defined.
3.2.2 Lock Serialization (ConnP-L)
Connection-based parallelism using locks also separates connections into groups, but each group is protected by a single lock, rather than only being processed by a single thread. As in connection-based parallelism us- ing threads, application threads entering the kernel for network service and driver threads passing up received packets both classify each request to a particular connec- tion group. However, application threads then acquire the lock for the group associated with the given connec- tion and then carry out the request with private access to any group-wide structures (including connection state). For inbound packet processing, the driver thread clas- sifies each inbound packet to a specific group, acquires the group lock associated with the packet, and then pro- cesses the packet “up” through the network stack. As in the MsgP case, there is one inbound protocol thread for each NIC, but the number of groups may far exceed the number of threads.
This implementation of connection-oriented paral- lelism is similar to Solaris 10, which permits a network operation to either be carried out directly after acquisi- tion of a group lock or to be passed on to a worker thread for later processing. ConnP-L is more rigidly defined; application and inbound protocol threads always acquire exclusive control of the group lock.

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Exploring the design space of social network-based Sybil defenses		1	919	15-02-2017, 02:55 PM Last Post: jaseela123d
	Critical State-Based Filtering System for Securing SCADA Network Protocols		1	855	14-02-2017, 12:48 PM Last Post: jaseela123d
	A PROACTIVE APPROACH TO NETWORK SECURITY	nit_cal	1	2,267	19-09-2014, 12:52 AM Last Post: [email protected]
	IEEE Project on Network Simulation using OMNeT++ 3.2 for M.Tech and B.Tech	VickyBujju	3	3,043	03-06-2013, 11:13 AM Last Post: computer topic
	Persuasive Cued Click-Points: Design, Implementation, and Evaluation of a Knowledge-B	Projects9	3	3,028	15-04-2013, 11:14 AM Last Post: computer topic
	Cooperative Caching in Wireless P2P Networks: Design, Implementation, and Evaluation	seminar class	2	3,319	02-02-2013, 02:08 PM Last Post: seminar details
	The Wireless Sensor Network for Home-Care System Using ZigBee	smart paper boy	1	1,975	31-01-2013, 11:34 AM Last Post: seminar details
	Handling Selfishness in Replica Allocation over a Mobile Ad Hoc Network	Projects9	1	1,462	08-01-2013, 02:25 PM Last Post: Guest
	Database Migration over Network	project topics	12	7,233	06-01-2013, 07:54 AM Last Post: Guest
	WISENET (Wireless Sensor Network) ppt.	seminar surveyer	9	12,575	08-12-2012, 02:49 PM Last Post: seminar details

Important Note..!

ASK HERE