ASK HERE

seminar topics · 30-03-2010, 12:26 PM

ABSTRACT
Intelâ„¢s Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.
The first implementation of Hyper-Threading Technology was done on the Intel Xeon processor MP. In this implementation there are two logical processors on each physical processor. The logical processors have their own independent architecture state, but they share nearly all the physical execution and hardware resources of the processor. The goal was to implement the technology at minimum cost while ensuring forward progress on logical processors, even if the other is stalled, and to deliver full performance even when there is only one active logical processor.
The potential for Hyper-Threading Technology is tremendous; our current implementation has only just begun to tap into this potential. Hyper-Threading Technology is expected to be viable from mobile processors to servers; its introduction into market segments other than servers is only gated by the availability and prevalence of threaded applications and workloads in those markets.
INTRODUCTION
The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand, we cannot rely entirely on traditional approaches to processor design. Micro architecture techniques used to achieve past processor performance improvementâ€œsuper pipelining, branch prediction, super-scalar execution, out-of-order execution, cachesâ€œhave made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intelâ„¢s Hyper-Threading Technology is one solution.
Making Hyper-Threading Technology a reality was the result of enormous dedication, planning, and sheer hard work from a large number of designers, validators, architects, and others. There was incredible teamwork from the operating system developers, BIOS writers, and software developers who helped with innovations and provided support for many decisions that were
made during the definition process of Hyper-Threading Technology. Many dedicated engineers are continuing to work with our ISV partners to analyze application performance for this technology. Their contributions and hard work have already made and will continue to make a real difference to our customers.
PROCESSOR MICRO-ARCHITECTURE

Traditional approaches to processor design have focused on higher clock speeds, instruction-level parallelism (ILP), and caches. Techniques to achieve higher clock speeds involve pipelining the microarchitecture to finer granularities, also called super-pipelining. Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second. Because there will be far more instructions in-flight in a superpipelined microarchitecture, handling of events that disrupt the pipeline, e.g., cache misses, interrupts and branch mispredictions, can be costly.ILP refers to techniques to increase the number of instructions executed each clock cycle. For example, a super-scalar processor has multiple parallel execution units that can process instructions simultaneously. With super-scalar execution, several instructions can be executed each clock cycle. However, with simple inorder execution, it is not enough to simply have multiple execution units. The challenge is to find enough instructions to execute.
One technique is out-of-order execution where a large window of instructions is simultaneously evaluated and sent to execution units, based on instruction dependencies rather than program order.Accesses to DRAM memory are slow compared to execution speeds of the processor. One technique to reduce this latency is to add fast caches close to the processor. Caches can provide fast memory access to frequently accessed data or instructions. However, caches can only be fast when they are small. For this reason, processors often are designed with a cache hierarchy in which fast, small caches are located and operated at access latencies very close to that of the processor core, and progressively larger caches, which handle less frequently accessed data or instructions, are implemented with longer access latencies. However, there will always be times when the data needed will not be in any processor cache. Handling such cache misses requires accessing memory, and the processor is likely to quickly run out of instructions to execute before stalling on the cache miss.
The vast majority of techniques to improve processor performance from one generation to the next is complex and often adds significant die-size and power costs. These techniques increase performance but not with 100% efficiency; i.e., doubling the number of execution units in a processor does not double the performance of the processor, due to limited parallelism in instruction flows. Similarly, simply doubling the clock rate does not double the performance due to the number of processor cycles lost to branch mispredictions.
figure shows the relative increase in performance and the costs, such as die size and power, over the last ten years on Intel processors1. In order to isolate the microarchitecture impact, this comparison assumes that the four generations of processors are on the same silicon process technology and that the speed-ups are normalized to the performance of an Intel486processor. Although we use Intelâ„¢s processor history in this example, other high-performance processor manufacturers during this time period would have similar trends. Intelâ„¢s processor performance, due to microarchitecture advances alone, has improved integer performance five- or six-fold1. Most integer applications have limited ILP and the instruction flow can be hard to predict.

Over the same period, the relative die size has gone up fifteen-fold, a three-times-higher rate than the gains in integer performance. Fortunately, advances in silicon process technology allow more transistors to be packed into a given amount of die area so that the actual measured die size of each generation microarchitecture has not increased significantly.
The relative power increased almost eighteen-fold during this period1. Fortunately, there exist a number of known techniques to significantly reduce power consumption on processors and there is much on-going research in this area. However, current processor power dissipation is at the limit of what can be easily dealt with in desktop platforms and we must put greater emphasis on improving performance in conjunction with new technology, specifically to control power.
CONVENTIONAL MULTI-THREADING
A look at todayâ„¢s software trends reveals that server applications consist of multiple threads or processes that can be executed in parallel. On-line transaction processing and Web services have an abundance of software threads that can be executed simultaneously for faster performance. Even desktop applications are becoming increasingly parallel. Intel architects have been trying to leverage this so-called thread-level parallelism (TLP) to gain a better performance vs.transistor count and power ratio.
In both the high-end and mid-range server markets, multiprocessors have been commonly used to get more performance from the system. By adding more processors, applications potentially get substantial performance improvement by executing multiple threads on
multiple processors at the same time. These threads might be from the same application, from different applications running simultaneously, from operating system services, or from operating system threads doing background maintenance. Multiprocessor systems have been used for many years, and high-end programmers are familiar with the techniques to exploit multiprocessors for higher performance levels.
In recent years a number of other techniques to further exploit TLP have been discussed and some products have been announced. One of these techniques is chip multiprocessing (CMP), where two processors are put on a single die.
The two processors each have a full set of execution and architectural resources. The processors may or may not share a large on-chip cache.CMP is largely orthogonal to conventional multiprocessor systems, as you can have multiple CMP processors in a multiprocessor configuration. Recently announced processors incorporate two processors on each die. However, a CMP chip is significantly larger than the size of a single-core chip and therefore more expensive to manufacture; moreover, it does not begin to address the die size and power considerations.
TIME-SLICE MULTI-THREADING
Time-slice multithreading is where the processor switches between software threads after a fixed time period. Quite a bit of what a CPU does is illusion. For instance, modern out-of-order processor architectures don't actually execute code sequentially in the order in which it was written.. It is noted that an OOE(out of order execution)architecture takes code that was written and compiled to be executed in a specific order, reschedules the sequence of instructions (if possible) so that they make maximum use of the processor resources, executes them, and then arranges them back in their original order so that the results can be written out to memory. To the programmer and the user, it looks as if an ordered, sequential stream of instructions went into the CPU and identically ordered, sequential stream of
computational results emerged. Only the CPU knows in what order the program's instructions were actually executed, and in that respect the processor is like a black box to both the programmer and the user.
The same kind of sleight-of-hand happens when we run multiple programs at once, except that this time the operating system is also involved in the scam. To the end user, it appears as if the processor is "running" more than one program at the same time, and indeed, there actually are multiple programs loaded into memory. But the CPU can execute only one of these programs at a time. The OS maintains the llusion of concurrency by rapidly switching between running programs at a fixed interval, called a time slice. The time slice has to be small enough that the user doesn't notice any degradation in the usability and performance of the running programs, and it has to be large enough that each program has a sufficient amount of CPU time in which to get useful work done. Most modern operating systems include a way to change the size of an individual program's time slice. So a program with a larger time slice gets more actual execution time on the CPU relative to its lower priority peers, and hence it runs faster.
But time-slice multithreading can result in wasted execution slots but can effectively minimize the effects of long latencies to memory. Switch-on-event Multithreading would switch threads on long latency events such as cache misses. This approach can work well for server applications that have large numbers of cache misses and where the two threads are executing similar tasks. However, both the time-slice and the switch-onevent onevent multi-threading techniques do not achieve optimal overlap of many sources of inefficient resource usage,such as branch mispredictions, instruction dependencies, etc.
Finally, there is simultaneous multi-threading, where multiple threads can execute on a single processor without switching. The threads execute simultaneously and make much better use of the resources.
This approach makes the most effective use of processor resources: it maximizes the performance vs. transistor count and power consumption.
CONCEPT OF SIMULTANEOUS MULTI-THREADING
Simultaneous Multi-threading is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle. Unlike other hardware multithreaded architectures (such as the Tera MTA), in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources. Unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism, simultaneous multithreading uses multiple threads to compensate for low single-thread ILP. The performance consequence is significantly higher instruction throughput and program speedups on a variety of workloads that include commercial databases, web servers and scientific applications in both multi programmed and parallel environments. Simultaneous multithreading has already had impact in both the academic and commercial communities.
It has been found that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. It is seen that the architecture for simultaneous multi threading achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads.The simultaneous multi threading enjoys a 2.5-fold improvement over an unmodified superscalar with the same hardware resources. This speedup is enabled by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best"
instructions to the processor. Several heuristics that helps to identify and use the best threads for fetch and issue are found, and show that such heuristics can increase throughput by as much as 37%. Using the best fetch and issue alternatives, we then use bottleneck analysis to identify opportunities for further gains on the improved architecture.
Simultaneous Multithreading: Maximizing On-Chip Parallelism
The increase in component density on modern microprocessors has led to a substantial increase in on-chip parallelism. In particular, modern superscalar RISCs can issue several instructions to independent functional units each cycle. However, the benefit of such superscalar architectures is ultimately limited by the parallelism available in a single thread.
Simultaneous Multi threading is a technique permitting several independent threads to issue instructions to a super-scalar's multiple functional units in a single cycle. In the most general case, the binding between thread and functional unit is completely dynamic. We present several models of simultaneous multithreading and compare them with wide superscalar, fine-grain multithreaded, and single-chip, multiple-issue multiprocessing architectures. To perform these evaluations, simultaneous multithreaded architecture was simulated based on the DEC Alpha 21164 design, and execute code generated by the Multi flow trace scheduling compiler. The results show that: (1) No single latency-hiding technique is likely to produce acceptable utilization of wide superscalar processors. Increasing processor utilization will therefore require a new approach, one that attacks multiple causes of processor idle cycles. (2) Simultaneous multithreading is such a technique. With our
machine model, an 8-thread, 8-issue simultaneous multithreaded processor sustains over 5 instructions per cycle, while a single-threaded processor can sustain fewer than 1.5 instructions per cycle with similar resources and issue bandwidth. (3) Multithreaded workloads degrade cache performance relative to single-thread performance, as previous studies have shown. We evaluate several cache configurations and demonstrate that private instruction and shared data caches provide excellent performance regardless of the number of threads. (4) Simultaneous multithreading is an attractive alternative to single-chip multiprocessors. We show that simultaneous multithreaded processors with a variety of organizations are all superior to conventional multiprocessors with similar resources.
While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design.
CONVERTING TLP TO ILP VIA SMT
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Although they correspond to different granularities of
parallelism, ILP and TLP are fundamentally identical: they both identify independent instructions that can execute in parallel and therefore can utilize parallel hardware. Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle.Multiprocessors exploit TLP by executing different threads in parallel on different processors. Unfortunately, neither parallel processing style is capable of adapting to dynamically changing levels of ILP and TLP, because the hardware enforces the distinction between the two types of parallelism. A multiprocessor must statically partition its resources among the multiple CPUs (see Figure 1); if insufficient TLP is available, some of the processors will be idle. A superscalar executes only a single thread; if insufficient ILP exists, much of that processorâ„¢s multiple-issue hardware will be wasted.
Simultaneous multithreading (SMT) [Tullsen et al. 1995; 1996; Gulati etal. 1996; Hirata et al. 1992] allows multiple threads to compete for and share available processor resources every cycle. One of its key advantages when executing parallel applications is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By allowing multiple threads to share the processorâ„¢s functional units simultaneously, thread-level parallelism is essentially converted into instruction-level parallelism.An SMT processor can therefore accommodate variations in ILP and TLP. When a program has only a single thread (i.e., it lacks TLP) all of the SMT processorâ„¢s resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP.
figure1
Fig. A comparison of issue slot (functional unit) utilization in various architectures. Each square corresponds to an issue slot, with white squares signifying unutilized slots. Hardware utilization suffers when a program exhibits insufficient parallelism or when available parallelism is not used effectively. A superscalar processor achieves low utilization because of low ILP in its single thread. Multiprocessors physically partition hardware to exploit TLP, and therefore performance suffers when TLP is low (e.g., in sequential portions of parallel programs). In contrast, simultaneous multithreading avoids resource partitioning. Because it allows multiple threads to compete for all resources in the same cycle, SMT can cope with varying levels of ILP and TLP; consequently utilization is higher, and performance is better.
An SMT processor can uniquely exploit whichever type of parallelism is available, thereby utilizing the functional units more effectively to achieve the goals of greater throughput and significant program speedups.
SIMULTANEOUS MULTITHREADING AND MULTIPROCESSORS
Both the SMT(simultaneous multithreading) processor and the on-chip shared-memory MPs we examine are built from a common out-of-order superscalar base processor. The multiprocessor combines several of these superscalar CPUs in a small-scale MP, whereas simultaneous multithreading uses a wider-issue superscalar,and then adds support for multiple contexts
Base Processor Architecture
The base processor is a sophisticated, out-of-order superscalar processor with a dynamic scheduling core similar to the MIPS R10000 [Year 1996].Figure 2 illustrates the organization of this processor, and Figure 3 shows its processor pipeline. On each cycle, the processor fetches a block of instructions from the instruction cache. After decoding these instructions,the register-renaming logic maps the logical registers to a pool of physical renaming registers to remove false dependencies. Instructions are then fed to either the integer or floating-point instruction queues. When their operands become available, instructions are issued from these queues to the corresponding functional units. Instructions are retired in order.
figure-2
figure-3
SMT ARCHITECTURE
The SMT architecture, which can simultaneously execute threads from up to eight hardware contexts, is a straightforward extension of the base processor. To support simultaneous multithreading, the base processor architecture requires significant changes in only two primary areas: the instruction fetch mechanism and the register file. A conventional system of branch prediction hardware (branch target buffer and pattern history table) drives instruction fetching, although we now have 8 program counters and 8 subroutine return stacks (1 per context). On each cycle, the fetch mechanism selects up to 2 threads (among threads not already incurring I-cache misses) and fetches up to 4 instructions from each thread (the 2.4 scheme from Tullsen et al. [1996]). The total fetch bandwidth of 8 instructions is therefore equivalent to that required for an 8-wide superscalar processor, and only 2 I-cache ports are required.Additional logic, however, is necessary in the SMT to prioritize thread selection. Thread priorities are assigned using the icount feedback technique , which favors threads that are using processor resources most effectively. Under icount, highest priority is given to the threads that have the least number of instructions in the decode, renaming,and queue pipeline stages. This approach prevents a single thread from
clogging the instruction queue, avoids thread starvation, and provides a more even distribution of instructions from all threads, thereby heightening interthread parallelism. The peak throughput of our machine is limited by the fetch and decode bandwidth of 8 instructions per cycle.
Following instruction fetch and decode, register renaming is performed,as in the base processor. Each thread can address 32 architectural integer (and FP) registers. The register-renaming mechanism maps these architectural registers (1 set per thread) onto the machineâ„¢s physical registers. An 8-context SMT will require at least 8 * 32 5 256 physical registers, plus additional physical Registers for register renaming. With a larger register file, longer access times will be required, so the SMT processor pipeline is extended by 2 cycles to avoid an increase in the cycle time. Figure 3 compares the pipeline for the SMT versus that of the base superscalar. On the SMT, register reads take 2 pipe stages and are pipelined. Writers to the register file behave in a similar manner, also using an extra pipeline stage.In practice, we found that the lengthened pipeline degraded performance by less than 2% when running a single thread. The additional pipe stage requires an extra level of bypass logic, but the number of stages has a smaller impact on the complexity and delay of this logic (O(n)) than the issue width (O(n2)) [Palacharla et al. 1997]. Our previous study contains more details regarding the effects of the two-stage register read/write pipelines on the architecture and performance.
In addition to the new fetch mechanism and the larger register file and longer pipeline, only three processor resources are replicated or added to support SMT: per-thread instruction retirement, trap mechanisms, and an additional thread id field in each branch target buffer entry. No additional hardware is required to perform multithreaded scheduling of instructions to functional units. The register-renaming phase removes any apparent interthread register dependencies, so that the conventional instruction ACM Transactions on Computer Systems queues can be used to dynamically schedule instructions from multiple threads. Instructions from all threads are dumped into the instruction queues, and an instruction from any thread can be issued from the queues once its operands become available.In this design, few resources are statically partitioned among contexts;consequently, almost all hardware resources are available even when only one thread is executing. The architecture allows us to achieve the performance advantages of simultaneous multithreading, while keeping intact the design and single-thread peak performance of the dynamically scheduled CPU core present in modern superscalar architectures.
Single-Chip Multiprocessor Hardware configurations
In our analysis of SMT and multiprocessing, we focus on a particular region of the MP design space, specifically, small-scale, single-chip, shared-memory
multiprocessors. As chip densities increase, single-chip multiprocessing will be possible, and some architects have already begun to investigate this use of chip real estate [Olukotun et al. 1996]. An SMT processor and a small-scale, on-chip multiprocessor have many similarities: for example,both have large numbers of registers and functional units, on-chip caches,and the ability to issue multiple instructions each cycle. In this study, we keep these resources approximately similar for the SMT and MP comparisons,and in some cases we give a hardware advantage to the MP.1We look at both two- and four-processor multiprocessors, partitioning the scheduling unit resources of the multiprocessor CPUs (the functional units,instruction queues, and renaming registers) differently for each case. In the two-processor MP (MP2), each processor receives half of the on-chip execution resources previously described, so that the total resources relative to an SMT are comparable (Table I). For a four-processor MP (MP4), each processor contains approximately one-fourth of the chip resources. The issue width for each processor in these two MP models is indicated by the total number of functional units. Note that even within the MP design space, these two alternatives (MP2 and MP4) represent an interesting tradeoff between TLP and ILP. The two-processor machine can exploit more ILP, because each processor has more functional units than its MP4 counterpart, whereas MP4 has additional processors to take advantage of more TLP.Table I also includes several multiprocessor configurations in which we increase hardware resources. These configurations are designed to reduce bottlenecks in resource usage in order to improve aggregate performance.MP2fu (MP4fu), MP2q (MP4q), and MP2r (MP4r) address bottlenecks of functional units, instruction queues, and renaming registers, respectively.MP2a increases all three of these resources,so that the total execution resources of each processor are equivalent to a single SMT processor.2 (MP4a is similarly augmented in all three resource classes, so that the entire MP4a multiprocessor also has twice as many resources as our SMT.)For all MP configurations, the base processor uses the out-of-order )scheduling processor described earlier and the base pipeline from Figure 3 . Each MP processor only supports one context; therefore its register file will be smaller, and access will be faster than the SMT. Hence the shorter pipeline is more appropriate.
Synchronization Mechanisms and Memory Hierarchy
SMT has three key advantages over multiprocessing: flexible use of ILP and TLP, the potential for fast synchronization, and a shared L1 cache.This study focuses on SMTâ„¢s ability to exploit the first advantage by determining the costs of partitioning execution resources; we therefore allow the multiprocessor to use SMTâ„¢s synchronization mechanisms and cache hierarchy to avoid tainting our results with effects of the latter two.We implement a set of synchronization primitives for thread creation and termination, as well as hardware blocking locks. Because the threads in an SMT processor share the same scheduling core, inexpensive hardware blocking locks can be implemented in a synchronization functional unit.This cheap synchronization is not available to multiprocessors, because the distinct processors cannot share functional units. In our workload, most interthread synchronization is in the form of barrier synchronization or
simple locks, and we found that synchronization time is not critical to performance. Therefore, we allow the MPs to use the same cheap synchronization techniques, so that our comparisons are not colored by synchronization effects. The entire cache hierarchy, including the L1 caches, is shared by all threads in an SMT processor. Multiprocessors typically do not use a shared L1 cache to exploit data sharing between parallel threads. Each processor in an MP usually has its own private cache (as in the commercial multiprocessors described by Sun Microsystems [1997], Slater [1992], IBM [1997],and Silicon Graphics [1996]) and therefore incurs some coherence overhead when data sharing occurs. Because we allow the MP to use the SMTâ„¢s shared L1 cache, this coherence overhead is eliminated. Although multiple threads may have working sets that interfere in a shared cache, that interthread interference is not a problem.
Comparing SMT and MP
In comparing the total hardware dedicated to our multiprocessor and SMT configurations, we have not taken into account chip area required for buses or the cycle-time effects of a wider-issue machine. In our study, the intent is not to claim that SMT has an absolute x percent performance advantage over MP, but instead to demonstrate that SMT can overcome some fundamental limitations of multiprocessors, namely, their inability to exploit changing levels of ILP and TLP. We believe that in the target design space we are studying, the intrinsic flaws resulting from resource partitioning in MPs will limit their effectiveness relative to SMT, even taking into consideration cycle time.
PARALLEL APPLICATION
SMT is most effective when threads have complementary hardware resource requirements. Multiprogrammed workloads and workloads consisting of parallel applications both provide TLP via independent streams of control, but they compete for hardware resources differently. Because a multiprogrammed workload does not share memory references across threads, it places more stress on the caches. Furthermore, its threads have different instruction execution patterns, causing interference in branch prediction hardware. On the other hand, multiprogrammed workloads are less likely to compete for identical functional units.Although parallel applications have the benefit of sharing the caches and branch prediction hardware, they are an interesting and different test of SMT for several reasons. First, unlike the multiprogrammed workload, all threads in a parallel application execute the same code and, therefore, have similar execution resource requirements, memory reference patterns, and levels of ILP. Because all threads tend to have the same resource needs at the same time, there is potentially more contention for these resources compared to a multiprogrammed workload.
Secondly, parallel applications illustrate the promise of SMT as an architecture
for improving the performance of single applications. By using threads to parallelize programs, SMT can improve processor utilization,but more important, it can achieve program speedups. Finally, parallel applications are a natural workload for traditional parallel architectures and therefore serve as a fair basis for comparing SMT and multiprocessors.
Effectively Using Parallelism on an SMT Processor
Rather than adding more execution resources to improve performance, SMT boosts performance and improves utilization of existing resources by using parallelism more effectively.Unlike multiprocessors that suffer from rigid partitioning, simultaneous multithreading permits dynamic resource sharing, so that resources can be flexibly partitioned on a per-cycle basis to match the ILP and TLP needs of the program. When a thread has a lot of ILP, it can access all
processor resources; and TLP can compensate for a lack of per-thread ILP.
As more threads are used , speedups increase (up to 2.68 on average with 8 threads), exceeding the performance gains attained by the enhanced MP configurations.The degree of SMTâ„¢s improvement varies across the benchmarks, depending on the amount of per-thread ILP. The five programs with the least ILP (radix, tomcatv, hydro2d, water-spatial, and shallow) get the five largest speedups for SMT.T8, because TLP compensates for low ILP;programs that already have a large amount of ILP (LU and FFT) benefit less from using additional threads, because resources are already busy executing useful instructions. In linpack, performance tails off after two threads, because the granularity of parallelism in the program is very small. The gain from parallelism is outweighed by the overhead of parallelization (not only thread creation, but also the work required to set up the loops in each thread).
IMPLICITLY-MULTITHREADED PROCESSORS[IMT]
. IMT executes compiler-specified speculative threads from a sequential program on a wide-issue SMT pipeline. IMT is based on the fundamental observation that Multi scalarâ„¢s execution model â€ i.e., compiler-specified speculative threads [11] â€ can be decoupled from the processor organization â€ i.e., distributed processing cores. Multi scalar [11] employs sophisticated specialized hardware, the register ring and address resolution buffer, which are strongly coupled to the distributed core organization. In contrast, IMT proposes to map speculative threads on to generic SMT.IMT differs fundamentally from prior proposals, TME and DMT, for speculative threading on SMT. While TME executes multiple threads only in the uncommon case of branch mispredictions, IMT invokes threads in the common
case of correct predctions, thereby enhancing execution parallelism. Unlike IMT, DMT creates threads in hardware. Because of the lack of compile-time information, DMT uses value prediction to break data dependence across threads. Unfortunately, inaccurate value prediction incurs frequent misspeculation stalls, prohibiting DMT from extracting thread-level parallelism effectively. Moreover,selective recovery from misspeculation in DMT requires fast and frequent searches through prohibitively large (e.g., ~1000 entries) custom instruction trace buffers that are difficult to implement efficiently.In this paper, we find that a naive mapping of compiler-specified speculative threads onto SMT performs poorly. Despite using an advanced compiler [14] to generate threads, a Naive IMT (N-IMT) implementation performs only comparably to an aggressive superscalar. NIMTâ„¢s key shortcoming is its indiscriminate approach to fetching/executing instructions from threads, without accounting for resource availability, thread resource usage, and inter-thread dependence information. The resulting poor utilization of pipeline resources (e.g., issue queue, load/store queues, and register file) in N-IMT negatively offsets the advantages of speculative threading.
Implicitly-MultiThreaded (IMT) processor utilizes SMTâ„¢s support for multithreading by executing speculative threads. Figure 3 depicts the anatomy of an IMT processor derived from SMT. IMT uses the rename tables for register renaming, the issue queue for out-of-order scheduling, the per-context load/store queue (LSQ) and active list for memory dependences and instruction reordering prior to commit. As in SMT, IMT shares the functional units, physical registers, issue queue,and memory hierarchy among all contexts.IMT exploits implicit parallelism, as opposed to programmer- specified, explicit parallelism exploited by conventional SMT and ultiprocessors. Like Multiscalar, IMT predicts the threads in succession and maps them to execution resources, with the earliest thread as the onspeculative(head) thread, followed by subsequent speculative threads [11]. IMT honors the inter-thread control-flow and register dependences specified by the compiler.IMT uses the LSQ to enforce inter-thread memory dependences. Upon completion, IMT commits the threads in program order.Two IMT variations are presented: (1) a Naive IMT (NIMT) that performs comparably to an aggressive superscalar, and (2) an Optimized IMT (O-IMT) that uses novel micro architectural techniques to enhance performance.
Thread Invocation
Like Multiscalar, both IMT variants invoke threads in program order by predicting the next thread from among the targets of the previous thread (specified by the thread descriptor) using a thread predictor. A descriptor cache
(Figure 3) stores recently-fetched thread descriptors.Although threads are invoked in program order, IMT may fetch later threadsâ„¢ instructions out of order prior to fetching all of earlier threadsâ„¢ instructions, thereby interleaving instructions from multiple threads. To decide which thread to fetch from, IMT consults the fetch policy.
Resource Allocation & Fetch Policy
IMT processor, N-IMT, uses an unmodified ICOUNT policy [13], in which the thread with the least number of instructions in flight is chosen to fetch instructions from every cycle. The rationale is that the thread that has the fewest instructions is the one whose instructions are flowing through the pipeline with the fewest stalls.We also make the observation that the ICOUNT policy may be suboptimal for a processor in which threads exhibit control-flow and data dependence and resources are relinquished in program (and not thread) order. For instance, later (program-order) threads may result in resource (e.g., physical registers, issue queue and LSQ entries) starvation in earlier threads, forcing the later threads to squash and relinquish the resources for use by earlier threads. Unfortunately, frequent thread squashing due to indiscriminate resource allocation without regards to demand incurs high overhead. Moreover, treating (control-and data-) dependent and independent threads alike is suboptimal. Fetching and executing instructions from later threads that are dependent on earlier threads may be counter-productive because it increases inter-thread dependence delays by taking away front-end fetch and processing bandwidth from earlier threads. Finally, dependent instructions from later threads exacerbate issue queue contention because they remain in the queue until the dependences are resolved.
To mitigate the above shortcomings, O-IMT employs a novel resource- and dependence-based fetch policy that is bimodal. In the dependent mode, the policy biases fetch towards the non-speculative thread when the threads are likely to be dependent, fetching sequentially to the highest extent possible. In the independent mode, the policy uses ICOUNT when the threads are potentially independent, enhancing overlap among multiple threads. Because loop iterations are typically independent, the policy employs an Inter-Thread Dependence Heuristic (ITDH) to identify loop iterations for the independent mode, otherwise considering threads to be dependent. ITDH predicts that subsequent threads are loop iterations if the next two threadsâ„¢ start PCs are the same as the nonspeculative (head) threadâ„¢s start PC.To reduce resource contention among threads, the policy employs a Dynamic Resource Predictor (DRP) to initiate fetch from an invoked thread only if the available hardware resources exceed the predicted demand by the thread. The DRP dynamically monitors the threads activity and allows fetch to be initiated from newly invoked threads when earlier threads commit and resources become available.
Figure 4 (a) depicts an example of DRP. O-IMT indexes into a table using the start PC of a thread. Each table entry holds the numbers of active list and LSQ slots,and physical registers used by the threadâ„¢s last four execution instances. The pipeline monitors a threadâ„¢s resource needs, and upon thread commit, pdates the threadâ„¢s DRP entry. DRP supplies the maximum among the four instances for each resource as the prediction for the next instanceâ„¢s resource requirement. The policy alleviates enterthread data dependence by processing producer instructions earlier and decreasing instruction execution stalls,thereby reducing pipeline resource contention.In contrast to O-IMT, prior proposals for speculative threading using SMT use variants of conventional fetch policies. TME uses biased-ICOUNT, a variant of ICOUNT that does not consider resource availability and thread-level independence. DMTâ„¢s fetch policy statically partitions two fetch ports, and allocates one port for the non-speculative thread and the other for speculative threads in a round-robin manner. However, DMT does not suffer from resource contention because the design assumes prohibitively large custom instruction trace buffers (holding thousands of instructions) allowing for threads to make forward progress without regards to resource availability and thread-level independence.Unfortunately, frequent associative searches through such large buffers are slow and impractical.
figure 4.using(a) and context multiplexing(b).
Multiplexing Hardware Contexts
Much like prior proposals, N-IMT assigns a single thread to a hardware context. Because many programs have short threads [14] and real SMT implementations are bound to have only a few (e.g., 2-8) contexts, this approach often leads to insufficient instruction overlap.Larger threads, however, increase both the likelihood of dependence misspeculation [14] and the number of instructions discarded per misspeculation, and cause speculative buffer overflow [5].Instead, to increase instruction overlap without the unwanted side-effects of large threads, O-IMT multiplexes the hardware contexts by mapping as many threads as allowed by the resources in one context (typically 3-6 threads for SPEC2K). Context multiplexing requires for each context only an additional fetch PC register and rename table pointer per thread for a given maximum number of threads per context. Context multiplexing differs from prior proposals for mapping multiple threads on to a single processing core [12,3] to alleviate load imbalance,in that multiplexing allows instructions from multiple threads within a context to execute and share resources simultaneously.Two design complexities arise due to sharing resources in context multiplexing. First, conventional active list and LSQ designs assume that instructions enter these queues in (the predicted) program order. Such an assumption enables the active list to be a non-searchable (potentially large) structure, and allows honoring emory dependences via an ordered (associative) search in the LSQ. If care is not taken, multiplexing would invalidate this assumption if multiple threads were to place instructions out of program order in the shared active list and LSQ. Such out-of-order placement would require an associative search on the active list to determine the correct instruction(s) to be removed upon commit or misspeculation.In the case of the LSQ, the requirements would be even more complicated. A memory access would have to search through the LSQ for an address match among the entries from the accessing thread, and then (conceptually)repeat the search among entries from the thread preceding the accessing thread, working towards older threads.Unfortunately, the active list and LSQ cannot afford these additional design complications because active lists are made large and therefore non-searchable by design and the LSQâ„¢s ordered, associative search is already complex and time-critical.Second, allowing a single context to have multiple out-of-program-order threads complicates managing interthread dependence. Because two in-program-order threads may be mapped to different contexts, honoring memory dependences would require memory accesses to search through multiple contexts thereby prohibitively increasing LSQ search time and design complexity.O-IMT avoids the second design complexity by mapping threads to a context in program order. Inter-thread and intra-thread dependences within a single context are treated similarly. Figure 3 (b) shows how in-programorder threads X and X+1 are mapped to a context. In addition to program order within contexts, O-IMT tracks the global program order among the contexts themselves for precise interrupts.
Register Renaming
Superscalarâ„¢s register rename table relies on in-order instruction fetch to page link register value producers to consumers.IMT processorsâ„¢ out-of-order fetch raises two issues in linking producers in earlier threads to consumers in later threads. First, IMT has to ensure that the rename maps for earlier threadsâ„¢ source registers are not clobbered by later threads. Second, IMT must guarantee that later threadsâ„¢ consumer instructions obtain the correct rename maps and wait for the yet-to-be-fetched earlier threadsâ„¢ producer instructions. While others [1,7] employ hardware- intensive value prediction to address these issues potentially incurring frequent misspeculation and recovery overhead, IMT uses the create and use masks combined with conventional SMT rename tables.
Both IMT variants address these issues as follows.Upon thread start-up (and prior to instruction fetch), the processor copies the rename maps of the registers in create and use masks from a master rename table, to a threadâ„¢s local rename table.1 To allow for invoking subsequent threads, the processor pre-allocates physical registers and pre-assigns mappings for all the create-mask registers in a pre-assign rename table. Finally, the processor updates the master table with the pre-assigned mappings and marks them as busy to reflect the yet-to-be-created register values.Therefore, upon thread invocation the master table correctly reflects the register mappings that a thread should either use or wait for.
POWER CONSUMPTION IN MULTI THREADED PROCESSORS
Processor power and energy consumption are of concern
in two different operating environments. The first is that of a mobile computer, where battery life is still very limited.While the overall energy consumption of a microprocessor is being reduced because of voltage scaling, dynamic clock frequency reduction, and low power circuit design, optimizations can be applied at the architecture level as well.The other environment where power consumption is important is that of high performance processors. These processors are used in environments where energy supply is not typically a limitation. For high performance computing, clock frequencies continue to increase, causing the power dissipated to reach the thresholds of current packaging technology.When the maximum power dissipation becomes a critical design constraint, that architecture which maximizes the performance/power ratio thereby maximizes performance. Multithreading is a processor architecture technique which has been shown to provide significant performance advantage over conventional architectures which can only follow a single stream of execution. Simultaneous multithreading (SMT) [15, 14] can provide up to twice the throughput of a dynamic superscalar single-threaded processor. Announced architectures that will feature multithreading include the Compaq Alpha EV8 [5] and the Sun MAJC Java processor [12], joining the existing Tera MTA supercomputer architecture [1]. We can show that a multithreading processor is attractive in the context of low-power or power-constrained devices for many of the same reasons that enable its high throughput. First, it supplies extra parallelism via multiple threads, allowing the processor to rely much less heavily on speculation; thus, it wastes fewer resources on speculative, never-committed instructions. Second, it provides both higher and more even parallelism over time when running multiple threads, wasting less power on underutilized execution resources. A multithreading architecture also allows different design decisions than are available in a single threaded processor, such as the ability to impact power through the thread selection mechanism.
Modelling Power
To describe the power and energy results of the processor,a particular power model can be identified. This power model is integrated into a detailed cycle-by-cycle instruction-level architectural simulator, allowing the model to accurately model both useful and non-useful (incorrectly speculated) use of all processor resources.The power model utilized for this study is an area-based model. In this power model,the microprocessor is divided into several high-level units,and the corresponding activity of each unit is used in the computation of the overall power consumption. The following processor units are modeled: L1 instruction cache, L1 data cache, L2 unified cache, fetch unit, integer and floating point instruction queues, branch predictor, instruction TLB, data TLB, load-store queue, integer and floating-point functional units, register file, register renamer, completion queue, and return stack.The total processor energy consumption is the summation of the unit energies where each unit energy is equal to:
An activity factor is a defined statistic measuring how many architectural events a program or workload generates for a particular unit. For a given unit, there may be multiple activity factors each representing a different action that
can occur within the unit. Activity factors represent high level actions and therefore are independent of the data values causing a particular unit to become active. We can compute the overall power consumption of the processor on a
cycle-by-cycle basis by summing the energy usage of all units in the processor.
The entire power model consists of 44 activity factors.Each of these activity factors is a measurement of the activity of a particular unit or function of the processor (e.g. number of instruction queue writes, L2 cache reads). The model does not assume that a given factor or combination of factors exercises a microprocessor unit in its entirety, but instead is modeled as exercising a fraction of the area depending on the particular activity factor or combination of activity factors. The weight of how much of the unit is exercised by a particular activity is an estimate based on knowledge of the unitâ„¢s functionality and assumed implementation. Each unit is also assigned a relative area value which is an estimate of the unitâ„¢s size given its parameters. In addition to the overall area of the unit, each unit is modeled as consisting of a particular ratio of 4 different circuit types: dynamic logic, static logic, memory, and clock circuitry.Each circuit type is given an energy density. The ratio of circuit types for a given logic unit are estimates based on engineering experience and consultation with processor designers. The advantage of an area-based design is its adaptability. It can model an aggregate of several existing processors, as long as average area breakdowns can be derived for each of the units. And it can be adapted for future processors for which no circuit implementations exist, as long as an estimate of the area expansion can be made. In both cases, we rely on the somewhat coarse assumption that the general circuit makeup of these units remains fairly constant across designs. This assumption should be relatively valid for the types of high-level architectural comparisons made in this paper, but might not be for more low-level design options. Our power model is an architecture-level power model,and is not intended to produce precise estimates of absolute whole-processor power consumption. For example, it does not include dynamic leakage factors, I/O power (e.g.,pin drivers), etc. However, the power model is useful for obtaining relative power numbers for comparison against results obtained from the same power model. The numbers that are obtained from the power model are independent of clock frequency (again, because we focus on relative values).
Power Consumption of a Multithreaded Architecture

The power-energy characteristics of multi threaded execution can be examined. It examines performance (IPC), energy efficiency (energy per useful instruction executed,E/UI), and power (the average power utilized during each simulation run) for both single-thread and multithread execution.All results (including IPC) are normalized to a baseline (in each case, the lowest single-thread value). This is done for two reasonsâ€ first, the power model is not intended to be an indicator of absolute power or energy, so this allows us to focus on the relative values; second, it allows us to view the diverse metrics on a single graph. Later results are normalized to other baseline values. The single-thread results (Figure 5) show diversity in almost all metrics. Performance and power appear to be positively correlated; the more of the processor used, the more power used. However, performance and energy per useful instruction are somewhat negatively correlated; the fewer unused resources and the fewer wastefully used resources, the more efficient the processor.Also included in the graph (the Ideal E/UI bar) is the energy efficiency of each application with perfect branch prediction. This gives an indication of the energy lost to incorrect speculation (when compared with the E/UI result).Gcc and goes suffer most from low branch prediction accuracy.Gcc, for example, only commits 39% of fetched instructions and 56% of executed instructions. In cases such as gcc and go, the energy effects of mispredicted branches is quite large (35% - 40%).Figure 6 shows that multithreaded execution has significantly improved energy efficiency over conventional processor execution. Multithreading attacks the two primary sources of wasted energy/power â€ unutilized resources,and most importantly, wasted resources due to incorrect speculation. Multithreaded processors rely far less on speculation to achieve high throughput than conventional processors .

figure-5.the relative performance,energy efficiency and power
results for the six bench marks when done.

figure-6.the relative performance ,energy efficiency and power results as the number of theads varies.
With multithreading, then, we achieve as much as a 90% increase in performance (the 4-thread result), while actually decreasing the energy dissipation per useful instruction of a given workload. This is achieved via a relatively modest increase in power. The increase in power occurs because of the overall increase in utilization and throughput of the processor that is, because the processor completes the same work in a shorter period of time. Figure 3 shows the effect of the reduced misspeculation on the individual components of the power model. The greatest reductions come in the front of the pipeline (e.g. fetch), which is always the slowest to recover from a branch misprediction. The reduction in energy is achieved despite an increase in L2 cache utilization and power. Multithreading,particularly a multiprogrammed workload, reduces the locality of accesses to the cache, resulting in more cache
misses; however, this has a bigger impact on L2 power than L1. The L1 caches see more cache fills, while the L2 sees more accesses per instruction. The increase in L1 cache fills is also mitigated by the same overall efficiency gains, as it sees fewer speculative accesses.
figure 7.contributors to overall energy efficiency of different threads
The power advantage of simultaneous multithreading can provide a significant gain in both the mobile and high performance domains. The E/UI values demonstrate that a multithreaded processor operates more efficiently, which is desirable in a mobile computing environment. For a high performance processor, the constraint is more likely to be average power or peak power. SMT can make better use of a given average power budget, but it really shines in being able to take greater advantage of a peak power constraint. That is, an SMT processor will achieve sustained performance that is much closer to the peak performance (power) the processor was designed for than a single threaded processor.
Power Optimizations
In this section, we examine power optimizations that will either reduce the overall energy usage of a multithreading processor or will reduce the power consumption of the processor. We examine the following possible power optimizations, each proving practical in some way by multithreading: reduced execution bandwidth, dynamic power consumption controls, and thread fetch optimizations. Reduced execution bandwidth exploits the higher execution efficiency of multithreading to achieve higher performance while maintain moderate power consumption. Dynamic power consumption controls allow a high performance processor to gracefully degrade activity when power consumption thresholds are exceeded. The thread fetch optimization looks to achieve better processor resource utilization via thread selection algorithms.
Reduced Execution Bandwidth
Prior studies have shown that an SMT processor has the potential to achieve double the instruction throughput of a single-threaded processor with similar execution resources . This implies that a given performance goal can be met with a less aggressive architecture, possibly consuming less power. Therefore, a multi-threaded processor with a smaller execution bandwidth may achieve comparable performance to a more aggressive single-threaded processor while consuming less power. In this section, for example, we compare the power consumption and energy efficiency of an 8-issue single-threaded processor with a 4-issue multithreaded processor.
Controlling Power Consumption via Feedback
We strive to reduce the peak power dissipation while still maximizing performance, a goal that is likely to face future high-performance architects. In this section, the processor is given feedback regarding power dissipation in order to limit its activity.Such a feedback mechanism does indeed provide reduced average power, but the real attraction of this technique is the ability to reduce peak power to an arbitrary level. A feedback mechanism that guaranteed that the actual power consumption did not exceed a threshold could make peak and achieved power consumption arbitrarily close.Thus, a processor could still be 8-issue (because that width is useful some fraction of the time), but might still be designed knowing that the peak power corresponded to 5 IPC sustained, as long as there is some mechanism in the processor that can guarantee sustained power does not exceed that level. This section models a mechanism, applied to an SMT processor, which utilizes the power consumption of the processor as the feedback input. For this optimization, we would like to set an average power consumption target and have the processor stay within a reasonable tolerance of the target while achieving higher performance than the single threaded processor could achieve. Our target power, in this case, is exactly the average power of all the benchmarks running on the single-threaded processor. This is a somewhat arbitrary threshold, but makes it easier to interpret the results. The feedback is an estimated value for the current power the processor is utilizing. An implementation of this mechanism could use either on-die sensors or a â„¢powerâ„¢ value which is computed from some performance counters. The advantage of using performance counters is that the lag time for the value is much less than those obtained from a physical temperature sensor. For the experiments listed here, the power value used is the number obtained from the simulator power model. We used a power sampling interval of 5 cycles (this is basically the delay for when the feedback information can be used by the processor. In each of the techniques we define thresholds whereby if the power consumption exceeds a given threshold, fetching or branch prediction activity will be modified to curb power consumption. The threshold values are fixed throughout a given simulation. In all cases, fetching is halted for all threads when power consumption reaches 110% of the desired power value.
figure-8 the average over all bench marks of performance and power of 2 and 4 threads with feedback
Power feedback control is a technique which enables the designer to lower the peak power constraints on the processor.This technique is particularly effective in a multithreading environment for two reasons. First, even with the drastic step of eliminating speculation, even for fetch, an SMT processor can still make much better progress than a singlethreaded processor. Second, we have more dimensions on which to scale back executionâ€the results were best when
we took advantage of the opportunity to scale back threads incrementally.
Thread Selection
This optimization examines the effect of thread selection algorithms on power and performance. The ability to select from among multiple threads provides the opportunity to optimize for power consumption and operating efficiency when making thread fetch decisions.The optimization we model involves two threads that are selected each cycle to attempt to fetch instructions from the I-cache. The heuristic used to select which threads fetch can have a large impact on performance [14]. This mechanism can also impact power if we use it to bias against the most speculative threads. This heuristic does not, in fact, favor low-power threads over high-power threads, because that only delays the running of the high-power threads. Rather, it favors less speculative threads over more speculative threads. This works because the threads that get stalled become less speculative over time (as branches are resolved in the processor), and quickly become good candidates for fetch. We modify the ICOUNT thread selection scheme, from [14], by adding a branch confidence metric to thread fetch priority. This could be viewed as a derivative of pipeline gating [11] applied to a multithreaded processor;however, we use it to change the fetch decision, not to stop fetching altogether. The low-conf scheme biases heavily against threads with more unresolved low-confidence branches, while all branches biases against the threads with more branches overall in the processor (regardless of confidence). Both use ICOUNT as the secondary metric.

figure-9. the performance, power and energy effects of using branch status to direct thread selection
Figure 9 shows that this mechanism has the potential to improve both raw performance and energy efficiency at the same time, particularly with the confidence predictor. For the 4 thread simulations, performance increased by 6.3% and 4.9% for the low-conf and all branches schemes, respectively.The efficiency gains should be enough to outweigh the additional power required for the confidence counters (not modeled), assuming the architecture did not already need them for other purposes. If not, the technique without the confidence counters was equally effective at improving power efficiency, lagging only slightly in performance.This figure shows an improvement even with two threads, because when two threads are chosen for fetch in a cycle, one is still chosen to have higher priority and possibly consume more of the available fetch bandwidth .This figure shows an improvement even with two threads, because when two threads are chosen for fetch in a cycle, one is still chosen to have higher priority and possibly consume more of the available fetch bandwidth .
Microprocessor power dissipation is becoming increasingly critical, due to various pressures. Low-power embedded and mobile devices are increasing rapidly. Every generation of processor puts far greater demands on power and cooling than the previous. We are approaching a technological window when power may become a bottleneck before transistor count, even for high-performance processors. In that scenario the processor architecture that optimizes the performance/power ratio thereby optimizes performance. Thus we see that demonstrate that simultaneous multithreading is an attractive architecture when energy and power are constrained. It, in particular

Important Note..!

ASK HERE