hyper Threading
#1

[attachment=512]
[attachment=627]
[attachment=628]
ABSTRACT


Intelâ„¢s Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.
The first implementation of Hyper-Threading Technology was done on the IntelXeonprocessor MP. In this implementation there are two logical processors on each physical processor. The logical processors have their own independent architecture state, but they share nearly all the physical execution and hardware resources of the processor. The goal was to implement the technology at minimum cost while ensuring forward progress on logical processors, even if the other is stalled, and to deliver full performance even when there is only one active logical processor.
The potential for Hyper-Threading Technology is tremendous; our current implementation has only just begun to tap into this potential. Hyper-Threading Technology is expected to be viable from mobile processors to servers; its introduction into market segments other than servers is only gated by the availability and prevalence of threaded applications and workloads in those markets.


INTRODUCTION


The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand, we cannot rely entirely on traditional approaches to processor design. Micro architecture techniques used to achieve past processor performance improvement“super pipelining, branch prediction, super-scalar execution, out-of-order execution, caches“have made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intel™s Hyper-Threading Technology is one solution.

Making Hyper-Threading Technology a reality was the result of enormous dedication, planning, and sheer hard work from a large number of designers, validators, architects, and others. There was incredible teamwork from the operating system developers, BIOS writers, and software developers who helped with innovations and provided support for many decisions that were
made during the definition process of Hyper-Threading Technology. Many dedicated engineers are continuing to work with our ISV partners to analyze application performance for this technology. Their contributions and hard work have already made and will continue to make a real difference to our customers.
Reply
#2
[attachment=622]
[attachment=623]
[attachment=624]
[attachment=625]
[attachment=626]
INTRODUCTION
The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand we cannot rely entirely on traditional approaches to processor design. Microarchitecture techniques used to achieve past processor performance improvement “ super-pipelining, branch prediction, super-scalar execution, out-of-order execution, caches “ have made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intel's Hyper-Threading Technology is one solution.
Processor Microarchitecture
Traditional approaches to processor design have focused on higher clock speeds, instruction-level parallelism (ILP), and caches. Techniques to achieve higher clock speeds involve pipelining the microarchitecture to finer granularities, also called super-pipelining. Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second. Because there will be far more instructions in-flight in a super-pipelined microarchitecture, handling of events that disrupt the pipeline, e.g., cache misses, interrupts and branch mispredictions, can be costly.
ILP refers to techniques to increase the number of instructions executed each clock cycle. For example, a super-scalar processor has multiple parallel execution units that can process instructions simultaneously. With super-scalar execution, several instructions can be executed each clock cycle. However, with simple in-order execution, it is not enough to simply have multiple execution units. The challenge is to find enough instructions to execute. One technique is out-of-order execution where a large window of instructions is simultaneously evaluated and sent to execution units, based on instruction dependencies rather than program order.
Accesses to DRAM memory are slow compared to execution speeds of the processor. One technique to reduce this latency is to add fast caches close to the processor. Caches can provide fast memory access to frequently accessed data or instructions. However, caches can only be fast when they are small. For this reason, processors often are designed with a cache hierarchy in which fast, small caches are located and operated at access latencies very close to that of the processor core, and progressively larger caches, which handle less frequently accessed data or instructions, are implemented with longer access latencies. However, there will always be times when the data needed will not be in any processor cache. Handling such cache misses requires accessing memory, and the processor is likely to quickly run out of instructions to execute before stalling on the cache miss.
The vast majority of techniques to improve processor performance from one generation to the next is complex and often adds significant die-size and power costs. These techniques increase performance but not with 100% efficiency; i.e., doubling the number of execution units in a processor does not double the performance of the processor, due to limited parallelism in instruction flows. Similarly, simply doubling the clock rate does not double the performance due to the number of processor cycles lost to branch mispredictions.


Figure 1: Single-stream performance vs. cost
Figure 1 shows the relative increase in performance and the costs, such as die size and power, over the last ten years on Intel processors1 . In order to isolate the microarchitecture impact, this comparison assumes that the four generations of processors are on the same silicon process technology and that the speed-ups are normalized to the performance of an Intel486TM processor. Although we use Intel's processor history in this example, other high-performance processor manufacturers during this time period would have similar trends. Intel's processor performance, due to microarchitecture advances alone, has improved integer performance five- or six-fold1. Most integer applications have limited ILP and the instruction flow can be hard to predict.
Over the same period, the relative die size has gone up fifteen-fold, a three-times-higher rate than the gains in integer performance. Fortunately, advances in silicon process technology allow more transistors to be packed into a given amount of die area so that the actual measured die size of each generation microarchitecture has not increased significantly.
The relative power increased almost eighteen-fold during this period1. Fortunately, there exist a number of known techniques to significantly reduce power consumption on processors and there is much on-going research in this area. However, current processor power dissipation is at the limit of what can be easily dealt with in desktop platforms and we must put greater emphasis on improving performance in conjunction with new technology, specifically to control power.
Thread-Level Parallelism
A look at today's software trends reveals that server applications consist of multiple threads or processes that can be executed in parallel. On-line transaction processing and Web services have an abundance of software threads that can be executed simultaneously for faster performance. Even desktop applications are becoming increasingly parallel. Intel architects have been trying to leverage this so-called thread-level parallelism (TLP) to gain a better performance vs. transistor count and power ratio.
In both the high-end and mid-range server markets, multiprocessors have been commonly used to get more performance from the system. By adding more processors, applications potentially get substantial performance improvement by executing multiple threads on multiple processors at the same time. These threads might be from the same application, from different applications running simultaneously, from operating system services, or from operating system threads doing background maintenance. Multiprocessor systems have been used for many years, and high-end programmers are familiar with the techniques to exploit multiprocessors for higher performance levels.
In recent years a number of other techniques to further exploit TLP have been discussed and some products have been announced. One of these techniques is chip multiprocessing (CMP), where two processors are put on a single die. The two processors each have a full set of execution and architectural resources. The processors may or may not share a large on-chip cache. CMP is largely orthogonal to conventional multiprocessor systems, as you can have multiple CMP processors in a multiprocessor configuration. Recently announced processors incorporate two processors on each die. However, a CMP chip is significantly larger than the size of a single-core chip and therefore more expensive to manufacture; moreover, it does not begin to address the die size and power considerations.
Another approach is to allow a single processor to execute multiple threads by switching between them. Time-slice multithreading is where the processor switches between software threads after a fixed time period. Time-slice multithreading can result in wasted execution slots but can effectively minimize the effects of long latencies to memory. Switch-on-event multi-threading would switch threads on long latency events such as cache misses. This approach can work well for server applications that have large numbers of cache misses and where the two threads are executing similar tasks. However, both the time-slice and the switch-on-event multi-threading techniques do not achieve optimal overlap of many sources of inefficient resource usage, such as branch mispredictions, instruction dependencies, etc.
Finally, there is simultaneous multi-threading, where multiple threads can execute on a single processor without switching. The threads execute simultaneously and make much better use of the resources. This approach makes the most effective use of processor resources: it maximizes the performance vs. transistor count and power consumption.
Hyper-Threading Technology brings the simultaneous multi-threading approach to the Intel architecture. In this paper we discuss the architecture and the first implementation of Hyper-Threading Technology on the Intel® XeonTM processor family

ACKNOWLEDGMENTS
Making Hyper-Threading Technology a reality was the result of enormous dedication, planning, and sheer hard work from a large number of designers, validators, architects, and others. There was incredible teamwork from the operating system developers, BIOS writers, and software developers who helped with innovations and provided support for many decisions that were made during the definition process of Hyper-Threading Technology. Many dedicated engineers are continuing to work with our ISV partners to analyze application performance for this technology. Their contributions and hard work have already made and will continue to make a real difference to our customers.
Reply
#3
[attachment=1949]



ABSTRACT
Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a micro architecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources. This paper describes the Hyper-Threading Technology architecture, and discusses the micro architecture details of Intel's first implementation on the Intelv Xeon processor family. Hyper-Threading Technology is an important addition to Intel's enterprise product line and will be integrated into a wide variety of products.

INTRODUCTION
The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand we cannot rely entirely on traditional approaches to processor design. Micro architecture techniques used to achieve past processor performance improvement-super pipelining, branch prediction, super-scalar execution, out-of-order execution, caches-have made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intel's Hyper-Threading Technology is one solution.
DeoL of CSE
I
SNCrCK Knlpnrhprv
THE TECHNIQUES BEFORE HYPER THREADING
Traditional approaches to processor design have focused on higher clock speeds, instruction-level parallelism (ILP), and caches. Accesses to DRAM memory are slow compared to execution speeds of the processor. One technique to reduce this latency is to add fast caches close to the processor. Caches can provide fast memory access to frequently accessed data or instructions. However, there will always be times when the data needed will not be in any processor cache. Handling such cache misses requires accessing memory, and the processor is likely to quickly run out of instructions to execute before stalling on the cache miss. The vast majority of techniques to improve processor performance from one generation to the next is complex and often adds significant die-size and power costs. These techniques increase performance but not with 100% efficiency; i.e., doubling the number of execution units in a processor does not double the performance of the processor, due to limited parallelism in instruction flows.
A look at today's software trends reveals that server applications consist of multiple threads or processes that can be executed in parallel. On-line transaction processing and Web services have an abundance of software threads that can be executed simultaneously for faster performance. Even desktop applications are becoming increasingly parallel. Intel architects have been trying to leverage this so-called thread-level parallelism (TLP) to gain a better performance vs. transistor count and power ratio. In both the high-end and mid-range server markets, multiprocessors have been commonly used to get more performance from the system. By adding more processors, applications potentially get substantial performance improvement by executing multiple threads on multiple processors at the same time. These threads might be from the same application, from different applications running simultaneously, from operating system services, or from operating system threads doing background maintenance. Multiprocessor systems have been used for many years, and high-end programmers are familiar with the techniques to exploit multiprocessors for higher performance levels. In recent years a number of other techniques to further exploit TLP have been discussed and some products have been announced. One of these techniques is chip multiprocessing (CMP), where two
Dent nfCSF.
processors are put on a single die. The two processors each have a full set of execution and architectural resources. The processors may or may not share a large on-chip cache. CMP is largely orthogonal to conventional multiprocessor systems, as you can have multiple CMP processors in a multiprocessor configuration. Recently announced processors incorporate two processors on each die. However, a CMP chip is significantly larger than the size of a single-core chip and therefore more expensive to manufacture; moreover, it does not begin to address the die size and power considerations.
Another approach is to allow a single processor to execute multiple threads by switching between them. Time-slice multithreading is where the processor switches between software threads after a fixed time period. Time-slice multithreading can result in wasted execution slots but can effectively minimize the effects of long latencies to memory. Switch-on-event multithreading would switch threads on long latency events such as cache misses.
Finally, there is simultaneous multi-threading, where multiple threads can execute on a single processor without switching. The threads execute simultaneously and make much better use of the resources. This approach makes the most effective use of processor resources: it maximizes the performance vs. transistor count and power consumption.
fWr nfCSF.
WHAT IS HYPER-THREADING
Hyper-Threading technology is an innovative design from Intel that enables multi-threaded software applications to process threads in parallel within each processor resulting in increased utilization of processor execution resources. To make it short, it is to place two logical processors into a single CPU die. As a result, an average improvement of -40% in CPU resource utilization yields higher processing throughput.
How Hyper-Threading Works
A form of simultaneous multi-threading technology (SMT), Hyper-Threading
technology allows multiple threads of software applications to be run simultaneously on one processor by duplicating the architectural state on each processor while the same processor execution resources is shared. The figure below represents how a Hyper-Threading based processor differentiates a traditional multiprocessor. The left-hand configuration shows a traditional multiprocessor system with two physical processors. Each processor has its own independent execution resources and architectural state. The right-hand configuration represents an Intel Hyper-Threading technology based processor. You can see that the architectural state for each processor is duplicated, while the execution resources is shared.
Processor Processor Processor
Execution Execution Execution
Resources Resources Resources
For multiprocessor-capable software applications, the Hyper-Threading based processor is considered two separate logical processors on which the software
MULTIPROCESSOR HYPER-THREADING
applications can run without modification. Also, each logical processor responds to interrupts independently. The first logical processor can track one software thread, while the second logical processor tracks another software thread simultaneously. Because the two threads share the same execution resources, the second thread can use resources that would be otherwise idle if only one thread was executing. This results in an increased utilization of the execution resources within each physical processor.
I 'cut. of CSE
SNGCE. Kolencherv
WINDOWS SUPPORT FOR HT TECHNOLOGY
How Do Windows-Based Servers Recognize Processors with Hyper-Threading Technology
Windows-based servers receive processor information from the BIOS. Each server vendor creates their own BIOS using specifications provided by Intel. Assuming the BIOS is written according to Intel specifications, it begins counting processors using the first logical processor on each physical processor. Once it has counted a logical processor on all of the physical processors, it will count the second logical processor on each physical processor, and so on, as shown in Figure 1.
Figure 1: Numbers indicate the order in which logical processors are recognized by the BIOS when writterj according to Intel specifications. This example shows a four-way system enabled with Hyper-Threading Technology.
Windows 2000 Server Family and Hyper-threading Technology
Windows 2000 Server does not distinguish between physical and logical processors on systems enabled with Hyper-Threading Technology; Windows 2000 simply fills out the license limit using the first processors counted by the BIOS. For example, when you launch Windows 2000 Server (4-CPU limit) on a four-way system enabled with Hyper-Threading Technology, Windows will use the first logical processor on each of the four physical processors, as shown in Figure 2; the second logical processor on each physical processor will be unused, because of the 4-CPU license limit. (This assumes the BIOS was written according to Intel specifications. Windows uses the processor count and sequence indicated by the BIOS.)
Dent, of CSE
SNCTCF. KnlpnrhprM
Logical Processors
Figure 2: Numbers indicate the order in which logical processors are used by Windows 2000 Server (4-CPU limit) on a four-way system enabled with Hyper-Threading Technology. Assumes BIOS is written according to Intel specifications.
However, when you launch Windows 2000 Advanced Server (8-CPU limit) on a four-way system enabled with Hyper-Threading Technology, Windows will use all eight logical processors, as shown in Figure 3.
Figure 3: Numbers indicate the order in which logical processors are used by Windows 2000 Advanced Server (8-CPU limit) on a four-way system enabled with Hyper-Threading Technology. Assumes BIOS is written according to Intel specifications.
Logical Processors
Physical Processors
Although Windows recognizes all eight logical processors in this example, in most cases performance would be better using eight physical processors.
Windows .NET Server Family and Hyper-threading Technology
When examining the processor count provided by the BIOS, Windows .NET Server distinguishes between logical and physical processors, regardless of how they are counted by the BIOS. This provides a powerful advantage over Windows 2000, in that Windows .NET Server only treats physical processors as counting against the license limit. For example, if you launch Windows .NET Standard Server (2-CPU
limit) on a'two-way system enabled with Hyper-Threading Technology, Windows will use all four logical processors, as shown in Figure 4.
Figure 4: Numbers indicate the order in which logical processors are used by Windows .NET Standard Server (2-CPU limit) on a two-way system enabled with Hyper-Threading Technology. Assumes BIOS is written according to Intel specifications.
Logical Processors
Physical Processors
Windows Server Applications and Hyper-threading Technology
Regardless of whether an application has been specifically designed to take advantage of Hyper-Threading Technology, or even whether the application is multi-threaded, Intel expects the existing body of applications in the market today to run correctly on systems enabled with Hyper-Threading Technology without further modification, and without being recompiled.
THREAD SCHEDULING AND HYPER-THREADING
TECHNOLOGY
Operating systems schedule threads on available processors based on a "ready-to-run" criteria. The set of available threads is contained in a thread pool. A thread is ready-to-run if it has all the resources it needs, except the processor. Threads that are waiting for disk, memory, or other 10, are not in a ready-to-run state. In general, high priority threads will be selected over low priority threads. Over time, a low priority thread will become favored and will eventually be scheduled on an available processor.
In the case where there are more ready-to-run threads than logical processors, the operating system will select higher-priority threads to schedule for each available processor. The lower-priority threads will be delayed to allow the higher priority threads more execution time.
In the case where there are two ready-to run threads and two logical processors, the operating system schedules each thread on a logical processor. The two threads may contend for the same physical processor resources, because HT Technology shares physical resources without respect to thread priority. As the two threads contend for resources, the high priority thread will complete instructions slower than when it owns the processor's execution resources itself.
Thread priority boost is a condition where a lower priority thread is consuming the same CPU resources as a higher priority thread. Thread priority boost may cause inconsistent, sub-optimal, or even degraded performance of higher-priority threads on systems with Hyper-Threading Technology.
MULTITHREADING, HYPER-THREADING, MULTIPROCESSING: NOW, WHAT'S THE DIFFERENCE
Hyper-Threading Technology
Today's software consists of multiple threads or processes that can be executed in parallel. A Web server is a good example of an application that benefits from multi-threading, which allows it to serve multiple users concurrently. To fully exploit the benefits of multi-threading, hardware and software engineers have used many techniques over the years. The most straightforward method is to use a multi¬processor system fitted with two or more processors. This method achieves true parallelism, but at the expense of increased cost.
The'second technique is chip multi-processing (CMP), where two processors are put on a single die. This technique is very similar to the first, where you have more than one physical processor, albeit each on a single die. However, this technique is also very expensive to implement, as the cost of manufacturing such a chip is high.
The third method is the most conventional and less expensive to implement. The technique is to use a single processor and the operating system to make use of time slicing to switch between threads. In most cases, time slicing is adequate and results in improved performance and throughput. However, time slicing could produce a penalty due to inefficient resource usage and the high cost of context switching. Context switching is the task performed by the CPU when it switches threads to execute different instructions. During this switching process, the CPU needs to save the state of the outgoing thread and load the state of the incoming thread. The CPU does not perform any useful work at this moment and hence context switching adds additional overhead to the execution time.
The fourth and last technique to be discussed is relatively new - Simultaneous Multi-Threading. Simultaneous multi-threading executes multiple threads on a processor without the need for switching. Hyper-Threading Technology enables software to implement this approach.
Hem nfCXF. SAmrF v^i^^nU^,
To understand how Hyper-Threading Technology works, let's take a look at the architecture of a conventional processor. A processor consists of mainly the Architecture State (Arch State) and Processor Execution Resources. The Arch State consists of registers, including the general-purposes registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. The Arch State defines and controls the environment of an executing thread or task. The Processor Execution Resources contain other resources such as caches, execution units, branch predictors, control logic, and buses.
Multithreading
As the demands on software grew, system programmers began to covet the time that processors wasted running a single thread while waiting for certain events to happen. When a program was waiting for a diskette drive to be ready or a user to type some information, programmers began to wonder if the processor could be doing other work. Under MS-DOS, the answer was unequivocally no. Instructions were executed sequentially, and if there was a pause in the thread of instructions, all downstream instructions had to wait for the pause to terminate. No magic or smoke and mirrors could get around this limitation.
To come up with a solution, software architects began writing operating systems that supported running pieces of programs, called threads. These multithreading operating systems made it possible for one thread to run while another was waiting for something to happen. On Intel® processor-based PCs and servers, today's operating systems, such as Windows 2000 and Windows XP, all support multithreading. In fact, the operating systems themselves are multithreaded. Portions of them can run while other portions are stalled.
To benefit from multithreading, programs also need to be multithreaded themselves. That is, rather than being developed as a single long sequence of instructions, they are broken up into logical units whose execution is controlled by the mainline of the program. This allows, for example, Microsoft Word to repaginate a
m r'lisj
document while the user is typing. Repagination occurs on one thread and handling keystrokes occurs on another. On single processor systems, these threads are executed sequentially, not concurrently. The processor switches back and forth between the keystroke thread and the repagination thread quickly enough that both processes appear to occur simultaneously.
When dual-threaded programs are executing on a single processor machine, some overhead is incurred when switching between the threads. Because switching between threads costs time, it would appear that running the two threads this way is less efficient than running the two threads in succession. However, if either thread has to wait on a system device or the user, the ability to have the other thread continue operating compensates very quickly for all the overhead of the switching. And since one thread in our example handles user input, there will certainly be frequent periods when it is just waiting. By switching between threads, operating systems that support multithreaded programs can improve performance, even if they are running on a uniprocessor system.
Multiprocessing
Multiprocessing systems have multiple processors running at the same time. Traditional multiprocessing systems have anywhere from 2 to about 128 processors. Beyond that number multiprocessing systems become parallel processors. We will touch on those later.
Multiprocessing systems allow different threads to run on different processors. This capability considerably accelerates program performance. Now two threads can run more or less independently of each other without requiring thread switches to get at the resources of the processor. Multiprocessor operating systems are themselves multithreaded and they too generate threads that can run on the separate processors to best advantage.
n nt nfCXF
In the early days, there were two kinds of multiprocessing:
¢ Asymmetrical -
On asymmetrical systems, one or more processors were exclusively dedicated to specific tasks, such as running the operating system. The remaining processors were available for all other tasks, generally the user applications. It quickly became apparent that this configuration was not optimal. On some machines, the operating-system processors were running at 100% capacity, while the user-assigned processors were doing nothing.
¢ Symmetrical -
In short order, system designers came to favor an architecture that balanced the processing load better: Symmetrical Multiprocessing (SMP). The "symmetry" refers to the fact than any thread”be it from the operating system or the user application”can run on any processor. In this way, the total computing load is spread evenly across all computing resources.
Today, symmetrical multiprocessing systems are the norm, and asymmetrical
designs have nearly completely disappeared. SMP systems use double the number of
processors, and performance generally jumps 80% or more. Why doesn't performance
jump 100% Two factors come into play: the first is the way threads interact, the
other is systemoverhead
Thread interaction has two components:
1. How threads handle competition for the same resources
2. How threads communicate among themselves
When two threads both want access to the same resource, one of them has to wait. The resource can be a disk drive, a record in a database that another thread is writing to, or any of a myriad other features of the system. The penalties accrued when threads have to wait for each other are so steep that minimizing this delay is a central design issue for hardware installations and the software they run. It is generally the largest factor in preventing perfect scalability of performance of multiprocessing systems, because running threads that never contend for the same resource is effectively impossible
A second factor is thread synchronization. When a program is designed in threads, there are many occasions where the threads need to interact, and the interaction points require delicate handling. For example, if one thread is preparing data for another thread to process, delays can occur when the first thread does not have data ready when the processing thread needs it. More compelling examples occur when two threads need to share a common area of memory. If both threads can write to the same area in memory, then the thread that wrote first has to check that what it wrote has not been overwritten, or it must lock out other threads until it has finished with the data. This synchronization and inter-thread management is clearly an aspect that does not benefit from having more available processing resources.
System overhead is the thread management done by the operating system. The more processors are running, the more the operating system has to coordinate. As a result, each new processor adds incrementally to the system management work of the operating system. This means that each new processor will contribute less and less to the overall system performance.
W 1st SJ W~\ )
HYPER-THREADING TECHNOLOGY ARCHITECTURE
Hyper-Threading Technology makes a single physical processor appear as multiple logical processors . To do this, there is one copy of the architecture state for each logical processor, and the logical processors share a single set of physical execution resources. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would' on conventional physical processors in a multiprocessor system. From a micro architecture perspective, this means that instructions from logical processors will persist and execute simultaneously on shared execution resources.
| Arch State. |
Processor Execution Resources Processor Execution Resources
¦
Figure 2: Processors without Hyper-Threading
Dert. of CSE
SNGCE fCnLnnhorv
As an example, Figure 2 shows a multiprocessor system with two physical processors that are not Hyper-Threading Technology-capable. Figure 3 shows a multiprocessor system with two physical processors that are Hyper-Threading Technology-capable. With two copies of the architectural state on each physical processor, the system appears to have four logical processors.
| Arch Slate || Arch StaLt | | Arch Slate || Arch State |
Processor Execution Resources Processor Execution; Resources
1 ¦
Figure 3: Processors with Hypei^TJirending Teclra oJttgy
Dcr>t. CSE
FIRST IMPLEMENTATION ON THE INTEL XEON PROCESSOR FAMILY
Several goals were at the heart of the micro architecture design choices made for the Intel. Xeon processor MP implementation of Hyper-Threading Technology. One goal was to minimize the die area cost of implementing Hyper-Threading Technology. Since the logical processors share the vast majority of micro architecture resources and only a few small structures were replicated, the die area cost of the first implementation was less than 5% of the total die area. A second goal was to ensure that when one logical processor is stalled the other logical processor could continue to make forward progress. A logical processor may be temporarily stalled for a variety of reasons, including servicing cache misses, handling branch mispredictions, or waiting for the results of previous instructions. Independent forward progress was ensured by managing buffering queues such that no logical processor can use all the entries when two active software threads2 were executing. This is accomplished by either partitioning or limiting the number of active entries each thread can have. A third goal was to allow a processor running only one active software thread to run at the same speed on a processor with Hyper-Threading Technology as on a processor without this capability. This means that partitioned resources should be recombined when only one software thread is active. A high-level view of the micro architecture pipeline is shown in Figure 4. As shown, buffering queues separate major pipeline logic blocks. The buffering queues are either partitioned or duplicated to ensure independent forward progress through each logic block.
FRONT END OF AN INTEL XEON PROCESSOR PIPELINE
The front end of the pipeline is responsible for delivering instructions to the later pipe stages. As shown in Figure 5a, instructions generally come from the Execution Trace Cache (TC), which is the primary or Level 1 (LI) instruction cache. Figure 5b shows that only when there is a TC miss does the machine fetch and decode instructions from the integrated Level 2 (L2) cache. Near the TC is the Microcode ROM, which stores decoded instructions for the longer and more complex IA-32 instructions.
Tint nfCSF
XK'GTF. Knlenrh,„¢
Execution Trace Cache (TC)
The TC stores decoded instructions, called micro operations or "uops." Most instructions in a program are fetched and executed from the TC. Two sets of next-instruction-pointers independently track the progress of the two software threads executing. The two logical processors arbitrate access to the TC every clock cycle. If both logical processors want access to the TC at the same time, access is granted to one then the other in alternating clock cycles. For example, if one cycle is used to fetch a line for one logical processor, the next cycle would be used to fetch a line for the other logical processor, provided that both logical processors requested access to the trace cache. If one logical processor is stalled or is unable to use the TC, the other logical processor can use the full bandwidth of the trace cache, every cycle. The TC
entries are tagged with thread information and are dynamically allocated as needed. The TC is 8-way set associative, and entries are replaced based on a leastrecently-used (LRU) algorithm that is based on the full 8 ways. The shared nature of the TC allows one logical processor to have more entries than the other if needed.
Microcode ROM
When a complex instruction is encountered, the TC sends a microcode-instruction pointer to the Microcode ROM. The Microcode ROM controller then fetches the uops needed and returns control to the TC. Two microcode instruction pointers are used to control the flows independently if both logical processors are executing complex IA-32 instructions. Both logical processors share the Microcode ROM entries. Access to the Microcode ROM alternates between logical processors just as in the TC.
SNGCR. KnLnchcrv
ADVANTAGES OF HT TECHNOLOGY
Hyper-threading, officially called Hyper-Threading Technology (HTT), is
Intel's trademark for their implementation of the simultaneous multithreading technology on the Pentium 4 microarchitecture. It is basically a more advanced form of Super-threading that debuted on the Intel Xeon processors and was later added to Pentium 4 processors. The technology improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle, for example during a cache miss.
The advantages of ht technology are listed as: improved support for multi¬threaded code, allowing multiple threads to run simultaneously, improved reaction and response time, and increased number of users a server can support.
According to Intel, the first implementation only used an additional 5% of the die area over the "normal" processor, yet yielded performance improvements of 15¬30%.
Intel claims up to a 30% speed improvement compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application-dependent, however, and some programs actually slow down slightly when Hyper Threading Technology is turned on. This is due to the replay system of the Pentium 4 tying up valuable execution resources, thereby starving the other thread. However, any performance degradation is unique to the Pentium 4 (due to various architectural nuances), and is not characteristic of simultaneous multithreading in general.
Hyper threading allows the operating system to see two logical processors rather than the one physical processor present.
Hyper-Threading works by duplicating certain sections of the processor” those that store the architectural state”but not duplicating the main execution resources. This allows a Hyper-Threading equipped processor to pretend to be two "logical" processors to the host operating system, allowing the operating system to
schedule two threads or processes simultaneously. Where execution resources in a non-Hyper-Threading capable processor are not used by the current task, and especially when the processor is stalled, a Hyper-Threading equipped processor may use those execution resources to execute the other scheduled task. (The processor may stall due to a cache miss, branch misprediction, or data dependency.)
Except for its performance implications, this innovation is transparent to operating systems and programs. All that is required to take advantage of Hyper-Threading is symmetric multiprocessing (SMP) support in the operating system, as the logical processors appear as standard separate processors.
APPLICATIONS OF HYPER-THREADING TECHNOLOGY
Enterprise, e-Business, and gaming software applications continue to put higher demands on processors. To improve performance in the past, threading was enabled in the software by splitting instructions into multiple streams so that multiple processors could act upon them. Hyper-Threading Technology (HT Technology)"!" provides thread-level parallelism on each processor, resulting in more efficient use of processor resources, higher processing throughput, and improved performance on today's multithreaded software. The combination of an Intel® processor and chipset that support HT Technology, an operating system that includes optimizations for HT Technology, and a BIOS that supports HT Technology and has it enabled, delivers increased system performance and responsiveness.
Hyper-Threading Technology for Business Desktop PCs
HT Technology helps desktop users get more performance out of existing software in multitasking environments. Many applications are already multithreaded and will automatically benefit from this technology. Business users can run demanding desktop applications simultaneously while maintaining system responsiveness. IT departments can deploy desktop background services that make their environments more secure, efficient and manageable, while minimizing the impact on end-user productivity and providing headroom for future business growth and new solution capabilities.
Hyper-Threading Technology for Gaming and Video
The Intel® Pentium® processor Extreme Edition combines HT Technology with dual-core processing to give people PCs capable of handling four software threads. HT Technology enables gaming enthusiasts to play the latest titles and experience ultra realistic effects and game play. And multimedia enthusiasts can create, edit, and encode graphically intensive files while running a virus scan in the background.
Tient nfCSF.
Hyper-Threading Technology for Servers
With HT Technology, multithreaded server software applications can execute threads in parallel within each processor in a server platform. Select products from the Intel® Xeon® processor family use HT Technology to increase compute power and throughput for today's Web-based and enterprise server applications.
Hyper-Threading Technology Benefits for Enterprise and e-Business
¦ Enables more user support, improving business productivity.
¦ Provides faster response times for Internet and e-Business applications, enhancing customer experiences
¦ Increases the number of transactions that can be processed.
¦ Allows compatibility with existing IA-32 applications and operating systems.
¦ Handles larger workloads.
fhnt of CSE
CONCLUSION
Hyper-Threading Technology is a technology that enables a single processor to run two separate threads simultaneously. Although several chip manufacturers have announced their intentions to ship processors with this capability, only Intel has done so at this time. The reason in part stems from a design change Intel made in the release of the Pentium® Pro processor in 1995. The company added multiple execution units to the processor. Even though the chip could execute only one thread, the multiple execution units enabled some instructions to be executed out of order. As the processor handled the main instructions, a look-ahead capability recognized upcoming instructions that could be executed out of order on the other execution pipelines and their results folded back into the stream of executed instructions when their turn came up. This facility made for a more optimized flow of executed instructions. It also was used to speculatively execute instructions from a branch in an upcoming "if test. When the mainline hit the test, results of pre-executed instructions from the correct branch would be used. If the speculation had pre-executed the wrong branch, those instructions were simply discarded.
It is important to note that though the Hyper-Threading Technology gives the operating system the impression that it is running on a multi-processor system, its performance does not exactly duplicate a true multi-processor system. However, there is significant improvement in performance over a conventional processor (up to about 30%), and the slight increases in die size (around 5%) and cost make Hyper-Threading Technology a cost-effective solution. The Intel Xeon® processor family is the first to implement the Hyper-Threading Technology. Hyper-Threading Technology is available on Intel desktop platforms as well.
D&L of CSE
SNGCE. Kalcttcherv
FUTURE SCOPE
Older Pentium 4 based MPUs use Hyper-Threading, but the current-generation cores, Merom, Conroe and Woodcrest, do not. Hyper-Threading is a specialized form of simultaneous multithreading, which has been said to be on Intel roadmaps for the generation after Merom, Conroe and Woodcrest.
While some have alleged that Hyper-Threading is somehow energy inefficient and claim this to be the reason Intel dropped it from their new cores, this is almost certainly not the case. A number of low-power chips do use multithreading, including the PPE from the Cell processor, the CPUs in the Playstation 3, Sun Microsystem's Niagara and the MIPS 34K.
Multiprocessing systems run threads on separate processors. Systems with Hyper-Threading Technology run two threads on one chip. Intel® Xeon® processor-based servers combine both technologies. They run two Hyper-Threading Technology enabled processors on the same machine. This creates a machine with four concurrent threads executing. If the instructions are scheduled correctly”and the operating systems are tuned for hyper-threading”the machines get enormous processing capability, and thread-heavy applications like Java* Virtual Machines run considerably faster.
At the end of 2002, all Intel Xeon processors are implemented with Hyper-Threading Technology, and Intel has announced that its desktop processors will support it next. As a result, the multithreading issues and opportunities that hyper-threading provides will become universal programming aspects during the next few years. The similarities between these technologies is further underscored by the fact that Intel and the vendors behind the OpenMP initiative are porting this parallel processing technology to hyper-threaded systems to extract the greatest possible benefit from the multiple processing pipelines.
n rrvrr
BIBLIOGRAPHY
1. A. Agarwal, B.H. Lim, D. Kranz and J. Kubiatowicz, "APRIL: A processor Architecture for Multiprocessing,
2. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porter, and B. Smith, "The TERA Computer System,"
3. google.com.
4. seminars4u.com.
5. CONTENTS
6. INTRODUCTION 1
7. THE TECHNIQUES BEFORE HYPER THREADING 2
8. WHAT IS HYPER-THREADING 4
9. WINDOWS SUPPORT FOR HT TECHNOLOGY 6
THREAD SCHEDULING AND HYPER-THREADING
10. TECHNOLOGY 9
MULTITHREADING, HYPER-THREADING,
11. MULTIPROCESSING: NOW, WHAT'S THE DIFFERENCE 10
12.
13.
14. HYPER-THREADING TECHNOLOGY ARCHITECTURE 15
15.
16.
17. FIRST IMPLEMENTATION ON THE INTEL XEON PROCESSOR
18. FAMILY 17
19. ADVANTAGES OF HYPER THREADING TECHNOLOGY 20
20. APPLICATIONS OF HYPER THREADING TECHNOLOGY 22
21. CONCLUSION 24
22. FUTURE SCOPE 25
23. BIBLIOGRAPHY 26
24. Dept. of CSE
Reply
#4
read http://studentbank.in/report-HYPER-THREA...ars-report for more
Reply
#5
ABSTRACT
Intelâ„¢s Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.
The first implementation of Hyper-Threading Technology was done on the Intel Xeon processor MP. In this implementation there are two logical processors on each physical processor. The logical processors have their own independent architecture state, but they share nearly all the physical execution and hardware resources of the processor. The goal was to implement the technology at minimum cost while ensuring forward progress on logical processors, even if the other is stalled, and to deliver full performance even when there is only one active logical processor.
The potential for Hyper-Threading Technology is tremendous; our current implementation has only just begun to tap into this potential. Hyper-Threading Technology is expected to be viable from mobile processors to servers; its introduction into market segments other than servers is only gated by the availability and prevalence of threaded applications and workloads in those markets.
INTRODUCTION
The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand, we cannot rely entirely on traditional approaches to processor design. Micro architecture techniques used to achieve past processor performance improvement“super pipelining, branch prediction, super-scalar execution, out-of-order execution, caches“have made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intel™s Hyper-Threading Technology is one solution.
Making Hyper-Threading Technology a reality was the result of enormous dedication, planning, and sheer hard work from a large number of designers, validators, architects, and others. There was incredible teamwork from the operating system developers, BIOS writers, and software developers who helped with innovations and provided support for many decisions that were
made during the definition process of Hyper-Threading Technology. Many dedicated engineers are continuing to work with our ISV partners to analyze application performance for this technology. Their contributions and hard work have already made and will continue to make a real difference to our customers.
PROCESSOR MICRO-ARCHITECTURE

Traditional approaches to processor design have focused on higher clock speeds, instruction-level parallelism (ILP), and caches. Techniques to achieve higher clock speeds involve pipelining the microarchitecture to finer granularities, also called super-pipelining. Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second. Because there will be far more instructions in-flight in a superpipelined microarchitecture, handling of events that disrupt the pipeline, e.g., cache misses, interrupts and branch mispredictions, can be costly.ILP refers to techniques to increase the number of instructions executed each clock cycle. For example, a super-scalar processor has multiple parallel execution units that can process instructions simultaneously. With super-scalar execution, several instructions can be executed each clock cycle. However, with simple inorder execution, it is not enough to simply have multiple execution units. The challenge is to find enough instructions to execute.
One technique is out-of-order execution where a large window of instructions is simultaneously evaluated and sent to execution units, based on instruction dependencies rather than program order.Accesses to DRAM memory are slow compared to execution speeds of the processor. One technique to reduce this latency is to add fast caches close to the processor. Caches can provide fast memory access to frequently accessed data or instructions. However, caches can only be fast when they are small. For this reason, processors often are designed with a cache hierarchy in which fast, small caches are located and operated at access latencies very close to that of the processor core, and progressively larger caches, which handle less frequently accessed data or instructions, are implemented with longer access latencies. However, there will always be times when the data needed will not be in any processor cache. Handling such cache misses requires accessing memory, and the processor is likely to quickly run out of instructions to execute before stalling on the cache miss.
The vast majority of techniques to improve processor performance from one generation to the next is complex and often adds significant die-size and power costs. These techniques increase performance but not with 100% efficiency; i.e., doubling the number of execution units in a processor does not double the performance of the processor, due to limited parallelism in instruction flows. Similarly, simply doubling the clock rate does not double the performance due to the number of processor cycles lost to branch mispredictions.
figure shows the relative increase in performance and the costs, such as die size and power, over the last ten years on Intel processors1. In order to isolate the microarchitecture impact, this comparison assumes that the four generations of processors are on the same silicon process technology and that the speed-ups are normalized to the performance of an Intel486processor. Although we use Intelâ„¢s processor history in this example, other high-performance processor manufacturers during this time period would have similar trends. Intelâ„¢s processor performance, due to microarchitecture advances alone, has improved integer performance five- or six-fold1. Most integer applications have limited ILP and the instruction flow can be hard to predict.

Over the same period, the relative die size has gone up fifteen-fold, a three-times-higher rate than the gains in integer performance. Fortunately, advances in silicon process technology allow more transistors to be packed into a given amount of die area so that the actual measured die size of each generation microarchitecture has not increased significantly.
The relative power increased almost eighteen-fold during this period1. Fortunately, there exist a number of known techniques to significantly reduce power consumption on processors and there is much on-going research in this area. However, current processor power dissipation is at the limit of what can be easily dealt with in desktop platforms and we must put greater emphasis on improving performance in conjunction with new technology, specifically to control power.
CONVENTIONAL MULTI-THREADING
A look at todayâ„¢s software trends reveals that server applications consist of multiple threads or processes that can be executed in parallel. On-line transaction processing and Web services have an abundance of software threads that can be executed simultaneously for faster performance. Even desktop applications are becoming increasingly parallel. Intel architects have been trying to leverage this so-called thread-level parallelism (TLP) to gain a better performance vs.transistor count and power ratio.
In both the high-end and mid-range server markets, multiprocessors have been commonly used to get more performance from the system. By adding more processors, applications potentially get substantial performance improvement by executing multiple threads on
multiple processors at the same time. These threads might be from the same application, from different applications running simultaneously, from operating system services, or from operating system threads doing background maintenance. Multiprocessor systems have been used for many years, and high-end programmers are familiar with the techniques to exploit multiprocessors for higher performance levels.
In recent years a number of other techniques to further exploit TLP have been discussed and some products have been announced. One of these techniques is chip multiprocessing (CMP), where two processors are put on a single die.
The two processors each have a full set of execution and architectural resources. The processors may or may not share a large on-chip cache.CMP is largely orthogonal to conventional multiprocessor systems, as you can have multiple CMP processors in a multiprocessor configuration. Recently announced processors incorporate two processors on each die. However, a CMP chip is significantly larger than the size of a single-core chip and therefore more expensive to manufacture; moreover, it does not begin to address the die size and power considerations.
TIME-SLICE MULTI-THREADING
Time-slice multithreading is where the processor switches between software threads after a fixed time period. Quite a bit of what a CPU does is illusion. For instance, modern out-of-order processor architectures don't actually execute code sequentially in the order in which it was written.. It is noted that an OOE(out of order execution)architecture takes code that was written and compiled to be executed in a specific order, reschedules the sequence of instructions (if possible) so that they make maximum use of the processor resources, executes them, and then arranges them back in their original order so that the results can be written out to memory. To the programmer and the user, it looks as if an ordered, sequential stream of instructions went into the CPU and identically ordered, sequential stream of
computational results emerged. Only the CPU knows in what order the program's instructions were actually executed, and in that respect the processor is like a black box to both the programmer and the user.
The same kind of sleight-of-hand happens when we run multiple programs at once, except that this time the operating system is also involved in the scam. To the end user, it appears as if the processor is "running" more than one program at the same time, and indeed, there actually are multiple programs loaded into memory. But the CPU can execute only one of these programs at a time. The OS maintains the llusion of concurrency by rapidly switching between running programs at a fixed interval, called a time slice. The time slice has to be small enough that the user doesn't notice any degradation in the usability and performance of the running programs, and it has to be large enough that each program has a sufficient amount of CPU time in which to get useful work done. Most modern operating systems include a way to change the size of an individual program's time slice. So a program with a larger time slice gets more actual execution time on the CPU relative to its lower priority peers, and hence it runs faster.
But time-slice multithreading can result in wasted execution slots but can effectively minimize the effects of long latencies to memory. Switch-on-event Multithreading would switch threads on long latency events such as cache misses. This approach can work well for server applications that have large numbers of cache misses and where the two threads are executing similar tasks. However, both the time-slice and the switch-onevent onevent multi-threading techniques do not achieve optimal overlap of many sources of inefficient resource usage,such as branch mispredictions, instruction dependencies, etc.
Finally, there is simultaneous multi-threading, where multiple threads can execute on a single processor without switching. The threads execute simultaneously and make much better use of the resources.
This approach makes the most effective use of processor resources: it maximizes the performance vs. transistor count and power consumption.
CONCEPT OF SIMULTANEOUS MULTI-THREADING
Simultaneous Multi-threading is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle. Unlike other hardware multithreaded architectures (such as the Tera MTA), in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources. Unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism, simultaneous multithreading uses multiple threads to compensate for low single-thread ILP. The performance consequence is significantly higher instruction throughput and program speedups on a variety of workloads that include commercial databases, web servers and scientific applications in both multi programmed and parallel environments. Simultaneous multithreading has already had impact in both the academic and commercial communities.
It has been found that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. It is seen that the architecture for simultaneous multi threading achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads.The simultaneous multi threading enjoys a 2.5-fold improvement over an unmodified superscalar with the same hardware resources. This speedup is enabled by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best"
instructions to the processor. Several heuristics that helps to identify and use the best threads for fetch and issue are found, and show that such heuristics can increase throughput by as much as 37%. Using the best fetch and issue alternatives, we then use bottleneck analysis to identify opportunities for further gains on the improved architecture.
Simultaneous Multithreading: Maximizing On-Chip Parallelism
The increase in component density on modern microprocessors has led to a substantial increase in on-chip parallelism. In particular, modern superscalar RISCs can issue several instructions to independent functional units each cycle. However, the benefit of such superscalar architectures is ultimately limited by the parallelism available in a single thread.
Simultaneous Multi threading is a technique permitting several independent threads to issue instructions to a super-scalar's multiple functional units in a single cycle. In the most general case, the binding between thread and functional unit is completely dynamic. We present several models of simultaneous multithreading and compare them with wide superscalar, fine-grain multithreaded, and single-chip, multiple-issue multiprocessing architectures. To perform these evaluations, simultaneous multithreaded architecture was simulated based on the DEC Alpha 21164 design, and execute code generated by the Multi flow trace scheduling compiler. The results show that: (1) No single latency-hiding technique is likely to produce acceptable utilization of wide superscalar processors. Increasing processor utilization will therefore require a new approach, one that attacks multiple causes of processor idle cycles. (2) Simultaneous multithreading is such a technique. With our
machine model, an 8-thread, 8-issue simultaneous multithreaded processor sustains over 5 instructions per cycle, while a single-threaded processor can sustain fewer than 1.5 instructions per cycle with similar resources and issue bandwidth. (3) Multithreaded workloads degrade cache performance relative to single-thread performance, as previous studies have shown. We evaluate several cache configurations and demonstrate that private instruction and shared data caches provide excellent performance regardless of the number of threads. (4) Simultaneous multithreading is an attractive alternative to single-chip multiprocessors. We show that simultaneous multithreaded processors with a variety of organizations are all superior to conventional multiprocessors with similar resources.
While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design.
CONVERTING TLP TO ILP VIA SMT
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Although they correspond to different granularities of
parallelism, ILP and TLP are fundamentally identical: they both identify independent instructions that can execute in parallel and therefore can utilize parallel hardware. Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle.Multiprocessors exploit TLP by executing different threads in parallel on different processors. Unfortunately, neither parallel processing style is capable of adapting to dynamically changing levels of ILP and TLP, because the hardware enforces the distinction between the two types of parallelism. A multiprocessor must statically partition its resources among the multiple CPUs (see Figure 1); if insufficient TLP is available, some of the processors will be idle. A superscalar executes only a single thread; if insufficient ILP exists, much of that processorâ„¢s multiple-issue hardware will be wasted.
Simultaneous multithreading (SMT) [Tullsen et al. 1995; 1996; Gulati etal. 1996; Hirata et al. 1992] allows multiple threads to compete for and share available processor resources every cycle. One of its key advantages when executing parallel applications is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By allowing multiple threads to share the processorâ„¢s functional units simultaneously, thread-level parallelism is essentially converted into instruction-level parallelism.An SMT processor can therefore accommodate variations in ILP and TLP. When a program has only a single thread (i.e., it lacks TLP) all of the SMT processorâ„¢s resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP.
figure1
Fig. A comparison of issue slot (functional unit) utilization in various architectures. Each square corresponds to an issue slot, with white squares signifying unutilized slots. Hardware utilization suffers when a program exhibits insufficient parallelism or when available parallelism is not used effectively. A superscalar processor achieves low utilization because of low ILP in its single thread. Multiprocessors physically partition hardware to exploit TLP, and therefore performance suffers when TLP is low (e.g., in sequential portions of parallel programs). In contrast, simultaneous multithreading avoids resource partitioning. Because it allows multiple threads to compete for all resources in the same cycle, SMT can cope with varying levels of ILP and TLP; consequently utilization is higher, and performance is better.
An SMT processor can uniquely exploit whichever type of parallelism is available, thereby utilizing the functional units more effectively to achieve the goals of greater throughput and significant program speedups.
SIMULTANEOUS MULTITHREADING AND MULTIPROCESSORS
Both the SMT(simultaneous multithreading) processor and the on-chip shared-memory MPs we examine are built from a common out-of-order superscalar base processor. The multiprocessor combines several of these superscalar CPUs in a small-scale MP, whereas simultaneous multithreading uses a wider-issue superscalar,and then adds support for multiple contexts
Base Processor Architecture
The base processor is a sophisticated, out-of-order superscalar processor with a dynamic scheduling core similar to the MIPS R10000 [Year 1996].Figure 2 illustrates the organization of this processor, and Figure 3 shows its processor pipeline. On each cycle, the processor fetches a block of instructions from the instruction cache. After decoding these instructions,the register-renaming logic maps the logical registers to a pool of physical renaming registers to remove false dependencies. Instructions are then fed to either the integer or floating-point instruction queues. When their operands become available, instructions are issued from these queues to the corresponding functional units. Instructions are retired in order.
figure-2
figure-3
SMT ARCHITECTURE
The SMT architecture, which can simultaneously execute threads from up to eight hardware contexts, is a straightforward extension of the base processor. To support simultaneous multithreading, the base processor architecture requires significant changes in only two primary areas: the instruction fetch mechanism and the register file. A conventional system of branch prediction hardware (branch target buffer and pattern history table) drives instruction fetching, although we now have 8 program counters and 8 subroutine return stacks (1 per context). On each cycle, the fetch mechanism selects up to 2 threads (among threads not already incurring I-cache misses) and fetches up to 4 instructions from each thread (the 2.4 scheme from Tullsen et al. [1996]). The total fetch bandwidth of 8 instructions is therefore equivalent to that required for an 8-wide superscalar processor, and only 2 I-cache ports are required.Additional logic, however, is necessary in the SMT to prioritize thread selection. Thread priorities are assigned using the icount feedback technique , which favors threads that are using processor resources most effectively. Under icount, highest priority is given to the threads that have the least number of instructions in the decode, renaming,and queue pipeline stages. This approach prevents a single thread from
clogging the instruction queue, avoids thread starvation, and provides a more even distribution of instructions from all threads, thereby heightening interthread parallelism. The peak throughput of our machine is limited by the fetch and decode bandwidth of 8 instructions per cycle.
Following instruction fetch and decode, register renaming is performed,as in the base processor. Each thread can address 32 architectural integer (and FP) registers. The register-renaming mechanism maps these architectural registers (1 set per thread) onto the machineâ„¢s physical registers. An 8-context SMT will require at least 8 * 32 5 256 physical registers, plus additional physical Registers for register renaming. With a larger register file, longer access times will be required, so the SMT processor pipeline is extended by 2 cycles to avoid an increase in the cycle time. Figure 3 compares the pipeline for the SMT versus that of the base superscalar. On the SMT, register reads take 2 pipe stages and are pipelined. Writers to the register file behave in a similar manner, also using an extra pipeline stage.In practice, we found that the lengthened pipeline degraded performance by less than 2% when running a single thread. The additional pipe stage requires an extra level of bypass logic, but the number of stages has a smaller impact on the complexity and delay of this logic (O(n)) than the issue width (O(n2)) [Palacharla et al. 1997]. Our previous study contains more details regarding the effects of the two-stage register read/write pipelines on the architecture and performance.
In addition to the new fetch mechanism and the larger register file and longer pipeline, only three processor resources are replicated or added to support SMT: per-thread instruction retirement, trap mechanisms, and an additional thread id field in each branch target buffer entry. No additional hardware is required to perform multithreaded scheduling of instructions to functional units. The register-renaming phase removes any apparent interthread register dependencies, so that the conventional instruction ACM Transactions on Computer Systems queues can be used to dynamically schedule instructions from multiple threads. Instructions from all threads are dumped into the instruction queues, and an instruction from any thread can be issued from the queues once its operands become available.In this design, few resources are statically partitioned among contexts;consequently, almost all hardware resources are available even when only one thread is executing. The architecture allows us to achieve the performance advantages of simultaneous multithreading, while keeping intact the design and single-thread peak performance of the dynamically scheduled CPU core present in modern superscalar architectures.
Single-Chip Multiprocessor Hardware configurations
In our analysis of SMT and multiprocessing, we focus on a particular region of the MP design space, specifically, small-scale, single-chip, shared-memory
multiprocessors. As chip densities increase, single-chip multiprocessing will be possible, and some architects have already begun to investigate this use of chip real estate [Olukotun et al. 1996]. An SMT processor and a small-scale, on-chip multiprocessor have many similarities: for example,both have large numbers of registers and functional units, on-chip caches,and the ability to issue multiple instructions each cycle. In this study, we keep these resources approximately similar for the SMT and MP comparisons,and in some cases we give a hardware advantage to the MP.1We look at both two- and four-processor multiprocessors, partitioning the scheduling unit resources of the multiprocessor CPUs (the functional units,instruction queues, and renaming registers) differently for each case. In the two-processor MP (MP2), each processor receives half of the on-chip execution resources previously described, so that the total resources relative to an SMT are comparable (Table I). For a four-processor MP (MP4), each processor contains approximately one-fourth of the chip resources. The issue width for each processor in these two MP models is indicated by the total number of functional units. Note that even within the MP design space, these two alternatives (MP2 and MP4) represent an interesting tradeoff between TLP and ILP. The two-processor machine can exploit more ILP, because each processor has more functional units than its MP4 counterpart, whereas MP4 has additional processors to take advantage of more TLP.Table I also includes several multiprocessor configurations in which we increase hardware resources. These configurations are designed to reduce bottlenecks in resource usage in order to improve aggregate performance.MP2fu (MP4fu), MP2q (MP4q), and MP2r (MP4r) address bottlenecks of functional units, instruction queues, and renaming registers, respectively.MP2a increases all three of these resources,so that the total execution resources of each processor are equivalent to a single SMT processor.2 (MP4a is similarly augmented in all three resource classes, so that the entire MP4a multiprocessor also has twice as many resources as our SMT.)For all MP configurations, the base processor uses the out-of-order )scheduling processor described earlier and the base pipeline from Figure 3 . Each MP processor only supports one context; therefore its register file will be smaller, and access will be faster than the SMT. Hence the shorter pipeline is more appropriate.
Synchronization Mechanisms and Memory Hierarchy
SMT has three key advantages over multiprocessing: flexible use of ILP and TLP, the potential for fast synchronization, and a shared L1 cache.This study focuses on SMTâ„¢s ability to exploit the first advantage by determining the costs of partitioning execution resources; we therefore allow the multiprocessor to use SMTâ„¢s synchronization mechanisms and cache hierarchy to avoid tainting our results with effects of the latter two.We implement a set of synchronization primitives for thread creation and termination, as well as hardware blocking locks. Because the threads in an SMT processor share the same scheduling core, inexpensive hardware blocking locks can be implemented in a synchronization functional unit.This cheap synchronization is not available to multiprocessors, because the distinct processors cannot share functional units. In our workload, most interthread synchronization is in the form of barrier synchronization or
simple locks, and we found that synchronization time is not critical to performance. Therefore, we allow the MPs to use the same cheap synchronization techniques, so that our comparisons are not colored by synchronization effects. The entire cache hierarchy, including the L1 caches, is shared by all threads in an SMT processor. Multiprocessors typically do not use a shared L1 cache to exploit data sharing between parallel threads. Each processor in an MP usually has its own private cache (as in the commercial multiprocessors described by Sun Microsystems [1997], Slater [1992], IBM [1997],and Silicon Graphics [1996]) and therefore incurs some coherence overhead when data sharing occurs. Because we allow the MP to use the SMTâ„¢s shared L1 cache, this coherence overhead is eliminated. Although multiple threads may have working sets that interfere in a shared cache, that interthread interference is not a problem.
Comparing SMT and MP
In comparing the total hardware dedicated to our multiprocessor and SMT configurations, we have not taken into account chip area required for buses or the cycle-time effects of a wider-issue machine. In our study, the intent is not to claim that SMT has an absolute x percent performance advantage over MP, but instead to demonstrate that SMT can overcome some fundamental limitations of multiprocessors, namely, their inability to exploit changing levels of ILP and TLP. We believe that in the target design space we are studying, the intrinsic flaws resulting from resource partitioning in MPs will limit their effectiveness relative to SMT, even taking into consideration cycle time.
PARALLEL APPLICATION
SMT is most effective when threads have complementary hardware resource requirements. Multiprogrammed workloads and workloads consisting of parallel applications both provide TLP via independent streams of control, but they compete for hardware resources differently. Because a multiprogrammed workload does not share memory references across threads, it places more stress on the caches. Furthermore, its threads have different instruction execution patterns, causing interference in branch prediction hardware. On the other hand, multiprogrammed workloads are less likely to compete for identical functional units.Although parallel applications have the benefit of sharing the caches and branch prediction hardware, they are an interesting and different test of SMT for several reasons. First, unlike the multiprogrammed workload, all threads in a parallel application execute the same code and, therefore, have similar execution resource requirements, memory reference patterns, and levels of ILP. Because all threads tend to have the same resource needs at the same time, there is potentially more contention for these resources compared to a multiprogrammed workload.
Secondly, parallel applications illustrate the promise of SMT as an architecture
for improving the performance of single applications. By using threads to parallelize programs, SMT can improve processor utilization,but more important, it can achieve program speedups. Finally, parallel applications are a natural workload for traditional parallel architectures and therefore serve as a fair basis for comparing SMT and multiprocessors.
Effectively Using Parallelism on an SMT Processor
Rather than adding more execution resources to improve performance, SMT boosts performance and improves utilization of existing resources by using parallelism more effectively.Unlike multiprocessors that suffer from rigid partitioning, simultaneous multithreading permits dynamic resource sharing, so that resources can be flexibly partitioned on a per-cycle basis to match the ILP and TLP needs of the program. When a thread has a lot of ILP, it can access all
processor resources; and TLP can compensate for a lack of per-thread ILP.
As more threads are used , speedups increase (up to 2.68 on average with 8 threads), exceeding the performance gains attained by the enhanced MP configurations.The degree of SMTâ„¢s improvement varies across the benchmarks, depending on the amount of per-thread ILP. The five programs with the least ILP (radix, tomcatv, hydro2d, water-spatial, and shallow) get the five largest speedups for SMT.T8, because TLP compensates for low ILP;programs that already have a large amount of ILP (LU and FFT) benefit less from using additional threads, because resources are already busy executing useful instructions. In linpack, performance tails off after two threads, because the granularity of parallelism in the program is very small. The gain from parallelism is outweighed by the overhead of parallelization (not only thread creation, but also the work required to set up the loops in each thread).
IMPLICITLY-MULTITHREADED PROCESSORS[IMT]
. IMT executes compiler-specified speculative threads from a sequential program on a wide-issue SMT pipeline. IMT is based on the fundamental observation that Multi scalar™s execution model ” i.e., compiler-specified speculative threads [11] ” can be decoupled from the processor organization ” i.e., distributed processing cores. Multi scalar [11] employs sophisticated specialized hardware, the register ring and address resolution buffer, which are strongly coupled to the distributed core organization. In contrast, IMT proposes to map speculative threads on to generic SMT.IMT differs fundamentally from prior proposals, TME and DMT, for speculative threading on SMT. While TME executes multiple threads only in the uncommon case of branch mispredictions, IMT invokes threads in the common
case of correct predctions, thereby enhancing execution parallelism. Unlike IMT, DMT creates threads in hardware. Because of the lack of compile-time information, DMT uses value prediction to break data dependence across threads. Unfortunately, inaccurate value prediction incurs frequent misspeculation stalls, prohibiting DMT from extracting thread-level parallelism effectively. Moreover,selective recovery from misspeculation in DMT requires fast and frequent searches through prohibitively large (e.g., ~1000 entries) custom instruction trace buffers that are difficult to implement efficiently.In this paper, we find that a naive mapping of compiler-specified speculative threads onto SMT performs poorly. Despite using an advanced compiler [14] to generate threads, a Naive IMT (N-IMT) implementation performs only comparably to an aggressive superscalar. NIMTâ„¢s key shortcoming is its indiscriminate approach to fetching/executing instructions from threads, without accounting for resource availability, thread resource usage, and inter-thread dependence information. The resulting poor utilization of pipeline resources (e.g., issue queue, load/store queues, and register file) in N-IMT negatively offsets the advantages of speculative threading.
Implicitly-MultiThreaded (IMT) processor utilizes SMTâ„¢s support for multithreading by executing speculative threads. Figure 3 depicts the anatomy of an IMT processor derived from SMT. IMT uses the rename tables for register renaming, the issue queue for out-of-order scheduling, the per-context load/store queue (LSQ) and active list for memory dependences and instruction reordering prior to commit. As in SMT, IMT shares the functional units, physical registers, issue queue,and memory hierarchy among all contexts.IMT exploits implicit parallelism, as opposed to programmer- specified, explicit parallelism exploited by conventional SMT and ultiprocessors. Like Multiscalar, IMT predicts the threads in succession and maps them to execution resources, with the earliest thread as the onspeculative(head) thread, followed by subsequent speculative threads [11]. IMT honors the inter-thread control-flow and register dependences specified by the compiler.IMT uses the LSQ to enforce inter-thread memory dependences. Upon completion, IMT commits the threads in program order.Two IMT variations are presented: (1) a Naive IMT (NIMT) that performs comparably to an aggressive superscalar, and (2) an Optimized IMT (O-IMT) that uses novel micro architectural techniques to enhance performance.
Thread Invocation
Like Multiscalar, both IMT variants invoke threads in program order by predicting the next thread from among the targets of the previous thread (specified by the thread descriptor) using a thread predictor. A descriptor cache
(Figure 3) stores recently-fetched thread descriptors.Although threads are invoked in program order, IMT may fetch later threadsâ„¢ instructions out of order prior to fetching all of earlier threadsâ„¢ instructions, thereby interleaving instructions from multiple threads. To decide which thread to fetch from, IMT consults the fetch policy.
Resource Allocation & Fetch Policy
IMT processor, N-IMT, uses an unmodified ICOUNT policy [13], in which the thread with the least number of instructions in flight is chosen to fetch instructions from every cycle. The rationale is that the thread that has the fewest instructions is the one whose instructions are flowing through the pipeline with the fewest stalls.We also make the observation that the ICOUNT policy may be suboptimal for a processor in which threads exhibit control-flow and data dependence and resources are relinquished in program (and not thread) order. For instance, later (program-order) threads may result in resource (e.g., physical registers, issue queue and LSQ entries) starvation in earlier threads, forcing the later threads to squash and relinquish the resources for use by earlier threads. Unfortunately, frequent thread squashing due to indiscriminate resource allocation without regards to demand incurs high overhead. Moreover, treating (control-and data-) dependent and independent threads alike is suboptimal. Fetching and executing instructions from later threads that are dependent on earlier threads may be counter-productive because it increases inter-thread dependence delays by taking away front-end fetch and processing bandwidth from earlier threads. Finally, dependent instructions from later threads exacerbate issue queue contention because they remain in the queue until the dependences are resolved.
To mitigate the above shortcomings, O-IMT employs a novel resource- and dependence-based fetch policy that is bimodal. In the dependent mode, the policy biases fetch towards the non-speculative thread when the threads are likely to be dependent, fetching sequentially to the highest extent possible. In the independent mode, the policy uses ICOUNT when the threads are potentially independent, enhancing overlap among multiple threads. Because loop iterations are typically independent, the policy employs an Inter-Thread Dependence Heuristic (ITDH) to identify loop iterations for the independent mode, otherwise considering threads to be dependent. ITDH predicts that subsequent threads are loop iterations if the next two threadsâ„¢ start PCs are the same as the nonspeculative (head) threadâ„¢s start PC.To reduce resource contention among threads, the policy employs a Dynamic Resource Predictor (DRP) to initiate fetch from an invoked thread only if the available hardware resources exceed the predicted demand by the thread. The DRP dynamically monitors the threads activity and allows fetch to be initiated from newly invoked threads when earlier threads commit and resources become available.
Figure 4 (a) depicts an example of DRP. O-IMT indexes into a table using the start PC of a thread. Each table entry holds the numbers of active list and LSQ slots,and physical registers used by the threadâ„¢s last four execution instances. The pipeline monitors a threadâ„¢s resource needs, and upon thread commit, pdates the threadâ„¢s DRP entry. DRP supplies the maximum among the four instances for each resource as the prediction for the next instanceâ„¢s resource requirement. The policy alleviates enterthread data dependence by processing producer instructions earlier and decreasing instruction execution stalls,thereby reducing pipeline resource contention.In contrast to O-IMT, prior proposals for speculative threading using SMT use variants of conventional fetch policies. TME uses biased-ICOUNT, a variant of ICOUNT that does not consider resource availability and thread-level independence. DMTâ„¢s fetch policy statically partitions two fetch ports, and allocates one port for the non-speculative thread and the other for speculative threads in a round-robin manner. However, DMT does not suffer from resource contention because the design assumes prohibitively large custom instruction trace buffers (holding thousands of instructions) allowing for threads to make forward progress without regards to resource availability and thread-level independence.Unfortunately, frequent associative searches through such large buffers are slow and impractical.
figure 4.using(a) and context multiplexing(b).
Multiplexing Hardware Contexts
Much like prior proposals, N-IMT assigns a single thread to a hardware context. Because many programs have short threads [14] and real SMT implementations are bound to have only a few (e.g., 2-8) contexts, this approach often leads to insufficient instruction overlap.Larger threads, however, increase both the likelihood of dependence misspeculation [14] and the number of instructions discarded per misspeculation, and cause speculative buffer overflow [5].Instead, to increase instruction overlap without the unwanted side-effects of large threads, O-IMT multiplexes the hardware contexts by mapping as many threads as allowed by the resources in one context (typically 3-6 threads for SPEC2K). Context multiplexing requires for each context only an additional fetch PC register and rename table pointer per thread for a given maximum number of threads per context. Context multiplexing differs from prior proposals for mapping multiple threads on to a single processing core [12,3] to alleviate load imbalance,in that multiplexing allows instructions from multiple threads within a context to execute and share resources simultaneously.Two design complexities arise due to sharing resources in context multiplexing. First, conventional active list and LSQ designs assume that instructions enter these queues in (the predicted) program order. Such an assumption enables the active list to be a non-searchable (potentially large) structure, and allows honoring emory dependences via an ordered (associative) search in the LSQ. If care is not taken, multiplexing would invalidate this assumption if multiple threads were to place instructions out of program order in the shared active list and LSQ. Such out-of-order placement would require an associative search on the active list to determine the correct instruction(s) to be removed upon commit or misspeculation.In the case of the LSQ, the requirements would be even more complicated. A memory access would have to search through the LSQ for an address match among the entries from the accessing thread, and then (conceptually)repeat the search among entries from the thread preceding the accessing thread, working towards older threads.Unfortunately, the active list and LSQ cannot afford these additional design complications because active lists are made large and therefore non-searchable by design and the LSQâ„¢s ordered, associative search is already complex and time-critical.Second, allowing a single context to have multiple out-of-program-order threads complicates managing interthread dependence. Because two in-program-order threads may be mapped to different contexts, honoring memory dependences would require memory accesses to search through multiple contexts thereby prohibitively increasing LSQ search time and design complexity.O-IMT avoids the second design complexity by mapping threads to a context in program order. Inter-thread and intra-thread dependences within a single context are treated similarly. Figure 3 (b) shows how in-programorder threads X and X+1 are mapped to a context. In addition to program order within contexts, O-IMT tracks the global program order among the contexts themselves for precise interrupts.
Register Renaming
Superscalarâ„¢s register rename table relies on in-order instruction fetch to page link register value producers to consumers.IMT processorsâ„¢ out-of-order fetch raises two issues in linking producers in earlier threads to consumers in later threads. First, IMT has to ensure that the rename maps for earlier threadsâ„¢ source registers are not clobbered by later threads. Second, IMT must guarantee that later threadsâ„¢ consumer instructions obtain the correct rename maps and wait for the yet-to-be-fetched earlier threadsâ„¢ producer instructions. While others [1,7] employ hardware- intensive value prediction to address these issues potentially incurring frequent misspeculation and recovery overhead, IMT uses the create and use masks combined with conventional SMT rename tables.
Both IMT variants address these issues as follows.Upon thread start-up (and prior to instruction fetch), the processor copies the rename maps of the registers in create and use masks from a master rename table, to a threadâ„¢s local rename table.1 To allow for invoking subsequent threads, the processor pre-allocates physical registers and pre-assigns mappings for all the create-mask registers in a pre-assign rename table. Finally, the processor updates the master table with the pre-assigned mappings and marks them as busy to reflect the yet-to-be-created register values.Therefore, upon thread invocation the master table correctly reflects the register mappings that a thread should either use or wait for.
POWER CONSUMPTION IN MULTI THREADED PROCESSORS
Processor power and energy consumption are of concern
in two different operating environments. The first is that of a mobile computer, where battery life is still very limited.While the overall energy consumption of a microprocessor is being reduced because of voltage scaling, dynamic clock frequency reduction, and low power circuit design, optimizations can be applied at the architecture level as well.The other environment where power consumption is important is that of high performance processors. These processors are used in environments where energy supply is not typically a limitation. For high performance computing, clock frequencies continue to increase, causing the power dissipated to reach the thresholds of current packaging technology.When the maximum power dissipation becomes a critical design constraint, that architecture which maximizes the performance/power ratio thereby maximizes performance. Multithreading is a processor architecture technique which has been shown to provide significant performance advantage over conventional architectures which can only follow a single stream of execution. Simultaneous multithreading (SMT) [15, 14] can provide up to twice the throughput of a dynamic superscalar single-threaded processor. Announced architectures that will feature multithreading include the Compaq Alpha EV8 [5] and the Sun MAJC Java processor [12], joining the existing Tera MTA supercomputer architecture [1]. We can show that a multithreading processor is attractive in the context of low-power or power-constrained devices for many of the same reasons that enable its high throughput. First, it supplies extra parallelism via multiple threads, allowing the processor to rely much less heavily on speculation; thus, it wastes fewer resources on speculative, never-committed instructions. Second, it provides both higher and more even parallelism over time when running multiple threads, wasting less power on underutilized execution resources. A multithreading architecture also allows different design decisions than are available in a single threaded processor, such as the ability to impact power through the thread selection mechanism.
Modelling Power
To describe the power and energy results of the processor,a particular power model can be identified. This power model is integrated into a detailed cycle-by-cycle instruction-level architectural simulator, allowing the model to accurately model both useful and non-useful (incorrectly speculated) use of all processor resources.The power model utilized for this study is an area-based model. In this power model,the microprocessor is divided into several high-level units,and the corresponding activity of each unit is used in the computation of the overall power consumption. The following processor units are modeled: L1 instruction cache, L1 data cache, L2 unified cache, fetch unit, integer and floating point instruction queues, branch predictor, instruction TLB, data TLB, load-store queue, integer and floating-point functional units, register file, register renamer, completion queue, and return stack.The total processor energy consumption is the summation of the unit energies where each unit energy is equal to:
An activity factor is a defined statistic measuring how many architectural events a program or workload generates for a particular unit. For a given unit, there may be multiple activity factors each representing a different action that
can occur within the unit. Activity factors represent high level actions and therefore are independent of the data values causing a particular unit to become active. We can compute the overall power consumption of the processor on a
cycle-by-cycle basis by summing the energy usage of all units in the processor.
The entire power model consists of 44 activity factors.Each of these activity factors is a measurement of the activity of a particular unit or function of the processor (e.g. number of instruction queue writes, L2 cache reads). The model does not assume that a given factor or combination of factors exercises a microprocessor unit in its entirety, but instead is modeled as exercising a fraction of the area depending on the particular activity factor or combination of activity factors. The weight of how much of the unit is exercised by a particular activity is an estimate based on knowledge of the unitâ„¢s functionality and assumed implementation. Each unit is also assigned a relative area value which is an estimate of the unitâ„¢s size given its parameters. In addition to the overall area of the unit, each unit is modeled as consisting of a particular ratio of 4 different circuit types: dynamic logic, static logic, memory, and clock circuitry.Each circuit type is given an energy density. The ratio of circuit types for a given logic unit are estimates based on engineering experience and consultation with processor designers. The advantage of an area-based design is its adaptability. It can model an aggregate of several existing processors, as long as average area breakdowns can be derived for each of the units. And it can be adapted for future processors for which no circuit implementations exist, as long as an estimate of the area expansion can be made. In both cases, we rely on the somewhat coarse assumption that the general circuit makeup of these units remains fairly constant across designs. This assumption should be relatively valid for the types of high-level architectural comparisons made in this paper, but might not be for more low-level design options. Our power model is an architecture-level power model,and is not intended to produce precise estimates of absolute whole-processor power consumption. For example, it does not include dynamic leakage factors, I/O power (e.g.,pin drivers), etc. However, the power model is useful for obtaining relative power numbers for comparison against results obtained from the same power model. The numbers that are obtained from the power model are independent of clock frequency (again, because we focus on relative values).
Power Consumption of a Multithreaded Architecture

The power-energy characteristics of multi threaded execution can be examined. It examines performance (IPC), energy efficiency (energy per useful instruction executed,E/UI), and power (the average power utilized during each simulation run) for both single-thread and multithread execution.All results (including IPC) are normalized to a baseline (in each case, the lowest single-thread value). This is done for two reasons” first, the power model is not intended to be an indicator of absolute power or energy, so this allows us to focus on the relative values; second, it allows us to view the diverse metrics on a single graph. Later results are normalized to other baseline values. The single-thread results (Figure 5) show diversity in almost all metrics. Performance and power appear to be positively correlated; the more of the processor used, the more power used. However, performance and energy per useful instruction are somewhat negatively correlated; the fewer unused resources and the fewer wastefully used resources, the more efficient the processor.Also included in the graph (the Ideal E/UI bar) is the energy efficiency of each application with perfect branch prediction. This gives an indication of the energy lost to incorrect speculation (when compared with the E/UI result).Gcc and goes suffer most from low branch prediction accuracy.Gcc, for example, only commits 39% of fetched instructions and 56% of executed instructions. In cases such as gcc and go, the energy effects of mispredicted branches is quite large (35% - 40%).Figure 6 shows that multithreaded execution has significantly improved energy efficiency over conventional processor execution. Multithreading attacks the two primary sources of wasted energy/power ” unutilized resources,and most importantly, wasted resources due to incorrect speculation. Multithreaded processors rely far less on speculation to achieve high throughput than conventional processors .

figure-5.the relative performance,energy efficiency and power
results for the six bench marks when done.

figure-6.the relative performance ,energy efficiency and power results as the number of theads varies.
With multithreading, then, we achieve as much as a 90% increase in performance (the 4-thread result), while actually decreasing the energy dissipation per useful instruction of a given workload. This is achieved via a relatively modest increase in power. The increase in power occurs because of the overall increase in utilization and throughput of the processor that is, because the processor completes the same work in a shorter period of time. Figure 3 shows the effect of the reduced misspeculation on the individual components of the power model. The greatest reductions come in the front of the pipeline (e.g. fetch), which is always the slowest to recover from a branch misprediction. The reduction in energy is achieved despite an increase in L2 cache utilization and power. Multithreading,particularly a multiprogrammed workload, reduces the locality of accesses to the cache, resulting in more cache
misses; however, this has a bigger impact on L2 power than L1. The L1 caches see more cache fills, while the L2 sees more accesses per instruction. The increase in L1 cache fills is also mitigated by the same overall efficiency gains, as it sees fewer speculative accesses.
figure 7.contributors to overall energy efficiency of different threads
The power advantage of simultaneous multithreading can provide a significant gain in both the mobile and high performance domains. The E/UI values demonstrate that a multithreaded processor operates more efficiently, which is desirable in a mobile computing environment. For a high performance processor, the constraint is more likely to be average power or peak power. SMT can make better use of a given average power budget, but it really shines in being able to take greater advantage of a peak power constraint. That is, an SMT processor will achieve sustained performance that is much closer to the peak performance (power) the processor was designed for than a single threaded processor.
Power Optimizations
In this section, we examine power optimizations that will either reduce the overall energy usage of a multithreading processor or will reduce the power consumption of the processor. We examine the following possible power optimizations, each proving practical in some way by multithreading: reduced execution bandwidth, dynamic power consumption controls, and thread fetch optimizations. Reduced execution bandwidth exploits the higher execution efficiency of multithreading to achieve higher performance while maintain moderate power consumption. Dynamic power consumption controls allow a high performance processor to gracefully degrade activity when power consumption thresholds are exceeded. The thread fetch optimization looks to achieve better processor resource utilization via thread selection algorithms.
Reduced Execution Bandwidth
Prior studies have shown that an SMT processor has the potential to achieve double the instruction throughput of a single-threaded processor with similar execution resources . This implies that a given performance goal can be met with a less aggressive architecture, possibly consuming less power. Therefore, a multi-threaded processor with a smaller execution bandwidth may achieve comparable performance to a more aggressive single-threaded processor while consuming less power. In this section, for example, we compare the power consumption and energy efficiency of an 8-issue single-threaded processor with a 4-issue multithreaded processor.
Controlling Power Consumption via Feedback
We strive to reduce the peak power dissipation while still maximizing performance, a goal that is likely to face future high-performance architects. In this section, the processor is given feedback regarding power dissipation in order to limit its activity.Such a feedback mechanism does indeed provide reduced average power, but the real attraction of this technique is the ability to reduce peak power to an arbitrary level. A feedback mechanism that guaranteed that the actual power consumption did not exceed a threshold could make peak and achieved power consumption arbitrarily close.Thus, a processor could still be 8-issue (because that width is useful some fraction of the time), but might still be designed knowing that the peak power corresponded to 5 IPC sustained, as long as there is some mechanism in the processor that can guarantee sustained power does not exceed that level. This section models a mechanism, applied to an SMT processor, which utilizes the power consumption of the processor as the feedback input. For this optimization, we would like to set an average power consumption target and have the processor stay within a reasonable tolerance of the target while achieving higher performance than the single threaded processor could achieve. Our target power, in this case, is exactly the average power of all the benchmarks running on the single-threaded processor. This is a somewhat arbitrary threshold, but makes it easier to interpret the results. The feedback is an estimated value for the current power the processor is utilizing. An implementation of this mechanism could use either on-die sensors or a â„¢powerâ„¢ value which is computed from some performance counters. The advantage of using performance counters is that the lag time for the value is much less than those obtained from a physical temperature sensor. For the experiments listed here, the power value used is the number obtained from the simulator power model. We used a power sampling interval of 5 cycles (this is basically the delay for when the feedback information can be used by the processor. In each of the techniques we define thresholds whereby if the power consumption exceeds a given threshold, fetching or branch prediction activity will be modified to curb power consumption. The threshold values are fixed throughout a given simulation. In all cases, fetching is halted for all threads when power consumption reaches 110% of the desired power value.
figure-8 the average over all bench marks of performance and power of 2 and 4 threads with feedback
Power feedback control is a technique which enables the designer to lower the peak power constraints on the processor.This technique is particularly effective in a multithreading environment for two reasons. First, even with the drastic step of eliminating speculation, even for fetch, an SMT processor can still make much better progress than a singlethreaded processor. Second, we have more dimensions on which to scale back execution”the results were best when
we took advantage of the opportunity to scale back threads incrementally.
Thread Selection
This optimization examines the effect of thread selection algorithms on power and performance. The ability to select from among multiple threads provides the opportunity to optimize for power consumption and operating efficiency when making thread fetch decisions.The optimization we model involves two threads that are selected each cycle to attempt to fetch instructions from the I-cache. The heuristic used to select which threads fetch can have a large impact on performance [14]. This mechanism can also impact power if we use it to bias against the most speculative threads. This heuristic does not, in fact, favor low-power threads over high-power threads, because that only delays the running of the high-power threads. Rather, it favors less speculative threads over more speculative threads. This works because the threads that get stalled become less speculative over time (as branches are resolved in the processor), and quickly become good candidates for fetch. We modify the ICOUNT thread selection scheme, from [14], by adding a branch confidence metric to thread fetch priority. This could be viewed as a derivative of pipeline gating [11] applied to a multithreaded processor;however, we use it to change the fetch decision, not to stop fetching altogether. The low-conf scheme biases heavily against threads with more unresolved low-confidence branches, while all branches biases against the threads with more branches overall in the processor (regardless of confidence). Both use ICOUNT as the secondary metric.

figure-9. the performance, power and energy effects of using branch status to direct thread selection
Figure 9 shows that this mechanism has the potential to improve both raw performance and energy efficiency at the same time, particularly with the confidence predictor. For the 4 thread simulations, performance increased by 6.3% and 4.9% for the low-conf and all branches schemes, respectively.The efficiency gains should be enough to outweigh the additional power required for the confidence counters (not modeled), assuming the architecture did not already need them for other purposes. If not, the technique without the confidence counters was equally effective at improving power efficiency, lagging only slightly in performance.This figure shows an improvement even with two threads, because when two threads are chosen for fetch in a cycle, one is still chosen to have higher priority and possibly consume more of the available fetch bandwidth .This figure shows an improvement even with two threads, because when two threads are chosen for fetch in a cycle, one is still chosen to have higher priority and possibly consume more of the available fetch bandwidth .
Microprocessor power dissipation is becoming increasingly critical, due to various pressures. Low-power embedded and mobile devices are increasing rapidly. Every generation of processor puts far greater demands on power and cooling than the previous. We are approaching a technological window when power may become a bottleneck before transistor count, even for high-performance processors. In that scenario the processor architecture that optimizes the performance/power ratio thereby optimizes performance. Thus we see that demonstrate that simultaneous multithreading is an attractive architecture when energy and power are constrained. It, in particular
Reply
#6
PRESENTED BY:
G SURESH

[attachment=9412]
Abstract:
Intel’s Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.
The first implementation of Hyper-Threading Technology was done on the IntelXeonprocessor MP. In this implementation there are two logical processors on each physical processor. The logical processors have their own independent architecture state, but they share nearly all the physical execution and hardware resources of the processor. The goal was to implement the technology at minimum cost while ensuring forward progress on logical processors, even if the other is stalled, and to deliver full performance even when there is only one active logical processor.
The potential for Hyper-Threading Technology is tremendous; our current implementation has only just begun to tap into this potential. Hyper-Threading Technology is expected to be viable from mobile processors to servers; its introduction into market segments other than servers is only gated by the availability and prevalence of threaded applications and workloads in those markets.
Introduction:
Hyper-Threading technology is a groundbreaking innovation from Intel that enables multi-threaded server software applications to execute threads in parallel within each processor in a server platform. The Intel® Xeon™ processor family uses Hyper-Threading technology, along with the Intel® NetBurst™ microarchitecture, to increase compute power and throughput for today’s Internet, e-Business, and enterprise server applications. This level of threading technology has never been seen before in a general-purpose microprocessor. Hyper-Threading technology helps increase transaction rates, reduces end-user response times, and enhances business productivity providing a competitive edge to e-Businesses and the enterprise. The Intel® Xeon™ processor family for servers represents the next leap forward in processor design and performance by being the first Intel® processor to support thread-level parallelism on a single processor.
With processor and application parallelism becoming more prevalent, today’s server platforms are increasingly turning to threading as a way of increasing overall system performance. Server applications have been threaded (split into multiple streams of instructions) to take advantage of multiple processors. Multi-processing-aware operating systems can schedule these threads for processing in parallel, across multiple processors within the server system. These same applications can run unmodified on the Intel® Xeon™ processor family for servers and take advantage of thread-level-parallelism on each processor in the system. Hyper-Threading technology complements traditional multi-processing by offering greater parallelism and performance headroom for threaded software.
Overview of Hyper-Threading Technology:
Hyper-Threading technology is a form of simultaneous multi-threading technology (SMT), where multiple threads of software applications can be run simultaneously on one processor. This is achieved by duplicating the architectural state on each processor, while sharing one set of processor execution resources. The architectural state tracks the flow of a program or thread, and the execution resources are the units on the processor that do the work: add, multiply, load, etc.
Dual-processing (DP) server applications in the areas of Web serving, search engines, security, streaming media, departmental or small business databases, and e- mail/file/print can realize benefits from Hyper-Threading technology using Intel® Xeon™ processor-based servers.
History:
The hyper-threading technology found its roots in Digital Equipment Corporation but was brought on the market by Intel. Hyper-Threading was first introduced in the Foster MP-based Xeon in 2002. It appeared on the 3.06 GHz Northwood-based Pentium 4 in the same year, and then appeared in every Pentium 4 HT, Pentium 4 Extreme Edition and Pentium Extreme Edition processor. Previous generations of Intel's processors based on the Core microarchitecture do not have Hyper-Threading, because the Core microarchitecture is a descendant of the P6 microarchitecture used in iterations of Pentium since the Pentium Pro through the Pentium III and the Celeron (Covington, Mendocino, Coppermine and Tualatin-based) and the Pentium II Xeon and Pentium III Xeon models.
Intel released the Nehalem (Core i7) in November 2008 in which hyper-threading makes a return. The first generation Nehalem contains 4 cores and effectively scales 8 threads. Since then, both 2- and 6-core models have been released, scaling 4 and 12 threads respectively.
The Intel Atom is an in-order processor with hyper-threading, for low power mobile PCs and low-price desktop PCs.
The Itanium 9300 launched with eight threads per processor (2 threads per core) through enhanced hyper-threading technology. Poulson, the next-generation Itanium, is scheduled to have additional hyper-threading enhancements.
The Intel Xeon 5500 server chips also utilize two-way hyper-threading.
Reply
#7

The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance.Making Hyper-Threading Technology a reality was the result of enormous dedication, planning, and sheer hard work from a large number of designers, validators, architects, and others.Many dedicated engineers are continuing to work with our ISV partners to analyze application performance for this technology.
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: instructions, dissipation, threading, inflatable arch, dmt moip combo, tyre threading report and ppt, db2 rename bufferpool,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Forum Jump: