24-05-2010, 12:22 PM
[attachment=3617]
Introduction
Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture.
Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors.
From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors.
From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources
The first implementation of Hyper-Threading Technology was done on the IntelXeon<=> processor MP.
In this implementation there are two logical processors on each physical processor.
The logical processors have their own independent architecture state, but they share nearly all the physical execution and hardware resources of the processor
The potential for Hyper-Threading Technology is tremendous; our current implementation has only just begun to tap into this potential.
Hyper-Threading Technology is expected to be viable from mobile processors to servers; its introduction into market segments other than servers is only gated by the availability and prevalence of threaded applications and workloads in those markets
Processor Micro-architecture
Traditional approaches to processor design have focused on higher clock speeds, instruction-level parallelism (ILP), and caches.
Techniques to achieve higher clock speeds involve pipelining the microarchitecture to finer granularities, also called super-pipelining.
Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second
One technique is out-of-order execution where a large window of instructions is simultaneously evaluated and sent to execution units, based on instruction dependencies rather than program order.
Accesses to DRAM memory are slow compared to execution speeds of the processor. One technique to reduce this latency is to add fast caches close to the processor
Conventional Multi-threading
In recent years a number of other techniques to further exploit TLP have been discussed and some products have been announced.
One of these techniques is chip multiprocessing (CMP), where two processors are put on a single die
The two processors each have a full set of execution and architectural resources. The processors may or may not share a large on-chip cache.
CMP is largely orthogonal to conventional multiprocessor systems, as you can have multiple CMP processors in a multiprocessor configuration.
Recently announced processors incorporate two processors on each die.
However, a CMP chip is significantly larger than the size of a single-core chip and therefore more expensive to manufacture; moreover, it does not begin to address the die size and power considerations
Time-slice Multi-threading
Time-slice multithreading is where the processor switches between software threads after a fixed time period. Quite a bit of what a CPU does is illusion.
For instance, modern out-of-order processor architectures don't actually execute code sequentially in the order in which it was written.
It is noted that an OOE(out of order execution)architecture takes code that was written and compiled to be executed in a specific order, reschedules the sequence of instructions (if possible) so that they make maximum use of the processor resources, executes them, and then arranges them back in their original order so that the results can be written out to memory
To the programmer and the user, it looks as if an ordered, sequential stream of instructions went into the CPU and identically ordered, sequential stream of computational results emerged
Concept of Simultaneous Multi-threading
Simultaneous Multi-threading is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle
It is seen that the architecture for simultaneous multi threading achieves three goals:
(1) it minimizes the architectural impact on the conventional superscalar design,
(2) it has minimal performance impact on a single thread executing alone, and
(3) it achieves significant throughput gains when running multiple threads.
The simultaneous multi threading enjoys a 2.5-fold improvement over an unmodified superscalar with the same hardware resources
Each square corresponds to an issue slot, with white squares signifying unutilized slots.
Hardware utilization suffers when a program exhibits insufficient parallelism or when available parallelism is not used effectively.
A superscalar processor achieves low utilization because of low ILP in its single thread.
Multiprocessors physically partition hardware to exploit TLP, and therefore performance suffers when TLP is low (e.g., in sequential portions of parallel programs).
In contrast, simultaneous multithreading avoids resource partitioning.
Because it allows multiple threads to compete for all resources in the same cycle, SMT can cope with varying levels of ILP and TLP; consequently utilization is higher, and performance is better
Base Processor Architecture
The base processor is a sophisticated, out-of-order superscalar processor with a dynamic scheduling core similar to the MIPS R10000.
On each cycle, the processor fetches a block of instructions from the instruction cache.
After decoding these instructions,the register-renaming logic maps the logical registers to a pool of physical renaming registers to remove false dependencies.
Instructions are then fed to either the integer or floating-point instruction queues.
When their operands become available, instructions are issued from these queues to the corresponding functional units. Instructions are retired in order
Power Consumption of a Multithreaded Architecture
The power-energy characteristics of multi threaded execution can be examined.
It examines performance (IPC), energy efficiency (energy per useful instruction executed,E/UI), and power (the average power utilized during each simulation run) for both single-thread and multithread execution.
All results (including IPC) are normalized to a baseline (in each case, the lowest single-thread value).
This is done for two reasons
Introduction
Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture.
Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors.
From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors.
From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources
The first implementation of Hyper-Threading Technology was done on the IntelXeon<=> processor MP.
In this implementation there are two logical processors on each physical processor.
The logical processors have their own independent architecture state, but they share nearly all the physical execution and hardware resources of the processor
The potential for Hyper-Threading Technology is tremendous; our current implementation has only just begun to tap into this potential.
Hyper-Threading Technology is expected to be viable from mobile processors to servers; its introduction into market segments other than servers is only gated by the availability and prevalence of threaded applications and workloads in those markets
Processor Micro-architecture
Traditional approaches to processor design have focused on higher clock speeds, instruction-level parallelism (ILP), and caches.
Techniques to achieve higher clock speeds involve pipelining the microarchitecture to finer granularities, also called super-pipelining.
Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second
One technique is out-of-order execution where a large window of instructions is simultaneously evaluated and sent to execution units, based on instruction dependencies rather than program order.
Accesses to DRAM memory are slow compared to execution speeds of the processor. One technique to reduce this latency is to add fast caches close to the processor
Conventional Multi-threading
In recent years a number of other techniques to further exploit TLP have been discussed and some products have been announced.
One of these techniques is chip multiprocessing (CMP), where two processors are put on a single die
The two processors each have a full set of execution and architectural resources. The processors may or may not share a large on-chip cache.
CMP is largely orthogonal to conventional multiprocessor systems, as you can have multiple CMP processors in a multiprocessor configuration.
Recently announced processors incorporate two processors on each die.
However, a CMP chip is significantly larger than the size of a single-core chip and therefore more expensive to manufacture; moreover, it does not begin to address the die size and power considerations
Time-slice Multi-threading
Time-slice multithreading is where the processor switches between software threads after a fixed time period. Quite a bit of what a CPU does is illusion.
For instance, modern out-of-order processor architectures don't actually execute code sequentially in the order in which it was written.
It is noted that an OOE(out of order execution)architecture takes code that was written and compiled to be executed in a specific order, reschedules the sequence of instructions (if possible) so that they make maximum use of the processor resources, executes them, and then arranges them back in their original order so that the results can be written out to memory
To the programmer and the user, it looks as if an ordered, sequential stream of instructions went into the CPU and identically ordered, sequential stream of computational results emerged
Concept of Simultaneous Multi-threading
Simultaneous Multi-threading is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle
It is seen that the architecture for simultaneous multi threading achieves three goals:
(1) it minimizes the architectural impact on the conventional superscalar design,
(2) it has minimal performance impact on a single thread executing alone, and
(3) it achieves significant throughput gains when running multiple threads.
The simultaneous multi threading enjoys a 2.5-fold improvement over an unmodified superscalar with the same hardware resources
Each square corresponds to an issue slot, with white squares signifying unutilized slots.
Hardware utilization suffers when a program exhibits insufficient parallelism or when available parallelism is not used effectively.
A superscalar processor achieves low utilization because of low ILP in its single thread.
Multiprocessors physically partition hardware to exploit TLP, and therefore performance suffers when TLP is low (e.g., in sequential portions of parallel programs).
In contrast, simultaneous multithreading avoids resource partitioning.
Because it allows multiple threads to compete for all resources in the same cycle, SMT can cope with varying levels of ILP and TLP; consequently utilization is higher, and performance is better
Base Processor Architecture
The base processor is a sophisticated, out-of-order superscalar processor with a dynamic scheduling core similar to the MIPS R10000.
On each cycle, the processor fetches a block of instructions from the instruction cache.
After decoding these instructions,the register-renaming logic maps the logical registers to a pool of physical renaming registers to remove false dependencies.
Instructions are then fed to either the integer or floating-point instruction queues.
When their operands become available, instructions are issued from these queues to the corresponding functional units. Instructions are retired in order
Power Consumption of a Multithreaded Architecture
The power-energy characteristics of multi threaded execution can be examined.
It examines performance (IPC), energy efficiency (energy per useful instruction executed,E/UI), and power (the average power utilized during each simulation run) for both single-thread and multithread execution.
All results (including IPC) are normalized to a baseline (in each case, the lowest single-thread value).
This is done for two reasons