ASK HERE

seminar presentation · 12-05-2010, 09:19 PM

Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology

Presented By:
Hakan Burak Duygulu

Introduction
The Intel Pentium 4 Processor Family (2000-2005)
based on Intel NetBurstÃ‚Â® microarchitecture.
introduced Streaming SIMD Extensions 2 (SSE2)
introduced Streaming SIMD Extensions 3 (SSE3)
The IntelÃ‚Â® Xeon Processor (2001-2005)
introduced support for Hyper-Threading Technology
The IntelÃ‚Â® PentiumÃ‚Â® M Processor (2003-2005)
designed for extending battery life
The Intel Pentium Processor Extreme Edition (2005)
64-bit addressing, 1024-Gbytes address space.
IntelÃ‚Â® NetBurstTM microarchitecture.

Design Goals
to execute legacy IA-32 applications based on single-instruction, multiple-data (SIMD) technology at high throughput
to operate at high clock rates and to scale to higher performance and clock rates in the future
IntelÃ‚Â® NetBurstTM microarchitecture.

Design Advantages
a deeply pipelined design, 20-stage pipeline, that allows for high clock rates (with different parts of the chip running at different clock rates).
a pipeline that optimizes for the common case of frequently executed instructions; the most frequently-executed instructions in common circumstances (such as a cache hit) are decoded efficiently and executed with short latencies
Employment of techniques to hide stall penalties; Among these are parallel execution, buffering, and speculation. The microarchitecture executes instructions dynamically and out-of-order, so the time it takes to execute each individual instruction is not always deterministic

Caches
The Intel NetBurst microarchitecture supports up to three levels of on-chip cache.
The first level cache (nearest to the execution core) contains separate caches for instructions and data. These include the first-level data cache and the trace cache (an advanced first-level instruction cache). All other caches are shared between instructions and data.
All caches use a pseudo-LRU (least recently used) replacement algorithm.

The Front End Pipeline
Consists of two parts:
Fetch/decode unit:
a hardware instruction fetcher that automatically prefetches instructions
a hardware mechanism that automatically fetches data and instructions into the unified second-level cache
Execution trace cache
The execution trace cache (TC) is the primary instruction cache in the Intel NetBurst microarchitecture. The TC stores decoded IA-32 instructions (Ã‚Âµops).

The Front End Pipeline
Prefetches IA-32 instructions that are likely to be executed
Fetches instructions that have not already been prefetched
Decodes IA-32 instructions into micro-operations
Generates microcode for complex instructions and special-purpose code
Delivers decoded instructions from the execution trace cache
Predicts branches using highly advanced algorithm

The Front End Pipeline - Branch Prediction
Enables the processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penalty that is incurred in the absence of correct prediction.
Branch prediction in the Intel NetBurst microarchitecture predicts all near branches (conditional calls, unconditional calls, returns and indirect branches). It does not predict far transfers (far calls and software interrupts).
Mechanisms have been implemented to aid in predicting branches accurately and to reduce the cost of taken branches:
the ability to dynamically predict the direction and target of branches based on an instructionâ„¢s linear address, using the branch target buffer (BTB)
if no dynamic prediction is available or if it is invalid, the ability to statically predict the outcome based on the offset of the target: a backward branch is predicted to be taken, a forward branch is predicted to be not taken
the ability to predict return addresses using the 16-entry return address stack
the ability to build a trace of instructions across predicted taken branches to avoid branch penalties.

The Static Predictor.
Once a branch instruction is decoded, the direction of the branch (forward or backward) is known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the direction of the branch. The static prediction mechanism predicts backward conditional branches (those with negative displacement, such as loop-closing branches) as taken. Forward branches are predicted not taken.
To take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so that the likely target of the branch immediately follows forward branches

Branch Target Buffer.
Once branch history is available, the Pentium 4 processor can predict the branch outcome even before the branch instruction is decoded. The processor uses a branch history table and a branch target buffer (collectively called the BTB) to predict the direction and target of branches based on an instructionâ„¢s linear address. Once the branch is retired, the BTB is updated with the target address.

Out-Of-Order Execution Core
Ability to execute instructions out of order is a key factor in enabling parallelism. This feature enables the processor to reorder instructions so that if one Ã‚Âµop is delayed, other Ã‚Âµops may proceed around it. The processor employs several buffers to smooth the flow of Ã‚Âµops.
The core is designed to facilitate parallel execution. It can dispatch up to six Ã‚Âµops per cycle. Most pipelines can start executing a new Ã‚Âµop every cycle, so several instructions can be in flight at a time for each pipeline. A number of arithmetic logical unit (ALU) instructions can start at two per cycle; many floating-point instructions can start once every two cycles.

Retirement Unit
The retirement unit receives the results of the executed Ã‚Âµops from the out-of-order execution core and processes the results so that the architectural state updates according to the original program order.
When a Ã‚Âµop completes and writes its result, it is retired. Up to three Ã‚Âµops may be retired per cycle. The Reorder Buffer (ROB) is the unit in the processor which buffers completed Ã‚Âµops, updates the architectural state in order, and manages the ordering of exceptions. The retirement section also keeps track of branches and sends updated branch target information to the branch target buffer(BTB). The BTB then purges pre-fetched traces that are no longer needed.
Hyper-Threading Technology
Enables software to take advantage of task-level, or thread-level parallelism by providing multiple logical processors within a physical processor package.
The two logical processors each have a complete set of architectural registers while sharing one single physical processor's resources. By maintaining the architecture state of two processors, an HT Technology capable processor looks like two processors to software, including operating system and application code.

Replicated Resources
Control registers (Architectural Registers AR)
8 general purpose registers (AR)
Machine state registers (AR)
Debug registers (AR)
Instruction pointers(IP)
Register renaming tables(RNT)
Return stack predictor (RSP)

Replicated Resources
ARâ„¢s are used by the operating system and application code to control program behavior and store data for computations.
IP and RNT are replicated for simultaneously track execution and state changes of the two logical processors.
The RSP is replicated to improve branch prediction of return instructions.

Partitioned Resources
Re-order Buffers(ROBâ„¢s)
Load/Store Buffers
Various queues, like the scheduling and Ã‚Âµop queues

Partitioned Resources
operational fairness
permitting the ability to allow operations from one logical processor to bypass operations of the other logical processor that may have stalled.
For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a logical processor from making forward progress for some number of cycles. The partitioning prevents the stalled logical processor from blocking forward progress.

Shared Resources
Caches: trace cache, L1, L2, L3
Execution Units
are fully shared to improve the dynamic
utilization of the resource.

Front End Pipeline
Execution trace cache access is arbitrated by the two logical processors every clock. If a cache line is fetched for one logical processor in one clock cycle, the next clock cycle a line would be fetched for the other logical processor provided that both logical processors are requesting access to the trace cache.
If one logical processor is stalled or is unable to use the execution trace cache, the other logical processor can use the full bandwidth of the trace cache.

Front End Pipeline
After fetching the instructions and building traces of Ã‚Âµops, the Ã‚Âµops are placed in a queue. This queue decouples the execution trace cache from the register rename pipeline stage. If both logical processors are active, the queue is partitioned so that both logical processors can make independent forward progress.

Execution Core
The core can dispatch up to six Ã‚Âµops per cycle, provided the Ã‚Âµops are ready to execute. Once the Ã‚Âµops are placed in the queues waiting for execution, there is no distinction between instructions from the two logical processors.
After execution, instructions are placed in the re-order buffer. The re-order buffer decouples the execution stage from the retirement stage. The re-order buffer is partitioned such that each uses half the entries.

Retirement
The retirement logic tracks when instructions from the two logical processors are ready to be retired. It retires the instruction in program order for each logical processor by alternating between the two logical processors. If one logical processor is not ready to retire any instructions, then all retirement bandwidth is dedicated to the other logical processor.
Once stores have retired, the processor needs to write the store data into the level-one data cache. Selection logic alternates between the two logical processors to commit store data to the cache.
Streaming SIMD Extensions 3 (SSE3)
Beginning with the Pentium II and Pentium Intel MMX technology processor families, four extensions have been introduced into the IA-32 architecture to permit IA-32 processors to perform single-instruction multiple-data (SIMD) operations.
These extensions include the MMX technology, SSE extensions, SSE2 extensions, and SSE3 extensions.
Streaming SIMD Extensions 3 (SSE3)
Streaming SIMD Extensions 3 (SSE3)

MMXâ€žÂ¢ Technology
MMX Technology introduced:
64-bit MMX registers
support for SIMD operations on packed byte, word, and doubleword integers
MMX instructions are useful for multimedia and communications software.
Streaming SIMD Extensions 3 (SSE3)

Streaming SIMD Extensions
Streaming SIMD extensions introduced:
128-bit XMM registers
data prefetch instructions
non-temporal store instructions and other cacheability and memory ordering instructions
extra 64-bit SIMD integer support
SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decoding.
Streaming SIMD Extensions 3 (SSE3)

Streaming SIMD Extensions 2
Streaming SIMD extensions 2 add the following:
support for SIMD arithmetic on 64-bit integer operands
instructions for converting between new and existing data types
extended support for data shuffling
extended support for cacheability and memory ordering operations
SSE2 instructions are useful for 3D graphics, video decoding/encoding, and encryption.
Streaming SIMD Extensions 3 (SSE3)

Streaming SIMD Extensions 3
Streaming SIMD extensions 3 add the following:
SIMD floating-point instructions for asymmetric and horizontal computation
a special-purpose 128-bit load instruction to avoid cache line splits
instructions to support thread synchronization
SSE3 instructions are useful for scientific, video and multi-threaded applications.
Enhanced Intel SpeedStepÃ‚Â® technology
Enables real-time dynamic switching between multiple voltages and operating frequency points.
The processor features the Auto Halt, Stop Grant, Deep Sleep, and Deeper Sleep low power states.
Enhanced Intel SpeedStepÃ‚Â® technology
The processor includes an address bus powerdown capability which removes power from the address and data pins when the FSB is not in use.
Conclusion
Deeply Pipelined, 20-stage pipeline , achieved Higher Clock Rate with NetBurst microarchitecture
Improved performance with very little additional die area with Hyper-Threading Technology
SSE3 offers 13 instructions that accelerate performance of Streaming SIMD Extensions technology

References
http://inteldesign/pentium4/manuals/index_new.htm
- IA-32 IntelÃ‚Â® Architecture Software Developer's Manual, Volume 1: Basic Architecture
- IA-32 IntelÃ‚Â® Architecture Optimization Reference Manual
http://inteldesign/mobile/datashts/302424.htm
http://arstechnicaarticles/paedia/cpu.ars
http://extremenanoprint_article/PC+Processor+Microarchitecture/1621.aspx
Questions

download the ppt
http://cmpe.boun.edu.tr/courses/cmpe511/...uygulu.ppt

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Home appliance & pc Cursor control by mobile phone (DTMF)	smart paper boy	3	3,593	21-05-2015, 03:16 PM Last Post: seminar report asees
	DETECTION OF LOST MOBILE USING SNIFFERS	seminar class	66	34,427	01-08-2014, 09:47 PM Last Post: seminar report asees
	advanced mobile phone signal jammer for gsm cdma and 3g networks with prescheduled ti	shilpa16	1	1,698	28-10-2013, 12:17 PM Last Post: ShayneThill
	fractal antenna:report and presentation	geethu ARJUN	5	5,028	04-10-2013, 01:12 PM Last Post: Guest
	Android Mobile Security – An Issue of Future	computer girl	2	2,427	24-08-2013, 10:26 AM Last Post: computer topic
	SOLAR AUTOMATIC MOBILE CHARGER WITH PAY SYSTEM	seminar class	13	11,256	12-07-2013, 11:28 AM Last Post: computer topic
	MOBILE NUMBER PORTABILITY	pavan457	38	31,004	29-04-2013, 10:36 AM Last Post: computer topic
	mobile fraud detection full report	project topics	7	7,312	03-03-2013, 02:22 PM Last Post: Guest
	mobile phone cloning full report	project topics	19	32,215	08-02-2013, 09:40 PM Last Post: Guest
	GLOBAL SYSTEM FOR MOBILE COMMUNICATIONS & SECURITY full report	seminar presentation	1	2,750	06-02-2013, 10:02 AM Last Post: seminar details

Important Note..!

ASK HERE