ASK HERE

seminar class · 02-05-2011, 11:18 AM

[attachment=13209]
Introduction
Due to physical limitations, the clock speed of CPU’s has come to maximum limit. However, the Moore’s Law still holds, which means there still exists the ability to pack more transistors on a chip. The recent trend in the microprocessor industry is to put more cores (processors) into a single chip. Parallelism is the future of computing. Future microprocessor development efforts will continue to concentrate on adding cores rather than increasing single thread performance. One example of this trend is the heterogeneous nine-core Cell broadband engine, the main processor in the Sony Playstation 3 and has also attracted substantial interest from scientific computing community. Similarly the highly parallel graphics processing unit (GPU) is rapidly gaining maturity as a powerful engine for computationally demanding applications. The GPU’s performance and potential offer a great deal of promise for future computing systems, yet the architecture and programming model of the GPU are markedly different than most other commodity single chip processors.
The GPU is designed for a particular class of applications with the following characteristics.
• Computational requirements are large: Real-time rendering requires billions of pixels per second, and each pixel requires hundreds or more operations. GPUs must deliver an enormous amount of compute performance to satisfy the demand of complex real-time applications.
• Parallelism is substantial: Fortunately, the graphics pipeline is well suited for parallelism.
• Throughput is more important than latency: GPU implementations of the graphics pipeline prioritize throughput over latency. The human visual system operates on millisecond time scales, while operations within a modern processor take nanoseconds. This six-order-of-magnitude gap means that the latency of any individual operation is unimportant.
Because of the primitive nature of the tools and techniques, the first generation of applications were notable for simply working at all. As the field matured, the techniques became more sophisticated and the comparisons with non-GPU work more rigorous. We are now entering the third stage of GPU computing: building real applications on which GPUs demonstrate an appreciable advantage.
GPU Architecture
The GPU has always been a processor with ample computational resources. The most important recent trend, however, has been exposing that computation to the programmer. Over the past few years, the GPU has evolved from a fixed-function special-purpose processor into a full-fledged parallel programmable processor with additional fixed-function special-purpose functionality. More than ever, the programmable aspects of the processor have taken centre stage.
We begin by chronicling this evolution, starting from the structure of the graphics pipeline and how the GPU has become a general-purpose architecture, then taking a closer look at the architecture of the modern GPU.
A. The Graphics Pipeline
The input to the GPU is a list of geometric primitives, typically triangles, in a 3-D world coordinate system. Through many steps, those primitives are shaded and mapped onto the screen, where they are assembled to create a final picture. It is instructive to first explain the specific steps in the canonical pipeline before showing how the pipeline has become programmable.
• Vertex Operations: Vertex operations transform raw 3D geometry into the 2D plane of your monitor. Vertex pipelines also eliminate unneeded geometry by detecting parts of the scene that are hidden by other parts and simply discarding those parts.
Range Based fogging Elevation based fogging
• Primitive Assembly: The vertices are assembled into triangles, the fundamental hardware-supported primitive in today’s GPUs.
• Rasterization: Rasterization is the process of determining which screen-space pixel locations are covered by each triangle. Each triangle generates a primitive called a “fragment” at each screen-space pixel location that it covers. Because many triangles may overlap at any pixel location, each pixel’s color value may be computed from several fragments.
• Fragment Operations: Using color information from the vertices and possibly fetching additional data from global memory in the form of textures (images that are mapped onto surfaces), each fragment is shaded to determine its final color. Just as in the vertex stage, each fragment can be computed in parallel. This stage is typically the most computationally demanding stage in the graphics pipeline.
• Composition: Fragments are assembled into a final image with one color per pixel, usually by keeping the closest fragment to the camera for each pixel location.
The Graphics logical pipeline(The programmable blocks are in blue)
Historically, the operations available at the vertex and fragment stages were configurable but not programmable. For instance, one of the key computations at the vertex stage is computing the color at each vertex as a function of the vertex properties and the lights in the scene. In the fixed-function pipeline, the programmer could control the position and color of the vertex and the lights, but not the lighting model that determined their interaction.
B. Evolution of GPU Architecture
The fixed-function pipeline lacked the generality to efficiently express more complicated shading and lighting operations that are essential for complex effects. The key step was replacing the fixed-function per-vertex and per-fragment operations with user-specified programs run on each vertex and fragment. Over the past six years, these vertex programs and fragment programs have become increasingly more capable, with larger limits on their size and resource consumption, with more fully featured instruction sets, and with more flexible control-flow operations. After many years of separate instruction sets for vertex and fragment operations, current GPUs support the unified Shader Model 4.0 on both vertex and fragment shaders.
• The hardware must support shader programs of at least 65 k static instructions and unlimited dynamic instructions.
• The instruction set, for the first time, supports both 32-bit integers and 32-bit floating-point
numbers.
• The hardware must allow an arbitrary number of both direct and indirect reads from global memory (texture).
• Finally, dynamic flow control in the form of loops and branches must be supported.
As the shader model has evolved and become more powerful, and GPU applications of all types have increased vertex and fragment program complexity, GPU architectures have increasingly focused on the programmable parts of the graphics pipeline. Indeed, while previous generations of GPUs could best be described as additions of programmability to a fixed-function pipeline, today’s GPUs are better characterized as a programmable engine surrounded by supporting fixed-function units.
C. Architecture of a Modern GPU
We noted that the GPU is built for different application demands than the CPU: large, parallel computation requirements with an emphasis on throughput rather than latency. Consequently, the architecture of the GPU has progressed in a different direction than that of the CPU.
Basic Unified GPU architecture: The programmable shader stages execute on the array of unified processors, and the logical graphics pipeline dataflow recirculates through the processors.
Consider a pipeline of tasks, such as we see in most graphics APIs (and many other applications), that must process a large number of input elements. In such a pipeline, the output of each successive task is fed into the input of the next task. The pipeline exposes the task parallelism of the application, as data in multiple pipeline stages can be computed at the same time; within each stage, computing more than one element at the same time is data parallelism. To execute such a pipeline, a CPU would take a single element (or group of elements) and process the first stage in the pipeline, then the next stage, and so on. The CPU divides the pipeline in time, applying all resources in the processor to each stage in turn.
GPUs have historically taken a different approach. The GPU divides the resources of the processor among the different stages, such that the pipeline is divided in space, not time. The part of the processor working on one stage feeds its output directly into a different part that works on the next stage.
This machine organization was highly successful in fixed-function GPUs for two reasons. First, the hardware in any given stage could exploit data parallelism within that stage, processing multiple elements at the same time. Because many task-parallel stages were running at any time, the GPU could meet the large compute needs of the graphics pipeline. Secondly, each stage’s hardware could be customized with special-purpose hardware for its given task, allowing substantially greater compute and area efficiency over a general-purpose solution. For instance, the Rasterization stage, which computes pixel coverage information for each input triangle, is more efficient when implemented in special-purpose hardware. As programmable stages (such as the vertex and fragment programs) replaced fixed-function stages, the special-purpose fixed function components were simply replaced by programmable components, but the task-parallel organization did not change.
The result was a lengthy, feed-forward GPU pipeline with many stages, each typically accelerated by special purpose parallel hardware. In a CPU, any given operation may take on the order of 20 cycles between entering and leaving the CPU pipeline. On a GPU, a graphics operation may take thousands of cycles from start to finish. The latency of any given operation is long. However, the task and data parallelism across and between stages delivers high throughput.

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	SAMBA SERVER ADMINISTRATION full report	project report tiger	3	4,779	17-01-2018, 05:40 PM Last Post: AustinnuAke
	air ticket reservation system full report	project report tiger	16	46,934	08-01-2018, 02:33 PM Last Post: RaymondGom
	Platform Autonomous Custom Scalable Service using Service Oriented Cloud Computing Ar		1	1,066	15-02-2017, 04:39 PM Last Post: jaseela123d
	Cloud Computing with Service Oriented Architecture in Business Applications		1	924	15-02-2017, 11:55 AM Last Post: jaseela123d
	Cloud Computing Security: From Single to Multi-Clouds		1	843	14-02-2017, 04:56 PM Last Post: jaseela123d
	SPOC: A Secure and Privacy-preserving Opportunistic Computing Framework for Mobile-He		1	921	14-02-2017, 03:49 PM Last Post: jaseela123d
	An Efficient Algorithm for Mining Frequent Patterns full report	project topics	3	4,805	01-10-2016, 10:02 AM Last Post: Guest
	online examination full report	project report tiger	14	42,958	03-09-2016, 11:20 AM Last Post: jaseela123d
	Employee Cubicle Management System full report	computer science technology	4	5,145	07-04-2016, 11:37 AM Last Post: dhanabhagya
	e-Post Office System full report	computer science technology	27	26,132	30-03-2016, 02:56 PM Last Post: dhanabhagya

Important Note..!

ASK HERE