ASK HERE

super · 14-06-2009, 05:42 PM

[attachment=21]
[attachment=21]

Seminar Report on

A Vectorizing Compiler for Multimedia
Extension

Submitted By :
Venkateswarlu.N.P.
In Partial fulfillment of requirement for degree of
Master of Techonology(M.Tech)
DEPARTMENT OF COMPUTER SCIENCE
COCHIN UNIVERSITY OF SCIENCE AND TECHONOLOGY
COCHIN 682 022.

ABSTRACT
This Seminar Report, describes an implementation of Vectorizing Compiler for Intelâ„¢s MMX ( MultiMedia Extensions). This compiler would identify data parallel sections of the code using scalar expansion and array dependence analysis. To enhance the scope for application of the subword semantics, this compiler performs several code transformations. These include strip mining, scalar expansion, grouping and reduction, loop fission and distribution. There after inline assembly instructions corresponding to the data parallel sections are generated. This compiler uses Stanford University Intermediate Format(SUIF), a public domain compiler tool, for implementation. The performance of the code generated by this compiler is evaluated for a multimedia benchmarks. Initial performance results reveal that, this compiler generated code produces a reasonable performance improvement (speedup of 2 to6.5) over the code generated without the vectorizing transformations/inline assembly

1). INTRODUCTION ::
Multimedia is the integration of Visual (video, Images, animation), audio (music, speech) and textual information. It is basically information represented in different ways by these different media data-types. Media Processing is the decoding, encoding, interpretation, enhancement of digital multimedia information. This information is typically in the form of large volumes of low precision or short data-types. The large volume of data makes compression a necessary step before storage. The media processing applications have been dominating the personal computing domain. They are
* Small native data types
* Large data-set sizes
* Large amount of inherent data parallelism
* Computationally intensive features
* Multiple concurrent media streams
* Large I/O requirements
These applications have traditionally been supported by special-purpose chips, also known as media-processors, but with the rise in the fraction of such applications, it become necessary to enhance their performance preferably without an increase in the cost, and hence without the support of a special hardware. This high computational demand on short data types for media applications has been effectively addressed by modern processors but the introduction of subword parallelism. Subword parallelism is a specific instance of data parallelism in which a data word is the data-set. Subword parallelism is exhibited by instructions which act on a set of lower precision data packed into a word, resulting in the parallel processing of the data-set. Most of the current processors support 64-bit words, although Intel X86 processors support 32-bit words, they have 64-bit floating-point units. The word size processors determines the width of general purpose registers and data-paths for integer and address calculations. Inorder to support and exploit subword parallelism, modern processors extend their Instruction Set Architecture. These extensions , popularly referred to as the multimedia extensions. e.g., Intelâ„¢s MMX (MultiMedia Extension), Sunâ„¢s VIS (Visual Instruction Set), Hewlett Packardâ„¢s MAX-2 (Multimedia Acceleration eXtension) and PowerPCâ„¢s AltiVec. Since the same instruction applies to all data-elements in the word, this is a form of small- scale SIMD (Single Instruction Multiple Data). An application written in a high-level language would not benefit from these extensions to the ISA, Unless the compiler generates object code making use of these instructions. Unfortunately, this has not been the case for subword parallelism. Vectorization technique, which has traditionally been used by compilers for vector and SIMD machines, can be applied for this purpose. In simple terms, a vectorizing compiler identifies instructions in the loop, whose successive instances can be executed in parallel, without affecting the semantics of the program. In the absence of compiler support for subword parallelism, the application programmer is currently forced to write his application at assembly level, which is both tedious and error prone.
a). Enhanced System Libraries:
Selected library routines are hand-coded in assembly to exploit the extended set of instructions.
b). Macro Calls for Extended Set of -
Instructions :
The system header files define a set of macros that provide a higher level interface to the extended set of instructions. In case of hardware supported enhanced libraries, the programmer can make use of system version of some function calls which exploits subword parallelism within the function. However, this loses out certain opportunity that a vectorizing compiler can achieve. For example inlining a function may improve the parallelism and reduce the function overhead. A compiler may be able to exploit this enhanced parallelism while inlining would not be possible in hardware enhanced library functions since the source code would not be available. Using macro calls in program to exploit subword parallelism require the user to be aware of the code segment which can be optimized by the multimedia extensions and the macros provided. Further, the code transformations have to be performed manually. Lastly, programming with the macro calls is as hard as with the assembly equivalent. The above reasons strongly motivate the need for a vectorizing compiler as a general approach for exploiting the subword parallelism. Further supporting different architectures as well as changes in the multimedia instruction set in this approach would require modifications only to the code generation module. This also allows easy portability of the application. Lastly, compiler support approach makes the process(of Vectorizing) transparent to the user, reduce the errors associated with assembly coding and improve the performance of applications. This Vectorizing compiler is a source to source vectorizing C compiler for Intelâ„¢s MMX. The compiler takes a C source file as input. Variours code transformations such as strip mining, scalar expansion, condition distribution are applied. The output is a C source file, with the data parallel sections coded in inline assembly. This allows the rest of the code to be optimized by the native C compiler. This vectorizing compiler uses Stanford University Intermediate Format (SUIF), a public domain compiler tool, for out implementation. The performance of the code generated by this compiler is evaluated with number of benchmarks (Kernels and Multimedia applications).
2). Background :
This section provides the background required to understand the vectorizing techniques and Intelâ„¢s MMX.
2.1). Dependency Relations :
The control flow in a program is represented by a Control Flow Graph, which is a directed graph, where each node could be a statement or a sequence of statements, based on the level of abstraction., and an edge represents a control transfers between a pair of nodes. Control dependence, which can be derived from the control flow graph, restricts the order of execution of statements in a program. A statement S` may or may not be executed based on the execution of a statement S. This represents that statement S` is control dependent on S. Two statements S and S` are said to be data dependent if there is one access each in S and S` to the same location and al least one of the two accesses is a write. Data dependences are represented by a Data Dependency Graph whose nodes are the statements of the program and directed edges represent dependences. The arcs of the data dependency graph are classified as Forward and Backward arcs. An arc or dependency from S to S` is said to be lexically forward when S1 follows S in the program order and is said to be lexically backward when S follows S` in the program order.As long as the control flows along the program sequence, the dependence arcs will be lexically forward but control transfers against the program sequence, as in the case of a loop, can introduce lexically backward arcs. Consider the example code below,
for( i=1; i<=N; i++ )
for( j=1; j<=N; j++ )
s1 :
= A[i-1][j];
s2 : A[[i][j] = . . .;
s3 :
= A[i][j] + . .;
endfor
endfor
In the above example S1 and S2 are Lexically backward and S2 and S3 are Lexically forward depend.
* DDG
* CFG
In the dependence graph for this code shows, the arc from S2 to S3 is lexically forward and the arc from S2 to S1 is lexically backward. Array elements are typically defined and used by statements in a loop. These statements are usually executed more then once. It therefore becomes necessary to discuss about instances of execution of the statement. The instances are identified by an iteration vector. Index Variable iteration vector : Iteration vector of the form (i1,i2,i3,..,ik), where i1,i2,i3. .are the values of the loop indicies enclosing the statement, ordered from outer to inner. In the previous example the (normalized)iteration vectors for statement S1 are (1,1),(1,2), . .,(1,N),(2,1),(2,2),. .,(2,N),(N,1)..(N,N). Consider the data dependence from S2 to S1 in the example. It can be seen that the dependence
S1
S3
S2
S1
S2
S3
Lexically backward
Lexically forward
would not have been present in the absence of the enclosing loops. Such dependence are said to be loop-carried.
2.2). The SUIF Framework :
The compiler research community has a great need ofr compiler infrastructures on which new techonology can be implemented and evaluated. SUIF(Stanford University Intermediate Format) compiler system is a platform for research on compiler-techniques for high- performance machines. SUIF is a research compiler used for experimenting and developing new compiler algorithms. It fosters code reuse, sharing, and modularity. The compiler is structured as a small kernel plus a toolkit consisting of various compilation analysis and optimizations built using the kernel. The kernel performs three major functions :
* Defines an intermediate representation of
programs : The program representation is designed to support both high-level program restructuring transformations as well as low-level analyses and optimizations.
*Supports a set of program manipulation primitives : These routines are used for performing several transformations.
* Structure the interface between different compiler passes : Compilation passes are implemented as separate programs that communicate via files, termed as SUIF files. SUIF files always use the same output format so that passes can be reordered simply by running programs in a different order. Different passes can communicate by annotating the program representation.
The SUIF kernel provides an Object-Oriented implementation of the SUIF intermediate format. The intermediate format is a mixed-level program representation. Besides the low-level constructs such as SUIF instructions, this representation includes three high-level constructs: loops, conditionsal statements, and array access operations. PASSES : Scc, is the driver for the SUIF ANSI C Compiler. Porky, makes various transformations to the SUIF code. The purpose of the transformations could either be to allow subsequent passes to make simplifying assumptions, such as the assumption that there are no branches in a loop body or try to rearrange the code to make it easier for subsequent passes to get information without getting rid of any particular construct. S2c, to read the specified SUIF file and print- out its translation into the standard C language.This passes is augment with inline assembly code for data parallel sections.
2.3). Intel MMX :
This section gives an overview of Intelâ„¢s MMX(Multimedia Extension) and its different facets, namely the register stack, the adta stack, the data types supported and the instruction set.

2.3.1). Multimedia Registers : The MMX register set consists of eight 64-bit registers, aliased onto the registers of the floating-point register stack. MMX instructions access these registers directly using the register names MM0 through MM7. while operating in the MMX mode, the aliasing mechanism would ensure that accessing these registers as floating point units would result in NaNs(Not a Number).
2.3.2). Multimedia data types : Intel has introduced the following new 64-bit quantities
*Packed Bytes : Eight bytes packed into the 64- bits.
*Packed Words :Four 16-bit words packed into 64- bits.
*Packed Double-Words: Two 32-bit double-words packed 64-bits.
*Quad word : One 64-bit quantity.
2.3.3). MMX Instruction set : The MMX instruction set can be classified as
Data Transfer Instructions : The MOVD and MOVQ instructions move packed data(respectively 32 and 64 bit data ) between MMX registers and memory or between MMX registers and themselves. The 32-bit data transfer instructions always more the low-order 32bits of the MMX register. The register-to-register version of the MOV instruction implementation the operation of moving data between MMX and integer registers.
Arithmetic
Instructions : These instructions include introduction to perform add, subtract, and multiply on packed operand types. Comparison
Instructions : These instructions independently compare all the respective data elements of the two packed data types in parallel. They generate a mask of 1â„¢s and 0â„¢s depending on whether the condition is true or false. These masks can then be used by logical instructions to select elements.
Logical Instructions : these perform logical operations on quard registers.
Shift Instructions : MMX implements two versions of logical left, right and arithmetic right shift operations.
Conversion Instructions : These convert data- elements in packed registers.The execution of the multimedia instructions to exploit data parallelism on the subwords in the multimedia registers. This is referred to as subword parallelism or subword semanitics.
3). Vectorization for Multimedia Extensions : Vectorization has traditionally been used for vector and SIMD machines. Compilers for personal computers have never found a need for these techniques. The introduction of the subword model has however changed the situation and forced the review of vectorization techniques.

3.1). Vectorization for MMX : The Compiler was implemented on the SUIF compiler framework. The compiler has been structured as a set of passes. The application is converted into SUIF intermediate format and the passes are applied on the intermediate format. A overview of the vectorizing compiler is given
* Overall process of Vectorizing Compiler The motivation example for understanding the vectorinzing compiler is below
for( i=1; i<=N; i++ )
for( j=1; j<=N; j++ )
for( k=1; k<=n; k++ )
C1 : if(A[k-1]==. .)
S1 : A[k] = test +..;
S2 : test = . . ;
Endfor
ËœCâ„¢
scc
porky
Condition distribution Scalar expansion Strip mine Dependency graph Reduction processing Loop distribution S2c CFG ËœCâ„¢ Inline assembly

S3 : B[i][j] = B[i-1][j]+test;
Endfor
Endfor
* Motivating Example
3.2). Identification of Data Parallel sections : What statements can be executed in parallel using the subword semantics?. Assume S1(in conjunction with C1) is executed using subword semantics i.e., operands of successive instances of S1 are packed in multimedia registers and are operated in parallel upon by multimedia instructions. When these operations are executed in parallel, it can be seen that certain instances of C1 would make use of the wrong value of A[k-1]. This is due to the fact that the kth instance of C1 is executed in parallel with the (k-1)th instance of S1, instead of waiting for it to complete and produce the required result, i.e A[k-1]. Clearly, S2 cannot be executed in parallel using subword semantics since successive iterations write to the same location test, and hence when performed in parallel would result in an inconsistent state of the memory location. This is a case of output dependence between the successive instances of statement S2. On the other hand, S3 access the same memory location only in successive iterations of i. Hence instances involving successive iterations of j(and same iteration of i) can be executed in parallel. Thus the aim of this phase is to identify the statements which could be executed in parallel without violating the semantics of the program.
Only a singleton SCC that is not self-dependent is a candidate for exploiting subword parallelism. The presence of a self-dependence arc indicates that successive instances cannot be executed in parallel. In identifying SCCs in the dependence graph, and hence vectorizable loops, we must take into account the fact that due to the domain constraint. Statements in a conditional body are executed based on the conditional test. Such a statement S when pulled out of the loop, must be pulled out in conjunction with the test condition. Any lexically backward dependence between the statement and test would therefore be equivalent to a self-dependence. In order to facilitate the identification of such dependence loops, condition distribution is performed. As in the case of loop distribution, control is distributed over the statements of the then-part and the else-part. Condition distribution is also known in the literature as IF-conversion.
Implementation :
The Control Flow Graph was constructed using the Mach-SUIF CFG library support. The module then builds the array data flow component of the data dependence graph.Dependence vectors are determined for each pair since array definitions are ambiguous. Based on the level of dependence, it can be determined if the dependence is either loop-independent or carried by the innermost enclosing loop, and in that case, an arc will be added between the pair of references. For each outer for-loop, the module identifies the strongly connected component of the data dependence graph. Single statement strongly connected components, which are not self-dependent are annotated as data parallel. They can therefore be executed using the subword semantics. Provided the result type is sufficiently short. Illustration : Conditional distribution is first performed on the code, transformation. In Strongly Connected Component graph where X1,S1, and S2 are contained in a strongly connected component. Since there are no singleton SCCs, statements in the innermost loops such are not vectorizable. Considering level 2 and the appropriate dependence graph for it, one can find out that statement S3 forms a singleton SCC (without a self-arc) at level 2. hence S3 can be executed using subword
semantics.
For( i=1; i<N; i++ )
For( j=1; j<N; j++ )
For( k=1; k<N; k++ )
X1 : C1.temp = A[k-1];
S1 : if(c1.temp==..) A[k]=test+..;
S2 : test-..;
Endfor
S3 : B[i][j] = b[i-1][j]+test;
Endfor
Endfor
* Condition Distribution

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Multimedia History	seminar class	1	17,021	07-03-2019, 06:22 AM Last Post:
	network security seminars report	computer science technology	14	20,681	24-11-2018, 01:19 AM Last Post:
	Modular Computing seminars report	computer science crazy	4	21,626	08-10-2013, 04:32 PM Last Post: Guest
	E-COMPILER FOR JAVA WITH SECURITY EDITOR	smart paper boy	7	11,975	27-07-2013, 01:06 PM Last Post: computer topic
	E-COMPILER FOR JAVA WITH SECURITY EDITOR	seminar class	9	13,755	24-06-2013, 11:44 AM Last Post: Guest
	tele immersion seminars report	computer science technology	9	14,855	20-12-2012, 11:20 AM Last Post: seminar details
	Multimedia Broadcast Multicast Service (MBMS)	Computer Science Clay	1	13,651	02-11-2012, 12:38 PM Last Post: seminar details
	multimedia messaging service full report	project report tiger	2	18,194	06-10-2012, 12:13 PM Last Post: seminar details
	computer science seminars topics	computer science crazy	1	10,086	16-03-2012, 10:38 AM Last Post: seminar paper
	GSM Security And Encryption (download seminars report)	Computer Science Clay	14	14,408	07-03-2012, 07:35 PM Last Post: kushi.8

Important Note..!

ASK HERE