ASK HERE

seminar class · 25-03-2011, 10:30 AM

Presented by:
Deepthi Amuru

[attachment=10956]
Abstract—Online clock skew scheme is proposed in this paper to improve the performance of the asynchronous wave-pipelined circuits. In conventional pipelining technique, the operating frequency is increased by dividing the combinational logic into number of stages and registers are introduced between the stages. Where, all the registers are fed with a global clock. Wave-pipelining technique maximizes the logic utilization without inserting internal registers so that we can achieve the improvement in the speed of the circuit without the cost of increased area and routing complexity. Previous papers implemented on wave-pipelining circuits used complex circuitry for adjusting the clock period and clock skews in offline condition. The proposed low complexity control circuit generates enable signal for enabling the output latch(s) during stable period depending on the clock speed in online condition. The proposed technique is evaluated by implementing filters using Distributed Arithmetic Algorithm (DAA) by using 3 different schemes: non-pipelining, pipelining and wave-pipelining on Xilinx Spartan III: comparisons are done in terms of operating frequency, power dissipation, area (in terms of Les and registers) and latency (for maximum operating frequency) at different frequencies and performance analysis is carried out. Wave-pipelined DA filter is faster by a factor of 1.36 compared to the non-pipelined one. The pipelined filter is faster by a factor of 1.38 compared to wave-pipelined one but at the cost of increased logic utilization by 115.69%. The dynamic power for the 4-tap DA wave-pipelined filter is less by approx. 8% compared to pipelined and greater by approx. 28% compared to non-pipelined circuits.
Index Terms—Asynchronous, Wave-pipelining, clock skew, FIR,
DAA
INTRODUCTION
FIELD-PROGRAMMABLE gate array-based system is gaining extensive popularity due to the flexibility and complexity it provides. FPGAs have potential for parallelism. FPGAs with complexities, as high as 10 million gates in a single integrated circuit(IC) have become the reality. This has enabled the FPGA vendors to embed the restricted instruction set computer (RISC) processor in part of the core so that in a single IC, the advantage of both microprocessors and FPGAs can be combined, leading to a design of a complete system on a single chip (SOC) [1]. FPGAs provide solutions that maintain both the advantages of the approach based on DSP processors and the approach based on ASICs. Many front-end
DSP algorithms, such as FFTs, FIRs or IIR filters previously built with ASICs or PDSPs are now replaced by FPGAs.
Increased performance of digital systems is critically important in applications ranging from general purpose computing to single/image processing to telecommunications. In today’s electronics industry, low power and high throughput circuit design arise as the most important issues of the system design. Reducing delay and a proper clocking methodology is very important to maintain the overall system performance. In any digital circuit, the critical path (longest path delay) decides the operating frequency of the system. The operating frequency of digital circuits can be increased by several techniques such as pipelining, wave-pipelining and asynchronous pipelining.
In conventional pipelining technique, the operating frequency is increased by dividing the combinational logic into number of stages and registers are introduced between the stages. All registers are fed with a global clock. Registers are used to provide temporary storage for the computed data between successive pipeline stages. This improves the speed of the system but at the cost of increased number of registers, area, latency, power and clock routing complexity [2].
Wave-pipelining technique is another approach which improves the speed of the circuit with less area and clock loads. In case of ordinary pipeline system, there is one “wave’ of data between register stages. When a new set of data has been clocked into one set of register, the values are propagated to the next stage of register before the first set of data has been clocked again. But in case of Wave pipelining (WP) system; multiple coherent “waves” of data are propagated between storage elements as shown in figure1 below.
Many researchers have worked with wave pipelining, which was first introduced by Cotten [4], who named it maximum rate pipelining. He observed that the rate at which logic can propagate through a circuit depend not only the longest path delay but also on the difference between the longest and the shortest path delay. As a result, several computational “waves,” i.e., logic signals related to different clock cycles, can propagate through the logic simultaneously.
The operating speed of the wave-pipelined circuit can be increased by the following three tasks: adjustment of the clock period, clock skew and equalization of path delays. A low complexity control circuit for the clock skew generation is proposed in this paper.
The paper is organized as follows: The overview of wave-pipelining is introduced in section II. In section III, review of related work is presented. Section IV describes the proposed control circuit. The description of the DAA algorithm is dealt in section V. The implementation results are given in section VI. Conclusion is discussed in section VII.
II. OVERVIEW OF WAVE-PIPELINING
Figure 2 shows a typical combinational logic circuit along with input and output registers [3]. Figure 3 depicts the data flow graph through the combinational circuit [3]. The skew between the input and output registers is denoted as δ. At the beginning of each clock cycle, data is fed into the combinational logic block through the input register.
A number of paths may exist between the inputs and output of a logic block. A change in the input causes the output to change after a delay of [Dmin Dmax] through the shortest and longest path, respectively. The shaded regions (bounded by Dmin and Dmax) depict the periods where the logic levels of the logic block vary with time. The non-shaded areas depict the stable duration of the logic block. In the conventional system, the output register is clocked in the non-shaded region and the minimum clock period, Tclk, is chosen to be greater than Dmax. In the wave-pipelined system, the clock period is chosen to be (Dmax -Dmin) + clocking overheads such as setup time, hold time, etc. To ensure correct operation, δ should be adjusted so that the active clock edge occurs in the stable period. As the shaded region increases with increase in the logic depth, while the operating clock frequency should be reduced with increase in logic depth. Moreover, to maximize the frequency of operation of the wave-pipelined system, the difference (Dmax-Dmin) is minimized by equalizing the path delays.
Hence, adjustment of the clock period, clock skew (δ), and equalization of path delays are the three tasks required for maximizing the operating speed of the wave-pipelined circuit. All three tasks require the delays to be measured and altered if required. Layout editors, such as the FPGA editor from Xilinx or Floor planner from Altera may be used for this purpose.
III. REVIEW OF RELATED WORK
The construction of maximum-rate circuits or wave pipelines is centered on the equalization of all path delays. In principle, this technique speeds up a combinational circuit without increasing either the synchronization power (due to the avoidance of intermediate registers) or the spurious activity power (due to the inherent path equalization), or the initial latency (due to the maximum delay of the datapath not being increased by the insertion of intermediate registers). The combination of high-performance integrated circuit (IC) technologies, pipelined architectures, and sophisticated computer-aided design (CAD) tools has converted wave-pipelining from a theoretical oddity into a realistic, although challenging, VLSI design method. Wave-pipelining has been
employed for implementing a number of systems on both ASICs and FPGAs [5]. The concept of wave-pipelining has been described in a number of previous works [3] [6].
The adjustment of clock skew and clock period is carried out in pipelining technique also. Along with the above mentioned two, equalization of path delays is carried out in wave-pipelining. The adjustment of clock skew and clock period are carried out manually in [7] and [1]. The wave-pipelined circuit designed using the layout editor may be tested using simulation. However, the simulation is inadequate for testing due to the difference between the actual delays and the delays calculated by the layout editor. This is because the layout editor considers only the worst case delays and the actual delays may be significantly different due to fabrication variations.
This difference becomes important as the logic depth of the circuit increases. Hence, the design is downloaded to the actual FPGA and its operation is checked using a Personal Computer (PC), based test system in [1]. If correct results are not obtained, delays are altered and the design is downloaded for testing again. A number of iterations of place and route, simulation, downloading, and testing in the actual device may be required until the correct results are obtained. The design of a wave-pipelined circuit in this fashion requires human intervention and is time consuming. Automation schemes as well as the technique for minimizing the difference in path delays are carried out in [8]. In [8], clock period and clock skew are adjusted in offline that is in test mode only.
IV. PROPOSED ONLINE CLOCK SKEW SCHEME
A Concept of the proposed online clock skew scheme
The clock speed of wave-pipelined circuits can be increased if the idle time of the non-critical paths can be reduced. The operating frequency of the wave-pipelined circuit depends on the difference between Dmax and Dmin. Adjusting the Dmax for increasing the operating frequency is available in the literature. In [8], the clock period and clock skew are adjusted in testing mode (offline) only which has the disadvantages of PVT variations. In this paper, modeling Dmin is also considered for increasing the operating frequency of the wave-pipelined circuits. The loading effect is tolerated by taking 10% tolerance represented by d1 and d2 delay. The proposed control circuit is based on the fact that output data are latched only during the stable period. The control circuit generates an enable (EN) signal which ensures opening of output latch in the stable region and closing of the output latch in the unstable period. The stable and unstable regions are bounded by Dmin and Dmax. The enable signal generated should close the latch at Dmin instant and should open the latch at Dmax instant.
The basic architecture for the wave pipelined circuit with the proposed control circuit for the online clock skew scheme is depicted in Fig. 4. It consists of input registers and an output latch at the output. The enable signal ‘en’ generated by the control circuit which controls opening and closing of output latch(s).
B. Algorithm for Control Logic Design
Steps:
1. Tin = 0
2. Lclose = Tin + Dmin – d2
3. Lopen =Tin + Dmax + d1
4. Tin = Tin + Tclk
5. Goto step 2
Where, Tin is the instant at which input data arrives; Tclk defines the clock period. Lopen is the instant at which output latch is opened and Lclose is the instant at which output latch is closed. Dmin and the Dmax are minimum and maximum delay through the combinational block; d1 and d2 are the tolerance range for Dmin and Dmax respectively.
Table.1 explains the operation of the proposed online Clock skew scheme in generating the ‘en’ signal considering: Dmax = 12 ns; Dmin = 7 ns; clk = 9ns; and assuming nodes are initialized to 0 at 0 ns. The ‘en’ signal waveform is shown in Fig.5.
C. Control circuit for the proposed online clock skew Scheme
The proposed control circuit for the online clock skew scheme consists of two T flip-flops driven by common clock ‘clk’ and T inputs are always made high as shown in Fig. 6 so that T flip flops toggles for every clock. Assuming that a combinational block is having maximum delay of Dmax and minimum delay of Dmin, the outputs of flip-flops are given to the delay blocks- one to a delay block representing Dmax and another to delay block representing Dmin. The outputs of both delay blocks are XORed to generate the enable signal ‘en’, which is ideally the difference of Dmax and Dmin. This enable signal is given to the output latch of the combinational block (Fig.4) as a control signal to open/close the latch. Initially the Q1, Q2, Q3 and Q4 are ‘0’. The ‘en’ signal is ‘0’. After clk is applied, both T flip-flops toggle and Q1 and Q2 become ‘1’. After delay of Dmin, Q3 becomes high making ‘en’ high. The latch is closed (opaque) and does not allow the data to pass through it. After delay of Dmax, Q4 becomes high in making ‘en’ low. The latch is transparent allowing the data to pass through it. This process continues for every input data and the latch is open between Dmax and Dmin which safely latching of during stable period.
V. DISTRIBUTED ARITHMETIC ALGORITHM
The Distributed Arithmetic (DA) plays an important role in embedding DSP functions in the Look-up Table (LUT) based FPGAs and enables the FPGAs to achieve performance which is superior to those of programmable DSPs [8]. DA can be optimized for area efficiency, speed efficiency or for both. For efficient implementation of DA on FPGAs, a number of algorithms such as Read Only Memory (ROM) decomposition technique and offset binary coding have been proposed in the literature [9]. Normally, for the computation of vector dot product using DA, the content of DA ROM is stored assuming multiplication using 2’s complement arithmetic with sign extension technique. The computation of the output of an N tap Linear Time Invariant (LTI) filter and computation of transform of a Nx1 vector can be generalized as the problem of computation of the sum of products given by
1
y ( n ) = ∑ a ( n, k ) x( k ) (1)
k=0
In the case of LTI filters and transform computation, a(n,k) is time invariant and only x(k) varies with time. In view of this, y(n) can be computed by using the look up tables for multiplication. This can be achieved as follows:
The input samples x(k) may be assumed to be represented in 2’s complement representation using W bits and can be written as
w-1 -m
x ( k ) = -x ( W-1 , k ) + ∑ x ( W-1-m , k ) 2 (2)
m=0
Substituting equation (2) in (1) and interchanging the
order of summation w.r.t. m and k, we get
w-2 -(w-1-m)
y ( n ) = -S ( W-1 ) + ∑ S ( m ) 2 (3)
m=0
where
N-1
S ( m ) = ∑ x ( m , k ) a( n , k ) (4)
K=0
It may be noted that x(m,k), for m= 0,1, … W-1, takes binary values 1 or 0. Hence, S(m) can be computed using ROM with address as the bits x(m,0), x(m,1), … x(m,N-1). Furthermore, the contents of S(m) is the same for all values of m.
A. Full parallel DA algorithm
To compute y(n), W ROMs, ROM 0 – ROM (W-1) can be used. ROM 0 – ROM (W-2) contain the same content and correspond to S(0) – S(W-2). ROM (W-1) corresponds to [-S(W-1)] and is actually the 2’s complement of the content of the other ROMs. The MSBs of all the samples are fed as the address to the (W-1)th ROM. The next bits of all the samples are fed to the (W- 2)th ROM address bits. Similarly, the LSBs of all the samples are fed as address to the 0th ROM. For W = 8, y(n) can be computed using four stages of adders. y(n) is expressed using S(0) – S(7) in equation 5.
-1 -1 -2
y ( n ) = {[ -S(7) + S(6) 2 ] + [ S(5) + S(4) 2 ] 2 }+
-1 -1 -2 -4
{[S(3) + S(2) 2 ] + [ S(1) + S(0) 2 ] 2 }2 (5)
Equation (5) requires multiplication of the numbers by 2-i. If 2’s complement multiplication with sign extension is used, this requires shifting the number towards right i times and replicating the MSB i times. For example, multiplication of a number 10100101 represented in 2’s complement form by 2-4 results in the number 1111 1010 0101. The full parallel DAA scheme with 2’s complement multiplication with sign extension is shown in Fig. 6. The logic depth or the no. of stages of logic elements required for DA filter depends on the no. of taps. The no. of stages required for DA filter with 8, 16 and 32 taps are 4, 5 and 6 respectively.
B. ROM decomposition and Pipelining for DAA
DA algorithm discussed above can be modified to reduce the size of the ROM required. Fig. 6 shows ROM decomposition technique for DA algorithm [9]. It can be verified that an N tap filter requires Distributed Arithmetic Look-up Tables (DALUTs) with 2N locations.
The exponential growth in the ROM size can be avoided by splitting the N address bits to the ROM into blocks of K address bits each. Now, only K inputs DALUTs are required and hence the individual ROM size becomes 2K. Totally N/K such DALUTs are required for computing the output corresponding to a particular bit of the input samples. To get the correct output, the outputs of the K input DALUTs have to be added.

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	REAL TIME CLOCK DISPLAY USING GRAPHICAL LCD	seminar class	1	3,808	21-08-2015, 12:10 PM Last Post: Guest
	Online Electronics Shopping	CarolToni	1	1,253	16-06-2015, 03:54 PM Last Post: seminar report asees
	CONTENT DEPENDENT WATER MARKING SCHEME FOR SPEECH SIGNAL	seminar class	3	2,393	04-05-2015, 03:15 PM Last Post: seminar report asees
	DESIGN AND IMPLEMENTATION OF ASYNCHRONOUS FIFO FOR EMBEDDED APPLICATIONS	computer science crazy	1	22,608	14-04-2015, 05:38 PM Last Post: Guest
	An Effective Wavelet-Based Watermarking Scheme Using Human Visual System	seminar class	1	1,599	19-12-2012, 11:48 AM Last Post: seminar details
	Automated alarm CirCuits	mechanical wiki	2	3,334	17-11-2012, 12:15 PM Last Post: seminar details
	Digital Clock with Alarm	project topics	2	3,064	15-11-2012, 11:32 AM Last Post: seminar details
	OPTICAL INTEGRATED CIRCUITS	smart paper boy	1	1,740	13-11-2012, 12:39 PM Last Post: seminar details
	PHS BASED ONLINE VEHICLE TRACKING SYSTEM full report	project topics	5	7,180	25-10-2012, 09:57 PM Last Post: Guest
	Online Examination	seminar projects crazy	1	5,651	08-02-2012, 09:44 AM Last Post: seminar addict

Important Note..!

ASK HERE