26-12-2010, 09:44 AM
ARCHITECTURAL MODIFICATIONS TO ENHANCE THE FLOATING-POINT PERFORMANCE OF FPGA
Seminar Report
by
ABHIJITH.M.A
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
COLLEGE OF ENGINEERING
THIRUVANANTHAPURAM
2010
Seminar Report
by
ABHIJITH.M.A
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
COLLEGE OF ENGINEERING
THIRUVANANTHAPURAM
2010
[attachment=7662]
ABSTRACT
With latest technologies FPGAs have reached the point where they are capable of implementing complex floating-point applications. However the application of FPGA for scientific applications that require floating point operations is limited .In that case an improvement in FPGA architecture for floating point operation comes into consideration .This paper considers three architectural modifications that make floating-point operations more efficient on FPGAs. Before considering about the modifications the present architecture and floating point numbering system are referred. The dominant style of current FPGAs is the island-style FPGA, consisting of a 2-D lattice of CLBs. Based on that three modifications are presenting here. The first modification is an embedded FPU which implements a double-precision floating point multiply-add operation .It is implemented in an island-style FPGA. A first-input–first-output (FIFO) is also provided in parallel to the multiplier so that the pipelines are balanced. These coarse-grained units provide a dramatic gain in area and clock rate at the cost of dedicating significant silicon resources to hardware that is very domain specific. Another feature of floating-point lend themselves to finer grained approaches. Floating point arithmetic requires variable length and direction shifters which can be called variable length shifters. The first alternative to lookup tables (LUTs) for implementing the variable length shifters is a coarse-grained approach: embedded variable length shifters in the FPGA fabric. With this improvement significant reduction in area with a modest increase in clock rate and are smaller and more general than embedded floating-point units. The fine-grained approach is adding a 4:1 multiplexer unit inside a configurable logic block (CLB), in parallel to each 4-LUT. This modification provides the smallest overall area improvement and also a significant improvement in clock rate .But the trivial size of the CLB increases in this case.
1.INTRODUCTION
Field-Programmable Gate Arrays (FPGAs) are currently used as a means to accelerate scientific applications. Recent scientific applications reveals that this has been the exclusive domain of microprocessors. But the performance of microprocessors is limited by their lack of customizability . In contrast, application specific integrated circuits (ASICs) can be highly efficient at floating-point computations, but they do not have the programmability needed for typical scientific computing environments.Now its possible to implement a variety of scientific algorithms with the help of FPGAs.This is made possible through the Increases in FPGA density, and optimizations of floating-point elements for FPGAs. In spite of this, the floating-point performance of FPGAs must increase dramatically to offer a compelling advantage for this domain
There are still significant opportunities to improve the floating-point performance of FPGAs by optimizing the device architecture. Fixed-point operations have long since become common on FPGAs and hence FPGA architectures have introduced targeted optimizations. Some of them are fast carry-chains, cascade chains, and embedded multipliers. Xilinx has created an entire family of FPGAs optimized for the signal processing domain, which uses this type of operation intensively .Floating-point operations are becoming more common nowadays but there have not been the same targeted architectures for floating-point as there are for fixed-point. Potential architectural modifications span a spectrum from the extremely coarse-grained to the extremely fine-grained. This paper explores ideas at three points in that spectrum. At the coarse-grained end, we evaluate the addition of IEEE 754 standard floating-point multiply-add units as an embedded block in the reconfigurable fabric. Many scientific algorithms need IEEE compliance and most of the algorithms explored can fully leverage floating-point multiply-adds. Since a fused multiply-add can often be smaller than a separate multiplier and an adder, we chose the fused multiply add as the “coarse-grained” enhancement. These coarse-grained units provide a dramatic gain in area and clock rate at the cost of dedicating significant silicon resources to hardware that is very domain specific. IEEE floating-point also has features that lend themselves to finer grained approaches. The primary example is that floating point arithmetic requires variable length and direction shifters. In floating-point multiplication and division, the mantissa must be shifted before the calculation and after the calculation to renormalize the mantissa.
In highly optimized double-precision floating-point cores for FPGAs , the shifter accounts for almost a third of the logic for the adder and a quarter of the logic for the multiplier. Thus, better support for variable length shifters can noticeably improve floating-point performance. Two approaches are done to optimize the FPGA hardware for variable length shifters.First we consider a minor tweak to the traditional CLB: the addition of a 4:1 multiplexer in parallel with the 4-LUT. This was at the fine grained end. This provides a surprisingly large clock rate improvement with a more modest area improvement and virtually no extra silicon area. Next we consider the addition of an embedded block to provide variable length shifting. This uses slightly more area than the configurable logic blocks (CLB) modification and provides a corresponding increase in area savings. This modification is in the middle of the spectrum. It provides only a modest improvement in clock rate.
VPR ( versatile place and route) is a widely used research tool for FPGA place and route. It uses simulated annealing and a timing-based semi-perimeter routing estimate for placement and a timing-driven router. The modified version was used to place and route a set of double-precision floating-point benchmarks. Five benchmarks were used to test the performance. They were matrix multiply, matrix vector multiply, vector dot product, fast Fourier transform (FFT), and LU decomposition. Each benchmark was used with five versions, which were CLB only, embedded multiplier, embedded shifter, multiplexer, and embedded floating-point units.
2. FLOATING-POINT NUMBERING SYSTEM AND OPERATIONS
The IEEE-754 standard specifies a representation for single and double precision floating-point numbers. It is currently the standard that is used for real numbers on most computing platforms. Floating-point numbers consist of three parts: sign bit, mantissa, and exponent. The mantissa is stored as a fraction (f), which is combined with an implied one to form a mantissa (1.f) such that the mantissa is multiplied by the base number (two) to an exponent. The representations of single and double precision numbers are as follows,
A sign bit, an 8-bit exponent, and a 23-bit mantissa are specified for a single-precision floating-point number. Double-precision floating-point has a sign bit, an 11-bit exponent and 52-bit mantissa.
Figure 1.Single precision IEEE floating point number
Figure 2.Double precision IEEE floating point number
There are Floating point operations like addition, subtraction, multiplication and division. For the sake of explanation we will just go through floating point multiplication. The steps involved are given below,
1. Multiply the mantissa parts including the 1 also as, 1.f *1.f
2. Find out E-127 to get true exponent.
3. Add the exponents and re-bias by adding 127.
4. Now we have to normalise the mantissa part to standard form as 1.f
5. For normalising shift the decimal point to left or right and add or subtract 1 from the exponent for every shift
3. BASIC FPGA ARCHITECTURE
All FPGAs will have the following basic architectural components to function properly to meet the requirement of designs. Simply saying an FPGA is a two dimensional array of gates realised using look up tables inside CLBs. The basic architectural components are,
1. CONFIGURABLE LOGIC BLOCKS (CLBS)
2. SLICES
3. INPUT/OUTPUT BLOCKS (IOBS)
4. THE STORAGE ELEMENT
5. DISTRIBUTED RAM
6. BLOCK RAM
7. DIGITAL CLOCK MANAGER (DCM) BLOCKS
8. DEDICATED MULTIPLIERS
3.1. CONFIGURABLE LOGIC BLOCKS (CLBS)
CLBs contain flexible Look-Up Tables (LUTs) that implement logic plus storage elements used as flip-flops or latches. CLBs perform a wide variety of logical functions as well as store data. The Configurable Logic Blocks (CLBs) constitute the main logic resource for implementing synchronous as well as combinatorial circuits. Each CLB contains four slices, and each slice contains two Look-Up Tables (LUTs) to implement logic and two dedicated storage elements that can be used as flip-flops or latches. The LUTs can be used as a 16x1 memory (RAM16) or as a 16-bit shift register (SRL16), and additional multiplexers and carry logic simplify wide logic and arithmetic functions. Most general-purpose logic in a design is automatically mapped to the slice resources in the CLBs. Each CLB is identical, and the Spartan-3E family CLB structure is identical to that for the Spartan-3 family.
3.2. SLICES
Each CLB comprises four interconnected slices. These slices are grouped in pairs. Each pair is organized as a column with an independent carry chain. The left pair supports both logic and memory functions and its slices are called SLICEM. The right pair supports logic only and its slices are called SLICEL. Therefore half the LUTs support both logic and memory (including both RAM16 and SRL16 shift registers) while half support logic only, and the two types alternate throughout the array columns. The SLICEL reduces the size of the CLB and lowers the cost of the device, and can also provide a performance advantage over the SLICEM.
3.3. INPUT/OUTPUT BLOCKS (IOBS)
IOBs control the flow of data between the I/O pins and the internal logic of the device. Each IOB supports bidirectional data flow plus 3-state operation. Supports a variety of signal standards, including four high-performance differential standards. Double Data-Rate (DDR) registers are also included. The Input/output Block (IOB) provides a programmable, unidirectional or bidirectional interface between a package pin and the FPGA’s internal logic. The unidirectional input-only block has a subset of the full IOB capabilities. Thus there are no connections or logic for an output path. The following paragraphs assume that any reference to output functionality does not apply to the input-only blocks. The number of input-only blocks varies with device size, but is never more than 25% of the total IOB count.
3.4. THE STORAGE ELEMENT
which is programmable as either a D-type flip-flop or a level-sensitive transparent latch, provides a means for synchronizing data to a clock signal, among other uses. The storage elements in the top and bottom portions of the slice are called FFY and FFX, respectively. FFY has a fixed multiplexer on the D input selecting either the combinatorial output Y or the bypass signal BY. FFX selects between the combinatorial output X or the bypass signal BX.
3.5. DISTRIBUTED RAM
The LUTs in the SLICEM can be programmed as distributed RAM. This type of memory affords moderate amounts of data buffering anywhere along a data path. One SLICEM LUT stores 16 bits (RAM16). Multiple SLICEM LUTs can be combined in various ways to store larger amounts of data, including 16x4, 32x2, or 64x1 configurations in one CLB. The fifth and sixth address lines required for the 32-deep and 64-deep configurations, respectively, are implemented using the BX and BY inputs, which connect to the write enable logic for writing and the F5MUX and F6MUX for reading. Writing to distributed RAM is always synchronous to the SLICEM clock (WCLK for distributed RAM) and enabled by the SLICEM SR input which functions as the active-High Write Enable (WE). The read operation is asynchronous, and, therefore, during a write, the output initially reflects the old data at the address being written.
3.6. BLOCK RAM
provides data storage in the form of 18-Kbit dual-port blocks. Spartan-3E devices incorporate 4 to 36 dedicated block RAMs, which are organized as dual-port configurable 18 Kbit blocks. Block RAM synchronously stores large amounts of data while distributed RAM, previously described, is better suited for buffering small amounts of data anywhere along signal paths. Each block RAM is configurable by setting the content’s initial values, default signal value of the output registers, port aspect ratios, and write modes. Block RAM can be used in single-port or dual-port modes
3.7. DIGITAL CLOCK MANAGER (DCM) BLOCKS
The DCM blocks Provide self-calibrating, fully digital solutions for distributing,
delaying, multiplying, dividing, and phase-shifting clock signals. Digital Clock Managers (DCMs) provide flexible, complete control over clock frequency, phase shift and skew. To accomplish this, the DCM employs a Delay-Locked Loop (DLL), a fully digital control system that uses feedback to maintain clock signal characteristics with a high degree of precision despite normal variations in operating temperature and voltage. This section provides a fundamental description of the DCM.
3.8. DEDICATED MULTIPLIERS
Most of the devices provide 4 to 36 dedicated multiplier blocks per device. The multipliers are located together with the block RAM in one or two columns depending on device density. The multiplier blocks primarily perform two’s complement numerical multiplication but can also perform some less obvious applications, such as simple data storage and barrel shifting. Logic slices also implement efficient small multipliers and thereby supplement the dedicated multipliers. Each multiplier performs the principle operation P = A × B, where ‘A’ and ‘B’ are 18-bit words in two’s complement form, and ‘P’ is the full-precision 36-bit product, also in two’s complement form.
Figure 3. Internal architecture
The above described elements are organized as shown in the figure. A ring of IOBs surrounds a regular array of CLBs. Each device has two columns of block RAM. Each RAM column consists of several 18-Kbit RAM blocks. Each block RAM is associated with a dedicated multiplier. The DCMs are positioned in the canter with two at the top and two at the bottom of the device. The XC3S100E has only one DCM at the top and bottom, while the XC3S1200E and XC3S1600E add two DCMs in the middle of the left and right sides. Al l family features a rich network of traces that interconnect all five functional elements, transmitting signals among them. Each functional element has an associated switch matrix that permits multiple connections to the routing.
4. MODIFICATIONS ON ARCHITECTURE
In the previous sections we have discussed the requirements of floating point operations and from that we can find out some special architecture to improve the performance. Since we have created an idea about the basic FPGA architecture, we are able to suggest the following modifications on the architecture.
1. EMBEDDED FPU
2. EMBEDDED SHIFTER
3. ADDITION OF 4:1 MULTIPLEXER
4.1. EMBEDDED FPU
A floating point unit which can perform specific floating point operations will be a most useful part in the architecture. This leads to the addition of an embedded floating point unit in the architecture. The embedded FPU implements a double-precision floating-point multiply-add operation as shown in Figure. The FPU can be configured to implement a double-precision multiply, add, or multiply-add operation. As seen in Figure, inputs and outputs are registered at the attachment point to the reconfigurable fabric, but the individual functional units (adder and multiplier) are pipelined internally (not shown). The mode input selects data paths to configure the unit as an adder, multiplier, or multiply-add. A first-input–first-output (FIFO) is also provided in parallel to the multiplier so that the pipelines are balanced.
The size of the unit includes both the computation logic and the programmable routing. The amount of programmable routing in each floating-point unit will depend on the height of the unit, since it must have vertical channels at the edge to interface to the rest of the device and a horizontal channel to pass data across the unit. Thus, the area of the embedded floating-point unit was increased to accommodate one vertical side of the floating point unit (FPU) being filled with connection blocks (assumed to be as large as a “CLB”). This made the true area of the FPU dependent on the shape chosen. Latency of the FPUs was more difficult to estimate appropriately. The final testing of the architecture shows that embedded FPU improves the results to a great extent
Figure 4.Blodk diagram of FPU
4.2. EMBEDDED SHIFTER
As we had discussed earlier the mantissa have to be shifted for floating-point operations. The mantissa can be shifted either left or right, by any distance up to the full length of the mantissa. This means that up to a 24 bit shift can be required for IEEE single precision and up to 53 bits of shift can be required for IEEE double precision. However, in hardware, shifters tend to be implemented in powers of two. Therefore, shifters of length 32 and 64 bits were implemented for single and double precision floating-point operations, respectively. Even though floating-point operations only require a logical shift, the embedded shifter should be versatile enough to be used for a wider variety of applications. The embedded shifter that was used has five modes: shift left logical/arithmetic, rotate left, shift right logical, shift right arithmetic, and rotate right. During the shifting that accompanies the normalization of floating-point numbers it is necessary to calculate a sticky bit. The sticky bit is the result of the logical OR of all of the bits that are lost during a right shift, and it is an integral part of the shift operation. Adding the necessary logic to the shifter to compute the sticky bit increases the size of the shifter by less than 1%. Thus, the logic for the sticky bit calculation is included in each shifter. The sticky bit outputs are undefined when a shift other than a logical right shift is performed. The embedded shifter also has optional registers on the inputs and outputs of the data path. There are a total of 83 inputs and 66 outputs. The 83 inputs include 16 control bits, 64 data bits, and 3 register control bits (clock, reset, and enable). The 66 outputs include 64 data bits and 2 sticky bits (two independent sticky bit outputs are needed when the shifter is used as two independent 32-bit shifters). Internally, the combinational delay of the shifter was 1.52 ns, which is far from the limiting timing path. Total area of the shifter logic is 1.27 times the size of the CLB and its associated routing; however, this does not account for the area needed for the additional number of connection of the embedded shifter compared to the CLB or the area needed for connections to the routing structure.
Figure 5.Block diagram of embedded shifter
4.3. ADDITION OF 4:1 MULTIPLEXER
In the section about architecture of FPGA we have studied about the configurable logic block in detail. Now we are applying a fine grained approach to modify the internal structure of CLB. The figure shows a simplified version of one half of the baseline CLB (the lighter shaded blocks). The baseline CLB has two four-input LUTs (4-LUTs), two flip-flops, and some logic to support carry chains (the AND and XOR gates as well as the vertically oriented data path). This allows each half of the CLB to implement any four-input function, including one bit of an add or subtract, as well as some other more eclectic functions (e.g., the addition of either of two constants to an input based on a select bit). In modifying the CLB to better implement variable length shifters, two general principles were observed: minimize the impact on the architecture, and have no impact on general purpose routing. To accomplish these goals, the only change that was made to the CLB’s architecture was to add a single 4:1 multiplexer in parallel with each 4-LUT, as shown in Figure.
The multiplexer and LUT share the same four data inputs. The select lines for the multiplexer are the BX and BY inputs to the CLB. Since each CLB has two LUTs, each CLB would have two 4:1 multiplexers. Since there are only two select lines, both of the 4:1 multiplexers would have to share their select lines. However, for shifters and other large data path elements it is easy to find muxes with shared select inputs. The BX and BY inputs are normally used as the independent inputs for the D flip-flops, but are blocked in the new mux mode. However, the D flip-flops can still be driven by the LUTs in the CLB, and can be used as normal when not using the mux mode. It was determined that adding the 4:1 multiplexer increased the delay of the 4-LUT by only 1.83%. A 4:1 multiplexer was also laid out and simulated. The delay of the 4:1 multiplexer was 253 ps, which is less than the 270 ps that was determined for the 4-LUT .The addition of two 4:1 multiplexers to each CLB increases the size of the CLB by less than 0.5% .
A simplified view of bottom half of CLB is given in the figure below. Along with that the basic structure of FPGA board after the discussed modifications is also given.
Figure 6.Simplified representation of bottom half of modified CLB showing addition
Figure 7.Basic architecture Figure 8.Embedded shifters added
Figure 9.Embedded FPU replacing multiplier
5. TESTING METHODOLLOGY
5.1.VPR
VPR (Variable Place and Root) tool is used to test the feasibility of modifications.VPR is first modified to allow use of embedded blocks. It uses simulated annealing and a timing based semi perimeter routing estimate for placement and a timing driven detailed router. In previous versions, VPR supported only three types of circuit elements: input pads, output pads, and CLBs. To test the proposed architectural modifications and to incorporate the necessary architectural elements, VPR was modified to allow the use of embedded block units of parameterizable size. These embedded blocks have parameterizable heights and widths that are quantized by the size of the CLB. Horizontal routing is allowed to cross the embedded units, but vertical routing only exists at the periphery of the embedded blocks. The regular routing structure that existed in the original VPR was maintained. Additionally, a fast carry-chain was incorporated into the existing CLBs to insure a reasonable comparison with state-of-the-art devices.
5.2. BENCHMARKS
Five benchmarks were used to test the feasibility of the proposed architectural modification. They were matrix multiply, matrix vector multiply, vector dot product, FFT, and a LU decomposition data path. All of the benchmarks use double-precision floating-point addition and multiplication. The LU decomposition also includes floating-point division, which must be implemented in the reconfigurable fabric for all architectures. Each benchmark was sized to be of comparable, though not identical, complexity and the devices were created to be large enough to accommodate the largest benchmark without resource sharing. To explore the impacts of our proposed modifications, the following five versions of each benchmark were created.
5.3. THE FIVE VERSIONS USED
• CLB Only: All floating-point operations are performed using the CLBs. The only other units in this version are embedded RAMs and input/output (I/O).
• Embedded Multiplier: This version adds 18-bit embedded multipliers to the CLB Only version. Floating point multiplication uses the CLBs and the embedded multipliers. Floating-point addition and division are performed using only the CLBs. This version is similar to the Xilinx Vertex II Pro family of FPGAs, and thus is representative of what is currently available in commercial FPGAs.
• Embedded Shifter: This version further extends the embedded multiplier version with embedded variable length shifters that can be configured as a single 64-bit variable length shifter or two 32-bit variable length shifters. Floating-point multiplication uses the CLBs, embedded multipliers, and embedded shifters. Floating-point addition and division are performed using the CLBs and Embedded shifters (for normalization shifting in both cases).
• Multiplexer: While the same embedded RAMs, embedded multipliers, and I/O of the embedded multiplier version are used, the CLBs have been slightly modified to include a 4:1 multiplexer in parallel with the LUTs. Floating-point multiplication uses the modified CLBs and the embedded multipliers. Floating-point addition and division are performed using only the modified CLBs.
• Embedded FPU: Besides the CLBs, embedded RAMs, and I/O of the CLB ONLY version, this version includes embedded floating-point units (FPUs). Each FPU performs a double-precision floating-point multiply-add. Other floating-point operations are implemented using the general reconfigurable resources.
The hardware description language, VHDL is used to write the floating-point benchmarks. At the high level, none of the benchmarks changed from one version to the other. Instead, the back-end tools were modified to remap specific blocks to the new technology. For example, for the embedded FPU, a special, recognizable macro was inserted in place of the electronic data interchange format (EDIF) instance of the floating-point unit built from CLBs. This macro was identified as the design entered the VPR flow and was replaced with the embedded floating-point unit. Similar techniques were used for the embedded shifter and multiplexer design pointsThe addition of the carry-chain was necessary to make a reasonable comparison between the different benchmark versions. Fast carry-chains were used and VPR was also modified for it. Along with the two 4 input function generators, two storage elements, and arithmetic logic gates, each CLB has a fast carry chain affecting two output bits. The carry-out of the CLB exits through the top of the CLB and enters the carry-in of the CLB above as shown in Fig. 8. Each column of CLBs has one carry chain that starts at the bottom of the column of CLBs and ends at the top of the column. Since each CLB has logic for two output bits, there are two opportunities in each CLB to get on or off of the carry-chain.
6. PERFORMANCE STUDY
Figure10.Bench mark-Clock rate
Figure 11.Bench mark-Area
Figure 12.Bench mark-Track count
6.1. EMBEDDED FPU
The embedded FPU had the highest clock rate, smallest area, and lowest track count of all the architectures. By adding embedded FPUs there was an average clock rate increase of 33.4%, average area reduction of 54.2%, and average track count reduction of 6.83% from the EMBEDDED MULTIPLIER to the EMBEDDED FPU versions. To determine the penalty of using an FPGA with embedded FPUs for non-floating-point computations, the percent of the chip that was used for each component was calculated. For the chosen FPU configuration, the FPUs consumed 17.6% of the chip. This is an enormous amount of “wasted” area for many applications and would clearly be received poorly by that community; however, this generally mirrors the introduction of the PowerPC to the Xilinx architecture.
6.2. EMBEDDED SHIFTERS
Even with a conservative size estimate, adding embedded shifters to modern FPGAs significantly reduced circuit size. Adding embedded shifters increased the average clock rate by 3.3% and reduced the average area by 14.6% from the EMBEDDED MULTIPLIER to the EMBEDDED SHIFTER versions. Even though there was an average increase in the track count of 16.5%, a track count of 58 is well within the number of routing tracks on current FPGAs. Only the floating-point operations were optimized for the embedded shifters—the control and remainder of the data path remained unchanged. If we consider only the floating-point units, the embedded shifters reduced the number of CLBs for each double-precision floating-point addition by 31% while requiring only two embedded shifters. For the double-precision floating-point multiplication the number of CLBs decreased by 22% and required two embedded shifters.
6.3. MODIFIED CLBS WITH ADDITIONAL 4:1 MULTIPLEXERS
Using the small modification to the CLB architecture showed surprising improvements. Even though only the floating-point cores were optimized with the 4:1 multiplexers, there was an average clock rate increase of 11.6% and average area reduction of 7.3% from the EMBEDDED MULTIPLIER to the MULTIPLEXER versions. The addition of the multiplexer reduced the size of the double-precision floating-point adder by 17% and reduced the size of the double-precision multiplier by 10%. Even though there was an average increase in the track count of 16.1%, a track count of 58 is well within the number of routing tracks on current FPGAs
7. CONCLUSION
Three architectural modifications to make floating point operations more efficient are demonstrated. The modifications introduced are adding complete Double-precision floating point multiply-add units, adding embedded shifters, and adding a 4:1 multiplexer in parallel to the LUT, each provide an area and clock rate benefit over traditional approaches with different tradeoffs. Three levels of approaches were applied, the fine grained approach, coarse grained approach and a middle level of both these. At the most coarse-grained end of the spectrum is a major architectural change that consumes significant chip area, but provides a dramatic advantage. Addition of embedded FPU is the coarse grained level. The embedded FPUs provided an average reduction in area of 54.2% compared to an FPGA enhanced with embedded 18-bit x18-bit multipliers. This area achievement is in addition to an average speed improvement of 33.4% over using the embedded 18-bit x18-bit multipliers. There is even an average reduction in the number of routing tracks required by an average of 6.8%. The embedded shifter provided an average area savings of 14.3% and an average clock rate increase of 3.3%, which is the intermediate level of approach. At the finest grain end of the spectrum, adding a 4:1 multiplexer in the CLBs provided an average area savings of 7.3% while achieving an average speed increase of 11.6%. The surprising fact is that the smaller change to the FPGA architecture amounts to the bigger net “win.”
8. REFERENCES
1. Michael J. Beauchamp, Scott Hauck, Keith D. Underwood, and K. Scott Hemmert, “Architectural Modifications to Enhance the Floating-Point Performance of FPGAs” IEEE transactions on very large scale integration (VLSI) systems, vol. 16, no. 2, February 2008,pp 177-187
2. K. S. Hemmert and K. D. Underwood, “An analysis of the double precision floating-point FFT on FPGAs,” in Proc. IEEE Symp. FPGA Custom Comput. Mach., 2005, pp. 171–180.
3. K. S. Hemmert and K. D. Underwood, “Open source high performance
floating- point modules,” in Proc. IEEE Symp. FPGAs Custom Comput.
Mach., 2006,pp. 349–350.