02-05-2011, 04:42 PM
Abstract
In this paper we report on the development of an ecient and portable implementation
of Strassen's matrix multiplication algorithm. Our implementation is designed to be used in
place of DGEMM, the Level 3 BLAS matrix multiplication routine. Ecient performance
will be obtained for all matrix sizes and shapes and the additional memory needed for tem-
porary variables has been minimized. Replacing DGEMM with our routine should provide
a signicant performance gain for large matrices while providing the same performance for
small matrices. We measure performance of our code on the IBM RS/6000, CRAY YMP
C90, and CRAY T3D single processor, and oer comparisons to other codes. Our perfor-
mance data reconrms that Strassen's algorithm is practical for realistic size matrices. The
usefulness of our implementation is demonstrated by replacing DGEMM with our routine in
a large application code.
Keywords: matrix multiplication, Strassen's algorithm, Winograd variant, Level 3 BLAS
1 Introduction
The multiplication of two matrices is one of the most basic operations of linear algebra and
scientic computing and has provided an important focus in the search for methods to speed
up scientic computation. Its central role is evidenced by its inclusion as a key primitive
operation in portable libraries, such as the Level 3 BLAS [7], where it can then be used as
a building block in the implementation of many other routines, as done in LAPACK [1].
Thus, any speedup in matrix multiplication can improve the performance of a wide variety
of numerical algorithms.
Much of the eort invested in speeding up practical matrix multiplication implementa-
tions has concentrated on the well-known standard algorithm, with improvements seen when
the required inner products are computed in various ways that are better-suited to a given
machine architecture. Much less eort has been given towards the investigation of alterna-
tive algorithms whose asymptotic complexity is less than the (m3) operations required by
the conventional algorithm to multiply m m matrices. One such algorithm is Strassen's
algorithm, introduced in 1969 [19], which has complexity (mlg(7)), where lg(7) 2:807 and
lg(x) denotes the base 2 logarithm of x.
Strassen's algorithm has long suered from the erroneous assumptions that it is not
ecient for matrix sizes that are seen in practice and that it is unstable. Both of these
assumptions have been questioned in recent work. By stopping the Strassen recursions early
and performing the bottom-level multiplications using the traditional algorithm, competitive
performance is seen for matrix sizes in the hundreds in Bailey's FORTRAN implementation
on the CRAY 2 [2], Douglas, et al.'s [8] C implementation of the Winograd variant of
Strassen's algorithm on various machines, and IBM's ESSL library routine [16]. In addition,
the stability analyses of Brent [4] and then Higham [11, 12] show that Strassen's algorithm
is stable enough to be studied further and considered seriously in the development of high-
performance codes for matrix multiplication.
A useful implementation of Strassen's algorithm must rst eciently handle matrices of
arbitrary size. It is well known that Strassen's algorithm can be applied in a straightforward
fashion to square matrices whose order is a power of two, but issues arise for matrices that
are non-square or those having odd dimensions. Second, establishing an appropriate cuto
criterion for stopping the recursions early is crucial to obtaining competitive performance on
matrices of practical size. Finally, excessive amounts of memory should not be required to
store temporary results. Earlier work addressing these issues can be found in
In this paper we report on our development of a general, ecient, and portable imple-
mentation of Strassen's algorithm that is usable in any program in place of calls to DGEMM,
the Level 3 BLAS matrix multiplication routine. Careful consideration has been given to all
of the issues mentioned above. Our analysis provides an improved cuto condition for rect-
angular matrices, a demonstration of the viability of dynamic peeling, a simple technique for
dealing with odd matrix dimensions that had previously been dismissed [8], and a reduction
in the amount of memory required for temporary variables.
We measure performance of our code on the IBM RS/6000, CRAY YMP C90, and CRAY
T3D single processor and examine the results in several ways. Comparisons with machine-
specic implementations of DGEMM reconrm that Strassen's algorithm can provide an
improvement over the standard algorithm for matrices of practical size. Timings of our code
using several dierent cuto criteria are compared, demonstrating the benets of our new
technique. Comparisons to the Strassen routine in the ESSL RS/6000 and the CRAY C90
libraries and the implementation of Douglas, et al., show that competitive performance can
be obtained in a portable code that uses the previously untried dynamic peeling method
for odd-sized matrices. This is especially signicant since for certain cases our memory
requirements have been reduced by 40 to more than 70 percent over these other codes.
The remainder of this paper is organized as follows. Section 2 reviews Strassen's al-
gorithm. In Section 3 we describe our implementation and address implementation issues
related to cuto, odd dimensions, and memory usage. Performance of our implementation is
examined in Section 4, where we also report on using our Strassen code for the matrix mul-
tiplications in an eigensolver application. We oer a summary and conclusions in Section 5.
DOWNLOAD FULL REPORT
http://javacalculus.googlecodefiles/Stra...orithm.pdf