 
Summary: General MatrixMatrix Multiplication Using
SIMD Features of the PIII
Douglas Aberdeen Douglas.Aberdeen@anu.edu.au
Jonathan Baxter Jonathan.Baxter@anu.edu.au
Research School of Information Sciences and Engineering
Australian National University
Abstract. Generalised matrixmatrix multiplication forms the kernel of
many mathematical algorithms. A faster matrixmatrix multiply imme
diately benets these algorithms. In this paper we implement ecient
matrix multiplication for large matrices using the oating point Intel
SIMD (Single Instruction Multiple Data) architecture. A description of
the issues and our solution is presented, paying attention to all levels of
the memory hierarchy. Our results demonstrate an average performance
of 2.09 times faster than the leading public domain matrixmatrix mul
tiply routines.
1 Introduction
A range of applications such as articial neural networks benet from GEMM
(generalised matrixmatrix) multiply routines that run as fast as possible. The
challenge is to use the CPU's peak oating point performance when memory
access is fundamentally slow. The SSE (SIMD Streaming Extensions) instruc
