Summary: Fusion of Loops for Parallelism and Locality
Naraig Manjikian and Tarek S. Abdelrahman
Department of Electrical and Computer Engineering
University of Toronto
Toronto, Ontario, Canada M5S 3G4
Loop fusion improves data locality and reduces synchronization in data-parallel applications.
However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried
dependences which prevent parallelism. In addition, performance losses result from cache
conflicts in fused loops. In this paper, we present new techniques to: (1) allow fusion of
loop nests in the presence of fusion-preventing dependences, (2) maintain parallelism and
allow the parallel execution of fused loops with minimal synchronization, and (3) eliminate
cache conflicts in fused loops. We describe algorithms for implementing these techniques
in compilers. The techniques are evaluated on a 56-processor KSR2 multiprocessor and
on a 16-processor Convex SPP-1000 multiprocessor. The results demonstrate performance
improvements for both kernels and complete applications. The results also indicate that
careful evaluation of the profitability of fusion is necessary as more processors are used.
Index Terms -- Locality enhancement, loop fusion, cache conflicts, loop transformations,
data-parallel applications, scalable shared-memory multiprocessors.