The gap between memory speed and processor speed is increasing with each new generation of processors. As a result, lack of sufficient bandwidth between the processor and various levels of memory has become a principle bottleneck in achieving peak performance for large-scale applications. Because application programs are typically written in a clear style that facilitates code development and maintainability they often fail to exploit data reuse to the greatest extent possible. Although significant performance gains can be achieved by careful hand rewriting of code, more automated approaches are needed for improving both performance, productivity and maintainability.
There are two major obstacles to automatic optimization of large scale applications. First the choice of algorithms and data structures in the original source may severely limit opportunities for optimization. Second compiler-based optimizations must be expressed in terms of algorithms that are both general and correct. However, the techniques for achieving performance at the leading edge of capability computing are currently neither simple nor general enough to be applied universally. For this reason, we are pursuing a hybrid approach of building tools that use compiler technology to assist expert human programmers.
Using the Rice compiler infrastructure we have constructed a source to source loop transformation tool that combines new compilation techniques with well-known loop transformations to improve data reuse on various levels of the memory hierarchy. Our optimization strategy builds upon multi-level fusion of loop nests. We employ both data and computation restructuring strategies such as array contraction, selective unroll-and-jam and guard-free core generation.
Controlled Loop Fusion
- We enable multi-level loop fusion by adjusting the alignment of statements in different loops relative to a common iteration space.
- Loop fusion improves cache reuse by reducing the reuse distance between accesses to the same data or data that share the same cache line.
- Loop fusion also opens up possibilities for applying array contraction.
- However, fusing loops too aggressively may cause the memory footprint for the innermost body to be too large.
- To resolve this issue we have extended our fusion algorithm to enable more fine- grained control over which loop nests should be fused.
- Using a compiler directive, we can instruct our fusion algorithm to selectively fuse any pair of loops in any dimension as long as fusing those loops is legal.
- We implemented unroll-and-jam to exploit temporal reuse carried by outer loops. Applying this transformation to a loop nest brings reuse in the outer loops closer together which improves register reuse.
- In a multilevel fused loop nest, the dependences may be carried by different loops within the fused loop nest. Unroll-and-jamming along a particular dimension does not provide us with sufficient benefits in this case.
- To get the benefits of both cache reuse from fusion and register reuse from unroll-and-jam we have augmented the conventional unroll-and-jam strategy. Within a fused loop nest we apply unroll-and-jam selectively to loop nests on the dimension(s) that give best reuse.
- If it is possible to fuse a set of loop nests that include the definition and all of the uses of values of a temporary array, we typically can reduce the storage footprint by applying array contraction
- Reducing the storage footprint leads to reuse of data locations. This can reduce the memory bandwidth required if the reduced-size array can now fit in some level of cache
Guard-Free Core Generation
- Using a symbolic set manipulation package, we compute the iteration space for a guard-free core loop nest from a fused computation.
- Pieces of iteration spaces are clipped off along each dimension of the fused loop nest to uncover a guard-free rectangular core.
- To facilitate unroll-and-jam, the loop bounds of the core are adjusted so that the number of iterations is divisible by the unroll factor in the dimension to be unrolled.
We applied our transformation tool on the Runge-Kutta kernel of the NCOMMAS weather code developed at NCSA. We ran experiments with different combinations of transformations. This enables us to measure the effectiveness of each individual transformation as well as its interplay with other transformations in the framework.
Our results show that applying multi-level loop fusion significantly reduces cache misses. Applying storage reduction with fusion farther improves the cache performance.
We plan to enhance our transformation tool by integrating other loop transformations such as scalar replacement. Also our preliminary experiments suggest that there is significant interplay among the various transformations that we use in our framework. We plan to explore this relationship farther and develop algorithms based on cost models for orchestrating the transformations.
- Increasing Temporal Locality with Skewing and Recursive Blocking, Robert Fowler, Guohua Jin, John Mellor-Crummey, Proceedings of SC01: High-Performance Computing and Networking, Denver, Colorado, November 2001. PDF | Postscript
- Improving Performance with Integrated Program Transformations, Apan Qasem, Guohua Jin, John Mellor-Crummey, Rice University. Available as a Technical Report #TR03-419, October 2003. PDF | Postscript
- Compiler Support for Better Memory Utilization in Scientific Code, Rob Fowler, Guohua Jin, John Mellor-Crummey, Apan Qasem, LACSI Symposium 2002. Powerpoint