Embedded Intelligent SRAM

Principal Investigator Arvind

Project Website http://csg.csail.mit.edu/pubs/isram.html

Many embedded systems use a simple pipelined RISC processor for computation and an on-chip SRAM for data storage. We present an enhancement called Intelligent SRAM (ISRAM) that consists of a small computation unit with an accumulator that is placed near the on-chip SRAM. The computation unit can perform operations on two words from the same SRAM row or on one word from the SRAM and the other from the accumulator. This ISRAM enhancement requires only a few additional instructions to support the computation unit. We present a computation partitioning algorithm that assigns the computations to the processor or to the new computation unit for a given data flow graph of the program. Performance improvement comes from the reduction in the number of accesses to the SRAM, the number of instructions, and the number of pipeline stalls compared to the same operations in the processor. The experimental results on various benchmarks show up to 1.48 performance speedup for our enhancement.

We have developed a compiler algorithm that takes advantage of this additional functionality to improve program performance. We use the data flow graph of a program as input to our computation partitioning algorithm that assigns the computations to the processor or to the new computation unit. This algorithm is used before instruction selection, scheduling and register allocation. Our branch-and-bound algorithm uses a cost function that estimates data transfer instruction costs in terms of the number of instructions for any given partitioning. The algorithm generates the SRAMrow constraints for data layout that then need to be satisfied by a data layout algorithm. It ensures that the set of generated constraints is feasible. We have evaluated our approach on a set of benchmarks and we achieved significant performance improvement using our enhancement. The details of our approach including the compiler algorithm can be found in.

We plan on implementing the compiler algorithm in a compiler framework and evaluating the approach on a set of benchmarks. We will also explore an enhancement where a computation unit is used near a cache. The idea of performing operation on the SRAM row words can be extended to direct-mapped or set-associative caches with operations being performed on different words of the cache line. In a set-associative cache, an operation can also be performed between two words of the cache lines from different ways of the cache in the same set.