cpu architecture - What does a 'Split' cache means. And how is it useful(if it is)?

Question

Welcome To Ask or Share your Answers For Others

cpu architecture - What does a 'Split' cache means. And how is it useful(if it is)?

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:34:25+0000

A summary and additional discussion can be found at: L1 caches usually have split design, but L2, L3 caches have unified design, why?.

Introduction

A split cache is a cache that consists of two physically separate parts, where one part, called the instruction cache, is dedicated for holding instructions and the other, called the data cache, is dedicated for holding data (i.e., instruction memory operands). Both of the instruction cache and data cache are logically considered to be a single cache, described as a split cache, because both are hardware-managed caches for the same physical address space at the same level of the memory hierarchy. Instruction fetch requests are handled only by the instruction cache and memory operand read and write requests are handled only by the data cache. A cache that is not split is called a unified cache.

The Harvard vs. von Neumann architecture distinction originally applies to main memory. However, most modern computer systems implement the modified Harvard architecture whereby the L1 cache implements the Harvard architecture and the rest of the memory hierarchy implements the von Neumann architecture. Therefore, in modern systems, the Harvard vs. von Neumann distinction mostly applies to the L1 cache design. That's why the split cache design is also called the Harvard cache design and the unified cache design is also called von Neumann. The Wikipedia article on the modified Harvard architecture discusses three variants of the architecture, of which one is the split cache design.

To my knowledge, the idea of the split cache design was first proposed and evaluated by James Bell, David Casasent, and C. Cordon Bell in their paper entitled An Investigation of Alternative Cache Organizations, which was published in 1974 in the IEEE TC journal (the IEEE version is a bit clearer). The authors found using a simulator that, for almost all cache capacities considered in the study, an equal split results in the best performance (see Figure 5). From the paper:

Typically, the best performance occurs with half of the cache devoted to instructions and half to data.

They also provided a comparison with a unified cache design of the same capacity and their initial conclusion was that the split design has no advantage over the unified design.

As shown in Fig. 6, the performance of the best dedicated cache CUXD (half allotted to instructions and half to data) in general is quite similar to that of a homogeneous cache (CUX); the extra complexity of a dedicated cache control is thus not justifiable.

It's not clear to me actually whether the paper evaluated the split design or a cache that is partitioned between instructions and data. One paragraph says:

Thus far, the cache memory has been assumed to be composed of homogeneous cells. But conceivably a functionally specialized partitioning of the cache could give higher performance. For example, perhaps a cache devoted exactly half to instructions and half to data would be more effective than a homogeneous one; alternatively, one that holds just instructions could be better than one holding just data. To test these hypotheses, the effects of dividing the cache into sections dedicated to specific uses were investigated.

(This paragraph was formatted automatically by https://www.textfixer.com/tools/remove-white-spaces.php.)

It seems to me that the authors are talking about both the split and partitioned designs. But it's not clear what design was implemented in the simulator and how the simulator was configured for evaluation.

Note that the paper didn't discuss why the split design may have a better or worse performance than the unified design. Also note how the authors used the terms "dedicated cache" and "homogeneous cache." The terms "split" and "unified" appeared in later works, which I believe were first used by Alan Jay Smith in Directions for memory hierarchies and their components: research and development in 1978. But I'm not sure because the way Alan used these terms gives the impression that they are already well-known. It appears to me from Alan's paper that the first processor that used the split cache design was the IBM 801 around 1975 and probably the second processor was the S-1 (around 1976). It's possible that the engineers of these processors might have came up with the split design idea independently.

Advantages of the Split Cache Design

The split cache design was then extensively studied in the next two decades. See, for example, Section 2.8 of this highly influential paper. But it was quickly recognized that the split design is useful for pipelined processors where the instruction fetch unit and the memory access unit are physically located in different parts of the chip. With the unified design, it is impossible to place the cache simultaneously close to the instruction fetch unit and the memory unit, resulting in high cache access latency from one or both units. The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. (See what it looks like in the S-1 processor in Figure 3 of this document.) This is the primary advantage of the split design over the unified design. This is also the crucial difference between the split design and the unified design that supports cache partitioning. That's why it makes to have a split data cache, as proposed in several research works, such as Cache resident data locality analysis and Partitioned first-level cache design for clustered microarchitectures.

Another advantage of the split design is that it allows instruction and data accesses to occur in parallel without contention. Essentially, a split cache can have double the bandwidth of a unified cache. This improves performance in pipelined processors because instruction and data accesses can occur in the same cycle in different stages of the pipeline. Alternatively, the bandwidth of a unified cache can be doubled or improved using multiple access ports or multiple banks. In fact, using two ports provides twice the bandwidth to the whole cache (in contrast, in the split design, the bandwidth is split in half between the instruction cache and the data cache), but adding another port is more expensive in terms of area and power and may impact latency. A third alternative to improve the bandwidth is by adding more wires to the same port so that more bits can be accessed in the same cycle, but this would probably be restricted to the same cache line (in contrast to the two other approaches). If the cache is off-chip, then the wires that connect it to the pipeline become pins and the impact of the number of wires on area, power, and latency become more significant.

In addition, processors that use a unified (L1) cache typically included arbitration logic that prioritizes data accesses over instruction accesses; this logic can be eliminated in the split design. (See the discussion on the Z80000 processor below for a unified design that avoids arbitration.) Similarly, if there is another cache level that implements the unified design, there will be a need for an arbitration logic at the L2 unified cache. Simple arbitration policies may reduce performance and better policies may increase area. [TODO: Add examples of policies.]

Another potential advantage is that the split design allows us to employ different replacement policies for the instruction cache and data cache that may be more suitable for the access patterns of each cache. All Intel Itanium processors use the LRU policy for the L1I and the NRU policy for the L1D (I know for sure that this applies to the Itanium 2 and later, but I'm not sure about the first Itanium). Moreover, starting with Itanium 9500, the L1 ITLB uses NRU but the L1 DTLB uses LRU. Intel didn't disclose why they decided to use different replacement policies in these processors. In general, It seems to me that it's uncommon for the L1I and L1D to use different replacement policies. I couldn't find a single research paper on this (all papers on replacement policies focus only on data or unified caches). Even for a unified cache, it may be useful for the replacement policy to distinguish between instruction and data lines. In a split design, a cache line fetched into the data cache can never displace a line in the instruction cache. Similarly, a line filled into the instruction cache can never displace a line in the data cache. This issue, however, may occur in the unified design.

The last sub-section of the section on the differences between the modified Harvard architecture and Harvard and von Neumann in the Wikipedia article mentions that the Mark I machine uses different memory technologies for the instruction and data memories. This made me think whether this can constitute as an advantage for the split design in modern computer systems. Here are some of the papers that show that this indeed the case:

LASIC: Loop-Aware Sleepy Instruction Caches Based on STT-RAM Technology: The instruction cache is mostly read-only, except when there is a miss, in which case the line must be fetched and filled into the cache. This means that, when using STT-RAM (or really any other NVRAM technology), the expensive write operations occur less frequently compared to using STT-RAM for the data cache. The paper shows that by using an SRAM loop cache (like the LSD in Intel processors) and an STT-RAM instruction cache, energy consumption can be significantly reduced, especially when a loop is being executed that fits entirely in the loop cache. The non-volatile property of STT-RAM enables the authors to completely power-gate the instruction cache without losing its contents. In contrast, with an SRAM instruction cache, the static energy consumption is much larger, and power-gating it resu

Categories

cpu architecture - What does a 'Split' cache means. And how is it useful(if it is)?