## **Accepted Manuscript** A fault-tolerant last level cache for CMPs operating at ultra-low voltage Alexandra Ferrerón, Jesús Alastruey-Benedé, Darío Suárez Gracia, Teresa Monreal Arnal, Pablo Ibáñez Marín, Víctor Viñals Yúfera PII: S0743-7315(18)30781-0 DOI: https://doi.org/10.1016/j.jpdc.2018.10.010 Reference: YJPDC 3966 To appear in: J. Parallel Distrib. Comput. Received date: 19 December 2017 Revised date: 23 July 2018 Accepted date: 22 October 2018 Please cite this article as: A. Ferrerón, J. Alastruey-Benedé, D.S. Gracia et al., A fault-tolerant last level cache for CMPs operating at ultra-low voltage, *J. Parallel Distrib. Comput.* (2018), https://doi.org/10.1016/j.jpdc.2018.10.010 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ## Highlights - Fault-Tolerant Last Level Cache for CMPs Operating at Ultra-Low Voltage. - Mechanism that exploits redundancy and reuse to enhance $\mathcal{V}$ ock disabling performance. - Fault-aware LLC management that maps critical blocks to perative cache entries. - Detailed evaluation of block disabling techniques in a pared-memory coherent CMP. # A Fault-Tolerant Last Level Cache for CMPs Operating at Ultra-Low Voltage Alexandra Ferrerón<sup>a,c,1</sup>, Jesús Alastruey-Benedé<sup>a,c,\*</sup>, Pario Sua, ez Gracia<sup>a,c</sup>, Teresa Monreal Arnal<sup>b,c</sup>, Pablo Ibáñez Marín<sup>a,c</sup>, Vetor Virals Yúfera<sup>a,c</sup> <sup>a</sup>Departamento de Informática e Ingeniería de Sistemas. ⊂ stitue de Investigación en Ingeniería de Aragón. Universidad de 7 srago de Spain de Universitat Politècnica de Catalunya, Bar el aTeau, Spain c'HiPEAC European Network of Excellenc #### Abstract Voltage scaling to values near the unabald voltage is a promising technique to hold off the many-core power wall. However, as voltage decreases, some SRAM cells are unable to operate reliably and allow a behavior consistent with a hard fault. Block disabling is a micro-remodural technique that allows low-voltage operation by deactivating faulty cache entries, at the expense of reducing the effective cache capacity. It the case of the last-level cache, this capacity reduction leads to an increase in off-chi, r emory accesses, diminishing the overall energy benefit of reducing the voltage supply. In this work, we exploit the reuse locality and the intrinsic redundant y of multi-level inclusive hierarchies to enhance the performance of block Cabling with negligible cost. The proposed fault-aware last-level cac'.e n anagement policy maps critical blocks, those not present in private caches and with a higher probability of being reused, to active cache entries. Our valuation shows that this fault-aware management results in up to 37.3 and 54.27 fewer misses per kilo instruction (MPKI) than block disabling for resultipro rammed and parallel workloads, respectively. This translates to p formance enhancements of up to 13% and 34.6% for multiprogrammed and parallel vorkloads, respectively. <sup>1</sup>Now at Google. <sup>\*</sup>Corresponding author: Email address: jalastru@unizar.es (Jesús Alastruey-Benedé) Keywords: Near-threshold voltage, SRAM reliability, fault-tole. nce, on-chip caches, cache management #### 1. Introduction For recent CMOS technologies, power density is the main performance limiting factor across most computing segments. Moore's how continues to hold, with a doubling of the number of transistors and integration density in each new process generation, but Dennard scaling no longer applies, and we are not able to keep a constant power density across technology generations. Power budgets prevent us from utilizing all the available consistors, leading to dark silicon [1]. For years, industry has relied on $\cdot$ a... the supply voltage $(V_{dd})$ to reduce power consumption, but this trend has dramatically slowed since the 90 nm generation because of leakage. Reducing operating voltages to values near the threshold voltage $(V_{th})$ would minimize reakage and switching power consumption. The resulting power reduction could be used to activate more chip resources and potentially achieve performance improvements [2]. Unfortunately, $V_{d\epsilon}$ scaling is limited by the tight margins of the on-chip cache SRAM transistors. Excessive parameter variations in SRAM cells limit the voltage scaling of the mory structures to a minimum voltage, $V_{dd_{min}}$ , below which SRAM cells may not of trate reliably. $V_{dd_{min}}$ usually determines the minimum voltage of the whole processor, and in current technologies is typically of the order of 0.7–1.0 $^{-7}$ when regular 6T SRAM cells are employed. In the literature, various solutions have been proposed to enable reliable cache peratic at low voltages. At the circuit level, the use of larger transistors or more transistors (assist circuitry) improves SRAM cell resilience [3, 4]. The rund drawbacks of this approach are the associated increases in area and power consumption. First-level caches in chip multiprocessors (CMPs) occupy little area, and their access time often determines the processor cycle time. Commercial processors, such as the Intel Nehalem family, use robust 8T SRAM cells to build reliable first-level caches, since this represents an affordable overhead [5]. In contrast, last-level caches (LLCs) are usually shared and have larger sizes and associativity, accounting for much of the die area [6]. Hence, for LLCs, minimum-geometry 6T cells are preferred to achieve higher densities. At the architectural level, fault-tolerant cache designs rely and disabling faulty resources at different granularities [7], or correcting defective bits through either error correction codes (ECCs) [8] or a distributed duption tion of blocks [9, 10]. Block Disabling (BD) is a simple technique that disables a cache entry when a defective bit is found [11]. It is already implemented in modern processors to protect against hard faults [6]. However, duction the random distribution of defective cells, the capacity of the cache is and ally compromised. Complex techniques based on ECCs or the combination of faulty resources are able to rescue more cache capacity, but incur 'arge storage overheads and sometimes require complex remapping that periodizes the cache access latency. In our work, we have develged a new approach to mitigate the impact of SRAM failures in LLCs due to parameter variations, based on BD but also relying on the underlying out tures already present in CMPs. We identify a natural source of on-chip that are undancy that arises because of the replication of blocks in inclusive nulti-level cache hierarchies and exploit this redundancy through a smart fault-a rare cache management policy. In this paper, we make the following contributions. First, we provide an evaluation of PT techniques in a shared-memory coherent CMP running parallel and multipregrammed workloads with a complete and detailed memory model, exploring pRAM ceals with different probabilities of failure. Second, we introduce a technique mathematic eeps the tags of the LLC and, therefore, the tracking capabilities of the coherence directory operational. This way, a block not physically stored in the LLC an reside in the private level and be made available to other cores. As an alternative to main memory supply, we set up a cache-to-cache copy so rvice of support code or data sharing (thread migration, operating system, or partiel workloads). Finally, we propose a fault-aware cache management policy that predicts the usefulness of a block based on its use pattern, and guides the allocation of blocks to faulty and non-faulty cache entries, adding no overhead to the original replacement policy. - Our fault-aware cache management policy is able to decrease the LLC misses per kilo instruction (MPKI) by up to 37.3%, with respect to LN which translates to speedup improvements of 2 to 13% for multiprogrammed workloads. For parallel workloads, the MPKI values decrease by 5 to 54.3%, with respect to BD, for the different SRAM cells considered, improving performance up to 34.6%. - This paper extends our previous work [12] in everal significant ways: i) a new fault-aware cache management policy aiming at caches operating at low voltages, ii) a detailed implementation of block disalting with operational tags (BDOT) technique, iii) a more realistic SRAM fault model, improving the accuracy of the results, and iv) a more detailed evaluation including multi-programmed workloads and cache capacity/energy analysis. The rest of the paper is organize. In Section 2 introduces the problem of process variations and its effect on SPAM cell reliability. Section 3 comments on BD and its impact on large cache fructures. In Section 4, we describe how to take advantage of the coherence infrastructure to operate at low $V_{dd}$ . Section 5 introduces a fault-aware sche ranagement policy for LLC operating at low voltages. Section 6 de crib s the methodology. Section 7 presents our evaluation. Section 8 discusses the syst m impact. In Section 9, we comment on related work, and in Sec. 10, we outline our conclusions. #### 2. Process 'Var ations in SRAM cells SRA' 1 structures are especially vulnerable to failures due to process variations, as there are agreesively sized to meet high density requirements, and because of the vast 1 umber of cells that comprise on-chip SRAM structures [13]. In pricular, intra-die random dopant fluctuations (RDFs) are the main cause of hreshold voltage variation [14]. The stochastic nature of the ion implantation process leads to a distribution of $V_{th}$ values across a chip, which reduces the aready tight transistor margins. Hence, SRAM structures have a minimum voltage, $V_{dd_{min}}$ , to guarantee reliable operation, which is typically of the order of 0.7–1.0 V in current process generations, when 6T cells are u. d. The robustness of SRAM cells under the $V_{dd_{min}}$ range has the extensively analyzed in the literature [10, 9, 3, 8, 15]. Zhou et al. so ned six different sizes of 6T SRAM cells in 32 nm technology, and their probabilities of failure as $V_{dd}$ decreases [4]. According to that study, at 0.5 V, the probability of failure of an SRAM cell $(P_{fail})$ is between $10^{-3}$ and $10^{-2}$ . The unit of larger cells reduces the probability of failure, as non-uniformities are rage out, increasing read and write margins and resulting in more robust devices. It wever, large cells reduce the density and increase power and energy longuage. Table 1 describes the six SRAM cells of Theorem study (C1, C2, C3, C4, C5, and C6) in terms of their area relative to the smallest cell (C1), and lists the percentage of non-faulty entries in caches built from these cells operating at 0.5 V, assuming 64-byte cache entries. The entry is considered faulty if it contains at least one defective bit. As Table 1 shows, less than 10% of the cache entries are non-faulty for the small cells C1 and C2 at 0 of v. of the cache is implemented with the more robust C6 cells, however, the percentage of non-faulty cache entries rises to 60%, but at the cost of a 58% increase in area (relative to C1), and the consequent increase in leakage, which is now suitable option for a large structure such as an on-chip LLC. In this wo', we take Zhou's reliability study as a reference to test our proposals on $\gamma$ wide range of failure probabilities. We will only consider C2 to C6 operating at 0.5 V (our target near-threshold $V_{dd}$ ), as at this voltage, a cache built with C1 ce's would have all its capacity compromised. Table 1: Area relative to cell C1 and percentage of non-faulty 64-byte entries in a cache operat. Tat f $\delta$ V, for the 6 bit cells introduced in [4]. | Cell type | C1 | C2 | С3 | C4 | C5 | C6 | |---------------|------|------|------|------|------|------| | Relative area | 1.00 | 1.12 | 1.23 | 1.35 | 1.46 | 1.58 | | % non-faulty | 0.0 | 9.9 | 27.8 | 35.8 | 50.6 | 59.9 | Figure 1: Available associativ .ty of a $^1$ 6-way set associative block disabling cache (64-byte block) made up of cells C6-C1 $^1$ erativ g at 0.5 V. ## 3. Impact of Block 'visabling on Large Shared Caches at Ultra-low Voltages A simple 'ppr ach to handling hard faults is the disabling of faulty elements. BD deactivates "sources at block (cache entry<sup>2</sup>) granularity: when a fault is detected at r given cache entry, that entry is marked as defective and it can no longer story a cache block [11]. This technique is implemented in modern processors to enable them to tolerate hard faults [6]. 3D now also been studied for operation at low voltages because of its easy mplementation and low overhead [15]. From the implementation perspective, <sup>&</sup>lt;sup>2</sup>In this work, we differentiate between cache block and cache entry: block refers to the $\iota$ unsfer unit, the content *per se*, while entry refers to the physical group of cells that store a block. only one bit per entry suffices to mark the entry as faulty. The Lain a wback of this approach is that the amount of capacity dramatically falls when the probability of failure increases, as shown in Table 1. Even Lai to total count of faulty cells in the cache is less than 1%, the effective cache capacity is strongly affected because of the random distribution of faulty cere. BD results in caches with variable associativity per set, determined by the random distribution of faults in the cache. The interaction between BD and a system's cach organization also plays an important role. Modern commercial processor such as the Intel Core i7, implement inclusive hierarchies to facilitate and a private cache are also stored in the shared LLC. The coherence informs ion is embedded in the LLC; i.e., the sharing state and a bit vector to represent the current sharers are added to each block. To force inclusion, where a block is evicted from the LLC, explicit back invalidations are required to remove the copies of the private cache blocks, if present (inclusion victims) [10] Inclusive hierarchies perform roorly when the aggregated size of the private caches is similar to $t^1$ e size of the LLC [17], and BD exacerbates the problem because of the substantial a sociativity and capacity degradation in the LLC. Figure 1 shows $t^1$ estable associativity in a 16-way set associative cache bank with 64-byte binarchies, when built with cells C1-C6 (Table 1) operating at 0.5 V. The number of faulty ways per set follows a binomial distribution B(n,p), where n is the associativity, and p denotes the probability of failure of a cache entry. Figure 1 shows how the associativity degrades as more faulty cells appear on $t^1$ e caches structure. On average, 50% of the ways are faulty if the cache is built with C5 cells, and this percentage rises to 90% when using C2 cells. The associativity loss directly translates to a significant increase in the number of invalidations in a cache built with C5 cells is 10 times larger than in a cache implemented with fault-free SRAM calls. This finding suggests that inclusive hierarchies are not particularly suitable 150 for systems that implement BD in the presence of a significant number of faults. From the coherence management perspective, however, only directory inclusion is required: blocks present in the private levels have to be concluded only in the shared level tag array, without the need for a replica in the dat array [16]. This observation is the basis for the techniques we propose a this paper. Our proposal has been designed for inclusive me nory in rarchies, but most of the proposed ideas could benefit non-inclusive $\varepsilon$ d exclusive hierarchies as well. The objectives of our replacement and promotion algorithms are to assign the non-faulty entries to blocks with reuse and $\varepsilon$ blocks are not present in the private caches. On the one hand, these objections is different, and our algorithms should consider different priorities for allocation and promotion decisions. On the other hand, our proposal alleviate as $\varepsilon$ specific problem of inclusive hierarchies, such as the need to invalidate $\varepsilon$ block is a private cache when it is evicted from the shared cache. This problem does not exist in non-inclusive hierarchies, and therefore our proposal is $r \approx a_{\rm p}$ blicable in this specific aspect. Note on Figure 1 that, when using cell types C3 and C2, 0.6% and 18.9% of the sets have no or erative ways, respectively. To be able to offer a complete comparison with PD, as time that at least one of the ways in each set is non-faulty, although this is not a requirement for the techniques we present in this paper, and the LLC is able to operate even when all the ways of a set are faulty. ## 4. Explc. 'ng 'nclusive Hierarchies to Enable Ultra-low Voltage Operation Block Disabling with Operational Tags 175 The LD scheme simply assumes one extra bit per entry to identify faulty cache entries in the data array (one or more faulty cells). Faulty data entries are excluded from tag search and replacement, involving a net reduction in sociativity, and a consequent increase in inclusion victims. From the coherence management perspective, however, tracking blocks in the shared level tag array suffices to ensure directory inclusion. This is the basis of our previous work, [12], and the starting point of the first technique we propose: Fock Disabling with Operational Tags (BDOT). Assuming a two-level inclusive hierarchy, to force directory inclusion, we turn on the tags of faulty entries in the LLC, including them in the conventional operations of search and replacement. The tag of a faulty entry, if valid, tracks a cache block that might be present in the private caches, but that cannot be stored in the shared cache. Enabling the tags of the faulty entries restores the associativity of the shared cache as some by the first-level private caches, eliminating the problem of the increase in the same or of inclusion victims caused by the loss in associativity. In this situation, two kinds of LLC art as nave to be distinguished: tag-only (T), where the associated data entry is fally and only the tag is stored, and tag-data (D), where the associated data entry is non-faulty and both tag and data are stored. From the implementation perspective, one resilient bit still suffices to indicate whether the entry is faulty or not. The coherence protocol needs to be adapted to the new solution, where a T entry only stores the block tag and directory structure. Whenever a request to a block stored in a T entry arrives to the LLC bank the request needs to be sent to the next level (in this case, the off-chip memory) to recover the block, and the same occurs with dirty blocks, which and to be written back to memory after being evicted from a private cach. To fu' y explore this scheme, no failures should occur in the cells of the tag array. This are be accomplished, for example, by using robust cells (e.g., increasing the number of transistors per cell) or increasing the strength of the ECC. Tags—ccupy very little area in comparison to the data array (around 6% or our configuration, see Table 2 in Section 6), and increasing the cell size by 5.7% (assuming 8T SRAM cells [18]) will only increase the total area of a cache of the tag array, while using resilient tag cells involves little overhead, we opt for the latter. This approach is also consistent with prior work [9, 10]. Moreover, many of today's CPUs use different cell types for tag and data arrays [19]. Contrary to other proposals, our mechanism works even when all entries of a set are faulty. Contrary to other proposals, our mechanism works even who is all entries of a set are faulty. The LLC saves the tags for both faulty and ren-faulty entries, maintaining the coherence status of all the blocks, and allowing blocks to be stored in the private levels without the need of a distance in the shared level. Hence, it is possible to store a block in the private cache's even if all the data ways of the corresponding LLC set are faulty. #### 220 4.1. BDOT Limitations BDOT, as described above, has two poter. 'al limitations, both related to the allocation of blocks to faulty entries. First, BDOT always forwards a restate blocks allocated to faulty entries to the off-chip memory. However, a block allocated to a faulty entry might be present on-chip, if it is being und by a private cache (L1). This situation is common in parallel working, which share data and instructions. In this case, the directory information can be used to orchestrate cooperation among L1 caches. When the directory protocol receives an L1 request to a shared block mapped to a Tentry, in orwards the request to one of the sharers of the block, namely, the L1 cache closest to the requester in terms of Manhattan distance. That L1 will serve the block through a cache-to-cache transfer. Cache-te cache transfers are already implemented in the baseline coherence protocol for exclusively owned blocks. Hence, no additional hardware is required and a shipht modification of the directory protocol suffices to trigger a shared block transfer. So from now on, we assume that BDOT includes this feature. The second limitation comes from allocating blocks to LLC entries without aking into account their T or D nature. Unfortunately, this blind allocation can result in heavily reused blocks being attached to faulty entries. Indeed, if a result block of the LLC is required repeatedly from an L1 cache (i.e., the block shows reuse), any replacement algorithm will tend to protect it, reducing its eviction chances. Thus, if a block with reuse is initially allocated to a T entry, unless replicated in other cores, all L1 cache misses will be forwar 'ad on chip by the LLC. In the next section, we introduce a specific allocation and "allocation policy for BDOT caches that differentiates between T and D entries #### 5. Fault-aware Cache Management Policy for PCOT Caches Conventional cache management policies assume that every cache entry can store a block, while BDOT breaks this assumetion: ach set in an N-way set associative cache contains T entries that store only tags, and D entries that store tags and data. Keeping in mind the main goal of improving the overall LLC performance under BDOT, the strong introduces a fault-aware cache management policy that takes into account the distinct nature of T and D entries, and the reuse pattern of the reference stream. In particular, we seek to achieve the following two goals: - 1. To allocate blocks the a most likely to be used in the future to D entries. - 2. To maximize the an unt of on-chip data by giving greater priority (higher chances of being allocated to D entries) to blocks that are not present in private cache level. Prior work has sho. In that reuse is a very effective predictor of the usefulness of a given block in the LLC [20, 21]. Reuse locality can be described as follows: lines accessed at least twice tend to be reused many times in the near future, and recently leus delines are more useful than those reused earlier [20]. Therefore, seeking to achieve our first goal, we exploit reuse locality to predict which blocks should be an ocated to D entries. With respect to our second goal, a request to block allocated to a T entry and present in L1 can be serviced through a rache-to cache transaction, whilst if the block is not present in L1, the request will always be forwarded to the off-chip memory, incurring a penalty in access time and energy. Therefore, it is preferable to dedicate D entries to blocks not available on the L1 caches. Figure 2: Reuse and inclusion states for a block in LLC NR, R C, and NC represent: Non-Reused, Reused, Cached (in L1), and Non-Cached (in L1), respectively. Replacement and coherence transitions are not shown. These goals may be added to any management policy. In this work, we will build on top of a state-of-the-art reculting sed replacement algorithm: Not-Recently Reused (NRR) [20]. Next, we describe the baseline replacement in some depth and then we add awareness of the existence of faulty entries. #### 5.1. Baseline NRR Replacement Algorithm 270 The NRR algorithm require four states per LLC block, as depicted in Figure 2. When a block not present in the LLC is requested by the processor (1st use: L1 request), it is stored in the L1 and the LLC (to force inclusion), its state being in the LLC 'R-C (Non-Reused, Cached). When the block is evicted from the private cache (L1 request), its LLC state changes to NR-NC (Non-Reused, Non-Cached) On a new request (2nd use: L1 request), a copy of the block is stored again in 1 and its LLC state is R-C (Reused, Cached). At this point, the block has shown rouse in the LLC and, very likely, it will be reused many times in the near 1 to re. Finally, when the block is evicted again from the L1, the state becomes R-NC (Reused, Non-Cached). Subsequent requests and evictions Havi g LLC blocks classified this way, the replacement policy can exploit L1 temporal locality and LLC reuse. In an inclusive hierarchy, the replacement of a lock in the LLC forces the invalidation of its copies in the private caches, if any, and this usually implies performance degradation, assuming that blocks in L1 are being actively used [17]. Therefore, the highest priority (protect. n) is given to blocks stored in private caches. As a secondary objective, the highest priority is given to blocks that have shown reuse in the LLC. Hence, Nr. selects victims in the following order: NR-NC, R-NC, NR-C, R-C. Reuse receases it ken into account by resetting the reuse bit when all the non-cached block are rearranted as reused (transition from R-NC to NR-NC). This way, more recent's sused blocks become more protected. The implementation of NRR only requires one vuse bit per block. The protection of private copies can be implemented in arious different ways [17], but one simple solution is to use the presence bit in or of the coherence directory, assuming non-silent tag evictions of clean blocks. (a) Insertion from main nemory after LLC (b) Insertion from L1 after L1 eviction (promomiss. Figure 3: Insertion a promotion actions for a fault-aware cache management policy example in a cache set with two faulty cache entries. Lowercase and capital letters indicate tag and data, respectively. #### 5.2. Reu od oase and Fault-aware Management for BDOT Caches Seking to guarantee that valuable blocks remain in the LLC, we devise a fault-ware management policy by distinguishing between T and D entries. One option to promote blocks by reallocating them from T to D entries, if needed, to improve the overall cache performance. The design choices include where the promoted data comes from and which victim is chosen as a target of the consequent demotion. At the same time, we want to continue exploiting reuse in the simple and efficient way offered by an NRR-like replacement algorithm, which is unaware of faulty entries. Thus, our goal is to design a comparent scache management policy, merging reuse exploitation and faulty antity management. Below, we elaborate on the two mechanisms that are key 's achieving this, namely block insertion/replacement and block promotion/der otion. #### 5.2.1. Insertion and Replacement of Blocks On a first insertion (LLC miss), an incoming block has not shown reuse, and hence allocating it to a T entry seems a reasonable idea. Figure 3a shows an example of a cache block to be inserted in a 4-w v cane set with two T entries (those storing q and r tags) and two D entries those storing p and s tags and the corresponding P and S data). A victimal elected among the blocks allocated to T entries. The baseline replacement proof dictates which of those blocks (Q and R) is selected for replacement T is equivalent to predicting that the incoming block X is not going to be a vised. If the reuse pattern of the block is mispredicted, block X should be a follocated to a D entry, to reduce its access time and transfer energy in finance L1 misses. This reallocation will be performed using the promotion meanism ve detail in the next subsection. Dealing with first insertion, this way is very simple but has a clear disadvantage, related to $\mathbb{C}$ distribution of T and D entries, with respect to the percentage of revolution and non-reused blocks. For example, if the number of T entries is small, the insertion policy would place considerable pressure on these scarce entries. Blocks would be unavoidably forced to leave the LLC before having hald enough time to show a reuse pattern, even though there are many available D entries. In an extreme case, when all the entries in a set are D type. This cache management policy could not be implemented. Solving this problem is not easy. We explored various adaptive mechanisms in which some D entries are used as T. However, it is difficult to determine the optimal number of T entries, this being highly dependent on the workload. After carrying out the contraction of the sake of brevity, the performance returns were disappointing given the required complexity. Given that our promotion mechanism reallocates reused blocks to D entries and non-reused ones to T entries (as we will see in the following whose condition), we realized that the baseline NRR replacement policy itself surfaces to achieve our initial goals because it protects reused blocks. Since NRR is a lower priority to non-reused blocks, blocks allocated to T entries with have more chances to be evicted. This implies that, with a balanced distribution between T and D entries, an incoming block will have a higher probability of being inserted in a T entry than in a D entry. If the number of T entries in a set is very low, and even if there are no T entries in a set, the mechanism still works correctly. NRR periodically resets the reuse bit of the set of idates with the same priority as T entries. Hence, the initial insertion does not necessarily have to consider the nature of the entry, and our importantation relies only on the baseline replacement policy to select the vicinary block. #### 5.2.2. Promotion and Demotion & Blocks A blind allocation of blocks to cache entries may result in valuable blocks (i.e., those with reuse) being i itially a located to T entries, and vice versa. However, this undesirable situation can be tracked on the fly through the reuse footprint, and reversed by $\operatorname{swa}_{P_1}\operatorname{ir}_2$ a T entry with a D entry: when a block allocated to a T entry shows thuse, we will *promote* it to a D entry. Promotion involves a complementary $\operatorname{demotion}$ of the block stored in the selected D entry. To select which block is demoted, we also rely on reuse and L1 presence information. Recred blocks should be kept in the LLC, but unlike in the baseline and cere not policy, block demotion does not involve an LLC tag eviction. Furthamore, in the block is present in L1, losing the contents of the LLC is not critical, because there is at least one on-chip copy of the block, which can be supplied by a cache-to-cache transaction. Thus, to maximize the amount of an-chip lata, the demotion algorithm will select the victim block among those present in L1. Among the blocks in L1, non-reused ones should have more chances of being demoted. Note that the promotion of a block can be performed at two different times: at reuse detection (i.e., on a second L1 request to a block stored in a T cutry) or after the second eviction from L1 (i.e., on eviction after reule). Performing the promotion after the second request from L1 duplicates the ratent, as a copy of the block is also stored in a private cache, whilst preforming the promotion after the L1 eviction meets the goal of maximizing the remound of on-chip data. Therefore, we opt for the latter and trigger promotions on the latter and trigger promotions on the latter and trigger promotions. The promotion/demotion process is illustrated in Figure 3b. When block R, which is stored in a T entry, is evicted from the L1 cache and selected for promotion (i.e., its reuse bit is set), we color victim among the demotion candidates (P and S in Figure 3b). Once the vertime is selected (P in our example), we swap the cache contents in three steps: (1) discard the data entry P, writing back the data to memory, if dirty; (2) $wa_{\rm F}$ p and r tags; and (3) copy the data (R) to the available D entry, $w^{\rm Figh}$ was occupied by the demoted block (P). #### 5.2.3. Summary and Implementation Figure 4 illustrates the implementation of the aforementioned ideas. The states of the baseline replacemental algorithm shown in Figure 2 are now superstates split into T and D subtemption. The initial allocation of blocks (1st use: L1 request in Figure 4) does not take into account the nature of the entry, and it solely depends on the victim subsection arising from reuse and L1 presence; i.e., it only depends on the laseline replacement algorithm. After insertion, blocks will move alone, NR-C, NR-C, and R-NC superstates as they would do in a cache without home dering faulty entries. To guarantee that high value blocks—those showing reuse—remain in the LLC the pc icy reallocates them from T to D entries when they are evicted f om the L1 and reside in a faulty LLC entry: R-C-T state. After L1 eviction, locks in R-C-T trigger a promotion, which results in the transition to an R-NC-D reste and reallocation to a D entry, with the consequent demotion of another block within the set to a T entry. A block being demoted can be in any of the superstates, and according to the victim selection algorithm, we first demote Figure 4: Reuse and inclusion states for a block in LLC with BDOT. blocks that are prese. 'In the private levels, in order to maximize the amount of data available in the on-chip hierarchy. As a secondary objective, the policy attempts to first demond low priority blocks, that is, those without reuse. In particular, it selects blocks in the following order: NR-C-D, R-C-D, NR-NC-D, and R-NC-D. This 'our e-ba ed, fault-aware policy adds no extra storage overhead to the basel' he reuse-based replacement policy, as only the bit indicating reuse and the prescuce bit vector are needed to orchestrate the replacement and promotion fections. Moreover, swapping blocks only requires some extra control logic to perform the following actions: first, the logic reads the demoted victim and inverts the promoted block, as for conventional block insertion, and, then, it writes back the tag of the demoted block. Promoting blocks after L1 eviction implies non-silent eviction of data blocks. This overhead does not affect latency, as L1 replacements are not in the critical path, and has a negligible in. pact on energy consumption. The fault-aware cache management technique here presented could be implemented on top of other replacement algorithms (such as LPU or NRU). We decided to rely on NRR because of its simple, yet efficient immementation, and because it fits the general principles behind our ideas. Finally, and regarding the reallocation from T to D entries and vice versa, other policies are also possible. For example, instead of relying on the reuse information of the blocks, a future use predictor [22] could be utilized to decide which had be allocated to D entries, or a dead block predictor [23] could be allocated to indicate which blocks may be demoted to T entries, but these solutions add complexity to the cache logic as well as requiring more storage we head. #### 6. Methodology #### 6.1. Overview of the System Our baseline system consists of a tiled CMP, with an inclusive two-level cache hierarchy, where the second leve cache or LLC is shared and distributed among the processor cores. There are interconnected by means of a mesh. Each tile has a processor core with a private first level cache (L1) split into instructions and data, and a bank of the shared LLC, both connected to the router (Figure 5). Similarly to most CM<sup>7</sup>, the write-policy for L1 data caches is write-back because other policies, such as crite-through, may collapse the interconnection network [24]. The mean will have to convoy every single store from the cores to the LLC banks to guarate econtent inclusion. The CMP includes two memory controllers located at the edges of the chip. Table 2 shows the parameters of the baseline recessor, memory hierarchy, and interconnection network. We $\varepsilon$ ssume it runs at a frequency of 1 GHz with an operating voltage of 0.5 V Note that the DRAM module voltage is not scaled like the rest of the system, and hence, the relative speed of main memory with respect to the chip increases as the voltage decreases. This model is consistent with prior work [9, 10]. Figure 5: Modeled 8-corc CMP. Our baseline coherence protocol relies a full-map directory with Modified, Exclusive, Shared, Invalid (MESI) states. We use explicit eviction notification of both shared and exclusively owned backs. L1 caches are built with robust SRAM cells that can run reliably at 'owar near-threshold voltages, while LLC data banks are built with conventional of SRAM cells and, therefore, they are sensitive to failures [5]. As in previous studies [9, 10], [v] assume that the LLC tag arrays are hardened by using upsized cells $[ach ac^{Qr}]$ [18]. The baseline LLC replacement policy is Not-Recently Used [q] [2[r]] extended with private copy protection [17]. We implement this protection [q] using coherence directory information updated by non-silent L1 block every cons. #### 6.2. Experime + A Set-up Rega ding our experimental set-up, we model the CMP system described in Table 2. " ase Simics [26] in combination with GEMS [27] to simulate the on-c tip men bry hierarchy and interconnection network, and DRAMSim2 [28] to simulate the DDR3 DRAM in detail. To obtain timing, area, and energy consum, tion, we use the McPAT framework [29] for the on-chip components, and LiAMSim2 for the DRAM module. We extend the Ruby module (GEMS) to simulate the cache swaps in detail in order to take into account their dynamic energy overhead. Table 2: Main characteristics of the CMP system. | Cores | 8, Ultrasparc III Plus, in-order, 1 instr./cyc' , sin ,1e-v. eaded, 1 GHz at | | | | |-------------------|------------------------------------------------------------------------------|--|--|--| | | $V_{dd}$ 0.5 V | | | | | Coherence proto- | MESI, directory-based (full-map) distril ated arong LLC cache banks | | | | | col | | | | | | Consistency model | Sequential | | | | | L1 cache | Private, 64 KB data and instr. cacl 's. way, 34 B block size, LRU, 2-cycle | | | | | | hit access time | | | | | LLC cache | Shared, inclusive, interleaved by line ddress, 1 bank/tile, 1 MB/bank, | | | | | | 16-way, 64 B block size, NRU re, 'acer .ent | | | | | | 8-cycle hit access time (4-cycle or access + 4-cycle data access) | | | | | Memory | 2 memory controllers, ic ted at the edges of the chip; 1333 MHz DDR3 | | | | | | 2 channels, 8 Gb/channel, 8 ba.ks, 8 KB page size, open page policy; raw | | | | | | access time 50 cycles | | | | | NoC | Mesh, 2 virtual work. (VNs): requests and replies; 2 virtual channels | | | | | | per VN; 16-byte flu size | | | | | | 1-cycle latence hop, 2 stage routing | | | | We use a set of 20 mul' programmed workloads built as random combinations of the 29 SPEC CPU 2006 a, plications [30], with no special distinction between integer and floating point programs. Each application appears on average 5.5 times with a standard deficient of 2.5. Programs were run on a real machine until completion with the reference inputs. Hardware counters were used to locate the end of the initialization phase. Every multiprogrammed mix was run for as many not cuctions as the longest initialization phase, and a checkpoint was created at this point. We then run cycle-accurate simulations including 300 million cycle. To warm up the memory hierarchy and 700 million cycles for data collection. We include a selection of shared-memory parallel applications from PARSE [31] with a significant memory footprint (MPKI<sub>LLC</sub> $\geq$ 1.0) when running the *sim-large* input in the baseline system: canneal (MPKI<sub>LLC</sub> = 4.3), for the system of t way to that used for multiprogrammed workloads<sup>3</sup> and run 300 m. Tion Cycles to warm up the memory structures once the parallel phase he set red, and then collect statistics for 700 million cycles. One challenge for analyzing fault mitigation techn ques in the large set of required simulations. Running all workloads and simulated more els combinations for a single fault map can lead to wrong result, as where authors have described [32, 33]. For example, if all the faults affect to the most/least frequently accessed cache sets, the observed speed-up would be nuch lower/higher than in reality. To address this issue, we rely on statistical ampling to generate random fault maps and run Monte Carlo experiments and guarantee a 5% margin of error with a confidence level of 95% [34]. In other words, the number of samples is increased as necessary to reach the region of error within the desired level of confidence. For our worklords simulated models, metrics, margin of error and confidence level, each point of the design space has to be simulated between 20 and 30 times, each one with a different fault map. We pick the 5% margin of error and the 95% confidence level as a good trade-off between simulation time and accuracy, in tree ing both has a large impact in the required number of simulations. To ensure all simulations have similar numbers of faults but at different location, and compute the faultiness of each memory cell randomly and independently father cells [35, 36]. Finally, we consider that the number and location of faulty cells do not change during workload execution. ## 7. Evalu. \*.on I his sect on evaluates the effectiveness of the proposed BDOT management to linique for LLC caches in terms of MPKI, adding up the misses in all LLC observed that no OS activity appeared when our parallel applications were run and the ratio of CPU utilization between the different threads was practically constant across the 'mulations. banks and dividing by the aggregated instruction count of all cres. Later, in Section 8, we analyze the impact on system performance, a ea, and energy. To assess the effectiveness of our proposals, we include a veral additional configurations. First, as an upper bound in performance, a rebust cache built with unrealistically robust cells (Robust); i.e., cells the operate at ultra-low voltages with neither failures nor power or area over leader, which corresponds to a perfect unattainable solution. Then, we also include block disabling (BD), as our proposal emerges from it. Finally, we add results for nord disabling (WD) [10]. Word disabling is a more complex technique that a mbines consecutive faulty cache entries to recreate fully functional and it the cost of reducing the cache capacity. Section 9 presents a comprehensive discussion of this and other techniques versus our proposals. In summary, we consider the folioring configurations: 520 - Robust: reference system, the LLC is built with unrealistically robust cells. All data are presented with respect to this system. - **BD**: system implent onting block disabling, as presented in Section 3, with NRU replacement. - BDOT-NR J: system implementing block disabling with operational tags, as present d in Paction 4, with NRU replacement. - BDO7-NF R: system implementing BDOT with NRR replacement, as pres nted n. Section 5.1. - BDC '-N'AR-FA: system implementing BDOT with fault-aware NRR replacement, as presented in Section 5.2. - W. system implementing word disabling with NRU replacement [10]. As in the case of NRR, the NRU implementation also includes private corp. protection. Our detailed results include multiprogrammed workloads (the NPS SPEC CPU 2006 mixes) and parallel workloads (the 4 selected PARSEC applications), for the five cell types considered (C6, C5, C4, C3, and C2). Figure 6: Normalized MPKI (average for SP. 'vm. ' with respect to Robust for the different proposals and cell types. Average MPKI for Re' ast: 5.09. #### 7.1. Multiprogrammed Worklow ... 530 Figure 6 shows the LLC MPKI results for the multiprogrammed workloads. BD is a valid solution for a coche with few defective entries, like one built with C6 cells, where the average MPKI penalization is 23.9%. However, this penalization increased randly with the number of faulty entries, reaching 136% for C2. Using the lags of the defective LLC entries to keep the coherence state of blocks stored in L1 chows BDOT-NRU to incur fewer MPKI than BD for C2, but it does not of errany advantage (the MPKI value increases) for the rest of the cells. To d' fere itiate and quantify the benefit of a reuse-based replacement and our farilt-aware cache management policy, we first implement NRR on top of BD( T (BD) T-NRR), without taking into account the nature of cache entries (for lity of mon-faulty). This naive implementation offers a slight improvement with respect to BDOT-NRU for all cell types, but it is still worse than BD, except for C2, as in the case of BDOT-NRU. The explanation for this behavior is the blind allocation of blocks to entries, without taking into account whether the entry can store only the tag (T) or both the tag and the data (D). Allocating a block that shows reuse to a T entry implies that all the requestation that the block are forwarded to the next level (in this case, off-chip). Fest estable due to the reused-based policy, this block will remain in the defective outry of the LLC, protected by the replacement algorithm. However, blocks with reuse allocated to D entries are also protected from replacement, and that explains why the relative differences between BDOT-NRR and BDO P-NP were larger when using larger cells (i.e., with less faults, like C6 and C5) BDOT-NRR-FA addresses this issue, adding the information of defective entries to the cache management policy. The pend is ation in terms of MPKI is 14.6%, 15.1%, 16%, 18.3%, and 37.3% lower than with BD for C6, C5, C4, C3, and C2, respectively. If we compare BDOT-1 RR-FA with BDOT-NRR, there are 20% fewer MPKI, irrespective of the call type, demonstrating the goodness of the design. Regarding WD, although there are significant differences in terms of the number of defective entries among the cell types considered (Table 1), the MPKI for the different configure none is almost constant. Two reasons explain this behavior: i) a single defective cell forces the entry to be classified as faulty, and ii) the number of defective cells per entry is usually small (three on average for the smallest ce<sup>11</sup>: C. [37] and, therefore, very often blocks are successfully stored by combir n<sub>5</sub> two consecutive entries. Thus, the average number of ways per set in our tem when implementing WD is eight across the different cell configuration. Compared to BD, WD obtains better results when the average number of defective entries is greater than half, which is the case of cells C4, C3, and C2, a now i in Table 1. BDOT-NRR-FA lowers the MPKI with respect to WD by 70%, 16.1%, 8.5%, and 3.4%, for C6, C5, C4, and C3, respectively. WD caly be its BDOT-NRR-FA in caches with a high number of defective cells C2, where on average 90% of the entries are faulty). However, BDOT-NRR-FA n ruire no additional overhead, whilst WD requires additional storage and logic u ...onstruct blocks. Figure 7: Normalized MPKI (average for F. with respect to Robust for the different proposals and cell types. Average MPKI for Re'ast: 2.01. #### 7.2. Parallel Workloads 575 Figure 7 shows the relative LLC MPKI for the parallel workloads, with respect to the baseline. As with nultiprogrammed workloads, BDOT-NRR-FA has a lower average MPKI and BD and non fault-aware implementations of BDOT. In particular, FDOT-NRR-FA improves MPKI with respect to BD by 5%, 5%, 9.6% 19.2%, and 54.2% on average for C6, C5, C4, C3, and C2, respectively. Comparing with the multiprogrammed workloads, the relative MPKI numbers shown in Figure 7 are larger, moving away from the Robust system to 3 greater extent for all cell types, even for the winning alternatives (WD and BFOT-NRR-FA). But it is worth noting that the absolute MPKI values for the parallel applications considered are low (Section 6), which makes the relative accesses appear more substantial. Upon Loser examination of the results, we can make some interesting observations. Figure 8 shows the LLC MPKI analysis per application for the different cell types. BD is better than plain BDOTs (BDOT-NRU, BDOT-NRR) in C6-C3 rells (C3 in canneal is an exception), while in cell C2 the trend clearly reverses. On the contrary, BDOT-NRR-FA is better than BD in most cases, being vips the only exception (cells C6-C3), and giving very noticeable recipition, in the smallest cell C2. For vips, BDOT-NRR-FA only beats BF in C2 because its image processing algorithm shows very little reuse with a smallest. In such non-demanding environment, BD can store the vips working set. Finally, the costly WD shows a similar tendency to that observed with multiprogrammed workloads, with a relatively constant performance independently of the cell type. In this case, BDOT-NRR-FA heats WD when using C6 or C5 (12.7% and 6.6% lower MPKI values, respectively), but it cannot reach WD performance for C4, C3, or C2 (5.5%, 12.4% and 2..3% higher MPKI values, respectively). #### 8. System Impact 595 This section analyzes the impact of ur proposals on the system in terms of performance, area, and energy consumption. As in the previous section, we present results relative to the Robust system and compare with the BD and WD mechanisms. #### 8.1. Performance Figure 9 shows the per `mance relative to the robust cell for both multiprogrammed and paralle, workloads. For multir rog. ammed workloads (Figure 9a), performance follows the same trend as MPK., 3DOT-NRR-FA being the best design option except in the case of C2 cel's, for which WD outperforms BDOT-NRR-FA by 2.2%. In particular, BDOT-NR, FA shows a performance degradation with respect to the Robust reference sys em of 1.3%, 2%, 3.4%, 4.3%, and 6.9% for C6, C5, C4, C3, and C2, respect. 1, or, in other words, a performance improvement with respect to BD of 2%, 2.2%, 2.7%, 3.6%, and 13.1%. in the case of multiprogrammed workloads, speedup in parallel application reformance (Figure 9b) also follows the same trend as in the MPKI results, with a notable exception. For C3, BDOT-NRU and BDOT-NRR perform slightly Figure 8. Per-application normalized MPKI (PARSEC) with respect to Robust for the different proposals and cell types. Average MPKIs for Robust: 4.26, 1.59, 1.0 and 1.19, for canneal, ferret, streamcluster, and vips, respectively. Figure 9: Normalized speedup (average) with respect . Robust for the different proposals and cell types. better than BD on average, while in Frunc 7, the average MPKI value with these techniques was higher than with LD. As we already mentioned, the LLC MPKI for the parallel applications in the baseline system is small (Section 6), and small MPKI increases with respect to this system appear relatively large in Figure 7. Nevertheless, for C3, screamc. Ster has a dramatic speedup degradation with BD. This is due to the large ramber of back invalidations to L1 blocks to force directory inclusion (inclusion victims). Specifically, in this application, the number of invalidations to L1 blocks decreases 20 times when implementing BDOT. The TPK numbers are similar, but the number of instructions executed differ considerable. For this application, we observe a performance improvement of 6.1% where using BDOT-NRU (6.2% for BDOT-NRR), with respect to BD. Or average, BDOT-NRR-FA shows a similar performance to BD for C6 and C5, where the performance degradation with respect to the reference system is 2.2% and 2.9%, respectively, and for C4, C3, and C2, the performance is better, w 1.8% 7.1%, and 34.6%, respectively. BDOT-NRR-FA and WD have similar performance (within 1%), except for in the case of C2, for which WD achieves a 4.1% better performance. 630 In summary, BDOT-NRR-FA is an excellent choice for caches with different numbers of defective entries, as it achieves as good performance at more complex fault-tolerant techniques without adding any extra storage of the ad to the cache. #### 8.2. Area and Energy Figure 10: Normalized EPI (average) with . spect to Robust for the different proposals and cell types. Larger SRAM cells are less 'kely to fail, but at the cost of larger areas and higher power consumation. Even the largest cell considered by Zhou's study (C6), which requires a 41.1% larger area than C2, is far from reaching fully functional performance: '0.1% of the cache entries are faulty at 0.5 V (Table 1). Our fault awa e mechanism has a minimal impact on area. Only two extra bits suffice to imprement BDOT-NRR-FA: one bit marks entries as defective (as in BD), and the coher one stores the replacement policy (i.e., NRR) information. Thus the extra storage overhead is added compared to the BD system. N inimizir g area helps to reduce energy in the LLC. Signals traveling smaller C stances require less dynamic power for switching, and, most importantly, small ells cor sume less static power. To estimate the sub-threshold current, $I_{sub}$ , rausing the static consumption, we assume that $I_{sub}$ is directly proportional of the transistor width of the cells considered, and estimate it with respect to C2 [4]. For the unrealistically robust cell, we assume that it is the same size as C2, but with a null probability of failure. Energy consumption as a incredes the dynamic overhead of LLC block swaps and L1 clean data exict on required by the fault-aware BDOT policy. Finally, we account for both 'e on-chip power and the off-chip DRAM power. Figure 10 shows the energy per instruction (EPI) for all the systems and cell types considered, both for multiprogramme 1 (Figure 10a) and parallel (Figure 10b) workloads, with respect to a system implemented with robust cells at 0.5 V, distinguishing between on-chip and off-chip onsumption. For BD, the 2.4-fold higher MPKI for C2 \*\*scale\*\* the off-chip DRAM traffic, and in turn, significantly increases off-chip \*\*DDAM\*\* PI for both multiprogrammed and parallel workloads. On average, BDO1-NRR-FA results in a 5.4%, 5.8%, 6.8%, 8.2%, and 20.4% lower overall \*\*P\*\* than BD for C6, C5, C4, C3, and C2, respectively, for the multiprogramme! workloads. In the case of parallel workloads, the EPI of BDOT-NDR-FA is within 2% of BD for C6, C5, and C4, and 7.4% and 26.8% lower for C3 a. \*\*I C2, respectively. Regarding WD, the real show the same trend as performance, namely, BDOT-NRR-FA EPI results are 1.5%, 9.8%, 7%, and 4% lower for C6, C5, C4, and C3, respectively, whe considering multiprogrammed workloads, while for parallel workloads, the PI values of the two techniques are very similar for C6 and C5, but BDO NRR-FA cannot achieve the efficiency of WD for the other cell configuration. The energy rosults shown above do not consider any block power gating technique [38] Assuming a more aggressive approach, where fine-grained block power gating is affordable [39], the benefits of BD-based techniques in terms of power and energy will be enhanced, as faulty entries do not consume static power during operation. Applying this technique, the EPI with BDOT-NRR-FA vould to 6.2%, 6.7%, 7.2%, 6.3%, and 5.5% lower for the multiprogrammed worklosus than the EPI values in Figure 10 with C6, C5, C4, C3, and C2 cells, It applies to the same tendency is observed in the parallel workload results. Figure 11 compares the EPI values with BD and BDOT-NRR-FA when implementing block power gating with those obtained with WD. We observe that for multiprogrammed workloads all the cell configurations ach. we significant improvements in terms of EPI with respect to WD, and in the case of parallel workloads, only the C2 configuration is not able to reach the TVD efficiency. Figure 11: Normalized EPI (average) with respect to word disabling, when implementing fine-grained block power gating. #### 9. Related Work Conventional 6T S AM $\circ$ "s fail to operate reliably in the near-threshold regime, as the ratio one rair is for read stability and writability of transistors cannot be guarar eed, especially in view of $V_{th}$ variations. Prior proposals to mitigate the impact of $\circ$ AM cell failures due to parameter variation at ultra-low voltages can be categorized into circuit and architectural solutions. Circuit solutions include methods that improve the bit cell resilience by increasing its size [4], or by adding assist/spare circuitry [18, 3]. Increasing the cell sign or the number of transistors per cell comes at the cost of significant increases in the SRAM area (lower density) and power consumption. Since the LLC accounts for much of the die size, increasing its area (e.g., ST SRAM cells [3] louble the area of the SRAM structure) is not a design option. Architectural solutions include redundancy through ECCs, disabling techiques, and duplication mechanisms. Our proposal fits in this category. ECCs are extensively employed to protect designs against soft errors. Some studies have extended the use of ECCs to protect against ha. ¹ erro. ³ when running at ultra-low voltages [40, 8]. ECCs are usually opt mix and to minimize their storage requirements, at the cost of complex logic to ¹ tect and correct errors. Thus, the ability to detect and correct more errors correst at the cost of increasing the complexity of the decoding stage, or the ¹ torago requirements of the check bits [8]. Our proposal is orthogonal to the local to provide more functional entries (or any other technique that increases the number of functional entries), as it adapts seamlessly to the a nount of functional and non-functional data entries in the cache. Regarding BD [11], Lee et al. examina portant unce degradation of disabling cache lines, sets, ways, ports, or the complex cache in a single processor environment [7]. Ladas et al. implement a vic im cache to compensate for the loss in associativity [15]. Our approach was relies on BD, but does not require any additional structures. Ghasemi et al. propose the use of 'eterogeneous cell sizes, in order that when operating at low-power, we so the et al. propose a mixed-cell memory design, where a small portion of the cacle is implemented with robust cells, which store dirty cache blocks, and 'he three majnder with non-robust cells [19]. They modify the replacement policy of guide the allocation of blocks based on the type of request (load or store). Thou et al. combine spare cells, heterogeneous cell sizes, and ECCs into the hyl rid design to improve on the effectiveness obtained by any single technique and [4]. In contrast to these techniques, we do not rely on the existency of obust ways and we guide the allocation of blocks to faulty or operational LCC entries based on their reuse. T. a gravularity at which capacity is disabled could be finer, though this vould add complexity to cache accesses. Word disabling tracks defects at word-half religiously, combining two consecutive cache entries into a single fault-free entry, halving both associativity and capacity [10]. Abella et al. propose to vpass faulty subentries rather than disable full cache lines, but this technique is suitable only for the first-level cache, where accesses are word wide [42]. Palframan et al. follow a similar approach, patching faulty wo. 4s fro. 1 other microprocessor structures, such as the store queue or the hiss status holding register [43]. Ferrerón et al. compress cache blocks to fit the him faulty entries, allowing the utilization of 100% of the cache entries [37]. More complex schemes couple faulty cache entries using a remapping mechanism. [9, 44–45]. They group collision-free cache entries (from the same or different has be banks) relying on the execution of a complex algorithm and strutures to core all the mapping strategy. Re-mapping mechanisms add a level of indirection to the cache access (increasing its latency), and the combination of cache entries to recreate a cache block adds complexity. Besides, several cache has besses are needed to obtain a fault-free cache block, increasing the energy confunction and/or the block access latency. Unlike the aforementioned proposals, we do not add any additional structures or re-mapping mechanical or ly minor changes to the coherence protocol and replacement polic. In the context of ultra-low voltage. Keramidas et al. use a PC-indexed spatial predictor to orchestrate the relacement decisions among fully and partially usable entries in first-level caches [46]. We based our allocation predictions on reuse patterns, which simplifies the hardware, and we do not consider the use of partially faulty entries. Regarding the problementation of our techniques, it is worth referring to the work of Je' et al. [17]. In inclusive hierarchies, the private caches filter the temporal locality and hot blocks (i.e., blocks being actively used by the processor are degraded in the replacement algorithm of the LLC, eventually being evaluated. They address this problem by protecting blocks present in the private caches and preventing their replacement in the LLC through several techniques, including: sending hints to the LLC, identifying temporal locality ria early invalidation, and querying the private caches about the presence of blocks. We also protect private copies in all the replacement policies considered and ding the baseline one), in our case by using the coherence information and assuming non-silent evictions of clean blocks. Albericio et al. base replacement decisions on block reuse locality [20]. They propose the *Not-Recently Reused* (NRR) algorithm, which protect. block present in the private caches and blocks that have shown reuse in the LL? Their simple yet efficient implementation achieved better performance it is a more complex techniques such as RRIP [47]. Our proposal uses NRR as the base replacement policy. #### 770 10. Summary and Conclusions Voltage reduction has been the primary draw to reduce power during recent decades, but ultra-deep-submicron technolog. A have suddenly stopped this trend because of problems with leakage and stability. Manufacturing-induced parameter variations make SRAM cellouse at lower voltages, meaning that they require a minimum voltage to operate reliably. SRAM cell failures can be tolerated by deactivating faulty cache entries. This technique is called Block Disabling (BD) and requires only one but per tag. Unfortunately, as the number of defective entries increases, so does performance degradation, and the energy saved from decreasing $V_{eff}$ does not compensate for the extra energy required for the additional mair memory accesses. The reduction in associativity and capacity experienced by inclusive LLCs extended with B<sup>7</sup> has two specific drawbacks in multicore systems. First, the number of inclusive victims in private L1 caches increases. Second, the MPKI values also grow, increasing LLC miss latency and main memory energy consumption. To cc be viith the first problem, we propose Block Disabling with Operational Tags (PDOT), which uses robust cells to implement the LLC tag array. BDOT enallies some cache blocks to be only in private levels by simply tracking their tracking theorem, and extends the existing cache-to-cache coherence service to clean blacks. Thus, with regard to inclusion victims, the LLC associativity is fully restored. BDOT requires a small amount of extra control, and it adds to storage overhead to BD (the bit that marks operative entries sufficing to distinguish between LLC T and D entries). Any replacement algorithm may work with BDOT, and we have tested NRU and NRR, two low-cost cate-of the art proposals for LLCs. After the last copy L1 eviction of a block tracked by a entry, a future reference to this block will involve an off-chip access even 'hough we know that reuse chances are high. Hence, we tackle the secon' problem from the key observation that we can preserve the data cache on ... by exchanging the valuable, just evicted T entry block (promotion), for an L1 present D entry block (demotion). Furthermore, if all blocks allocated to D ntries lack L1 copies, we can still resort to demotion, losing effective n-cn., apacity, assuming that an incoming L1 block showing reuse (second T1 months ement) is more valuable than any older block allocated to a D entry. We have implemented these ideas in BDOT-NRR-FA, the fault-aware version of BDOT that selects for demotion a D entry victim block that has a warrup opy in L1 (first criterion), and has not shown reuse in the LLC (accord riterion). Compared to a BDOT LLC using NRR replacement, BDOT-NPR-FA improves performance and energy efficiency with no area over head because the bits per block required, namely for the presence vector, opera 've ent y, and reuse are required, respectively, by the coherence mechanism BD and conventional replacement. We tested our proposals against a wide range of multiprogrammed and parallel workloads and deferent $P_{fail}$ situations. Our best proposal, BDOT-NRR-FA, beath RD, results in up to 37.3% and 54.2% lower MPKI values for multiprogrammed and parallel workloads, respectively. These decreases translate to perform another improvements of 13% and 34.6%, respectively. Regarding energy use, our proposal altereases EPI by between 5.4% and 20.4% for multiprogrammed, and between 2% and 26.8% for parallel workloads. The largest savings come from LCs with the most faulty cells, and gains are consistent across programs, naking our proposal very suitable for the operation of multicore LLCs at low v 1 ager for current and future technology nodes. #### References 835 - [1] M. Taylor, A landscape of the new dark silicon design regirae, LEE Micro 33 (5) (2013) 8–19. doi:10.1109/MM.2013.90. - [2] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, 'A' Mudge, Nearthreshold computing: Reclaiming moore's 'aw t' ugh energy efficient integrated circuits, Proc. of the IEEE 98 (2) (2010) 25 J-266. doi:10.1109/JPROC.2009.2034764. - [3] J. Kulkarni, K. Kim, K. Roy, A 160m, robust schmitt trigger based subthreshold SRAM, IEEE Journal robust Solid-State Circuits 42 (10) (2007) 2303–2313. doi:10.1109/JSSC. - [4] S.-T. Zhou, S. Katariya, H. Chesen, S. Draper, N. S. Kim, Minimizing total area of low-voltage SRAM prays through joint optimization of cell size, redundancy, and ECC, in. TEEE Int. Conf. on Computer Design, 2010, pp. 112–117. doi:10.122/ICCD.2010.5647605. - [5] R. Kumar, G. Hinton, A family of 45nm IA processors, in: IEEE Int. Solid-State Circuits Conf. Digest of Technical Papers, 2009, pp. 58–59. doi:10.1109 ISSC 2'09.4977306. - [6] J. Chang, M. Huang, J. Shoemaker, J. Benoit, S.-L. Chen, W. Chen, S. Chiu R. Canesan, G. Leong, V. Lukka, S. Rusu, D. Srivastava, The 65-nm 16-MB Shored On-Die L3 Cache for the Dual-Core Intel Xeon Processor 710 Sches, EEE Journal of Solid-State Circuits 42 (4) (2007) 846-852. Control of Solid-State Circuits 42 (4) (2007) 846-852. - [7] h. Lee S. Cho, B. Childers, Performance of graceful degradation for cache faults, in: IEEE Computer Society Annual Symp. on VLSI, 2007, pp. 40 -415. doi:10.1109/ISVLSI.2007.81. - [8] Z. Chishti, A. R. Alameldeen, C. Wilkerson, W. Wu, S.-L. Lu, Improving cache lifetime reliability at ultra-low voltages, in: 42nd Annual IEEE/ACM Int. Symp. on Microarchitecture, 2009, pp. 89-99. doi:10.145/.369112.1669126. 850 - [9] A. Ansari, S. Feng, S. Gupta, S. Mahlke, Archipelago: A polymorphic cache design for enabling robust near-threshold operation, in: IEEE 17th Int. Symp. on High Performance Computer Architecture, 2011, pp. 539–550. doi:10.1109/HPCA.2011.5749758. - [10] C. Wilkerson, H. Gao, A. R. Alameldeen, Z. Chabti, M. Khellah, S.-L. Lu, Trading off cache capacity for reliability to proble low voltage operation, in: 35th Annual Int. Symp. on Computer A. hitecture, 2008, pp. 203–214. doi:10.1109/ISCA.2008.22. - [11] G. Sohi, Cache memory organization to enhance the yield of high performance VLSI processors, IEEE Part on Computers 38 (4) (1989) 484–492. doi:10.1109/12.21141. - [12] A. Ferrerón, D. Suárez Chacia, J. Alastruey-Benedé, T. Monreal, V. Viñals, Block disabling the factorization and improvements in CMPs operating at ultra-low voltages in: 2014 IEEE 26<sup>th</sup> International Symposium on Computer Archi, where and High Performance Computing, 2014, pp. 238–245. doi:10 '109/SBAC-PAD.2014.12. - [13] A. J. Bhrana arwala, X. Tang, J. D. Meindl, The impact of intrinsic device fluctuation on CMOS SRAM cell stability, IEEE Journal of Solid-State Circ ats '6 (4) (2001) 658–665. doi:10.1109/4.913744. - [14] Y. Tang, J. K. De, J. D. Meindl, Intrinsic MOSFET parameter fluctuations due to random dopant placement, IEEE Trans. on Very Large Scale Integration (VLSI) Systems 5 (4) (1997) 369–376. doi:10.1109/92.645063. - Vcc-min, in: IEEE Int. Symp. on Performance-effective operation below Vcc-min, in: IEEE Int. Symp. on Performance Analysis of Systems Software, 2010, pp. 223–234. doi:10.1109/ISPASS.2010.5452017. - [16] J. L. Baer, W. Wang, On the inclusion properties for mc'+i-level cache hierarchies, in: 15th Annual Int. Symp. on Computer Arc itecture, 1988, pp. 73–80. doi:10.1145/633625.52409. - [17] A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr. J. Eme Achieving non-inclusive cache performance with inclusive cache. Temporal locality aware (TLA) cache management policies, in: 43rd A and IF EE/ACM Int. Symp. on Microarchitecture, 2010, pp. 151–162. doi: 10.1109/MICRO.2010.52. - [18] L. Chang, R. Montoye, Y. Nakamura, K. Bats, T. Eickemeyer, R. Dennard, W. Haensch, D. Jamsek, An 8T-SRAM for ariability tolerance and low-voltage operation in high-performance cookes, IEEE Journal of Solid-State Circuits 43 (4) (2008) 956–963. a. i:10.1109/JSSC.2007.917509. 885 - [19] S. M. Khan, A. R. Alameldeen, C. Wilkerson, J. Kulkarni, D. A. Jimenez, Improving multi-core perioducians using mixed-cell cache architecture, in: IEEE 19th Int. Symp. on High Performance Computer Architecture, 2013, pp. 119–130. doi:10.1109/ PCA.2013.6522312. - [20] J. Albericio, P. Il añez, V. 7iñals, J. M. Llabería, Exploiting reuse locality on inclusive share 1 st-1 vel caches, ACM Trans. Archit. Code Optim. 9 (4) (2013) 38:1-5 19. doi:10.1145/2400682.2400697. - [21] M. Chau'nu., J. Gaur, N. Bashyam, S. Subramoney, J. Nuzman, Introducing hiera, by -awareness in replacement and bypass algorithms for last-level caches, ir. 21st Int. Conf. on Parallel Architectures and Compilation Techniques, 2014, pp. 293–304. doi:10.1145/2370816.2370860. - [22] C.-J. W 1, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., J. Emer, SHiP: Signature-based hit predictor for high performance caching, in: 44th Annual IEEE/ACM Int. Symp. on Microarchitecture, 2011, pp. 430–441. doi:10.1145/2155620.2155671. - [3] S. M. Khan, Y. Tian, D. A. Jimenez, Sampling dead block prediction for last- - level caches, in: 43rd Annual IEEE/ACM Int. Symp. on Mr. parchitecture, 2010, pp. 175–186. doi:10.1109/MICRO.2010.24. - [24] J. Handy, The Cache Memory Book, Morgan Kaufmann, 198. 910 915 - [25] S. Microsystems, UltraSPARC T2 supplement to the UltraSPARC architecture, Draft d1.4.3, Sun Microsystems Inc. (2017). - [26] P. Magnusson, M. Christensson, J. Eskilso. D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B. Verner, imics: A full system simulation platform, Computer 35 (2) (2002), 50-58. doi:10.1109/2.982916. - [27] M. M. K. Martin, D. J. Sorin, B. M. Bedrmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hir, L. A. Wood, Multifacet's General Execution-driven Multiproce or St. ulator (GEMS) toolset, SIGARCH Comput. Archit. News 33 (4) (2005) 92–99. doi:10.1145/1105734.1105747. - [28] P. Rosenfeld, E. Coc per-b. lis, B. Jacob, DRAMSim2: A cycle accurate memory system simulator, Computer Architecture Letters 10 (1) (2011) 16–19. doi:10.109 L-CA.2011.4. - [29] S. Li, J. H. Ann, R. D. Strong, J. B. Brockman, D. M. Tullsen, N. P. Jouppi, McPAT: an integrated power, area, and timing modeling framework for multicons and manycore architectures, in: 42nd Annual IEEE/ACM Int. Sympton Microarchitecture, 2009, pp. 469–480. doi:10.1145/1669112. 166:177. - [30] J. L. H. ming, SPEC CPU2006 benchmark descriptions, SIGARCH Comput. A whit News 34 (4) (2006) 1–17. doi:10.1145/1186736.1186737. - 31] C. Bienia, S. Kumar, J. P. Singh, K. Li, The PARSEC benchmark suite: cnaracterization and architectural implications, in: 17th Int. Conf. on Parallel Architectures and Compilation Techniques, 2008, pp. 72–81. doi: http://doi.acm.org/10.1145/1454115.1454128. - [32] D. Sánchez, Y. Sazeides, J. M. Cebrián, J. M. García, J. L. Arc rón, Podeling the impact of permanent faults in caches, ACM Trays. Co. Architecture and Code Optimization 10 (4) (2013) 29:1–29:23. doi: 1145/2541228. 2541236. - [33] D. Hardy, I. Sideris, N. Ladas, Y. Sazeides, The performance vulnerability of architectural and non-architectural arrays of erms nent faults, in: 45th Annual IEEE/ACM Int. Symp. on Microan hitecture, 2012, pp. 48–59. doi:10.1109/MICRO.2012.14. - [34] R. Jain, The Art of Computer Systems Pern mance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, Wiley, 1991. - [35] A. Agarwal, B. Paul, H. Mahn, oc.; A. Datta, K. Roy, A process-tolerant cache architecture for improced in an anoscale technologies, IEEE Tran. on Very Large Scale Integration (VLSI) Systems 13 (1) (2005) 27–38. doi: 10.1109/TVLSI.200/.8404 7. - [36] L. Cheng, P. Gur ca, C. J. Spanos, K. Qian, L. He, Physically justifiable die-level modeln. Trisp tial variation in view of systematic across wafer variability, IT FE Trans. on Computer-Aided Design of Integrated Circuits and Systems 30 (3) (2011) 388–401. doi:10.1109/TCAD.2010.2089568. - [37] A. Ferre, D. Suarez, J. Alastruey, T. Monreal, P. Ibañez, Concertina: Sque zine in cache content to operate at near-threshold voltage, IEEE Trans. on Co., put rs 65 (3) (2016) 755–769. doi:10.1109/TC.2015.2479585. - [38] M. Pow II, S.-H. Yang, B. Falsafi, K. Roy, T. N. Vijaykumar, Gated-Vdd: a cucuit technique to reduce leakage in deep-submicron cache memories, in: Int. Symp. on Low Power Electronics and Design, 2000, pp. 90–95. doi:10.1109/LPE.2000.155259. - [39] M. Gottscho, A. BanaiyanMofrad, N. Dutt, A. Nicolau, P. Gupta, DPCS: Dynamic power/capacity scaling for SRAM caches in the nanoscale era, - ACM Trans. on Architecture and Code Optimization 12 (2015) 27:1-27:26. doi:10.1145/2792982. - [40] A. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, S.-L. Lu, Energy-efficient cache design using variable-streng h error- orrecting codes, in: 38th Annual Int. Symp. on Computer Architecture, 2011, pp. 461–471. doi:10.1145/2000064.2000118. - [41] H. Ghasemi, S. Draper, N. S. Kim, Low-voltage an-chip cache architecture using heterogeneous cell sizes for high performance processors, in: IEEE 17th Int. Symp. on High Performance Concuter Architecture, 2011, pp. 38–49. doi:10.1109/HPCA.2011.5749.5. 965 970 - [42] J. Abella, J. Carretero, P. Chapar, X. Vera, A. González, Low Vccmin fault-tolerant cache with highly of dictable performance, in: 42nd Annual IEEE/ACM Int. Symp. of Miles exchitecture, 2009, pp. 111–121. doi: 10.1145/1669112.1669128. - [43] D. J. Palframan, N. Kim, I. H. Lipasti, iPatch: Intelligent fault patching to improve energy efficiency, in: IEEE 21st Int. Symp. on High Performance Computer Archi ecture, 2015, pp. 428–438. doi:10.1109/HPCA. 2015.705605 - [44] C.-K. Ke'., V.-F. Wong, Y. Chen, H. Li, Tolerating process variations in large, second acides: The buddy cache, ACM Trans. on Architecture and Code Optimization 6 (2) (2009) 8:1–8:34. doi:10.1145/1543753. - [45] Γ. Mah nood, S. Kim, S. Hong, Macho: A failure model-oriented adaptive cacne architecture to enable near-threshold voltage scaling, in: IEEE 19th Int. Symp. on High Performance Computer Architecture, 2013, pp. 532–541. doi:10.1109/HPCA.2013.6522347. - [16] G. Keramidas, M. Mavropoulos, A. Karvouniari, D. Nikolos, Spatial pattern prediction based management of faulty data caches, in: Conference on Design, Automation & Test in Europe, 2014, pp. 60:1-6. del:http://dl.acm.org/citation.cfm?id=2616606.2616680. [47] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., J. Emar, High performance cache replacement using re-reference interval pre-iction (RRIP), in: 37th Annual Int. Symp. on Computer Architecture, 2010, pp. 60–71. doi:10.1145/1816038.1815971. Alexandra Ferrerón Photograph Click here to download high resolution image Víctor Viñals Yúfera Photograph Click here to download high resolution image Alexandra Ferrerón received the MS and PhD degrees in computer science engineering from the Universidad de Zaragoza, Spain, in 2013 and 2016, respectively. Her research interests include high-performance low-power on-chip memory hierarchies, ultra-low and near-threshold voltage computing, and High Performance Computing. She currently works as Site Reliability Engineer for BigQuery (Google Cloud Platform) at Gogle Switzerland. Jesús Alastruey-Benedé received the Telecommunications Engineering degree and the PhD degree in Computer Science from the Universaliad de Zaragoza, Spain, in 1997 and 2009, respectively. He is a Lecture in the Departmento de Informática e Ingeniería de Sistemas (D'IS,, Universidad de Zaragoza, Spain. His research interests include processor microarchitecture, memory hierarchy, and High Performance Computing (HPC) applications. He is a member of the Instituto de Investigación en Ingeniería de Aragón (I3A) and the European HiPEA Nota. Darío Suárez-Gracia (S'08, M'12) received the P'.D degree in computer engineering from the Universidad de Zaragoza. Sprin, in 2011. From 2012 to 2015, he was been working at Qualcomm Resear h Silicon Valley on power aware parallel and heterogeneous computing for robile devices. Currently, he is an assistant professor at the Universidad de Zaragoza in Spain. His research interests include parallel programming, heterogeneous computing, memory hierarchy design, networks-on-chin, and accelerators for computer vision applications. He is also a member of the IEEE, the IEEE Computer Society, and the Association for Computing Machinery. Teresa Monreal-Arnal received the Monoreal in Mathematics and the PhD degree in Computer Science from the University of Zaragoza, Spain, in 1991 and 2003, respectively. Until 2007, she was with the Informática e Ingeniería de Sistemas Department (DIS) at the University of Zaragoza, Spain. Currently, she is an Acociate Professor with the Computer Architecture Department (DAC at he Universitat Politècnica de Catalunya (UPC), Spain. Her research in erects include processor microarchitecture, memory hierarchy, and parailel computer architecture. She collaborates actively with the Grupo Calagoza and tectura de Computadores from the University of Zaragoza and actively of Zaragoza and actively with the Grupo Calagoza and active the University of Zaragoza and active the Mathematics and the PhD degree in Pablo Ibáñez received the 'S degree in Computer Science from the Universitat Politècn's de Catalunya in 1989, and the PhD degree in Computer Science from the Universidad de Zaragoza in 1998. He is an Associate Professor 1. the Departamento de Informática e Ingeniería de Sistemas (DIIS) as the Universidad de Zaragoza, Spain. His research interests include processor microarchitecture, memory hierarchy, parallel computer architecture and High Performance Computing (HPC) applications. He is a member of the Instituto de Investigación en Ingeniería de Aragón (I3A) and the European HiPEAC NoE. Víctor Viñals-Yufera received the MS degree in Telecommunications and the PhD degree in Computer Science from the Universitat Politècnica de Catalunya (Uro, in 1982 and 1987, respectively. He was associate professor in the Facultat d'Informàtica de Barcelona from 1983 to 1988. Currently, ne is full professor in the Informática e Ingeniería de Sistemas Department at the University of Zaragoza (Spain). His research interests include processor microarchitecture, memory hierarchy, and parallel computer architecture. He is member of the ACM, the IEEE Computer Society, and HiPEAC. He also belongs to the Computer Architecture Group and the I3A Institute of the University of Zaragoza.