Improving GPU cache hierarchy performance with a fetch and replacement cache

Candel, Francisco; Valero, Alejandro; Petit, Salvador; Sahuquillo, Julio
doi:10.1007/978-3-319-96983-1_17
000084363 001__ 84363
000084363 005__ 20250612142454.0
000084363 0247_ $$2doi$$a10.1007/978-3-319-96983-1_17
000084363 0248_ $$2sideral$$a108030
000084363 037__ $$aART-2018-108030
000084363 041__ $$aeng
000084363 100__ $$aCandel, Francisco
000084363 245__ $$aImproving GPU cache hierarchy performance with a fetch and replacement cache
000084363 260__ $$c2018
000084363 5060_ $$aAccess copy available to the general public$$fUnrestricted
000084363 5203_ $$aIn the last few years, GPGPU computing has become one of the most popular computing paradigms in high-performance computers due to its excellent performance to power ratio. The memory requirements of GPGPU applications widely differ from the requirements of CPU counterparts. The amount of memory accesses is several orders of magnitude higher in GPU applications than in CPU applications, and they present disparate access patterns. Because of this fact, large and highly associative Last-Level Caches (LLCs) bring much lower performance gains in GPUs than in CPUs. This paper presents a novel approach to manage LLC misses that efficiently improves LLC hit ratio, memory-level parallelism, and miss latencies in GPU systems. The proposed approach leverages a small additional Fetch and Replacement Cache (FRC) that stores control and coherence information of incoming blocks until they are fetched from main memory. Then, fetched blocks are swapped with victim blocks to be replaced in the LLC. After that, the eviction of victim blocks is performed from the FRC. This management approach improves performance due to three main reasons: (i) the lifetime of blocks being replaced is increased, (ii) the main memory path is unclogged on long bursts of LLC misses, and (iii) the average L2 miss delaying latency is reduced. Experimental results show that our proposal increases the performance (OPC) over 25% in most of the studied applications, reaching improvements up to 150% in some applications.
000084363 536__ $$9info:eu-repo/grantAgreement/ES/MINECO/TIN2015-66972-C5-1-R$$9info:eu-repo/grantAgreement/ES/MINECO/TIN2016-76635-C2-1-R
000084363 540__ $$9info:eu-repo/semantics/openAccess$$aAll rights reserved$$uhttp://www.europeana.eu/rights/rr-f/
000084363 592__ $$a0.283$$b2018
000084363 593__ $$aTheoretical Computer Science$$c2018$$dQ2
000084363 593__ $$aComputer Science (miscellaneous)$$c2018$$dQ2
000084363 655_4 $$ainfo:eu-repo/semantics/article$$vinfo:eu-repo/semantics/acceptedVersion
000084363 700__ $$aPetit, Salvador
000084363 700__ $$0(orcid)0000-0002-0824-5833$$aValero, Alejandro$$uUniversidad de Zaragoza
000084363 700__ $$aSahuquillo, Julio
000084363 7102_ $$15007$$2035$$aUniversidad de Zaragoza$$bDpto. Informát.Ingenie.Sistms.$$cÁrea Arquit.Tecnología Comput.
000084363 773__ $$g11014 LNCS (2018), 235-248 [13 pp.]$$pLect. notes comput. sci.$$tLecture Notes in Computer Science$$x0302-9743
000084363 8564_ $$s397693$$uhttps://zaguan.unizar.es/record/84363/files/texto_completo.pdf$$yPostprint
000084363 8564_ $$s61606$$uhttps://zaguan.unizar.es/record/84363/files/texto_completo.jpg?subformat=icon$$xicon$$yPostprint
000084363 909CO $$ooai:zaguan.unizar.es:84363$$particulos$$pdriver
000084363 951__ $$a2025-06-12-14:23:36
000084363 980__ $$aARTICLE
Atlantis Institut des Sciences Fictives