000009229 001__ 9229
000009229 037__ $$aTAZ-TFM-2012-919
000009229 041__ $$aeng
000009229 1001_ $$aFerrerón Labari, Alexandra
000009229 24500 $$aEfficient instruction and data caching for high-performance low-power embedded systems
000009229 260__ $$aZaragoza$$bUniversidad de Zaragoza$$c2012
000009229 506__ $$aby-nc-sa$$bCreative Commons$$c3.0$$uhttp://creativecommons.org/licenses/by-nc-sa/3.0/
000009229 500__ $$aAbstract also available in Spanish.
000009229 520__ $$aAlthough multi-threading processors can increase the performance of embedded systems with a minimum overhead, fetching instructions from multiple threads each cycle also increases the pressure on the instruction cache, potentially harming the performance/consumption ratio. Instruction caches are responsible of a high percentage of the total energy consumption of the chip, which for battery-powered embedded devices becomes a critical issue. A direct way to reduce the energy consumption of the first level instruction cache is to decrease its size and associativity. However, demanding applications, and specially applications with several threads running together, might suffer a dramatic performance slow down, or even increase the total energy consumption of the cache hierarchy, due to the extra misses incurred. In this work we introduce iLP-NUCA (Instruction Light Power NUCA), a new instruction cache that replaces the conventional second level cache (L2) and improves the Energy–Delay of the system. We provided iLP-NUCA with a new tree-based transport network-in-cache that reduces both the cache line service latency and the energy consumption, regarding the former LP-NUCA implementation. We modeled in our cycle-accurate simulation environment both conventional instruction hierarchies and iLP-NUCAs. Our experiments show that, running SPEC CPU2006, iLP-NUCA, in comparison with a state–of–the–art high performance conventional cache hierarchy (three cache levels, dedicated L1 and L2, shared L3), performs better and consumes less energy. Furthermore, iLP-NUCA reaches the performance, on average, of a conventional instruction cache hierarchy implementing a double sized L1, independently of the number of threads. This translates into a reduction of the Energy–Delay product of 21%, 18%, and 11%, reaching 90%, 95%, and 99% of the ideal performance for 1, 2, and 4 threads, respectively. These results are consistent for the considered applications distribution, and bigger gains are in the most demanding applications (applications with high instruction cache requirements). Besides, we increase the performance of applications with several threads without being detrimental for any of them. The new transport topology reduces the average service latency of cache lines by 8%, and the energy consumption of its components by 20%.
000009229 521__ $$aMáster en Ingeniería de Sistemas e Informática
000009229 540__ $$aDerechos regulados por licencia Creative Commons
000009229 6531_ $$acomputer architecture
000009229 6531_ $$acache memory
000009229 6531_ $$amulti-thread
000009229 6531_ $$aembedded systems.
000009229 700__ $$aSuárez Gracia, Darío$$edir.
000009229 7102_ $$aUniversidad de Zaragoza$$bInformática e Ingeniería de Sistemas$$cArquitectura y Tecnología de Computadores
000009229 7202_ $$aAlastruey Benedé, Jesús$$eponente
000009229 8560_ $$email@example.com
000009229 8564_ $$s966806$$uhttp://zaguan.unizar.es/TAZ/EINA/2012/9229/TAZ-TFM-2012-919.pdf$$zMemoria (eng)
000009229 8564_ $$s897144$$uhttp://zaguan.unizar.es/TAZ/EINA/2012/9229/TAZ-TFM-2012-919_ANE.pdf$$zAnexos (eng)
000009229 909CO $$ooai:zaguan.unizar.es:9229$$ppublic
000009229 909CO $$pTAZ
000009229 950__ $$a
000009229 980__ $$aTAZ$$bTFM$$cEINA