000112181 001__ 112181
000112181 005__ 20240319080957.0
000112181 0247_ $$2doi$$a10.1016/j.sysarc.2022.102398
000112181 0248_ $$2sideral$$a128413
000112181 037__ $$aART-2022-128413
000112181 041__ $$aeng
000112181 100__ $$aRodríguez, A.
000112181 245__ $$aLightweight asynchronous scheduling in heterogeneous reconfigurable systems
000112181 260__ $$c2022
000112181 5060_ $$aAccess copy available to the general public$$fUnrestricted
000112181 5203_ $$aThe trend for heterogeneous embedded systems is the integration of accelerators and general-purpose CPU cores on the same die. In these integrated architectures, like the Zynq UltraScale+ board (CPU+FPGA) that we target in this work, hardware support for shared memory and low-overhead synchronization between the accelerator and the CPU cores make the case for exploring strategies that exploit a tight collaboration between the CPUs and the accelerator. In this paper we propose a novel lightweight scheduling strategy, FastFit, targeted to FPGA accelerators, and a new scheduler based on it, named MultiFastFit, which asynchronously tackles heterogeneous systems comprised of a variety of CPU cores and FPGA IPs. Our strategy significantly reduces the overhead to automatically compute the near-optimal chunksizes when compared to a previous state-of-the-art auto-tuned approach, which makes our approach more suitable for fine-grained applications. Additionally, our scheduler MultiFastFit has been designed to enable the efficient co-execution of work among compute devices in such a way that all the devices are busy while minimizing the load unbalance. Our approaches have been evaluated using four benchmarks carefully tuned for the low-power UltraScale+ platform. Our experiments demonstrate that the FastFit strategy always finds the near-optimal FPGA chunksize for any device configuration at a reasonable cost, even for fine-grained and irregular applications, and that heterogeneous CPU+FPGA co-executions that exploit all the compute devices are usually faster and more energy efficient than the CPU-only and FPGA-only executions. We have also compared MultiFastFit with other state-of-the-art scheduling strategies, finding that it outperforms other auto-tuned approach up to 2x and it achieves similar results to manually-tuned schedulers without requiring an offline search of the ideal CPU-FPGA partition or FPGA chunk granularity. © 2022 The Authors
000112181 536__ $$9info:eu-repo/grantAgreement/ES/MICINN/PID2019-105396RB-I00
000112181 540__ $$9info:eu-repo/semantics/openAccess$$aby$$uhttp://creativecommons.org/licenses/by/3.0/es/
000112181 590__ $$a4.5$$b2022
000112181 592__ $$a1.276$$b2022
000112181 591__ $$aCOMPUTER SCIENCE, SOFTWARE ENGINEERING$$b22 / 108 = 0.204$$c2022$$dQ1$$eT1
000112181 593__ $$aSoftware$$c2022$$dQ1
000112181 591__ $$aCOMPUTER SCIENCE, HARDWARE & ARCHITECTURE$$b11 / 54 = 0.204$$c2022$$dQ1$$eT1
000112181 593__ $$aHardware and Architecture$$c2022$$dQ1
000112181 594__ $$a8.5$$b2022
000112181 655_4 $$ainfo:eu-repo/semantics/article$$vinfo:eu-repo/semantics/publishedVersion
000112181 700__ $$aNavarro, A.
000112181 700__ $$aNikov, K.
000112181 700__ $$aNunez-Yanez, J.
000112181 700__ $$0(orcid)0000-0002-4031-5651$$aGran Tejero, R.$$uUniversidad de Zaragoza
000112181 700__ $$0(orcid)0000-0002-7490-4067$$aSuárez Gracia, D.$$uUniversidad de Zaragoza
000112181 700__ $$aAsenjo, R.
000112181 7102_ $$15007$$2035$$aUniversidad de Zaragoza$$bDpto. Informát.Ingenie.Sistms.$$cÁrea Arquit.Tecnología Comput.
000112181 773__ $$g124 (2022), 102398 [14 pp]$$pJ. systems archit.$$tJournal of Systems Architecture$$x1383-7621
000112181 8564_ $$s2349544$$uhttps://zaguan.unizar.es/record/112181/files/texto_completo.pdf$$yVersión publicada
000112181 8564_ $$s2577679$$uhttps://zaguan.unizar.es/record/112181/files/texto_completo.jpg?subformat=icon$$xicon$$yVersión publicada
000112181 909CO $$ooai:zaguan.unizar.es:112181$$particulos$$pdriver
000112181 951__ $$a2024-03-18-13:43:25
000112181 980__ $$aARTICLE