<?xml version="1.0" encoding="UTF-8"?>
<collection>
<dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:invenio="http://invenio-software.org/elements/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><dc:identifier>doi:10.1016/j.sysarc.2022.102398</dc:identifier><dc:language>eng</dc:language><dc:creator>Rodríguez, A.</dc:creator><dc:creator>Navarro, A.</dc:creator><dc:creator>Nikov, K.</dc:creator><dc:creator>Nunez-Yanez, J.</dc:creator><dc:creator>Gran Tejero, R.</dc:creator><dc:creator>Suárez Gracia, D.</dc:creator><dc:creator>Asenjo, R.</dc:creator><dc:title>Lightweight asynchronous scheduling in heterogeneous reconfigurable systems</dc:title><dc:identifier>ART-2022-128413</dc:identifier><dc:description>The trend for heterogeneous embedded systems is the integration of accelerators and general-purpose CPU cores on the same die. In these integrated architectures, like the Zynq UltraScale+ board (CPU+FPGA) that we target in this work, hardware support for shared memory and low-overhead synchronization between the accelerator and the CPU cores make the case for exploring strategies that exploit a tight collaboration between the CPUs and the accelerator. In this paper we propose a novel lightweight scheduling strategy, FastFit, targeted to FPGA accelerators, and a new scheduler based on it, named MultiFastFit, which asynchronously tackles heterogeneous systems comprised of a variety of CPU cores and FPGA IPs. Our strategy significantly reduces the overhead to automatically compute the near-optimal chunksizes when compared to a previous state-of-the-art auto-tuned approach, which makes our approach more suitable for fine-grained applications. Additionally, our scheduler MultiFastFit has been designed to enable the efficient co-execution of work among compute devices in such a way that all the devices are busy while minimizing the load unbalance. Our approaches have been evaluated using four benchmarks carefully tuned for the low-power UltraScale+ platform. Our experiments demonstrate that the FastFit strategy always finds the near-optimal FPGA chunksize for any device configuration at a reasonable cost, even for fine-grained and irregular applications, and that heterogeneous CPU+FPGA co-executions that exploit all the compute devices are usually faster and more energy efficient than the CPU-only and FPGA-only executions. We have also compared MultiFastFit with other state-of-the-art scheduling strategies, finding that it outperforms other auto-tuned approach up to 2x and it achieves similar results to manually-tuned schedulers without requiring an offline search of the ideal CPU-FPGA partition or FPGA chunk granularity. © 2022 The Authors</dc:description><dc:date>2022</dc:date><dc:source>http://zaguan.unizar.es/record/112181</dc:source><dc:doi>10.1016/j.sysarc.2022.102398</dc:doi><dc:identifier>http://zaguan.unizar.es/record/112181</dc:identifier><dc:identifier>oai:zaguan.unizar.es:112181</dc:identifier><dc:relation>info:eu-repo/grantAgreement/ES/MICINN/PID2019-105396RB-I00</dc:relation><dc:identifier.citation>Journal of Systems Architecture 124 (2022), 102398 [14 pp]</dc:identifier.citation><dc:rights>by</dc:rights><dc:rights>http://creativecommons.org/licenses/by/3.0/es/</dc:rights><dc:rights>info:eu-repo/semantics/openAccess</dc:rights></dc:dc>

</collection>