Squire: a general-purpose accelerator to exploit fine-grain parallelism on dependency-bound kernels

Langarita, Rubén; Marco-Sola, Santiago; Moretó, Miquel; Armejach, Adrià; Ibáñez-Marín, Pablo; Alastruey-Benedé, Jesús
doi:10.1109/PACT65351.2025.00035
000168108 001__ 168108
000168108 005__ 20260126155509.0
000168108 0247_ $$2doi$$a10.1109/PACT65351.2025.00035
000168108 0248_ $$2sideral$$a147666
000168108 037__ $$aART-2025-147666
000168108 041__ $$aeng
000168108 100__ $$aLangarita, Rubén
000168108 245__ $$aSquire: a general-purpose accelerator to exploit fine-grain parallelism on dependency-bound kernels
000168108 260__ $$c2025
000168108 5060_ $$aAccess copy available to the general public$$fUnrestricted
000168108 5203_ $$aMultiple HPC applications are often bottlenecked by compute-intensive kernels implementing complex dependency patterns (data-dependency bound). Traditional general-purpose accelerators struggle to effectively exploit fine-grain parallelism due to limitations in implementing convoluted data-dependency patterns (like SIMD) and overheads due to synchronization and data transfers (like GPGPUs). In contrast, custom FPGA and ASIC designs offer improved performance and energy efficiency at a high cost in hardware design and programming complexity and often lack the flexibility to process different workloads. We propose Squire, a general-purpose accelerator designed to exploit fine-grain parallelism effectively on dependency-bound kernels. Each Squire accelerator has a set of general-purpose low-power in-order cores that can rapidly communicate among themselves and directly access data from the L2 cache. Our proposal integrates one Squire accelerator per core in a typical multicore system, allowing the acceleration of dependency-bound kernels within parallel tasks with minimal software changes. As a case study, we evaluate Squire’s effectiveness by accelerating five kernels that implement complex dependency patterns. We use three of these kernels to build an end-to-end read-mapping tool that will be used to evaluate Squire. Squire obtains speedups up to 7.64× in dynamic programming kernels. Overall, Squire provides an acceleration for an end-to-end application of 3.66×. In addition, Squire reduces energy consumption by up to 56% with a minimal area overhead of 10.5% compared to a NeoverseN1 baseline.
000168108 536__ $$9info:eu-repo/grantAgreement/ES/AEI/PID2022-136454NB-C22$$9info:eu-repo/grantAgreement/ES/AEI/PID2023-146193OB-I00$$9info:eu-repo/grantAgreement/ES/DGA/T58-23R$$9info:eu-repo/grantAgreement/ES/MICIU/PID2023-146511NB-I00
000168108 540__ $$9info:eu-repo/semantics/openAccess$$aAll rights reserved$$uhttp://www.europeana.eu/rights/rr-f/
000168108 655_4 $$ainfo:eu-repo/semantics/conferenceObject$$vinfo:eu-repo/semantics/acceptedVersion
000168108 700__ $$0(orcid)0000-0003-4164-5078$$aAlastruey-Benedé, Jesús$$uUniversidad de Zaragoza
000168108 700__ $$0(orcid)0000-0002-5916-7898$$aIbáñez-Marín, Pablo$$uUniversidad de Zaragoza
000168108 700__ $$aMarco-Sola, Santiago
000168108 700__ $$aMoretó, Miquel
000168108 700__ $$aArmejach, Adrià
000168108 7102_ $$15007$$2035$$aUniversidad de Zaragoza$$bDpto. Informát.Ingenie.Sistms.$$cÁrea Arquit.Tecnología Comput.
000168108 773__ $$g(2025), 292-305$$tProceedings of the Conference on Parallel Architectures and Compilation Techniques$$x1089-795X
000168108 8564_ $$s453082$$uhttps://zaguan.unizar.es/record/168108/files/texto_completo.pdf$$yPostprint
000168108 8564_ $$s2987409$$uhttps://zaguan.unizar.es/record/168108/files/texto_completo.jpg?subformat=icon$$xicon$$yPostprint
000168108 909CO $$ooai:zaguan.unizar.es:168108$$particulos$$pdriver
000168108 951__ $$a2026-01-26-14:50:40
000168108 980__ $$aARTICLE
Repositorio Institucional de Documentos