<?xml version="1.0" encoding="UTF-8"?>
<collection>
<dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:invenio="http://invenio-software.org/elements/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><dc:identifier>doi:10.1016/j.cviu.2025.104622</dc:identifier><dc:language>eng</dc:language><dc:creator>Plou, Carlos</dc:creator><dc:creator>Mur-Labadia, Lorenzo</dc:creator><dc:creator>Guerrero, Jose J.</dc:creator><dc:creator>Martinez-Cantin, Ruben</dc:creator><dc:creator>Murillo, Ana C.</dc:creator><dc:title>Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors</dc:title><dc:identifier>ART-2026-147757</dc:identifier><dc:description>Video is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in realworld environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces BayesianVSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the Goal Step challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical application.</dc:description><dc:date>2026</dc:date><dc:source>http://zaguan.unizar.es/record/168247</dc:source><dc:doi>10.1016/j.cviu.2025.104622</dc:doi><dc:identifier>http://zaguan.unizar.es/record/168247</dc:identifier><dc:identifier>oai:zaguan.unizar.es:168247</dc:identifier><dc:relation>info:eu-repo/grantAgreement/ES/DGA/T45-23R</dc:relation><dc:relation>info:eu-repo/grantAgreement/ES/MCIU/PID2024-159284NB-I00</dc:relation><dc:relation>info:eu-repo/grantAgreement/ES/MICINN-AEI/PID2021-125209OB-I00</dc:relation><dc:relation>info:eu-repo/grantAgreement/ES/MICINN/PID2021-125514NB-I00</dc:relation><dc:relation>info:eu-repo/grantAgreement/ES/MICINN/PID2024–158322OB-I00</dc:relation><dc:identifier.citation>COMPUTER VISION AND IMAGE UNDERSTANDING 264 (2026), 104622 [11 pp.]</dc:identifier.citation><dc:rights>by-nc</dc:rights><dc:rights>https://creativecommons.org/licenses/by-nc/4.0/deed.es</dc:rights><dc:rights>info:eu-repo/semantics/openAccess</dc:rights></dc:dc>

</collection>