Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors

Plou, Carlos; Mur-Labadia, Lorenzo; Murillo, Ana C.; Martinez-Cantin, Ruben; Guerrero, Jose J.
doi:10.1016/j.cviu.2025.104622
000168247 001__ 168247
000168247 005__ 20260130124311.0
000168247 0247_ $$2doi$$a10.1016/j.cviu.2025.104622
000168247 0248_ $$2sideral$$a147757
000168247 037__ $$aART-2026-147757
000168247 041__ $$aeng
000168247 100__ $$aPlou, Carlos$$uUniversidad de Zaragoza
000168247 245__ $$aTemporal video segmentation with natural language using text–video cross attention and Bayesian order-priors
000168247 260__ $$c2026
000168247 5060_ $$aAccess copy available to the general public$$fUnrestricted
000168247 5203_ $$aVideo is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in realworld environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces BayesianVSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the Goal Step challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical application.
000168247 536__ $$9info:eu-repo/grantAgreement/ES/DGA/T45-23R$$9info:eu-repo/grantAgreement/ES/MCIU/PID2024-159284NB-I00$$9info:eu-repo/grantAgreement/ES/MICINN-AEI/PID2021-125209OB-I00$$9info:eu-repo/grantAgreement/ES/MICINN/PID2021-125514NB-I00$$9info:eu-repo/grantAgreement/ES/MICINN/PID2024–158322OB-I00
000168247 540__ $$9info:eu-repo/semantics/openAccess$$aby-nc$$uhttps://creativecommons.org/licenses/by-nc/4.0/deed.es
000168247 655_4 $$ainfo:eu-repo/semantics/article$$vinfo:eu-repo/semantics/publishedVersion
000168247 700__ $$aMur-Labadia, Lorenzo$$uUniversidad de Zaragoza
000168247 700__ $$0(orcid)0000-0001-5209-2267$$aGuerrero, Jose J.$$uUniversidad de Zaragoza
000168247 700__ $$0(orcid)0000-0002-6741-844X$$aMartinez-Cantin, Ruben$$uUniversidad de Zaragoza
000168247 700__ $$0(orcid)0000-0002-7580-9037$$aMurillo, Ana C.$$uUniversidad de Zaragoza
000168247 7102_ $$15007$$2520$$aUniversidad de Zaragoza$$bDpto. Informát.Ingenie.Sistms.$$cÁrea Ingen.Sistemas y Automát.
000168247 773__ $$g264 (2026), 104622 [11 pp.]$$pComput. vis. image underst.$$tCOMPUTER VISION AND IMAGE UNDERSTANDING$$x1077-3142
000168247 787__ $$tBayesianVSLNet - Temporal Video Segmentation with Natural Language using Text-Video Cross Attention and Bayesian Order-priors$$whttps://github.com/cplou99/BayesianVSLNet
000168247 8564_ $$s3180617$$uhttps://zaguan.unizar.es/record/168247/files/texto_completo.pdf$$yVersión publicada
000168247 8564_ $$s2756806$$uhttps://zaguan.unizar.es/record/168247/files/texto_completo.jpg?subformat=icon$$xicon$$yVersión publicada
000168247 909CO $$ooai:zaguan.unizar.es:168247$$particulos$$pdriver
000168247 951__ $$a2026-01-30-12:20:41
000168247 980__ $$aARTICLE
Atlantis Institut des Sciences Fictives