Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors

Plou, Carlos; Mur-Labadia, Lorenzo; Murillo, Ana C.; Martinez-Cantin, Ruben; Guerrero, Jose J.

doi:10.1016/j.cviu.2025.104622

Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors

Plou, Carlos (Universidad de Zaragoza) ; Mur-Labadia, Lorenzo (Universidad de Zaragoza) ; Guerrero, Jose J. (Universidad de Zaragoza) ; Martinez-Cantin, Ruben (Universidad de Zaragoza) ; Murillo, Ana C. (Universidad de Zaragoza)

Resumen: Video is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in realworld environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces BayesianVSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the Goal Step challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical application.
Idioma: Inglés
DOI: 10.1016/j.cviu.2025.104622
Año: 2026
Publicado en: COMPUTER VISION AND IMAGE UNDERSTANDING 264 (2026), 104622 [11 pp.]
ISSN: 1077-3142
Financiación: info:eu-repo/grantAgreement/ES/DGA/T45-23R
Financiación: info:eu-repo/grantAgreement/ES/MCIU/PID2024-159284NB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN-AEI/PID2021-125209OB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN/PID2021-125514NB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN/PID2024–158322OB-I00
Tipo y forma: Article (Published version)
Área (Departamento): Área Ingen.Sistemas y Automát. (Dpto. Informát.Ingenie.Sistms.)
Dataset asociado: BayesianVSLNet - Temporal Video Segmentation with Natural Language using Text-Video Cross Attention and Bayesian Order-priors ( https://github.com/cplou99/BayesianVSLNet)

You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.

Exportado de SIDERAL (2026-01-30-12:20:41)

Permalink:

Visitas y descargas

Este artículo se encuentra en las siguientes colecciones:
Articles > Artículos por área > Ingeniería de Sistemas y Automática

Back to search

Record created 2026-01-30, last modified 2026-01-30

Versión publicada:
PDF

Rate this document:

(Not yet reviewed)

Add to personal basket
Export as BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

Universidad de Zaragoza Repository

Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors