Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors

Plou, Carlos; Mur-Labadia, Lorenzo; Murillo, Ana C.; Martinez-Cantin, Ruben; Guerrero, Jose J.

doi:10.1016/j.cviu.2025.104622

Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors

Plou, Carlos (Universidad de Zaragoza) ; Mur-Labadia, Lorenzo (Universidad de Zaragoza) ; Guerrero, Jose J. (Universidad de Zaragoza) ; Martinez-Cantin, Ruben (Universidad de Zaragoza) ; Murillo, Ana C. (Universidad de Zaragoza)

Resumen: Video is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in realworld environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces BayesianVSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the Goal Step challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical application.
Idioma: Inglés
DOI: 10.1016/j.cviu.2025.104622
Año: 2026
Publicado en: COMPUTER VISION AND IMAGE UNDERSTANDING 264 (2026), 104622 [11 pp.]
ISSN: 1077-3142
Financiación: info:eu-repo/grantAgreement/ES/DGA/T45-23R
Financiación: info:eu-repo/grantAgreement/ES/MCIU/PID2024-159284NB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN-AEI/PID2021-125209OB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN/PID2021-125514NB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN/PID2024–158322OB-I00
Tipo y forma: Artículo (Versión definitiva)
Área (Departamento): Área Ingen.Sistemas y Automát. (Dpto. Informát.Ingenie.Sistms.)
Dataset asociado: BayesianVSLNet - Temporal Video Segmentation with Natural Language using Text-Video Cross Attention and Bayesian Order-priors ( https://github.com/cplou99/BayesianVSLNet)

Debe reconocer adecuadamente la autoría, proporcionar un enlace a la licencia e indicar si se han realizado cambios. Puede hacerlo de cualquier manera razonable, pero no de una manera que sugiera que tiene el apoyo del licenciador o lo recibe por el uso que hace. No puede utilizar el material para una finalidad comercial.

Exportado de SIDERAL (2026-01-30-12:20:41)

Enlace permanente:

Visitas y descargas

Este artículo se encuentra en las siguientes colecciones:
Artículos > Artículos por área > Máster Universitario en Ingeniería de Sistemas y Automática

Volver a la búsqueda

Registro creado el 2026-01-30, última modificación el 2026-01-30

Versión publicada:
PDF

Valore este documento:

(Sin ninguna reseña)

Añadir a una carpeta personal
Exportar como BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

Repositorio Institucional de Documentos

Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors