Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors

Plou, Carlos; Mur-Labadia, Lorenzo; Murillo, Ana C.; Martinez-Cantin, Ruben; Guerrero, Jose J.

doi:10.1016/j.cviu.2025.104622

Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors

Plou, Carlos (Universidad de Zaragoza) ; Mur-Labadia, Lorenzo (Universidad de Zaragoza) ; Guerrero, Jose J. (Universidad de Zaragoza) ; Martinez-Cantin, Ruben (Universidad de Zaragoza) ; Murillo, Ana C. (Universidad de Zaragoza)

Resumen: Video is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in realworld environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces BayesianVSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the Goal Step challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical application.
Idioma: Inglés
DOI: 10.1016/j.cviu.2025.104622
Año: 2026
Publicado en: COMPUTER VISION AND IMAGE UNDERSTANDING 264 (2026), 104622 [11 pp.]
ISSN: 1077-3142
Financiación: info:eu-repo/grantAgreement/ES/DGA/T45-23R
Financiación: info:eu-repo/grantAgreement/ES/MCIU/PID2024-159284NB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN-AEI/PID2021-125209OB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN/PID2021-125514NB-I00
Financiación: info:eu-repo/grantAgreement/ES/MICINN/PID2024–158322OB-I00
Tipo y forma: Article (Published version)
Área (Departamento): Área Ingen.Sistemas y Automát. (Dpto. Informát.Ingenie.Sistms.)
Dataset asociado: BayesianVSLNet - Temporal Video Segmentation with Natural Language using Text-Video Cross Attention and Bayesian Order-priors ( https://github.com/cplou99/BayesianVSLNet)
Exportado de SIDERAL (2026-01-30-12:20:41)

Permalink:

Visitas y descargas

Este artículo se encuentra en las siguientes colecciones:
articulos > articulos-por-area > ingenieria_de_sistemas_y_automatica

Retour à la recherche

Notice créée le 2026-01-30, modifiée le 2026-01-30

Versión publicada:
PDF

Évaluer ce document:

(Pas encore évalué)

Ajouter au panier personnel
Exporter vers BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

Atlantis Institut des Sciences Fictives

Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors