TAZ-TFM-2024-1587

Language-aware neural feature fusion fields for egocentric video

de Nova Guerrero, Alejandro
Guerrero Campo, José Jesús (dir.) ; Mur Labadía, Lorenzo (dir.)

Universidad de Zaragoza, EINA, 2024
Informática e Ingeniería de Sistemas department, Ingeniería de Sistemas y Automática area

Máster Universitario en Robótica, Gráficos y Visión por Computador

Abstract: Environment understanding is a crucial characteristic in many applications, from navigation to robot manipulation. Scene comprehension is inherent to human beings but it is a very difficult task for machines. Egocentric videos represent the world seen from a first-person view in which an actor interacts with the environment and their objects. Neural rendering techniques allow learning scene geometric information but lacks of semantic understanding, losing a lot of environment comprehension. Some works focuses on egocentric videos in isolated or short clips or work on large tracks but lack of semantic information. In this project we present Dynamic Image-Feature Fields (DI-FF), a neural network able to reconstruct an egocentric video decomposing its three principal components: static objects, foreground objects and person moving around the scene. Furthermore, it incorporates language-image features that contribute to semantic understanding of the egocentric video. By computing relevancy maps, DI-FF is able to perform detailed dynamic object segmentation by open-vocabulary text queries, having knowledge of surrounding objects barely visible in the current viewpoint. It also enables affordance segmentation for simple action queries. To achieve this, the model is compound by three different neural network streams, one for each scene component: background, foreground and actor, which outputs the pixel color independently to be able to decompose the scene. Additionally, each of these networks predicts language CLIP embeddings to allow object recognition by text queries. With this implementation, DI-FF outperforms state-of-the-art projects, representing a robust basis for future work in egocentric video environment understanding.

Universidad de Zaragoza Repository

+

-