000152305 001__ 152305
000152305 005__ 20250401114420.0
000152305 037__ $$aTAZ-TFM-2024-1587
000152305 041__ $$aeng
000152305 1001_ $$ade Nova Guerrero, Alejandro
000152305 24200 $$aLanguage-aware neural feature fusion fields for egocentric video
000152305 24500 $$aLanguage-aware neural feature fusion fields for egocentric video
000152305 260__ $$aZaragoza$$bUniversidad de Zaragoza$$c2024
000152305 506__ $$aby-nc-sa$$bCreative Commons$$c3.0$$uhttp://creativecommons.org/licenses/by-nc-sa/3.0/
000152305 520__ $$aEnvironment understanding is a crucial characteristic in many applications, from navigation to robot manipulation. Scene comprehension is inherent to human beings but it is a very difficult task for machines. Egocentric videos represent the world seen from a first-person view in which an actor interacts with the environment and their objects. Neural rendering techniques allow learning scene geometric information but lacks of semantic understanding, losing a lot of environment comprehension. Some works focuses on egocentric videos in isolated or short clips or work on large tracks but lack of semantic information. In this project we present Dynamic Image-Feature Fields (DI-FF), a neural network able to reconstruct an egocentric video decomposing its three principal components: static objects, foreground objects and person moving around the scene. Furthermore, it incorporates language-image features that contribute to semantic understanding of the egocentric video. By computing relevancy maps, DI-FF is able to perform detailed dynamic object segmentation by open-vocabulary text queries, having knowledge of surrounding objects barely visible in the current viewpoint. It also enables affordance segmentation for simple action queries. To achieve this, the model is compound by three different neural network streams, one for each scene component: background, foreground and actor, which outputs the pixel color independently to be able to decompose the scene. Additionally, each of these networks predicts language CLIP embeddings to allow object recognition by text queries. With this implementation, DI-FF outperforms state-of-the-art projects, representing a robust basis for future work in egocentric video environment understanding.<br />
000152305 521__ $$aMáster Universitario en Robótica, Gráficos y Visión por Computador
000152305 540__ $$aDerechos regulados por licencia Creative Commons
000152305 691__ $$a0
000152305 692__ $$a
000152305 700__ $$aGuerrero Campo, José Jesús$$edir.
000152305 700__ $$aMur Labadía, Lorenzo$$edir.
000152305 7102_ $$aUniversidad de Zaragoza$$bInformática e Ingeniería de Sistemas$$cIngeniería de Sistemas y Automática
000152305 8560_ $$f716351@unizar.es
000152305 8564_ $$s9032497$$uhttps://zaguan.unizar.es/record/152305/files/TAZ-TFM-2024-1587.pdf$$yMemoria (eng)
000152305 909CO $$ooai:zaguan.unizar.es:152305$$pdriver$$ptrabajos-fin-master
000152305 950__ $$a
000152305 951__ $$adeposita:2025-04-01
000152305 980__ $$aTAZ$$bTFM$$cEINA
000152305 999__ $$a20241128181538.CREATION_DATE