TAZ-TFM-2022-016


Modeling human visual behavior in dynamic 360º environments.

Bernal Berdún, Edurne
Martín Serrano, Daniel (dir.) ; Masiá Corcoy, Belén (dir.)

Universidad de Zaragoza, EINA, 2022
Departamento de Informática e Ingeniería de Sistemas, Área de Lenguajes y Sistemas Informáticos

Máster Universitario en Robótica, Gráficos y Visión por Computador

Resumen: Virtual reality (VR) is rapidly growing: Advances in hardware, together with the current high computational power, are driving this technology, which has the potential to change the way people consume content, and has been predicted to become the next big computing paradigm. However, although it has become accessible at a consumer level, much still remains unknown about the grammar and visual language in this medium. Understanding and predicting how humans behave in virtual environments remains an open problem, since the visual behavior known for traditional screen-based content does not hold for immersive VR environments: In VR, the user has total control of the camera, and therefore content creators cannot ensure where viewers’ attention will be directed to. This understanding of visual behavior, however, can be crucial in many applications, such as novel compression and rendering techniques, content design, or virtual tourism, among others. Some works have been devoted to analyzing and modeling human visual behavior. Most of them have focused on identifying the content’s regions that attract the observers’ visual attention, resorting to saliency as a topological measure of what part of a virtual scene might be of more interest. When consuming virtual reality content, which can be either static (i.e., 360◦ images) or dynamic (i.e., 360◦ videos), there are many factors that affect human visual behavior, which are mainly associated with the scene shown in the VR video or image (e.g., colors, shapes, movements, etc.), but also depend on the subjects observing it (their mood and background, the task being performed, previous knowledge, etc.). Therefore, all these variables affecting saliency make its prediction a challenging task. This master thesis presents a novel saliency prediction model for VR videos based on a deep learning approach (DL). DL networks have shown outstanding results in image processing tasks, automatically inferring the most relevant information from images. The proposed model is the first to exploit the joint potential of convolutional (CNN) and recurrent (RNN) neural networks to extract and model the inherent spatio-temporal features from videos, employing RNNs to account for temporal information at the time of feature extraction, rather than to post-process spatial features as in previous works. It is also tailored to the particularities of dynamic VR videos, with the use of spherical convolutions and a novel spherical loss function for saliency prediction that work on a 3D space rather than in traditional image space. To facilitate spatio-temporal learning, this work is also the first in including the optical flow between 360◦ frames for saliency prediction, since movement is known to be a highly salient feature in dynamic content. The proposed model was evaluated qualitatively and quantitatively, proving to outperform state-of-the-art works. Moreover, an exhaustive ablation study demonstrates the effectiveness of the different design decisions made throughout the development of the model.

Tipo de Trabajo Académico: Trabajo Fin de Master

Creative Commons License



El registro pertenece a las siguientes colecciones:
Trabajos académicos > Trabajos Académicos por Centro > Escuela de Ingeniería y Arquitectura
Trabajos académicos > Trabajos fin de máster



Volver a la búsqueda

Valore este documento:

Rate this document:
1
2
3
 
(Sin ninguna reseña)