Class token and knowledge distillation for multi-head self-attention speaker verification systems

Mingote, Victoria; Lleida, Eduardo; Ortega, Alfonso; Miguel, Antonio

doi:10.1016/j.dsp.2022.103859

Class token and knowledge distillation for multi-head self-attention speaker verification systems

Mingote, Victoria (Universidad de Zaragoza) ; Miguel, Antonio (Universidad de Zaragoza) ; Ortega, Alfonso (Universidad de Zaragoza) ; Lleida, Eduardo (Universidad de Zaragoza)

Resumen: This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a new sampling estimation of the class token. In this approach, the class token is obtained by sampling from a list of several trainable vectors. This strategy introduces uncertainty that helps to generalize better compared to a single initialization as it is shown in the experiments. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.
Idioma: Inglés
DOI: 10.1016/j.dsp.2022.103859
Año: 2023
Publicado en: DIGITAL SIGNAL PROCESSING 133 (2023), 103859 [10 pp.]
ISSN: 1051-2004
Factor impacto JCR: 2.9 (2023)
Categ. JCR: ENGINEERING, ELECTRICAL & ELECTRONIC rank: 143 / 353 = 0.405 (2023) - Q2 - T2
Factor impacto CITESCORE: 5.3 - Statistics, Probability and Uncertainty (Q1) - Computational Theory and Mathematics (Q1) - Applied Mathematics (Q1) - Signal Processing (Q2) - Computer Vision and Pattern Recognition (Q2) - Electrical and Electronic Engineering (Q2) - Artificial Intelligence (Q2)

Factor impacto SCIMAGO: 0.799 - Applied Mathematics (Q2) - Artificial Intelligence (Q2) - Electrical and Electronic Engineering (Q2) - Statistics, Probability and Uncertainty (Q2) - Computer Vision and Pattern Recognition (Q2) - Signal Processing (Q2) - Computational Theory and Mathematics (Q2)

Financiación: info:eu-repo/grantAgreement/ES/AEI/PDC2021-120846-C41
Financiación: info:eu-repo/grantAgreement/ES/DGA/T36-20R
Financiación: info:eu-repo/grantAgreement/EC/H2020/101007666/EU/Exchanges for SPEech ReseArch aNd TechnOlogies/ESPERANTO
Financiación: info:eu-repo/grantAgreement/ES/MINECO/PRE2018-083312
Tipo y forma: Artículo (Versión definitiva)
Área (Departamento): Área Teoría Señal y Comunicac. (Dpto. Ingeniería Electrón.Com.)

Debe reconocer adecuadamente la autoría, proporcionar un enlace a la licencia e indicar si se han realizado cambios. Puede hacerlo de cualquier manera razonable, pero no de una manera que sugiera que tiene el apoyo del licenciador o lo recibe por el uso que hace.

Exportado de SIDERAL (2024-11-22-11:58:05)

Enlace permanente:

Visitas y descargas

Este artículo se encuentra en las siguientes colecciones:
Artículos > Artículos por área > Teoría de la Señal y Comunicaciones

Volver a la búsqueda

Registro creado el 2022-12-21, última modificación el 2024-11-25

Versión publicada:
PDF

Valore este documento:

(Sin ninguna reseña)

Añadir a una carpeta personal
Exportar como BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

Repositorio Institucional de Documentos

Class token and knowledge distillation for multi-head self-attention speaker verification systems