Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

Gimeno, Pablo; Lleida, Eduardo; Miguel, Antonio; Viñals, Ignacio; Ortega, Alfonso

doi:10.1186/s13636-020-00172-6

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

Gimeno, Pablo (Universidad de Zaragoza) ; Viñals, Ignacio (Universidad de Zaragoza) ; Ortega, Alfonso (Universidad de Zaragoza) ; Miguel, Antonio (Universidad de Zaragoza) ; Lleida, Eduardo (Universidad de Zaragoza)

Resumen: This paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models. We explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate. Our findings show that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%. Furthermore, this solution does not increase the number of parameters of the model and reduces the number of operations per second, allowing our system to achieve a real-time factor below 0.04 if running on CPU and below 0.03 if running on GPU. This new architecture combined with a data-agnostic data augmentation technique called mixup allows our system to achieve competitive results in both the Albayzín 2010 and 2012 evaluation datasets, presenting a relative improvement of 19.72% and 5.35% compared to the best results found in the literature for these databases.
Idioma: Inglés
DOI: 10.1186/s13636-020-00172-6
Año: 2020
Publicado en: EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING 2020 (2020), 5 [19 pp.]
ISSN: 1687-4714
Factor impacto JCR: 1.558 (2020)
Categ. JCR: ENGINEERING, ELECTRICAL & ELECTRONIC rank: 198 / 273 = 0.725 (2020) - Q3 - T3
Categ. JCR: ACOUSTICS rank: 20 / 32 = 0.625 (2020) - Q3 - T2
Factor impacto SCIMAGO: 0.259 - Electrical and Electronic Engineering (Q3) - Acoustics and Ultrasonics (Q3)

Financiación: info:eu-repo/grantAgreement/ES/DGA-FEDER/T36-17R
Financiación: info:eu-repo/grantAgreement/ES/MINECO/TIN2017-85854-C4-1-R
Tipo y forma: Artículo (Versión definitiva)
Área (Departamento): Área Teoría Señal y Comunicac. (Dpto. Ingeniería Electrón.Com.)

Debe reconocer adecuadamente la autoría, proporcionar un enlace a la licencia e indicar si se han realizado cambios. Puede hacerlo de cualquier manera razonable, pero no de una manera que sugiera que tiene el apoyo del licenciador o lo recibe por el uso que hace.

Exportado de SIDERAL (2023-03-23-12:56:49)

Enlace permanente:

Visitas y descargas

Este artículo se encuentra en las siguientes colecciones:
Artículos > Artículos por área > Teoría de la Señal y Comunicaciones

Volver a la búsqueda

Registro creado el 2020-05-07, última modificación el 2023-03-23

Versión publicada:
PDF

Valore este documento:

(Sin ninguna reseña)

Añadir a una carpeta personal
Exportar como BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

Repositorio Institucional de Documentos

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data