Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

Gimeno, Pablo; Lleida, Eduardo; Miguel, Antonio; Viñals, Ignacio; Ortega, Alfonso
doi:10.1186/s13636-020-00172-6
000088514 001__ 88514
000088514 005__ 20230323131624.0
000088514 0247_ $$2doi$$a10.1186/s13636-020-00172-6
000088514 0248_ $$2sideral$$a117137
000088514 037__ $$aART-2020-117137
000088514 041__ $$aeng
000088514 100__ $$0(orcid)0000-0002-3142-0708$$aGimeno, Pablo$$uUniversidad de Zaragoza
000088514 245__ $$aMulticlass audio segmentation based on recurrent neural networks for broadcast domain data
000088514 260__ $$c2020
000088514 5060_ $$aAccess copy available to the general public$$fUnrestricted
000088514 5203_ $$aThis paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models. We explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate. Our findings show that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%. Furthermore, this solution does not increase the number of parameters of the model and reduces the number of operations per second, allowing our system to achieve a real-time factor below 0.04 if running on CPU and below 0.03 if running on GPU. This new architecture combined with a data-agnostic data augmentation technique called mixup allows our system to achieve competitive results in both the Albayzín 2010 and 2012 evaluation datasets, presenting a relative improvement of 19.72% and 5.35% compared to the best results found in the literature for these databases.
000088514 536__ $$9info:eu-repo/grantAgreement/ES/DGA-FEDER/T36-17R$$9info:eu-repo/grantAgreement/ES/MINECO/TIN2017-85854-C4-1-R
000088514 540__ $$9info:eu-repo/semantics/openAccess$$aby$$uhttp://creativecommons.org/licenses/by/3.0/es/
000088514 590__ $$a1.558$$b2020
000088514 591__ $$aENGINEERING, ELECTRICAL & ELECTRONIC$$b198 / 273 = 0.725$$c2020$$dQ3$$eT3
000088514 591__ $$aACOUSTICS$$b20 / 32 = 0.625$$c2020$$dQ3$$eT2
000088514 592__ $$a0.259$$b2020
000088514 593__ $$aElectrical and Electronic Engineering$$c2020$$dQ3
000088514 593__ $$aAcoustics and Ultrasonics$$c2020$$dQ3
000088514 655_4 $$ainfo:eu-repo/semantics/article$$vinfo:eu-repo/semantics/publishedVersion
000088514 700__ $$0(orcid)0000-0003-1772-0605$$aViñals, Ignacio$$uUniversidad de Zaragoza
000088514 700__ $$0(orcid)0000-0002-3886-7748$$aOrtega, Alfonso$$uUniversidad de Zaragoza
000088514 700__ $$0(orcid)0000-0001-5803-4316$$aMiguel, Antonio$$uUniversidad de Zaragoza
000088514 700__ $$0(orcid)0000-0001-9137-4013$$aLleida, Eduardo$$uUniversidad de Zaragoza
000088514 7102_ $$15008$$2800$$aUniversidad de Zaragoza$$bDpto. Ingeniería Electrón.Com.$$cÁrea Teoría Señal y Comunicac.
000088514 773__ $$g2020 (2020), 5  [19 pp.]$$pEURASIP j. audio, speech music. process.$$tEURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING$$x1687-4714
000088514 8564_ $$s1169246$$uhttps://zaguan.unizar.es/record/88514/files/texto_completo.pdf$$yVersión publicada
000088514 8564_ $$s48179$$uhttps://zaguan.unizar.es/record/88514/files/texto_completo.jpg?subformat=icon$$xicon$$yVersión publicada
000088514 909CO $$ooai:zaguan.unizar.es:88514$$particulos$$pdriver
000088514 951__ $$a2023-03-23-12:56:49
000088514 980__ $$aARTICLE
Universidad de Zaragoza Repository