unsupervised learning; acoustic unit descriptors; dysarthric speech; non-negative matrix factorization
Résumé :
[en] In this paper, we investigate unsupervised acoustic model training approaches for dysarthric-speech recognition. These models are first, frame-based Gaussian posteriorgrams, obtained from Vector Quantization (VQ), second, so-called Acoustic Unit Descriptors (AUDs), which are hidden Markov models of phone-like units, that are trained in an unsupervised fashion, and, third, posteriorgrams computed on the AUDs. Experiments were carried out on a database collected from a home automation task and containing nine speakers, of which seven are considered to utter dysarthric speech. All unsupervised modeling approaches delivered significantly better recognition rates than a speaker-independent phoneme recognition baseline, showing the suitability of unsupervised acoustic model training for dysarthric speech. While the AUD models led to the most compact representation of an utterance for the subsequent semantic inference stage, posteriorgram-based representations resulted in higher recognition rates, with the Gaussian posteriorgram achieving the highest slot filling F-score of 97.02%.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
Walter, Oliver; University of Paderborn > Department of Communications Engineering
DESPOTOVIC, Vladimir ; University of Paderborn > Department of Communications Engineering
Haeb-Umbach, Reinhold; University of Paderborn > Department of Communications Engineering
Gemmeke, Jort; Katholieke Universiteit Leuven - KUL > ESAT - PSI, Processing Speech and Images
Van hamme, Hugo; Katholieke Universiteit Leuven - KUL > ESAT - PSI, Processing Speech and Images
Ons, Bart; Katholieke Universiteit Leuven - KUL > ESAT - PSI, Processing Speech and Images
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
An Evaluation of Unsupervised Acoustic Model Training for a Dysarthric Speech Interface
Date de publication/diffusion :
septembre 2014
Nom de la manifestation :
15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014)
Lieu de la manifestation :
Singapore, Singapour
Date de la manifestation :
from 14-09-2014 to 18-09-2014
Manifestation à portée :
International
Titre de l'ouvrage principal :
Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014)
J. Noyes and C. Frankish, "Speech recognition technology for individuals with disabilities, " Augmentative and Alternative Communication, vol. 8, no. 4, pp. 297-303, 1992.
M. S. Hawley, S. P. Cunningham, P. D. Green, P. Enderby, R. Palmer, S. Sehgal, and P. O'Neill, "A voice-input voice-output communication aid for people with severe speech impairment, " Neural Systems and Rehabilitation Engineering, IEEE Transactions on, vol. 21, no. 1, pp. 23-31, 2013.
M. S. Hawley, P. Enderby, P. Green, S. Cunningham, S. Brownsell, J. Carmichael, M. Parker, A. Hatzis, P. O'Neill, and R. Palmer, "A speech-controlled environmental control system for people with severe dysarthria, " Medical Engineering & Physics, vol. 29, no. 5, pp. 586-593, 2007.
K. Rosen and S. Yampolsky, "Automatic speech recognition and a review of its functioning with dysarthric speech, " Augmentative and Alternative Communication, vol. 16, no. 1, pp. 48-60, 2000.
H. Christensen, S. Cunningham, C. Fox, P. Green, and T. Hain, "A comparative study of adaptive, automatic recognition of disordered speech." in INTERSPEECH, 2012.
K. T. Mengistu and F. Rudzicz, "Comparing humans and automatic speech recognition systems in recognizing dysarthric speech, " in Advances in Artificial Intelligence. Springer, 2011, pp. 291-300.
M. Hasegawa-Johnson, J. Gunderson, A. Penman, and T. Huang, "HMM-based and SVM-based recognition of the speech of talkers with spastic dysarthria, " in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, vol. 3. IEEE, 2006, pp. III-III.
P. Green, J. Carmichael, A. Hatzis, P. Enderby, M. S. Hawley, and M. Parker, "Automatic speech recognition with sparse training data for dysarthric speakers." in INTERSPEECH, 2003.
H. V. Sharma and M. Hasegawa-Johnson, "State-transition interpolation and MAP adaptation for HMM-based dysarthric speech recognition, " in Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies. Association for Computational Linguistics, 2010, pp. 72-79.
F. Rudzicz, "Acoustic transformations to improve the intelligibility of dysarthric speech, " in Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies. Association for Computational Linguistics, 2011, pp. 11-21.
E. Sanders, M. B. Ruiter, L. Beijer, and H. Strik, "Automatic recognition of dutch dysarthric speech: A pilot study." in INTERSPEECH, 2002.
W. K. Seong, J. H. Park, and H. K. Kim, "Multiple pronunciation lexical modeling based on phoneme confusion matrix for dysarthric speech recognition, " Advanced Science and Technology Letters, vol. 14, pp. 57-60, 2012.
K. T. Mengistu and F. Rudzicz, "Adapting acoustic and lexical models to dysarthric speech, " in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 4924-4927.
S. O. Caballero-Morales and F. Trujillo-Romero, "Dynamic estimation of phoneme confusion patterns with a genetic algorithm to improve the performance of metamodels for recognition of disordered speech, " in Advances in Computational Intelligence. Springer, 2013, pp. 175-187.
J. F. Gemmeke, J. V. D. Loo, G. D. Pauw, J. Driesen, H. V. hamme, and W. Daelemans, "A self-learning assistive vocal interface based on vocabulary learning and grammar induction, " in Proc. INTERSPEECH, 2012, pp. 1-4.
J. F. Gemmeke, B. Ons, H. Van hamme, J. van de Loo, W. D. G. De Pauw, J. Huyghe, J. Derboven, L. Vugen, B. van Den Broeck, P. Karsmakers, and B. Vanrumste, "Self-taught assistive vocal interfaces : An overview of the ALADIN project, " in Proc. INTERSPEECH, 2013, pp. 1-5.
B. Ons, N. Tessema, J. van de Loo, J. Gemmeke, G. De Pauw, W. Daelemans, and H. Van hamme, "A Self Learning Vocal Interface for Speech-impaired Users, " in Proc. Workshop on Speech and Language Processing for Assistive Technologies (SLPAT), 2013.
Y. Zhang and J. Glass, "Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams, " in IEEE Workshop on Automatic Speech Recognition Understanding (ASRU), Nov 2009, pp. 398-403.
M. Siu, H. Gish, A. Chan, W. Belfield, and S. Lowe, "Unsupervised Training of an HMM-Based Self-Organising Unit Recognizer with Applications to Topic Classification and Keyword Discovery, " Comput. Speech Lang., vol. 28, no. 1, pp. 210-223, Jan. 2013.
C.-Y. Lee and J. Glass, "A Nonparametric Bayesian Approach to Acoustic Model Discovery, " in Proc. of 50th Annual Meeting of the ACL, Stroudsburg, PA, USA, 2012, pp. 40-49.
O. Walter, T. Korthals, R. Haeb-Umbach, and B. Raj, "A hierarchical system for word discovery exploiting DTW-based initialization, " in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Olomouc, Czech Republic, Dec. 2013.
S. Chaudhuri and B. Raj, "Unsupervised Structure Discovery for Semantic Analysis of Audio, " in Advances in Neural Information Processing Systems 25, 2012, pp. 1187-1195.
J. V. D. Loo, G. D. Pauw, J. F. Gemmeke, P. Karsmakers, B. Van, D. Broeck, W. Daelemans, and H. V. hamme, "Towards shallow grammar induction for an adaptive assistive vocal interface: A concept tagging approach, " in Proc. NLP4ITA, 2012, pp. 27-34.
J. Gemmeke and H. Van hamme, "NMF-Based Keyword Learning from Scarce Data, " in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Olomouc, Czech Republic, Dec. 2013.
H. Van hamme, "HAC-models: A Novel Approach to Continuous Speech Recognition, " in Proc. INTERSPEECH, 2008.
M. Sun and H. V. HAMME, "Coding Methods for the NMF Approach to Speech Recognition and Vocabulary Acquisition." Journal of Systemics, Cybernetics & Informatics, vol. 10, no. 6, 2012.
S. Chaudhuri, M. Harvilla, and B. Raj, "Unsupervised Learning of Acoustic Unit Descriptors for Audio Content Representation and Classification, " in Proc. INTERSPEECH, 2011, pp. 2265-2268.
D. Arthur and S. Vassilvitskii, "k-means++: The advantages of careful seeding, " in Proc. ACM-SIAM symposium on discrete algorithms, 2007, pp. 1027-1035.
J. Schmalenstroeer, M. Bartek, and R. Haeb-Umbach, "Unsupervised learning of acoustic events using dynamic time warping and hierarchical K-means++ clustering, " in Proc. INTERSPEECH, 2011.
M. E. J. Newman and M. Girvan, "Finding and evaluating community structure in networks, " Physical Review E, vol. 69, no. 2, Feb. 2004.
C. Middag, "Automatic analysis of pathological speech, " Ph.D. dissertation, Ghent University, Belgium, 2012.
J. Heymann, O. Walter, R. Haeb-Umbach, and B. Raj, "Iterative Bayesian Word Segmentation for Unsupervised Vocabulary Discovery from Phoneme Lattices, " in Proc. ICASSP, Florence, Italy, May 2014.