home > publications > a64b
Contact
Modeling pronunciation variation for ASR: a survey of the literature.
Helmer Strik, Catia Cucchiarini (1999)
A2RT, Dept. of Language & Speech, University of Nijmegen
P.O. Box 9103, 6500 HD Nijmegen, The Netherlands

Speech Communication, Vol. 29, No. 2-4, pp. 225-246.

Modeling pronunciation variation for ASR: a survey of the literature.

Abstract

The focus in automatic speech recognition (ASR) research has gradually shifted from isolated words to conversational speech. Consequently, the amount of pronunciation variation present in the speech under study has gradually increased. Pronunciation variation will deteriorate the performance of an ASR system if it is not well accounted for. This is probably the main reason why research on modeling pronunciation variation for ASR has increased lately.

In this contribution, we provide an overview of the publications on this topic, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop' 1. First, the most important characteristics that distinguish the various studies on pronunciation variation modeling are discussed. Subsequently, the issues of evaluation and comparison are addressed. Particular attention is paid to some of the most important factors that make it difficult to compare the different methods in an objective way. Finally, some conclusions are drawn as to the importance of objective evaluation and the way in which it could be carried out.

Zusammenfassung

Die Forschungsrichtung der automatischen Spracherkennung (ASR) hat sich nach und nach vom Erkennen isolierter Wörter in Richtung Erkennung frei gesprochener Sprache entwickelt. Das hat zur Folge, daß die Aussprachevariation, so wie sie in der freien Rede zutage tritt, bei der Spracherkennung ein intervenierender Faktor geworden ist. Die Leistung eines ASR-Systems wird nämlich erheblich beeinträchtigt, wenn man diesen Faktor nicht berücksichtigt. Dies ist vermutlich der Hauptgrund dafür, warum die systematische Berücksichtigung der Aussprachevariation bei der ASR in letzter Zeit stark zugenommen hat.

Dieser Artikel stellt einen Überblick der Literatur zu diesem Thema dar, wobei den Beiträgen in diesem 'special issue' sowie denen des 'Rolduc workshop' besondere Aufmerksamkeit geschenkt wird. Zunächst werden die wichtigsten Unterschiede der zahlreichen Arbeiten zur Modellbildung der Aussprachevariation diskutiert. Dann folgt eine Besprechung der Beurteilung und des Vergleichs verschiedener Methoden, die der Modellbildung zugrunde liegen. Dabei wird den wichtigsten Faktoren, die einen objektiven Vergleich der Methoden erschweren, besondere Aufmerksamkeit geschenkt. Letztendlich schließen sich einige Schlußfolgerungen im Hinblick auf die Relevanz objektiver Beurteilung und deren mögliche Realisierung an.

Résumé

Le centre d'intéret dans la recherche de la reconnaissance automatique de la parole (ASR), parti des mots isolés, s'est engagé vers le discours conversationnel. Par conséquence, la quantité de variation de prononciation présente dans le discours dont nous rapportons les résultats a graduellement augmenté. La variation de prononciation détériorera la performance d'un systeme ASR si l'on n'en rend pas compte. C'est probablement la raison principale pourquoi la recherche dans le domaine de la modélisation de la variation de prononciation pour ASR a augmenté récemment.

Dans cette contribution on fournit une vue d'ensemble des publications sur ce sujet, et en particulier on réfere aux articles de cette edition spéciale et aux contributions présentées dans les sessions qui ont eu lieu a 'Rolduc'. D'abord, les caractéristiques les plus importantes qui distinguent les diverses études sur modélisation de variation de prononciation sont discutées. Puis les questions d'évaluation et de comparaison sont adressées. Une attention particulière est prêtée à certains des facteurs les plus importants qui rendent difficile de comparer les différentes méthodes d'une maniere objective. Enfin quelques conclusions sont tirées quant à l'importance de l'évaluation objective et de la façon dans laquelle elle pourrait être effectuée.



References


[01] Adda-Decker, M., Lamel, L., 1998.

Pronunciation variants across systems, languages and speaking style.

In: [85], pp. 1-6.


[02] Adda-Decker, M., Lamel, L., 1998.

Pronunciation variants across system configuration, language and speaking style.

This special issue.


[03] Aubert, X., Dugast, C., 1995.

Improved acoustic-phonetic modeling in Philips' dictation system by handling liaisons and multiple pronunciations.

In: Proc. of Eurospeech-95, Madrid, pp. 767-770.


[04] Bacchiani, M., Ostendorf, M., 1998.

Joint acoustic unit design and lexicon generation.

In: [85], pp. 7-12.


[05] Bacchiani, M., Ostendorf, M., 1999.

Joint lexicon, acoustic unit inventory and model design.

This special issue.


[06] Barnett, J., 1974.

A phonological rule compiler.

In: [23], pp. 188-192.


[07] Bell, A., 1984.

Language style as audience design.

Language in Society, 13, 2, 145-204.


[08] Beulen, K., Ortmanns, S., Eiden, A., Martin, S., Welling, L., Overmann, J., Ney, H., 1998.

Pronunciation modelling in the RWTH large vocabulary speech recognizer.

In: [85], pp. 13-16.


[09] Blackburn, C.S., Young, S.J., 1995.

Towards improved speech recognition using a speech production model.

In: Proc. of EuroSpeech-95, Madrid, pp. 1623-1626.


[10] Blackburn, C.S., Young, S.J., 1996.

Pseudo-articulatory speech synthesis for recognition using automatic feature extraction from X-ray data.

In: Proc. of ICSLP-96, Philadelphia, pp. 969-972.


[11] Bonaventura, P., Gallocchio, F., Mari, J., Micca, G., 1998.

Speech recognition methods for non-native pronunciation variations.

In: [85], pp. 17-22.


[12] Cohen, M., 1989.

Phonological structures for speech recognition.

PhD thesis, Univ. of California, Berkeley, USA.


[13] Cohen, P.S., Mercer, R.L., 1974.

The Phonological Component of an Automatic Speech-Recognition System.

In: [23], pp. 177-187.


[14] Cohen, P.S., Mercer, R.L., 1975.

The Phonological Component of an Automatic Speech-Recognition System.

In: Reddy, D.R. (ed.), Speech Recognition, Academic Press, Inc., New York,

1975, pp. 275-320.


[15] Coupland, N., 1984.

Accommodation at work: Some phonological data and their implications.

International Journal of the Sociology of Language, 46, 49-70.


[16] Cremelie, N., Martens, J.-P., 1995.

On the use of pronunciation rules for improved word recognition.

In: Proc. of Eurospeech-95, Madrid, pp. 1747-1750.


[17] Cremelie, N., Martens, J.-P., 1997.

Automatic rule-based generation of word pronunciation networks.

In: Proc. of EuroSpeech-97, Rhodes, pp. 2459-2462.


[18] Cremelie, N., Martens, J.-P., 1998.

In search of pronunciation rules.

In: [85], pp. 23-28.


[19] Cremelie, N., Martens, J.-P., 1999.

In search of better pronunciation models for speech recognition.

This special issue.


[20] Deng, L., Sun, D., 1994.

A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features.

Journal of the Acoustical Society of America, 95(5), May 1994, 2702-2719.


[21] Deshmukh, N., Weber, M., Picone, J., 1996.

Automated generation of N-best pronunciations of proper nouns.

In: Proc. of ICASSP-96, Atlanta, pp. 283-286.


[22] Downey, S., Wiseman, R., 1997.

Dynamic and static improvements to lexical baseforms.

In: Proc. of Eurospeech-97, Rhodes, pp. 1027-1030.


[23] Erman, L., 1974.

Proc. of the IEEE Symposium on Speech Recognition,

Carnegie-Mellon Univ., Pittsburgh Pa., 15-19 April 1974, 295 pages.

(IEEE Catalog No. 74CH0878-9 AE).


[24] Eskenazi, M., 1993.

Trends in speaking styles research.

In: Proc. of Eurospeech-93, Berlin, pp. 501-509.


[25] Ferreiros, J., Macías-Guarasa, J., Pardo, J.M., Villarrubia, L., 1998.

Introducing multiple pronunciations in Spanish speech recognition systems.

In: [85], pp. 29-34.


[26] Finke, M., Waibel, A., 1997.

Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition.

In: Proc. of EuroSpeech-97, Rhodes, pp. 2379-2382.


[27] Flach, G., 1995.

Modelling pronunciation variability for spectral domains.

In: Proc. of Eurospeech-95, Madrid, pp. 1743-1746.


[28] Fosler-Lussier, E., Morgan, N., 1998.

Effects of speaking rate and word frequency on conversational pronunciations.

In: [85], pp. 35-40.


[29] Fosler-Lussier, E., Morgan, N., 1999.

Effects of speaking rate and word frequency on conversational pronunciations.

This special issue.


[30] Friedman, J., 1974.

Computer exploration of fast speech rules.

In: [23], pp. 197-203.


[31] Fukada, T., Sagisaka, Y., 1997.

Automatic generation of a pronunciation dictionary based on a pronunciation network.

In: Proc. of EuroSpeech-97, Rhodes, pp. 2471-2474.


[32] Fukada, T., Yoshimura, T., Sagisaka, Y., 1998.

Automatic generation of multiple pronunciations based on neural networks and language statistics.

In: [85], pp. 41-46.


[33] Fukada Toshiaki, Yoshimura Takayoshi, Sagisaka Yoshinori, 1999.

Automatic generation of multiple pronunciations based on neural networks.

Speech Communication 27 (1), pp. 63-73.


[34] Giles, H., Powesland, P., 1975.

Speech style and social evaluation.

Cambridge University Press, Cambridge.


[35] Giles, H., Smith, P., 1979.

Accommodation theory: Optimal levels of convergence.

In: Giles, H., stClair, R. (Eds.) Language and social psychology, Blackwell, Oxford.


[36] Godfrey, J.J., Ganapathiraju, A., Ramalingam, C.S., Picone, J., 1997.

Microsegment-based connected digit recognition.

In: Proc. of ICASSP-97, Munich, pp. 1755-1758.


[37] Goldsmith, J., 1976.

Autosegmental phonology.

Doctoral thesis, Massachussets Institute of Technology, Cambridge.

[Bloomington, Indiana: Indiana University Linguistics Club. New York: Garland Press, 1979].


[38] Goldsmith, J.A., 1990.

Autosegmental and Metrical Phonology.

Oxford: Blackwell.


[39] Greenberg, S., 1998.

Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation.

In: [85], pp. 47-56.


[40] Greenberg, S., 1999.

Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation.

This special issue.


[41] Heine, H., Evermann, G., Jost, U., 1998.

An HMM-based probabilistic lexicon.

In: [85], pp. 57-62.


[42] Holmes, W.J., Russell, M.J., 1996.

Modeling speech variability with segmental HMMs.

In: Proc. of ICASSP-96, Atlanta, pp. 447-450.


[43] Holter, T., 1997.

Maximum Likelihood Modelling of Pronunciation in Automatic Speech Recognition.

PhD thesis, Norwegian University of Science and Technology, Dec. 1997.


[44] Holter, T., Svendsen, T., 1998.

Maximum likelihood modelling of pronunciation variation.

In: [85], pp. 63-66.


[45] Holter, T., Svendsen, T., 1999.

Maximum likelihood modelling of pronunciation variation.

This special issue.


[46] Imai, T., Ando, A., Miyasaka, E., 1995.

A New Method for Automatic Generation of Speaker-Dependent Phonological Rules.

In: Proc. of ICASSP-95, Detroit, pp. 864-867.


[47] Jelinek, F., Bahl, L.R., Mercer, R.L., 1974.

Design of a linguistic statistical decoder for the recognition of continuous speech.

In: [23], pp. 255-260.


[48] Kaisse, E., 1985.

Connected speech: the interaction of syntax and phonology.

Academic Press, Orlando.


[49] Kessens, J., Wester, M., 1997.

Improving Recognition Performance by Modelling Pronunciation Variation.

Proceedings of the CLS opening Academic Year '97-'98, pp. 1-20.


[50] Kessens, J.M., Wester, M., Strik, H., 1999.

Improving the performance of a Dutch CSR by modelling within-word and cross-word pronunciation variation.

This special issue.


[51] Kipp, A., Wesenick, M.-B., Schiel, F., 1996.

Automatic detection and segmentation of pronunciation variants in German speech corpora.

In: Proc. of ICSLP-96, Philadelphia, pp. 106-109.


[52] Kipp, A., Wesenick, M.-B., Schiel, F., 1997.

Pronunciation Modeling Applied to Automatic Segmentation of Spontaneous Speech.

In: Proc. of EuroSpeech-97, Rhodes, pp. 1023-1026.


[53] Labov, W., 1972.

Sociolinguistic patterns.

University of Pennsylvania Press, Philadelphia.


[54] Lamel, L., Adda, G., 1996.

On designing pronunciation lexicons for large vocabulary continuous speech recognition.

In: Proc. of ICSLP-96, Philadelphia, pp. 6-9.


[55] Laver, J., 1994.

Principles of Phonetics.

Cambridge University Press, Cambridge.


[56] Lehtinen, G., Safra, S., 1998.

Generation and selection of pronunciation variants for a flexible word recognizer.

In: [85], pp. 67-72.


[57] Mercer, R., Cohen, P., 1987.

A method for efficient storage and rapid application of context-sensitive phonological rules for automatic speech recognition.

IBM J. Res. Develop., Vol. 31, No. 1, January 1987, pp. 81-90.


[58] Mirghafori, N., Fosler, E., Morgan, N., 1995.

Fast speakers in large vocabulary continuous speech recognition: analysis and antidotes.

In: Proc. of EuroSpeech-95, Madrid, pp. 491-494.


[59] Mokbel, H., Jouvet, D., 1998.

Derivation of the optimal phonetic transcription set for a word from its acoustic realisations.

In: [85], pp. 73-78.


[60] Mouria-Beji, F., 1998.

Context and speed dependent phonemic models for continuous speech recognition.

In: [85], pp. 79-84.


[61] Murray, I.R., Arnott, J.L., 1993.

Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion.

Journal of the Acoustical Society of America, 93(2), pp. 1097-1108.


[62] Nock, H.J., Young, S.J., 1998.

Detecting and correcting poor pronunciations for multiword units.

In: [85], pp. 85-90.


[63] O'Malley, M.H., Cole, A., 1974.

Testing phonological rules.

In: [23], pp. 193-196.


[64] Oshika, B.T., Zue, V.W., Weeks, R.V., Neu, H., 1974.

The role of phonological rules in speech understanding research.

In: [23], pp. 204-207.


[65] Perennou, G., Brieussel-Pousse, L., 1998.

Phonological component in automatic speech recognition.

In: [85], pp. 91-96.


[66] Peters, S.D., Stubley, P., 1998.

Visualizing speech trajectories.

In: [85], pp. 97-102.


[67] Polzin, T.S., Waibel, A.H., 1998.

Pronunciation variations in emotional speech.

In: [85], pp. 103-108.


[68] Pousse, L., Perennou, G., 1997.

Dealing with pronunciation variants at the language model level for automatic continuous speech recognition of French.

In: Proc. of Eurospeech-97, Rhodes, pp. 2727-2730.


[69] Rabinowitz, A.S., 1974.

Phonetic to graphemic transformation by use of a stack procedure.

In: [23], pp. 212-217.


[70] Ravishankar, M., Eskenazi, M., 1997.

Automatic generation of context-dependent pronunciations.

In: Proc. of EuroSpeech-97, Rhodes, pp. 2467-2470.


[71] Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock, H., Saraclar, M., Wooters, C., Zavaliagkos, G., 1998.

Stochastic pronunciation modelling from hand-labelled phonetic corpora.

In: [85], pp. 109-116.


[72] Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock, H., Saraclar, M., Wooters, C., Zavaliagkos, G., 1999.

Stochastic pronunciation modelling from hand-labelled phonetic corpora.

This special issue.


[73] Ristad, E.S., Yianilos, P.N., 1998.

A surficial pronunciation model.

In: [85], pp. 117-120.


[74] Roach, P., Arnfield, S., 1998.

Variation information in pronunciation dictionaries.

In: [85], pp. 121-124.


[75] Roe, D.B., Riley, M.D., 1994.

Prediction of word confusabilities for speech recognition.

In: Proc. of ICSLP-94, Yokohama, pp. 227-230.


[76] Romaine, S., 1980.

Stylistic variation and evaluative reactions to speech.

Language and Speech, 23, 213-232.


[77] Rovner, P., Makhoul, J., Wolf, J., Colarusso, J., 1974.

Where the words are: lexical retrieval in a speech understanding system.

In: [23], pp. 160-164.


[78] Safra, S., Lehtinen, G., Huber, K., 1998.

Modeling pronunciation variations and coarticulation with finite-state transducers in CSR.

In: [85], pp. 125-130.


[79] Scherer, K.R., Giles, H., 1979.

Social Markers in Speech.

Cambridge: Cambridge University Press.


[80] Schiel, F., Kipp, A., Tillmann, H.G., 1998.

Statistical modelling of pronunciation: it's not the model, it's the data.

In: [85], pp. 131-136.


[81] Shockey, L., Erman, L.D., 1974.

Sub-lexical levels in the HEARSAY II speech understanding system.

In: [23], pp. 208-210.


[82] Sloboda, T., Waibel, A., 1996.

Dictionary Learning for Spontaneous Speech Recognition.

In: Proc. of ICSLP-96, Philadelphia, pp. 2328-2331.


[83] Strik, H., 1998.

Publications on pronunciation variation and ASR.

http://lands.let.ru.nl/~strik/pron-var/references.html


[84] Strik, H., Cucchiarini, C., 1998.

Modeling pronunciation variation for ASR: overview and comparison of methods.

In: [85], pp. 137-144.


[85] Strik, H., Kessens, J.M., Wester, M., 1998.

Proceedings of the ESCA Workshop 'Modeling Pronunciation Variation for

Automatic Speech Recognition', Rolduc, Kerkrade, 4-6 May 1998.

A2RT, University of Nijmegen, 168 pages.


[86] Svendsen, T., Soong, F., Purnhagen, H., 1995.

Optimizing acoustic baseforms for HMM-based speech recognition.

In: Proc. of EuroSpeech-95, Madrid, pp. 783-786.


[87] Tappert, C. C., 1974.

Experiments with a tree search method for converting noisy phonetic representation into standard orthography.

In: [23], pp. 261-266.


[88] Torre, D., Villarrubia, L., Hernández, L., Elvira, J.M., 1997.

Automatic Alternative Transcription Generation and Vocabulary Selection for Flexible Word Recognizers.

In: Proc. of ICASSP-97, Munich, pp. 1463-1466.


[89] Wesenick, M.-B., 1996.

Automatic generation of German pronunciation variants.

In: Proc. of ICSLP-96, Philadelphia, pp. 125-128.


[90] Wester, M., Kessens, J.M., Strik, H., 1998.

Improving the performance of a Dutch CSR by modelling pronunciation variation.

In: [85], pp. 145-150.


[91] Wester, M., Kessens, J.M., Cucchiarini, C., Strik, H., 1998.

Selection of pronunciation variants in spontaneous speech: comparing the performance of man and machine.

Proceedings of the ESCA workshop 'SPoSS 98 - Sound Patterns of Spontaneous Speech: Production and Perception', Aix-en-Provence, France, 24-25-26 September, 1998, pp. 157-160.


[92] Wester, M., Kessens, J.M., Cucchiarini, C., Strik, H., 1999.

Comparison Between Expert Listeners and Continuous Speech Recognizers in Selecting Pronunciation Variants.

Proceedings of the 14th International Congress of Phonetic Sciences (ICPhS-99), San Fransico, USA, 1999.


[93] Williams, G., Renals, S., 1998.

Confidence measures for evaluating pronunciation models.

In: [85], pp. 151-156.


[94] Wiseman, R., Downey, S., 1998.

Dynamic and static improvements to lexical baseforms.

In: [85], pp. 157-162.


[95] Zeppenfeld, T., Finke, M., Ries, K., Westphal, M., Waibel, A., 1997.

Recognition of conversational speech using the JANUS speech engine.

In: Proc. of ICASSP-97, Munich, pp. 1815-1818.




Footnote:

1: Whenever we mention 'the Rolduc workshop' in the text we refer to the ESCA Tutorial and Research Workshop "Modeling pronunciation variation for ASR" that was held in Rolduc from 4 to 6 May 1998. This special issue of Speech Communication contains a selection of papers presented at that workshop.

Last updated on 22-05-2004