home > publications > a68c
Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms.
Catia Cucchiarini, Helmer Strik, Loe Boves (2000)
A2RT, Dept. of Language & Speech, University of Nijmegen
P.O. Box 9103, 6500 HD Nijmegen, The Netherlands

Speech Communication 30 (2-3), pp. 109-119.

Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms.


The ultimate aim of the research reported on here is to develop an automatic testing system for Dutch pronunciation. In the experiment described in this paper automatic scores of telephone speech produced by native and non-native speakers of Dutch are compared with specific, i.e. temporal and segmental, and global pronunciation ratings assigned by three groups of experts: three phoneticians and two groups of three speech therapists. The goals of this experiment are to determine 1) whether specific expert ratings of pronunciation quality contribute to our understanding of the relation between human pronunciation scores and machine scores of speech quality; 2) whether different expert groups assign essentially different ratings, and 3) to what extent rater pronunciation scores can be predicted on the basis of automatic scores. The results show that collecting specific ratings along with overall ones leads to a better understanding of the relation between human and automatic pronunciation assessment. Furthermore, after normalization no considerable differences are observed between the ratings by the three expert groups. Finally, it appears that the speech quality scores produced by our speech recognizer can predict expert pronunciation ratings with a high degree of accuracy.

1. Introduction

In the last few years we have witnessed the appearance of numerous software programs for teaching and testing language proficiency, such as those developed by Auralog and Syracuse Language Systems (see URLs in the reference list). The eventual advantages of such systems are obvious: lower costs, greater flexibility and, in the case of testing, increased objectivity.

In developing automatic instruments for language testing it soon appeared that for certain skills automation would be easier than for others. In general four skills are distinguished on the basis of the dimensions: mode (oral vs. written) and direction (receptive vs. productive). Since in testing receptive skills it is possible to use response tasks that are easy to score (multiple choice, matching and cloze), developing automatic tests for these skills is relatively easy. For productive skills, on the other hand, automatic tests are difficult to develop, because of the open-ended nature of the input. Furthermore, in the case of speaking, direction and mode conspire to make automatic testing even more difficult.

In spite of these difficulties, various methods for evaluating certain oral sub-skills like pronunciation have been proposed (Bernstein et al., 1990; Neumeyer et al., 1996; Franco et al., 1997). Most of these systems make use of recent developments in automatic speech recognition. However, it seems important that any system intended for testing or improving pronunciation should refer to some standard based on judgments of human raters, the importance of which cannot be overestimated, as human scores are what automatic grading techniques purport to reproduce.

The importance of expert ratings for automatic assessment of pronunciation quality has been underlined by Bernstein et al. (1990). In this study aimed at determining the feasibility of automatic pronunciation grading, the performance of an automatic speech recognizer was tested against speech quality ratings by experts. In Neumeyer et al. (1996) and Franco et al. (1997), pronunciation scores assigned by human experts were also used as a reference to determine the validity of automatic measures of speech quality such as log-likelihood scores, timing scores, phone classification error scores and segment duration scores. While in these studies considerable effort was dedicated to optimizing the automatic measures so as to obtain better correlations between machine scores and human scores, less attention was paid to the ratings assigned by the experts; only overall ratings of pronunciation were collected.

However, research on pronunciation evaluation has revealed that overall scores of pronunciation quality may be affected by a great variety of speech characteristics (Anderson-Hsieh et al., 1992). Non-native speech can deviate from native speech in various aspects such as fluency, syllable structure, word stress, intonation and segmental quality. When native speakers are asked to score non-native speech on pronunciation quality, their scores are usually affected by more than one of these aspects. Research on the relationship between native speaker ratings of non-native pronunciation and deviance in the various aspects of speech quality has revealed that each area affects the overall score to a different extent (Anderson-Hsieh et al., 1992).

These findings suggest that global ratings of pronunciation quality assigned by human raters have a complex structure, which may be problematic when such scores are used as a reference for automatically produced measures of speech quality, because one does not know exactly what the human scores stand for. Questions such as ?What do raters exactly evaluate?? and ?What influences their judgements most?? should be taken into consideration when trying to develop machine measures that best approach human pronunciation scores. Against this background it seems that more specific pronunciation ratings should be collected along with global ratings of pronunciation quality so as to obtain a better understanding of pronunciation grading by humans.

Another problem with human pronunciation scores collected in previous studies (Neumeyer et al., 1996 and Franco et al., 1997) is that they do not take due account of possible shibboleth sounds. In these studies the experts were asked to assign a global pronunciation score to each of several sentences uttered by each speaker (sentence level rating). The scores for all the sentences by one speaker were then averaged to obtain an overall speaker score (speaker level rating) (see Neumeyer et al., 1996 and Franco et al., 1997). Although this procedure may seem logical at first sight, there are some problems with it.

The scores assigned by a rater to different sentences uttered by one and the same speaker may differ as a function of segmental make-up (Labov, 1966). For example, if a shibboleth (stigmatizing) sound is present in one sentence, the score for that sentence may be considerably lower than those for other sentences by the same speaker that do not contain that specific sound. Owing to the presence of a stigmatizing sound, pronunciation scores collected at the speaker level could turn out to be lower than the scores that would result by averaging over the various sentences uttered by the same speaker. In other words, the average score might not reflect the effect of the shibboleth sound to the same extent as the one expressed in an overall speaker score. This seems to suggest that if the researcher is interested in pronunciation scores at the speaker level, (s)he should have the human raters listen to fragments containing the whole phonetic inventory of the language in question.

In our research directed at developing an automatic pronunciation testing system for Dutch, we also took human judgments as a reference. In order to obtain greater insight in how experts evaluate pronunciation, we asked them to assign both global and specific ratings of pronunciation quality. Moreover, in order to take account of the possible effects of stigmatizing sounds on the ratings, in the present experiment the human raters did not assign scores to individual sentences, but judged the pronunciation of each speaker on the basis of two sets of five phonetically rich sentences.

When it came to selecting raters to assess non-native pronunciation of Dutch we found that we could choose from among different groups. Phoneticians are obvious candidates, since they are expert on pronunciation in general. Teachers of Dutch as a second language would seem to be another obvious candidate; however, from these teachers we learned that, in practice, pronunciation problems in learners of Dutch as a second language are not usually addressed by language teachers, but rather by speech therapists. Since it is possible that the ratings vary with the experts? background, we decided to include different groups of raters in the experiment so that we could make comparisons between them.

Another characteristic of the current experiment is that it is not limited to assessing non-native speech, but it also concerns native speech. The reason for doing this is that the presence of native-produced sentences facilitates judgements of non-native speech (Flege and Fletcher, 1992: 385). These authors suggest that although native speech patterns are stored in native listeners? long-term memory, the availability of native speech makes it easier for listeners to make accurate judgments of degree of accent.

Finally, an important feature of this experiment is that telephone speech is used. The rationale behind this is that in the future automatic tests to be administered over the telephone will be required for different applications. In one study that we know of telephone quality was simulated by using 200-3600 Hz band-limited speech (Bernstein et al., 1990). However, this is only a first approximation of real telephone speech.

The first aim of the experiment reported on here was to determine whether the availability of specific ratings of pronunciation quality along with global ratings would enhance our understanding of the relation between human scores and machine scores. The second aim was to determine whether resorting to different groups of experts would lead to different results. Finally, we wanted to establish to what extent speech quality scores computed by our speech recognizer (see Strik et al., 1997) can predict pronunciation scores assigned by human experts.

This paper is organized as follows: section 2 describes the experimental methodology. The results of this experiment are presented and discussed in section 3, while conclusions are drawn in section 4.

4. Conclusions

The first aim of the experiment reported on in this paper was to find out whether specific ratings of pronunciation quality would increase our insight into the relation between human ratings and machine scores. The results presented above show that this is indeed the case: the comparison between more detailed and global ratings revealed that overall pronunciation is most influenced by segmental quality, which is the human measure that can be predicted most poorly on the basis of our machine scores. It also appeared that specific aspects of pronunciation quality can be predicted more accurately, provided that the right automatic correlate is found. In other words, although overall pronunciation can be predicted accurately on the basis of automatic measures of timing, it appears that these measures can predict fluency and speech rate even more accurately, which is also what one would expect. A clear result of this experiment is that the optimal correlate of segmental quality still eludes us.

It seems therefore that an important contribution of the specific ratings is that they make clear in which direction action should be taken in order to achieve better pronunciation scoring. For example, it is now clear that attempts should be made to obtain a better predictor of segmental quality, because this would prevent speakers with poor pronunciation and the right temporal characteristics from obtaining high pronunciation scores.

The second aim of this experiment was to determine whether taking different groups of experts as a reference would lead to different ratings. The results presented above reveal that raters who did not receive any instructions on the use of the rating scales may differ from each other in the absolute values of the scores assigned. However, one can normalize for these differences by computing standard scores. After normalization no considerable differences between the raters were observed: they all evaluate the speakers in a similar way. We can therefore conclude that expert ratings of pronunciation exhibit a certain degree of stability.

Finally, the third aim of the experiment reported on in this paper was to determine whether pronunciation ratings assigned by human experts can be predicted on the basis of scores produced by an automatic speech recognizer. The results found so far show that a good prediction of both global and specific pronunciation scores can be obtained on the basis of automatic measures of timing such as ROS and TD. However, it seems that further research is needed to determine whether appropriate measures can be found to obtain a more refined assessment of segmental quality.

With a view to the ultimate aim of our research, i.e. developing an automatic testing system for Dutch pronunciation, the results of this experiment are very useful since they show that pronunciation scores assigned by human experts can be accurately predicted on the basis of measures computed by a speech recognizer. Furthermore, they indicate how we should proceed toward developing an automatic pronunciation test. For instance, finding an adequate automatic correlate of segmental quality is necessary to avoid that fast speakers with low proficiency get high pronunciation scores.

To conclude, the results presented in this paper are promising and the fact they were obtained under rather ?normal and realistic? conditions (no laboratory speech, no exclusion of disfluent utterances) makes them even more promising.


This research was supported by SENTER (which is an agency of the Dutch Ministry of Economic Affairs) under the Information Technology Programme, the Dutch National Institute for Educational Measurement (CITO), Swets Test Services of Swets and Zeitlinger and PTT Telecom. The research of Dr. H. Strik has been made possible by a fellowship of the Royal Netherlands Academy of Arts and Sciences. We thank Febe de Wet for her assistance in analyzing the data.


Anderson-Hsieh, J., R. Johnson, Koehler, K., 1992. The relationship between native speaker judgments of non-native pronunciation and deviance in segmentals, prosody, and syllable structure, Language Learning, 42, 529-555.

Auralog http://www.auralog.com/eng/index.htm

Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., Weintraub, M., 1990. Automatic evaluation and training in English pronunciation. In: Proceedings International Congress on Spoken Language Processing (ICSLP) ?90, Kobe, pp. 1185-1188.

Cucchiarini, C., Strik, H., Boves, L., 1997. Automatic evaluation of Dutch pronunciation by using speech recognition technology. In: Furui, S., Juang, B.-H., Chou, W. (Eds.), Proceedings IEEE workshop ASRU, Santa Barbara, pp. 622-629.

Ferguson, G.A., 1987. Statistical analysis in psychology and education, fifth edition, McGraw-Hill book company, Singapore.

Flege, J., Fletcher , K.,1992. Talker and listener effects of perceived foreign accent, J. Acoust. Soc. Amer., 91, 370-389.

Franco, H., Neumeyer, L., Kim, Y., Ronen, O., 1997. Automatic pronunciation scoring for language instruction. In: Werner, B. (Ed.), Proc. Int. Congress on Acoustics, Speech and Signal Processing (ICASSP) 1997, München, pp. 1471-1474.

Kraayeveld, H., 1997. Idiosyncrasy in prosody, Doctoral dissertation University of Nijmegen, Nijmegen.

Labov, W. 1966. The social stratification of English in new York City, Center for Applied Linguistics, Washington.

Lee, C.H., 1997. A unified statistical hypothesis testing approach to speaker verification and verbal information verification. In: Proceedings COST workshop Rhodos, pp. 63-72.

Neumeyer, L., Franco, H., Weintraub, M., Price, P., 1996. Automatic text-independent pronunciation scoring of foreign language student speech. In: Bunnel, H.T., Idsardi, W. (Eds.), Proceedings International Congress on Spoken Language Processing (ICSLP) ?96, Philadelphia, pp. 1457-1460.

den Os, E.A., Boogaart, T.I., Boves, L., Klabbers, E.,1995. The Dutch Polyphone corpus. In: Pardo, J.M., Enríquez, E., Ortega, J., Ferreiros, J., Macías, J., Valverde, F.J. (Eds.), Proceedings ESCA 4th European Conference on Speech Communication and Technology: EUROSPEECH 95, Madrid, pp. 825-828.

Strik, H., Russel, A., Van den Heuvel, H., Cucchiarini, C., Boves, L., 1997. A spoken dialogue system for the Dutch public transport information service, International Journal of Speech Technology, 121-131.

Syracuse Language Systems http://www.syrlang.com/


Set 1

1) Vitrage is heel ouderwets en past niet bij een modern interieur.

2) De Nederlandse gulden is al lang even hard als de Duitse mark.

3) Een bekertje warme chocolademelk moet je wel lusten.

4) Door jouw gezeur zijn we nu al meer dan een uur te laat voor die afspraak.

5) Met een flinke garage erbij moet je genoeg opbergruimte hebben.

Set 2

1) Een foutje van de stuurman heeft het schip doen kapseizen.

2) Gelokt door een stukje kaas liep het muisje keurig in de val.

3) Het ziet er naar uit dat het deze week bij ons opnieuw gaat regenen.

4) Na die grote lekkage was het dure behang aan vervanging toe.

5) Geduldig hou ik de deur voor je open.

Last updated on 22-05-2004