Different aspects of expert pronunciation quality ratings and
their relation to scores produced by speech recognition algorithms.
Catia Cucchiarini, Helmer Strik, Loe Boves (2000)
Dept. of Language & Speech, University of Nijmegen
P.O. Box 9103, 6500 HD Nijmegen, The Netherlands
Speech Communication 30 (2-3), pp. 109-119.
Different aspects of expert pronunciation quality ratings and
their relation to scores produced by speech recognition algorithms.
The ultimate aim of the research reported on here is to develop an automatic
testing system for Dutch pronunciation. In the experiment described in this paper
automatic scores of telephone speech produced by native and non-native speakers of
Dutch are compared with specific, i.e. temporal and segmental, and global
pronunciation ratings assigned by three groups of experts: three phoneticians and two
groups of three speech therapists. The goals of this experiment are to determine 1)
whether specific expert ratings of pronunciation quality contribute to our
understanding of the relation between human pronunciation scores and machine
scores of speech quality; 2) whether different expert groups assign essentially
different ratings, and 3) to what extent rater pronunciation scores can be predicted on
the basis of automatic scores. The results show that collecting specific ratings along
with overall ones leads to a better understanding of the relation between human and
automatic pronunciation assessment. Furthermore, after normalization no
considerable differences are observed between the ratings by the three expert groups.
Finally, it appears that the speech quality scores produced by our speech recognizer
can predict expert pronunciation ratings with a high degree of accuracy.
In the last few years we have witnessed the appearance of numerous software
programs for teaching and testing language proficiency, such as those developed by
Auralog and Syracuse Language Systems (see URLs in the reference list). The
eventual advantages of such systems are obvious: lower costs, greater flexibility and,
in the case of testing, increased objectivity.
In developing automatic instruments for language testing it soon appeared that for
certain skills automation would be easier than for others. In general four skills are
distinguished on the basis of the dimensions: mode (oral vs. written) and direction
(receptive vs. productive). Since in testing receptive skills it is possible to use
response tasks that are easy to score (multiple choice, matching and cloze),
developing automatic tests for these skills is relatively easy. For productive skills, on
the other hand, automatic tests are difficult to develop, because of the open-ended
nature of the input. Furthermore, in the case of speaking, direction and mode conspire
to make automatic testing even more difficult.
In spite of these difficulties, various methods for evaluating certain oral sub-skills
like pronunciation have been proposed (Bernstein et al., 1990; Neumeyer et al., 1996;
Franco et al., 1997). Most of these systems make use of recent developments in
automatic speech recognition. However, it seems important that any system intended
for testing or improving pronunciation should refer to some standard based on
judgments of human raters, the importance of which cannot be overestimated, as
human scores are what automatic grading techniques purport to reproduce.
The importance of expert ratings for automatic assessment of pronunciation
quality has been underlined by Bernstein et al. (1990). In this study aimed at
determining the feasibility of automatic pronunciation grading, the performance of an
automatic speech recognizer was tested against speech quality ratings by experts. In
Neumeyer et al. (1996) and Franco et al. (1997), pronunciation scores assigned by
human experts were also used as a reference to determine the validity of automatic
measures of speech quality such as log-likelihood scores, timing scores, phone
classification error scores and segment duration scores. While in these studies
considerable effort was dedicated to optimizing the automatic measures so as to
obtain better correlations between machine scores and human scores, less attention
was paid to the ratings assigned by the experts; only overall ratings of pronunciation
However, research on pronunciation evaluation has revealed that overall scores of
pronunciation quality may be affected by a great variety of speech characteristics
(Anderson-Hsieh et al., 1992). Non-native speech can deviate from native speech in
various aspects such as fluency, syllable structure, word stress, intonation and
segmental quality. When native speakers are asked to score non-native speech on
pronunciation quality, their scores are usually affected by more than one of these
aspects. Research on the relationship between native speaker ratings of non-native
pronunciation and deviance in the various aspects of speech quality has revealed that
each area affects the overall score to a different extent (Anderson-Hsieh et al., 1992).
These findings suggest that global ratings of pronunciation quality assigned by
human raters have a complex structure, which may be problematic when such scores
are used as a reference for automatically produced measures of speech quality,
because one does not know exactly what the human scores stand for. Questions such
as ?What do raters exactly evaluate?? and ?What influences their judgements
most?? should be taken into consideration when trying to develop machine measures
that best approach human pronunciation scores. Against this background it seems that
more specific pronunciation ratings should be collected along with global ratings of
pronunciation quality so as to obtain a better understanding of pronunciation grading
Another problem with human pronunciation scores collected in previous studies
(Neumeyer et al., 1996 and Franco et al., 1997) is that they do not take due account of
possible shibboleth sounds. In these studies the experts were asked to assign a global
pronunciation score to each of several sentences uttered by each speaker (sentence
level rating). The scores for all the sentences by one speaker were then averaged to
obtain an overall speaker score (speaker level rating) (see Neumeyer et al., 1996 and
Franco et al., 1997). Although this procedure may seem logical at first sight, there are
some problems with it.
The scores assigned by a rater to different sentences uttered by one and the same
speaker may differ as a function of segmental make-up (Labov, 1966). For example,
if a shibboleth (stigmatizing) sound is present in one sentence, the score for that
sentence may be considerably lower than those for other sentences by the same
speaker that do not contain that specific sound. Owing to the presence of a
stigmatizing sound, pronunciation scores collected at the speaker level could turn out
to be lower than the scores that would result by averaging over the various sentences
uttered by the same speaker. In other words, the average score might not reflect the
effect of the shibboleth sound to the same extent as the one expressed in an overall
speaker score. This seems to suggest that if the researcher is interested in
pronunciation scores at the speaker level, (s)he should have the human raters listen to
fragments containing the whole phonetic inventory of the language in question.
In our research directed at developing an automatic pronunciation testing system
for Dutch, we also took human judgments as a reference. In order to obtain greater
insight in how experts evaluate pronunciation, we asked them to assign both global
and specific ratings of pronunciation quality. Moreover, in order to take account of
the possible effects of stigmatizing sounds on the ratings, in the present experiment
the human raters did not assign scores to individual sentences, but judged the
pronunciation of each speaker on the basis of two sets of five phonetically rich
When it came to selecting raters to assess non-native pronunciation of Dutch we
found that we could choose from among different groups. Phoneticians are obvious
candidates, since they are expert on pronunciation in general. Teachers of Dutch as a
second language would seem to be another obvious candidate; however, from these
teachers we learned that, in practice, pronunciation problems in learners of Dutch as a
second language are not usually addressed by language teachers, but rather by speech
therapists. Since it is possible that the ratings vary with the experts? background, we
decided to include different groups of raters in the experiment so that we could make
comparisons between them.
Another characteristic of the current experiment is that it is not limited to
assessing non-native speech, but it also concerns native speech. The reason for doing
this is that the presence of native-produced sentences facilitates judgements of non-native speech (Flege and Fletcher, 1992: 385). These authors suggest that although
native speech patterns are stored in native listeners? long-term memory, the
availability of native speech makes it easier for listeners to make accurate judgments
of degree of accent.
Finally, an important feature of this experiment is that telephone speech is used.
The rationale behind this is that in the future automatic tests to be administered over
the telephone will be required for different applications. In one study that we know of
telephone quality was simulated by using 200-3600 Hz band-limited speech
(Bernstein et al., 1990). However, this is only a first approximation of real telephone
The first aim of the experiment reported on here was to determine whether the
availability of specific ratings of pronunciation quality along with global ratings
would enhance our understanding of the relation between human scores and machine
scores. The second aim was to determine whether resorting to different groups of
experts would lead to different results. Finally, we wanted to establish to what extent
speech quality scores computed by our speech recognizer (see Strik et al., 1997) can
predict pronunciation scores assigned by human experts.
This paper is organized as follows: section 2 describes the experimental
methodology. The results of this experiment are presented and discussed in section 3,
while conclusions are drawn in section 4.
The first aim of the experiment reported on in this paper was to find out whether
specific ratings of pronunciation quality would increase our insight into the relation
between human ratings and machine scores. The results presented above show that
this is indeed the case: the comparison between more detailed and global ratings
revealed that overall pronunciation is most influenced by segmental quality, which is
the human measure that can be predicted most poorly on the basis of our machine
scores. It also appeared that specific aspects of pronunciation quality can be predicted
more accurately, provided that the right automatic correlate is found. In other words,
although overall pronunciation can be predicted accurately on the basis of automatic
measures of timing, it appears that these measures can predict fluency and speech rate
even more accurately, which is also what one would expect. A clear result of this
experiment is that the optimal correlate of segmental quality still eludes us.
It seems therefore that an important contribution of the specific ratings is that they
make clear in which direction action should be taken in order to achieve better
pronunciation scoring. For example, it is now clear that attempts should be made to
obtain a better predictor of segmental quality, because this would prevent speakers
with poor pronunciation and the right temporal characteristics from obtaining high
The second aim of this experiment was to determine whether taking different
groups of experts as a reference would lead to different ratings. The results presented
above reveal that raters who did not receive any instructions on the use of the rating
scales may differ from each other in the absolute values of the scores assigned.
However, one can normalize for these differences by computing standard scores.
After normalization no considerable differences between the raters were observed:
they all evaluate the speakers in a similar way. We can therefore conclude that expert
ratings of pronunciation exhibit a certain degree of stability.
Finally, the third aim of the experiment reported on in this paper was to determine
whether pronunciation ratings assigned by human experts can be predicted on the
basis of scores produced by an automatic speech recognizer. The results found so far
show that a good prediction of both global and specific pronunciation scores can be
obtained on the basis of automatic measures of timing such as ROS and TD.
However, it seems that further research is needed to determine whether appropriate
measures can be found to obtain a more refined assessment of segmental quality.
With a view to the ultimate aim of our research, i.e. developing an automatic
testing system for Dutch pronunciation, the results of this experiment are very useful
since they show that pronunciation scores assigned by human experts can be
accurately predicted on the basis of measures computed by a speech recognizer.
Furthermore, they indicate how we should proceed toward developing an automatic
pronunciation test. For instance, finding an adequate automatic correlate of segmental
quality is necessary to avoid that fast speakers with low proficiency get high
To conclude, the results presented in this paper are promising and the fact they
were obtained under rather ?normal and realistic? conditions (no laboratory speech,
no exclusion of disfluent utterances) makes them even more promising.
This research was supported by SENTER (which is an agency of the Dutch Ministry
of Economic Affairs) under the Information Technology Programme, the Dutch
National Institute for Educational Measurement (CITO), Swets Test Services of
Swets and Zeitlinger and PTT Telecom. The research of Dr. H. Strik has been made
possible by a fellowship of the Royal Netherlands Academy of Arts and Sciences. We
thank Febe de Wet for her assistance in analyzing the data.
Anderson-Hsieh, J., R. Johnson, Koehler, K., 1992. The relationship between native
speaker judgments of non-native pronunciation and deviance in segmentals,
prosody, and syllable structure, Language Learning, 42, 529-555.
Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., Weintraub, M., 1990. Automatic
evaluation and training in English pronunciation. In: Proceedings International
Congress on Spoken Language Processing (ICSLP) ?90, Kobe, pp. 1185-1188.
Cucchiarini, C., Strik, H., Boves, L., 1997. Automatic evaluation of Dutch
pronunciation by using speech recognition technology. In: Furui, S., Juang, B.-H.,
Chou, W. (Eds.), Proceedings IEEE workshop ASRU, Santa Barbara, pp. 622-629.
Ferguson, G.A., 1987. Statistical analysis in psychology and education, fifth edition,
McGraw-Hill book company, Singapore.
Flege, J., Fletcher , K.,1992. Talker and listener effects of perceived foreign accent, J.
Acoust. Soc. Amer., 91, 370-389.
Franco, H., Neumeyer, L., Kim, Y., Ronen, O., 1997. Automatic pronunciation
scoring for language instruction. In: Werner, B. (Ed.), Proc. Int. Congress on
Acoustics, Speech and Signal Processing (ICASSP) 1997, München, pp. 1471-1474.
Kraayeveld, H., 1997. Idiosyncrasy in prosody, Doctoral dissertation University of
Labov, W. 1966. The social stratification of English in new York City, Center for
Applied Linguistics, Washington.
Lee, C.H., 1997. A unified statistical hypothesis testing approach to speaker
verification and verbal information verification. In: Proceedings COST workshop
Rhodos, pp. 63-72.
Neumeyer, L., Franco, H., Weintraub, M., Price, P., 1996. Automatic text-independent pronunciation scoring of foreign language student speech. In:
Bunnel, H.T., Idsardi, W. (Eds.), Proceedings International Congress on Spoken
Language Processing (ICSLP) ?96, Philadelphia, pp. 1457-1460.
den Os, E.A., Boogaart, T.I., Boves, L., Klabbers, E.,1995. The Dutch Polyphone
corpus. In: Pardo, J.M., Enríquez, E., Ortega, J., Ferreiros, J., Macías, J.,
Valverde, F.J. (Eds.), Proceedings ESCA 4th European Conference on Speech
Communication and Technology: EUROSPEECH 95, Madrid, pp. 825-828.
Strik, H., Russel, A., Van den Heuvel, H., Cucchiarini, C., Boves, L., 1997. A spoken
dialogue system for the Dutch public transport information service, International
Journal of Speech Technology, 121-131.
Syracuse Language Systems http://www.syrlang.com/
1) Vitrage is heel ouderwets en past niet bij een modern interieur.
2) De Nederlandse gulden is al lang even hard als de Duitse mark.
3) Een bekertje warme chocolademelk moet je wel lusten.
4) Door jouw gezeur zijn we nu al meer dan een uur te laat voor die
5) Met een flinke garage erbij moet je genoeg opbergruimte hebben.
1) Een foutje van de stuurman heeft het schip doen kapseizen.
2) Gelokt door een stukje kaas liep het muisje keurig in de val.
3) Het ziet er naar uit dat het deze week bij ons opnieuw gaat regenen.
4) Na die grote lekkage was het dure behang aan vervanging toe.
5) Geduldig hou ik de deur voor je open.