C. Cucchiarini, H. Strik & L. Boves
A2RT, Dept. of Language & Speech, University of Nijmegen
P.O. Box 9103, 6500 HD Nijmegen, The Netherlands
Proc. of IEEE ASRU, Santa Barbara, Dec. 1997.
Abstract -
The ultimate aim of the research reported on in this paper is to develop
a system for automatic assessment of foreign speakers' pronunciation of
Dutch. The aim of the experiment described here was to determine
whether pronunciation ratings assigned by human experts
could be predicted on the basis of scores calculated by an automatic speech
recognizer. To this end
20 native and 60 non-native speakers of Dutch read ten phonetically rich
sentences over the telephone. The
automatic speech recognizer was trained with read speech of 4019 Dutch
subjects with varying regional accents.
The results show that the human scores can be accurately predicted, even
in the case of telephone speech. Analysis of the various types of human ratings and
automatic measures provides more insight into the relationship between human and machine
scores and indicates how the automatic measures can be further improved to
achieve even greater predictive power.
Introduction
Developing computer tests for productive language skills
such as speaking and
writing is difficult because of the open-ended nature of the input. Recent
advances in speech recognition research seem to suggest that there are
possibilities of using computers to test at least some aspects of oral
proficiency. \cite{Bernstein90,Hiller94,Eskenazi96,Neumeyer96} describe
automatic methods for evaluating English pronunciation. In this paper we report
on an experiment that was aimed at determining whether scores obtained by
means of an automatic speech recognizer correlate with human pronunciation
scores of spoken Dutch. In doing so, we have analyzed both the automatic and
the human experts' scores in detail.
In the methods for automatic pronunciation assessment developed so far
\cite{Bernstein90,Neumeyer96} different machine measures have been used: HMM
log-likelihood scores, timing scores, phone classification error scores and
segment duration scores. Recently, also phone log-posterior probability scores
have been investigated by \cite{Franco97}. In all these studies the validity of
machine scores is established by comparing them with pronunciation scores
assigned by human experts who are asked to assign a global pronunciation score
to each of the several sentences uttered by each speaker. The scores for all
the sentences by one speaker are then averaged to obtain an overall speaker
score. Alternatively, the total set of sentences can be scored as a single
item. Of the four measures used in \cite{Neumeyer96}, segment duration scores
show the highest degree of correlation with human-assigned pronunciation scores
(0.86). However, \cite{Franco97} found that phone log-posterior probability
scores are even better predictors of human scores. Attempts to improve the
correlations at the sentence level by combining different machine scores led to
an additional 7\% increase in correlation \cite{Franco97}. The trend in this
kind of research is to look for machine measures that best correlate with human
scores. In this attempt little is done to try and understand the nature of the
correlation between machine scores and human scores, while this would certainly
be very useful for improving automatic pronunciation assessment. Non-native
speech can deviate from native speech in various aspects such as fluency,
syllable structure, word stress, intonation and segmental quality. In the
literature, considerable attention has been paid to the relative importance of
the various aspects of speech quality for pronunciation assessment
\cite{Anderson92,Munro95}. The fact that human scores depend on several speech characteristics
may be problematic when such scores are used as benchmark for automatic
measures of speech quality. For this reason, in the present study more specific
pronunciation ratings were collected along with global ratings of pronunciation
quality. We asked the human raters to explicitly assess segmental quality,
fluency and speech rate, in addition to overall pronunciation quality.
The present experiment includes ratings of native speech of two kinds: standard
speech and speech with different regional accents. The presence of
native-produced sentences might facilitate judgments of non-native speech
\cite{Flege92};
and it is interesting to know how native regional accents are evaluated
relative to the speech of foreigners.
Throughout the experiment telephone
speech is used, since in the near future automatic tests to be administered
over the telephone will be required for different applications. In
\cite{Bernstein90
telephone quality was simulated by using 200-3600 Hz band-limited speech, but
this is different from real telephone speech.
Aims of the present study
Given the successful attempts at developing
automatic pronunciation testing
systems for English, we decided to develop a similar test for assessing foreign
speakers' pronunciation of Dutch. To this end we used the automatic speech
recognizer developed at the University of Nijmegen. Some of the information
concerning this recognizer is provided below and in
\cite{Strik97}. The first aim of our experiment is to
determine to what extent scores computed by our speech recognizer can predict
pronunciation scores assigned by human experts. Furthermore, we wanted to
determine whether asking the human experts to assign specific ratings of
pronunciation quality along with global ratings would enhance our understanding
of the relation between human scores and machine scores.
The last aim of this
experiment was to determine how real telephone speech would fare in an
experiment of this kind.
Method
Speakers
The speakers in this experiment are 60 non-native speakers (NNS), 16 native
speakers (NS) and 4 speakers of the standard language (SDS). The NNS were
selected on the basis of language background (9 language groups), proficiency
(3 levels) and sex. The NS were selected according to region of origin (4
regions) and sex. The four speakers of Standard Dutch (two males and two
females) were selected on the basis of scores obtained in previous experiments
in which the degree of standardness had been evaluated.
Speech material
Each speaker read two sets of five phonetically rich sentences. In preparing
the sentences, the following criteria were adopted:
the sentences should be meaningful, not sound strange and
not contain foreign words or names, nor unusual words which NNS are unlikely
to be familiar with;
the content of the sentences should be as neutral as possible. They
should not contain statements concerning characteristics of particular
countries or nationalities;
each set of five sentences should contain all phonemes of Dutch at
least once.
The average duration of each set is 30 s.
With two sets this amounts to one
minute of speech per speaker. The sentences were read over the telephone. As
the recording system was connected to a Euro-ISDN line, the input signals consist
of 8 kHz 8 bit A-law coded samples. The subjects called from their homes or
from telephone booths, so that the recording conditions were far from ideal.
All speech material was checked and orthographically transcribed before being
used for the experiment.
Raters
The raters involved in this experiment are three expert phoneticians with
considerable experience in judging pronunciation and other speech and speaker
characteristics. A high level of expertise was required because the raters had
to evaluate specific aspects of pronunciation quality.
The rating experiment comprised two sessions held on different days. In session 1
the raters assigned overall pronunciation scores, while in session 2 the
specific scores were given. Scores were not given to individual sentences
but to sets of five phonetically rich sentences. The 80 speakers
were proportionally assigned to the three raters. Each rater judged 20 NNS, 6 NS
(2 NS were evaluated twice) and all 4 SDS. Overall pronunciation quality,
segmental quality and fluency were rated on a scale ranging from 1 to 10. A
scale ranging from -5 to +5 was used to assess speech rate. Per session each
rater scored 52 unique sets plus 44 sets that were added to calculate
intra-rater and inter-rater reliability. Each time the order of the sets was
randomized.
Automatic measures
The speech recognizer described in \cite{Strik97} was used. It was trained with
38 con\-text-in\-depen\-dent phone models, using continuous mixture density HMMs. The
recognizer was trained with 18,000 phonetically rich sentences from 4019
speakers of the Polyphone database \cite{denOs95}. From the recognizer output
the following measures were calculated:
tdur1 &= & total duration of speech (no pauses)
tdur2 &= & total duration of speech plus pauses
MSD &= & mean segment duration (tdur1/N-segments)
ROS &= & rate of speech (N-segments/tdur2)
LL &= & global log-likelihood (sum of LLs for individual words)
Results
Human scoring
Both intra-rater and inter-rater reliability coefficients $\alpha$ were very high
($> .95$), except for the speech rate scores, where two raters had intra-rater
reliabilities in the order of $.75$. Since natives consistently received
higher scores, their presence could have inflated the reliability scores.
However, reliabilities remained high ($> .91$) when they were
computed within the group of non-natives.
Table~\ref{table:ScalCor} shows the correlations between the scores on the four
expert scales. It is evident that Segmental quality is almost identical to
Overall pronunciation quality, but that the temporal measures are good
predictors of Overall and Segmental quality too. This is in accordance with informal observations of many
teachers, who report that pupils who have a low proficiency level combine
disfluencies and mispronunciations.
Automatic scoring
The correlations between the various automatic measures are shown in Table
\ref{table:AutCor}. Obviously, all correlations are very high, so that we must
conclude that all measures address essentially the same characteristics of the
speech.
Automatic scoring and human scoring
Correlation coefficients were calculated between the four types of human scores
and the five automatic measures. The results (corrected for attenuation) are
presented in Table~\ref{table:AutHum}. As appears from Table~\ref{table:AutHum}, all
correlations between automatic and human scores are high. The automatic measure
that shows the highest correlations with the human scores is LL. Among the
human-assigned scores, Fluency shows the highest correlations with the
automatic scores.
The fact that aspects of pronunciation quality regarding speech timing, such as
Fluency and Speech rate, are more highly correlated with automatic scores
related to utterance duration than the scores on Overall pronunciation and
Segmental quality reveals that the raters did their job properly. When asked to
rate fluency and speech rate, they indeed paid attention to these aspects of
speech timing. In other words, the high correlations between the four types of
human-assigned scores (see Table~\ref{table:ScalCor}) are most probably due to the
fact that these aspects of pronunciation quality are indeed correlated with
each other.
Discussion and conclusions
In this paper we have reported on an experiment aimed at determining whether
pronunciation scores assigned by human experts can be predicted on the basis of
scores produced by an automatic speech recognizer. The analyses of the human
scores revealed that high levels of reliability were achieved, intrarater as
well as interrater, in different conditions and for different scales.
Since the human ratings appeared to be reliable, they can safely be used
as a reference for the automatic scores.
The results show that overall pronunciation scores can be predicted with a
considerable degree of accuracy on the basis of automatic measures. All
correlations between Overall pronunciation and the automatic scores are high;
the highest correlation ($0.79$) is found for LL. This might seem rather
surprising, since in previous research \cite{Neumeyer96} log-likelihood turned
out to be no good predictor of overall pronunciation. However, it should be
pointed out that in this experiment all automatic scores, even LL, turned out
to be highly correlated with each other. The fact that there is a high
correlation between LL and Overall pronunciation can thus be misleading.
Inspection of the correlation between LL and Overall pronunciation revealed
that the association is mostly due to the close relation between LL and
utterance duration. This strong dependence of LL on utterance duration is
probably due to the way in which LL is calculated by our system at the moment:
the LL for the whole utterance is calculated by summing the LLs of the
individual words. This strong dependence is reflected most clearly in the
extremely high correlation of LL with tdur1 (of 0.94), the latter being the
total duration of the utterances (i.e. of all words without the pauses). This
confirms the suggestion that some kind of normalization, e.g. by computing
likelihood ratios, is essential to approximate the intuitive concept of
'segmental quality' or 'overall pronunciation quality' with scores obtained
from an automatic speech recognizer.
It is of interest to study the relations between automatic scores and human
scores in more detail, by analyzing the 'factorial' composition of the latter.
By using the specific pronunciation scores it became clear that Overall
pronunciation is most influenced by Segmental quality, which is the human
measure that can be predicted most poorly on the basis of the machine scores.
Even log-likelihood (LL), which was intended to be the automatic measure most
closely related to Segmental quality, is highly correlated with utterance
duration.
Another aspect in which our study differs from previous ones is that telephone
speech was used. People were simply asked to dial a certain number, and they
were free to select time, place and location. Consequently, the resulting
acoustic registrations differ in many ways from those made in a studio or a
(usually quiet) office environment. Here we will mention only the most relevant
ones.
First of all, in telephone speech only the bandwith of 300 - 3400 Hz is used.
Second, not just one high quality microphone was used, but many different
telephone microphones. Finally, and probably most important, relatively high
level acoustic background signals are frequently present, which is usually not
the case with laboratory speech. We do consider these conditions as 'normal and
realistic', in the sense that later on, when this technology will be used in
applications over the telephone, conditions will most probably be similar.
However, it should be underlined that these conditions make automatic speech
recognition more difficult.
To conclude, the results of this experiment are very promising since they show
that pronunciation scores assigned by human experts can be accurately predicted
on the basis of measures computed by a speech recognizer. Furthermore, these
results indicate how the machine scores could be improved so as to obtain an
even greater predictive power. Finally, the fact that these results were
obtained with telephone speech under 'normal and realistic' conditions, makes
them even more promising.
Acknowledgements
This research was supported by SENTER (which is an agency of the Dutch Ministry
of Economic Affairs) under the Information Technology Programme, the Dutch
National Institute for Educational Measurement (CITO), Swets Test Services of
Swets \& Zeitlinger and PTT Telecom. The research of Dr. H. Strik has been made
possible by a fellowship of the Royal Netherlands Academy of Arts and
Sciences.
References
Anderson88
Anderson-Hsieh, J. and K. Koehler
``The relationship between native speaker judgments of nonnative pronunciation
and deviance in segmentals, prosody, and syllable structure,''
{\it Language Learning}, Vol.~42, pp.~529-555, 1988.
Anderson92
Anderson-Hsieh, J., R. Johnson and K. Koehler
``The relationship between native speaker judgments of nonnative pronunciation
and deviance in segmentals, prosody, and syllable structure,''
{\it Language Learning}, Vol.~42, pp.~529-555, 1992.
Bernstein90
J. Bernstein, M. Cohen, H. Murveit, D. Rtischev, and M. Weintraub
``Automatic evaluation and training in English pronunciation,''
in {\it Proc. Int. Congress
on Spoken Language Processing (ICSLP) '90}, 1990, pp.~1185-1188.
Boeschoten, J. (1989) Verstaanbaarheid van klanken in het Nederlands gesproken
door Turken, PhD Dissertation, Leyden University.
BiS92
C. H. Bischof and G. M. Shroff, ``On Updating Signal
Subspaces,'' {\it IEEE Trans. on Signal Processing}, vol.~40, no.~1,
pp.~96--105, Jan. 1992.
LiY88} R. A. Lincoln and K. Yao, ``Efficient Systolic Kalman
Filtering Design by Dependence Graph Mapping,'' in {\it VLSI Signal
Processing, III}, IEEE Press, R. W. Brodersen and H. S. Moscovitz Eds.,
1988, pp.~396--410.
Example of a Book
GVL89} G. H. Golub and C. F. Van Loan, {\it Matrix
Computations}, 2nd ed. Baltimore, MD: Johns Hopkins, 1989.
Eskenazi96
M. Eskenazi
``Detection of foreign speakers' pronunciation errors for
second language training - preliminary results,'' in {\it Proc. Proc. Int.
Congress on Spoken Language Processing (ICSLP) '96}, 1996,
pp.~1465-1468.
Fayer87
%J. Fayer and E. Krazinsky
``Native and nonnative judgments of
intelligibility and irritation,'' {\it Language Learning}, Vol.~37,
pp.~313-326, 1987.
Munro95
M.J. Munro
``Nonsegmental factors in foreign accent,'' {\it Studies in Second Language
Acquisition}, Vol.~17,
pp.~17-34, 1995.
Ferguson87
%G.A. Ferguson {\it Statistical analysis in psychology and education}, 5th
ed. Singapore: McGraw-Hill, 1987.
Flege92
J. Flege and K. Fletcher
``Talker and listener effects of perceived
foreign accent,'' {\it Journal of the Acoustical Society of America},
Vol.~91, pp.~370-389, 1992.
Franco97
H. Franco, L. Neumeyer, Y. Kim and O. Ronen
``Automatic
pronunciation scoring for language instruction,'' in {\it Proc. Int. Congress on
Acoustics, Speech and Signal Processing (ICASSP) 1997}, pp.~1471-1474.
Heuven81
%V.J. van Heuven and J.W. de Vries ``Begrijpelijkheid van buitenlanders:
de rol van fonische en niet-fonische factoren,''
{\it Forum der Letteren}, Vol.~22, pp.~309-320, 1981.
Hiller94
Hiller, S., E. Rooney, R. Vaughan, M. Eckert, J. Laver, and M. Jack
``An automated system for computer-aided pronunciation learning,''
{\it Computer Assisted Language Learning}, Vol.~7, pp.~51-63, 1994.
James76
E. James
``The acquisition of prosodic features using a speech
visualizer,'' {\it International Review of Applied Linguistics and Language
Teaching}, Vol.~14, pp.~227-243.
Johansson78
S. Johansson
{\it Studies of error gravity: Native reactions to errors
produced by Swedish learners of English}, Göteborg: Acta Universitatis
Gothoburgensis, 1978.
Neumeyer96
L. Neumeyer, H. Franco, M. Weintraub, and P. Price
``Automatic
text-independent pronunciation scoring of foreign language student speech,''
in {\it Proc. Proc. Int. Congress on Spoken Language Processing (ICSLP) '96},
Philadelphia, pp.~1457--1460.
denOs95
den Os, E.A., T.I. Boogaart, L. Boves and E. Klabbers
``The Dutch
Polyphone corpus,'' in {\it Proc. ESCA 4th European Conference on Speech Communication
and Technology: EUROSPEECH 95}, Madrid, pp.~825-828.
Strik97
H. Strik, A. Russel, H. van den Heuvel, C. Cucchiarini and L. Boves
``A
spoken dialogue system for the Dutch public transport information service''
{\it International Journal of Speech Technology},
Vol.~2, pp.~119-129, 1997.