INTERSPEECH 2010 Using Non-Native Error Patterns to Improve Pronunciation Verification Joost van Doremalen, Catia Cucchiarini, Helmer Strik Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands fj:vandoremalen, c:cucchiarini, h:strikg@let:ru:nl Abstract In this paper we show how a pronunciation quality measure can be improved by making use of information on frequent pronunciation errors made by non-native speakers. We propose a new measure, called weighted Goodness of Pronunciation (wGOP), and compare it to the much used GOP measure. We applied this measure to the task of discriminating correctly from incorrectly realized Dutch vowels produced by non-native speakers and observed a substantial increase in performance when sufficient training material is available. Index Terms: pronunciation error detection, computer-assisted language learning, confidence measures, weighted GOP 1. Introduction Adult second language (L2) learners are known to experience difficulties in learning to pronounce the sounds of an L2 (see [1] for reviews). The majority of L2 learners never acquire native- like performance and many of them have problems even in attaining a level of comfortably intelligible speech. An important limiting factor in acquiring the pronunciation of an L2 is considered to be the phonology of the mother tongue (L1). Theories that attempt to explain L1-L2 interference in speech perception and production are based on the tenet that the perceptual salience of phonetic detail becomes tied to the distinctions that are relevant in L1 [2] [3]. This form of L1 entrenchment leads to “deafness” to phonetic distinctions in the L2 and causes difficulties in learning to perceive and produce L2 speech sounds. However, the positive finding is that new distinctions in an L2 can be learned, but this requires intensive feedback [4] [5]. Since in general it is not possible to offer intensive feedback on pronunciation in L2 classrooms, there is growing interest for Computer Assisted Pronunciation Training systems that make use of automatic speech recognition to provide feedback on pronunciation. This is also the aim of our DISCO project [6]. An important requirement for such systems is that pronunciation errors are reliably detected. For this purpose various measures of pronunciation quality have been developed [7] [8]. Although in general acceptable levels of performance can be achieved with these measures, it is our impression that better performance could be achieved by using pronunciation quality measures that take more account of the specific pronunciation errors that are made in the L2. More specifically, the research reported on in this paper evaluates a newly developed pronunciation quality measure on a set of Dutch vowels spoken by L2 learners. In this paper we first provide a brief overview of the most used pronunciation quality measures and try to explain how such measures could be made more sensitive to error patterns (Section 2). We then go on to describe the case of vowel pronunciation error detection in Dutch (Section 3). In the following sections we report on experiments in which the performance of our new measure is compared to the much used GOP measure introduced in [7]. 2. Pronunciation Quality Measures Most pronunciation quality measures are segmental confidence measures. These confidence measures try to estimate the posterior probability of a phone: P(Ojp)P(p) P(pjO)= (1) P(O) where p is the target phoneme and O the observation matrix. If this confidence measure is below a certain predefined threshold the phone is flagged as incorrectly realized. One well known instantiation of this notion is the Goodness of Pronunciation (GOP) algorithm [7] in which conditional probabilities are calculated using Hidden Markov Models (HMM) trained on native speech material. In the applications of this algorithm an equal prior distribution is often assumed and the denominator P(O) is approximated by calculating the likelihood of the most likely phone sequence in the specific segment. In addition, transforming to a log scale and normalizing by phone duration dur yields [7]: logfP(Ojp)g- maxi logfP(Ojpi)g GOP (p)= (2) dur The decision of accepting or rejecting the phone as a correct pronunciation of the target phoneme is made by simple thresholding, which is determined separately for each target phoneme. This threshold can be calibrated on real non-native speech material or native material in which artificial errors have been introduced [9]. In [8] the posterior probability is estimated by: P(Ojp)P(p) P(pjO)= PN (3) P(Ojpi)P(pi) i where the summation in the denominator runs over all N phonemes. The priors P(p) and P(pi) represent the prior probability of the specific phoneme estimated from native speech material. Other approaches to pronunciation verification involve discriminative training methods such as Support Vector Machines [10] in which the posterior probability is estimated directly. The research presented in this paper is grounded in the generative modeling approaches taken in [7] and [8]. We have found that the GOP scoring algorithm has difficulties in detecting errors in target phonemes with multiple acoustically close “neighbouring” phonemes. This is specifically the case in the Dutch vowel system, as explained in more detail in the next section. These difficulties are mainly caused by the fact that the denominator in Eq. 2 only takes into account the maximum Copyright © 2010 ISCA 590 26-30 September 2010, Makuhari, Chiba, Japan likelihood phone sequence, which might be an underestimation if there is more than one competing phoneme. In Eq. 3 this problem does not arise, but we think that weighting the likelihoods of competing phonemes P(Ojpi) based on how important they are for predicting an error in the target phoneme might improve this measure. Therefore, we propose to combine multiple likelihood ratios from the target phoneme with all competitor phonemes in a logistic regression model. This regression model is trained on manually annotated non-native speech material. The measure, which we call weighted GOP (wGOP) is explained in more detail in Section 5.2. 3. Dutch Vowel System The Dutch vowel inventory is relatively complex: it contains thirteen monophthongs, three diphthongs, and some additional vowels found mainly in loan words [11] [12] (see Fig. 1 for a vowel chart, SAMPA [13] is used in the current paper). In addition, there are relatively many vowels in the mid-to-high, front-central area of the vowel space: • /I/ (as in /bIt/,“bid”; “pray”) • /Y/ (as in /pYt/, “put”; “well”) • /y/ (as in /byr/, “buur”; “neighbour”) • /2:/ (as in /l2:k/, “leuk”; “nice”) • /e:/ (as in /be:t/, “beet”; “bite”) Research has shown that in the case of Dutch, vowels pose particular problems to L2 learners [14]. The difficulties experienced by Dutch L2 learners in perceiving Dutch vowels do indeed appear to be connected to the relationship between the Dutch vowel system and that of their mother tongue [5], in the sense that L2 learners find it difficult to distinguish vowels that differ along dimensions that are not relevant in their mother tongue. New distinctions can however be learned if intensive feedback is provided [5]. With respect to production there is a compounding problem, because acoustic similarity is not the only influencing factor, orthography also plays a role in the sense that the orthography of the mother tongue is going to interfere with the way Dutch vowels are pronounced [14]. Moreover, in Dutch orthography the same grapheme is sometimes used to indicate two different phonemes, which might cause extra confusions. Automatic classification of Dutch vowels produced by non- natives turned out be less successful than classification of vowels produced by native speakers [15]. Because of its characteristics — relatively many vowels with concentrations in a specific area of the vowel space — the Dutch vowel system is particularly suited to test the effectiveness of our newly developed pronunciation quality measure. 4. Material The non-native speech material for the present experiments was taken from the JASMIN speech corpus [16].This material was recorded from speakers of many different mother tongues with relatively low proficiency levels, namely A1, A2 and B1 of the Common European Framework (CEF). For the experiments reported on in this paper we used the read speech material. The material is obtained from 45 speakers reading the same set of phonetically rich sentences. In total there are 3669 chunks with a duration ranging from 5 to 15 seconds. Orthographic transcriptions were manually created and include fluency phenomena such as filled pauses, restarts and repetitions. From Ei e:2: 9y i E@ a: y I Y Figure 1: Dutch vowel chart . intra T1 0.975 . intra T2 0.948 . inter T1 - T2 0.913 Table 1: Transcription correction statistics these orthographic transcriptions, phonetic transcriptions were automatically generated using a pronunciation lexicon with native and non-native pronunciation variants. Phonetic transcriptions for words which contain disfluencies were manually created. Because the automatically generated phonetic transcription can contain errors, we had two transcribers manually correct the phonetic transcriptions on the word level. They were instructed to change the phonetic transcription whenever they thought that an error had been made. For this correction, only the SAMPA symbols for Dutch were used. Chunks were presented in a random order. 10% of the material was corrected by both transcribers and another 10% was transcribed twice by the same transcriber in order to calculate the inter and intra transcriber agreement, respectively. These agreement scores are shown in Table 1. Both transcribers changed less than 10% of the segments, and there is quite some overlap in the segments they changed, which explains the high agreement levels. 5. Method 5.1. Phonetic Time Alignment Firstly, an alignment between a canonical phonetic transcription using the CGN pronunciation lexicon [17] and the speech signal was created. This canonical transcription represents how the words should have been pronounced in Standard Dutch. Secondly, an alignment between the manually corrected phonetic transcription and the speech signal was created. The manually corrected transcription represents how the words have been realized. The alignments were created by doing a Viterbi alignment with acoustic models trained using the SPRAAK package [18]. 47 3-state monophone Gaussian Mixture Models (GMM) were trained with native read speech material from the CGN speech database. For preprocessing purposes the input speech, sampled at 16kHz, is first divided into overlapping 32ms Hamming windows with a 10ms shift and pre-emphasis factor of 0.95. 12 Mel-frequency cepstral coefficients (MFCCs) plus C0, and their first and second order derivatives were calculated and cepstral mean subtraction (CMS) was applied. The quality of these segmentations was checked semiautomatically. We observed that word-internal disfluencies 591 caused problems in the segmentation. These chunks could be detected relatively easily by spotting extremely long segments at the end of a chunk that were labelled as silence and that had low average acoustic likelihoods. We cleaned up the material by removing the 948 chunks that met these criteria. caused problems in the segmentation. These chunks could be detected relatively easily by spotting extremely long segments at the end of a chunk that were labelled as silence and that had low average acoustic likelihoods. We cleaned up the material by removing the 948 chunks that met these criteria. ical transcription was correctly realized we checked whether more than 50% of the segment duration, as established in the canonical segmentation, contained the same vowel in the segmentation created from the manually corrected transcription. If this was not the case, then the vowel was flagged as incorrectly pronounced. Note that in this way, problems in the segmentation could lead to virtual pronunciation errors, which was the main reason to delete problematic chunks. 5.2. Likelihood Ratio Calculation For the calculation of likelihood ratios for all vowel segments in the canonical transcription, we used the same monophone acoustic models with which we performed the Viterbi alignment. We calculated these likelihood ratios as: logfP(Ojvt)g- logfP(Ojv)g 8v . V : LLRv v t = (4) dur where O is the observation matrix, vt the target vowel sound and V the set of Dutch vowel phonemes. We will call these likelihood ratios LLRvvt vowel scores. The likelihoods of “competing” vowel sounds P(Ojv) are simplified by following the same state level segmentation as the Viterbi path that was calculated for the target phone. That is, the competing vowels v switch states at the same times as the target vowel vt. Following Eq. 2, we also calculated the GOP measure, which we will denote with LLRvt To calculate the likeli max. hood of the optimal phone sequence in the segment we used an unconstrained free phone recognizer. 5.3. Model Training and Evaluation Our baseline pronunciation verification system utilizes only the GOP score LLRvt . Our new measure, wGOP, combines the max idividual vowel scores in a logistic regression model: wGOP (vt)= 1 . (5) 1 + expf..( 0 + vLLRvvt )g v These models are trained for each vowel phoneme separately. In these models, the dichotomous dependent variable, which represents whether the target phone was correctly or incorrectly pronounced, is predicted by the variables denoted as LLRv v t , i.e. the vowel scores. To train a specific vowel model, we first extracted the segments for which this vowel appeared in the canonical transcription as a target phone. The number of segments per phoneme is shown in Table 2, together with the percentage of pronunciation errors. We also investigated whether adding the GOP score in the regression model as a predictor increased performance. We trained and tested the models using leave-one-speakerout cross-validation within the WEKA package [19]. That is, the v coefficients are first optimized using all segments of the first 44 speakers and afterwards tested on the segments of the remaining speaker. This is repeated until all segments are tested. The coefficients indicate to what extent the likelihood of a certain competing vowel is important in predicting whether the realized phone was correctly or incorrectly pronounced. We evaluated the GOP score, the wGOP score and their combination using the equal error rate (EER), which is the point phoneme #inst %errors GOP wGOP Comb 2: 235 44.68 32.34 24.69 + 25.55 + 9y 397 43.83 19.92 15.61 + 14.58 + Ei 1204 40.78 26.42 22.60 + 22.18 + Y 738 35.10 24.38 20.86 + 20.16 + o: 1619 34.03 41.66 32.03 + 31.57 + e: 1757 31.30 23.62 23.14 + 20.59 + y 361 29.36 26.35 24.61 + 24.42 + I 1715 29.16 29.13 24.42 + 21.40 + A 2730 27.77 31.16 28.55 + 28.02 + E 1695 17.05 25.57 24.45 + 22.91 + i: 1637 16.56 23.70 24.29 -23.26 + a: 2131 10.00 29.65 22.35 + 22.58 + Au 404 7.67 32.35 45.10 -44.97 - u 563 6.75 23.56 44.84 -42.29 - O 1426 4.98 33.13 44.53 -34.10 - Table 2: Overall results of the GOP measure and the weighted GOP score. Column descriptions: (1) Target phone, (2) Number of instances, (3) Percentage of incorrectly realized phones, (4) EER using GOP, (5) EER using wGOP, (6) Sign of EER difference between GOP and wGOP (7) EER using the combiation of GOP and wGOP (8) Sign of EER difference between GOP and the combination of GOP and wGOP. on the error curve where the false acceptance rate is equal to the false rejection rate. 6. Results The EERs for each vowel are shown in Table 2. This list is ordered by the percentage of pronunciation errors per vowel. For the vowels for which the EER of wGOP measure is lower than the EER of the GOP measure, the improvement is 4.26% on average. This is not the case for /Au/, /u/, /O/ and /i:/, target vowels with very low percentages of pronunciation errors. Because the number of pronunciation errors for these phonemes is low, apparently no reliable regression models could be trained and the resulting EERs are therefore higher than those obtained using only the GOP measure. For vowels with many pronunciation errors and sufficient training material, our new method yields a substantial increase in performance. Combining the two methods is only beneficial for some vowels, most notably /e:/ and /I/. To gain insight into these overall results, we investigated whether the phones that have been incorrectly realized were correctly rejected by the GOP and wGOP methods (Table 3). We did this for the three vowels with the highest percentage of pronunciation errors, /2:/, /9y/ and /Ei/, and for their three largest confusions. Here we see that the phones which are most often confused with the three targets — /y/ with /2:/, /Au/ with /9y/ and /a:/ with /Ei/ — benefit most from our new measure. Presumably this is caused by the weighting of the vowel scores based on frequent confusion patterns. 7. Discussion and Conclusions From the results of the experiments we carried out we can conclude that tuning the wGOP measure with enough real nonnative speech data considerably improves its discriminative ability compared to the GOP measure. An important concern in using this tuning data is the issue of generalizability to other speakers and tasks. We trained the models speaker-independently, and the speakers in our material have widely varying L1s such as Turkish, Arabic, Spanish, Chinese, Persian, Hebrew, English etc. 592 target realized %of errors GOP wGOP target realized %of errors GOP wGOP y 32.38 18.10 23.81 2: @ 20.00 13.33 13.33 2: Y 18.10 8.57 10.48 9y Au 69.54 59.77 65.52 9y a: 7.47 3.45 2.87 9y A 5.75 4.02 4.59 Ei a: 58.05 44.60 50.30 Ei j 18.53 11.61 12.63 Ei e: 8.35 5.91 3.87 Table 3: Distribution of realized phones in the correct rejects for the target phonemes /2:/, /9y/ and /Ei/. Column descriptions: (1) Target phoneme, (2) Realized phoneme, (3) Percentage of the total number of incorrectly pronounced phones for the target phoneme, (4) %correct rejects (%CR) at EER using GOP, (5) %CR at EER using wGOP. Although these languages have a different phonology, apparently there is some systematicity in the error patterns of these speakers, at least enough for our measure to profit from it. This means that some phonemic confusions are quite stable across speakers. This was also observed in [14], where a number of phonemic confusions were identified that were common to L2 learners with varying L1s. On the other hand, we think that our measure could be further improved by using data from specific L1s or clusters of typologically similar L1s. With enough data available, our measure could be fine-tuned to the specific confusions that occur within a (type of) L1-L2 pair. Another important aspect is the kind of task the speakers have to perform. We used read speech data, where the users had to read sentences from a computer screen. As stated in Section 2, there are some obvious phonemic confusions due to interference with the orthography in this task, which are not likely to occur when speakers are not reading but have to repeat spoken utterances. As this might lead to different error patterns, it follows that the tuning data has to be appropriate for the task it is employed for. In our experiments we have treated all pronunciation errors on equal par, which might not be a valid assumption in all cases. Consider for example the pronunciation errors of /2:/ as /@/ and /Y/ (Table 3), which might be considered as less serious than the error of pronouncing /2:/ as /y/. In such cases one could argue about how “false” the false accepts of the system are and how this differs between the different pronunciation errors. Whether or not we have to treat these and other error patterns differently is in essence a matter of pedagogy, but ideally the technology should be able to deal with these requirements. One way to approach the latter problem would be in the calibration of the threshold. In this paper we have used the EER as a measure of discriminative ability, but pedagogically the EER threshold might not be the optimal threshold. We could however optimize the threshold in such a way that it minimizes the total cost of erroneous decisions. This total cost can be calculated by weighting the different types of errors in a pedagogically sound way. It is not straightforward how these costs should be quantified and more research on this topic is needed to investigate this issue. In the future we plan to investigate in which ways our method could be improved. For some vowel sounds discussed in this paper, this would involve handling their context dependence. Others aspects that could lead to improvement might be the initial segmentation, on which all local confidence scoring heavily depends, and speaker adaptation of the HMM models. Also we would like to investigate how our method generalizes to other sounds, such as consonants. 8. Acknowledgements We would like to thank Laura Graaf, Floor Jansen, Eline van Buuren and Marieke Oenema for correcting the automatic phonetic transcriptions. The DISCO project is carried out within the STEVIN programme which is funded by the Dutch and Flemish Governments (http://taalunieversum.org/taal/technologie/stevin/). 9. References [1] Strange, W., “Speech Perception and Linguistic Experience: Issues in Cross-Language Research,” New York Press, Baltimore, MD, pp. 171-206, 1995. [2] Best, C.T., “A direct realist view of speech cross language speech perception,” In: Strange, W. (Ed.), Speech Perception and Linguistic Experience: Issues in Cross-Language Research. New York Press, Baltimore, MD, pp. 171-206, 1995. [3] Flege, J.E., “Second language speech learning: theory, findings and problems,” In: Strange, W. (Ed.), Speech Perception and Linguistic Experience: Theoretical and Methodological Issues. York Press, Timonium, MD, pp. 233-273, 1995. [4] Logan, J. Lively, S. and Pisoni, D., “Training Japanese listeners to identify English /r/ and /l/: a first report,” Journal of the Acoustical Society of America, vol. 89, pp. 874-886, 1991. [5] Goudbeek, M., Cutler, A. and Smits, R., “Supervised and unsupervised learning of multidimensionally varying non-native speech categories,” Speech Communication, vol. 50, pp. 109-125, 2008. [6] http://lands.let.ru.nl/ strik/research/DISCO. [7] Witt, S., “Use of speech recognition in computer assisted language learning,” Ph.D. dissertation, University of Cambridge, 1999. [8] Franco, H., Neumeyer, L., Digalakis, V. and Ronen, O., “Combination ofmachine scores for automatic grading of pronunciation quality,” Speech Communication, vol. 30, pp. 121-130, 2000. [9] Kanters, S., Cucchiarini, C. and Strik, H., “The Goodness of Pronunciation algorithm: a detailed performance study”, In Proceedings of SLaTE 2009, Birmingham, 2009. [10] Yoon, S-Y., Hasegawa-Johnson, M. and Sproat, R., “Automated Pronunciation Scoring using Confidence Scoring and Landmark- based SVM”, In Proceedings of Interspeech, Brighton, United Kingdom, 2009. [11] Booij, G., “The Phonology of Dutch”. Oxford, Clarendon Press, 1995. [12] Gussenhoven, C., “Dutch”, in Handbook of the International Phonetic Association, Part II, Illustrations of the IPA, pp. 74-77. Cambridge, Cambridge University Press, 1999. [13] http://www.phon.ucl.ac.uk/home/sampa/dutch.htm [14] Neri, A., Cucchiarini, C. and Strik, H., “Selecting segmental errors inL2 Dutch for optimal pronunciation training,” International Review of Applied Linguistics, vol. 44, pp. 357-404, 2006. [15] Truong, K., Neri, A., Cucchiarini, C. and Strik, H., “Automatic Pronunciation Error Detection: An Acoustic-Phonetic Approach,” In Proceedings of InSTIL, Venice, Italy, 2004. [16] Cucchiarini, C., Driesen, J., Van Hamme, H. and Sanders, E., “Recording speech of children, non-natives and elderly people for HLT applications: the JASMIN-CGN corpus,” in Proceedings of LREC, 2008. [17] Oostdijk, N., “The design of the spoken dutch corpus,” in New Frontiers of Corpus Research, Peters, P., Collins, P. and Smith, A. Eds. Rodopi, pp. 105-112, 2002. [18] Demuynck, K., Roelens, J., Van Compernolle, D. and Wambacq, P., “SPRAAK: an open source SPeech Recognition and Automatic Annotation Kit,” In Proceedings of ICSLP, p.495, 2008. [19] Witten, I. and Frank, E., “Data Mining: Practical machine learning tools and techniques,” Morgan Kaufmann, 2005. 593