The Goodness of Pronunciation Algorithm: a Detailed Performance Study Sandra Kanters 1, Catia Cucchiarini 2, Helmer Strik 2 1 Customer Contact Solutions, Logica, The Netherlands 2 Department of Linguistics, Radboud University Nijmegen, The Netherlands sandra.kanters@logica.com, [c.cucchiarini|h.strik]@let.ru.nl Abstract An inventory was compiled of pronunciation errors frequently made by foreigners speaking Dutch. On the basis of this inventory artificial errors were created in a native development corpus, which in turn were used to optimize thresholds for the Goodness of Pronunciation (GOP) algorithm. In the current study the GOP algorithm is evaluated in three different ways: (1) using a native test corpus with artificial errors which reflect errors frequently made by non-natives, (2) within an actual application used by non-natives for practicing pronunciation, and (3) post-hoc, using the recorded interactions of the pronunciation training application, to determine what the performance of the algorithm would have been if optimal speaker and phone specific thresholds had been used. The results show that the performance of the GOP algorithm was satisfactory and that the procedure by which thresholds were determined by simulating realistic pronunciation errors was appropriate, because performance on the artificially introduced errors closely approximated performance on real data. This finding is particularly welcome if we consider that, in general, paucity of data is a common problem in this kind of research. Furthermore, it appeared that post-hoc threshold optimization only led to a slight increase in performance. Index Terms: Goodness of Pronunciation (GOP), pronunciation error detection, Computer Assisted Pronunciation Training (CAPT) 1. Introduction Research on second language (L2) acquisition has indicated that exposure to a second language might not be sufficient for L2 learning (e.g., [1]), especially for adult L2 learners. Relevant in this respect are Swain’s output hypothesis [1], which emphasizes the role of output in L2 learning, and Schmidt’s [2] ‘noticing hypothesis’, which underlines that awareness of discrepancies between the learner’s output and the L2 is necessary for the acquisition of a specific linguistic item. Since exposure to the L2 and L2 output will not automatically guarantee this kind of awareness, corrective feedback is required to make learners aware of their errors and stimulate them to attempt self-improvement [3]. In pronunciation learning corrective feedback is particularly required because very often learners are not aware of the pronunciation errors they make. On the other hand, providing individual corrective feedback on pronunciation is particularly time-consuming for teachers, with the result that the amount of practice that is needed is almost never achieved in the classroom. Computer Assisted Language Learning (CALL) systems that make use of Automatic Speech Recognition (ASR) seem to offer an alternative for practicing pronunciation, because they can offer specific feedback on individual errors and extra time for practicing at the learners’ own tempo. An important requirement is then that the feedback provided be helpful. In part this is determined by the accuracy of the feedback. If learners receive inaccurate feedback (pronunciation errors are indicated where actually no errors occur, or pronunciation errors are missed) they are less likely to actually improve their pronunciation. Corrective feedback on pronunciation can be given on different aspects. In this paper we focus on corrective feedback on the phoneme level. Providing this kind of detailed feedback is considerably more challenging than providing corrective feedback on a more global level such as word or sentence level. As a matter of fact, for providing global feedback pronunciation measures can be used that are calculated over longer stretches of speech, and therefore more data points, while detailed feedback at the segmental level requires computing a score for each individual realization of a given phone. Various approaches to segmental error detection can be found in the literature. The best known example is the Goodness Of Pronunciation (GOP) algorithm proposed by Witt [5], [6]. The GOP algorithm calculates the likelihood ratio that the realized phone corresponds to the phoneme that should have been spoken according to the canonical pronunciation. Thresholds, calculated beforehand, are used to decide which likelihood ratio scores corresponded to mispronounced sounds. The GOP algorithm was applied in the Dutch-CAPT system [7] [10], a system designed to provide corrective feedback on a selected number of speech sounds, referred to as target phonemes, which had appeared to be problematic for learners of Dutch from various first language backgrounds: [4], [7]. This inventory of errors was determined on the basis of three non-native corpora (not including the Dutch-CAPT corpus) [4]. In the current study the GOP algorithm is evaluated in three different ways to get insight into how GOP scores vary as a function of different parameters, in particular threshold values. The ultimate aim is to determine whether and how pronunciation error detection can be improved. In short, the three procedures are the following (more details are provided in sections 2.4.1., 2.4.2, and 2.4.3, resp., regarding the methodology, while the results are presented in sections 3.1, 3.2, and 3.3, resp.): (1) Since not enough non-native material was available, as is often the case, GOP thresholds for the target phonemes were optimized by creating artificial pronunciation errors in native data. In our case, these artificial errors reflect the errors frequently made by foreigners [4], [7]. Thresholds per phoneme were optimized on one set, and tested on another set of native data. (2) These thresholds were employed in an actual application, the Dutch-CAPT system, which was used by non- natives. All interactions were recorded and evaluated afterwards [7] [10]. (3) Finally, we also tested, post-hoc, what the performance of the algorithm would have been if we had used speaker specific thresholds for all phones that are optimal for the current data. The most relevant innovative aspects of the current study are that the GOP algorithm is evaluated for foreigners speaking Dutch, the thresholds are optimized using a native development corpus with artificial errors that reflect errors frequently made by foreigners (based on an inventory made using other corpora), and finally that the GOP algorithm is evaluated in three different ways: (1) using an independent native test corpus with realistic artificial errors, (2) in an actual application used by non-natives, and (3) and also post- hoc using speaker-specific thresholds. 2. Method 2.1. Material The inventory of pronunciation errors was based on three corpora of non-native speech (for more details see [4], [7]). Speech from three other, non/overlapping corpora was used to form the databases of this study. Two corpora were sub- corpora of the Spoken Dutch Corpus (Corpus Gesproken Nederlands; CGN), a corpus of about 9 million words that constitutes a plausible sample of standard Dutch as spoken in the Netherlands and Flanders and contains various annotation layers [8]. We chose two sub-corpora of Dutch spoken by native speakers from the Netherlands, one was used as development corpus (CGN-dev) and an independent one as test corpus (CGN-test). The last corpus contains the speech material that was collected through Dutch-CAPT and consists of interactions between non-native language learners and the Dutch-CAPT training system. The learners had different native languages. This material was manually annotated for pronunciation errors. The performance of the algorithm was investigated for the 11 target phonemes. The databases were formed with all realizations of these phonemes. This made up a total of 92,798 realizations for CGN-dev, 191,147 realizations for CGN-test (about 50% are errors) and 1,806 for Dutch-CAPT (about 42% are errors). 2.2. The Goodness of Pronunciation algorithm The GOP algorithm [5], [6] calculates the likelihood ratio that a phone realization corresponds to the phoneme that should have been spoken (the so-called GOP score). The student’s speech is subjected to both a forced and a free speech recognition phase. During forced recognition a known orthographic transcription of the speech signal is used to force the recognition of the speech and in the free recognition phase the phoneme sequence most likely to be spoken is calculated. A GOP score of a specific phone realization is then calculated by taking the absolute difference of the log probability of the forced and the log probability of the free recognition phase. Phones with GOP scores above a pre-defined threshold are probably mispronounced and are for this reason rejected by the algorithm. Likewise, phones with scores lower than the pre-defined threshold will probably be well-pronounced and are accepted. 2.3. Performance measures A classification algorithm like the GOP can produce four types of outcomes: 1) correctly accepted (CA) phone realizations, i.e. phones that were pronounced correctly and were also judged as correct; 2) correctly rejected (CR) phone realizations, i.e. phones that were pronounced incorrectly and were also judged as incorrect; 3) mispronunciations that were falsely judged as being correct (FA: False Accept) and 4) correct pronunciations that were falsely flagged as mispronunciations (FR: False Reject). To achieve optimal performance the algorithm should detect the mispronunciations and, at the same time, it should not flag as mispronunciations those realizations that were actually correct. For this reason both the amount of correctly rejected (CR) and correctly accepted (CA) realizations are important in the performance calculation. The performance of an error detection algorithm can be calculated in different ways. One way is to measure the scoring accuracy (SA), which is calculated by formula (1) shown below: SA=((CA+CR)/(CA+CR+FA+FR))*100 (1) Other widely used measures for calculating the performance of a classification algorithm are precision, recall and the F-measure. These metrics can be calculated both for the correct accepts and the correct rejects (see (2) -(6) ). Precision of CA = (CA / (CA + FA) ) * 100 (2) Precision of CR = (CR / (CR + FR) ) * 100 (3) RecallofCA=(CA/(CA+FR))*100 (4) RecallofCR=(CR/(CR+FA))*100 (5) F-measure = 2 * (Precision * Recall) / (Precision + Recall) (6) 2.4. Analyses 2.4.1. Establishing thresholds The aim of this exercise was to find GOP thresholds that maximize SA while keeping FR below 10%. The rationale behind this decision was that erroneously rejecting correct pronunciations would be more detrimental for learners than erroneously accepting mispronunciations. Optimal GOP thresholds were established in the following way. First, since we did not have enough non-native speech material at our disposal, pronunciation errors were simulated by changing the phonemic representations in the lexicon of the native speech corpus. The artificial errors were introduced in the pronunciation dictionary for the 11 target phonemes, phone by phone. For each phone, for half of the entries containing that phone, the correct pronunciation (i.e. phone) was replaced by an incorrect pronunciation (i.e. another phone). The scheme according to which correct phones were replaced by erroneous ones was based on information that we had collected on how Dutch phones are frequently mispronounced by L2 learners [7] [10]. Optimal thresholds were then established for each phoneme-gender combination by carrying out an exhaustive search. Preliminary experiments had shown that a step size of about 0.25 was sufficient, since generally there is a range of threshold values for which the values of SA do not differ significantly. The GOP thresholds were established by using the development corpus CGN-dev and were evaluated on the independent test corpus CGN-test (see results in Section 3.1). 2.4.2. Performance on Dutch-CAPT The thresholds obtained for the CGN-dev corpus (see Section 2.4.1.) were used in the Dutch-CAPT system. In order to get insight into the performance of the GOP algorithm on speech by non-native speakers, GOP scores were calculated for the speech collected from the users of the Dutch-CAPT system [9] [10]. Other than in the Dutch-CAPT system, where a maximum of three pronunciation errors per utterance was indicated, in this study GOP scores were calculated for all pronunciation errors. The performance was measured in SA and in precision, recall and F-measure of the correct accepts and the correct rejects. 2.4.3. Threshold optimization Threshold optimization was carried out post-hoc on the Dutch-CAPT material with the aim of finding out whether the performance of the algorithm could be optimized by using thresholds on a more specific level. Instead of using thresholds for each phoneme-gender pair, thus pooling speakers of the same gender, the performance was measured with phoneme-speaker dependent thresholds, therefore for each separate speaker. First, for each phoneme-speaker pair the threshold which yielded the highest SA for that specific pair was calculated. Threshold values in between a specific range were used to find those optimal thresholds. Performance was then calculated for each speaker separately. Subsequently, the values for the various speakers were combined to obtain measurements for the whole group. 3. Results 3.1. Establishing thresholds Optimal thresholds were established for each phoneme-gender combination. The GOP thresholds were determined by means of CGN-dev (see Section 2.4.1), and evaluated on CGN-test. The average evaluation results are shown in the second column of Table 1. It can be observed that all performance values (SA, precision, recall, and F) are higher than 80%. The goal was to find GOP thresholds for which SA was high and FR remained below 10%. The FR value in Table 1 is indeed smaller than 10%. The percentage of artificial pronunciation errors in the material is about 50%. It can be seen that the performance of the algorithm for the correct and incorrect phonemes does not differ much, since CA and FA do not differ much from CR and FR, respectively. 3.2. Performance on Dutch-CAPT In Table 1, third column, the performance results for the Dutch-CAPT database are presented. These results show that SA was 81.51%. For the performance measures precision, recall, and F-measure slightly higher percentages were obtained for correct accepts than correct rejects. Remarkably, these values for realistic errors of non-natives do not differ much from those for artificial errors in native data. 3.3. Threshold optimization In Table 1, fourth column, the results of the threshold optimization analysis are presented. The performance values are all higher than those in column three. However, if one considers that this is the best that can be obtained (post-hoc) for this method, it can be concluded that the thresholds obtained with the method using realistic artificial errors in native data appear to work very well. Table 1. The number of phoneme realizations, their distribution into CA, CR, FA, and FR, and the performance results on CGN-test, Dutch-CAPT and Dutch-CAPT (optimized) CGN-test Dutch- CAPT Dutch-CAPT (optimized) Tot # realizations 191,147 1,806 1,806 CA 40.25 % 49.67 % 51.61 % CR 41.42 % 31.84 % 35.99 % FA 8.54 % 10.41 % 6.26 % FR 9.79 % 8.08 % 6.15 % SA 81.67 % 81.51 % 87.60 % Precision of CA 82.49 % 82.67 % 89.19 % Recall of CA 80.43 % 86.00 % 89.36 % F-measure of CA 81.45 % 84.30 % 89.27 % Precision of CR 80.88 % 79.75 % 85.41 % Recall of CR 82.90 % 75.36 % 85.19 % F-measure of CR 81.88 % 77.49 % 85.30 % 4. Discussion In this paper the performance of the GOP algorithm was studied to get insight into how GOP scores vary as a function of different parameters, in particular threshold values. The ultimate aim was to determine whether and how pronunciation error detection could be improved. The performance of the algorithm was studied for the 11 target phonemes using three databases. CGN-dev was used to determine threshold values for each phoneme-gender pair. With these thresholds the performance of the algorithm was calculated on a database of Dutch spoken by native speakers in which pronunciation errors had artificially been added (CGN-test) and on a database of Dutch spoken by non- natives (Dutch-CAPT), which had been manually annotated for pronunciation errors. The performance of the GOP algorithm was measured in SA and in precision and recall of CA and CR. The results for CGN-test showed that SA was about 82%, and that precision and recall percentages were roughly the same. Also for Dutch- CAPT SA was about 82%, but precision and recall of CA were slightly higher (83% and 86%, respectively), and precision and recall of CR were slightly lower (80% and 75%, respectively). Although both SA and precision and recall measure the performance of the algorithm, they analyze it from different perspectives. SA shows the percentage of correct classifications (CA and CR) versus incorrect classifications (FA and FR), but it does not focus on either correct accepts or correct rejects, which precision and recall do. Both in CGN- test and Dutch-CAPT about half of the phones are mispronounced, and FA and FR do not differ much. This explains why the performance measures do not differ considerably for the two corpora. A post-hoc threshold optimization analysis on the Dutch- CAPT data showed that using more specific thresholds, i.e. thresholds for each phoneme-speaker pair, yielded slightly better performance. For each phoneme-speaker pair thresholds which optimized SA for that specific pair were calculated. With these new thresholds the performance in SA, and in precision and recall of CA and CR was measured. This analysis resulted in an SA of approximately 88%. Precision and recall of CA were 89%, and precision and recall of CR were 85%. Compared to the research by Witt [5] it has to be concluded that lower SA percentages are obtained here (90% against 82%, respectively), but probably this lower performance can be explained by the fact that we used a more realistic simulation of the real world, and consequently the task was more difficult. Witt does not mention which and how many different phonemes she used in creating artificial errors. We calculated the performance on the 11 phonemes that tend to be difficult for language learners and that were addressed in Dutch- CAPT. Second, while simulating the pronunciation errors we first checked how Dutch phones are usually mispronounced and used this information in changing the phonemic representations. Witt, on the other hand, created artificial errors by replacing in the lexicon all realizations of a given phoneme, say /a/ by another one, say /i/. However, the chance that language learners will make that type of error is smaller than that they will confuse or mispronounce phonemes that are acoustically more similar such as /i/ and /I/, or /x/ and /k/. Likewise, the GOP algorithm will have a harder time in distinguishing /i/ from /I/ and /x/ from /k/ than in distinguishing /a/ from /i/. This might explain the higher SA values obtained by Witt for the native material. The result that the accuracy measures for CGN-test are not very different from those of Dutch-CAPT indicates that the performance on real data approximates the performance on artificially introduced pronunciation errors. In other words, the procedure by which thresholds were determined worked properly. This is also partly related to the choice of the simulated errors we made, as these were based on knowledge about pronunciation errors that L2 learners actually make. This finding is particularly reassuring because in this kind of research data sparseness is just a fact of life. These outcomes show that when real data are not available they can at least be simulated with satisfactory results. Unfortunately, we cannot compare our results for non-natives to those of Witt, because Witt did not present SA results for non-natives. The finding that post-hoc threshold optimization only led to a slight increase in performance can be explained by the fact that the GOP scores of well-pronounced and mispronounced sounds overlap to a considerable extent. In other words, whichever threshold is chosen, there will always be False Accepts and/or False Rejects. For this reason, the solution in improving the performance has to be sought in using speech characteristics for which such an overlap is minimized. A possible way of doing this is by enhancing the GOP algorithm with acoustic-phonetic information. With the latter approach results have been obtained that are even better than GOP results [9]. With this approach specific phoneme characteristics can be included, which could perhaps help the algorithm to better detect which sounds are correctly or incorrectly pronounced. 5. Conclusions From the results presented in this paper we can draw the following conclusions. First, the performance of the GOP algorithm of 80-90% is satisfactory. Second, the procedure by which thresholds were determined was appropriate, because performance on artificially introduced pronunciation errors closely approximated performance on real data. This finding is particularly welcome if we consider that, in general, paucity of data is a common problem in this kind of research. Our results indicate that, in the absence of real data, acceptable results can be obtained by simulating pronunciation errors in a realistic way. Third, the performance of the algorithm could be improved (slightly) by taking more specific thresholds. Although adopting thresholds for each phoneme-speaker pair will not be easily feasible in practice, it is worth investigating whether groups of speakers can be formed to which the same thresholds can be applied (e.g. speakers with the same or comparable native languages). Fourth, as threshold optimization only led to a slight increase in performance, it is clear that other ways have to be found to improve the performance of the GOP algorithm, for instance by including acoustic-phonetic information (e.g., [9]) that better models specific phoneme characteristics. 6. Acknowledgements We are indebted to Febe de Wet for her work on establishing the GOP thresholds and to Ambra Neri who collected the Dutch-CAPT speech material and its annotations. 7. References [1] Swain, M., “Communicative competence: some roles of comprehensible input and comprehensible output in its development”, in Input in Second Language Acquisition, Gass, M.A. and Madden, C.G. [Eds.], Rowley MA: Newbury House, pp. 235-253, 1985. [2] Schmidt, R.W., “The role of consciousness in second language learning”, Applied Linguistics, vol. 11, pp. 129-158, 1990. [3] Havranek, G., “When is corrective feedback most likely to succeed?”, International Journal of Educational Research, vol. 37, pp. 255-270, 2002. [4] Neri, A., Cucchiarini, C., Strik, H., “Selecting segmental errors in L2 Dutch for optimal pronunciation training.” IRAL International Review of Applied Linguistics, 44, pp. 357–404, 2006. [5] Witt, S.M., “Use of speech recognition in Computer assisted Language Learning”, PhD thesis, Department of Engineering, University of Cambridge, 1999. [6] Witt, S.M. and Young, S., “Phone-level Pronunciation Scoring and Assessment for Interactive Language Learning”, Speech Communication, vol. 30, pp. 95-108, 2000. [7] Cucchiarini, C., Neri, A., and Strik, H., “Oral proficiency training in Dutch L2: The contribution of ASR-based corrective feedback”, to appear in Speech Communication. [8] Oostdijk, N., “The design of the spoken Dutch corpus.” in New Frontiers of Corpus Research, Peters, P., Collins, P. and Smith, A. [Eds.], Rodopi, Amsterdam, pp. 105–112, 2002. [9] Strik, H., Truong, K., de Wet, F. and Cucchiarini, C., “Comparing different approaches for automatic pronunciation error detection”, to appear in Speech Communication. [10] Neri, A., Cucchiarini, C., Strik, H., “The effectiveness of computer-based corrective feedback for improving segmental quality in L2-Dutch”, ReCALL, Volume 20, Issue 02, May 2008, pp. 225-243.