Utterance Verification in Language Learning Applications Joost van Doremalen, Helmer Strik, Catia Cucchiarini Department of Linguistics, Radboud University, Nijmegen, The Netherlands fj.vandoremalen,h.strik,c.cucchiarinig@let.ru.nl Abstract A CALL system for oral proficiency is being developed in which constrained responses are elicited from L2 learners. In the first phase the best matching utterance is selected from a predefined list of possible responses. Since errors may occur and giving feedback on the basis of incorrectly recognized utterances is confusing, we verify the correctness of the utterance in the second phase. In the current paper we focus on the utterance verification process. Combining duration related features with a likelihood ratio (LR) yielded an equal error rate (EER) of 10.3%, which was significantly better than the EER for LR alone, 14.4%, and the EER for the duration-related features, 25.3% Index Terms: utterance verification, non-native speech processing, computer-assisted language learning 1. Introduction In second language acquisition research it is widely acknowledged that naturalistic, implicit learning is not always sufficient to achieve high-quality L2 proficiency and that explicit instruction helps overcome some of the problems [1] [2]. In the case of oral proficiency, providing sufficient instruction and feedback is more problematic than in other skills because time- consuming interaction with an individual tutor is usually required. This might explain the increasing interest in applying automatic speech recognition to oral proficiency learning [3]. The overview by Eskenazi [3] also makes it clear that developing good-quality ASR-based language learning applications is fraught with difficulties. One of the problems concerns the relatively poor performance of ASR systems on non-native speech and the consequent need to develop approaches that restrict the search space and make the task easier. A major distinction can be drawn between strategies that are essentially aimed at constraining the output of the learner so that the speech becomes more predictable and techniques that are aimed at improving the decoding of non-native speech. Within the first category, a possible strategy consists in eliciting constrained output from learners by letting them read aloud an utterance from a limited set of answers presented on the screen or by allowing a limited amount of freedom in formulating responses, as in the Subarashii [4] and and the Let’s Go systems [5]. However, more freedom in user responses is particularly necessary in ASR-based CALL systems that are intended for practicing grammar in speaking proficiency. While for practicing pronunciation it may suffice to read sentences aloud, to practice grammar learners need to have some freedom in formulating answers in order to show whether they are able to produce correct forms. This can be achieved by designing exercises that allow some freedom to the learners in producing answers, but that are predictable enough to be handled by ASR. In our DISCO project, which is aimed at developing a pro totype of an ASR-based CALL application that can provide intelligent feedback on important aspects of L2 speaking such as pronunciation, morphology, and syntax [6], this is achieved by generating a predefined list of possible (correct and incorrect) responses for each exercise. We intend to use a two-step procedure in which first is determined what was said (content), and subsequently how it was said (form). In the first (recognition) phase the system should tolerate deviations in the way utterances are spoken, while in the second (error detection) phase, strictness is required (see also [7] and [8]). In the first phase of the two-step procedure two stages can be distinguished, a) utterance selection and b) utterance verification (UV). When learners are allowed some freedom in formulating their responses, there is always the possibility that the learners response is not present in the predefined list and is recognized incorrectly in stage (a) as one of the utterances of the predefined list. Giving feedback on the basis of an incorrectly recognized utterance is confusing and thus should be avoided. Therefore, utterance verification (UV) is carried out in stage (b). An excellent overview of recent work on UV can be found in [9]. In the present paper we focus on the process of verifying the decoded utterance within the framework of a CALL application for oral proficiency. In the remainder of this paper we first describe the speech material used in our experiments and subsequently the speech recognizer and the UV approach adopted in these experiments. The results are presented in section 3. In section 4 we discuss our findings and speculate on possible ways of utilizing our method for UV in the context of a CALL application like DISCO. We end with some concluding remarks in section 5. 2. Method 2.1. Material The speech material for the present experiments was taken from the non-native component of the JASMIN speech corpus [10], which was collected for the aim of facilitating the development of ASR-based language learning applications and is particularly suited for our purpose. Speakers with different mother tongues and relatively low proficiency levels (A1, A2 and B1 of the Common European Framework) were recorded because this complies with the demand for ASR-based CALL applications. The JASMIN corpus contains read speech and human-machine dialogues. The latter were used for our experiments because they more closely resemble the situation we will encounter in the DISCO application. The JASMIN dialogues were designed such as to elicit typical phenomena of human-machine interaction that are known to be problematic in the development of spoken dialogue systems, i.e. restarts, filled pauses and repetitions. The material we used for the present experiments consists of speech from 45 speakers, 40% male and 60% female, with 25 different L1 backgrounds. Ages range from 19 to 55, with a mean of 33. The speakers each respond to 39 questions about a journey. We first deleted the utterances that contain crosstalk, background noise and whispering from the corpus. After deletion of these utterances the material consists of 1325 utterances. The mean signal-to-noise-ratio (SNR) of the material is 24.9 with a standard deviation of 5.1. To simulate the task in the DISCO application of selecting and verifying the utterance that was spoken, we generated language models from the lists of responses given by each speaker to each of the 39 questions. These lists mimic the predicted responses in our CALL application task because they contain a) responses to relatively closed questions and b) morphologically and syntactically correct and incorrect responses. Note that in this set the response that was spoken was always present in the language model. To simulate the case in which the spoken utterance is not present in the list, we also generated language models in which the correct utterance is left out. In this way, our dataset consists of 1650 items, because each utterance is decoded two times: one time when its representation is present in the language model and one time when it is not present. 2.2. Utterance selection For selecting the spoken utterance from a list, we have used a speech recognizer with a constrained language model and small vocabulary. The speech recognizer we used in this research is SPRAAK [11], an open source HMM ASR package. In the following section we will discuss the setup of this speech recognizer. 2.2.1. Acoustic Preprocessing Acoustic preprocessing was done by dividing the speech, sampled at 16kHz, into overlapping 32ms Hamming windows with a 10ms shift and pre-emphasis factor of 0.95. 12 Mel-frequency cepstral coefficients (MFCC) plus C0, and their first and second order derivatives were calculated, and cepstral mean subtraction (CMS) was applied. 2.2.2. Language Model and Pronunciation Lexicon Constrained language models (LM) were generated based on the responses to each of the 39 questions. These responses were manually transcribed at the orthographic level. Restarts and repetitions were also annotated. The LMs are implemented as Finite State Machines (FSM) with parallel paths containing the word sequences of the responses. A priori each path is equally likely. To be able to decode filled pauses between words, self- loops are added in every node. Filled pauses are represented in the pronunciation lexicon. The pronunciation lexicon contains canonical phonetic representations extracted from the CGN lexicon [12]. 2.2.3. Acoustic Models We trained three-state tied Gaussian Mixture Models (GMM). 47 Baseline triphone models, 46 phoneme and one silence model, were trained on 42 hours of native read speech from the CGN corpus [12]. In total 11,660 triphones were created, using 32,738 Gaussians. These native models were retrained with non-native speech by doing a one-pass Viterbi training with 6 hours of non-native read speech from the JASMIN corpus. The utterances were spoken by the same speakers as those in the test material. Table 1: Equal error rates (EER) for the individual features LR, nr shorter 1, nr shorter 5, nr longer 95, nr longer 99 and the combinations duration comb (nr shorter 1,nr shorter 5,nr longer 95, nr longer 99) and all features, all. Features EER LR 14.4% nr shorter 1 27.3% nr shorter 5 27.4% nr longer 95 35.8% nr longer 99 38.5% duration comb 25.3% all 10.3% 2.3. Utterance verification A common approach to utterance verification is to extract confidence predictors during decoding and combine these using a machine learning model. This model is then trained to predict whether the utterance is correctly or incorrectly recognized. Confidence predictors that are often used include N-best list counts, hypothesis density, acoustic stability and duration related features [9]. We have also adopted this confidence predictor combination approach and used two types of predictors, acoustic likelihood ratio and duration related features, to train a logistic regression model. Details on the predictors and model are provided below. 2.3.1. Acoustic likelihood ratio The first confidence predictor, one that has been used in for example [13], is the likelihood ratio: p(xju1) p(xjuFPR) (1) in which u1 is the 1-Best decoding result given the signal x and uFPR is the optimal phone string found using free phone recognition. We call this predictor LR. The rationale behind this predictor is that when the input speech is not modelled as a path in the search space, the likelihood p(xju1) is smaller relative to p(xjuFPR) than when it is modelled. This predictor estimates the posterior probability of the utterance given the speech signal x where p(xjuFPR) is an estimation of the probability of x. 2.3.2. Duration-related features When the input speech representation is not modelled as a path in the search space and the utterance is recognized as another sequence of words, the phone segmentation of this sequence of words will generally be characterized by deviations in phone durations. A straightforward way to capture this is to count the phones in the segmentation with durations that deviate substantially from the mean phone duration. We have implemented this by using predictors similar to those introduced in [14]. Phone duration distributions were derived from manually verified phonemic transcriptions of 42 hours of read native speech from the CGN corpus [12]. For each of the 46 phonemes the 1st, 5th, 95th and 99th percentile duration was calculated from these distributions. The predictors that were extracted from the segmentation are the number of phonemes in the decoded utterance that are shorter than the 1st (nr shorter 1) and 5th (nr shorter 5) percentile and the number of phonemes that are longer than the 95th (nr longer 95) and 99th (nr longer 99) percentile durations. These predictors were normalized by the total number of phonemes in the recognized utterance. Table 2: Percentages of correctly and incorrectly classified decoding results of the two different subsets and the total set using the global EER threshold and all predictors. (a) Percentages of decoding result classification on the set where the correct transcription was in the language model. b) Percentages of decoding result classification on the set where the correct transcription was not present in the language model. (c) Percentages of decoding result classification on the whole dataset. (a) actual correct incorrect predicted correct incorrect 80.8% 9.2% 3.0% 7.0% (b) actual correct incorrect predicted correct incorrect -- 8.3% 91.7% (c) actual correct incorrect predicted correct incorrect 40.4% 4.6% 5.6% 49.4% 2.3.3. Feature combination To combine the five predictors, i.e. LR, nr shorter 1, nr shorter 5, nr longer 95, nr longer 99, into one confidence measure we used a logistic regression model. In this model it is assumed that the logit of the probability of a binary variable is a linear function of a set of explanatory variables: X p(yjp) N logit(p(yjp))= = 0 + ixi (2) 1 - p(yjp) i=1 where p(yjp) is the probability of a correctly or incorrectly decoded utterance y given the confidence predicting variables p. The optimal weights ß are choosen through Maximum Likelihood Estimation (MLE) in the WEKA machine learning toolkit [15]. We trained and tested the model by using Leave- One-Speaker-Out crossvalidation where the model is trained on all speakers except one and then tested on the utterances of the speaker that was left out during training. This is repeated until all speakers are tested and the results of all speakers are averaged. 2.4. Evaluation We have evaluated the discriminative ability of our utterance verifier using Receiver Operator Characteristic (ROC) curves, in which the two types of error rates, i.e. the false positive rate and false negative rate, are plotted for different thresholds. Using the point of the ROC curve where the two error types are equal, the equal error rate (EER), the different confidence indicators and their combinations are evaluated. 95% Confidence intervals were calculated to investigate whether differences between EERs were significantly different. 3. Results The utterance error rate (UER) of our speech decoder on the set of decoding results where the correct transcription was present in the LM was 10.0%. In this case errors consist of substitutions with competing language model paths. The UER on the Figure 1: ROC curves for the feature LR and the combinations duration comb and all. set without the correct transcriptions in the LM was of course 020406080020406080False Positive RateFalse Negative Rate020406080020406080020406080020406080duration-combLRall 100.0%, so 55.0% of all the cases was incorrectly recognized. The task for the UV was to discriminate the correctly and incorrectly recognized cases. In Table 1 this ability is shown in terms of EER for the individual predictors and several predictor combinations. ROC curves of the best performing predictor and two combinations are shown in Figure 1. Within the individual predictors LR performs best (14.4%) and all the duration-related predictors perform much worse. When we combined all duration-related predictors, duration comb, the EER relative to the best performing duration- related predictor dropped significantly from 27.3% (with a confidence interval 1.7) to 25.3%. Finally, by combining the LR with duration comb, the EER relative to LR decreased significantly by 4.1% from 14.4% to 10.3%. In Table 2a and 2b percentages are shown using the EER threshold and using all predictors for the two different sets of decoding results, with and without the correct transcription in the LM, respectively. For example, in the set of results with the correct transcription in the LM 80.8% is classified as correct when it indeed was correctly decoded and 9.2% was classified as incorrect (false reject). In the set without the correct transcription in the LM 91.7% was classified as incorrect when it was incorrectly decoded, and 8.3% was classified as correct (false accept). The performance on the whole dataset is shown in Table 2c. 4. Discussion The duration-related predictors have a weak performance individually, but they still contain additional information relative to the acoustic likelihood ratio LR. The duration-related predictor distributions of correctly and incorrectly decoded utterances overlap severely. This was still the case when we normalized these predictors for the speaking rate within the utterance or when we used the probability of the phoneme durations in the utterance as a predictor. The latter we calculated through a kernel density estimation of the duration probability density per phoneme trained on the CGN native read speech data. Using these more complex predictors the model was not able to make substantially better predictions. By introducing an UV procedure and using the EER threshold we are able to filter out 91.7% of the utterances that are not in the predicted list of responses. This comes with the cost of also rejecting utterances that are correctly decoded and accepting utterances that are incorrectly decoded. Of course, these error rates depend not only on the discriminative performance of the UV, but also on the threshold setting. In our CALL application this threshold setting has consequences for the learner, because of the potentially misleading feedback he or she gets. Until now we have evaluated the performance of different predictors and combinations using the EER threshold, but this might not be the optimal threshold setting in the actual application. In our application the recognized utterance will be probably shown to the user so that he/she knows whether the utterance was correctly recognized. If the system makes an error in recognizing the utterance, this will then be clear for the user. The system can make two types of errors: a) a false rejection, in which case a correctly decoded utterance is classified as incorrect by the UV or b) a false acceptance, in which case an incorrectly decoded utterance is classified as correct. To determine which of these errors is more detrimental at this stage of the application, it is necessary to consider how such errors can be handled in the application and what their possible consequences are. In the case of a rejection, and therefore also of a false rejection, it is possible to ask the user to repeat the utterance. In concrete terms then, a false rejection implies that the user is unnecessarily asked to repeat the utterance. In the case of a false acceptance an utterance will be shown to the user that (s)he actually did not produce. This type of error would seem to be more detrimental because it can affect the credibility of the system. However, the degree of seriousness will depend on the degree of discrepancy between the utterance that was actually produced and the one that was recognized and shown by the system: the larger the deviation the more serious the error. On the other hand, large deviations are less likely than small deviations. On the basis of such considerations we can indicate the seriousness of the two types of errors and therefore the costs that should be assigned to false rejections and false acceptances. More information on this issue can be found in [16]. There are now three different factors that are important in choosing an application-dependent threshold, namely 1) the prior probability of a correct decoding pcorrect, 2) the cost of a false rejection CFR and 3) the cost of a false acceptance CFA. To formalize the idea of taking into account different error costs and different prior distributions in the process of choosing a threshold, we can estimate the total cost of a specific threshold setting with a cost function: Ctotal = pFRCFRpcorrect + pFACFA(1 - pcorrect) (3) where pFR and pFA are the probabilities of false rejection and false acceptance respectively. This kind of cost function is also used in the NIST evaluation of speaker recognition systems [17]. Minimizing Ctotal on a development set will provide us with the optimal threshold setting given the application- dependent parameters CFR, CFA and pcorrect. Using the UV with this application-dependent threshold calibration procedure will make an excellent research vehicle for future experiments with different error costs. 5. Conclusion We have evaluated several procedures for utterance verification. The best result obtained for a single duration-related feature is an EER of 27.3%. By combining four duration-related features the EER could be reduced significantly to 25.3%. Better results, i.e. an EER of 14.4%, were found for the tested acoustic likelihood ratio, and an extra significant reduction to 10.3% was obtained by combining the likelihood ratio with the four duration-related features. 6. Acknowledgements The DISCO project is carried out within the STEVIN programme which is funded by the Dutch and Flemish Governments (http://taalunieversum.org/taal/technologie/stevin/). 7. References [1] Norris, J.M. and Ortega, L., “Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis”, Language Learning, vol. 50, pp. 417-528, 2000. [2] Ellis, N.C., Bogart, P.S.H., “Speech and Language Technology in Education: the perspective from SLA research and practice”, In Proceedings ISCA ITRW SLaTE, Farmington PA, 2007. [3] Eskenazi, M., “An overview of Spoken Language Technology for Education”, Speech Communication, 2009. [4] Ehsani, F., Bernstein, J. and Najmi, A., “An interactive dialog system for learning Japanese”, Speech Communication, vol. 30, pp. 167-177, 2000. [5] Raux A. and Eskenazi, M., “Using task-oriented spoken dialogue systems for language learning: potential, practical applications and challenges”, In Proceedings of INSTILL, 2004. [6] DISCO project website, http://lands.let.ru.nl/ strik/research/DISCO/. [7] Menzel, W., Herron, D., Morton, R., Pezzotta, D., Bonaventura, P., and Howarth, P., “Interactive pronunciation training”, ReCALL, vol. 13, no. 1, pp. 67-78, 2000. [8] Cucchiarini, C., Neri, A., and Strik, H., “Oral proficiency training in Dutch L2: The contribution of ASR-based corrective feedback”, Speech Communication, to appear. [9] Jiang H., “Confidence measures for speech recognition: a survey”, Speech Communication, vol. 45, pp. 455-470, 2005. [10] Cucchiarini, C., Driesen, J., Van hamme, H. and Sanders, E., “Recording speech of children, non-natives and elderly people for HLT applications: the JASMIN-CGN Corpus”, In Proceedings of LREC, 2008. [11] Demuynck, K., Roelens, J., Van Compernolle, D. and Wambacq, P., “SPRAAK: an open source SPeech Recognition and Automatic Annotation Kit”, In Proceedings of ICSLP, page 495, 2008. [12] Oostdijk, N., “The design of the spoken Dutch corpus”, In Peters, P., Collins, P., and Smith, A., (Eds.) New Frontiers of Corpus Research, Rodopi, Amsterdam, pp. 105-112, 2002. [13] Bouwman, G. and Boves, L., “Utterance verification based on the likelihood distance to alternative paths”, In Proceedings of the 5th International Conference on Text, Speech and Dialogue, pp. 213220, 2002. [14] Goronzy, S., Marasek, K., Kompe, R. and Haag, A., “Prosodically Motivated Features for Confidence Measures”, In ASR2000, vol. 1, pp. 207-212, 2000. [15] Witten, I.H. and Frank, E.,“Data Mining: Practical machine learning tools and techniques”, 2nd Edition, Morgan Kaufmann, San Francisco, 2005. [16] Bachman, L., “Fundamental considerations in language testing”, Oxford University Press, 1990, pp. 214-218. [17] van Leeuwen, D. and Br¨N., “An Introduction ummer, to Application-Independent Evaluation of Speaker Recognition Systems”, In Speaker Classification I, Christian Mller (Ed.), Springer, 2007.