Nonnative speech decoding in a CALL application Auteurs Centre for Language and Speech Technology, Radboud University, Nijmegen, The Netherlands email adressen Abstract This is the layout specification and template definition for the INTERSPEECH 2009 Conference, which will be held in Brighton, United Kingdom, 6-10 September 2009. This template has been generated from previous Interspeech templates except the paper size is A4 rather than US Letter. The format is essentially the one used for IEEE ICASSP conferences. You must include index terms as shown below. Index Terms: speech synthesis, unit selection, joint costs 1. Introduction Owing to the increasing mobility of workers around the world, the demand for language lessons is growing steadily in many host countries. In several cases the demand clearly outstrips the supply and immigrants have to wait months before being enrolled in a language course. A compounding problem is that many immigrant workers simply do not have the time to attend language courses. Such situations call for innovative solutions that can make language learning more effective, more personalized, less expensive and less time consuming. Computer Assisted Language Learning (CALL) systems seem to constitute a viable alternative to teacher-fronted lessons. In addition, the recent advances in ASR technology open up new opportunities of developing CALL systems that can address oral proficiency, one of the most problematic skills in terms of time investments and costs. In the Netherlands speaking proficiency plays an important role within the framework of civic integration examinations. Foreigners who wish to acquire Dutch citizenship have to show that they are able to get by in Dutch society and that they speak the Dutch language at the Common European Framework (CEF) A2 level, which means that they can make themselves understood in Dutch and are able to perform activities such as buying a train ticket or applying for a passport. In this context automatic systems for improving speaking performance are particularly welcome. Such systems should preferably address important aspects of oral proficiency like pronunciation and grammar. However, developing ASR-based CALL systems that can provide training and feedback for second language speaking is not trivial, as ASR performance on non-native speech is not yet as good as on native speech. The main problems with non-native speech concern deviations in pronunciation, morphology, and syntax and a relatively high rate of disfluencies, such as filled pauses, repetitions, restarts and repairs. To circumvent the ASR problems caused by these phenomena, various techniques have been proposed to restrict the search space and make the task easier. A well-known strategy consists in eliciting output from learners by letting them choose form a finite set of answers that are presented on the screen. This technique was used successfully in the Tell me More and Talk to Me series developed by Auralog back in 2000 (Auralog, 2000; TTM, 2002). The learner could engage in interactive dialogues with the computer by answering oral questions that were simultaneously displayed on the screen and could reply by choosing one response from a limited set of three that were phonetically sufficiently different from each other so that the spoken response could easily be recognized by the ASR system. Although this strategy allows for relatively realistic dialogues and is still applied in language learning applications, it is worthwhile to explore whether ASR technology can be improved to such an extent that relatively more freedom can be allowed in the responses. This would mean that instead of choosing from a limited set of utterances that can be read aloud, the learner is allowed some freedom in formulating his/her answer. This is particularly necessary in ASR-based CALL systems that are intended for practicing grammar in speaking proficiency. While for practicing pronunciation it may suffice to read out loud sentences that appear on the screen, to practice grammar learners need to have some freedom in formulating answers so that they can show whether they are able to produce correct forms. So, the challenge in developing an ASR-based system for practicing oral proficiency consists in designing exercises that allow some freedom to the learners in producing answers, but that are predictable enough to be handled by ASR. This is precisely the challenge we face in the DISCO project, which is aimed at developing a prototype of an ASR-based CALL application for practicing speaking performance in Dutch as a second language (DL2). The application aims at optimizing learning through interaction in realistic communication situations and at providing intelligent feedback on important aspects of DL2 speaking, viz. pronunciation, morphology, and syntax. The application should be able to detect and give feedback on errors that are made by learners of Dutch as a second language. Within this project we are designing exercises that stimulate students to produce utterances containing the required morphological and syntactic forms by showing them words on the screen, without declensions or conjugations, in random order, possibly in combination with pictograms and figures representing scenes (e.g. a girl reading a book). In addition, use is made of dialogues and scenarios illustrating so-called “crucial practice situations” (in Dutch cruciale praktijksituaties or CPS), which correspond to realistic situations in which learners might find themselves in Dutch society and in which they have to interact with other citizens. These CPSs form the basis of the various civic integration examinations. In these exercises learners are prompted to produce utterances which are subsequently analyzed to provide the appropriate feedback. However, before the system can proceed to this type of detailed analysis, it has be ascertained whether the learner produced an appropriate answer and not something completely different such as “I have no idea”, “I hate this system” or the like. Optimizing the process of recognizing the intended utterance in the initial phase of our DISCO application is the focus of the present paper. Because we do not have DISCO speech data yet, we resorted to other non-native speech material we had at our disposal and which seemed suitable for our research purpose, as will be described in the Method section. 2. Method 2.1. Material The speech material for the present experiments was taken from the JASMIN speech corpus [ref], which contains speech of children, non-natives and elderly people. Since the non-native component of the JASMIN corpus was collected for the aim of facilitating the development of ASR-based language learning applications, it seemed particularly suited for our purpose. Speech from a miscellaneous group of speakers with different mother tongues was collected for this corpus because this realistically reflects the situation in Dutch L2 classes. In addition these speakers have relatively low proficiency levels, namely A1, A2 and B1 of the Common European Framework (CEF), because it is for these levels that ASR-based CALL applications appear to be most needed. The JASMIN corpus contains speech collected in two different modalities, read speech and human-machine dialogues. The latter were used for our experiments because they more closely resemble the situation we will encounter in the DISCO application. The JASMIN dialogues were collected through a Wizard-of-Oz-based platform and were designed such that the wizard was in control of the dialogue and could intervene when necessary. In addition, recognition errors were simulated and unexpected and difficult questions were asked to elicit some of the typical phenomena of human-machine interaction that are known to be problematic in the development of spoken dialogue systems, such as hyperarticulation, shouting, restarts, filled pauses, silent pauses, self talk and repetitions. Considering all these characteristics we can state that the JASMIN non-native speech is considerably more challenging for ASR than the speech we will encounter in the DISCO application. The material we used for the present experiments consists of speech from 45 speakers, 40% male and 60% female, with different L1 backgrounds. The speakers each give answers to 39 questions about a hypothetical journey they would like to make. We first deleted the utterances that contain crosstalk, background noise and whispering from the corpus because these phenomena are not likely to occur in the DISCO application. After deletion of these utterances the material consists of 1325 utterances. To simulate the task in the DISCO application we generated 39 lists, one for each question, of the answers from each speaker. These lists mimic predicted responses in our CALL application task because both contain responses to relatively closed-ended questions and both contain morphologically and syntactically correct and incorrect responses. 2.2. Speech Recognizer The speech recognizer we used in this research is SPRAAK \cite{SPRAAK}, an open source speech recognition package. For the acoustic models, we trained 47 3-state Gaussian Mixture Models (GMM): 46 phones and 1 silence model. GMMs are trained using a 32 ms Hamming window with a 10 ms step size. Acoustic feature vectors consisted of 12 mel-based cepstral coefficients (MFCCs) plus C0, and their first and second order derivatives. The contrained language models and pronunciation lexica are implemented as finite state transducers (FST). 2.3. Language Modelling As said in the introduction, we have chosen to simplify the decoding task by using a constrained language model. In total 39 language models are generated based on the responses to each of the 39 questions. These responses are manually transcribed at the orthographic level. Filled pauses and disfluencies, i.e. restarts and repetitions, are annotated in these orthographic transcriptions. Filled pauses are abound in everyday spontaneous speech and generally do not hamper communication. This is why we want to allow the students to take filled pauses. In our material x% of the utterances contain one or more filled pauses and more than x% of all transcribed units are filled pauses. While restarts, repetitions and other disfluency phenomena are also common in normal speech, we think that in a CALL application for training oral proficiency these phenomena can be penalized (ref?). In our material x% of the answers contain one ore more disfluencies. In this research we have included restarts and repetitions in the language model. Our language models are implemented as FSTs with parallel paths of orthographic transcriptions of every answer to the question. A priori each path is equally likely. For example, part of a response list is: - /ik ga met uh... de vliegtuig/ (/I am going er... by plane/*) - /ik uh... ga met de trein/ (/I er... am going by train/) - /met de uh... vliegtuig/ (/by er... plane/*) - /met het vliegtuig/ (/by plane/) From this particular set of answers the following baseline language model is generated (filled pauses are left out): To be able to decode possible filled pauses between words, we generated another language model with self-loops added in every word node. A loop can be taken when /@/ or /@m/ (the phonetic representations of the two most common filled pauses in Dutch) is observed. The filled pause loop penalty was empirically optimized. An example of this language model is depicted in figure x. To examine whether filled pause loops are a sufficient means of modelling filled pauses, we also experimented with a language model containing the original orthographic transcriptions (which include the manually annotated filled pauses) without filled pause loops. 2.4. Acoustic Modelling Baseline acoustic triphone models were trained on x hours of native read speech from the CGN corpus (ref). In total 11,660 triphones were created, using 32,738 tied Gaussians. In several studies on nonnative speech processing (refs) it has been observed that by adapting or retraining native acoustic models with nonnative speech, decoding performance can be increased. To investigate whether this is also the case in a constrained task as described in this paper, we retrained the baseline acoustic models with nonnative speech. New acoustic models were obtained by doing a one-pass Viterbi training based on the native acoustic models with x hours of nonnative read speech from the JASMIN corpus. This nonnative read speech was uttered by the same speakers as those in our test material. Triphone acoustic models are the de facto choice for most researchers in speech technology. However, the expected performance gain from modelling context dependency by using triphones over monophones might be minimal in a constrained task. To examine this hypothesis, we also experimented with monophone acoustic models trained on the same speech (both native and nonnative). 2.5. Lexical modelling The baseline pronunciation lexicon contains canonical phonetic representation extracted from the CGN lexicon (ref). However, it is known that nonnative pronunciation generally deviates from native pronunciation, both at the phonetic and the phonemic level. To model this pronunciation variation at the phonemic level, usually pronunciation variants are added to the lexicon. In the literature several researchers report a slight performance gain by including nonnative pronunciation variants (refs) . To derive pronunciation variants, we extracted context-dependent rewrite rules from an alignment of canonical and realized phonemic representations of nonnative speech from the JASMIN corpus (test material was excluded). Prior probabilities of these rules were assigned by taking the relative frequency of rule applications in their context. We generated pronunciation variants by successively applying the derived rewrite rules to the canonical representations in the baseline lexicon. Probabilities of these variants were calculated by multiplying the applied rule probabilities. The canonical representation has a standard probability of 1. Afterwards, probabilities of word pronunciation variants were normalized. We generated pronunciation lexica with variants by introducing a cutoff probability: variants below the cutoff were not included in the lexicon. In this manner, lexica with on average 2, 3, 4 and 5 variants per word were created. Furthermore, using the lexicon with 5 variants, we took into account prior probabilities during decoding. 2.6. Evaluation We evaluated the speech decoding setups by using the utterance error rate (UER), which is the percentage of utterances where the 1-Best decoding results deviates from the transcription. Filled pauses are not taken into account during evaluation. As said in the introduction, we don't expect our method to correctly discriminate between phonetically close responses. That is the reason why we also chose to evaluate a decoding result as correct when one of the responses with a phonetic distance that is close to this result equals the transcription. Iets over de verdeling van afstanden in de antwoorden? En over wat die afstanden concreet voorstellen? We calculated the phonetic distance by using a program called ADAPT (ref). 3. Results In table 1 the UER for the different language models and acoustic models can be observed. In all cases, the language model with filled pause loops performed significantly better than the language model without loops. Furthermore, it performed even better than the language model with the manually annotated filled pauses. (moet nog wel blijken uit de nieuwe resultaten...) Decoding setups with acoustic models trained on nonnative speech performed significantly better than with acoustic models trained on native speech. The performance difference between monophone and triphone acoustic models was not significant. Naturally, error rates are lower when evaluating using clusters of phonetically similar responses. Performance differences between =0 (precisely correct) and <10 (one of the answers with a phonetic distance of 10 or smaller to the 1-Best equals the transcription) and between <5 and <15 were significant. Acoustic Model Language Model dist = 0 dist < 5 dist < 10 dist < 15 Native (tri) without loops 30.5 30.0 27.9 26.3 Native (mon) without loops Native (tri) with loops 18.9 18.5 16.4 14.9 Native (mon) with loops Native (tri) with positions 16.5 16.2 14.9 14.0 Native (mon) with positions Nonnative (tri) without loops 25.8 24.5 23.1 21.2 Nonnative (mon) without loops Nonnative (tri) with loops 13.7 13.4 11.3 9.8 Nonnative (mon) with loops Nonnative (tri) with positions 11.5 11.2 9.7 8.8 Nonnative (mon) with positions Table 1: This table shows the Utterance Error Rates for the different language models: without FP loops, with FP loops and with positions, and different acoustic models: trained on native speech and retrained on nonnative speech, both monophone and triphone. All setups used the baseline canonical lexicon. The columns =0, <5, <10, <15 indicate at what phonetic distance from the precisely correct response the 1-Best is considered to be correct. Performance decreased using lexica with pronunciation variants generated using data-driven methods. The more variants are added, the worse the performance. Lexicon dist = 0 dist < 5 dist < 10 dist < 15 canonical 18.9 18.5 16.4 14.9 priors 23.2 19.7 17.4 16.0 2 variants 20.4 20.0 17.5 15.6 3 variants 20.9 20.5 18.2 15.7 4 variants 22.1 21.7 19.0 17.0 5 variants 22.5 22.0 19.4 17.8 Table 2: Utterances Error Rates for different lexica: canonical, 5 variants with priors and 2-5 variants . These rates are obtained by using native triphone acoustic models and language models with filled pause loops. 4. Discussion Discuss effects of Language Model Discuss effects of Acoustic Model Discuss effects of Lexicon, also indicate that experiments with priors were done, but also didn't improve results Discuss 'effects' of answer clusters 5. Conclusion conclusion 6. References references