Practicing syntax in spoken interaction: Automatic detection of syntactical errors in non-native utterances Helmer Strik, Janneke van de Loo, Joost van Doremalen, Catia Cucchiarini Department of Linguistics, Radboud University Nijmegen, The Netherlands {h.strik, j.vandoremalen, c.cucchiarini}@let.ru.nl, JannekevandeLoo@student.ru.nl Abstract In the current paper we present a new method, called SynPOS: Syntactic analysis using POS-tags. SynPOS is applied to a corpus of spoken human-machine interactions. The results show that language learners of Dutch often make syntactical errors, that there are many different types of syntactical errors, and that their frequencies vary a lot. This information can be used next to select errors and develop exercises for CALL systems. Index Terms: syntactic analysis, syntactical errors, part-ofspeech, POS tags, ASR-based CALL 1. Introduction Within the framework of Computer Assisted Language Learning (CALL) numerous systems have been developed for practicing grammar (morphology and syntax) in a foreign or second language. In the majority of these systems the learner’s output is provided in the written modality, by means of a keyboard and/or a mouse (clicking, drag & drop, etc.). Although this way of practicing may be successful for learning the grammar of the target language, it is questionable whether the knowledge thus acquired really contributes to speaking the target language more correctly. Two important questions may be raised in this respect. First, according to some researchers this type of explicit knowledge about grammar is essentially different from the implicit knowledge of a language that is acquired from usage, rather than from rules and drills, and that is required for communicative competence (for a brief overview, see [4]). Second, it is not clear whether knowledge acquired in one modality (written) generalizes to other modalities (spoken). Research so far indicates that this is not the case [3]. For these reasons, it is very interesting to develop CALL systems that can handle non-native speech and that make it possible to practice grammar and to receive feedback while speaking in the target language. However, as far as we know, grammar has not yet been systematically addressed in ASR- based CALL systems that analyze L2 learners' speech production. Exceptions are Lee and Seneff [5], in which an approach for automatic grammar correction is presented, and the DISCO (Development and Integration of Speech technology into Courseware for language learning) project [9] which is aimed at realizing a CALL system that makes use of automatic speech recognition (ASR) for assessing speech of learners of Dutch as a second language and for providing corrective feedback on pronunciation and grammar. In the FASOP (Feedback on Syntax in Oral Proficiency) project [11] we will use the latter system to study the effect of providing different types of feedback on the acquisition of syntax in oral proficiency. Although an ASR-based CALL system for practicing grammar may seem particularly appealing, developing a good system is far from trivial, mainly because automatic speech recognition of non-native speech is still problematic and thus only limited tasks can be used. For instance, as prompts one often uses written utterances that have to be read aloud or spoken utterances that have to be repeated. While such exercises can be useful for practicing pronunciation, they are not appropriate for practicing grammar. Given the limitations of speech technology, the question then is how grammar can be practiced in such CALL systems. In the current paper the focus is on syntax. In order to practice syntax, we need to know what should be practiced, what the exercises should look like, and then we need to develop the technology to automatically handle these exercises. The final goal of the current line of research is a method that makes it possible to develop an ASR-based CALL system for practicing grammar. In the Dutch-CAPT project [2, 10] we faced similar problems regarding pronunciation in Dutch as L2 and we adopted the following procedure: make an inventory of the errors, list criteria for selecting the errors, use them to select errors, and finally develop a system, i.e. the exercises to practice these aspects and the technology to handle these exercises (detect the errors and give feedback about them). In the current paper we explore the possibilities of using a similar procedure for syntax. In the case of pronunciation, one can go through an utterance from the beginning to the end and determine for every sound whether it is pronounced correctly or not. In the case of syntax, the issues are more complex. For instance, it is not possible to simply go through an utterance from the beginning to the end and determine for every word whether it is correct or not. In fact, it is not straightforward what kind of method should be used to analyze non-native speech data, to make the inventory of errors, to select errors, and to generate the system (exercises and technology). We present a new method for automatically generating an inventory of (syntactical) errors made by non-native speakers by analyzing utterances from a corpus of non-native speech. The method makes use of part-of-speech (POS) tags to label the words in each utterance, and an algorithm that matches words in two utterances: the (correct) target utterance and the (possibly erroneous) realization of the utterance. In section 2 we describe this method together with the non-native speech material we used. The results are presented in section 3 and discussed in section 4. 2. Material and method 2.1. Material The non-native speech material for the present experiments was taken from the JASMIN speech corpus [1]. Recordings were made for speakers with many different mother tongues who had relatively low proficiency levels, namely A1, A2 and B1 of the Common European Framework (CEF). For the experiments reported on in this paper we used the spontaneous speech material. Orthographic transcriptions were manually created and include (dis-)fluency phenomena such as filled pauses, restarts and repetitions. Grammatical errors were manually annotated. Furthermore, the annotators also entered the corresponding correct target utterance (see the examples presented below). For every utterance containing an error we thus have the realization and the corresponding correct target utterance. The total number of utterances containing at least 1 error is 954. For the time being we selected only the 589 utterances (with 4150 words in the target utterances) that contain only 1 syntactical error. Note that in addition, the utterances often contain other errors, e.g. regarding morphology, pronunciation of sounds and prosody, disfluencies, etc. 2.2. Method The general method for analyzing the non-native utterance on grammatical errors is called SynPOS: Syntactical analysis using POS-tags. It consists of the following four stages, carried out for each pair of utterances (target & realization): • (1) Add POS-tags • (2) Align words in the utterances • (3) Match words in the utterances • (4) Make an error list Stages (1) and (2) are interchangeable, but are listed in this order because stages (2) + (3) together are for matching words for each pair of utterances (target & realization). The four stages are described in more detail in the following sections. 2.2.1. Add POS-tags TADPOLE is a modular memory-based morphosyntactic tagger, analyzer and dependency parser for Dutch. TADPOLE is an acronym of 'TAgger, Dependency Parser, and mOrphoLogical analyzEr' [6, 8]. For the current research we only use the output of the part-of-speech (POS) tagger and the information about the lemmas. The POS-tags used are listed in Table 1. The first column contains the Dutch acronym, as obtained with TADPOLE, and the second column an English acronym and short description. An example of a realized utterance, its corresponding target, and the POS-tags of both utterances are provided in Figures 1 and 2. 2.2.2. Align words in the utterances The program SCLITE is a tool for scoring and evaluating the output of automatic speech recognition (ASR) systems. SCLITE is part of the NIST SCTK Scoring Toolkit [7, 13]. The program SCLITE is generally used to compare the output of the ASR system to the correct target text. In our case, SCLITE is used to align the words (without using the POS- tags) for each pair of utterances. An example of the output for a pair of utterances is provided in Figure 2 (see the lines target, realization & SCLITE). SCLITE results in an alignment of the two corresponding utterances, containing information on deletions (Del), Insertions (Ins), and Substitutions (Sub) (see Figure 2). However, this is not enough for our goals, as will become clear below. For instance, in some cases a combination of an insertion in one utterance and a deletion in the other utterance is a transposition (Tp). Therefore, some extra matching steps are needed, as described in the next section. 2.2.3. Match words in the utterances Below a short description is presented of the different steps. The effect of these steps is illustrated in the example in Figure 2. First, position numbers are added to the words in the target, and if words are matched in the following steps position numbers from the target are copied to the realization. * step a. Match equal words aligned by SCLITE For words that match exactly (same position and form), copy the position number of the target to the realization. Obviously, the match is not yet complete, and therefore extra steps are needed. * step b. Match other equal words (except ART) In step b words (except words with the POS-tag ART) with the same form but on other positions are matched. * step c. Match words with equal lemmas (except ART) In step c words (except words with the POS-tag ART) with the same lemma are matched. For this step we use the lemmas obtained with TADPOLE (see section 2.2.1). Steps b & c are not carried out for words with the POS-tag ART. The reason is that many utterances contain multiple articles, and non-native speakers make a lot of errors regarding articles (see Table 1). Treating articles in the same way as words with other POS-tags would result in many erroneous results. For instance, in the example in Figure 2, look at the two occurrences of the word “de”, which obviously should not be matched. They have the same form, and thus would be matched in step b; and they also have the same lemma and thus would be matched in step c. Matching of articles is resolved in the next steps. * step d. Match words with small Levenshtein distance Sometimes the orthographic representations of two words that should be matched differ slightly. The reason could be a typo, a pronunciation error, which in some cases is coded in the ‘orthographic’ representation, a morphological error, etc. To resolve these issues we match words for which the Levenshtein distance divided by the length of the longest word is smaller than or equal to 1/3. This is only done for pairs of words for which the length of the longest word is at least 4. Note that in this step also the POS-tag of the realization of the word “ZWIMBAD” (i.e. WW), is replaced by the correct POS-tag of the matching word “ZWEMBAD” (i.e. N) of the target utterance. * step e. Match words with equal POS-tags in matching post-word context Match words with same POS-tag and matching post-word context, i.e. of the following two words in the target utterance at least one of them should have been matched to one of the two following words in the realization. * step f. Match words in matching (surrounding and post- word) contexts In this final step, words are matched (see Figure 1) if • either both surrounding (left & right) words match, • or both following words match target en TEN derde wil ik … POStag CON PREP NUM VERB PRON … lemma en ten drie willen ik … pos.nr. 0 1 2 3 4 … real. en DE derde wil ik … POStag CON ART NUM VERB PRON lemma en de drie willen ik … pos.nr. 0 --2 3 4 … stepf: 0 1 2 3 4 … Figure 1: Example illustrating the effect of step f. target omdat ik ALTIJD met DE bus naar HET ZWEMBAD GA because I always with the bus to the pool go POStag CON PRON ADV PREP ART N PREP ART N VERB lemma omdat ik altijd met de bus naar het zwembad gaan pos.nr. 012345678 9 real. omdat ik GAAT met ** bus naar DE ZWIMBAD ALTIJD because I goes with --bus to the pool always SCLITE= = Sub = Del = = Sub Sub Sub POStag CON PRON VERB PREP -N PREP ART VERB ADV lemma omdat ik gaan met -bus naar de zwimbad altijd pos.nr. stepa 0 1 --3 --5 6 -----stepb 0 1 --3 --5 6 ----2 stepc 0193--56----2 stepd 0 1 9 3 --5 6 --8(N) 2 stepe 0193--5678(N) 2 final 0193--5678(N) 2 SynPOS = = Tp+Sub = Del = = Sub Sub Tp Figure 2: Made up example of a pair of utterances illustrating the method: the annotations and the effect of the various steps. SynPOS finds substitutions (Sub), deletions (Del), insertions (Ins), and transpositions (Tp). For further explanation see text. Table 1. Absolute frequency and relative frequency (%) of syntactical errors. The columns contain the frequencies on Del, Sub, Tp, & Ins, the rows the frequencies for the different POS-tags. Dutch acronym English acronym and description Total Del Sub Tp Ins 4150 399 (9.6%) 302 (7.3%) 212 (5.1%) 125 (3.0%) LID ART -article 350 170 (48.6%) 20 (5.7%) 1 (0.3%) 18 (5.1%) VNW PRON – pronoun 884 133 (15.0%) 62 (7.0%) 32 (3.6%) 18 (2.0%) VZ PREP – preposition 384 38 (9.9%) 42 (10.9%) 6 (1.6%) 32 (8.3%) VG CON -conjunction 198 10 (5.1%) 4 (2.0%) 1 (0.5%) 14 (7.1%) WW VERB -verb 853 36 (4.2%) 81 (9.5%) 102 (12.0%) 28 (3.3%) BW ADV -adverb 375 5 (1.3%) 20 (5.3%) 27 (7.2%) 3 (0.8%) N N -noun 608 5 (0.8%) 47 (7.7%) 19 (3.1%) 6 (1.0%) ADJ ADJ -adjective 358 2 (0.6%) 62 (17.3%) 22 (6.1%) 4 (1.1%) TSW INT -interjection 8 ---2 (25.0%) SPEC SPEC -special token 84 -4 (4.8%) 2 (2.4%) - TW NUM -numeral 48 ---- 2.2.4. Make an error list After all the steps described above have been carried out the errors are annotated (see the row SynPOS), and a report with the results is generated. Some results are presented in the next section. 3. Results An overview of the results obtained with our SynPOS method is presented in Table 1. For the syntactical errors we present both the absolute frequencies (the number of occurrences) and the relative frequencies (which were obtained by dividing the absolute frequencies by the number of occurrences of the POS-tags listed in the column ‘Total’). The order of the results in Table 1 is as follows: 1. First in the columns: decreasing number of absolute and relative frequency, i.e. Del, Sub, Tp, and Ins. 2. Next in the rows: decreasing number of relative frequency (%) in the column Del. It can be observed in Table 1 that many errors are found by our method: 399 (9.6%) deletions, 302 (7.3%) substitutions, and 212 (5.1%) transpositions; thus in total 913 (21.9%) of the words in the target are changed. In addition, 125 (3.0%) insertions were found. There are also many different types of errors, i.e. 35 in Table 1 (35 cells in Table 1 have a value larger than 0). Not all of these types of syntactical errors occur equally often. Deletion of articles occurs most often, both in terms of absolute and relative frequency; almost half of the articles are not realized. These results can be useful for selecting syntactical errors for CALL systems. Frequency is obviously an important criterion, both absolute and relative frequency. Absolute and relative frequency can be combined, e.g., by simply multiplying their numbers. As an example, the values for which the product of these two numbers is larger than 2 are listed in bold in Table 1, and those for which the product is in between 1 and 2 are in Italic. Of course, besides frequency other criteria could be used for selecting syntactical errors. 4. Discussion and conclusions In the previous sections we have presented a new method, called SynPOS, to analyze syntactical errors in speaking performance for the purpose of developing CALL exercises for practicing syntax in Dutch L2 spoken interaction. SynPOS yields clear and plausible results that are in line with previous findings, especially with respect to the frequent syntactical errors we found. It seems therefore that SynPOS can be employed to analyze corpora to identify syntactical errors together with quantitative information. These results can then be used to select syntactical errors, and subsequently to develop a system for practicing the more problematic L2 syntactical phenomena. For example, the quantitative information can be employed to develop a language model (LM) for the ASR with different probabilities for the options (paths) present in the language model. In the current research the method is applied to Dutch utterances. However, the proposed method can also be applied to other languages, if POS taggers exist for those languages. The next thing we are going to study is finding patterns in the results, patterns that generalize from our current data to other data, and thus can be used for system development. For deletions and substitutions (the largest classes) the situation is probably straightforward: the position of these words in the target utterances is known, and these words can simply be deleted or substituted. In the LM of the ASR we can then add extra arcs (paths), possibly with the corresponding probabilities. However, in the case of insertions and transpositions we have to find patterns that make clear where the words could appear (given the syntactical errors that non- natives make). Maybe the information we have at the moment is not rich enough to make this possible to a sufficient degree. If that turns out to be the case, we will consider gathering extra information. An obvious alternative would be to use a syntactic parser, e.g. for Dutch the Alpino parser [12]. A disadvantage of using a syntactic parser is that its output may contain more errors than the output of a POS-tagger, even for the correct target utterance. In any case, given that at the moment there probably is no method that can correctly analyze utterances spoken by non-natives that contain errors, it is probably best to use a correct target and its analysis as a reference, as we did in the current method with POS-tags. In the first three stages of this method some errors are made. We manually checked tags and lemmas of 50 pairs of utterances. The 50 target utterances contained 394 words in total, out of which 15 words (4%) received an incorrect POS- tag from TADPOLE. Of these 15 words, 10 belong to two classes that were often tagged incorrectly, i.e. (1) "het weer" (the weather) which should be tagged as 'ART N', and (2) some adjectives tagged as adverbs. Often, when the POS-tag is incorrect, the lemma is also incorrect; for the words with correct POS-tag the lemma was generally correct as well. For the POS-tags and lemmas we could use other resources, but they probably will contain other errors, and it is not likely that the net gain will be very large. Furthermore, the alignments produced by SCLITE are not always optimal. SCLITE offers some possibilities to improve the alignment, for instance by using Levenshtein distance. For the present experiments we used the standard ‘basic’ version of SCLITE. However, the alignment errors that can be resolved in this way probably are already resolved in our stage 3. Finally, for all 589 target- realization pairs, for which the target utterances contain 4150 words, only 12 matching errors were found, i.e. for only 2.0% of the utterance and 0.29% of the words. Consequently, the number of errors made by SynPOS is small, and some of the errors made in stages 1 and 2 are resolved in stage 3. Still, there might be room for some improvement, but a more thorough analysis requires a larger corpus, and it is not likely that this will result in substantial changes in the analysis results, especially not in the frequent syntactical errors found. We could also use more fine-grained POS-tags, for instance within the class of pronouns we could discern personal pronoun, demonstrative pronoun, etc. For the analysis this is not necessary, but it may be useful for finding patterns in the results. However, for finding patterns the biggest gain can probably be obtained by using a syntactic parser, as was already mentioned above. We intend to study these issues in future research. We will also look at utterances containing more than 1 syntactical error. Finally, we will use the information obtained with SynPOS to develop and test ASR-based CALL exercises to train syntax in spoken language in the projects DISCO [9] and FASOP [11]: we will select syntactical errors, develop exercises to train these aspects, develop the technology to handle the spoken replies automatically, analyze them, and provide feedback, and finally compare and test the effect of providing feedback in different ways. 5. References References to papers and URLs, listed in alphabetical order. [1] Cucchiarini, C., Driesen, J., Van Hamme, H. and Sanders, E. (2008) “Recording speech of children, non-natives and elderly people for HLT applications: the JASMIN-CGN corpus”, Proceedings of LREC-2008. [2] Cucchiarini, C., Neri, A. and Strik, H. (2009) “Oral Proficiency Training in Dutch L2: the Contribution of ASR-based Corrective Feedback”, Speech Communication, pp. 853-863, , Volume 51, Issue 10, October 2009. [3] De Jong, N. (2005), “Can second language grammar be learned through listening? An Experimental Study”, Studies in Second Language Acquisition, 27, 205–234, 2007. [4] Ellis, N.C., and Bogart, P.S.H. (2007), “Speech and LanguageTechnology in Education: the perspective from SLA research and practice”, Proceedings ISCA ITRW SLaTE, Farmington PA. [5] Lee, J., and Seneff, S. (2006) “Automatic grammar correction for second-language learners”, Proceedings of Interspeech 2006. [6] Van den Bosch, A., Busser, G.J., Daelemans, W. and Canisius, S. (2007) “An efficient memory-based morphosyntactic tagger and parser for Dutch”, in F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), , pp. 99-114, Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, 2007. [7] ftp://jaguar.ncsl.nist.gov/current_docs/sctk/doc/sclite.htm [8] http://ilk.uvt.nl/tadpole/ [9] http://lands.let.ru.nl/~strik/research/DISCO [10] http://lands.let.ru.nl/~strik/research/Dutch-CAPT/ [11] http://lands.let.ru.nl/~strik/research/FASOP.html [12] http://www.let.rug.nl/vannoord/alp/Alpino/ [13] http://www.nist.gov/itl/iad/mig/tools.cfm