Automatic Detection of Vowel Pronunciation Errors 


Using Multiple Information Sources 


Joost van Doremalen, Catia Cucchiarini, Helmer Strik 

Department of Linguistics, Radboud University Nijmegen 
The Netherlands 


{j.vandoremalen,c.cucchiarini,h.strik}@let.ru.nl 


Frequent pronunciation errors made by L2 learners of Dutch 
often concern vowel substitutions. To detect such pronunciation 
errors, ASR-based confidence measures (CMs) are generally 
used. In the current paper we compare and combine confidence 
measures with MFCCs and phonetic features. The results show 
that the best results are obtained by using MFCCs, then CMs, 
and finally phonetic features, and that substantial improvements 
can be obtained by combining different features. 

I. INTRODUCTION 
The application of Automatic Speech Recognition (ASR) 
technology to second language (L2) learning, and in particular 
to pronunciation training, has received growing attention in 
the last decade [1]. Our institute has been involved in applying 
ASR to pronunciation training in several projects, e.g. [30] 
[31][32][33]. The research presented here is carried out within 
the framework of the DISCO project, which is aimed at 
developing a prototype of an ASR-based CALL application 
that provides feedback on Dutch L2 pronunciation, 
morphology, and syntax [30]. 

The use of ASR technology is especially advantageous 
when it comes to identifying specific pronunciation errors and 
providing corrective feedback to the learners. L2 learners do 
indeed appear to have difficulties in identifying their own 
pronunciation errors [2]. This suggests that Computer Assisted 
Language Learning (CALL) programs that can provide 
automatic corrective feedback on pronunciation are preferred 
to systems that can only offer the opportunity of listening and 
repeating L2 speech without corrective feedback. In line with 
these requirements several studies have addressed 
pronunciation error detection through ASR [3][4][5][6]. The 
main challenge in these approaches is to develop algorithms 
that achieve sufficient accuracy in error detection so that the 
feedback provided to the learners is not misleading. In 
general, achieving sufficient detection accuracy is particularly 
challenging exactly for those sounds that are easily confused 
or mispronounced by L2 learners. 

Pronunciation errors in a second language can derive from 
several sources. An important limiting factor in acquiring the 
pronunciation of an L2 is considered to be interference from 
the first language (L1) [7], which can affect L2 speech 
production both at the prosodic and at the segmental level. L2 
learners may have difficulties with the different syllable 
structure of the language to be learned, its rhythm and 
temporal organization, its phonemic inventory and 
phonotactics. Here we will focus on segmental aspects. L2 

learners might insert or delete speech sounds, realize L2 
phonemes incorrectly or even use phonemes from their L1. In 
particular, L2 learners may find it difficult to realize certain 
phonetic contrasts, either because they do not exist in their L1, 
or because they do exist but are not phonologically distinctive. 
Consequently, when trying to pronounce L2 phonemes, L2 
learners may end up producing L1 phonemes that are 
somewhat similar but not identical. In such cases relatively 
subtle acoustic distinctions may lead to phonemic 
substitutions. Identifying those errors is of course more 
difficult than identifying substitutions of sounds that are 
acoustically more different. 

For these reasons, various studies in pronunciation error 
detection have focused on sets of L2 phonemes that are very 
similar in acoustic and articulatory terms, in attempts to find 
accurate methods of identifying the mispronounced sounds. In 
general, ASR-based confidence measures (CMs) like posterior 
probabilities or the Goodness of Pronunciation measure 
(GOP) are used for pronunciation error detection [8][3][4][5]. 
These CMs give an indication of how confident the recognizer 
is that a given target sound was pronounced: the lower the 
confidence, the higher the chance that another sound was 
pronounced. Such measures have the advantage that they can 
be obtained fairly easily with an ASR system and they can be 
calculated in similar ways for all speech sounds. 

However, since segmental pronunciation errors tend to 
concern specific phonetic contrasts that pose special 
difficulties to L2 learners, a promising approach to 
pronunciation error detection might be one that uses phonetic 
information related to the problematic contrasts. Along these 
lines, [9] and [10] developed dedicated classifiers to identify 
pronunciation errors that appeared to be frequent in Dutch L2 
and that concerned relatively subtle distinctions such as that 
between fricatives and plosives and that between long and 
short vowels. 

In [9] it was shown that good classification results can be 
obtained by using phonetic features; more specifically, by 
using more general features for vowels (formants, pitch and 
duration), and very specific features for differentiating a 
plosive from a fricative. In [10] a comparison was made of 
different approaches for differentiating a plosive from a 
fricative. A method in which phonetic features were used 
together with LDA performed better than GOP. However, 
similar results were obtained for MFCCs in combination with 
an LDA. 


So, on the one hand phonetic features seem to be promising 
for classification, but on the other simply using MFCCs also 
provides good results. Furthermore, the results for these two 
methods were better than those for GOP, for the specific cases 
that were studied. These interesting results led to a number of 
questions: how these different methods would perform on 
other sounds, and whether something could be gained by 
combining different measures. In the current study we tried to 
answer these questions. 

The outline of the paper is as follows. In section 2 we 
explain the background of this research. In section 3 we 
describe the material used and the method adopted in our 
experiments. The results are presented in section 4 and 
discussed in section 5. 

II. RESEARCH BACKGROUND 
Considering that L2 pronunciation errors are often related to 
interference from the L1, it seems very advantageous to have 
CALL systems that are designed for specific combinations of 
L1 and L2, and that can address the errors you would expect 
for those specific combinations, for instance German, Italian, 
Chinese or Japanese students learning English [11][5][12], or 
Americans learning French [13]. In general, using such fixed 
combinations of languages also has considerable advantages 
from the point of view of ASR technology: speech recognition 
is facilitated and pronunciations errors are more easily 
predictable. However, the feasibility of such systems heavily 
depends on the number of students and the approach used in 
L2 classes. 

In the Netherlands, it is common practice to have 
heterogeneous L1 groups of learners in Dutch L2 classes. For 
this reason, in our research on ASR-based pronunciation 
training for Dutch L2 [6][10][14] we have focused on 
pronunciation errors that can be made by any learner, 
regardless of his/her L1. Although it is known that 
pronunciation errors are likely to be affected by the L1, in our 
research we also found that, at least for Dutch, it is possible to 
identify a set of phonemes that are particularly problematic for 
many L2 learners with different mother tongues [14]. This 
research and observations by Dutch L2 teachers indicate that, 
in general, vowels are more problematic than consonants [14], 
which may partly be due to the relatively high number of 
vocalic phonemes in Dutch compared to other languages [15] 
[16]: Dutch has 13 monophthongs, 3 diphthongs and some 
additional vowels found mainly in loan words [17]. 

The vocalic pronunciation errors, which concern almost all 
vowels and very often the diphthongs, appear to be related to 
difficulties with actually pronouncing the sounds and to 
orthographic interference [14]. In particular, vocalic errors are 
concentrated on realising a number of contrasts that many L2 
learners are not familiar within their L1s, such as /a/ versus 
/A/, /e/ versus /E/, /o/ versus /O/, /i/ versus /I/, /u/ versus /y/, 
/u/ versus /Y/ and /y/ versus /Y/ (SAMPA notation [34]). The 
problems in realising such contrasts are not only related to 
their absence in the learner’s L1, but also to Dutch 
orthography, as sometimes the same grapheme is used to 
indicate two different phonemes. For instance in the words 

“bonen” (beans) and “bom” (bomb) the grapheme “o” stands 
for the phoneme /o/ in the first word and for /O/ in the second 
word, Similarly, in the words “buren” (neighbours) and 
“bussen” (buses) the grapheme “u” represents the phoneme /y/ 
in the first word and /Y/ in the second word 

The vowels /a/, /e/, /o/, and /i/ are generally longer than 
their short counterparts /A/, /E/, /O/ and /I/, but the distinction 
between long and short vowels seems to be based more on 
phonological grounds than on phonetic ones [18]. /e/ and /o/ 
are longer than /E/, /O/ respectively, while the high vowels /i/, 
/y/ and /u/ are longer than /I/ and /Y/ only when they are 
followed by /r/ [18]. According to [19] the difference in length 
between the long and the short vowels only appears in 
prosodically strong positions, a strong syllable in a foot. 

In addition, duration is not the only characteristic that 
distinguishes the long vowels from their short counterparts, as 
the spectral characteristics also vary [18][17]. The vowels 
/e/, /o/, /i/, /u/ and /y/ are higher than /E/, /O/, /I/, and /Y/, 
respectively. /y/ and /Y/ are more fronted than /u/ and /a/ is 
more fronted than /A/. 

Since many languages do not have such a distinction 
between vowel pairs that are associated with one grapheme, 
but have different realisations such /a/ and /A/, /e/ and /E/, 
/o/ and /O/, /i/ and /I/, /u/ and /y/, /u/ and /Y/ and /y/ and /Y/, 
L2 learners tend to produce attempts at pronouncing either of 
the two vowels in a pair, for instance /a/ or /A/, that often fall 
in between. Depending on the amount of deviation from the 
target sound these attempts will be classified as either /A/ or 
/a/. Problems arise when the amount of deviation is such that 
an attempt at producing /A/ is perceived as /a/ or vice versa, 
because in such cases another word will be pronounced than 
the intended one, for instance /maan/ (moon), instead of /man/ 
(man). 

Given the difficulties posed by the above-mentioned 
vocalic contrasts to Dutch L2 learners, we set out to 
investigate whether it is possible to develop specific measures 
that achieve high accuracy in identifying the resulting 
pronunciation errors. 

III. METHOD 
A. Material 
The speech material for our experiments was taken from 
the Spoken Dutch Corpus (CGN), a large corpus of Dutch as 
spoken in the Netherlands and Flanders by adult native 
speakers. CGN contains about 9 million words and a great 
variety of speakers of different age, gender, and region of 
origin, recorded in various socio-situational settings [20]. 

The speech material was extracted from the Northern Dutch 
part of CGN, and stems from 4 different components of CGN: 
read speech, and different broadcast speech material 
components that can be subsumed under the label ‘broadcast 
monologues’. The RS material was recorded from trained 
speakers who read aloud novels in a studio environment, 
while the BM fragments were produced by speakers who were 
accustomed to speaking in public. These components are 
among the most formal in CGN, and reflect well the types of 


speech that will be encountered in the final application. We 
used the RS material as our training set and the BM material 
as our test set. 

CGN is a corpus of native speech and as such it does not 
contain the pronunciation errors L2 learners usually make. 
Although there are databases of non-native speech, these were 
considered to be too small for the purpose of this research. 
Given that the vocalic errors we wanted to investigate in this 
study concern phonemic substitutions, these can be easily 
simulated by artificially introducing them in a native corpus. 
In previous research we have used this procedure [6][21] and 
have seen that it works properly, as long as the simulated 
errors reflect errors that are actually made by L2 learners. 

Errors that are often made by L2 learners are substitutions 
of the phonemes mentioned in Table 1 (see, e.g., [13]). Based 
on this information on how Dutch phones are frequently 
mispronounced by L2 learners, the CGN material was 
manipulated in such a way that realistic L2 errors were 
introduced. For instance, in order to train and evaluate the 
classification of /a/, all occurrences of /A/ in the transcriptions 
were replaced by /a/; and analogously for the other vowels. 
For more details on the procedure and on results showing that 
the classifiers obtained in this way show similar performance 
for real errors in non-native speech the reader is referred to 
[21]. Frequencies of the vowels under investigation in our 
material are shown in Table II. 

TABLE I 

SUBSTITUTIONS OFTEN MADE BY L2 LEARNERS. EACH ROW CONTAINS PHONEMES THAT 

ARE OFTEN CONFUSED, TOGETHER WITH AN EXAMPLE OF A DUTCH WORD IN WHICH THEY 

APPEAR (AND AN ENGLISH TRANSLATION). 

/a/ maan (moon), /A/ man (man) 
/i/ liep (walked), /I/ lip (lip) 
/e/ leeg (empty), /E/ leg (put) 
/o/ boon (bean), /O/ bon (ticket) 
/u/ boek (book), /y/ vuur (fire), /Y/ bus (bus) 


B. Feature Calculation 
First, segmentations of the material were obtained through 
forced alignment. The segmentations were subsequently used 
to calculate a number of features. Details on the calculation of 
these features are provided below. 

1) ASR-based Features 

As our baseline we employed the widely used segmental 
confidence measure (CM) introduced in [8] which is the 
average frame-based posterior probability (AFBPP) of a 
forced aligned phone given the acoustic observations. 

The AFBPP of a phone ph is calculated as: 

t 

e 

afbpp. 
ph .= 
1 . 
log. 
p. 
sti|xt .. 


t-tb.1

et=tb 

where p(sit|xt) is the frame based posterior probability of the 
forced aligned state si at time t given the observation vector xt. 
p(sit|xt) is calculated as: 

p . 
xt|sti . 
p .sti . 


p . 
sti|xt .= 
N

.
S
jp . 
xt|stj . 
p . 
stj . 
where the summation in the denominator ranges over all N 
states of all triphone models. We will refer to this confidence 
measure as CMseg. The HMM models for the automatic phone 
alignment were trained with SPRAAK [22]. As training 
material we used the RS material from the CGN corpus. 

For preprocessing purposes the input speech, sampled at 
16kHz, is first divided into overlapping 32ms Hamming 
windows with a 10ms shift and pre-emphasis factor of 0.95. 
12 Mel-frequency cepstral coefficients (MFCCs) plus C0, and 
their first and second order derivatives were calculated and 
cepstral mean subtraction (CMS) was applied. 47 3-state 
Gaussian Mixture Models (GMM) were trained: 46 phones 
and 1 silence model. In total 11,660 triphones are created, 
using 32,738 Gaussians. 

Apart from averaging the frame-based probability over the 
whole segment, we also averaged over the three consecutive 
hidden states to model vowel onset/offset dynamics, hereby 
obtaining three state-based confidence measures. To this set of 
three features will be referred to as CMstate. 

2) MFCCs 

The 13 MFCCs, and their first and second order derivatives 
(as described above), were included in our feature set. We 
extracted MFCC-based features at three points in time within 
the segment, i.e. the windows closest to 25%, 50% and 75% 
of the length of the vowel. This makes a total of 117 (3x3x13) 
features referred to as MFCCs. 

3) Phonetic Features 

Using PRAAT [23], the first three formants (F1,F2 and F3) 
and F2-F1 were measured at the same three points in time 
(25%, 50% and 75%). In addition to these 12 features the 
mean pitch (F0) and intensity of the segments were also 
calculated. Since these measures can show considerable 
variation between speakers we carried out a normalization at 
the speaker level. [24] compared different vowel 
normalization procedures, and the best results were obtained 
with Lobanov's Z-score transformation [25]. Therefore, we 
also applied Lobanov's Z-score transformation to our data. 
These 14 normalized features will be referred to here as 
Spectral. Apart from spectral measures, we also extracted the 
raw segment durations from the automatically generated 
segmentation. 

The durations of the three hidden states were also included. 
Apart from the 4 raw durations, we also included durations 
normalized for the articulation rate in the utterance, making a 
total of 8 duration features, referred to as Duration. 

C. Classification: training and evaluation 
For classification, we utilised support vector machines (SVM) 
with a linear kernel function using the LibSVM package [26]. 
The reason for choosing a linear kernel was that it performed 
as well as several non-linear kernels, i.e. Radial Basis 
Function (RBF) and polynomial kernels, and requires 


considerably less CPU time. For each vowel, a different 
classifier was trained after cost parameters had been optimised 
through 10-fold cross-validation on the training set. 

TABLE II 

FREQUENCIES OF VOWELS IN TRAINING AND TEST SET 

Phone Training set Test set 
/a/ 7988 4193 
/A/ 11092 5895 
/i/ 5411 3328 
/I/ 6967 3848 
/e/ 6689 3867 
/E/ 8242 4195 
/o/ 5620 3100 
/O/ 6359 3586 
/u/ 2127 1078 
/y/ 957 574 
/Y/ 1600 824 
Total 63052 34488 

First, the individual performance of all feature sets was 
examined. Afterwards, feature sets were combined. We 
evaluated the performance of each classifier with the Equal 
Error Rate (EER) on the Receiver Operating Characteristic 
(ROC) curve. Furthermore, 95% confidence intervals were 
calculated to test whether differences between performance 
were significant. 

IV. RESULTS 
In Table III it is shown how the different feature sets perform 
(as EER) for the different vowels. On the whole, the results 
for MFCCs are somewhat better than those for the CMs. For / 
a/-/A/, /i/-/I/, /e/-/E/ and /y/ better results are obtained for 
MFCCs. The results for CMseg and CMstate do not differ much: 
for /a/ and /o/ significantly better results are obtained with 
CMstate, for /A/ CMseg performs significantly better. The 
phonetic feature sets Spectral and Duration alone achieve 
about 60-80% correct. 
In Table IV the performance for combinations of different 
feature sets is shown. Significant performance gains can be 
obtained by adding Duration to MFCCs and CMs for /a/, 
/A/, /o/ and /O/. The combinations of Spectral and Duration 
perform equally or better than the two sets individually, but 
worse than the combination of MFCCs and Duration. Adding 
CMs to the latter combination helps to lower the error rate for 
almost all phones, except /I/. Differences between 
combinations with CMseg or CMstate are not significant. 

V. DISCUSSION 
Within each subset of vowels, the results are based on the 
same tokens. For instance, for the /a/ classification results, all 

occurrences of /A/ in the transcriptions are replaced by /a/, 
and for the /A/ results it is just the other way around. Thus it 
may be surprising to see that the results for the long and the 
short vowels are not the same. The reason for this discrepancy 
is that for the /a/ classification the acoustic model for /a/ was 
used to obtain the automatic segmentations, while for the /A/ 
classification the same tokens were automatically segmented 
by using the acoustic model for /A/. This is also how it will be 
done in the application. Inspection of the segmentations 
indeed revealed that the begin and end times do vary. The 
smallest differences are observed for the /o/ vs. /O/ pair, while 
the largest ones pertain to the /u/ vs. /y/ and /Y/ distinction. 
This explains the large performance differences within the 
latter group. 

TABLE III 
EQUAL ERROR RATES FOR INDIVIDUAL FEATURE SETS: CMSEG, CMSTATE, MFCCS, 
SPECTRAL AND DURATION. ASTERIKS (*) INDICATE THE BEST PERFORMING FEATURE 
SETS. 


Target CMseg CMstate MFCCs Spectral Duration 
/a/ 17.0 15.9 13.8* 29.8 19.6 
/A/ 22.9 24.7 14.1* 30.3 25.1 
/i/ 18.7 19.0 13.4* 24.4 30.3 
/I/ 22.9 22.2 13.9* 22.3 40.8 
/e/ 11.4 10.7 9.7* 17.7 17.7 
/E/ 13.3 13.6 9.6* 17.6 32.9 
/o/ 26.5 24.8* 25.4 38.1 26.7 
/O/ 24.7* 25.2 26.1 36.9 31.0 
/u/ 5.0* 5.1 7.5 23.4 18.7 
/y/ 11.9 12.8 11.8* 22.0 27.2 
/Y/ 14.6 14.4* 15.1 29.6 40.7 
Overall 18.9 18.9 15.0* 26.8 27.7 

Note that the results presented in the current paper concern 
difficult cases. For instance, if we had tried to classify vowels 
that are acoustically more different from each other (such as 
/i/, /a/, and /u/), results would probably have been better. 
However, the latter are not the kind of substitution errors that 
are frequently made by language learners. For a CALL 
application it is important to be able to detect the errors that 
are frequently made by language learners. 

Therefore, we first studied what frequent errors are (see 
Table I) [14], and we tried to develop classifiers for these 
frequent errors. Here we present results for many tokens 
present in different components of a standard general purpose 
corpus (CGN), e.g. in 'relatively uncontrolled material', in 
which different factors may have a negative effect on the 
performance of our classifiers. 

First of all, there is a training-test mismatch. For training 
read speech was used, while for testing we used broadcast 
speech: there is a mismatch in speech style, recording 


channels, etc. Furthermore, we used all tokens without using a 
selection procedure (e.g. for context, place of words in the 
utterances, prosodic effects, etc.). In the final CALL 
application we have more control over many of these factors: 
we know who the speaker is (and adapt to that speaker in 
various ways), what the recording channel is, and we can 
choose the material (the stimuli and the prompts) ourselves in 
such a way that we can focus on those problematic sounds that 
can be reliably detected. Even within the speech of natives 
there will be a large variation in the realisation of (distinct) 
vowels, and it is known that realisations of distinct vowels 
often overlap. By providing feedback only on clear 
mispronunciations, we can minimise the number of times that 
a correct realisation of a phoneme is classified as a 
mispronunciation (false rejections). 

TABLE IV 
EQUAL ERROR RATES FOR COMBINED FEATURE SETS. 


Target MFCC+ Spectral+ MFCC+ MFCC+ 
Duration Duration Duration+ Duration+ 
CMseg CMstate 
/a/ 12.5 18.5 11.3 11.1 
/A/ 13.0 13.8 11.6 11.7 
/i/ 13.3 22.2 12.5 12.6 
/I/ 13.7 22.4 14.0 13.7 
/e/ 9.1 13.0 7.9 7.8 
/E/ 9.7 15.7 8.4 8.4 
/o/ 20.8 26.9 19.3 19.2 
/O/ 23.9 30.3 19.5 19.7 
/u/ 7.2 17.8 4.7 4.6 
/y/ 13.0 19.9 9.7 9.7 
/Y/ 14.9 22.4 9.9 9.8 
Overall 13.9 19.6 12.3 12.3 

For the /o/ vs. /O/ distinction classification performance 
turns out to be lower than for all other combinations. This may 
partly be explained by the higher acoustic similarity 
between /o/ and /O/ as compared to the other vowel sets that 
are studied here. Shown in Table V are average frequency 
values of the formants (F1 and F2) for 50 males in columns 2 
and 3 (taken from [27] and for 16 female speakers in columns 
4 and 5 (taken from [28]). Although there are differences in 
the values, which was expected because columns 2-3 concern 
males and columns 4-5 females, it is clear that the differences 
between /o/ and /O/ are smaller compared to those within the 
other vowel sets. In order to obtain better performance for /o/ 
vs. /O/ we might need to look in more detail to the (phonetic) 
differences between these vowels, for instance the fact that /o/ 
often shows a considerable degree of diphthongisation. 

Table V also contains information on the average durations 
of phonemes: the average values in column 6 are taken from 
[29]. The differences between the average durations in column 
6 of Table V reflect the performance of the classifiers using 
duration alone (see column 6 of Table III). For instance, the 
smallest difference in duration is observed for /i/ vs. /I/, and 
classification with duration alone also shows the highest error 
rates for these vowels. On the other extreme: the largest 

differences in duration are observed for /a/ vs. /A/, and the 
best (average) classification results are also found for this 
vowel pair. Our classification results are thus in line with the 
phonetic observations. 

TABLE V 

AVERAGE VALUES FOR THE PHONEMES IN CLUMN 1. COLUMNS 2-5 CONTAIN AVERGE 

FORMANT (F1 & F2) VALUES (IN HZ) TAKEN FROM [27] AND [28] RESP. COLUMN 6 

CONTAINS AVERAGE VALUES FOR THE DURATION OF THE PHONEMES (IN MSEC.), TAKEN 

FROM [29]. 

Phon. F1 F2 F1 F2 Dur 
/a/ 795 1301 948 1644 186 
/A/ 679 1051 859 1321 103 
/i/ --346 2401 105 
/I/ 388 2003 442 2452 91 
/e/ 407 2017 438 2443 176 
/E/ 583 1725 638 2123 107 
/o/ 487 911 525 1033 162 
/O/ 523 866 581 1079 99 
/u/ 339 810 400 893 111 
/y/ 305 1730 354 2070 140 
/Y/ 438 1498 482 1832 98 

We are aware that there is a considerable overlap between 
feature sets. For instance, CMs, MFCCs, and Spectral are all 
spectrally based, and thus it is not surprising to observe that 
there are similarities in the results. Furthermore, there is a 
large variation in the number of features in the sets used here: 
1 for CMseg, 3 for CMstate, 117 for MFCCs, 14 for Spectral, and 
4 for Duration. It is interesting to observe that a classifier 
based on 1 feature ( CMseg) performs almost as well as the one 
based on 117 MFCCs. The different feature sets have some 
pros and cons. 

The advantage of the CMs compared to the MFCCs is that 
the number of features is much smaller. However, more 
important in the final application is probably the CPU time 
required. The fact that MFCCs and Duration are part of the 
standard ASR procedure, i.e. they do not require a large 
computational overhead, might thus be appealing. 

On the other hand, phonetic features (Spectral and 
Duration) have the advantage that they can be more easily 
interpreted. If formant values and durations are too low or 
high, feedback based on these observations can be given to the 
learner (i.e. the position of your tongue is too high (if F1 is too 
low), or the vowel should be made shorter) and to the teacher 
(for monitoring the learner). Clearly, the latter can be very 
useful in a language learning application. 

Although in the current paper we studied classifiers for 
mispronunciation detection of Dutch vowels, the methods 
used are generic and can easily be ported to other languages 
and other sounds. First, the results presented are relevant for 
other languages that contain sets of vocalic phonemes that are 


very similar in acoustic and articulatory terms (i.e. English, 
German and Swedish) and as such pose problems to L2 
learners. In addition, similar classifiers can be developed for 
different vowel combinations, but also for consonants, as we 
have done in our previous research [9,10]. The relative 
importance of some features will differ between languages. 
For instance, the importance of duration in the detection of 
vowels will be smaller in languages in which duration is not 
such an important factor in the vowel system, such as Italian 
and Spanish. However, in the latter two languages duration 
plays a more important role in recognizing consonants 
(compared to, e.g., Dutch). Classifiers thus have to be 
optimized for each language, but the procedures used to 
develop the classifiers can be very similar. 

Furthermore, in the current research the classifiers are used 
to detect pronunciation errors made by language learners. In 
doing this we focused on substitutions often made by 
language learners, because this is most important for the 
application in our language learning project. Thus we 
trained and tested the classifiers for certain combinations of 
vowels, e.g. /a/ vs. /A/. However, it is also possible to 
optimize the classifiers for other purposes: other combinations 
of sounds, or – very general -to detect whether a given sound 
is indeed the intended sound (e.g. /a/ or not). The cases we 
studied here are very difficult ones, since the vowels we try to 
discern are acoustically very similar. For other sound 
combinations the task will generally will be easier and thus 
performance is likely to be higher. 

VI. ACKNOWLEDGMENT 
The DISCO project is carried out within the STEVIN 
programme(http://taalunieversum.org/taal/technologie/stevin/) 
which is funded by the Dutch and Flemish Governments. 

REFERENCES 

[1]M. 
Eskenazi, “An overview of Spoken Language Technology for 
Education,” Speech Communication, 2009. 
[2]A. 
Dlaska and C. Krekeler, “Self-assessment of pronunciation,” 
System, 36, pp. 506-516, 2008. 
[3]S.M. 
Witt and S.J. Young, “Phone-level pronunciation scoring and 
assessment for interactive language learning”, Speech Communication 
30, 95–108, 2000. 
[4]G. 
Kawai, and K. Hirose “Teaching the pronunciation of Japanese 
double-mora phonemes using speech recognition technology,” Speech 
Communication, 30 (2), pp. 131-143. 
[5]B. Mak, M. Siu, M. Ng, Y-C, Tam, Y.-C. Chan, K.-W., Chan, Kin-
Wah Chan, K-Y. Leung, S. Ho, F-H. Chong, J. Wong, and J. Loet 
“PLASER: Pronunciation Learning via Automatic Speech 
Recognition,” in Proc. HLT-NAACL 2003 Workshop on Building 
Educational Applications using Natural Language Processing, 
Edmonton, Canada, 23-29. 
[6]C. Cucchiarini, A. Neri, and H. Strik, “Oral proficiency training in 
Dutch L2: The contribution of ASR-based corrective feedback,” 
Speech Communication, 2009. 

[7]J. Flege, “Second-language speech learning: Findings and problems,” 
In Speech Perception and Linguistic Experience: Theoretical and 
Methodological Issues in Cross-Language Speech Research, 
Winifred Strange (ed.), Timonium, MD: York Press Inc, pp. 233–273, 
1995. 
[8]H. Franco, L. Neumeyer, V. Digalakis, and O. Ronen, “Combination 
of Machine Scores for Automatic Grading of Pronunciation Quality,” 
Speech Communication, 30:121-130, 2000. 
[9]K. 
Truong, A. Neri, C. Cucchiarini, and H. Strik, Automatic 
Pronunciation Error Detection: An Acoustic-Phonetic Approach. In: 
Proceedings of InSTIL, Venice, Italy, 2004. 
[10]H. 
Strik, K. Truong, F. de Wet and C. Cucchiarini, “Comparing 
different approaches for automatic pronunciation error detection,” 
Speech Communication, 2009. 
[11]W. Menzel, D. Herron, R. Morton, D. Pezzotta, P. Bonaventura, P. 
Howarth “Interactive pronunciation training” ReCALL, 13(1), pp. 
67-78, 2001. 
[12]Y. 
Tsubota, M. Dantsuji, and T. Kawahara, “An English 
pronunciation learning system for Japanese students based on diagnosis 
of critical pronunciation errors,” ReCALL, 16(1), pp. 173-188, 2004. 
[13]Y. 
Kim, H. Franco, and L. Neumeyer, “Automatic Pronunciation 
Scoring of Specific Phone Segments for Language Instruction,” In 
Proc. Eurospeech, pp. 645-648, Vol. 2, Rhodes, Greece, 1997. 
[14]A. Neri, C. Cucchiarini, and H. Strik “Selecting segmental errors in 
L2 Dutch for optimal pronunciation training”, International Review of 
Applied Linguistics, 44, 357-404, 2006. 
[15]B. Lindblöm, Phonetic universals in vowel systems, in Experimental 
Phonology, John J. Ohala, and Jeri J. Jaeger (eds.), 13–44, Orlando, 
FL, Academic Press, 1986. 
[16]I. Maddieson, Patterns of Sounds. Cambridge, Cambridge University 
Press, 1984. 

[17]C. Gussenhoven, Dutch, in Handbook of the International Phonetic 
Association, Part II, Illustrations of the IPA, 74–77. Cambridge, 
Cambridge University Press, 1999. 

[18]G. Booij, The Phonology of Dutch. Oxford, Clarendon Press, 1995. 
[19]T. 
Rietveld, and V.J. Van Heuven, Algemene Fonetiek. Bussum: 
Coutinho, 2001. 
[20]N.H.J. 
Oostdijk, “The Spoken Dutch Corpus. Outline and first 
evaluation”, in Proceedings of the Second International Conference on 
Language Resources and Evaluation (LREC 2000), Athens, Greece, 
Vol. 2: pp. 887-894, 2000. 
[21]S. 
Kanters, C. Cucchiarini, and H. Strik, “The Goodness of 
Pronunciation Algorithm: a Detailed Performance Study,” in 
Proceedings of SLATE, 2009. 
[22]K. 
Demuynck, J. Roelens, D. V. Compernolle, and P. Wambacq, 
“SPRAAK: an open source SPeech Recognition and Automatic 
Annotation Kit,” in Proceedings of ICSLP, p. 495., 2008. 
[23]P. Boersma and D. Weenink, 
Praat: doing phonetics by computer 
(Version 5.1.10) [Computer program]. Retrieved July 8, 2009, from 
http://www.praat.org/. 
[24]P. 
Adank, Vowel Normalization: a perceptual-acoustic study of 
Dutch vowels. Doctoral dissertation, Radboud University Nijmegen, 
The Netherlands, 2003. 
[25]Lobanov, B. M., “Classification of Russian vowels spoken by 
different speakers,” Journal of the Acoustical Society of America, Vol. 
49, Issue 2B, pp. 606-608, 1971. 

[26]C.-C. Chang, C.-J. Lin, “LIBSVM: a library for support vector 
machines,” 2001. Software available at 
http://www.csie.ntu.edu.tw/~cjlin/libsvm. 

[27]L.C.W. Pols, H.R.C. Tromp and R. Plomp, "Frequency analysis of 
Dutch vowels from 50 male speakers," J.Acoust.Soc.Am. 53, 
1093-110, 1973. 

[28]R. Van Hout, P. Adank, & V.J. van Heuven “Akoestische metingen 
van Nederlandse klinkers in algemeen Nederlands en in Zuid-
Limburg.”, Taal en Tongval, 52, pp.151-162, 2000. 
[29]H. 
Strik and E. Konst, "A duration model for phonetic units in 
isolated Dutch words", AFN-Proceedings, University of Nijmegen, 
Vol. 15, pp. 71-78, 1992. 
[30]http://lands.let.ru.nl/~strik/research/DISCO 
[31]http://lands.let.ru.nl/~strik/research/ST-AAP.html 
[32]http://lands.let.ru.nl/~strik/research/Dutch-CAPT/ 

[33]Repetitor: http://lands.let.ru.nl/literature/heuvel.2008.6.pdf 
[34]J.C. Wells, “SAMPA computer readable phonetic alphabet”. In 
Handbook of Standards and Resources for Spoken Language Systems, 
Gibbon, D., Moore, R. and Winski, R. (ed.), Berlin and New York: 
Mouton de Gruyter. Part IV, section B, 1997.