Phoneme Errors in Read and Spontaneous Non-Native Speech: 
Relevance for CAPT System Development 


Joost van Doremalen, Catia Cucchiarini, Helmer Strik 

Department of Linguistics, Radboud University Nijmegen, The Netherlands

{j.vandoremalen, c.cucchiarini, h.strik}let.ru.nl 


Abstract 

For the purpose of pronunciation assessment and training 
in a second language both read and spontaneous speech are 
employed. In this paper we present the results of a study on 
the nature of phoneme errors in Dutch read and 
spontaneous non-native speech and discuss the possible 
consequences and relevance of these findings for the 
purpose of developing Computer Assisted Pronunciation 
Training systems. 

1. Introduction 
Pronunciation is considered a difficult skill to learn in a 
second language (L2), with the majority of L2 learners 
never acquiring native-like performance and many of them 
having problems even in attaining a level of comfortably 
intelligible speech. Research has shown that serious 
pronunciation problems can hamper communication [1][2] 
or even put the learner at a social and professional 
disadvantage (see for reviews [3]). The emphasis on 
communicative effectiveness in language teaching has 
brought about a renewed interest in pronunciation training. 
In addition, various studies have shown that tailor-made 
training is effective in improving perception and 
production of L2 speech sounds [4]. 

However, the kind of specific and intensive training 
that is required for improving pronunciation cannot be 
generally applied in L2 classrooms because it is too time-
consuming. In the classroom, as well as in natural 
interactions, feedback on pronunciation errors is provided 
incidentally and is not always interpreted correctly [5]. 

Computer Assisted Pronunciation Training systems 
that make use of automatic speech recognition would seem 
to constitute an interesting alternative, as such systems 
can offer virtually unlimited input, can provide 
individualized, instantaneous feedback and the 
possibility of practicing as much as possible. 

In spite of these undoubtedly attractive features, there 
are also important technological limitations that should 
be reckoned with, since automatic speech recognition 
(ASR) of non-native speech is still problematic [6]. 

To partly alleviate the ASR problems caused by nonnative 
speech, various strategies have been proposed to 
constrain the output of the learner so that the speech 
becomes more predictable and thus more tractable form the 
ASR point of view. A common strategy consists in 
eliciting constrained output from learners by letting them 
read aloud texts displayed on the screen. Even in this case, 
however, it is possible that learners read something 
different from what is presented to them, but the chances 
are high that they will do what they are asked to do. 

Although this might be a safe option from the ASR 
point of view, one might argue whether this is 
pedagogically sound from the perspective of 
pronunciation learning. The type of speech that is elicited 

in this way is indeed read speech, which of course is 
different from the spontaneous speech that learners will 
have to use in communicating in the L2. On the other 
hand, learning to read aloud is one of the skills that have 
to be learned in the L2. 

For the purpose of CAPT development, such choices 
are very important as they determine which sounds are 
identified as being problematic, which ones are selected 
for pronunciation training and how algorithms for error 
detection are designed and trained, as will be explained 
below. Within the DISCO project [11] [12], which is aimed 
at realizing an ASR-based training system for oral 
proficiency in Dutch L2, we decided to study the nature 
and frequency of phoneme errors in read and spontaneous 
non-native speech with a view to developing a sound 
pronunciation training component. 

In this paper we first discuss the pros and cons of 
using either read or spontaneous speech for the purposes 
of pronunciation assessment and training (section 2). We 
then go on to present a study on the nature of phoneme 
errors in Dutch read and spontaneous non-native speech 
(section 3). We will end with a discussion of the results 
and some concluding remarks. 

2. 
Pronunciation assessment and training: 
read speech vs. spontaneous speech 
In the previous section we referred to the necessity of using 
read speech in ASR-based CAPT systems to make it easier 
for ASR to handle non-native speech. However, there are 
other reasons for using read speech when it comes to 
pronunciation assessment and training. 

To start with, one could argue that learning grapheme-
phoneme correspondences and being able to read aloud is 
a skill that should be learned in the L2 just as it is learned 
in the L1. In addition, for pronunciation assessment read 
speech offers a number of advantages. First, by eliciting 
read speech it is possible to control what the speakers will 
say and to have them produce the same words and sounds. 
This homogeneity in content ensures that pronunciation 
scores are comparable. When human judges are involved 
this has the additional advantage that raters are not 
influenced by oral production factors lying outside the 
domain of pronunciation such as grammar or lexicon. 

Having the possibility of controlling the content of 
the utterances also has the advantage that phonetically 
balanced material can be used. In turn this is attractive 
because the pronunciation of all phonemes of a language 
can be evaluated. 

So, although there are several good reasons for 
employing read speech when evaluating L2 pronunciation, 
this also has some drawbacks. For instance, the ability to 
read aloud in an L2 partly depends on the familiarity with 
L2 orthography. Interference from L2 orthography might 
cause specific phoneme errors thus providing a biased 
picture of pronunciation difficulties [7]. If some of the 


errors observed in L2 read speech are simply decoding 
errors caused by insufficient knowledge of L2 
orthography or interference from it, it is legitimate to ask 
whether such errors are pronunciation errors at all. In 
addition, research has shown that the nature and frequency 
of phoneme errors in non-native speech production are 
related to the specific relation between L1 and L2 
orthography [8]. 

By employing spontaneously produced speech such 
forms of orthography interference could be avoided. The 
phoneme errors thus observed are more likely to give an 
indication of real pronunciation problems. However, since 
the content of the utterance cannot be controlled, it is 
questionable whether the speech elicited provides a 
complete representation of potential pronunciation 
problems. 

Given that these choices play an important part in 
CAPT development, we decided to study the nature and 
frequency of phoneme errors in read and spontaneous nonnative 
speech to be able to make optimal choices for 
pronunciation training within the framework of our DISCO 
system for L2 oral proficiency training. 

3. Phoneme errors in read and 
spontaneous non-native speech: the case of 
Dutch 

3.1. Non-native speech material 
The speech material for the present experiments was taken 
from the JASMIN speech corpus [9], which contains speech 
from speakers with different mother tongues and relatively 
low proficiency levels, namely A1, A2 and B1 of the 
Common European Framework (CEF). The speech was 
collected in two different modalities: read speech and 
human-machine dialogues. The read speech material we 
used for this study consists of utterances produced by 45 
speakers while reading aloud short texts from the screen 
and sets of phonetically rich sentences. The spontaneous 
speech material was derived from the human-machine 
dialogues which were collected through a Wizard-of-Ozbased 
platform [9]. 

Orthographic transcriptions were manually created and 
include fluency phenomena such as filled pauses, restarts 
and repetitions. From these orthographic transcriptions, 
phonetic transcriptions were automatically generated 
using a pronunciation lexicon with native and non-native 
pronunciation variants. Phonetic transcriptions for words 
which contain disfluencies were manually created. Because 
the automatically generated phonetic transcription can 
contain errors, we had two transcribers manually correct 
the phonetic transcriptions on the word level. They were 
instructed to change the phonetic transcription whenever 
they thought that an error had been made. For this 
correction, only the SAMPA symbols for Dutch (SAMPA 

[10] is also used in the remainder of this paper) were used. 
Chunks were presented in a random order. 10% of the 
material was corrected by both transcribers and another 
10% was transcribed twice by the same transcriber in order 
to calculate inter-transcriber and intra transcriber 
agreement, respectively. The inter-transcriber agreement is 

0.91 (Cohens kappa) and the mean intra-transcriber 
agreement is 0.96. Both transcribers changed less than 
10% of the segments, and there is quite some overlap in 
the segments they changed, which explains the high 
agreement levels. We refer to this manually corrected 
transcription as the reference transcription. 

3.2. Automatic analysis 
The speech material was automatically analyzed to identify 
phoneme errors. We aligned the reference transcription with 
a canonical transcription obtained from the CGN lexicon, 
as in [11]. The alignment was done using an algorithm that 
takes two phoneme sequences as input and calculates the 
optimal alignment on the basis of phonetic distances 
between pairs of phonemes [11]. 

The CGN lexicon provides some common 
pronunciation variants. We integrated these variants in the 
procedure by aligning the reference transcription with the 
canonical transcription that has the smallest phonetic 
distance to the reference transcription. In this way, some of 
the pronunciation variation that natives exhibit is taken 
into account. 

After the alignment we calculated confusion matrices 
of phonemic substitutions and deletions. 

4. Results 
In Tables 1 and 2 all target phonemes are listed together 
with their frequency, percentage correctly realized and their 
three most frequent confusions. The phonemes are divided 
into six phonemic groups: diphtongs, monophtongs, 
plosives, fricatives, nasals and approximants. 

4.1. Vowels vs. consonants in read and spontaneous 
speech 
What appears from these tables is that, in general, more 
errors are produced in read speech than in spontaneous 
speech and that vowels cause more errors than consonants. 
Tables 3 shows average percentage correct scores for 
vowels and consonants not weighted by the individual cell 
frequencies while in Table 4 cell frequency is taken into 
account. Both measures indicate that vowels are more 
problematic than consonants in read speech, which is in 
line with results of previous research [7], while the results 
in Table 4 are mixed. In spontaneous speech consonants 
appear to be more problematic if cell frequency is taken 
into account. 
Among the consonants, we see that the most frequently 
incorrectly pronounced consonants — /G/, /v/ and /z/ —, 
are often realized as their devoiced counterparts /x/, /f/ and 
/s/, respectively. In many regions in the Netherlands this 
phenomenon also occurs among native speakers and it is 
therefore questionable whether these should be regarded as 
pronunciation errors at all. 
The data in Tables 3 and 4 also show that the difference in 
the number of errors between vowels and consonants is 
much smaller in spontaneous speech than in read speech. 
All target vowels appear to be less problematic in 
spontaneous speech than in read speech, except for /O/., 
while the number of consonant errors is comparable in read 
and spontaneous speech. 
In read speech the most problematic vowels, based on their 
relative error percentages, are /9y/, /Y/, /y/, /2:/, /e:/ and /E/. 
These vowels are much less error prone in spontaneous 
speech. This can be ascribed to different factors. 


target freq %cor error#1 error#2 error#3 target freq %cor error#1 error#2 error#3 
Au 940 96.0 o::1.5 a:: 0.9 u: 0.5 Au 206 98.1 a: 1.0 u: 0.5 o:: 0.5 
Ei 2750 85.6 a:: 10.4 e:: 1.1 E: 1.1 Ei 1279 89.3 a:: 6.5 A: 3.0 i: 0.7 
9y 1094 51.7 Au: 38.0 o:: 4.8 O: 2.2 9y 196 61.2 Au: 28.6 O: 4.6 o:: 3.1 
i 3852 93.3 I: 3.3 e:: 1.9 @: 1.0 i 1966 92.9 I: 4.9 e:: 1.0 @: 0.7 
freq %cor error#1 error#2 error#3 target freq %cor error#1 error#2 error#3 
Au 940 96.0 o::1.5 a:: 0.9 u: 0.5 Au 206 98.1 a: 1.0 u: 0.5 o:: 0.5 
Ei 2750 85.6 a:: 10.4 e:: 1.1 E: 1.1 Ei 1279 89.3 a:: 6.5 A: 3.0 i: 0.7 
9y 1094 51.7 Au: 38.0 o:: 4.8 O: 2.2 9y 196 61.2 Au: 28.6 O: 4.6 o:: 3.1 
i 3852 93.3 I: 3.3 e:: 1.9 @: 1.0 i 1966 92.9 I: 4.9 e:: 1.0 @: 0.7 

e: 4206 75.7 E: 9.9 i: 7.4 I: 2.6 e: 2430 89.1 E: 5.1 i: 4.2 I: 0.4 
a: 5134 94.8 A: 4.2 @: 0.5 -: 0.3 a: 3152 97.9 A: 1.2 @: 0.4 -: 0.2 
o: 3703 88.0 O: 8.2 u: 2.1 @: 0.5 o: 1591 95.5 u: 1.9 O: 1.1 2:: 0.5 
u 1367 95.5 y: 1.1 Y: 1.0 O: 0.9 u 804 96.1 Y: 1.7 2:: 0.6 O: 0.5 
y 961 67.4 u: 24.7 Y: 2.1 2:: 2.0 y 311 71.1 u: 17.7 @: 4.2 i: 3.5 
I 3845 84.9 i: 12.6 E: 2.1 @: 0.3 I 4260 94.2 i: 4.8 E: 0.7 @: 0.2 
E 5366 84.7 @: 8.5 I: 4.2 -: 0.8 E 2642 94.2 I: 2.3 @: 1.5 e:: 0.6 
A 6461 91.6 a:: 7.0 @: 0.6 O: 0.3 A 2685 94.1 a:: 3.6 E: 0.7 e:: 0.5 
O 3292 96.8 o:: 2.0 a:: 0.5 u: 0.3 O 1274 92.4 o:: 3.8 Y: 1.2 A: 1.1 
Y 1656 61.6 u: 25.4 y: 7.5 O: 2.5 Y 342 65.8 u: 22.2 @: 8.8 I: 1.2 
2: 627 72.9 y: 8.8 u: 5.6 @: 2.7 2: 167 80.8 @: 4.8 u: 4.2 Y: : 3.6 
@ 20745 94.0 E: 2.6 I: 1.2 -: 1.0 @ 9712 96.5 -: 1.5 I: 0.7 E: 0.7 
p 2847 96.4 b: 2.8 -: 0.8 p 807 91.6 b: 7.6 -: 0.6 g: 0.1 
t 13899 90.2 -: 7.0 d: 2.5 s: 0.1 t 6108 90.7 -: 4.9 d: 4.1 j: 0.1 
k 4751 96.2 g: 2.4 -: 0.7 x: 0.3 k 4428 94.8 g: 4.1 -: 0.9 x: 0.1 
b 3149 99.7 p: 0.2 w: 0.1 b 962 100.0 
d 8909 99.2 -: 0.5 t: 0.3 d 3107 98.5 -: 1.3 t: 0.1 j: 0.1 
f 1688 89.0 v: 7.0 -: 3.7 w: 0.1 f 802 92.6 v: 6.5 -: 0.9 
s 7041 91.4 z: 5.9 -: 1.6 S: 0.8 s 3091 90.7 z: 6.8 -: 1.2 S: 0.9 
S 145 87.6 s: 11.7 j: 0.7 S 73 89.0 Z: 8.2 j: 1.4 s: 1.4 
x 3674 91.5 G: 3.0 -: 2.7 k: 1.3 x 1805 91.0 G: 3.7 -: 3.3 g: 0.9 
v 4563 62.0 f: 37.4 w: 0.5 v 1641 60.9 f: 38.8 -: 0.1 b: 0.1 
z 2598 74.3 s: 25.7 z 1128 66.8 s: 32.8 S: 0.2 
Z 254 81.5 x: 8.7 G: 4.3 s: 2.4 Z 21 100.0 
G 1075 50.6 x: 35.8 h: 6.5 g: 5.1 G 585 56.4 x: 30.8 g: 5.0 -: 3.6 
h 2984 95.4 -: 2.4 G: 1.0 x: 0.8 h 1093 82.5 -: 15.7 x: 1.2 d: 0.2 
m 4212 99.2 -: 0.7 m 3424 99.4 -: 0.3 n: 0.2 b: 0.1 
n 16380 94.8 -: 3.1 N: 1.3 m: 0.5 n 6912 94.1 -: 3.4 N: 1.9 m: 0.3 
N 1192 93.8 n: 3.0 -: 1.8 g: 0.8 N 459 97.6 n: 1.5 g: 0.4 x: 0.2 
j 2827 88.4 S: 9.1 -: 2.4 Z: 0.2 j 1468 93.9 S: 3.5 -: 2.5 h: 0.1 
l 6941 98.6 -: 1.2 w: 0.1 j: 0.1 l 3629 98.4 -: 1.5 s: 0.1 g: 0.0 
r 12199 92.7 -: 6.0 l: 1.1 j: 0.1 r 4198 97.6 l: 1.4 -: 0.8 n: 0.0 
w 2524 98.9 v: 1.0 w 1657 99.9 v: 0.1 
Table 1: Phonemic substitutions and deletions in read Table 2: Phonemic substitutions and deletions in 
speech. spontaneous speech. 

Read Spontaneous Total This is corroborated by the finding that /y/ is more often 
Vowels 83.4 88.0 85.7 realized when /2:/ and /Y/ are the target in read speech than 
Consonants 89.1 89.9 89.5 in spontaneous speech. 
Total 86.6 89.1 87.8 Second, some problematic vowels occur much less often in 

the spontaneous material than in the read material. This is 
Table 3: Average of %correct of vowels and consonants, for example the case for /9y/, /Y/ and /2:/. This difference is 
read and spontaneous speech and their totals. These probably related to the different composition of the read 
percentages are not weighted by cell frequencies. and spontaneous material. A requirement of the 
phonetically rich sentences contained in the read speech 
material used in this study is that all phonemes appear at 

Read Spontaneous Total least once in a set of sentences. The frequency of occurrence 
Vowels 88.7 93.8 90.4 of the various phonemes can therefore be different in 
Consonants 92.0 92.4 92.1 spontaneous speech where there are no such requirements. 
Total 90.7 93.0 91.4 Among the consonants the biggest differences between 

read and spontaneous speech are found for the fricatives /Z/ 
Table 4: Average of %correct of vowels and consonants, and /h/. 
read and spontaneous speech and their totals. These As noted in [7] the fricative /Z/ is very infrequent in normal 
percentages are weighted by cell frequencies. Dutch, in which it represents 0.05% of the consonants 
while in the phonetically rich sentences used for this 
First, these sounds are represented graphemically by ‘ui’ study, it represents 1% of the consonants. 
(/9y/), ‘u’ (/Y/), ‘uu’ or ‘u’ (/y/), ‘eu’ (/2:/), ‘e’ or ’ee’ (/e:/) The glottal fricative /h/ is more often deleted in 
and ‘e’ (/E/). The use of the same graphemes, ‘e’ and ‘u’, to spontaneous speech than in read speech. In this case it 

represent these phonemes might be responsible for the seems that orthography has the function of reminding the 
higher percentages of confusions in read speech as speaker of the presence of this phoneme, which is otherwise 
opposed to spontaneous speech, where orthography will be neglected, probably due to its relatively low salience. 

less of an obstacle. 


4.2. Error patterns in read and spontaneous speech 
A final remark concerns the confusion patterns associated 
with the various phonemes. In general there are many 
similarities between read and spontaneous speech, 
although some differences are also present. For instance, 
the diphthong /Ei/ is confused with /a:/, / e:/ and /E/ in read 
speech and with /a:/, /A/, and /i/ in spontaneous speech. 
The relation between the confusions in read speech and the 
grapheme representation ’ei’ seems rather obvious. 
Similarly, /y/ is confused with /u/, /Y/, and /2:/ in read 
speech and with /u/, /@/, and /i/ in spontaneous speech. 
This also seems to be related to the grapheme 
representation ‘u’ or ‘uu’. Finally, the vowel /Y/ is 
confused with /u/, /y/, and /O/ in read speech and with / u/, 
/@/, and /I/ in spontaneous speech. Again there seems to be 
a relation between the confusion pattern in read speech and 
the grapheme ‘u’. 
Among the consonants we see that the fricative /Z/ is 
confused with /x/, /G/, and /s/ in read speech whereas no 
confusions are found in spontaneous speech. The relation 
between the confusion pattern in read speech and the 
grapheme ‘g’ is evident also in this case. 

5. Discussion and Conclusions 
In the previous section we have seen that there can be 
differences in the occurrence of phoneme errors in read and 
spontaneous non-native speech. Frequency of occurrence, 
either absolute or relative, may be a criterion in selecting 
problematic phonemes that should be the focus of 
pronunciation training [7][12]. It follows that when 
making such selections one should take into account 
which type of speech material was used for assessing 
pronunciation, because this partly determines the results. 
Related to this, it is also important that the type of training 
be based on the nature of the errors identified. If errors are 
caused by interference from the orthography, rather than by 
a difficulty articulating the sound in question, then some 
training in phonics might be more appropriate than 
specific pronunciation training. 
The degree to which phoneme errors are affected by the 
orthography will be related to the degree of orthographic 
transparency or orthographic depth of the L2 [13] and to 
the relation between the L1 and the L2 [8]. To fully 
appreciate the results of this study and the possible 
consequences for CAPT development it is important to 
point out that Dutch is considered to be a relatively 
transparent language with relatively low orthographic 
complexity [13], 
Another aspect that should be taken into account when 
developing systems for pronunciation training concerns 
the error patterns. In a recent paper we have shown that 
knowledge of the error patterns can be used to develop 
more sensitive and more accurate metrics for pronunciation 
error detection [11]. Since error patterns may vary 
depending on whether read or spontaneous speech is used, 
it follows that this should be taken into account when 
designing CAPT systems. 

6. Acknowledgements 
The DISCO project is carried out within the STEVIN 
programme funded by the Dutch and Flemish Governments 
(http://taalunieversum.org/taal/technologie/stevin/). 

7. References 
[1] Flege, J., “The relation between L2 production 
and 
perception,” In Proceedings of the XIVth International 
Congress of Phonetics Sciences, Berkeley, pp. 12731276, 
1999. 
[2] Van Wijngaarden, S.J., “The intelligibility of 
nonnative 
speech,” Doctoral dissertation, Free University, 
Amsterdam, 2003. 
[3] 
Eisenstein, M., “Native reactions to nonnative speech: 
A review of empirical research,” Studies in Second 
Language Acquisition, vol. 13, pp. 23–41, 1983. 
[4] Lively, S.E., Logan J.S. and Pisoni D.B., “Training 
Japanese listeners identify /r/ and /l/. II: The role of 
phonetic environment and talker variability in 
learning new perceptual categories,” Journal of the 
Acoustical Society of America, vol. 94, pp. 
1242–1255, 1993. 
[5] 
Lyster, R., and Ranta, L., “Corrective feedback and 
learner uptake: Negotiation of form in communicative 
classrooms,” Studies in Second Language Acquisition, 
vol. 19, pp. 37-66, 1997. 
[6] Bouselmi, G., Fohr, D., Illina, 
I. And Haton, J., 
“Multilingual non-native speech recognition using 
phonetic confusion-based acoustic model 
modification and graphemic constraints,” in 
Proceedings of ICSLP, 2006. 
[7] Neri, 
A., Cucchiarini, C. and Strik, H. “Selecting 
segmental errors in L2 Dutch for optimal 
pronunciation training,” IRAL -International Review 
of Applied Linguistics, vol. 44, pp. 357–404, 2006. 
[8] 
Erdener D.V. and Burnham D.K., “The role of 
audiovisual speech and orthographic information in 
nonnative speech production.” Language Learning, 
vol. 55, pp. 191–228, 2005. 
[9] Cucchiarini, C., Driesen, J., 
Van Hamme, H. and 
Sanders, E., “Recording speech of children, non-
natives and elderly people for HLT applications: the 
JASMIN-CGN corpus,” in Proceedings of LREC, 2008. 
[10]Wells, J. S, “SAMPA -computer readable phonetic 
alphabet,” http://www.phon.ucl.ac.uk/home/sampa/. 
[11] Van Doremalen, J., Cucchiarini, C. and Strik H., “Using 
Non-Native Error Patterns to Improve Pronunciation 
Verification”, Submitted to Interspeech 2010. 
[12] Cucchiarini, C., Neri, A. and Strik, H., “Oral proficiency 
training in Dutch L2: The contribution of ASR-based 
corrective feedback,” Speech Communication, 2009. 
[13] Van den Bosch, A., Content, A., Daelemans, W., & De 
Gelder, B. (1994). Measuring the complexity of 
writing systems. Journal of Quantitative Linguistics, 
1(3), 177–188.