Mirjam Wester, Judith Kessens, Catia Cucchiarini, Helmer Strik
A2RT, Dept. of Language & Speech, University of Nijmegen
P.O. Box 9103, 6500 HD Nijmegen, The Netherlands
In: H. Strik, N. Oostdijk, C. Cucchiarini and P.A. Coppen (eds.),
Proceedings of the Department of Language & Speech, Vol. 20, pp. 127-137,
Nijmegen, the Netherlands, 1997
Abstract
In this paper we describe a method for improving the performance of a continuous
speech recognizer by modelling pronunciation variation. Although the results obtained
with this method are in line with those reported by other authors, the magnitude of
the improvements is very small. In looking for possible explanations for these results, we computed
various sorts of statistics about the material. Since these data proved to be very useful
in understanding the effects of our method, they are discussed in this paper. Moreover, on the
basis of these statistics we discuss how the system can be improved in the future.
1. Introduction
At the Department of Language and Speech of the University of Nijmegen we
are working on a Spoken Dialogue System (SDS) that will be employed to automate part of
a public transport information service. This system was adapted from a German prototype developed
by Philips Research Labs, and was further improved by means of a bootstrapping method
(Strik et. al., 1996 and 1997). An important component of this SDS is a continuous speech recognizer (CSR).
This part of the SDS was also gradually improved through the bootstrapping method,
by adding more data. However, since a point was reached at which no further increase
in performance could be obtained by increasing the data, new methods of improving the system
were sought. Given that the SDS is a mixed-initiative system and that the kind of speech
the callers may use is extremely varied, we thought of improving the system's performance
by modelling pronunciation variation. In this paper, the method used for modelling pronunciation variation is
discussed in detail in section 2. Subsequently, in section 3 the results obtained with
this method are presented together with various sorts of statistics about the material.
In section 4 we discuss how the statistics we computed helped us to understand why the variations
in performance were so small, and how this knowledge can be used to improve the system
in the future.
2. Method and material
2.1 Method
The starting point of the current research was a CSR in which a single pronunciation
lexicon was used. For each word only the transcription we thought was most probable
(the canonical form) was available. In this experiment we wanted to test to what extent
the performance of the CSR could be improved by modelling at least part of the pronunciation
variation that is encountered in the material. The approach we adopted in this attempt resembles
those used previously with success in Cohen (1989) and Lamel and Adda (1996). In this approach phonological rules are used to generate pronunciation variants,
i.e. to expand the lexicon. The expanded lexicon can then be used during training,
recognition (test) or both. During test the old test lexicon is simply replaced by the new
one, in order to make it possible to recognize pronunciation variants. During training the pronunciation
variants can be used to obtain new acoustic models. For training, the whole process can
be schematized as follows: 1. Use the old lexicon (single pronunciation) and the training corpus to
compute the first version of the phone models. 2 . Select phonological rules. 3. Generate a new lexicon with multiple pronunciations on the basis of the
selected rules. 4. Do forced recognition to determine which variant is realized in the corpus.
The chosen variant is then added to the training corpus. This way a new transcription
of the training corpus is obtained. 5. Use the new transcription of the training corpus to calculate new phone
models. Stages 4 and 5 can be repeated a couple of times in iteration so as to obtain
different versions of the phone models. Stages 2 to 5 can be repeated with different rules. Our ultimate goal is to find the rules that are optimal in the sense that
they produce the greatest increase in performance. The goal of the current research was to
test whether the method proposed above was suitable for our purposes. In order to do so we
have tested the method with only four phonological rules, as will be explained below.
2.2 Phonological rules
Much of the phonological variation in Dutch has been described in the literature
by means of phonological rules (see for instance Booij, 1995). However, there are also
phenomena which have been described to a lesser extent or even not at all. It is therefore
almost impossible to decide in advance which rules will be relevant to our CSR. Moreover, in
order to make such a decision one needs to know what type of speech (speaking style) is being
dealt with. However, in our research it is difficult to determine what the speaking
style is. Although there is a considerable amount of information on spontaneous speech and human
interaction, relatively little is known about man-machine interaction. On the one hand,
people who call the system for information about the train schedule use spontaneous speech,
in the sense that they do not read aloud some previously prepared text. From this point of
view one might expect that their speech will exhibit all sorts of phenomena (e.g. disfluencies,
hesitations and pronunciation variation) that are known to occur in spontaneous speech.
On the other hand, these people realize that they are talking to a machine that may have problems
in understanding them. This may be a reason for them to monitor their speech
to a greater extent and speak more carefully and therefore more formally, than they would normally
do (see also Shriberg et al., 1992). Moreover, variation of speaking style can be observed even for one speaker
within one recording session. For example, a speaker may begin by talking informally,
but may change to a more formal speech mode if (s)he realizes that the speech recognizer
has difficulties in understanding him/her. The fact that such extreme forms
of variation are present, makes it difficult to adopt speaking style dependent phonological
rules, simply because one does not know what the speaking style is.
In order to select the initial set of phonological rules a number of criteria
were followed. As is well known, variation occurs both within words and at word
boundaries. Given the use of a lexicon in our CSR, it was obvious to begin with word
internal variation. Therefore, the first criterion was to choose rules of word phonology. Second, we decided to start with rules concerning those phenomena that are
known to be most detrimental to automatic speech recognition. Of the three possible
recognition errors, i.e. insertions, deletions and substitutions, the first two have the greatest
consequences for speech recognition, because they affect the number of segments present in
different realizations of the same word. Therefore, starting with rules concerning
insertions and deletions was the second criterion we adopted. A third criterion was to choose rules that are frequently applied. Actually,
frequently applied is amenable to two interpretations. A rule can be frequent either
because it is frequently applied whenever the context for its application is met or because
the context in which it can be applied is very frequent (even though the rule is applied
in only 50% of the cases). Obviously, it is this latter case of 'frequent occurrence' that
is most interesting for automatic speech recognition, since in this case it is difficult to predict
which variant should be taken as canonical form, while in the former case the most frequent form
would probably suffice as sole transcription. A fourth criterion (related to the previous one) we followed was that the
rules should regard phones that are relatively frequent in the language, since rules
that concern infrequent phones probably have fewer consequences for the recognizer's performance.
Finally, we decided to start with rules that have been extensively described in the
literature, so as to avoid possible effects of overgeneration and undergeneration due to incorrect
specifications of the rules. On the basis of the above-mentioned criteria the phonological rules
which were selected are / /-deletion, / /-epenthesis, /t/-deletion and /n/-deletion
(Booij, 1995). A short description and an example of each of the rules follow here, after Booij
(1995:127-130, 139-141, 152-154).
1. / /-deletion: When two consecutive syllables are headed by a schwa the
first schwa may be deleted provided that the remaining onset consonant cluster is a cluster of an obstruent followed by a liquid. obs + + liq + obs + liq + Example: /And r / /Andr /
2. / /-epenthesis In nonhomorganic consonant clusters in coda position a
schwa may be inserted. Example: /mElk/ /mEl k/
3. /t/-deletion: This rule is typically one of the processes that occurs
in fast speech, but to a lesser extent also in careful speech. There are three different conditions in which /t/-deletion occurs. First, if a /t/ in a coda is preceded by an obstruent, and followed by another consonant, the /t/ may delete. obs + t + cons obs + cons Example: /snElstmox l k/ /snElsmox l / Second, if the preceding consonant is a sonorant, /t/-deletion is possible, but then the following consonant must be an obstruent. When the obstruent following the sonorant + /t/-cluster is a /k/, deletion does not apply.
When /t/ is preceded by a sonorant, and also followed by a sonorant, deletion
is impossible. son + t + obs son + obs Example: /Eintp nt/ /Eimp nt/ And lastly because in some Dutch dialects /t/-deletion in word-final position also occurs, we decided to apply the rule /t/-deletion in word-final position following an obstruent (unless the obstruent is an /s/). word final: obs + t obs Example: /dElft/ /dElf/
4. /n/-deletion: In standard Dutch, syllable-final /n/s can be dropped after
a schwa, except in the indefinite article een' / n/. For many speakers, in particular in
the western part of the Netherlands, the deletion of /n/ is obligatory. An /n/
is deleted if it is the final /n/ of a syllable after a schwa and if that syllable
is not a verbal stem. syllable final: + n Example: /rEiz n/ /rEiz / There is however, no deletion of the final /n/ in ik teken' (I draw) /tek
n/ because teken' is a verbal stem. Booij (1995) also adds to this that the
/n/ must be at the end of a morpheme. However we did not apply this part of the rule so /n/-deletion in words like volgende' (/n/ is not near a morphological boundary) is also applied in our lexicon.
Generating pronunciation variants is time-consuming and error-prone since
it is mostly manual work. We created a multiple pronunciation lexicon by automatically
generating the above rules using a script in which the rules and their conditions were
specified. All four rules were applied where it was possible and in no specific order. Thus a multiple
pronunciation lexicon was obtained. However, generating pronunciation variants automatically is not foolproof
either, although the types of problems encountered are different than the problems
which arise from manually generating pronunciation variants. For example, the conditions
in which a phonological rule should be applied are often based on morphological information,
such as morphological boundaries, which is, at present, missing from our phone transcriptions.
So either the phone transcriptions need to be enriched, or other ways to solve
these kinds of problems must be found. The variants obtained automatically were compared
with a lexicon which had been made by hand to check whether the correct variants were being
produced by the script. For a number of the variants generated one could expect that
it was highly unlikely that they would occur but we chose to overgenerate so as not to exclude
possible variants beforehand.
2.3 Material
The CSR used in this experiment is part of an SDS (Strik et. al., 1996 and
1997). The speech material was collected with an online version of the SDS, which was connected
to an ISDN line. The training and test material consisted of 24,676 utterances (81,090
words) and 6,276 utterances (21,106 words), respectively. The most important characteristics of the CSR are the following. The input
signals consist of 8 kHz 8 bit A-law coded samples. Feature extraction is done every
10 ms for frames with a width of 16 ms. The first step in feature analysis is an FFT
analysis to calculate the spectrum. Next, the energy in 14 mel-scaled filter bands between 350
and 3400 Hz is calculated. Apart from these 14 filterbank coefficients, the 14 delta coefficients,
log energy, and slope and curvature of the energy are also used. This makes a total
of 31 feature coefficients. The CSR uses acoustic models (HMMs), language models (LMs:
unigram and bigram), and a lexicon. The continuous density HMMs consist of three segments
of two identical states, one of which can be skipped. In the online SDS the output of the CSR, and thus the input to the following
natural language processing component, is a wordgraph (Strik et. al., 1996 and 1997).
In the research version it is possible to use the LMs to compute the Best Sentence (BS).
Obviously, the error rates for the wordgraph are much lower than those of the BS (Strik et. al.,
1996 and 1997). Nevertheless, we will use the BS in this article, because they are better
suited for the goals of the present research: evaluation of the results is easier and more transparent. The single variant training lexicon contains 1,433 entries, these are all
the words contained in the training corpus and also a number of words which could
be expected in this specific application even though they do not (yet) occur in the corpus (for
example station names). The four phonological rules selected for investigation affect 536
of the 1,433 (37%) words in the training lexicon. Of these 536 words 487 words are affected
by one of the four phonological rules. In 47 cases two rules were applied to the same word
and in two cases three rules were applied. There were no words that were affected by all
four rules because / /-deletion and / /-epenthesis did not occur within
the same word. On average, 1.3 variants were generated for each of the 536 words. The multiple variant lexicon contains
2,151 entries, 1,433 (67%) of which are canonical. The test lexicon contains 860 entries, which are all the words present in
the online version. The number of out of vocabulary (OOV) words in the test corpus
is 298. The four phonological rules concern 354 of the 860 entries in the test lexicon (41%).
In this case 315 words were subject to one of the four rules. In 37 cases two rules were
applied, and here also two words were affected by three of the four rules. On average, 1.3 variants
were generated for each of the 354 words. The multiple pronunciation lexicon contains 1,341
entries, 860 (64%) of which are canonical.
2.4 Forced recognition
Forced recognition was imposed through the language models (LMs). For each
sentence unigram and bigram LMs were derived on the basis of 100.000 repetitions
of the same sentence. After the first forced recognition round, 484 utterances of the
training corpus were not correctly recognized. 47 of these utterances turned out to contain obvious
transliteration errors which were corrected afterwards. Since the remaining 437 sentences
appeared to be problematic for a number of reasons (they contained background noise, disfluencies, unexpectedly long pauses within words and in some cases the loudness level
was insufficient) they were removed from the original training corpus and only 24,667 utterances
were used for further experiments. It turns out that forced recognition is a useful tool to identify all sorts
of errors and utterances which, for some reason, are problematic for the CSR. These utterances
will certainly be examined more closely in the near future. Instead of forced
recognition with LMs, as described above, we could have used a standard Viterbi algorithm.
Although the main advantage of the Viterbi algorithm is that a forced alignment can be
obtained for all utterances, the main disadvantages of this algorithm are (1) that the alignment
is not always meaningful, e.g. because the transliteration contains errors, and (2) that
it is not possible to find the errors and the problematic utterances. The resulting training corpus with 24,667 utterances was again used for
training and forced recognition. In the 24,566 cases in which forced recognition was
successful, the pronunciation variants chosen by forced recognition were substituted for
the original (canonical) transcriptions. In the 101 cases in which forced recognition
was not successful, the canonical form was chosen. The new transcriptions were subsequently
used to train new phone models.
3. Results
Above it has been explained how single (S) and multiple (M) pronunciations
during training lead to two different sets of phone models. In addition either single (S)
or multiple (M) pronunciations can be used in the test lexicon. This makes a total of four
combinations, for each of which we present the sentence and word errors rates (SER and WER,
respectively) of the best sentences (BS) in Table 1.
Table 1. SER and WER for the BS of four different CSRs.
CSR SS SM MS MM
train S S M M
test S M S M
SER(%) 32.63 32.39 33.03 32.41
WER(%) 23.63 23.50 23.81 23.50
As appears from Table 1, there are only slight variations in recognition
performance between the various conditions. Nevertheless, it is interesting to analyse these
data in more detail, in order to see whether the various tendencies are in line with those reported
in the literature. For instance, the worst performance level appears to be obtained when multiple pronunciations are used for training but not for testing (i.e., when the
new phone models are combined with the old lexicon). This is exactly what appeared in Lamel and
Adda (1996). Furthermore, Lamel and Adda (1996) found that using multiple pronunciations
for testing gave better results than using single pronunciation lexicons. This
is confirmed by our data (compare column 2 with column 3). However, these authors also found
that recognition performance improved even further when multiple pronunciations were used
both for training and for testing, which is not confirmed by our data: there is practically
no difference in performance between column 3 and column 5. Therefore, on the basis of these results we can conclude that the applied
method improves the performance, albeit to a small extent. Moreover, the observed
improvements are in line with those reported elsewhere (Lamel and Adda, 1996). However, since
the magnitude of the changes is considerably smaller than that reported by other authors,
it is interesting to consider why this is the case. A possible explanation for these results would be that during forced recognition
the CSR selects the wrong variant. In order to test whether this was the case,
we checked for a small number of words whether the correct pronunciation variant was chosen
by looking at and listening to the signals. Since it turned out that in 90% of the 711
words the correct version was chosen, there is no reason to believe that the small increase
in performance was mainly due to errors in forced recognition. Another reason could be that the number of pronunciation variants that can
be selected is relatively small. Against this background it is interesting to know how
often one of the alternative variants could be chosen and how often it was indeed chosen.
In Table 2 the total number of words in the training corpus, and the total number of recognized
words in the test corpus for two different conditions (recognition with respectively original
and new phone models) is given in the second row. The number of cases in which only a
single variant could be chosen is listed in the third row, in the fourth row the number of cases
in which an alternative variant could be chosen is given and in the last row the number
of instances in which an alternative variant was chosen is shown.
Table 2. Number of pronunciation variants possible and chosen
Corpora train corpus test corpus (old phones) test corpus (new phones)
Total 81,090 19,962 20,011
Single 66,590 15,556 15,640
Multiple 14,500 4,406 4,371
Alternative 6,363 2,028 2,128
Percentages have been calculated for the data in Table 2 to give a clearer
picture of how the different rows relate to each other. Alternative pronunciations were available
for 17.9% of all words in the training corpus. In 43.9% of those cases an alternative variant
was actually chosen, which means that 7.8% of the total number of words in the corpus
are chosen as alternative variants. For the test corpus the percentages are similar to
those of the training corpus. Alternative variants could be chosen in 22.1% (original) and 21.8%
(new) of the total number of words in the test corpus; of those cases 46.0% and 48.7% were
alternative variants. In 10.2% (original) and 10.6% (new) of the total number of words one of
the multiple variants was chosen. From these data we can infer that, on average, one of the alternative variants
is chosen in about 45% of the possible cases, and in 8-10% of the total number of
words. However, most variants will only differ in one phone from the canonical form. A comparison
of the two transcriptions of the training corpus (i.e. the canonical forms versus the
transcriptions obtained with forced recognition) reveals that they differ in 6,594 of the
total 318,774 phones (2.1%). This seems to be one of the reasons why the effects on recognition
performance are far from dramatic. Adding variants to the test lexicon increases confusability, which could
also be one of the reasons why there was not a great deal of improvement in the recognizer's
performance. In the tests in which the multiple pronunciation lexicon was used 48% of
all variants in the test lexicon (1341 entries) never occurred in the test corpus. 19% of all
entries in the lexicon were alternative variants which were never chosen. In 5% of the cases the
canonical form of a word was never chosen but, instead, an alternative variant was chosen, and
24 % of the entries in the lexicon were words which never occurred in the test corpus,
neither the canonical form nor an alternative variant of those words was ever chosen. This is partly due to overcoverage of the rules but also to the fact that
a lot of canonical forms in the test lexicon have been added for application specific
purposes. There are, for example, quite a number of station names and time indicators which
do not occur in the test corpus but which must be contained in the test lexicon because
they are considered to be of utmost importance for the application. In other words they may not
have occurred yet but they could very well occur in the future and as the CSR is part of a
system for a public transport information service, it must be able to recognize all station
names and time indicators as they are crucial for the success of an enquiry.
In order to gain more insight in these data, we compared the four versions
of the CSR. First we determined for each version of the CSR which BS contained an error.
Subsequently, for four of the six logical combinations of the CSR (those in which only
one factor changes, while the other is kept constant, i.e. SS-SM, MS-MM, SS-MS and SM-MM) the
BS containing errors were compared. The results of these comparisons are shown
in Table 3.
Table 3. Comparisons of the performance of the four versions of the CSR.
CSR 1 SS MS SS SM
CSR 2 SM MM MS MM
same errors 1630 1592 1089 1066
other errors 364 400 836 844
improvements 54 81 123 123
deteriorations 39 42 148 124
net result +15 +39 -25 -1
From Table 3 it appears that a considerable number of utterances contain
a recognition error in both CSRs, either the same (row 3) or a different one (row 4). Furthermore,
there are cases in which a better solution is chosen (improvements, row 5). However, since
in an almost equal number of cases a worse solution is chosen (deteriorations, row 6),
the two effects balance each other off and the net result (row 7) is small. This neutralization
effect explains why no considerable changes in the error rates were observed in Table 1.
It is well-known that including alternative pronunciation variants leads to some sort of trading
relation between improving performance (by covering part of the variability in speech) and
deteriorating it (by increasing the confusability between the entries in the lexicon). Based on the fact that only 2.1% of the phones differ between the two transcriptions
of the training corpus and the results shown in Table 1, it could be concluded
that the use of multiple pronunciations during training has little consequences for the
recognition process, for instance, because the acoustic models hardly change. However, comparison
of columns 4 and 5 with columns 2 and 3 in Table 2 reveals that varying the phone models
produces more changes than varying the test lexicon. A comment on this may be in order. Using multiple variants for testing simply means that the CSR can choose
from among a greater number of possibilities for each word. Put differently, the variations
in the system occur at the word level and concern only a limited number of words. When
multiple variants are used for training, on the other hand, they produce different acoustic
models. In other words, in this case the variations occur at the phone level. Since all words
in the corpus are made up of phones, the effects of variation modelling during training are
likely to be more pervasive. Further inspection of Table 3 also reveals that, in spite of the greater
number of changes in columns 4 and 5, the net result is negative, while in columns
2 and 3 it is positive. In other words, the fewer changes in columns 2 and 3 successfully conspire
to achieve better recognition results, while the net result of the larger number of changes
in columns 4 and 5 is a deterioration. A final remark concerns the number of utterances in which there is room
for improvement. It appears that 4,038 of the 6,276 utterances are recognized
correctly in all four systems. Since 1,066 utterances contain OOV words they can never be recognized
correctly. Therefore there is only room for improvement in the remaining 1,172 utterances.
With this in mind no dramatic changes in recognition performance can possibly be expected.
4. Discussion and conclusions
In the previous section we examined the results of an experiment aimed at
determining the contribution of pronunciation variation modelling to improving the performance
of our CSR. One of the things we have learned from this experiment is that forced recognition
as it was implemented in this method is a useful instrument to identify possible errors
in the transliterations and in the lexicons and to spot the utterances that, for
some reason, present insurmountable problems to automatic speech recognition. Studying these
sentences in further detail is certainly worthwhile. Furthermore, in 90% of the cases this forced
recognition procedure selects the correct pronunciation variant. As far as the main goal of this experiment is concerned, i.e. establishing
whether the applied method is suitable for improving the performance of our CSR, we
can conclude that there are no reasons to assume that this is not the case. As a matter of
fact the observed changes are in line with those reported by other researchers. The only problem
seems to be that in our research the variations are very small. In this respect it may
be instructive to consider the following facts. First, the statistics concerning the material may have played an important
part in limiting the effect of pronunciation modelling on recognition performance.
It should be borne in mind that an alternative variant was chosen in only 8-10% of the cases.
Moreover, in most of the cases the alternative transcriptions differed in only one phone from
the canonical form. In connection with this, no more than 2.1% of the phones were changed as
a result of variation modelling. Furthermore, in only 1,172 sentences was there room
for improvement. Finally, another factor that should not be overlooked concerns the phones
involved in the rules under study. Since the four rules concern phones that are very frequent
in Dutch and in the material under study (in the training corpus /n/, /t/ and / / are the
three most common phones), there are so many occurrences of these phones, that the impact
of variation modelling is likely to be limited. If we consider all these aspects, it is not surprising that recognition
performance hardly improved. Moreover, it is important to point out that our research
is at an early stage and that a number of things that we intend to do have not been done yet.
For instance, in this experiment we have confined ourselves to within word variation, whereas
modelling variation above the word level may be even more important (Cremelie and Martens, 1995).
Second, since only four rules were investigated, only a small part of the variation
in the material could be covered. However, it is our intention to expand the set of phonological
rules so as to maximize coverage. Another factor that might be responsible for the limited
impact of pronunciation modelling on recognition performance and that we have not
controlled yet is overcoverage, that is the fact that the rules selected generate a great
number of variants (19% of the total lexicon) that are not present in the corpus. This was to be
expected because no pruning of variants whatsoever was carried out. The reason for this is that
in this phase of our research we did not want to exclude variants that might turn out to be useful
at a later stage. Since we opted for overcoverage, this should be considered when analysing
the results. It is obvious that in the future we intend to examine pronunciation variants more
critically, before including them in the lexicon. More attention will be paid to the variants
that are indeed present in the corpus. In addition, the frequency with which they occur
will also be investigated, so that a probability count can be attached to each variant.
In the light of these considerations it is therefore legitimate to conclude that the results of
this experiment are promising, in spite of the limited increase in recognition performance.
Acknowledgements
This work was funded by the Netherlands Organisation for Scientific Research
(NWO) as part of the NWO Priority Programme Language and Speech Technology. The research
of Dr. H. Strik has been made possible by a fellowship of the Royal Netherlands
Academy of Arts and Sciences.
References
Booij, G.E. (1995), The phonology of Dutch. Oxford: Clarendon Press. Cohen, M.H. (1989), Phonological structures for speech recognition. PhD
dissertation, University of California, Berkeley. Cremelie, N. and J.P. Martens (1995), On the use of pronunciation rules
for improved word recognition, Proceedings EUROSPEECH'95, Madrid, 1747-1750. Lamel, L.F. and G. Adda (1996), On designing pronunciation lexicons for
large vocabulary, continuous speech recognition, Proceedings ICSLP'96, Philadelphia, 6-9. Shriberg, E., E. Wade and P. Price (1992), Human-machine problem solving
using spoken language systems (SLS): factors affecting performance and user satisfaction, Proceedings Speech and Natuaral Language Workshop, Harriman, New York, 49-54. Strik, H., A. Russel, H. van den Heuvel, C. Cucchiarini and L. Boves (1996),
Localizing an automatic inquiry system for public transport information, Proceedings International Conference on Spoken Language Processing (ICSLP) 96, Philadelphia, 853-856. Strik, H., A. Russel, H. van den Heuvel, C. Cucchiarini and L. Boves (1997),
A spoken dialogue system for the Dutch public transport information service, to appear
in International Journal of Speech Technology.