Mirjam Wester, Judith M. Kessens & Helmer Strik
A2RT,
Dept. of Language & Speech,
University of Nijmegen
P.O. Box 9103, 6500 HD Nijmegen, The Netherlands
In: H. Strik, J.M. Kessens, M. Wester (eds.),
Proc. of the ESCA workshop 'Modeling Pronunciation Variation for Automatic Speech Recognition', Rolduc,
Kerkrade, 4-6 May 1998, pp. 145-150.
ABSTRACT
This paper describes how the performance of a continuous speech
recognizer for Dutch has been improved by modeling
pronunciation variation. We used three methods in order to
model pronunciation variation. First, within-word variation was
dealt with. Phonological rules were applied to the words in the
lexicon, thus automatically generating pronunciation variants.
Secondly, cross-word pronunciation variation was accounted for
by adding multi-words and their variants to the lexicon.
Thirdly, probabilities of pronunciation variants were
incorporated in the language model (LM), and thresholds were
used to choose which pronunciation variants to add to the LMs.
For each of the methods, recognition experiments were carried
out. A significant improvement in error rates was measured.
1. INTRODUCTION
The work reported on here concerns the Continuous Speech
Recognition (CSR) component of a Spoken Dialogue System (SDS)
that is employed to automate part of an existing public
transport information service [1]. A large number of telephone
calls of the on-line version of the SDS have been recorded.
These data clearly show that the manner in which people speak
to the SDS varies, ranging from using very sloppy articulation
to hyper articulation. As pronunciation variation - if it is
not properly accounted for - degrades the performance of the
CSR, solutions must be found to deal with this problem.
Pronunciation variation can be divided into two main kinds of
variation. First, variation in the order and number of phones a
word consists of, and second, variation in the acoustic
realization of phones. In the present research, we are mainly
interested in the first kind of pronunciation variation,
because we expect this variation to be more detrimental to
speech recognition than the second kind. After all, most of the
variation in producing phones should be modeled implicitly when
using mixture models.
Our objectives are to improve the performance of the CSR, but
also to gain more understanding of the processes which play a
role in spontaneous speech. The work reported on in this paper
is exploratory research into how pronunciation variation can
best be dealt with in CSR.
In section 2, the general method for modeling pronunciation
variation is described. It is followed by a detailed
description of three different approaches which we used to
model pronunciation variation. Subsequently, in section 3, the
results obtained with these methods are presented. Finally, in
the last section, we discuss the results and their
implications.
2. METHOD AND MATERIAL
2.1 Method
The approach we use resembles those used previously with
success in [2, 3]. Earlier experiments using this method are
reported on in [4]. First, our baseline lexicon is described
followed by an explanation of the general method for modeling
pronunciation variation. Next, an explanation of the manner in
which the general method is used for modeling within-word
variation (method 1) and cross-word variation (method 2) is
given. The last method (method 3), which is an expansion of the
general method, describes how probabilities of pronunciation
variants were incorporated in the language model (LM).
2.1.1 Baseline
As a baseline we used a CSR with an automatically generated
lexicon. This lexicon is a canonical lexicon which means it
contains one transcription per word. It is crucial to have a
well-described lexicon to start out with. This is especially so
in light of pronunciation variation, because the variants
chosen for each word in the canonical lexicon have great
consequences for the results of the recognition. Since
improvements or deteriorations in recognition due to modeling
pronunciation variation are measured compared to the result of
the baseline system, the choice of this baseline is quite
crucial. Furthermore, the pronunciation variants which we
generate are based on the canonical transcriptions, therefore
the canonical lexicon must be well-defined.
Our lexicon was automatically generated using the Text-to-Speech (TTS) system [5] developed at the University of
Nijmegen. Phone transcriptions for the words in the lexicon
were obtained by looking up the transcriptions in two lexica;
ONOMASTICA [6], a lexicon with proper names, and CELEX, a
lexicon with words from mainly fictional texts. The grapheme-to-phoneme converter is employed whenever a word cannot be
found in either of the lexica. There is also the possibility of
manually adding words to a user lexicon, if the words do not
occur in either of the lexica and are not correctly generated
by the grapheme-to-phoneme converter. In this way,
transcriptions of new words are easily obtained automatically
and consistency in transcriptions is achieved.
2.1.2 Rule-based lexicon expansion
As explained above, our baseline is a canonical lexicon, with
one entry per word. Pronunciation variants are added to this
lexicon, thus resulting in a lexicon with multiple
pronunciation variants. This lexicon can be used either during
recognition or training, or during both. In short the whole
procedure for training is as follows:
1. Train the first version of phone models using a canonical
lexicon.
2. Choose a set of phonological rules.
3. Generate a multiple-pronunciation lexicon using the rules
from step 2.
4. Use forced recognition to improve the transcription of
the training corpus.
5. Train new phone models using the improved transcriptions.
In step 4, forced recognition is used to determine which
pronunciation variants are realized in the training corpus.
Forced recognition involves "forcing" the recognizer to choose
between variants of a word, instead of between different words.
In this way, an improved transcription of the training corpus
is obtained, which is used to train new phone models.
Steps 4 and 5 can be repeated in iteration in order to
gradually improve the transcriptions and the phone models.
Steps 2 through to 5 can be repeated for different sets of
phonological rules.
2.1.3 Method 1: Within-word variation
Pronunciation variants were automatically generated by applying
a set of phonological rules of Dutch to the pronunciations in
the canonical lexicon. The rules were applied to all words in
the lexicon where possible, using a script in which rules and
conditions were specified. All variants generated by the script
were added to the canonical lexicon thus creating a multiple-pronunciation lexicon.
In the first set of experiments, we modeled within-word
variation using four phonological rules: /n/-deletion, /t/-deletion, / /-deletion and / /-insertion. In the next set of
experiments, we added a fifth rule; the rule for post-vocalic
/r/-deletion. These rules were chosen according to four
criteria. The rules had to be rules of word-phonology, they had
to concern insertions and deletions, they had to be frequently
applied, and they had to regard phones that are relatively
frequent in Dutch. A more detailed description of the
phonological rules and the criteria for choosing them can be
found in [4, 7, 8].
2.1.4 Method 2: Cross-word variation
Cross-word variation was modeled by joining words together with
underscores, thus forming new words which we refer to, in this
paper, as multi-words. This changes the lexica, corpora, and
LMs. The multi-words are added to a lexicon in which the
separate parts that make up the multi-words are still present.
Multi-words are substituted in the corpora wherever the word
sequences occur. The LMs are calculated on the basis of these
adapted corpora.
We used the following criteria to decide if a word classifies
as a multi-word or not. First, the sequence of words had to
occur frequently in the training material. We considered a
minimum of 20 occurrences of the word sequence in the training
material to be adequate. The second criterion which we adopted
was that word sequences had to form an articulatory or
linguistic unit. Thirdly, when a two part multi-word, for
example "ik_wil" is selected, it is no longer possible to
create a multi-word consisting of three parts which includes
"ik_wil". Thus, the three-part multi-word "ik_wil_graag" is
then no longer a possible multi-word.
Experiments were carried out to measure the effect of adding
multi-words to the lexicon, and the effect of adding
pronunciation variants of multi-words. The pronunciation
variants of the multi-words were automatically generated using
the five within-word phonological rules mentioned earlier and a
number of cross-word phenomena, namely: cliticization,
contraction and reduction. The underscores were disregarded
during the scoring procedure, so whether the word sequence was
recognized as a multi-word or in separate parts had no effect
on the word error rates.
2.1.5 Method 3: Probabilities
In previous experiments [4], we found that it is crucial to
determine which pronunciation variants should be added to the
lexicon. Adding variants to the lexicon can lead to a higher
degree of confusability during recognition. Consequently,
pronunciation variants not only correct some of the mistakes
made, but also introduce new mistakes. Therefore, we started
looking for automatic ways to reduce this confusability. First,
we incorporated probabilities in the LMs, and second, we
applied a threshold to determine which pronunciation variants
should be included in both the LMs and the lexicon.
A forced recognition was carried out on a large corpus (see
section 2.2) with a lexicon containing 50 multi-words and
pronunciation variants. Word counts and counts of pronunciation
variants were made on the basis of the resulting corpus. These
counts were used to create new LMs (unigram and bigram).
Pronunciation variants were added to the LMs, thus creating new
entries. This is in contrast to the earlier described methods 1
and 2, where the pronunciation variants were not incorporated
in the LMs, but only in the lexicon.
We assumed that not all words occurred frequently enough in
the training material to correctly estimate the probabilities
of all variants. Therefore, a number of thresholds were chosen,
to find out how often a word must occur in order to correctly
estimate the probabilities of the pronunciation variants.
The thresholds (N) are applied to both the LM and the test
lexicon. The word count is used to determine if pronunciation
variants are included in the LM. If a word occurs N times or
more, all pronunciation variants of that word and their counts
are included in the LM and the lexicon. If a word occurs less
times than the threshold, only the most frequent pronunciation
variant is included in the LM and the lexicon.
2.2 CSR and Material
The CSR used in this experiment is part of an SDS [1], as was
mentioned earlier. The speech material was collected with an
online version of the SDS, which was connected to an ISDN line.
The input signals consisted of 8 kHz 8 bit A-law coded samples.
The speech can be described as spontaneous or conversational.
Recordings with high levels of background noise were excluded
from the material used for training and testing.
The most important characteristics of the CSR are as follows.
Feature extraction is done every 10 ms for frames with a width
of 16 ms. The first step in feature analysis is an FFT analysis
to calculate the spectrum. Next, the energy in 14 Mel-scaled
filter bands between 350 and 3400 Hz is calculated. The final
processing stage is the application of a discrete cosine
transformation on the log filterband coefficients. Besides 14
cepstral coefficients (c0-c13), 14 delta coefficients are also
used. This makes a total of 28 feature coefficients. The CSR
uses acoustic models (HMMs), language models (unigram and
bigram), and a lexicon. The continuous density HMMs consist of
three segments of two identical states, one of which can be
skipped. In total 38 HMMs were used, 35 of these models
represent phonemes of Dutch, two represent allophones of the
phonemes /l/ and /r/, and one model is used for the non-speech
sounds.
For the experiments conducted using methods 1 and 2, our
training and test material consisted of 25,104 utterances
(81,090 words) and 6267 utterances (21,106 words),
respectively. The training material was used to train the HMMs
and the LMs. In a later stage, the training corpus was expanded
with 49,822 utterances leading to a total of 74,926 utterances
(225,775 words). The enlarged training corpus is only used for
method 3 to estimate the probabilities of pronunciation
variants. In the future, this enlarged corpus will also be used
in methods 1 and 2.
The single variant training lexicon contains 1412 entries,
which are all the words in the training material. Adding
pronunciation variants generated by the five phonological rules
increases the size of the lexicon to 2729 entries (an average
of about 2 entries per word). Adding 50 multi-words plus their
variants leads to a lexicon with 2845 entries. The maximum
number of variants that occurs for a single word is 16.
The single variant test lexicon contains 1158 entries, which
are all the words in the test corpus, plus a number of words
which must be in the lexicon because they are part of the
domain of the application. The testing corpus does not contain
any out-of-vocabulary (OOV) words. This is a somewhat
artificial situation, but we did not want the recognition
performance to be influenced by words which could never be
recognized correctly, simply because they were not present in
the lexicon. Adding pronunciation variants generated by the
five phonological rules leads to a lexicon with 2273 entries
(also about 2 entries on average per word). Adding 50 multi-words and their variants results in a lexicon with 2389
entries.
The results presented in the next section are best-sentence
word error rates. The word error rate (WER) is determined by :
where S is the number of substitutions, D the number of
deletions, I the number of insertions and N the total number of
words. During the scoring procedure only the orthographic
representation is used. Whether or not the correct
pronunciation variant was recognized is not taken into account.
3. RESULTS
Recognition can be carried out with phone models trained on a
corpus with single-pronunciation variants (S), or with phone
models trained on a corpus with multiple-pronunciation variants
(M). In addition, either a single (S) or a multiple (M)
pronunciation lexicon can be used during recognition. In the
following tables the different conditions are indicated in the
row entitled "CSR". The first letter indicates what kind of
training corpus was used and the second letter denotes what
type of lexicon was used during testing.
3.1 Method 1: Within-word variation
Table 1 shows the results obtained for two rule sets: four and
five rules (see 2.1.3). Adding a pronunciation rule, in this
case the /r/-deletion rule, gives the same result for the SM
condition, but leads to an improvement, 0.32% and 0.31% in WER,
for the MS and MM conditions, respectively. Therefore, the rest
of the results discussed here concern the CSR with five rules.
Table 1
The effect of adding pronunciation variants during recognition
can be seen when comparing the SS and SM conditions. In column
2, the results are shown for the baseline condition (SS).
Adding pronunciation variants to the lexicon (resulting in a
multiple-pronunciation lexicon, SM) leads to an improvement of
0.29% in WERs.
When the multiple-pronunciation lexicon is used to perform a
forced recognition and new phone models are trained on the
resulting updated training corpus (MM), it leads to a further
improvement of 0.30% compared to the condition SM.
Testing with the single-pronunciation lexicon while using
updated phone models leads to a slight decrease in WERs
compared to the SS condition. It seems the best results are
found when the phone models are trained on a corpus which is
based on the same lexicon as the lexicon which is used during
recognition. (SS is better than MS and MM is better than SM.)
3.2 Method 2: Cross-word variation
On the basis of the criteria explained in section 2.1.4, we
selected multi-words which were added to the lexicon. Table 2
shows the effect of adding 25, 50 and 75 multi-words compared
to the WER for the case where 0 multi-words have been added to
the lexicon (the SS column in Table 1). The first 50 multi-words were as general as possible, no real application specific
word sequences were included. The next 25 multi-words which
were added to get a total of 75 multi-words were application
specific. They consisted of frequently occurring station names.
This was necessary because no more than 50 word sequences,
which were not application specific, adhered to all the
criteria listed in 2.1.4. The station names which we added were
of the type "Driebergen-Zeist", which is simply a station name
consisting of two parts.
Table 2
Adding 50 multi-words leads to an improvement of 0.49% in WERs.
It seems as if there is a maximum to the number of variants
which should be added. On the basis of the results shown in
Table 2, we decided to continue using the lexicon containing 50
multi-words, because this gave the largest improvement in WERs.
In the following stage, we added different pronunciation
variants to the lexicon containing 50 multi-words. The results
are shown in Table 3. The second column shows the result for
the condition without pronunciation variants, but with 50
multi-words (see also column 4, Table 2). Next, we added
pronunciation variants generated by the five phonological rules
(see 2.1.3). First, the rules were only applied to the separate
words in the lexicon, not to the multi-words (column 3). The
result in column 4 is due to adding only pronunciation variants
of the 50 multi-words (see 2.1.4) to the lexicon. In the last
column, the result is shown for the situation where all of the
pronunciation variants (5 rules and multi) were added to the
lexicon.
Table 3
Adding variants generated by the five phonological rules (5
rules) gives roughly the same improvement (0.34% compared to
0.29%) as was found in Table 1 when going from SS to SM. When
only variants of the multi-words are added (multi), a
deterioration of 0.51% in WERs is found. Adding both multi-word
variants and the variants generated by the five rules (all)
leads to a deterioration in WERs when compared to the SS
condition.
3.3 Method 3: Probabilities
Probabilities for separate pronunciation variants were
estimated using the enlarged corpus. A forced recognition was
carried out on this corpus in order to obtain the pronunciation
variants for each word. The lexicon which was used for the
forced recognition contained the 50 multi-words and all of the
pronunciation variants (same lexicon as for SMall, last column
in Table 3). The probabilities of the pronunciation variants
were incorporated in the LMs. Column 2 in Table 4 shows the
result of adding probabilities of all pronunciation variants to
the LMs. When this is compared to the same test situation,
without probabilities (last column, Table 3), an improvement of
0.61% in WERs is achieved.
Table 4
Next, we decided to apply thresholds for adding pronunciation
variants to the lexica and LMs as was described in section
2.1.5. We expected that this would also influence recognition,
but the improvements proved to be small, as can be seen in
columns 3 through 5 in Table 4.
3.4 Overall Results for the 3 Methods
In all of the above results, the effects of adding
pronunciation variants can not be seen clearly, because WERs
only give an indication of the total improvement or
deterioration. Table 5 shows the changes in the utterances,
which occur due to the combination of all three methods which
were tested. A comparison is made between the baseline
condition and the final test (the best condition in Table 4,
threshold 100). In the first column (Table 5) the type of
change is given, in the second column the number of utterances
which are affected.
Table 5
In total 875 of the 6276 utterances changed. The net result is
improvements in 101 utterances, as Table 5 shows, but that is
only part of what actually happens due to applying the three
methods. For instance, in 480 cases the mistakes made in the
utterances change. Although they remain incorrect, the mistakes
which are made are different, so pronunciation modeling has an
effect here which can not be seen in the WERs.
A significant improvement of 1.58% in sentence error rates
(SERs) is found (McNemar test for significance [9]) when going
from the baseline condition to the final test. The McNemar test
for significance cannot be performed on WERs because the errors
(insertions, deletions and substitutions) are not independent
of each other. All three methods separately, also show
significant improvement for SERs. Table 6 shows the SERs for
each of the three methods.
Table 6
Adding variants of five rules, and using updated phone models
(method 1), leads to a significant improvement of 0,67% in
SERs, when it is compared to the baseline. Adding 50 multi
words to the baseline condition (method 2) leads to a
significant improvement of 0.73% in SERs. For method 3, a
comparison is made between the SMall condition (see column 5 in
Table 3) and the condition with a threshold of 100 for the LM.
The improvement is 0.64% in SERs, which is also a significant
improvement.
4. DISCUSSION AND CONCLUSIONS
The results of method 1, modeling within-word variation, show
that adding pronunciation variants generated by applying four
phonological rules, reduces the WER. Adding another
pronunciation rule, the rule for /r/-deletion also improves
recognition performance. A further improvement is found when
using updated phone models. This improvement is larger for five
rules than for four rules. In total, for method 1, the WERs
improve by 0.59% which is a significant improvement of 0.67% in
SERs. Therefore, we can conclude that this method works for
improving the performance of our CSR. It is important to
realize, however, that with each rule that is applied, the
variants which are generated will introduce new mistakes in
addition to correcting others. In the future, we will look for
ways to minimise confusability and to maximise the efficiency
of the variants which are added by finding the optimal set of
phonological rules.
Method 2 shows that adding multi-words leads to an
improvement of 0.49% in WERs and a significant improvement of
0.73% in SERs. This improvement may be due to the fact that by
adding multi-words a type of trigram is created in the LM, only
for the most frequent word sequences in the training corpus.
It is unclear why modeling pronunciation variants of multi-words does not lead to an improvement in WERs. The multi-words
are all frequent word sequences and we expected that modeling
pronunciation variation at that level would have an effect.
Furthermore, the pronunciation phenomena which were modeled,
i.e. cliticization, reduction processes and contractions are
all phenomena which are thought to occur frequently in Dutch
[8]. An analysis of the changes which occur due to adding
pronunciation variants for multi-words show that the variants
correct some errors but also introduce new ones. Other methods
might model cross-word variation more effectively. Therefore,
we will examine other ways of modeling cross-word variation and
we will also attempt to minimize the confusability between
variants in the future.
The results of method 3 show an improvement of 0.68% in WERs
and a significant improvement of 0.64% in SERs. The steps
undertaken in method 3 consisted of adding counts of the
pronunciation variants to the LMs and defining a number of
thresholds. In the set of experiments, in which probabilities
for pronunciation variants were included in the LM, they were
included in both the unigram and the bigram. An alternative to
this method is to keep the bigram intact and to add the
information about frequency of pronunciation variants to the
unigram only.
The question is whether or not information about
pronunciation variants should be modeled in the bigram. In some
cases, there may be reasons to assume that certain
pronunciation variants will follow up each other in the course
of one utterance. For instance, if the speaking rate is high,
it can be expected that it will be high during the whole
utterance. The exact relationships between different
pronunciation variants are currently, however, not well
understood, and in addition to that, methods to decide when
those relationships occur are also not available. So, it may
not be optimal to model pronunciation variation at word level
in the bigram. In the future, we will experiment with modeling
the unigrams independently of the bigrams to find out if they
should be modeled separately or together.
In our experiments we found a relative improvement of 8.5%
WER (1.08% WER absolute) when going from our baseline condition
to the condition in which a lexicon containing multi-words and
pronunciation variants was used, and an LM with probabilities
of pronunciation variants was used. Our results show that all
three methods lead to significant improvements. We found an
overall, significant improvement of 1.58% in SERs. These
results are very promising and we will continue to seek ways to
elaborate on this research in order to understand the processes
which play a role to a fuller extent and to gain further
degrees of improvement in the performance of the CSR.
5. ACKNOWLEDGMENTS
This work was funded by the Netherlands Organisation for
Scientific Research (NWO) as part of the NWO Priority Programme
Language and Speech Technology. The research of Dr. H. Strik
has been made possible by a fellowship of the Royal Netherlands
Academy of Arts and Sciences.
6. REFERENCES
[1] H. Strik, A. Russel, H. Van den Heuvel, C. Cucchiarini &
L. Boves (1997) A spoken dialogue system for the Dutch
public transport information service Int. Journal of
Speech Technology, Vol. 2, No. 2, pp. 119-129.
[2] M. H. Cohen (1989) Phonological Structures for Speech
Recognition. Ph.D. dissertation, University of California,
Berkeley.
[3] L. F. Lamel & G. Adda (1996) On designing pronunciation
lexica for large vocabulary, continuous speech
recognition. Proc. of ICSLP '96, Philadelphia, pp 6-9.
[4] J. M. Kessens, M. Wester (1997) Improving Recognition
Performance by Modeling Pronunciation Variation. Proc. of
the CLS opening Academic Year 97 98, pp. 1-19.
[5] J. Kerkhoff & T. Rietveld (1994) Prosody in Niros with
Fonpars and Alfeios, Proc. Dept. of Language & Speech,
University of Nijmegen, Vol.18 pp. 107-119.
[6] Onomastica
http://www2.echo.lu/langeng/en/lre1/onomas.html
[7] C. Cucchiarini & H. van den Heuvel (1995) /r/ deletion in
Standard Dutch, Proc. of the Dept. of Language & Speech,
University of Nijmegen, Vol. 19, pp. 59-65.
[8] G. Booij (1995) The Phonology of Dutch Oxford: Clarendon
press.
[9] S. Siegel & N.J. Castellan (1956) Nonparametric Statistics
for the Behavioral Sciences, McGraw Hill, pp.63-67.