MODELLING PRONUNCIATION VARIATION: SOME PRELIMINARY RESULTS

home > publications > a37

Contact

MODELLING PRONUNCIATION VARIATION: SOME PRELIMINARY RESULTS
Mirjam Wester, Judith Kessens, Catia Cucchiarini, Helmer Strik
A²RT, Dept. of Language & Speech, University of Nijmegen
P.O. Box 9103, 6500 HD Nijmegen, The Netherlands
In: H. Strik, N. Oostdijk, C. Cucchiarini and P.A. Coppen (eds.), Proceedings of the Department of Language & Speech, Vol. 20, pp. 127-137, Nijmegen, the Netherlands, 1997

Abstract

In this paper we describe a method for improving the performance of a continuous speech recognizer by modelling pronunciation variation. Although the results obtained with this method are in line with those reported by other authors, the magnitude of the improvements is very small. In looking for possible explanations for these results, we computed various sorts of statistics about the material. Since these data proved to be very useful in understanding the effects of our method, they are discussed in this paper. Moreover, on the basis of these statistics we discuss how the system can be improved in the future.

1. Introduction

At the Department of Language and Speech of the University of Nijmegen we are working on a Spoken Dialogue System (SDS) that will be employed to automate part of a public transport information service. This system was adapted from a German prototype developed by Philips Research Labs, and was further improved by means of a bootstrapping method (Strik et. al., 1996 and 1997). An important component of this SDS is a continuous speech recognizer (CSR). This part of the SDS was also gradually improved through the bootstrapping method, by adding more data. However, since a point was reached at which no further increase in performance could be obtained by increasing the data, new methods of improving the system were sought. Given that the SDS is a mixed-initiative system and that the kind of speech the callers may use is extremely varied, we thought of improving the system's performance by modelling pronunciation variation. In this paper, the method used for modelling pronunciation variation is discussed in detail in section 2. Subsequently, in section 3 the results obtained with this method are presented together with various sorts of statistics about the material. In section 4 we discuss how the statistics we computed helped us to understand why the variations in performance were so small, and how this knowledge can be used to improve the system in the future.

2. Method and material

2.1 Method

The starting point of the current research was a CSR in which a single pronunciation lexicon was used. For each word only the transcription we thought was most probable (the canonical form) was available. In this experiment we wanted to test to what extent the performance of the CSR could be improved by modelling at least part of the pronunciation variation that is encountered in the material. The approach we adopted in this attempt resembles those used previously with success in Cohen (1989) and Lamel and Adda (1996). In this approach phonological rules are used to generate pronunciation variants, i.e. to expand the lexicon. The expanded lexicon can then be used during training, recognition (test) or both. During test the old test lexicon is simply replaced by the new one, in order to make it possible to recognize pronunciation variants. During training the pronunciation variants can be used to obtain new acoustic models. For training, the whole process can be schematized as follows: 1. Use the old lexicon (single pronunciation) and the training corpus to compute the first version of the phone models. 2 . Select phonological rules. 3. Generate a new lexicon with multiple pronunciations on the basis of the selected rules. 4. Do forced recognition to determine which variant is realized in the corpus. The chosen variant is then added to the training corpus. This way a new transcription of the training corpus is obtained. 5. Use the new transcription of the training corpus to calculate new phone models. Stages 4 and 5 can be repeated a couple of times in iteration so as to obtain different versions of the phone models. Stages 2 to 5 can be repeated with different rules. Our ultimate goal is to find the rules that are optimal in the sense that they produce the greatest increase in performance. The goal of the current research was to test whether the method proposed above was suitable for our purposes. In order to do so we have tested the method with only four phonological rules, as will be explained below.

2.2 Phonological rules

Much of the phonological variation in Dutch has been described in the literature by means of phonological rules (see for instance Booij, 1995). However, there are also phenomena which have been described to a lesser extent or even not at all. It is therefore almost impossible to decide in advance which rules will be relevant to our CSR. Moreover, in order to make such a decision one needs to know what type of speech (speaking style) is being dealt with. However, in our research it is difficult to determine what the speaking style is. Although there is a considerable amount of information on spontaneous speech and human interaction, relatively little is known about man-machine interaction. On the one hand, people who call the system for information about the train schedule use spontaneous speech, in the sense that they do not read aloud some previously prepared text. From this point of view one might expect that their speech will exhibit all sorts of phenomena (e.g. disfluencies, hesitations and pronunciation variation) that are known to occur in spontaneous speech. On the other hand, these people realize that they are talking to a machine that may have problems in understanding them. This may be a reason for them to monitor their speech to a greater extent and speak more carefully and therefore more formally, than they would normally do (see also Shriberg et al., 1992). Moreover, variation of speaking style can be observed even for one speaker within one recording session. For example, a speaker may begin by talking informally, but may change to a more formal speech mode if (s)he realizes that the speech recognizer has difficulties in understanding him/her. The fact that such extreme forms of variation are present, makes it difficult to adopt speaking style dependent phonological rules, simply because one does not know what the speaking style is.

In order to select the initial set of phonological rules a number of criteria were followed. As is well known, variation occurs both within words and at word boundaries. Given the use of a lexicon in our CSR, it was obvious to begin with word internal variation. Therefore, the first criterion was to choose rules of word phonology. Second, we decided to start with rules concerning those phenomena that are known to be most detrimental to automatic speech recognition. Of the three possible recognition errors, i.e. insertions, deletions and substitutions, the first two have the greatest consequences for speech recognition, because they affect the number of segments present in different realizations of the same word. Therefore, starting with rules concerning insertions and deletions was the second criterion we adopted. A third criterion was to choose rules that are frequently applied. Actually, frequently applied is amenable to two interpretations. A rule can be frequent either because it is frequently applied whenever the context for its application is met or because the context in which it can be applied is very frequent (even though the rule is applied in only 50% of the cases). Obviously, it is this latter case of 'frequent occurrence' that is most interesting for automatic speech recognition, since in this case it is difficult to predict which variant should be taken as canonical form, while in the former case the most frequent form would probably suffice as sole transcription. A fourth criterion (related to the previous one) we followed was that the rules should regard phones that are relatively frequent in the language, since rules that concern infrequent phones probably have fewer consequences for the recognizer's performance. Finally, we decided to start with rules that have been extensively described in the literature, so as to avoid possible effects of overgeneration and undergeneration due to incorrect specifications of the rules. On the basis of the above-mentioned criteria the phonological rules which were selected are / /-deletion, / /-epenthesis, /t/-deletion and /n/-deletion (Booij, 1995). A short description and an example of each of the rules follow here, after Booij (1995:127-130, 139-141, 152-154).

1. / /-deletion: When two consecutive syllables are headed by a schwa the first schwa may be deleted provided that the remaining onset consonant cluster is a cluster of an obstruent followed by a liquid. obs + + liq + obs + liq + Example: /And r / /Andr /

2. / /-epenthesis In nonhomorganic consonant clusters in coda position a schwa may be inserted. Example: /mElk/ /mEl k/

3. /t/-deletion: This rule is typically one of the processes that occurs in fast speech, but to a lesser extent also in careful speech. There are three different conditions in which /t/-deletion occurs. First, if a /t/ in a coda is preceded by an obstruent, and followed by another consonant, the /t/ may delete. obs + t + cons obs + cons Example: /snElstmox l k/ /snElsmox l / Second, if the preceding consonant is a sonorant, /t/-deletion is possible, but then the following consonant must be an obstruent. When the obstruent following the sonorant + /t/-cluster is a /k/, deletion does not apply. When /t/ is preceded by a sonorant, and also followed by a sonorant, deletion is impossible. son + t + obs son + obs Example: /Eintp nt/ /Eimp nt/ And lastly because in some Dutch dialects /t/-deletion in word-final position also occurs, we decided to apply the rule /t/-deletion in word-final position following an obstruent (unless the obstruent is an /s/). word final: obs + t obs Example: /dElft/ /dElf/

4. /n/-deletion: In standard Dutch, syllable-final /n/s can be dropped after a schwa, except in the indefinite article een' / n/. For many speakers, in particular in the western part of the Netherlands, the deletion of /n/ is obligatory. An /n/ is deleted if it is the final /n/ of a syllable after a schwa and if that syllable is not a verbal stem. syllable final: + n Example: /rEiz n/ /rEiz / There is however, no deletion of the final /n/ in ik teken' (I draw) /tek n/ because teken' is a verbal stem. Booij (1995) also adds to this that the /n/ must be at the end of a morpheme. However we did not apply this part of the rule so /n/-deletion in words like volgende' (/n/ is not near a morphological boundary) is also applied in our lexicon.

Generating pronunciation variants is time-consuming and error-prone since it is mostly manual work. We created a multiple pronunciation lexicon by automatically generating the above rules using a script in which the rules and their conditions were specified. All four rules were applied where it was possible and in no specific order. Thus a multiple pronunciation lexicon was obtained. However, generating pronunciation variants automatically is not foolproof either, although the types of problems encountered are different than the problems which arise from manually generating pronunciation variants. For example, the conditions in which a phonological rule should be applied are often based on morphological information, such as morphological boundaries, which is, at present, missing from our phone transcriptions. So either the phone transcriptions need to be enriched, or other ways to solve these kinds of problems must be found. The variants obtained automatically were compared with a lexicon which had been made by hand to check whether the correct variants were being produced by the script. For a number of the variants generated one could expect that it was highly unlikely that they would occur but we chose to overgenerate so as not to exclude possible variants beforehand.

2.3 Material

The CSR used in this experiment is part of an SDS (Strik et. al., 1996 and 1997). The speech material was collected with an online version of the SDS, which was connected to an ISDN line. The training and test material consisted of 24,676 utterances (81,090 words) and 6,276 utterances (21,106 words), respectively. The most important characteristics of the CSR are the following. The input signals consist of 8 kHz 8 bit A-law coded samples. Feature extraction is done every 10 ms for frames with a width of 16 ms. The first step in feature analysis is an FFT analysis to calculate the spectrum. Next, the energy in 14 mel-scaled filter bands between 350 and 3400 Hz is calculated. Apart from these 14 filterbank coefficients, the 14 delta coefficients, log energy, and slope and curvature of the energy are also used. This makes a total of 31 feature coefficients. The CSR uses acoustic models (HMMs), language models (LMs: unigram and bigram), and a lexicon. The continuous density HMMs consist of three segments of two identical states, one of which can be skipped. In the online SDS the output of the CSR, and thus the input to the following natural language processing component, is a wordgraph (Strik et. al., 1996 and 1997). In the research version it is possible to use the LMs to compute the Best Sentence (BS). Obviously, the error rates for the wordgraph are much lower than those of the BS (Strik et. al., 1996 and 1997). Nevertheless, we will use the BS in this article, because they are better suited for the goals of the present research: evaluation of the results is easier and more transparent. The single variant training lexicon contains 1,433 entries, these are all the words contained in the training corpus and also a number of words which could be expected in this specific application even though they do not (yet) occur in the corpus (for example station names). The four phonological rules selected for investigation affect 536 of the 1,433 (37%) words in the training lexicon. Of these 536 words 487 words are affected by one of the four phonological rules. In 47 cases two rules were applied to the same word and in two cases three rules were applied. There were no words that were affected by all four rules because / /-deletion and / /-epenthesis did not occur within the same word. On average, 1.3 variants were generated for each of the 536 words. The multiple variant lexicon contains 2,151 entries, 1,433 (67%) of which are canonical. The test lexicon contains 860 entries, which are all the words present in the online version. The number of out of vocabulary (OOV) words in the test corpus is 298. The four phonological rules concern 354 of the 860 entries in the test lexicon (41%). In this case 315 words were subject to one of the four rules. In 37 cases two rules were applied, and here also two words were affected by three of the four rules. On average, 1.3 variants were generated for each of the 354 words. The multiple pronunciation lexicon contains 1,341 entries, 860 (64%) of which are canonical.

2.4 Forced recognition

Forced recognition was imposed through the language models (LMs). For each sentence unigram and bigram LMs were derived on the basis of 100.000 repetitions of the same sentence. After the first forced recognition round, 484 utterances of the training corpus were not correctly recognized. 47 of these utterances turned out to contain obvious transliteration errors which were corrected afterwards. Since the remaining 437 sentences appeared to be problematic for a number of reasons (they contained background noise, disfluencies, unexpectedly long pauses within words and in some cases the loudness level was insufficient) they were removed from the original training corpus and only 24,667 utterances were used for further experiments. It turns out that forced recognition is a useful tool to identify all sorts of errors and utterances which, for some reason, are problematic for the CSR. These utterances will certainly be examined more closely in the near future. Instead of forced recognition with LMs, as described above, we could have used a standard Viterbi algorithm. Although the main advantage of the Viterbi algorithm is that a forced alignment can be obtained for all utterances, the main disadvantages of this algorithm are (1) that the alignment is not always meaningful, e.g. because the transliteration contains errors, and (2) that it is not possible to find the errors and the problematic utterances. The resulting training corpus with 24,667 utterances was again used for training and forced recognition. In the 24,566 cases in which forced recognition was successful, the pronunciation variants chosen by forced recognition were substituted for the original (canonical) transcriptions. In the 101 cases in which forced recognition was not successful, the canonical form was chosen. The new transcriptions were subsequently used to train new phone models.

3. Results

Above it has been explained how single (S) and multiple (M) pronunciations during training lead to two different sets of phone models. In addition either single (S) or multiple (M) pronunciations can be used in the test lexicon. This makes a total of four combinations, for each of which we present the sentence and word errors rates (SER and WER, respectively) of the best sentences (BS) in Table 1.

Table 1. SER and WER for the BS of four different CSRs.

CSR SS SM MS MM

train S S M M

test S M S M

SER(%) 32.63 32.39 33.03 32.41

WER(%) 23.63 23.50 23.81 23.50

As appears from Table 1, there are only slight variations in recognition performance between the various conditions. Nevertheless, it is interesting to analyse these data in more detail, in order to see whether the various tendencies are in line with those reported in the literature. For instance, the worst performance level appears to be obtained when multiple pronunciations are used for training but not for testing (i.e., when the new phone models are combined with the old lexicon). This is exactly what appeared in Lamel and Adda (1996). Furthermore, Lamel and Adda (1996) found that using multiple pronunciations for testing gave better results than using single pronunciation lexicons. This is confirmed by our data (compare column 2 with column 3). However, these authors also found that recognition performance improved even further when multiple pronunciations were used both for training and for testing, which is not confirmed by our data: there is practically no difference in performance between column 3 and column 5. Therefore, on the basis of these results we can conclude that the applied method improves the performance, albeit to a small extent. Moreover, the observed improvements are in line with those reported elsewhere (Lamel and Adda, 1996). However, since the magnitude of the changes is considerably smaller than that reported by other authors, it is interesting to consider why this is the case. A possible explanation for these results would be that during forced recognition the CSR selects the wrong variant. In order to test whether this was the case, we checked for a small number of words whether the correct pronunciation variant was chosen by looking at and listening to the signals. Since it turned out that in 90% of the 711 words the correct version was chosen, there is no reason to believe that the small increase in performance was mainly due to errors in forced recognition. Another reason could be that the number of pronunciation variants that can be selected is relatively small. Against this background it is interesting to know how often one of the alternative variants could be chosen and how often it was indeed chosen. In Table 2 the total number of words in the training corpus, and the total number of recognized words in the test corpus for two different conditions (recognition with respectively original and new phone models) is given in the second row. The number of cases in which only a single variant could be chosen is listed in the third row, in the fourth row the number of cases in which an alternative variant could be chosen is given and in the last row the number of instances in which an alternative variant was chosen is shown.

Table 2. Number of pronunciation variants possible and chosen

Corpora train corpus test corpus (old phones) test corpus (new phones)

Total 81,090 19,962 20,011

Single 66,590 15,556 15,640

Multiple 14,500 4,406 4,371

Alternative 6,363 2,028 2,128

Percentages have been calculated for the data in Table 2 to give a clearer picture of how the different rows relate to each other. Alternative pronunciations were available for 17.9% of all words in the training corpus. In 43.9% of those cases an alternative variant was actually chosen, which means that 7.8% of the total number of words in the corpus are chosen as alternative variants. For the test corpus the percentages are similar to those of the training corpus. Alternative variants could be chosen in 22.1% (original) and 21.8% (new) of the total number of words in the test corpus; of those cases 46.0% and 48.7% were alternative variants. In 10.2% (original) and 10.6% (new) of the total number of words one of the multiple variants was chosen. From these data we can infer that, on average, one of the alternative variants is chosen in about 45% of the possible cases, and in 8-10% of the total number of words. However, most variants will only differ in one phone from the canonical form. A comparison of the two transcriptions of the training corpus (i.e. the canonical forms versus the transcriptions obtained with forced recognition) reveals that they differ in 6,594 of the total 318,774 phones (2.1%). This seems to be one of the reasons why the effects on recognition performance are far from dramatic. Adding variants to the test lexicon increases confusability, which could also be one of the reasons why there was not a great deal of improvement in the recognizer's performance. In the tests in which the multiple pronunciation lexicon was used 48% of all variants in the test lexicon (1341 entries) never occurred in the test corpus. 19% of all entries in the lexicon were alternative variants which were never chosen. In 5% of the cases the canonical form of a word was never chosen but, instead, an alternative variant was chosen, and 24 % of the entries in the lexicon were words which never occurred in the test corpus, neither the canonical form nor an alternative variant of those words was ever chosen. This is partly due to overcoverage of the rules but also to the fact that a lot of canonical forms in the test lexicon have been added for application specific purposes. There are, for example, quite a number of station names and time indicators which do not occur in the test corpus but which must be contained in the test lexicon because they are considered to be of utmost importance for the application. In other words they may not have occurred yet but they could very well occur in the future and as the CSR is part of a system for a public transport information service, it must be able to recognize all station names and time indicators as they are crucial for the success of an enquiry.

In order to gain more insight in these data, we compared the four versions of the CSR. First we determined for each version of the CSR which BS contained an error. Subsequently, for four of the six logical combinations of the CSR (those in which only one factor changes, while the other is kept constant, i.e. SS-SM, MS-MM, SS-MS and SM-MM) the BS containing errors were compared. The results of these comparisons are shown in Table 3.

Table 3. Comparisons of the performance of the four versions of the CSR.

CSR 1 SS MS SS SM

CSR 2 SM MM MS MM

same errors 1630 1592 1089 1066

other errors 364 400 836 844

improvements 54 81 123 123

deteriorations 39 42 148 124

net result +15 +39 -25 -1

From Table 3 it appears that a considerable number of utterances contain a recognition error in both CSRs, either the same (row 3) or a different one (row 4). Furthermore, there are cases in which a better solution is chosen (improvements, row 5). However, since in an almost equal number of cases a worse solution is chosen (deteriorations, row 6), the two effects balance each other off and the net result (row 7) is small. This neutralization effect explains why no considerable changes in the error rates were observed in Table 1. It is well-known that including alternative pronunciation variants leads to some sort of trading relation between improving performance (by covering part of the variability in speech) and deteriorating it (by increasing the confusability between the entries in the lexicon). Based on the fact that only 2.1% of the phones differ between the two transcriptions of the training corpus and the results shown in Table 1, it could be concluded that the use of multiple pronunciations during training has little consequences for the recognition process, for instance, because the acoustic models hardly change. However, comparison of columns 4 and 5 with columns 2 and 3 in Table 2 reveals that varying the phone models produces more changes than varying the test lexicon. A comment on this may be in order. Using multiple variants for testing simply means that the CSR can choose from among a greater number of possibilities for each word. Put differently, the variations in the system occur at the word level and concern only a limited number of words. When multiple variants are used for training, on the other hand, they produce different acoustic models. In other words, in this case the variations occur at the phone level. Since all words in the corpus are made up of phones, the effects of variation modelling during training are likely to be more pervasive. Further inspection of Table 3 also reveals that, in spite of the greater number of changes in columns 4 and 5, the net result is negative, while in columns 2 and 3 it is positive. In other words, the fewer changes in columns 2 and 3 successfully conspire to achieve better recognition results, while the net result of the larger number of changes in columns 4 and 5 is a deterioration. A final remark concerns the number of utterances in which there is room for improvement. It appears that 4,038 of the 6,276 utterances are recognized correctly in all four systems. Since 1,066 utterances contain OOV words they can never be recognized correctly. Therefore there is only room for improvement in the remaining 1,172 utterances. With this in mind no dramatic changes in recognition performance can possibly be expected.

4. Discussion and conclusions

In the previous section we examined the results of an experiment aimed at determining the contribution of pronunciation variation modelling to improving the performance of our CSR. One of the things we have learned from this experiment is that forced recognition as it was implemented in this method is a useful instrument to identify possible errors in the transliterations and in the lexicons and to spot the utterances that, for some reason, present insurmountable problems to automatic speech recognition. Studying these sentences in further detail is certainly worthwhile. Furthermore, in 90% of the cases this forced recognition procedure selects the correct pronunciation variant. As far as the main goal of this experiment is concerned, i.e. establishing whether the applied method is suitable for improving the performance of our CSR, we can conclude that there are no reasons to assume that this is not the case. As a matter of fact the observed changes are in line with those reported by other researchers. The only problem seems to be that in our research the variations are very small. In this respect it may be instructive to consider the following facts. First, the statistics concerning the material may have played an important part in limiting the effect of pronunciation modelling on recognition performance. It should be borne in mind that an alternative variant was chosen in only 8-10% of the cases. Moreover, in most of the cases the alternative transcriptions differed in only one phone from the canonical form. In connection with this, no more than 2.1% of the phones were changed as a result of variation modelling. Furthermore, in only 1,172 sentences was there room for improvement. Finally, another factor that should not be overlooked concerns the phones involved in the rules under study. Since the four rules concern phones that are very frequent in Dutch and in the material under study (in the training corpus /n/, /t/ and / / are the three most common phones), there are so many occurrences of these phones, that the impact of variation modelling is likely to be limited. If we consider all these aspects, it is not surprising that recognition performance hardly improved. Moreover, it is important to point out that our research is at an early stage and that a number of things that we intend to do have not been done yet. For instance, in this experiment we have confined ourselves to within word variation, whereas modelling variation above the word level may be even more important (Cremelie and Martens, 1995). Second, since only four rules were investigated, only a small part of the variation in the material could be covered. However, it is our intention to expand the set of phonological rules so as to maximize coverage. Another factor that might be responsible for the limited impact of pronunciation modelling on recognition performance and that we have not controlled yet is overcoverage, that is the fact that the rules selected generate a great number of variants (19% of the total lexicon) that are not present in the corpus. This was to be expected because no pruning of variants whatsoever was carried out. The reason for this is that in this phase of our research we did not want to exclude variants that might turn out to be useful at a later stage. Since we opted for overcoverage, this should be considered when analysing the results. It is obvious that in the future we intend to examine pronunciation variants more critically, before including them in the lexicon. More attention will be paid to the variants that are indeed present in the corpus. In addition, the frequency with which they occur will also be investigated, so that a probability count can be attached to each variant. In the light of these considerations it is therefore legitimate to conclude that the results of this experiment are promising, in spite of the limited increase in recognition performance.

Acknowledgements

This work was funded by the Netherlands Organisation for Scientific Research (NWO) as part of the NWO Priority Programme Language and Speech Technology. The research of Dr. H. Strik has been made possible by a fellowship of the Royal Netherlands Academy of Arts and Sciences.

References

Booij, G.E. (1995), The phonology of Dutch. Oxford: Clarendon Press. Cohen, M.H. (1989), Phonological structures for speech recognition. PhD dissertation, University of California, Berkeley. Cremelie, N. and J.P. Martens (1995), On the use of pronunciation rules for improved word recognition, Proceedings EUROSPEECH'95, Madrid, 1747-1750. Lamel, L.F. and G. Adda (1996), On designing pronunciation lexicons for large vocabulary, continuous speech recognition, Proceedings ICSLP'96, Philadelphia, 6-9. Shriberg, E., E. Wade and P. Price (1992), Human-machine problem solving using spoken language systems (SLS): factors affecting performance and user satisfaction, Proceedings Speech and Natuaral Language Workshop, Harriman, New York, 49-54. Strik, H., A. Russel, H. van den Heuvel, C. Cucchiarini and L. Boves (1996), Localizing an automatic inquiry system for public transport information, Proceedings International Conference on Spoken Language Processing (ICSLP) 96, Philadelphia, 853-856. Strik, H., A. Russel, H. van den Heuvel, C. Cucchiarini and L. Boves (1997), A spoken dialogue system for the Dutch public transport information service, to appear in International Journal of Speech Technology.

Last updated on 22-05-2004