Automatic Speech Recognition for MUMIS

.

M. Wester, J.M. Kessens, J. Sturm, E. Sanders, P. Tielen, H. Strik (2003)
Speech Recogniser.
Deliverable D-5.2-T28 of MUMIS (Multimedia Indexing and Searching Environment), Project ref. no. IST-1999-10651, 14-02-2003, 53 pages.
Security (distribution level) : Project internal

Abstract

This report describes the research on automatic speech recognition carried out within the MUMIS project. The main objective of the automatic speech recognition experiments is to obtain the best possible transcription of the spoken commentaries accompanying football matches.

Manual transcriptions were made of recordings of EURO-2000 matches in three different languages: German, Dutch and English. In total 30 matches were manually transcribed: 6 for Dutch, 3 for English and 21 for German.

Experiments were carried out for all three languages to measure the performance of automatic speech recognition. Recognition performance is measured in terms of word error rates (WER = insertions + deletions + substitutions / number of words). Initially, due to the fact that only a few matches were available for training and testing, oracle experiments were carried out for all three languages. The language models and lexicons used in the oracle experiments are trained on 3/4 of the test match and tested on the other 1/4. In general, the WERs found were rather high (between 31% and 90%). This is mainly due to the extremely high level of noise (i.e. different types of speech and non-speech sounds). Dutch and German experiments showed that matching the train and test data in terms of background noise leads to the best results. From the results of the German experiments, we can also conclude that using more data to train the phone models is not necessarily better, but rather that matching data in terms of SNR is the better option: using less but better matching data in terms of mean SNR leads to lower WERs.

As more data became available, non-oracle experiments could be carried out. For these experiments we used a category LM tuned to (but not trained on) the test match (i.e. player names of the test match were added to the LM and lexicon). The results on the full demonstration matches (Yugoslavia - the Netherlands (YugNed) and England - Germany (EngDld)) show that the category LM performs better than a normal LM. WERs for the baseline non-oracle experiments is much higher than for the oracle experiments (between 83% and 93%), which can be explained by a large amount of OOV and a mismatch between train and test data in terms of SNR. Using more data to train the LM and the lexicon results in lower WERs, but the reduction in WER levels off after about 150.000 words have been used to train the LM. Again, WERs proved to be lowest when matching data are used for training and testing, although training separate phone models for different SNR categories did not prove to help. The best results were obtained using noise robust features for training and recognition (relative improvements of 10%).

Since for a database retrieval task not all words are equally relevant, we categorized all the words into three groups (function words, application specific words (such as player names) and other content words) and calculated the recognition performance on each of the three groups. We found that function words (which are least important for an information retrieval task) are badly recognized (between 82% and 90% WER), whereas the important category of application specific words were recognized relatively well (between 40% and 75% WER). Also in the non-oracle experiments, the application specific words were recognized best.

Recommendations for the future are that the commentator's speech should be recorded in isolation, without stadium noise mixed in. In addition, more transcriptions of spoken material are necessary in order to train robust generic language models.