Helmer Strik, Albert Russel, Henk van den Heuvel, Catia
Cucchiarini and Lou Boves
In: H. Strik, N. Oostdijk, C. Cucchiarini, & P.A. Coppen (eds.) Proceedings of the Department of Language
and Speech Vol. 19, pp. 129-142, Nijmegen, The Netherlands, 1996
Abstract
In 1995 our department was involved in two projects in the field of continuous
speech recognition. The main aim of these two strongly related projects was the
development of basic technology that can be used to build advanced telephone-based systems
for providing information about public transport. A short description of the work carried
out within these projects is provided in the present article.
1. Introduction
During the last decade the performance of spoken dialogue systems has improved substantially. At the moment, the quality of these systems seems to be able
to support a number of simple practical tasks in small and clearly delimited domains.
As a result, much effort is spent nowadays to develop prototype telephone-based information
systems in different countries. These systems are reminiscent of the well-known Air
Travel Information System (ATIS) task that has been a focal point in the American ARPA-project.
In Europe two MLAP (Multi-Lingual Action Plan) projects concerning public railway information
have been carried out, viz. RAILTEL and MAIS. These projects differ from the ATIS
task in that they aim to construct truly interactive systems, accessible over the telephone.
There are many reasons why information about public transport is a suitable
domain for testing spoken dialogue systems, of which only some are mentioned here.
First of all, the domain can be limited in ways that are obvious for a caller, which is a
necessary requirement to reach a sufficient performance level. In the Dutch system that we are
developing, the domain is limited by restricting the information to travel between train
stations. Furthermore, there is a large demand for information about public transport. For instance,
in the Netherlands there is one nationwide telephone number for information about
public transport. This number receives about 12 million calls a year, of which only about
9 million are actually answered. At the moment, all calls are handled by human operators. A substantial
cost saving would be achieved if part of these calls could be handled automatically.
Moreover, automatic handling would probably reduce the number of unsuccessful calls.
In the Netherlands public transport information' is virtually identical
with multi-modal from address-to-address information'. The human operators who
provide the service must access a large database that contains the schedule information
of all public transport companies in the country. Especially the fine-meshed local transport
networks pose substantial problems, e.g. when specific bus stops must be identified. The
complex dialogues that may be required to disambiguate destinations on the address level are
far beyond what can be achieved with existing speech recognition, natural language processing,
and dialogue management technology. Therefore, we have limited the domain of our experimental
public transport information system to information about travels between train
stations. However, we intend to enlarge that domain gradually, e.g. by adding metro stations
in Amsterdam and Rotterdam, and by adding tram stops or the major inter-regional buses (Interliners).
2. General description of the system
The starting point of our research was a prototype developed by Philips
Research Labs (Aachen, Germany). This automatic inquiry system can give information about
the schedules of the German railways. Here we will only give a short description of the
system. Further details can be found in Oerder and Ney (1993), Steinbiss et al. (1993),
Aust et al. (1994), Ney and Aubert (1994), Oerder and Aust (1994), Aust et al. (1995), Steinbiss
et al. (1995). Conceptually, the Spoken Dialogue System (SDS) consists of four parts (in
addition to the telephone interface):
1. the Continuous Speech Recognition (CSR) module,
2. the Natural Language Processing (NLP) module,
3. the Dialogue Management (DM) module, and
4. the Text-To-Speech (TTS) module.
In the CSR module acoustic models (HMM's), language models (N-grams), and
a lexicon are used for recognition. In the current version monophones are modelled by
continuous density HMM's. However, it is also possible to use diphones or triphones as basic
units. In the future we will investigate whether the performance of the system can be improved
significantly by using context-sensitive models.
The lexicon contains orthographic and phonemic transcriptions of each word. Currently, there is exactly one phonemic transcription for each word. Using
only one pronunciation variant is not optimal, since words are often pronounced in
different ways. Therefore, we are now investigating how pronunciation variation can best
be handled within the framework of this SDS (see Section 5).
The output of the CSR module, and thus the input to the NLP module, is a
word graph. The NLP's task is to decide which path through the word graph has
to be chosen. The NLP does not choose this path by looking at the acoustic likelihood of the
path alone. It also uses application-specific knowledge in the form of a concept bigram and
syntactic unit counts. The goal of the NLP module is not to find a parse for the complete
utterance, but to look for sequences of concepts in the word graph. The concepts it looks
for are defined in a stochastic attributed context-free grammar (ACFG) that describes the utterances
which must be understood. For instance, the entry "<departure_station> ::= (121) from
<station_name>" is a part of the ACFG and is one of the many entries that define the concept <departure_station>. It denotes that if a path through the word graph exists
with e.g. the utterance "from Amsterdam" in it, it should be interpreted as a statement
indicating that the departure station is Amsterdam. Of course the same holds for the names of
other cities present in the lexicon. A similar definition exists for the related concept
<arrival_station>: "<arrival_station> ::= (248) to <station_name>". The numbers (121) and (248)
are the syntactic unit counts for these concept definitions. They state that these
syntactic units occurred 121 and 248 times, respectively, in the corpus on which the NLP
was trained. The concept bigram in the NLP describes the frequency of occurrence of ordered
pairs of concepts, just as a standard language model bigram does for ordered pairs
of words. The combination of concept bigram values, syntactic unit counts, and the acoustic
likelihood of the phonemes decides which path through the word graph is most likely in
this application.
The DM takes care of gathering all the necessary information to perform
a database query. It does so by asking specific questions to the caller. The DM needs
to know the departure and arrival station, the departure or arrival time and the date
on which the caller wants to travel. The opening question of the DM is "From which station to
which station do you want to travel?". If the caller answers "I want to travel from Nijmegen
to Amsterdam tomorrow" the NLP sets the values for departure_station := Nijmegen, arrival_station
:= Amsterdam, and date := tomorrow. The DM will then ask "At what time do you
want to travel from Nijmegen to Amsterdam tomorrow?". It thereby asks for the information
it is still missing to do a database query. At the same time it gives the caller feedback
about what the NLP did understand. If the NLP made a mistake, the caller can correct the
system, e.g. by saying: "No, I want to travel the day after tomorrow". If the caller does
not correct the system, the DM decides that the NLP did understand the concepts departure_station,
arrival_station, and date correctly. These concepts are then frozen. This means that the
caller can no longer change their values. If the DM has all the information it needs, it will
do the database query and report to the caller what connection(s) it found.
The information found in the database (and all other feedback mentioned
above) is presented to the caller by means of speech synthesis. Language generation
is limited to the concatenation of fixed phrases or by inserting the right words in open slots
in carrier phrases. Speech synthesis is accomplished by concatenating pre-recorded phrases and
words spoken by a female speaker.
3. The two projects
The lion's share of the work described below has been carried out within
the framework of two projects. A short description of these projects is given in this section.
Because it is difficult to say exactly which part of the work is done in which project,
we will only give a global description of the work carried out in each project.
3.1 MAIS
The European MLAP project MAIS (Multilingual Automatic Inquiry Systems)
started on December 1st, 1994, and ended on December 1st, 1995. The MAIS consortium
consisted of one technology provider: Philips Research Labs (Aachen, Germany); two public
transport companies: SNCF (French railways) and NS (Dutch railways, later associated
to the Dutch public transport information service, OVR); and three universities: RWTH
(Aachen, Germany), IRIT (Toulouse, France), and KUN (Nijmegen, the Netherlands).
The goals of the MAIS project were:
[1] to specify the requirements for an automated multilingual public transport
information system that can be accessed over the telephone by the general public;
[2] to specify assessment procedures which can be used to measure users'
satisfaction with the service; and
[3] to provide Dutch and French versions of the CSR, NLP, and DM modules,
which could eventually be used to build laboratory demonstrators of train timetable
information systems for these languages.
For aims [1] and [2] MAIS worked in close collaboration with the MLAP project
RAILTEL. Starting point for [3] was a prototype developed by Philips Research Labs
(Aachen, Germany), which already existed at the beginning of the project (see section
2). This prototype could provide information about the schedules of the German railways.
The work described in Section 4.1 mainly took place within the framework of this
project.
As a follow-up of the MAIS project (and partly also under the PP-TST project described below) we have worked on the improvement of the CSR and NLP modules,
to a level on which they can be used to implement an operational laboratory system
for Dutch. Such a system was needed in order to be able to collect task specific speech
that can be used to bring the modules to a performance level that might be sufficient for
actual deployment. A training database collection system' has been available in the Dutch Public
Switched Telephone Network since December 1995.
3.2 PP-TST
The NWO Priority Programme Language and Speech Technology' (in Dutch: Prioriteits-Programma
Taal- en Spraak-Technologie', PP-TST) is a five-year project which started
in January 1995. The partners involved in this project are the Netherlands
Organization of Scientific Research (NWO, Den Haag), Philips Corporate Research (PCR, Eindhoven),
Royal Dutch PTT (KPN Research, Leidschendam), Nijmegen University (KUN, Nijmegen), Institute for Perception Research (IPO, Eindhoven), Groningen University
(RUG, Groningen), and University of Amsterdam (UvA, Amsterdam).
The PP-TST aims at the development of advanced telephone-based information systems. One prominent feature of this programme is its attempt to achieve
scientific as well as practical goals at the same time. The practical goal is to build a demonstrator
of an interactive spoken language information system that can give travel information
about public transport in the Netherlands. A number of increasingly powerful demonstrators
are planned. From a scientific point of view, original contributions are envisaged in
robust speech recognition over the telephone, natural language processing, and dialogue
management in information-seeking dialogues. In the area of speech recognition, the focus
will be on signal processing techniques to remove channel characteristics, on the one hand,
and on explicit modelling of pronunciation variation, on the other. As for NLP aspects of
the system, three approaches will be compared, viz. the AI-type approach presently implemented
in the system, corpus-based parsing, and parsing using a conventional wide-coverage grammar.
On the level of dialogue control it will be investigated how the communication with the
user can be made more effective and user-friendly.
4. Building a Dutch SDS
In order to build and train an SDS for a certain application, a considerable
amount of data is needed. For collecting these data Wizard-of-Oz scenarios are often used.
However, within the framework of the current projects a different approach was chosen, which
consists of the following five stages:
[1] make a first version of the SDS with available data (which need not be application-specific)
[2] ask a limited group of people to use this system, and store the dialogues
[3] use the recorded data (which are application-specific) to improve the
SDS
[4] gradually increase the data and the number of users
[5] repeat steps [2], [3], and [4] until the system works satisfactorily.
4.1 The first version of the SDS
In Section 2 we provided a short description of the system developed by
Philips Aachen. A first version of the SDS was obtained by localizing this German system for
Dutch. How this was done is described in the present section.
4.1.1 CSR
The CSR component of the first version of the SDS was trained with 2500
utterances of the Polyphone database (Damhuis et al., 1994; den Os et al., 1995). The
whole database is recorded over the telephone and consists of read speech and (semi-)spontaneous
speech. For each speaker 50 items are available. Five of the 50 items were used, namely
the so called phonetically rich sentences. Each subject read a different set of five sentences,
selected so as to elicit all phonemes of Dutch at least once. The more frequent phonemes
are produced much more often, of course. The speech recognizer used in this version of
the system is a monophone mixture density HMM machine. As a first approximation, we trained
about 50 acoustic models; they represent the phonemes of Dutch, plus two allophones
of /l/ and /r/.
Note that the first version of the CSR is trained with read speech (and
not spontaneous speech, as in the application) and that only very few sentences were related
to the public transport domain. In the intended application the speech will be spontaneous
and related to public transport information. Therefore, the data used to train the first
version of the CSR cannot be considered application-specific.
Phonemic forms in the lexicon were taken from three different sources: (1)
the names of stations from the ONOMASTICA database (Konst and Boves, 1994), (2) the
lemma forms of other words from the CELEX database (Baayen et al., 1993), and (3) for
words that were not found in those two databases the phonemic forms were generated by means
of our grapheme-to-phoneme converter (Kerkhoff et al., 1984). Up to now, training
and testing have been done completely automaticly, i.e., no attempts have been made to improve
recognition rates by making the phonemic representations in the lexicon more homogeneous,
nor by investigating the optimal set of monophone models. Furthermore, as was already
noted above, there is only one phonemic transcription for each word, i.e., pronunciation
variation is not modelled. Therefore, recognition scores obtained so far must be considered
as rough indications of what can be obtained in an initial development job.
4.1.2 NLP
Since German and Dutch are quite similar from a syntactic point of view,
for some parts of the NLP it was possible to make a direct translation from German to Dutch.
However, in many other cases, such as time and date expressions, things appeared to
be much more complicated. To illustrate this point some examples are mentioned here.
For instance, each language has it own expressions for special days. In Dutch we have "koninginnedag" (birthday of the queen), "sinterklaas" (a festivity on December 5th or 6th),
and "oudjaarsdag" (December 31th, literally: old year day'). It is very common to say e.g.
"de dag na koninginnedag" (literally: the day after queen's day'). Thus, the system
had to be taught to recognize these expressions, which do not occur in German.
Furthermore, in different countries people assign a different meaning to
time expressions like morning, afternoon, evening, and night. Because these concepts
are used very often to indicate approximate time of departure or arrival, they should
be defined and handled accordingly. For instance, in the German system morning' is interpreted
as a time between 00:00 and 10:00, while in the Dutch system it is interpreted as
a time from 04:00 to 12:00. The time between 00:00 and 04:00 is usually referred to as night'
in Dutch.
Apart from different notions of time expressions, there are also differences
in the way the German and Dutch databases are constructed. These two kind of differences
interact, and lead to the following problem (which we call the time-frame problem). In
order to construct a database query, the NLP must determine the date on which the caller wants
to travel. It uses the system clock and the caller's information to do so. The system clock
is the internal clock of the computer on which the NLP software runs. It can provide the time
and the calender date. Let us call the date provided by the internal clock D. The German
system uses a database in which a day starts at 00:00 and ends at 23:59. These times fully
coincide with the beginning and the end of a calendar day. However, in the Dutch system a
database is used for which the day starts at 04:00 and ends at 03:59. This gives some tricky
problems when you want to interpret time-related expressions from the caller. For instance,
if the system asks a Dutch caller "when do you want to travel?", and the caller answers "tomorrow",
the interpretation of tomorrow depends on the time at which the answer is given.
If the caller says "tomorrow" between 04:00 and 23:59 (s)he really means tomorrow, i.e. the
system should interpret this as D+1. However, if (s)he says "tomorrow" between 00:00 and
03:59, a Dutch caller usually means today and not tomorrow. Consequently, the system should
not interpret this tomorrow' as D+1, but instead as D. The special status of the time
frame 00:00 - 04:00 in the Dutch system made it necessary to review all the interpretations
of time and date expressions in the original German system.
We were convinced that we could never figure out all the expressions Dutch
people could use in order to get information about public transport just by introspection.
At the same time, we did not have a large database available that could be used to look
for all possible expressions. Therefore, it was decided to proceed as follows: A preliminary
version of the grammar was made by translating some parts from German and by changing some
other parts. This part of the SDS was then tested independently of the speech interface
with a keyboard version of the dialogue system. People could log in on a system, type their
questions on a keyboard, and get the replies from the system on the screen. Because people
are likely to formulate their sentences differently when they speak or type, they were
instructed to try to formulate the sentences as they would do if they had to pronounce them.
In this way we were able to test the grammar and to gather some text material
that could be used to train the language model. It turned out that the sessions
of the users with this version of the NLP were extremely useful. On the basis of the log-files,
many adjustments were made to the system. A nice example is that in the original German grammar
there are 18 ways to give an affirmative answer and 7 ways to give a negative answer.
Based on the log-files we have defined 34 affirmative answers and 18 negative answers
for Dutch.
4.1.3 DM
For the bootstrap version of the system the German DM was translated literally
into Dutch. Some adaptations appeared to be necessary, though. For instance, the interface
to the public transport database had to be modified. Furthermore, some changes were required
in the feedback to the caller. By way of illustration, in the German system train
numbers are mentioned because these appear to be important for the caller. However,
this piece of information is irrelevant in the Netherlands (people never refer to the
train number) and was therefore excluded from the feedback in the Dutch system.
As mentioned above, a database query is initiated only after all necessary
information is available. Before an information item is considered as known and frozen,
the caller is given explicit or implicit feedback about what the system thinks it has recognized.
The caller can then disconfirm erroneous items and replace them with correct information.
4.1.4 TTS
Many adaptations had to be made to the speech output module of the system,
because only the general approach from the German prototype could be copied. An inventory
was made of the phrases that together form the questions and replies the system should be
able to produce. Recordings were made of these utterances spoken by a female speaker. In
the SDS these recorded utterances are concatenated to generate the speech output of the
system.
4.2 Improving the SDS
The first version of the SDS was put in the PSTN in December 1995. This
version was trained with DB0, i.e. the 2500 Polyphone utterances. A small group of people
received the telephone number of this system, and were requested to call it regularly.
Their dialogues were recorded. In this way the databases DB1 to DB5 in Table 1 were collected.
These databases are built up incrementally, which means that DB2 is a superset of DB1, DB3
of DB2, etc.
Table 1. Databases used during development of the SDS [see postscript version]
For every utterance in the databases an orthographic transcription was made
manually. Out-of-vocabulary words were detected automatically from the transcriptions.
In this way words containing typing errors were found as well. All these typing errors
were corrected. The out-of-vocabulary words were phonematized and added to the training
lexicon, in order to make it possible to use all the collected data for training the system.
However, not all new words were added to the recognition lexicon. Only the words that were related
to crucial concepts of the application were included in the recognition lexicon.
The average number of out-of-vocabulary words is shown in Figure 1. On the horizontal axis the number of utterances in the database is given (see Table
1). The vertical axis is the number of out-of-vocabulary words divided by the number of utterances,
i.e. the average number of out-of-vocabulary words per utterance. It can be observed
that the average number of out-of-vocabulary words is small. Apparently, we succeeded in
making a bootstrap lexicon that contains most of the words used.
In Figure 1 one can also see that the average number of out-of-vocabulary
words decreases as the number of utterances increases from 1301 to 6401. In the
beginning a fair number of out-of-vocabulary words are found. However, as the same group
of people is likely to use more or less the same words to ask for information, the number of
unknown words decreases gradually. After DB3 (6401 utterances) had been recorded, the
telephone number of the system was made available to a larger group of people. It is conceivable
that new people will use new words. As a matter of fact, the average number of out-of-vocabulary
words turns out to increase first and to decrease again later on (see Figure 1).
Whenever a sufficient amount of new data was collected, language models
and phoneme models were trained again. The new models were compared to the old
models (as will be described below), and those which performed best were chosen. In
the on-line system the old models were replaced by the better ones.
In the early versions of the system we detected some syntactic constructions
that where sometimes used by the callers but not handled correctly by the NLP.
To improve the NLP, these syntactic constructions were added to the NLP's context free
grammar. Furthermore, the NLP was trained with the same data used to train the language
model (the bigram). During the training of the NLP the concept bigram model is constructed
and the number of occurrences of syntactic units in the context free grammar is
counted and stored in the NLP. As described above (see section 2), the concept bigram and the
syntactic unit counts are used in deciding which parse of the word graph is chosen.
Although the first bootstrap version of the system was quite useful as a
tool for data acquisition, tests performed recently show that some changes at the ergonomic
level are required. For instance, the concatenation synthesis should be improved,
information about complex journeys should be split into smaller chunks, and the caller should
be able to interrupt the machine (barge-in capability). Some of these improvements
of the DM module will be addressed in the near future.
4.3 Evaluating the performance of the CSR module
Part of the data collected with the on-line SDS was kept apart as a test
database (500 utterances). The first test database was created by randomly selecting 500
utterances. The first evaluations were done with this test database. However, after some time
we found out that this database was not well balanced, i.e. it contained a lot of utterances
of a few speakers who often used the system in the beginning. That is why we decided to create
a second (more balanced) test database, also containing 500 utterances. This database was
used for later evaluations. The total number of words and characters (i.e. phonemes) in
each database is approximately 1.700 and 10.000, respectively. The number of different words
in test database 1 and 2 is 298 and 299, respectively. This means that test databases 1 and
2 are equally large.
The performance of the CSR module was evaluated for the whole word graph
(WG) and for the best sentence (BS) obtained from this word graph. Both for the
word graph and for the best sentence word-error rate (WER) and sentence-error rate (SER) were
calculated. In total this yields four measures that can be used for evaluation: WG-WER,
WG-SER, BS-WER, and BS-SER.
In section 2 it was already explained that the NLP looks for specific concepts
in the whole word graph, such as departure station, arrival station etc. Since
these concepts are words, WG-WER would seem to be the most relevant evaluation measure. However,
it is not necessary that the NLP recognizes every single word. Recognition of the
above-mentioned crucial concepts will suffice. Although WG-WER is probably a better measure
of the CSR performance, than the other three indices mentioned previously, it is obvious
that it is not an optimal measure. Indeed, the optimal measure would be a concept error rate
for the word graph. In order to provide complete information about the performance of
the CSR, the remaining three measures are also presented. The BS error rates give an
idea of the quality of the phoneme models and bigrams, because the probabilities of the phonemes
and bigrams are used to determine the BS from the WG. The SERs show how often the complete
sentence is recognized correctly.
Note that these results were obtained with a version of the system which
was available at the beginning of the MAIS and PP-TST projects. Thanks to the use of new
features, the performance of the CSR module has now improved. However, this improved version
has not been used for the research described in the present article. Still, the
research findings reported here apply to the improved version too, because they concern basic aspects
of the system.
The different databases were used to train language models and phoneme models.
In all cases the inventory of phonemes remained the same. Language models trained
on databases DBj will be called Lj. Phoneme models trained on database DBn
will be called Pn. In addition, phoneme models were trained on DB0 in combination with an application-specific database DBm. These phoneme models will be called P0m.
With test database 1 the error rates for several versions of the system
were obtained (see Table 2). First, the phoneme and language models were trained with
the Polyphone material (DB0). The resulting error rates are given in column 2. DB1 was
not used to train phoneme models and a language model because the number of utterances in
DB1 (i.e. 1301) was too small.
Table 2. Performance level for different phoneme models (Pi) and language
models (Lj). Evaluation is done with test database 1. [see postscript version]
Training the phoneme models on both the Polyphone data (DB0) and application-specific data (DB2) reduces the error rates (compare column
3 to column 2). However, a much larger reduction in the error rates is obtained by training
the language model on DB2 (compare column 4 with 2 and 3). The conclusion is that application-specific data is much more important for training the language models than for training
the phoneme models. Other comparisons of performance levels with different databases
confirmed this conclusion.
Increasing the number of utterances in the database from 5496 to 6401 does
not have much effect on the level of performance (compare columns 5 and 6 with column
4). This could be due to the fact that the amount of added utterances (i.e. 905 utterances)
is small compared to the size of the database. What is more important is that performance
does not deteriorate if the Polyphone material is left out when training the phoneme
models (compare columns 7 and 8 with columns 5 and 6, respectively). On the contrary, BS-WER
is even slightly better for phoneme models trained with DB3 (given in columns 7
and 8), compared to phoneme models trained with DB3 and DB0 together. Therefore, we decided
not to use the data from the Polyphone database anymore for the current application.
Table 3. Performance levels for different phoneme models (Pi) and language
models (Lj). Evaluation is done with test database 1 (column 2: old) and 2 (columns 3-5:
new). [see postscript version]
At this moment test database 1 was replaced with test database 2. For phoneme
models P3 and language model L3 the error rates obtained with test database 2 were
higher than those obtained with test database 1, except for BS-SER (see Table 3, compare columns
2 and 3). However, increasing the size of the training database to 8.000 utterances
led to a better performance. The effect of increasing the database to 10.003 utterances
was small. Evaluation results for a third test database, and for a larger training database (consisting
of 21.288 utterances) are presented in Strik et al. (1996).
5. Pronunciation variation and non-speech sounds
Apart from the work described in the previous section, some other research
was carried out in order to improve the SDS. In the present section we will only give a short
description of some issues related to modelling pronunciation variation and recognizing non-speech
sounds.
In order to obtain the phonemic representations of the words in the lexicon,
we first checked whether these words were present in two existing databases, namely
CELEX (Baayen et al., 1993) and ONOMASTICA (Konst and Boves, 1994). Phonemic transcriptions of the words that could not be found in these two databases were derived
by using the grapheme-to-phoneme conversion rules developed at our department (Kerkhoff
et al., 1984). The output of the rules was then checked and, if necessary, corrected by
hand. There are several reasons why a lexicon obtained in this way is not optimal for speech
recognition:
1. since the phonemic transcriptions are obtained from different sources,
they are likely to be inconsistent;
2. for each entry in the lexicon only one pronunciation variant is stored,
while in practice people will pronounce words in many different ways;
3. the pronunciation variant present in the lexicon is not always the optimal
one for speech recognition (see e.g. Cohen, 1989).
For instance, the policy adhered to in the ONOMASTICA project was to limit
reduction phenomena to the bare minimum. As a result many of the station names are
represented by overly formal phonemic forms. By way of illustration, we will give some
examples of recognition errors which are most probably due to pronunciation variation
(in our system and in the examples below SAMPA is used as the computer phonetic alphabet).
In one dialogue a person did not succeed in convincing the SDS that he wanted
to go to a place called Geldrop. Although he tried several times, the system did
not manage to recognize the word, because the speaker in question did not say [GELdrOp]
(the transcription of Geldrop in the lexicon), but [GELd@rOp]. Although this is only a minor
difference for human listeners, who are expert speech recognizers, this example illustrates
that a small difference in pronunciation (i.e. insertion of a schwa) can have serious
consequences for automatic SDS (i.e. recognizing the wrong place name). Reduction processes,
which are very common in spontaneous speech, also caused several problems. For instance,
many people say something like [xujdAx] instead of [xud@ndAx] or [xuj@ndAx], which are more
careful pronunciations of the Dutch word "goedendag" (a greeting which literally
means "nice day"). Another example of severe reduction is the pronunciation of Amsterdam as
[Ams@dAm] instead of [Amst@rdAm].
As spontaneous speech exhibits a considerable amount of pronunciation variation,
the speech recognizer's performance can be improved if the variation is properly
taken into account. For this reason part of our research on speech recognition is now
concentrated on modelling pronunciation. A first step in this direction consists in making
an inventory of possible pronunciation variants present in spontaneous speech. Although
a large amount of pronunciation variation in Dutch is described in the literature (see, e.g.,
Booij, 1995), we also found variation forms which probably have not been described before (see
Cucchiarini and Van den Heuvel, 1996).
Besides modelling pronunciation variation, correct recognition of non-speech
sounds is also very important. We encountered many examples of this phenomenon,
some of which are mentioned here. One sentence "ja dat klopt" ( yes, this is correct')
was followed by a long interval of breath noise after the last word. The system recognized the
words dat' en klopt' correctly. But the system also recognized the final bit of non-speech as
speech, and thus recognized: "nee dat klopt niet" ( no, this is not correct'). This is exactly
the opposite of what was meant. Furthermore, non-speech sounds were found very often at the beginning
of an utterance. In many cases a speaker starts an utterance by inhaling. This
inhaling noise is often followed by a lip-smack. Some preliminary experiments revealed that modelling
(and recognizing) these non-speech sounds does improve the performance of the
SDS.
6. Discussion and conclusions
In this paper we have described the development of a system that can be
used for automatizing part of an existing telephone-based service. An important characteristic
of this system is that it was derived from a prototype that had originally been
developed for German. Moreover, an alternative approach for collecting application-specific material
was adopted, instead of the usual Wizard-of-Oz scenario.
This alternative method appears to have considerable advantages. First of
all, no time is spent on building, testing, debugging, and implementing the WOZ simulation.
Instead, the real system is immediately realized. Consequently, the system used to collect
the data is the real system and not some imitation, the specifications of the data-collection
system and the final SDS are the same, and thus the properties of the signals collected
for development (like e.g. background noise, signal-to-noise ratio) closely resemble those of
the signals the final system will eventually have to handle. Furthermore, the whole application
with all the modules is used from the beginning, and not just one or some of its components.
In this way practical problems pop up at an early stage and can be solved before the
final implementation takes place. Many of these practical problems are specific for the implementation
of the SDS. Therefore, most of them will not turn up when a WOZ simulation is used.
In short, not all findings and experiences obtained with a bootstrap version can be obtained
with a WOZ simulation. Finally, it is possible to collect speech material and to test,
debug, and evaluate the system at the same time.
However, one important disadvantage of this approach is that it requires
that the first version of the system should work well enough to be used for data collection.
We succeeded in making a suitable bootstrap for the following reasons. Firstly, because
we could use the German prototype as a starting point. Secondly, because we had knowledge
about German, Dutch, and this specific application. These three types of knowledge, together
with the fact that German and Dutch are not very different, made it possible to localize
a substantial part of the German prototype for Dutch. Thirdly, because speech databases, albeit
not application-specific, were available. They were used especially to train
the phoneme models. Finally, because we used the data collected with the keyboard version. These
data, and our knowledge of Dutch and this application, were used to develop the bigram
and the NLP module. It is possible that under less advantageous circumstances, this
approach would be less successful than it turned out to be in our case. On the basis of our experience, we can therefore conclude that the bootstrap
approach was very successful. Furthermore, we found that phoneme models trained with
data which are not specific for the given application still perform reasonably well. However,
this is not the case for the language models. A large gain in performance was obtained when
the language models were trained with application-specific data. We also showed that
the small test databases used in our research succeeded in revealing the relative improvements
obtained with various versions of the system. However, the absolute numbers for the
performance levels differed between the two test databases. Therefore, it is probably
better to use more than one database for testing. Finally, we are satisfied with the results of the tests so far. Our goal
was to automatize part of an existing service. In order to reduce the complexity of the task,
we limited the domain to information about journeys from one train station to another.
So far, it seems that it should be possible to automatize this part of the service. However, we are
still improving the system and the final field tests still have to be performed. In the near
future we hope to be able to report positive results on the final evaluation of the system.
Acknowledgements
Part of the research was carried out in the framework of two projects: the
MLAP project MAIS and the Dutch Priority programme Language and Speech Technology' which
is funded by the Netherlands Organization for Scientific Research (NWO).
References
Aust, H., M. Oerder, F. Seide and V. Steinbiss (1994), Experience with the
Philips Automatic Train Timetable Information System', in: Proceedings IVTTA'94
2nd IEEE workshop on interactive voice technology for telecommunications applications,
Kyoto, 67-72.
Aust, H., M. Oerder, F. Seide and V. Steinbiss (1995), A spoken language
inquiry system for automatic train timetable information', Philips Journal of Research, 49
- 4, 399-418.
Baayen, R.H., R. Piepenbrock and H. van Rijn (1993), The CELEX lexical database
(on CD-ROM), Philadelphia, PA: Linguistic Data Consortium, University of Pennsylvania.
Booij, G. (1995), The phonology of Dutch. Oxford: Clarendon Press.
Cohen, M.H. (1989), Phonological structures for speech recognition, PhD
dissertation, University of California, Berkeley.
Cucchiarini, C. and H. van den Heuvel (1996), /r/-deletion in standard Dutch',
in: Proceedings of the Department of Language and Speech, Nijmegen University,
19, this issue.
Damhuis, M., T. Boogaart, C. in 't Veld, M. Versteijlen, W. Schelvis, L.
Bos and L. Boves (1994), Creation and analysis of the Dutch Polyphone corpus', in: Proceedings International Conference on Spoken Language Processing (ICSLP) '94, Yokohama, 1803-1806.
Kerkhoff, J., J. Wester and L. Boves (1984), A compiler for implementing
the linguistic phase of a text-to-speech conversion system', in: Bennis H. and W.U.S. van
Lessen Kloeke (eds.), Linguistics in the Netherlands, 111-117.
Konst, E.M. and L. Boves (1994), Automatic grapheme-to-phoneme conversion
of Dutch names', in: Proceedings International Conference on Spoken Language Processing (ICSLP) '94, Yokohama, 735-738.
Ney, H. and X. Aubert (1994), A word graph algorithm for large vocabulary,
continuous speech recognition', in: Proceedings International Conference on Spoken
Language Processing (ICSLP) '94, Yokohama, 1355-1358.
Oerder, M. and H. Aust (1994), A Real-time Prototype of an Automatic Inquiry
System', in: Proceedings International Conference on Spoken Language Processing (ICSLP)
'94, Yokohama, 703-706.
Oerder, M. and H. Ney (1993), Word graphs: an efficient interface between continuous-speech recognition and language understanding', in: Proceedings
ICASSP'93, Minneapolis, 119-122.
den Os, E.A., T.I. Boogaart, L. Boves and E. Klabbers (1995), The Dutch
Polyphone corpus', in: ESCA 4th European Conference on Speech Communication and Technology: EUROSPEECH 95, Madrid, 825-828.
Steinbiss, V., H. Ney, R. Haeb-Umbach, B. Tran, U. Essen, R. Kneser, M.
Oerder, H. Meier, X. Aubert, C. Dugast and D. Geller (1993), The Philips research system for large-vocabulary continuous-speech recognition', in: ESCA 3rd European Conference
on Speech Communication and Technology: EUROSPEECH '93, Berlin, 2125-2128.
Steinbiss, V., H. Ney, X. Aubert, S. Besling, C. Dugast, U. Essen, D. Geller,
R. Haeb-Umbach, R. Kneser, H.-G. Meier, M. Oerder, and B.-H. Tran (1995), The
Philips research system for continuous-speech recognition', Philips Journal of Research,
49 - 4, 317-352.
Strik, H., A. Russel, H. van den Heuvel, C. Cucchiarini and L. Boves (1996),
Localizing an automatic inquiry system for public transport information. To appear in:
Proceedings International Conference on Spoken Language Processing (ICSLP) '96, Philadelphia.