H. Strik & L. Boves (1991a)
Journal of Phonetics 19, pp. 367-378.
This article has appeared in the Journal of Phonetics. Therefore, I only
have a printed version with the final text and the final layout. If you
want a copy of this article, you can find it in Journal of Phonetics 19,
or you can contact me. The text of the ASCII version
below is slightly different from the text of the article.
Running title: non-linear time-alignment and averaging
Abstract
In analyzing physiological signals related to speech, it is necessary to
average several repetitions in order to improve the Signal to Noise Ratio.
However, in a recent experiment, considerable differences were found in
the articulation rate of repeated realizations of a medium length utterance,
especially for untrained subjects. This makes averaging of related physiological
signals a non-trivial problem. A new method of time-alignment and averaging
of the physiological signals is described. In this method a dynamic programming
algorithm is used, which succesfully corrects for the timing differences
between the repetitions.
1. Introduction
A quantitative study of the physiological basis of speech production requires
the simultaneous measurement of acoustic signals and a number of physiological
signals. The usual procedure to overcome the limitations of low Signal to
Noise Ratios in physiological signals, and to avoid misinterpretations caused
by idiosyncrasies of single tokens, is to average multiple repetitions of
the 'same' utterance (Atkinson, 1978; Baer, Gay, and Niimi, 1976; Collier,
1975; Maeda, 1976). To allow averaging, the utterances must be lined up
in time. To that end line-up points must be defined in every repetition.
Typical choices are distinctive events like the release of a plosive or
the onset of voicing, preferably close to the middle of the utterance.
Time alignment and averaging cannot be applied to all speech signals in
the same way. For instance, usually no averaging is applied to fundamental
frequency (F0) signals, probably because of its discontinuous nature which
makes straightforward averaging questionable. Instead, the F0 contour of
one of the repetitions is chosen to represent the 'average' F0 contour (Atkinson,
1978; Collier, 1975; Maeda, 1976).
The applicability of the method of linear time-alignment for averaging,
as described above, is limited by the inherent variability of speech production.
Two types of variation must be distinguished, viz. variation in speaking
rate and variation in articulation. Both kinds of variation are not independent,
as a pronounced change in speaking rate is likely to affect articulation
as well. But for the experiments we are concerned with, the amount of change
in speaking rate is such, that rate induced articulatory variations are
unlikely to be a first-order effect. This paper mainly deals with techniques
to overcome the effects of temporal variation.
If trained subjects are asked to utter words or short phrases, the variation
in articulation speed usually remains within reasonable bounds. But even
for a trained subject considerable differences were found in the speaking
rate for repetitions of a medium length utterance (Strik and Boves, 1988).
If the variation in the speaking rate is large, averaging after linear time-alignment
would result in signals corresponding to different articulatory events being
averaged.
In this paper we propose a novel processing technique in which a Dynamic
Programmig (DP) algorithm is used to time-align the tokens in a non-linear
way. The aim of this method, which is referred to as the method of non-linear
time-alignment and averaging, is to obtain such a degree of time-alignment
that meaningful averaging remains possible. The method was tested with the
data of an experiment with (quasi-)spontaneous speech of a non-trained subject.
Results of analysis with linear and non-linear time-alignment are compared.
The proposed method corrects for the variation in speaking rate, but then
there is still the problem of variation in articulation. It is safe to assume
that repeated realizations of the same utterance are fairly similar. However,
if the variation in articulation is too large, meaningful averaging after
time-alignment is never possible because then again physiological signals
related to different articulatory events are averaged. The data of our experiment
were also used to check a posteriori whether the amount of articulatory
variation was within reasonable bounds.
2. Method of non-linear time-alignment and averaging
In the method presented in this paper DP is used for non-linear time-alignment
of the tokens. DP is successfully used in speech recognition where it is
often referred to as Dynamic Time Warping (DTW).
First a brief description of the DP algorithm is given. For explanation
of the details of DP the user is referred to the relevant literature (e.g.
Sakoe and Chiba, 1978). Next, an overview is given of the six stages of
the procedure for non-linear time-alignment and averaging of physiological
signals related to speech, followed by a more detailed description of the
separate stages.
2.1. The DP algorithm
The algorithm, described here, is based on the flowchart given in the article
of Sakoe and Chiba (1978). The DP algorithm finds the optimal time registration
between two patterns, a reference pattern R of length J and a test pattern
T of length I. Both patterns are sequences of feature vectors, that are
derived from the speech signals by appropriate feature extraction. The frames
of the two patterns define a grid of IxJ points (Fig. 1a).
A suitable distance metric is used to calculate the distance at point pk,
which is the distance between frame i of test pattern T and frame j of reference
pattern R: d[pk] = d[Ti,Rj]. A path P is a sequence of K grid points (Fig.
1a): P = p1, p2, p3, ..., pk, ..., pK; and pk = (i,j). The total distance
between T and R for a given path P is the weighted sum of the local distances:
DP[T,R] = summation for k = 1 to K of wk*d[pk].
By definition, the optimal path Po is the path that minimizes DP[T,R]. The
path Po represents a function F, which realizes a mapping from the time
axis of T onto that of R, called the warping function. The warping function
F, or the optimal path Po, can be used to normalize the time axis of T with
respect to the time axis of R. When there are no timing differences between
T and R, the path Po coincides with the line i=j.
The path P usually is constrained. The path has to start in p1 = (1,1),
end in pK = (I,J), and it must remain within an adjustment window (Fig.
1a). In the method proposed in this paper a slope constraint condition of
1/2 (see Sakoe and Chiba, 1978) is used, which means that a diagonal step
can be followed, or preceded, by at most 2 off-diagonal (i.e. horizontal
or vertical) steps. The consequence is that only the five step sequences
given in Fig. 1b are allowed. The symmetric form DP-matching is used because
Sakoe and Chiba found that it gave better results in speech recognition
than the asymmetric form.
2.2. General overview of the method
The method of non-linear time-alignment of physiological signals, proposed
here, can be split into six successive stages:
1. specification of line-up points,
2. selection of a reference pattern,
3. calculation of cepstrum coefficients of the acoustic signals,
4. calculation of a warping function for each token (DP),
5. mapping of the physiological signals, using the warping function, and
6. calculation of median values and variation of time-normalized signals.
A necessary requirement for this method is that all (physiological) signals
be sampled at the same sampling frequency (Fs). For the experiment used
for evaluation of the method, Fs is 200 Hz, so the sampling time (Ts = 1/Fs)
is 5 ms. The individual stages are described below.
2.2.1. Specification of line-up points
Even though DP has proved to be useful in speech recognition, for the purpose
at hand some modifications seemed necessary. First of all, in basic speech
research one is often interested in the (average) physiological signals
before and after an utterance. However, it is difficult to obtain a useful
time registration path by comparing silence with silence. Also, it is often
desirable to have an exact time-alignment of a particular event in an utterance
to study the (average) physiological signals in the neighbourhood of this
event. Therefore, our method allows one to define several line-up points
in an utterance, that are time-aligned exactly; the DP algorithm is only
applied between those line-up points (Fig. 2). The first line-up point is
interpreted as the beginning of the utterance, and the last one as the end
of the utterance. Before the first line-up point, and after the last line-up
point, the time registration path runs diagonally (Fig. 2).
2.2.2. Selection of a reference pattern
One of the tokens is chosen as a reference for time-normalization of the
remaining tokens. The best choice for this reference pattern or template
seems to be the token with median length, because it requires the least
adaptation in the other tokens.
2.2.3. Calculation of feature vectors
The recording conditions during experiments in which several fysiological
signals are measured often are such, that the Signal to Noise Ratio (SNR)
of the audio signals is not high. The current method should also be applicable
to audio signals with mediocre SNR. Cepstrum coefficients are known to give
good results in speech recognition (Davis and Mermelstein, 1980; Paliwal
and Rao, 1982). Therefore, the first 12 cepstrum coefficients are used as
feature vectors. The speech signals were digitized with a sampling frequency
of 10 kHz and submitted to a 12th order LPC analysis using a 250 point Hamming
window and a window shift of Ts = 5 ms. The vectors of LPC coefficients
were subsequently transformed to vectors of 12 cepstrum coefficients (Markel
and Gray, 1976).
2.2.4. Determination of optimal time registration path
In the fourth stage the warping function has to be found that minimizes
the distance between test pattern and reference pattern. The exact choice
of the distance metric does not seem critical for our purpose. A simple
Euclidian distance measure proved to be sufficient. However, the definition
of the adjustment window is critical. Because there can be a substantial
difference in the length of patterns under comparison we used the adjustment
window shown in Fig. 1a, which is different from the one given by Sakoe
and Chiba (1978).
2.2.5. Transformation of the physiological signals
The warping functions computed in the previous stage describe the differences
in the temporal structure of all tokens relative to the reference token,
i.e. they allow normalization of the time axes of the tokens by mapping
them onto that of the reference token. Since the physiological signals are
measured on the same time axis as the speech signal, their time axes can
be normalized using the warp functions derived from the speech signals.
The time-normalized or warped signal W is computed from the original signal
S by using a non-linear function Fn: W(j) = Fn[S(i)]. The calculation starts
at grid point pK = (I,J), and backtracks to grid point p1 = (1,1). Because
only the five step sequences given in Fig. 1b are allowed, the function
Fn only has to be defined for these five partial paths. For time compression,
step sequences D and E in Fig. 1b, W(j) is obtained by averaging over two
and three samples respectively. For time stretching, step sequences A and
B, W(j) and preceding samples are obtained by linear interpolation (Fig.
3). And for a single diagonal step, step sequence C, no local transformation
of the time-axes is made.
The result is a function Fn that is defined in the following way:
Step sequence A. W(j) = [S(i+1) + S(i)]/2; W(j-1) = S(i); W(j-2) = [S(i)
+ S(i-1)]/2
Step sequence B. W(j) = [S(i+1) + 2*S(i)]/3; W(j-1) = [2*S(i) + S(i-1)]/3
Step sequence C. W(j) = S(i)
Step sequence D. W(j) = [S(i) + S(i-1)]/2
Step sequence E. W(j) = [S(i) + S(i-1) + S(i-2)]/3
As it is impossible to determine a meaningful warping function for the silent
intervals before and after the utterances, the best thing one can do is
to leave the time structure unchanged. This is achieved by letting the path
run diagonally (Fig. 2).
2.2.6. Averaging
For every physiological process the expected value of the time-normalized
signals must be computed. We prefer the median over the arithmetic mean
value, since it reduces the effect of outliers. The median signals are then
smoothed. In addition to the median value, a measure of the variation around
the median (the phonatory or articulatory variation) can also be important.
We found that the range spanned by all but the n largest and n smallest
values, where n is of course (much) less than half the number of available
tokens, is a useful measure of variation.
The method of averaging, described above, is appropriate for continuous
signals. But F0, one of the signals that has received much attention in
speech research, is a discontinuous signal. For unvoiced frames F0 was set
to zero. We found that taking the median value of F0 gives the appropriate
voiced-unvoiced decision and the desired average F0 value.
3. Experimental evaluation
To compare the methods of linear and non-linear time-alignment, data of
an experiment were used in which simultaneous recordings were made of the
acoustic signal, electroglottogram (EGG), lung volume (Vl), subglottal pressure
(Psb), supraglottal or oral pressure (Por), and electromyographic (EMG)
activity of the sternohyoid (SH) and vocalis (VOC) muscles. A male non-trained
subject was asked to produce an utterance spontaneously. His answer was:
"Ik heb het idee dat mijn keel wordt afgeknepen door die band" (I have the
feeling that my throat is being pinched off by that band). He was then asked
to repeat that sentence 29 times. All physiological signals were then pre-processed
to obtain signals with a sampling rate of 200 Hz. This experiment is described
in more detail elsewhere (Strik and Boves, 1991).
The original, spontaneous sentence deviated from the 29 repetitions because
in the original there was a pause of almost half a second, due to a swallowing
gesture of the subject. Thus, in order to minimize the risk that utterances
containing different articulatory gestures were averaged, only the last
29 sentences were used for analysis.
3.1. Variation in speaking rate
The oscillograms of three audio signals are shown in Fig. 4. It is obvious
that there are large differences in the durations of the utterances. The
mean length of the 29 utterances was 2310 ms (sd = 130 ms), while the maximum
and the minimum length were 2615 ms and 2165 ms, respectively.
The release of the /k/ of "keel" was used as the line-up point for the method
of linear time-alignment. This line-up point was chosen because it is expected
to be clearly distinguishable, and it is situated near the middle of the
sentence. The mean duration of the first part (from beginning to the line-up
point) was 880 ms (sd=80 ms), with a maximum of 1075 ms and a minimum of
780 ms. The mean duration of the last part (from line-up point to the end)
was 1430 ms (sd=70 ms); the maximum and minimum values were 1590 ms and
1320 ms. Therefore, one can hardly maintain that there is little variation
in the temporal structure of the signals. Also, the subject increased his
articulation rate as he repeated the utterances more often. But even for
the last six sentences the ranges for the first and last parts were 120
ms and 90 ms, respectively. So even after numerous repetitions the variation
is still so large that straightforward averaging of the tokens could result
in combining physiological signals of different articulatory movements.
3.2. Method of linear time-alignment
Although we did not expect linear time-alignment to produce meaningful results,
we still wanted to test its viability. In Fig. 5 the time-aligned transglottal
pressure (Ptr) signals, corresponding to the audio signals of Fig. 4, are
shown in the upper three windows. The timing differences are very large,
and the time-alignment is only reasonable just before and after the line-up
point. This is reflected in the average signal (Fig. 5, bottom trace) that
becomes increasingly meaningless towards both beginning and end of the utterance.
In Fig. 6 the average signals are plotted for F0, Intensity Level (IL),
Ptr, Por, Psb, Vl, SH and VOC. Especially for F0, IL and the pressure signals
it is apparent that the averages are only meaningful in the direct neighbourhood
of the line-up point.
3.3. Method of nonlinear time-alignment
For the method of non-linear time-alignment and averaging warping functions
were calculated for all tokens using the token with median length (2295
ms) as the template. These warping functions were then used to map the physiological
signals. Before averaging the signals, we checked whether the degree of
time-alignment, obtained by warping the signals, was sufficient.
To that end nine labels were placed manually in all 29 tokens at marked
acoustic events. Chosen were releases of unvoiced plosives, one of them
being the /k/ that was used as line-up point. The line-up points were used
to shift the signals, so after linear time-alignment the line-up points
are perfectly time-aligned. This is shown in Fig. 7, where the fifth label
is the /k/ that is used as line-up point. Away from the line-up point the
degree of time-alignment diminishes. Already for the two neighbouring labels,
label 4 and 6, the timing differences are fairly large. The largest timing
differences were found at the beginning of the utterances. The warping functions
were then used to time-align the labels, and the result is shown in Fig.
8. Apart from some inaccuracies, all labels (i.e. the corresponding acoustic
events) seem to be aligned very well. Because the acoustic events of the
whole sentence are time-aligned by non-linear time-alignment, meaningful
averaging at this stage seems possible.
Median signals are plotted in Fig. 9. It can be seen that the median signals
are not only meaningful near the line-up point, but also towards beginning
and end of the utterance.
3.4. Variation in pronunciation
Non-linear time-alignment seems succesful in time-aligning the acoustical
events of all utterances to a reasonable degree. However, for meaningful
averaging another requirement must be fulfilled, viz. that the different
realizations of the utterances are produced with essentially the same articulatory
gestures. After all, averaging the physiological signals belonging to utterances
that were produced very differently, is not a meaningful procedure. We cannot
test whether the movements of the articulators were very much alike in the
different utterances, but we can check the amount of variation of some relevant
physiological signals of the speech production system between the utterances.
The dotted lines in Fig. 9 give an idea of the range of the middle 20 values
at each time instant (see method). From these traces we can infer that,
apart from Vl, the amount of variation of the physiological signals between
the different realizations of an utterance is within reasonable bounds.
4. Conclusions and discussion
Both for untrained and trained (see Strik and Boves, 1988) subjects a substantial
degree of time variation between repetitions of a medium length utterance
was found. Even after numerous repetitions these timing differences did
not disappear. With such differences in temporal structure, linear time-alignment
and averaging no longer seems a useful procedure with which to extract meaningful
relations.
A possible solution seems to be the following. Define several line-up points
in each repetition, time-align these line-up points, and do linear time-alignment
in between. However, the timing differences are not distributed uniformly,
and therefore the number of line-up points needed to obtain a reasonable
overall time-alignment would be very large.
We have shown that the method of non-linear time-alignment, presented here,
works satisfactorily, despite the mediocre signal-to-noise ratio of the
speech signals and the highly non-stationary character of the noise. Thus,
the technique of DP, developed in the framework of automatic speech recognition,
can also be a very useful tool in fundamental research for processing physiological
(or comparable) signals related to speech. After time normalization, median
values are obtained for all measured physiological quantities. These median
values can be used for further analysis.
The method of non-linear time-alignment has some further advantages. In
contrast with the method of linear time-alignment, this method also yields
an average signal for F0. Furthermore, the technique can be used (semi-)
automatically, which makes it very attractive in a research situation that
is characterized by the need to handle large amount of signals.
Finally, the method can be used to time-align and average all kinds of signals
for which timing differences are apparent.
Acknowledgements
This research was supported by the Foundation for Linguistic Research, which
is funded by the Netherlands Organization for Scientific Research, N.W.O.
Special thanks are due to Harco de Blaauw who was the subject of the present
experiment, to Philip Blok M.D. who inserted the EMG electrodes and the
catheter, to Hans Zondag who helped in organizing and running the experiment,
and to Jan Strik who assisted in the processing of the data.
References
Atkinson, J.E. (1978) Correlation analysis of the physiological features
controlling fundamental voice frequency, Journal of the Acoustical Society
of America, 63, 211-222.
Baer, T.; Gay, T. and Niimi, S. (1976) Control of fundamental frequency,
intensity and register of phonation, Haskins Laboratory Status Report on
Speech Research, SR-45/46, 175-185.
Collier, R. (1975) Physiological correlates of intonation patterns, Journal
of the Acoustical Society of America, 58, 249-255.
Davis, S.B. and Mermelstein, P. (1980) Comparison of Parametric Representations
for Monosyllabic Word Recognition in Continuously Spoken Sentences, IEEE
Transactions on Acoustics, Speech, and Signal Processing, ASSP-28, 357-366.
Maeda, S. (1976) A characterization of American English intonation. Ph.D.
thesis, MIT, Cambridge.
Markel, J.D. and Gray Jr., A.H. (1976) Linear prediction of speech. Berlin:
Springer-Verlag.
Paliwal, K.K. and Rao, P.V.S. (1982) Evaluation of various linear prediction
parametric representations in vowel recognition, Signal processing, 4, 323-327.
Sakoe, H. and Chiba, S. (1978) Dynamic programming algorithm optimization
for spoken word recognition, IEEE Trans. Acoustics, Speech, and Signal Proc.,
Vol. ASSP-26, 43-49.
Strik, H. and Boves, L. (1988) Averaging physiological signals with the
use of a DTW algorithm. In Proceedings SPEECH'88, 7th FASE Symposium, Edinburgh,
Book 3, 883-890.
Strik, H. and Boves, L. (1991) Control of fundamental frequency, intensity
and voice quality in speech. This issue.
Figure captions
Figure 1. (a) A graphical representation of the DP algorithm, with (b) the
five possible step sequences (A-E) in the symmetric DP algorithm when the
slope constraint condition is 1/2. Indicated in Italics are the weighting
coefficients wk.
Figure 2. A graphical representation of non-linear time-alignment, when
three line-up points are used. B indicates the beginning of the utterance,
E the end, and L an acoustic event near the middle of the utterance.
Figure 3. An example of the function Fn for time stretching (step sequence
A). In this example a straight line is used as the input signal S.
Figure 4. Oscillograms of the audio signals of three repetitions of the
same utterance.The straight vertical line at 1.3 s connects the line-up
points of the individual signals.
Figure 5. In the three upper panels the transglottal pressure signals are
shown of the three utterances given in Figure 5. The lower panel contains
the average transglottal pressure signal for 29 repetitions. The straight
vertical line at 1.3 s connects the line-up points of the individual signals.
Figure 6. Average physiological signals for fundamental frequency, intensity
level, transglottal pressure, oral pressure, subglottal pressure, lung volume,
and electromyographic activity of the sternohyoid and vocalis muscles, obtained
by the method of linear time-alignment. The straight vertical line at 1.3
s connects the line-up points of the individual signals.
Figure 7. The labels of the 29 utterances after linear time-alignment. The
straight vertical line at 1.3 s connects the line-up points of the individual
signals.
Figure 8. The labels of the 29 utterances after non-linear time-alignment.
The straight vertical line at 1.3 s connects the line-up points of the individual
signals.
Figure 9. Median physiological signals (solid lines) for fundamental frequency,
intensity level, transglottal pressure, oral pressure, subglottal pressure,
lung volume, and electromyographic activity of the sternohyoid and vocalis
muscles, obtained by the method of non-linear time-alignment and averaging.
The dotted lines are a measure for the amount of variation (see text). The
straight vertical line at 1.3 s connects the line-up points of the individual
signals.