1. INTRODUCTION
Physiological signals that are related to the speech production system invariably contain some random activity. Especially EMG signals are always very noisy, but other signals like subglottal or oral pressure and lung volume usually are far from clean also. To extract meaningful relations between the various physiological processes it is necessary to improve the signal-to-noise ratio of these signals. For some applications a form of visual smoothing and comparison of various signals may be adequate. In many other situations, however, one would prefer to have reliable numeric measurement data that can be subjected to multivariate statistical processing.
The usual procedure to highlight information in physiological signals is to time align and average the signals derived from various repetitions of the ‘same’ utterance [1]. One obvious requirement for this procedure is that the tokens to be averaged are all produced with essentially equivalent physiological gestures. The most apparent characteristic of such repetitions is that they all have similar temporal structure. Trained speakers often succeed in repeating the same utterance with approximately the same duration, especially if the utterances are not too long and too complex. But even repetitions of a fairly long sentence (duration +/- 4 sec) produced by persons who have experience in acting as a subject in phonetic experiments, time aligned with a line-up point in the middle of the sentence, show deviations of as much as 150 msec towards both ends. With deviations of this magnitude straightforward averaging of tokens becomes questionable.
In this paper we propose a novel processing technique in which a dynamic time warping (DTW) algorithm is used to obtain such a degree of time alignment that averaging of tokens remains meaningful. The power of the procedure is shown by means of an example where straightforward time alignment does not yield useful results, whereas DTW clearly does.
2. METHOD
2.1. The Experiment
An experiment was carried out while the first author stayed at Haskins Laboratories in the fall of 1986. In this experiment simultaneous recordings of speech, electroglottogram (EGG), subglottal pressure (Ps), lung volume, and EMG activity of cricothyroid (CT), vocalis (VOC) and sternohyoid (SH) were obtained while a subject (a native speaker of Dutch) performed several speech tasks, a.o. the repeated production of a number of ‘meaningful’ Dutch sentences. The EMG signals were recorded using hooked wire electrodes that were inserted percutaneously. Subglottal pressure was recorded by means of a Millar microtip catheter transducer inserted pernasally and fed into the trachea via the posterior commissure of the glottis [2]. The subglottal pressure recordings were calibrated by having the subject blow into a manometer [3]. A more detailed description of this experiment can be found in Strik and Boves [4].
2.2. Speech Material
The speech material that is of relevance to the present paper consisted of four sentences comprising an increasing number of words. Each sentence had to be produced with three different intonation contours, i.e. a ‘flat hat pattern’ [5], two ‘pointed hats’ on the two syllables carrying pitch accents (one very early, the other towards the end of the sentence) and question intonation. Each sentence-contour pair was repeated five times before the next pair was attempted. The subject, who had ample experience with similar speech production experiments, practiced repeating the utterances before the start of the experiment. The example used to illustrate this paper is based on the longest sentence produced with the flat hat pattern. This sentence (U10) reads ‘Piet slikte zijn vierentwintig gele pillen gisteren liever in stilte met bier’. The sentence contains mainly high vowels, in order to prevent large movements of the hyoid bone due to articulatory demands [6]. Only four out of five repetitions yielded useful signals, because in one token a word was not pronounced.
2.3. Recording and Processing of Data
The physiological signals, the audio signal, an octal code and a timing pulse were recorded on a one inch, 14-channel instrumentation recorder [7]. The preprocessing of the data was done with the Haskins Laboratories EMG data processing system [1]. Fundamental frequency (F0) was derived from the EGG signal. This was done because the situation during the experiment precluded the recording of audio signals that are sufficiently clean to allow reliable automatic pitch tracking. After preprocessing all signals were sampled at a 200 Hz rate, copied onto digital magnetic tape and brought to our own laboratory for further processing. The acoustic speech waves were sampled at 10 kHz and also copied onto the tape, in such a way that time synchronization with the down-sampled physiological signals could be restored.
3. TIME ALIGNMENT
After preprocessing, the physiological signals were time aligned with the release of the /p/ of /pillen/ as line-up point. This line-up point was chosen because it is clearly distinguishable, and it is situated approximately in the middle of the sentence. In Fig. 1 the pitch tracks of the four useful tokens are shown in the upper four panels. It can be seen that the signals are approximately time aligned in the middle part of the tokens, but that the time alignment gets worse towards both ends. At the beginning of the utterances the maximum deviation is 120 msec, while at the end it is even 210 msec. If all time aligned signals are now averaged, then at both ends signals will be averaged that belong to different speech events. The result will be a signal that becomes increasingly meaningless towards both beginning and end of the utterance, as can clearly be seen from the trace in the bottom panel of Fig. 1. In cases like this one might take recourse to single token processing, with all attendant difficulties, inconveniences (the processing, interpretation, and comparison of individual tokens is extremely time consuming) and dangers. We have, however, attempted to develop a viable alternative in the form of a more intelligent time alignment before averaging. This technique is presented in the next section.
4. TIME NORMALIZATION
One way to circumvent the problem of time variation in utterances that one would like to time align and average might seem to define a number of additional line-up points, compute average signals around all line-up points, and concatenate these locally correct averaged segments in such a way that a reliable overall representation of the data would result. In the sentence at hand obvious candidates for additional line-up points would be the very start of the utterance and all remaining voiceless plosives. It is far from apparent, however, how the concatenation has to be effected in order to obtain meaningful results.
Variability of the temporal structure of different tokens of the ‘same’ word or phrase is a problem that has plagued automatic speech recognition from its very first practical implementation. Several linear and non-linear techniques for time normalizing test utterances to stored templates have been attempted, but no other proposal has been nearly as successful as the dynamic programming technique that has become known under the name Dynamic Time Warping [8].
DTW finds the local distortions of the time axis of a test utterance in such a way that the summed difference between the portions of the signals that get aligned is minimized in some sense. In searching for this optimal non-linear time alignment the additional constraint has to be satisfied that the maximum amount of local time distortion remains within reasonable bounds [9]. If the optimal time warp function has been obtained, every test utterance can be aligned with the template by distorting its time axis according to the warp function. Thus, it seems worthwhile to investigate whether this property generalizes to physiological signals that have been recorded simultaneously with the acoustic speech signal and that are represented in the form of samples taken at equidistant intervals of 5 ms.
Numerous parametrical and non-parametrical representations of speech signals can be used for computing the optimal time warp function. Similarly, a large number of distance measures for computing the summed distance between the representations of template and test utterance have been described in the literature [10]. In our work the speech signals were submitted to a 12th order autocorrelation LPC analysis using a 250 point Hamming window and a 5 ms window shift. The vectors of LPC coefficients were subsequently transformed to vectors of 12 cepstrum coefficients [11] that were used as input to a DTW algorithm, that employed a simple Euclidean distance measure. Token 2, the duration of which is closest to the average duration of the sentences, was taken as the template, and warp functions were computed for the remaining three sentences. Before the DTW was started, the test sentence and the template were time aligned in such a way that the line-up points derived from the /p/ in /pillen/ coincided. No special precautions were taken to account for the noisy quality of the speech recordings.
In Fig. 2 the warp function is shown for the mapping of token 3 on the template. The warp function is calculated with the restrictions that it is limited between the dashed lines, i.e. that there is an upper bound on the local time axis distortion, and that the function must be on the main diagonal both at the beginning and end of the utterances. If the line-up points are chosen correctly in both sentences, then the warp function will be on the diagonal at these points; this appears to be at least approximately true.
The warp functions are applied to all physiological signals belonging to the speech signals. In this way all signals are time normalized. Meaningful averaging of the physiological signals now becomes possible. In Fig. 3 the fundamental frequency traces of the time normalized utterances are plotted, together with their average. All voiced parts are similarly spaced in time, and the average signal looks completely reasonable. Fig. 4 shows, reading from top to bottom, averaged values of F0, intensity level (IL), Ps, SH, CT, and VOC, obtained by means of straightforward time alignment. Although it is certainly true that some general trends remain visible in the three EMG signals, especially towards the middle of the utterance, the averaged versions are by no means clearer than single tokens. Averaged versions of F0 and IL have clearly lost every meaning, even in the direct neighbourhood of the line-up point. Fig. 5, on the other hand, contains averaged versions of the same signals time aligned and normalised using the DTW algorithm. Obviously, the temporal structure of all signals is completely retained and meaningful statistical processing is now possible.
Thus it appears that DTW, as a way of time aligning signals before averaging, enables one to average noisy physiological signals even in cases where there is so much time variation in the utterances to be processed that straightforward time alignment may mis-align complete syllables. The results were obtained regardless of the noisy character of the speech signals that were used for the computation of the time warp function. We have limited our work to the use of a single straightforward implementation of the DTW technique. Although it would have been quite easy to run experiments with other choices for the parameters and the distance metric in the computation of the time warp function, it is less easy to think of sensible formal ways to compare the quality of the results. Also, it is not believed that these choices are critical for the purpose at hand.
5. CONCLUSIONS
Even if trained subjects are used in speech production experiments where physiological signals are recorded that are inherently so noisy that averaging of multiple tokens is necessary, such a large degree of time variation in different tokens of the same word or phrase may occur that straightforward averaging becomes a meaningless procedure. When using untrained subjects, the problem of time variation between repetitions of the same stimulus already becomes noticeable with very short and simple utterances.
We have shown that the technique of Dynamic Time Warping, developed in the framework of automatic speech recognition, can also make a very useful tool in fundamental speech research if it comes to averaging physiological (or comparable) signals. The technique can be used (semi-) automatically, which makes it very attractive in a research situation that is characterised by the need to handle large amounts of signals. The technique works satisfactorily despite the mediocre signal-to-noise ratio of the speech signal and the highly non-stationary character of the noise.
ACKNOWLEDGEMENTS
This research was supported by the Foundation for Linguistic Research, which is funded by the Netherlands Organization for the Advancement of Scientific Research N.W.O. Special thanks are due to Haskins Laboratories, New Haven Conn.; to dr. Thomas Baer who helped making the stay of the first author at Haskins possible, and helped organizing and running the experiment; to dr. Hiroshi Muta who inserted the EMG electrodes and the subglottal pressure sensor.
6. REFERENCES
[1] D. Kewley-Port (1973). Computer processing of EMG signals at Haskins Laboratories. Haskins Laboratories Status Report on Speech Research SR-37/38: 173-183.
[2] L. Boves (1984). The phonetic basis of perceptual ratings of running speech. Floris Publications, Dordrecht.
[3] B. Cranen and L. Boves (1985). Pressure measurements during speech production using semiconductor miniature pressure transducers : Impact on models for speech. J. Acoust. Soc. Am. 77: 1543-1551.
[4] H. Strik and L. Boves (1987). Regulation of intensity and pitch in chest voice. Proceedings 11th International Congres of Phonetic Sciences, Tallinn, Vol. VI: 32-35.
[5] J. ‘t Hart and R. Collier (1978). Integrating different levels of intonation analysis. J. of Phonetics 3: 235-255.
[6] R. Collier (1975). Physiological correlates of intonation patterns. J. Acoust. Soc. Am. 58: 249-255.
[7] D.K. Port (1971). The EMG data system. Haskins Laboratories Status Report on Speech Research SR-25/26: 67-72.
[8] T. K. Vintsyuk (1968). Recognition of spoken words by the dynamic programming method. Kibernetica 1: 81-88.
[9] H. Sakoe & S. Chiba (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustiscs, Speech, and Signal Proc., Vol. ASSP-26: 43-49.
[10] B.A. Hanson and H. Wakita (1987). Spectral slope distance measures with linaer prediction analysis for word recognition in noise. IEEE Trans. Acoustics, Speech, and Signal Proc., Vol. ASSP-35: 968-973.
[11] J.D. Markel and A.H. Gray (1976). Linear prediction of speech. Springer-Verlag, Berlin.