Data processing of physiological signals related to speech

home > publications > a08

Contact

Data processing of physiological signals related to speech
H. Strik & L. Boves (1988e)
AFN-Proceedings, University of Nijmegen, Vol. 12, pp. 41-55.

0. Abstract

The physiological control of 'spontaneous' speech of an untrained subject was studied. One of the measured physiological signals is transglottal pressure (Pt) because it is believed that Pt is one of the main factors in the control of the vibratory behaviour of the vocal folds. Data were analyzed using a novel processing technique, because the variation in the time structure of the individual repetitions is too large to make meaningful straightforward averaging possible. The results show that speech gestures are reproduceable, and that Pt indeed is one of the most important factors in the control of fundamental frequency (F0) and intensity level (IL) in speech. However, results do depend on the analyzing method used.

1. Introduction

Studies of the physiological control of speech parameters like F0 and IL traditionally been characterized by the use of trained phoneticians as subjects, and the use of specially designed speech material, mostly consisting of a small number of short sentences that had to be produced in several ways (different intonation contour, different placement of emphatic stress, etc). Another quite general characteristic of those studies is the fairly global way of processing the measurement data obtained in the course of the experiments.

This state of affairs has a number of potentially detrimental consequences. One of the most important disadvantages is that what we know about the physiological control of speech parameters is true only under the very limited condition that the talker is trained to serve as a subject in phonetic experiments and that he knows precisely how to produce a specific phonetic effect in a consistent manner. Especially if relations between physiological control parameters and speech parameters are derived from single tokens, or averages of small numbers of tokens, there is the risk that the observations actually are idiosyncrasies that may not be generalized to other talkers or other speech conditions. Lastly, but not least important, the global data processing, sometimes consisting of visual averaging of traces and visual estimation of correlation coefficients, may result in an emphasis of expected relations at the cost of other, perhaps stronger but unexpected, relations. Moreover, the time span of the analysis may also influence the results to an unwanted degree.

In this paper we make a first attempt at addressing the methodological problems encountered in speech physiology research. First of all, we used a subject who had no previous training as a phonetician, linguist or singer. Secondly, he was asked to repeat a 'spontaneously' produced, fairly complex sentence 29 times. Finally, we have analyzed the relations between physiological and speech parameters on different levels, using formal statistical techniques.

Physiological signals that are related to the speech production system invariably contain some random activity. Especially electromyographic (EMG) signals are always very noisy, but other signals like subglottal pressure (Psb), supraglottal pressure (Psp), and lung volume (Vl) are usually also far from clean. To extract meaningful relations between the various physiological processes it is necessary to improve the signal-to-noise ratio of these signals.

The usual procedure to highlight information in physiological signals is to time-align and average the signals derived from various repetitions of the 'same' utterance (Kewley-Port, 1973). For meaningful averaging of physiological signals related to speech, two requirements must be fulfilled. The first is that speech gestures must be reproduceable, for it is the causal relation between a physiological process and a speech gesture that one wants to reveal. The second requirement is that the inter-token variation in the temporal structure of the repetitions must be small. Note that these are really different requirements. Two tokens can be produced with essentially equivalent articulatory gestures, while the speech rate is still very different. On the other hand, similar overall articulation rates in two tokens do not guarantee that they were produced with similar articulatory gestures.

Trained speakers often succeed in repeating the same utterance with essentially equivalent articulatory gestures and with approximately the same duration, especially if the utterances are not too long and too complex. But repetitions of a fairly long sentence (duration &pm.4 s) produced by a person who had experience in acting as a subject in phonetic experiments, time-aligned with a line-up point in the middle of the sentence, show deviations of as much as 150 ms towards both ends (Strik and Boves, 1988). For the repetitions produced by the 'linguistically naive' subject of the present experiment the deviations were even considerably greater. With deviations of this magnitude straightforward averaging of tokens becomes questionable.

Therefore, before averaging the physiological signals, the time axes of the repetitions must be adjusted such that the speech gestures of the various repetitions are not only time-aligned at the line-up point, but at every time point of the utterance. For this purpose a novel processing technique is proposed, in which a dynamic time warping (DTW) algorithm is used to obtain a sufficient degree of time-alignment.

2. Method

2.1 Experimental procedure

In this experiment simultaneous recordings of the acoustic signal, electroglottogram (EGG), Vl, Psb, Psp, and EMG activity of the sternohyoid (SH), cricothyroid (CT) and vocalis muscle (VOC) were obtained while the subject carried out a large variety of speech tasks. All measured signals were stored on a 14-channel instrumentation recorder (TEAC XR-510).

The speech signal was transduced by a condenser microphone (B&K 4134) placed about 10 cm in front of the mouth, and amplified by a measuring amplifier (B&K 2607).

The pressure signals were recorded using a catheter with four pressure transducers, in the way described by Cranen and Boves (1985). The catheter was inserted pernasally and fed through the glottis into the trachea. The pressure measurements were calibrated by recording the pressure signals while the subject held lung pressures of up to 20 cm H2O against a water-filled U-tube manometer. The catheter, situated in the posterior commissure of the glottis, did not have a noticeable effect on phonation.

The EMG signals were recorded using hooked-wire electrodes (Hirose, 1971). The electrodes were inserted percutaneously, and correct electrode placement was confirmed by audio-visual monitoring of the signals during various functional manoeuvres. The EMG signals were calibrated by recording them while a 200 mV sine was used as input for the pre-amplifiers.

The perimeter of chest (X) and abdomen (Y) were measured with mercury filled strain-gauge wires. Lung volume was calculated from the weighted sum of these two signals: Vl=aX+bY+c. To obtain the relative contribution (a/b) of each signal the subject performed paradoxical movements (chest out, abdomen in and vice versa) with a closed glottis, i.e. a constant lung volume. The second lung volume calibration manoeuvre was that the subject had to fill and empty a balloon of fixed size several times. This gave absolute values for a and b. In our set-up the value of c cannot be determined, so that the measurements are confined to relative lung volumes. This is, however, sufficient for computing average flow.

The subject was a male native speaker of Dutch, with no experience in phonetics or linguistics and with no history of respiratory or laryngeal dysfunction. Near the end of the experiment he was asked to produce an utterance spontaneously. The produced Dutch sentence is : "Ik heb het idee dat mijn keel wordt afgeknepen door die band" (I have the feeling that my throat is being pinched off by that band). After he spoke this sentence, he was asked to repeat the same sentence 29 times. During the experiment the quality of the recordings of VOC and CT decreased considerably. For the utterances described in this paper, recorded near the end of the experiment, they were not considered useful for analysis.

2.2 Data preparation

All signals were A/D converted at a 10 kHz sampling rate. The files were stored on a microVAX computer.

F0 and IL were calculated with the SIF program of ILS. Both values were calculated every 5 ms, resulting in F0 and IL signals sampled at a 200 Hz rate. F0 was calculated from the EGG signal, because this gave better results than the F0 calculated from the audio signal. Still, F0 contours contained so many errors that they had to be corrected manually. IL was calibrated with the use of a pistonphone (B&K 4220). The physical unit of F0 is Hz, and that of IL is dB re 10&'sup(-12) W/m&'sup(2).

Pressure signals, chest and abdomen signals were low-pass filtered (third order digital elliptic low-pass filter, pass band edge 30 Hz) and downsampled to 200 Hz. For the correlation analyses described in this article only the DC-component of the pressure signal is needed. Lung volume was calculated from the low-pass filtered chest and abdomen signals. The physical units of pressure and of lung volume are cm H2O and cc, respectively.

Because the EMG signals sometimes contained a disturbing 50 Hz hum and/or a small DC-offset, they were high-pass filtered (third order digital elliptic high-pass filter, pass band edge 200 Hz). The EMG signals recorded with hooked-wire electrodes are interference patterns of, usually a few, single motor unit potentials. The motor unit potentials are sharp spikes and thus high-pass filtering hardly affects the motor unit potentials. The integrated rectified EMG was calculated in the way described by Basmajian (1975): first the signal is full-wave-rectified, and then it is integrated over successive periods of 5 ms. The integrator is reset after each integration. The physical unit of EMG signals is mV.

The EMG signal is a measure of the electric potential in a muscle. But the mechanical action of a muscle lags behind the main burst of electric potentials. Therefore, in order to correlate muscle activity with the resultant acoustic event, it must be shifted forward in time. Discrete cross correlation functions between F0 data and SH data were calculated. Atkinson (1978) defined the mean response time (MRT) of the muscle as the average value of the maxima of the cross correlation functions. All SH signals were shifted forward over their MRT of 190 ms. The value found by Atkinson (1978) was 120 ms. At the moment we have no explanation for the discrepancy between his value and ours.

2.3 Data processing

2.3.1 Time-alignment

To improve the signal-to-noise ratio of the physiological signals the method of time-alignment and averaging is often used (see Introduction). For this method to be used the subject has to produce several repetitions of the 'same' utterance. Line-up points are defined in each of the tokens, and with these line-up points the signals of the repetitions are time-aligned. Physiological signals are then averaged. However, F0 (and/or IL) signals usually are not averaged. The F0 (and/or IL) contour of one of the repetitions is chosen to represent the 'average' F0 (and/or IL) contour (Collier, 1975; Maeda, 1976).

Previous results (Strik and Boves, 1988) already revealed that even with a trained speaker, and prescribed sentences the variation in the temporal structure can be too large to make meaningful averaging possible. It seems neccessary to preprocess the signals before averaging them. Therefore, a novel processing technique was developed that is described in the next section.

2.3.2 Time-normalization

In the novel method the same line-up points, as described above, are used to time-align the signals of all repetitions. The repetition with median length is then chosen as the most representative one, and used as a reference for time-normalization of the remaining tokens. In the following this reference utterance will be called 'the template'. In order to effect time-normalization cepstrum coefficients are calculated for all speech signals, one set of cepstrum coefficients for every 5 ms of speech. Using a dynamic time warping (DTW) algorithm, the warp functions between the speech signals of all tokens and the speech signal of the template are calculated (Vintsyuk, 1968). DTW finds the local distortions of the time axis of a test utterance, relative to the template, in such a way that the summed spectral distance between the portions of the signals that get aligned is minimized. In searching for this optimal non-linear time-alignment function, the additional constraint has to be satisfied that the maximum amount of local time distortion remains within reasonable bounds (Sakoe and Chiba, 1978). The upper and lower bound of the DTW algorithm used in the present research was 200 ms.

Every warp function is a function that maps a 200 Hz speech analysis file of a repetition onto the 200 Hz speech analysis file of the template. This warp function, of a repetition on the template, is then used to distort the time axes of all physiological signals belonging to that repetition, resulting in physiological signals whose time axes are adjusted with reference to the template. Henceforth we will call this procedure 'time-normalization'. After time-normalization median signals were calculated. On every point in time the median value of the 29 values of the time-normalized signals was taken. This was also done for F0 and IL. For F0 this appeared to yield a good V/UV decision criterion for each sample of the utterance.

2.3.3 Correlation analysis

After time-normalization, the physiological signals of 29 utterances and the median physiological signals were available for further analysis. All signals had a sampling rate of 200 Hz. Further analysis is mainly based on the median signals, although in one case the original signals are used.

All physiological signals, except F0, are continuous functions of time. F0 is a discontinuous function, because it is non-existent during both the voiceless intervals and silent periods. It is not the purpose of this experiment to study the distinction between voiced and unvoiced speech; many laryngeal muscles and Pt are involved in this process. Rather the purpose is to study the control of speech during voicing. Therefore, only the voiced portions of the utterances were used in calculating the correlation coefficients.

Correlation coefficients were calculated, for many different data sets, between all possible pairs of the six variables (i.e. 15 pairs) using the Pearson-Product-Moment formula. Each data set contains a number of 6-dimensional data vectors, consisting of the values of F0, IL, Pt, Psp, Psb and SH at the same point in time. The first data set was created by appending all 29 utterances. It comprised data vectors for all 7937 voiced samples. Three smaller data sets were created by taking the data vectors pertaining to the largest voiced intervals of the median signals. Finally, an additional data set was created consisting of data vectors for all voiced samples of the median signals.

3. Results

None of the 29 repetitions of the spontaneous sentence included an inhalation pause. In the original, spontaneous sentence there was a pause of almost half a second, due to a swallowing gesture of the subject. Similar gestures did not occur in the 29 repetitions. Thus, in order to minimize the risk that utterances containing different articulatory gestures were averaged, only the last 29 sentences were used for analysis.

The total duration of the utterances and the pauses, between the utterances, is &pm.300 s. The total duration of the utterances alone is &pm.67 s. In total there are 7937 voiced samples, almost 40 seconds of voiced speech. The average number of voiced samples per token is 274 (sd=18). The individual voiced intervals of the utterances have lengths varying from 10 to 595 ms.

3.1 Variability of speech gestures

In the introduction it was already mentioned that two requirements for meaningful averaging of physiological signals are little inter-token variation in temporal structure, and little inter-token variation in speech gestures. Both variations are studied in the next section.

3.1.1 Variation in temporal structure

The average length of the utterances produced by our subject was 2310 ms (sd=130). The maximum and the minimum were 2615 ms and 2165 ms.

The release of the /k/ of /keel/ was used as a line-up point. This line-up point was chosen because it is clearly distinguishable, and it is situated approximately in the middle of the sentence. After defining this line-up point in each sentence it was possible to calculate the variation in the first part (from beginning to the line-up point) and the last part (from line-up point to the end) of the utterances. The average duration of the first part was 880 ms (sd=80), with a maximum of 1075 ms and a minimum of 780 ms. The average duration of the last part was 1430 ms (sd=70); the maximum and minimum value were 1590 ms and 1320 ms.

The results show that one can hardly maintain that there is little variation in the temporal structure of the signals. The maximum deviation for the first and last part were 295 ms and 270 ms, respectively. In general, it appeared that the subject tended to increase his articulation rate as he repeated the utterances more often. But even for the last six sentences the maximum deviation in first and last part was 120 ms and 90 ms, respectively. So even after numerous repetitions the magnitude of the deviations is still so large that straightforward averaging of the tokens could result in averaging physiological signals of one word in one sentence with that of a totaly different word in another sentence. With such differences in temporal structure, time-alignment and averaging no longer seems a useful procedure to extract meaningful relations.

In Figure 1 the average signals are plotted for, from top to bottom, F0, IL, Pt, Psp, Psb, Vl and SH. It can clearly be seen that the averages are only meaningful in the direct neighbourhood of the line-up point.

3.1.2 Variation in speech gestures

Because the variation in the temporal structure of the 29 repetitions is very large, it appears to be necessary to normalize the time axes of the utterances before the signals are averaged. All signals were 'time-normalized' using repetition 5 as the template. After time-normalization median signals were calculated. At each point in time, for every physiological quantity, the 29 values were ordered from low to high. The median value then is the 15th value. The median signals are plotted in Figure 2. To give an idea of the amount of variation around the median traces for the 5th and the 25th value are also plotted (dotted lines).

From Figure 2 we can conclude that the method of time-normalization worked satisfactorily, because the temporal structure variation around the median that remains is reasonably small. The second conclusion that can be derived from Figure 2 is that the variation in the magnitude of the signals is also very small. The largest variations were found for Vl. Fairly large variations were also found for the first word of the utterance. In some utterances the first word was clearly pronounced. In these cases IL and Pt were large, and part of the word was voiced. But in most utterances the clitic version of the personal pronoun was used, turning the pronunciation /Ikh?p/ into /k?p/. In these cases IL and Pt were small, and no part of the pronoun was voiced.

Correlation coefficients were calculated for all voiced samples of the 29 utterances (see Table I). All correlations are highly significant (p<0.0001) reflecting the consistent relations between the variables. Thus, apart from Vl, it seems that speech gestures are reproduceable. Therefore, the first requirement mentioned above is met. So, after time-normalization, averaging seems useful to extract meaningful relations between the various physiological processes.

Table I. Correlation matrix, mean and standard deviation for all voiced samples.

=========================================================

F0 IL Pt Psp Psb SH mean SD

F0 1.000 0.561 0.556 -0.068 0.621 -0.382 117.01 9.98

IL 1.000 0.828 -0.574 0.439 -0.207 64.42 4.49

Pt 1.000 -0.626 0.601 -0.309 4.97 1.32

Psp 1.000 0.248 -0.114 0.90 1.09

Psb 1.000 -0.500 5.87 1.06

SH 1.000 8.65 15.10

N=7937

|R|>0.046 for p<0.0001

======================================================

3.2 Physiological control of speech

The top panel of Figure 2 shows the oscillogram of the audio signal of repetition 5, the utterance used as a reference for time-normalization. The median signals in the other panels may not directly be referred to this audio signal, because they are not the physiological signals belonging to this particular audio signal but averages of 29 repetitions. For instance, the audio signal may not seem periodic during the last schwa, but if it is periodic in at least 15 of the 29 tokens, then the median F0 value will indicate that it is voiced. The median physiological signals may, however, be compared with each other.

If the data are analyzed quantitatively, using a correlation method, then the results are dependent on the time domain over which the correlations are calculated. The control of F0 and IL on word level is addressed in section 3.2.1, and the control of F0 and IL on sentence level in section 3.2.2. Other apparent results are briefly mentioned here.

One clear result is that the Vl traces of the individual repetitions run parallel, but the top and the bottom trace are seperated by &pm.400 cc. On the other hand, the Psb values show little variation between the individual repetitions which indicates that it is possible, at least for this subject, to produce the same subglottal pressures with different lung volumes.

Apart from the slow decline in Psb, there is little variation in the median value of Psb. The largest local variation is probably the rise during the /k/ in /kel/, just before the word with emphatic stress. Probably this is a combined effect of the prolongued obstruction of the vocal tract, and the extra effort of the expiratory muscles that want to assist the laryngeal muscles in raising the F0 of the vowel immediately following this consonant.

3.2.1 Control on word level

Before analyzing the control of F0 and IL, the F0 and IL signals in Figure 2 are first examined in order to assess variation between the tokens. Considerable inter-token variations were found for the first word, as explained in section 3.1.2. Except for this first word, the inter-token variation in intensity is very small. The inter-token variation in F0 is somewhat larger. This is not because the inter-token variation in the absolute value of F0 is large, but mainly because there is a fairly large inter-token variation in the voiced/unvoiced decision of a few consonants. Some consonants are always voiced, others are always unvoiced, but there are consonants that are voiced in some tokens and unvoiced in others. As mentioned above (see data processing), an F0 sample is classified as voiced if it is voiced in at least 15 of the 29 tokens.

A large number of studies on F0 and IL in speech have reported a positive relation between F0 and Psb (Collier, 1975; Atkinson, 1978; Baer, 1979; Shipp, Doherty and Morrissey, 1979; Gelfer, Harris, Collier and Baer, 1983; Titze and Durham, 1987), and a positive relation between IL and Psb (van den Berg, Zantema and Doornenbal, 1957; Rubin, 1963; Isshiki, 1964; Ladefoged, 1967; Bouhuys, Mead, Proctor and Stevens, 1968; Baer, Gay and Niimi, 1976). But the results of the present experiment suggest that it is Pt that covaries most closely with F0 and IL (see Figure 2). This visual impression was tested by calculating the correlation coefficients for the three largest voiced intervals (Table II). The results clearly show that Pt is far more important in the control of F0 and IL than Psb.

The correlation between Pt and Psp is almost -1 for all three segments. Therefore it would have been possible, on purely statistical grounds, to substitute Psp for Pt in all occurrences above. But on physiological grounds it seems more reasonable to state that it is Pt that is important in the control of the vibratory behaviour of the vocal folds.

Table II. Correlation matrix, mean and standard deviation for the three largest voiced intervals of the median physiological signals.

==================================================

F0 IL Pt Psp Psb SH mean SD

-F0 1.000 0.838 0.914 -0.841 0.368 0.043 117.80 3.17

IL 1.000 0.923 -0.970 0.035 -0.045 63.27 3.44

Pt 1.000 -0.941 0.266 0.125 5.40 0.88

Psp 1.000 -0.195 -0.048 1.14 0.86

Psb 1.000 0.248 6.35 0.13

SH 1.000 1.27 0.04

N=66

|R|>0.315 for p<0.01

F0 1.000 0.833 0.869 -0.921 0.235 -0.460 127.31 2.40

IL 1.000 0.933 -0.920 0.478 -0.433 64.84 1.53

Pt 1.000 -0.987 0.558 -0.617 5.39 1.03

Psp 1.000 -0.439 0.571 0.75 1.02

Psb 1.000 -0.703 5.94 0.18

SH 1.000 7.00 7.63

N=48

|R|>0.369 for p<0.01

F0 1.000 0.784 0.837 -0.721 0.541 0.172 110.06 4.36

IL 1.000 0.973 -0.942 0.316 0.138 57.96 4.66

Pt 1.000 -0.938 0.401 0.172 3.93 1.00

Psp 1.000 -0.079 -0.337 1.22 0.98

Psb 1.000 -0.354 5.04 0.24

SH 1.000 6.37 5.78

N=67

|R|>0.313 for p<0.01

==================================================

In section 3.2.2 we will try to explain why Pt is more important than Psb on a local level, here we will try to give a physiological explanation of the importance of Pt in the control of F0 and IL. This latter explanation is achieved best by dividing the problem into the control during vowel production and the control during consonant production. Of course, there are also intermediate states, but they can be seen as an interpolation between both extremes.

For vowel production it is possible to derive a direct causal relation between Pt on the one hand, and F0 and IL on the other. During vowel production there are no obstructions in the vocal tract, and the acoustic impedance of the glottis is much larger than the impedance of the vocal tract. Therefore, Psp is almost zero and Pt is almost equal to Psb (see Figure 2). Titze and Durham (1987) showed that during stable phonation the maximum glottal width (Gm) changes as a function of Psb (=Pt) alone (the activity of all laryngeal muscles is assumed to be constant). They argued that this increase in amplitude of vibration leads to an increase in F0. But if the amplitude of the vibration increases, and if the period time of the vibration decreases, then the vocal folds must close faster. Therefore, the airflow would have a steeper slope during closing, and IL would increase (Gauffin and Sundberg, 1980). The conclusion is that this mechanism would predict that during vowel production F0 and IL are positively related to Pt alone.

For voiced consonants the positive relation could be the result of a combination of the direct relation given above, and more indirect relations given below. During consonant production the vocal tract is constricted at some point along its length and there is a pressure build-up in the supraglottal region. The increase in Psp could be such that Pt drops below a certain threshold value, in which case the vibration of the vocal folds stops. The threshold value depends on the state of the larynx, i.e. the activity of the laryngeal muscles. The average Pt at which voicing stops for the first five tokens (52 voiced intervals) is 2.42 cm H2O (sd=0.88), and for the same tokens the average Pt at which voicing starts is 5.52 cm H2O (sd=0.95). Thus it seems that it is easier to keep vibration going than it is to start vibration.

Stevens (1977) suggested that the laryngeal musculature controlling vocal fold stiffness always responds to a decreasing Pt during consonant production either by increasing the stiffness, to stop vocal fold vibration, or by decreasing the stiffness, to keep vibration going although Pt is lowered. For voiced consonants the vocal folds would then be slackened, and F0 would decrease with decreasing Pt. Due to the constriction in the vocal tract and/or the smaller opening of the mouth, compared with vowel production, IL would also decrease.

The median value of Pt for all unvoiced consonants of this utterance remains below the average value of Pt at which voicing stops (2.42 cm H2O), and for all voiced consonants the median value of Pt remains above this threshold. This is no conclusive evidence to reject Stevens's suggestion, but it indicates that for the voiced consonants of this utterance the state of the vocal folds need not be changed to keep vibration going, because Pt probably remains sufficiently high to keep vibration going without any adjustments. If during the production of consonants no adjustments are made then F0 and IL would still decrease with decreasing Pt using the reasoning given above for vowel production. The conclusion is that with these data it is not clear which mechanism is used during consonant production, but that in both cases F0 and IL would be positively related to Pt alone.

3.2.2 Control on sentence level

In the section above, the control of F0 and IL in speech was studied on a local level. It is also possible to analyze the data on a more global level; in order to study the relation between the slow trends of the physiological signals. This is done by calculating the correlation coefficients for the data vectors pertaining to all voiced samples of the median signals (Table III). Compared to the analysis on a local level (Table II) many differences are observed.

Table III. Correlation matrix, mean and standard deviation for all voiced samples of the median physiological signals.

==================================================

F0 IL Pt Psp Psb SH mean SD

F0 1.000 0.669 0.714 -0.184 0.746 -0.468 115.71 8.27

IL 1.000 0.910 -0.669 0.483 -0.221 62.10 4.21

Pt 1.000 -0.670 0.591 -0.378 4.90 1.16

Psp 1.000 0.186 -0.128 0.90 0.94

Psb 1.000 -0.657 5.64 0.86

SH 1.000 6.50 11.15

N=288

|R|>0.152 for p<0.01

==================================================

On a local level Pt was determined almost entirely by Psp, while on a global level the contributions of Psp and Psb to the variation in Pt are almost equal. High correlations between IL and Pt were found on both levels. Psp and Psb contribute to the control of IL via Pt, and because Psb becomes more important in the control of Pt on a global level, Psb becomes more important in the control of IL too. Regarding F0, on word level the highest correlations were those with Pt, while on sentence level the correlation with Psb is the highest. On the whole, Psb becomes more important on a global level.

Psb decreases slowly during the course of the utterance, and hardly varies on a local level. F0, IL, Pt and Psp, on the other hand, do vary substantially during these intervals. In Table II we can see that the variations (var = sd&'sup(2)) of Pt and Psp are much greater than the variation of Psb. Therefore, it is not surprising that Pt is more effective in predicting the local rapid movements of F0 and IL, and that, in its turn, Psp is more effective in predicting Pt on a local level than Psb.

Psp almost entirely determines the rapid fluctuations, while Psb mainly determines the peak values of Pt. The overall pattern is that Psb and F0 decrease during the utterance, while IL decreases only slightly during the final part of the utterance. Therefore Psb becomes more important in the control of F0 on a global level.

4. Conclusions

In this paper a novel technique to process physiological signals related to speech is presented. This technique seemed neccessary because it was found that inter-token variations in the time structure of the speech gestures were very large when an untrained subject repeated a fairly long spontaneous sentence 29 times. In fact the variations were so large that straightforward averaging did not result in averaging physiological signals related to the 'same' speech gestures, but in averaging physiological signals related to totally different speech gestures. It is shown that the novel processing technique reduces this time-jitter to such a degree that meaningful averaging is possible.

After time-normalization it is possible to study the reproduceability of speech gestures, which is another requirement for meaningful averaging. The results show that, apart from Vl, the inter-token variation of the physiological signals is small, indicating that speech gestures are reproduceable, even if the subject is not a trained speaker.

The median signals reveal that of all the measured signals Pt is the most important factor in the control of F0 and IL in voiced speech. A hypothetical physiological model was proposed that could roughly explain how Pt controls F0 and IL. The activity of the laryngeal muscles also influences F0 and IL to some degree; their influence on F0 is probably greater than their influence on IL. For instance, during the word with sentence stress a number of F0-raising muscles is probably more active than during the other words.

ACKNOWLEDGEMENTS

This research was supported by the Foundation for linguistic research, which is funded by the Netherlands Organization for Scientific Research, N.W.O.

Special thanks are due to Harco de Blaauw who was the subject of the present experiment, to Philip Blok who inserted the EMG electrodes and the catheter, to Hans Zondag who helped organizing and running the experiment, and to Jan Strik who assisted in the processing of the data.

References

Atkinson, J.E. (1978) Correlation analysis of the physiological features controlling fundamental voice frequency, Journal of the Acoustical Society of America, 63, 211-222.

Baer, T. (1979) Reflex activation of laryngeal muscles by sudden induced subglottal pressure changes, Journal of the Acoustical Society of America, 65, 1271-1275.

Baer, T.; Gay, T. and Niimi, S. (1976) Control of fundamental frequency, intensity and register of phonation, Haskins Lab. Status Report on Speech Reasearch, SR-45/46, 175-185.

Basmajian, J.V. (1967) Muscles Alive, their functions revealed by electromyography (second edition), The Williams & Wilkins company, Baltimore.

Berg, J. van den; Zantema, J. and Doornenbal, P. (1957) On the air resistance and the Bernouilli effect of the human larynx, Journal of the Acoustical Society of America, 29, 626-631.

Bouhuys, A.; Mead, J.; Proctor, D.F. and Stevens, K.N. (1968) Pressure-Flow Events during Singing, Annals of the New York Academy of Sciences, Vol.155, Art.1, New York.

Collier, R. (1975) Physiological correlates of intonation patterns, Journal of the Acoustical Society of America, 58, 249-255.

Cranen, B. and Boves, L. (1985) Pressure measurements during speech production using semiconductor miniature pressure transducers: Impact on models for speech production, Journal of the Acoustical Society of America, 77, 1543-1551.

Gauffin, J. and Sundberg, J. (1980) Data on the glottal voice source behavior in vowel production, Speech Transmission Laboratory, Q. Prog. Status Rep., Royal Institute of Technology, Stochkholm, 2-3/1980, 61-70.

Gelfer, C.; Harris, K.; Collier, R. and Baer, T. (1983). Is declination actively controlled? In: I.R. Titze and C. Scherer (eds.), Vocal Fold Physiology, The Denver Center for the Performing Arts, Inc., Denver, Colorado.

Hirose, H. (1971) Electromyography of the Articulatory Muscles: Current Instrumentation and Techniques, Haskins Lab. Status Report on Speech Reasearch, SR-25/26, 73-86.

Isshiki, N. (1964) Regulatory mechanism of voice intensity variation, Journal of Speech and Hearing Research, 7, 17-29.

Kewley-Port, D. (1973) Computer processing of EMG signals at Haskins Laboratories, Haskins Lab. Status Report on Speech Reasearch, SR-33, 173-183.

Ladefoged, P. (1967) Three areas of experimental phonetics, Oxford: Oxford University Press.

Maeda, S. (1976) A characterization of American English intonation, Ph.D. thesis, MIT, Cambridge.

Rubin, H.J. (1963) Experimental studies on vocal pitch and intensity in phonation, The Laryngoscope, 8, 973-1015.

Sakoe, H. and Chiba, S. (1978) Dynamic programming algorithm optimization for spoken word recognition, IEEE Trans. Acoustics, Speech, and Signal Proc., Vol. ASSP-26, 43-49.

Shipp, T.; Doherty, E.T. and Morrissey, P. (1979) Predicting vocal frequency from selected physiologic measures, Journal of the Acoustical Society of America, 66, 678-684.

Stevens, K.N. (1977) Physics of Laryngeal Behavior and Larynx Modes, Phonetica, 34, 264-279.

Strik, H. and Boves, L. (1988). Averaging physiological signals with the use of a DTW algorithm. Proceedings SPEECH'88, 7th FASE Symposium, Edinburgh, Book 3, 883-890.

Titze, I.R. and Durham, P.L. (1987) Passive Mechanisms Influencing Fundamental Frequency Control. In: T. Baer, C. Sasaki and K.S. Harris (eds.), Vocal Fold Physiology, College-Hill Press, Boston.

Vintsyuk, T.K. (1968) Recognition of spoken words by the dynamic programming method, Kibernetica, 1, 81-88.

Last updated on 22-05-2004