H. Strik & L. Boves (1988b)
Proceedings SPEECH'88, 7th FASE Symposium, Edinburgh, Book 3, pp. 1115-1121.
1. INTRODUCTION
Studies of the physiological control of speech parameters like fundamental frequency (F0) and intensity level (IL) have traditionally been characterized by the use of trained phoneticians as subjects, and the use of specially designed speech material, mostly consisting of a small number of short sentences that had to be produced in several ways (different intonation contour, different placement of emphatic stress etc). Another quite general characteristic of those studies is the fairly global way of processing the measurement data obtained in the course of the experiments.
This state of affairs has a number of potentially detrimental consequences. One of the most important disadvantages is that what we know about physiological control of speech parameters is true only under the very limited condition that the talker is trained to serve as a subject in phonetic experiments and precisely knows how to produce a specific effect in a consistent manner. If relations between physiological control parameters and speech parameters were derived from single tokens, or averages of small numbers of tokens, there is the risk that the observations actually are idiosyncrasies that may not be generalized to other talkers or other speech conditions. Lastly, but not least important, the global data processing may emphasize some relations that fit especially well with the specific way of processing and de-emphasize, or completely overlook, other relations that happen to fit less well.
In this paper we make a first attempt at addressing the methodological problems encountered in speech physiology research. First of all, we used a subject who had no previous training as a phonetician, linguist or singer. Secondly, he was asked to repeat a 'spontaneously' produced fairly complex sentence 30 times. Finally we have analysed the relations between physiological and speech parameters in a number of ways and on a number of levels.
More specifically, the simultaneous control of F0 and IL of running speech was investigated. Simultaneous recordings of speech, electroglottogram (EGG), lung volume, subglottal pressure (Ps), supraglottal pressure (Po), and EMG activity of the sternohyoid, cricothyroid and vocalis muscle were obtained while a male subject carried out a large variety of speech tasks. The variables used in this study are F0, IL, Ps, Po and transglottal pressure (Pt), recorded during the last speech tasks. Because the quality of the recordings of the EMG signals decreased considerably during the course of the recording session, processing of those signals has been postponed until more insight has been obtained in the methodological problems indicated above.
2. METHOD
2.1. Speech Material
The subject was a male native speaker of Dutch, with no experience in phonetics or linguistics. Near the end of the experiment he was asked to produce an utterance spontaneously. The produced Dutch sentence was :"Ik heb het idee dat mijn keel wordt afgeknepen door die band" (I have the feeling that my throat is being pinched off by that band). After he spoke this sentence, he was asked to repeat the same sentence 29 times. None of the sentence productions included an inhalation pause.
2.2. Recording and Processing of Data
The pressure signals were recorded using a catheter with 4 pressure transducers, in the way described by Cranen and Boves [1], i.e. two transducers subglottally and two transducers supraglottally. The physiological signals, and the audio signal were recorded on a 14-channel instrumentation recorder. All signals were A/D converted with a 10 kHz sampling rate. The pressure signals were low-pass filtered (cutoff frequency 50 Hz) and down-sampled to 200 Hz. Only this low-frequency component of the pressure signals is used in the present analysis. Transglottal pressure was obtained by subtracting calibrated versions of Po from Ps. The EGG was used to make voiced/unvoiced decisions, and to calculate F0.
2.3. Correlation Analysis
All physiological samples were stored at a 200 Hz rate on a microVAX computer. Correlation coefficients were calculated between all possible pairs of these five variables for the voiced segments of the 30 utterances. The number of voiced samples was about 300 for all sentences. The length of the voiced segments in a sentence varied from 5 to 65 samples. The lengths of corresponding voiced parts in the 30 sentences were approximately the same. Correlation coefficients were computed for different data sets. First of all, 30 sets were defined, containing all data vectors for each of the 30 sentences. Next, a number of additional data sets were created by appending the data for several sentences. Finally, a large number of much smaller data sets were created by taking the data vectors pertaining to each of the individual voiced intervals of individual sentences as separate sets.
3. RESULTS
3.1. Control of F0 and IL on Sentence Level
The traces of speech, F0, IL, Pt, Po and Ps for one utterance (UTT18) are shown in Fig. 1. Because the 30 repetitions contained about the same number of voiced samples, there seems to be no objection against pooling and comparing the correlation coefficients of the whole utterances. The histograms of the 30 correlation coefficients for 7 pairs of variables are given in Fig. 2. Each histogram contains the correlations of the 30 sentences for one pair of variables. Correlations of F0 with the four remaining parameters are given in the top row; the corresponding correlation coefficients of IL are shown in the bottom row. In Table 1 all 10 correlation coefficients are listed for four appended sentences in the first row, and for one of those four sentences (UTT18) in the second row. In all histograms the correlation coefficients are grouped in small clusters. All correlations, except those of F0 with Po, are highly significant (p<0.01). Also, the correlation coefficients of UTT18 are in good agreement with those of the four sentences. Apparently there is not much dispersion in the correlation coefficients for whole sentences. This indicates that the spontaneous utterances are produced globally in similar ways, so physiological control of speech seems to be reproduceable.
The correlations which are scattered least, are R(F0,IL), R(F0,Pt) and R(IL,Pt). Partial correlations show that the high correlation between F0 and IL is almost entirely due to their mutual high correlation with Pt. For UTT18, for example, Pt explains 57.3% of the variance of IL, while F0 only explains another 3.3%. For the same utterance Pt explains 27.4%, and IL an extra 5.7% of the variance of F0.
High positive correlations were found between F0 and Ps, while the correlations between F0 and Po are very small (Fig. 2, Table 1). The significant positive correlations between F0 and Pt result mainly from the high positive correlation between F0 and Ps. For UTT18 Ps explains 59.9% of the variance of F0, and Pt adds 1.0% to the explained variance. Pt alone explains only 27.4% of the variance of F0. Thus, on a global level Ps seems to be the most important factor in the control of F0. This is in line with the general observation that, on the level of complete sentences, the lion's share of the variance in F0 is accounted for by declination effects. Local variations in F0 contribute less to the total variance.
IL is mainly determined by Pt (Fig. 2, Table 1). In its turn, Pt is determined by Ps and Po. It appears that both Ps and Po contribute to the control of IL via Pt, Po probably slightly more so than Ps. This may be explained by observing that Po is the major determinant of the fast fluctuations in Pt, and that the fast fluctuations in IL contribute at least as much to the total variance as the decline of IL from the beginning to the end of the sentences.
The correlations between Po and Ps are small (Table 1). This can be explained with the use of a physiological model. The voiced intervals are mainly made up of vowels. During vowel production the impedance of the glottis is much higher than the impedance of the vocal tract. As a result Po is only slightly affected by Ps, which is reflected in the small positive correlations.
3.2. Control of F0 and IL on Word Level
In Table 1 correlations are listed for four appended sentences. Correlations were calculated for each of the 40 voiced segments of these four sentences. The 40 correlations for each pair of variables are pooled in the histograms shown in Fig. 3. Listed in Table 2 are the correlations for UTT18 and for each voiced segment of UTT18.
Direct comparisons of the correlations obtained for the individual voiced intervals must be done with great caution. Firstly, the correlations no longer pertain to comparable speech events. Also, the number of observations within the data sets varies greatly. From Figure 3 it is obvious that these correlations are scattered more than the correlations for whole sentences. For one thing this is due to the fact that the number of samples used to calculate the correlations is smaller here, so that a fairly large variation is to be expected on purely statistical grounds. Another reason can be that there are other short term factors whose influence is great on a local level, but whose influence is averaged away on a large time scale. These factors could be activity of both intrinsic and extrinsic laryngeal muscles, vertical position of the larynx, jaw opening, relaxation of vocal folds after stepwise increase in tension etc.
In all 30 sentences the results of the last word deviated from the results of the other words. A decrease in IL, Ps and Pt was observed in all last words, while both F0 and Po remained fairly constant. The correlations for one additional syllable, the sixth voiced segment of UTT18 (Figure 1), also deviated from the others for most sentences. In this segment the sharp rise in F0 and IL was the result of the rise in Ps. These deviating results also cause the histograms in Figure 3 to be more scattered.
The most significant correlations in Fig. 3 are those of IL with Pt and Po. With the exception of the sixth and the last voiced segment of UTT18, the correlations of IL with both Pt and Po are higher for the individual voiced segments than for the whole sentence. If the correlations of the two deviating segments of each of the four sentences are excluded, then 31 of the 32 remaining correlations of IL with Po are between -0.8 and -1.0. Thus, it appears that the most dominant variable in the control of IL on a word level is Po. This conclusion fits nicely into a simple physiological and acoustic model that predicts a decrease in IL during the production of voiced consonants, relative to IL in the production of vowels [2].
The highest correlations, on word level, were found between Pt and Po. The correlations of Pt with Po are highly significant (p<0.01) for all voiced segments of UTT18 (Table 2), and are always higher (more negative) than the correlation coefficient for all voiced samples of this sentence.
There is a tendency for R(F0,Pt) to be moderatly to highly positive. And because there is a high negative correlation between Pt and Po, as mentioned above, there also is a tendency for R(F0,Po) to be moderatly to highly negative. In Table 1 it is shown that R(F0,Po) is significantly (p<0.05) positive for UTT18. But for 8 of the 11 voiced parts of the utterance R(F0,Po) is negative. Only the correlation of the deviating sixth voiced segment is significantly (p<0.01) positive. Of the eight negative correlations 3 are significant with p<0.01 and 2 with p<0.05. This example demonstrates clearly the effect of the analyzing method on the results.
4. CONCLUSIONS
From our data it appears that both IL and F0 are mainly determined by Pt. This conclusion holds on sentence level as well as on word level. Yet, there are relevant differences between the control of IL and F0, and between the control mechanisms on sentence and word level.
For F0, on the level of the sentence, the most important control mechanism appears to be Ps. On the word level, however, Ps is far less effective in predicting F0. Here Po must also be taken into account, as well as muscle activity of internal and external laryngeal muscles.
On the level of the sentence Ps and Po seem to contribute equally to the control of IL via Pt. On the level of individual words, however, Po appears to be by far the most important factor in the control of IL.
The general conclusion is that the rapidly varying Po is more important in the control of Fo and IL on a local level than on a global level. Results, regarding the control of Fo and IL in speech, seem to be dependent of the type of analysis used.
ACKNOWLEDGEMENTS
This research was supported by the Foundation for Linguistic Research, which is funded by the Netherlands Organization for the Advancement of Scientific Research N.W.O. Special thanks are due to Harco de Blaauw who was subject in the present experiment; to Philip Blok who inserted the EMG electrodes and the catheter; to Hans Zondag who helped organizing and running the experiment; and to Jan Strik who assisted in the processing of the data.
5. REFERENCES
[1] B. Cranen and L. Boves (1985). Pressure measurements during speech production using semiconductor pressure transducers: Impact on models for speech. J. Acoust. Soc. Am. 77: 1543-1551.
[2] J.L. Flanagan (1972). Speech analysis, synthesis and perception. Springer-Verlag, Berlin.