On the relation between voice source characteristics and prosody

home > publications > a13

Contact

On the relation between voice source characteristics and prosody
H. Strik & L. Boves (1991b)
Proceedings EUROSPEECH 91, Genova, Vol. 3, pp. 1145-1148.

ABSTRACT

The behaviour of the voice source characteristics in connected speech was studied. Voice source parameters were obtained by automatic inverse filtering, followed by automatic fitting of the LF-model to the data. Consistent relations between voice source parameters and prosody were observed.

Keywords: inverse filtering; LF-model; voice source

1. INTRODUCTION

Many present day text-to-speech systems produce speech that is intelligible, but doesn't sound natural. This lack of naturalness is at least in part due to the absence of voice source control rules. Numerous different voice source models have been proposed, some of them could be very useful for speech synthesis. But if a sophisticated voice source model is used, then one has to be able to control the parameters of the voice source. Therefore there is a need for data on the behaviour of the voice source, or more specifically of the behaviour of those characteristics of the source that can be mapped onto the model parameters. These data can be used to extract rules for the model parameters, and these rules can then be used in synthesis.

To extract rules a large amount of data is required. Both inverse filtering of the speech, and fitting a model to the inverse filter results could be done by hand. However, the disadvantages are that this is time consuming, subjective and therefore probably not reproducible. Therefore a procedure is developed to calculate (semi-)automatically the voice parameters.

Most research on voice source characteristics has dealt with sustained vowels, produced in different ways. For sustained vowels, recorded with a high SNR, automatic extraction of the voice parameters is possible. But from these data obtained from isolated speech segments it is difficult to formulate rules for whole utterances. Therefore our aim is to study the behaviour of the voice source in connected, preferably spontaneous speech. And apart from the vowels we also want to extract source parameters for voiced consonants, V/UV and UV/V transitions. Research on these topics is now in progress. In this article some results are presented. Special attention is given to the relation between voice source dynamics and prosody.

2. METHOD AND MATERIAL

2.1. SPEECH MATERIAL

To study voice source characteristics data were obtained for four male subjects. For all subjects recordings were made of the speech signal, electroglottogram (EGG), subglottal (Psub) and oral (Por) pressure, long volume, and activity of some laryngeal muscles (mostly crycothyroid, vocalis, and sternohyoid). For the current article only the data of one male subject were used. Near the end of a recording session he was asked to produce an utterance spontaneously. He then repeated this utterance 29 times. The experiment is described in more detail in Strik and Boves (in press). Inverse filter results were obtained for two of the 30 utterances.

2.2. INVERSE FILTERING

The speech signals were transduced by a condensor microphone (B&K 4134) placed about 10 cm in front of the mouth, and amplified by a measuring amplifier (B&K 2607), using the built-in 22.5 Hz high-pass filter to suppress low frequency vibrations. The digitized speech signal was processed with a phase correction filter in order to undo the phase distortion caused by the analog high-pass filter in the microphone amplifier. Closed glottis interval covariance LPC was used to estimate the parameters of the inverse filter. In Veth, Cranen, Strik & Boves (1990) it was shown that this technique is as powerful as more sophisticated techniques, like robust ARMA analysis. The moment of glottal closure was determined from the EGG. Inverse filtering yields an estimate of the differentiated glottal volume flow (dUg/dt); integration gives the flow signal (Ug).

2.3. EXTRACTION OF VOICE SOURCE PARAMETERS

Voice source parameters were obtained by fitting a voice source model to the data. The so called LF-model was used, because it seems useful for synthesis, and because it has already been studied in great detail. A description of the parameters of the LF-model is given in Fant, Liljencrants, and Lin (1985). The maximum in Ug (U0) is reached at time Tp, the minimum in dUg/dt (Ee) at time Te, and Ta is defined by the tangent of dUg/dt at the beginning of the return phase. U0 and Ee were not calibrated, and are given in arbritrary units. Tn is the length of the interval between Tp and Te, and it is related to the skewing of Ug. Ta is a measure of the degree of adduction and is related to the spectral tilt (see e.g. Fant and Lin, 1988).

For automatic fitting of a model to the signals, use was made of a special software package (details are given in Jansen, 1990). The fit is done pitch synchronously. The periods are defined by the minima in dUg, because these time points can be located most easily. Automatic fitting seems possible, although for the return phase it is difficult to obtain reliable, stable parameters (see Jansen, 1990).

2.4. AVERAGING THE RESULTS

For inverse filtering some analysis parameters must be defined. The most important are the length and exact position of the analysis window, and the order of the analysis. Generally, there seems to be no combination of these parameters that is optimal for each individual pitch period in an utterance. Using an order of 12 worked satisfactorily for almost all voiced frames.

Therefore, the following strategy was adopted. For each utterance inverse filtering was done with a number of different analysis windows. For the current article inverse filtering was done using all 15 different combinations of 5 window lengths and 3 window shifts. In addition, inverse filtering was done for closed glottis intervals (of variable length) that were derived automatically from the EGG. Voice source parameters were extracted for all 16 resulting inverse filter signals, by fitting the LF-model to the data. For each pitch period median values were calculated. The median values were used for further analysis.

3. RESULTS

Reasonable inverse filter results were obtained automatically for the vowels; the voiced consonants, and especially the first and last periods of a voiced segment gave more difficulties. In general it was observed that the lower Ptr, the more difficult it is to obtain reliable inverse filter results. For all data the parameters obtained for the return phase (Ta) were less stable than those of the exponential growing sine wave.

Rapid changes in the voice source parameters were observed at the beginning and end of voiced intervals. Because there were also differences between voice onset and offset, these data were analyzed seperately (see section 3.2). The data of the final vowel are presented in section 3.3. All remaining data fall into the category called steady phonation. These data are treated first (see section 3.1.) and serve as a reference against which the other data are compared.

3.1. STEADY PHONATION

During the course of all 30 utterances there was a gradual decline in Psub, Ptr, IL, and F0; while for the individual voiced segments Psub was almost constant and Por covaried with Ptr, IL, and F0. A large covariance between Ptr, IL, and F0 was found for all data (Strik and Boves, in press).

For the voice source parameters U0 and Ee the same tendencies were observed (see Table Ia and Fig. 1). The consistently high covariance of Ptr, U0, Ee, and IL does not seem surprising, as an increase in Ptr alone (everything else being equal) would increase the amplitude of vibration of the vocal folds, and therefore lead to an increase in U0 and Ee. Increasing U0 and Ee by roughly the same amount would lift the spectrum (see Fant and Lin, 1988), and thus increase IL. However, included in the data of steady phonation are voiced consonants, stressed and unstressed vowels. Large variations, both in the glottis and in the vocal tract, are expected for these data. For instance, for voiced consonants Ta and Tn are generally higher than for vowels. Therefore it is suprising that, in spite of the large variation in articulatory gestures, the covariance between Ptr, U0, Ee, and IL is still consistently high. Further research is needed to unravel the underlying relations.

Fig. 1. Scatterplot of U0 and IL as a function of Ptr for steady phonation, with regression lines.

Table I. table with correlations between Ptr, U0, Ee, IL, and F0.

The correlation between Ptr and F0 is much lower as the correlations between Ptr, U0, Ee, and IL (see Table Ib). Strik and Boves (1989) studied the relation between Ptr and F0 in connected speech, and found that tha activity of laryngeal muscles is an important factor in this relation. Probably, the variables that are not used in the present article (like activity of laryngeal muscles) have more effect on the relation between Ptr and F0, than on the relations between Ptr, U0, Ee, and IL.

3.2. VOICE ONSET AND OFFSET

Scatterplots of U0 and IL as a function of Ptr are given in Fig. 2 and Fig. 3, for voice onset and offset respectively. Also given are the regression lines for steady phonation (see Fig. 1). It is observed that for UV/V and V/UV transitions U0 is relatively lower compared to steady phonation, but that there are also differences between voice onset and voice offset.

Fig. 2. Scatterplot of U0 and IL as a function of Ptr for voice onset, and regression lines obtained from the data of steady phonation (see Fig. 1).

Fig. 3. Scatterplot of U0 and IL as a function of Ptr for voice offset, and regression lines obtained from the data of steady phonation (see Fig. 1).

The average Ptr for voice onset (4.9 cm H2O) is higher as the average Ptr for voice offset (3.6 cm H2O). It seems that higher Ptr values are needed to initiate vibration of the vocal folds, than to keep vibration going towards the end of a voiced interval. At the beginning of a voiced interval the average values of IL and F0 (59 dB and 130 Hz) are also higher as those at the end of a voiced interval (57 dB and 120 Hz).

Both towards beginning and end of a voiced interval a rise in Ta and Tn was observed. A higher Ta would enhance the spectral slope, and thus lower IL. The result of increasing Tn alone is that the flow pulses are less skewed, decreasing Ee and IL. Both changes would therefore affect IL and the spectrum of the audio signal.

3.3. FINAL VOWEL

Near the end off all 30 utterances there was a substantial decrease in Psub, Ptr, IL, and F0; and a marked increase in the activity of the SH. Also, for the final vowel U0 was relatively high, compared to the data for steady phonation (see Fig. ??). The deviating behaviour of the voice source during the utterance final syllable (see also Klatt and Klatt, 1990) was studied by comparing the inverse filter data of the last vowel /a/ to the data of the first vowel /a/. The results of this comparison are that U0 of the last vowel is higher, although Ptr, Ee, IL, and F0 are considerably lower. Furthermore, when all parameters are expressed in percentages of the period duration, no major differences were found between the relative time parameters of the fitted flow signals for both vowels. This means that apart from time stretching (increase of T0), there were no significant differences in the shape of the flow pulses.

The fact that U0 of the final vowel is higher, while Ptr is about 1.5 cm H2O lower, indicates that the impedance of the glottis must have been lowered. Because the shape of the flow pulses is almost the same for both vowels, no large differences in degree of adduction and open quotient are expected. The latter was confirmed with measurements from the EGG. Probably the amplitude of vibration of the vocal folds is increased by slackening the vocal folds. This could be done by diminishing the antero-posterior tension of the folds. Strik and Boves (1989) indeed found a suppressed activity of the CT and VOC near the end of a declarative utterance.

As the relative time parameters for both vowels are about equal, the decrease in Ee (+- 0.7 dB) is the result of the increase in U0 (+- 0.3 dB) and the increase in T0 (+- 1.0 dB) alone. The effect of these changes on the spectrum is described in Fant and Lin (1988). The increase in U0 increases the amplitudes of the lower harmonics; the decrease in Ee causes a lowering of the high-frequency part of the spectrum; while the increase of Ta causes an increase in the spectral tilt (Ta/T0 is about the same for both vowels, but Ta is much larger for the last vowel). When the spectra of both vowels are compared these differences are clearly visible.

4. CONCLUSIONS

A consistently high covariance between Ptr, U0, Ee, IL, and F0 has been observed for steady phonation. Increasing U0 and Ee would lift the spectrum and thus increase IL, and increasing F0 would change the position of the harmonics in the spectrum. Furthermore, both Ta and Tn rise, when going towards the beginning or end of a voiced interval, or from a vowel to a voiced consonant. All these fluctuations in the voice source parameters, and especially those during the final vowel, would probably have perceptual consequences. To improve the naturalness of synthetic speech, these effects have to be taken into acount.

ACKNOWLEDGEMENTS

This research was supported by the foundation for linguistic research, which is funded by the Netherlands Organization for the Advancement of Scientific Research N.W.O. Special thanks are due to dr. Philip Blok who inserted the EMG electrodes and the pressure catheter in the experiments.

REFERENCES

Fant, G., Liljencrants, J., & Lin, Q. (1985) A four-parameter model of glottal flow. STL-QPSR 4, pp. 1-13.

Fant, G. & Lin, Q. (1988) Frequency domain interpretation and derivation of glottal flow parameters. STL-QPSR 2-3, pp. 1-21.

Jansen, J. (1990) Automatische extractie van parameters voor het stembron-model van Liljencrants & Fant. Unplubished doctoral dissertation, Nijmegen.

Klatt, D.H. & Klatt, L. (1990) Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America 87, pp. 820-857.

Strik, H. & Boves, L. (1989) The fundamental frequency - subglottal pressure ratio. In Proceedings of EUROSPEECH-89, Vol. 2, pp. 425-428.

Strik, H. & Boves, L. (in press) Control of fundamental frequency, intensity and voice quality in speech. Journal of Phonetics.

Veth, J. de, Cranen, B., Strik, H. & Boves, L. (1990) Extraction of control parameters for the voice source in a text-to-speech system. In Proceedings of ICASSP-90, paper 21.S6a.2.

Last updated on 22-05-2004