H. Strik & L. Boves (1991b)
Proceedings EUROSPEECH 91, Genova, Vol. 3, pp. 1145-1148.
ABSTRACT
The behaviour of the voice source characteristics in connected speech was
studied. Voice source parameters were obtained by automatic inverse filtering,
followed by automatic fitting of the LF-model to the data. Consistent relations
between voice source parameters and prosody were observed.
Keywords: inverse filtering; LF-model; voice source
1. INTRODUCTION
Many present day text-to-speech systems produce speech that is intelligible,
but doesn't sound natural. This lack of naturalness is at least in part
due to the absence of voice source control rules. Numerous different voice
source models have been proposed, some of them could be very useful for
speech synthesis. But if a sophisticated voice source model is used, then
one has to be able to control the parameters of the voice source. Therefore
there is a need for data on the behaviour of the voice source, or more specifically
of the behaviour of those characteristics of the source that can be mapped
onto the model parameters. These data can be used to extract rules for the
model parameters, and these rules can then be used in synthesis.
To extract rules a large amount of data is required. Both inverse filtering
of the speech, and fitting a model to the inverse filter results could be
done by hand. However, the disadvantages are that this is time consuming,
subjective and therefore probably not reproducible. Therefore a procedure
is developed to calculate (semi-)automatically the voice parameters.
Most research on voice source characteristics has dealt with sustained vowels,
produced in different ways. For sustained vowels, recorded with a high SNR,
automatic extraction of the voice parameters is possible. But from these
data obtained from isolated speech segments it is difficult to formulate
rules for whole utterances. Therefore our aim is to study the behaviour
of the voice source in connected, preferably spontaneous speech. And apart
from the vowels we also want to extract source parameters for voiced consonants,
V/UV and UV/V transitions. Research on these topics is now in progress.
In this article some results are presented. Special attention is given to
the relation between voice source dynamics and prosody.
2. METHOD AND MATERIAL
2.1. SPEECH MATERIAL
To study voice source characteristics data were obtained for four male subjects.
For all subjects recordings were made of the speech signal, electroglottogram
(EGG), subglottal (Psub) and oral (Por) pressure, long volume, and activity
of some laryngeal muscles (mostly crycothyroid, vocalis, and sternohyoid).
For the current article only the data of one male subject were used. Near
the end of a recording session he was asked to produce an utterance spontaneously.
He then repeated this utterance 29 times. The experiment is described in
more detail in Strik and Boves (in press). Inverse filter results were obtained
for two of the 30 utterances.
2.2. INVERSE FILTERING
The speech signals were transduced by a condensor microphone (B&K 4134)
placed about 10 cm in front of the mouth, and amplified by a measuring amplifier
(B&K 2607), using the built-in 22.5 Hz high-pass filter to suppress low
frequency vibrations. The digitized speech signal was processed with a phase
correction filter in order to undo the phase distortion caused by the analog
high-pass filter in the microphone amplifier. Closed glottis interval covariance
LPC was used to estimate the parameters of the inverse filter. In Veth,
Cranen, Strik & Boves (1990) it was shown that this technique is as powerful
as more sophisticated techniques, like robust ARMA analysis. The moment
of glottal closure was determined from the EGG. Inverse filtering yields
an estimate of the differentiated glottal volume flow (dUg/dt); integration
gives the flow signal (Ug).
2.3. EXTRACTION OF VOICE SOURCE PARAMETERS
Voice source parameters were obtained by fitting a voice source model to
the data. The so called LF-model was used, because it seems useful for synthesis,
and because it has already been studied in great detail. A description of
the parameters of the LF-model is given in Fant, Liljencrants, and Lin (1985).
The maximum in Ug (U0) is reached at time Tp, the minimum in dUg/dt (Ee)
at time Te, and Ta is defined by the tangent of dUg/dt at the beginning
of the return phase. U0 and Ee were not calibrated, and are given in arbritrary
units. Tn is the length of the interval between Tp and Te, and it is related
to the skewing of Ug. Ta is a measure of the degree of adduction and is
related to the spectral tilt (see e.g. Fant and Lin, 1988).
For automatic fitting of a model to the signals, use was made of a special
software package (details are given in Jansen, 1990). The fit is done pitch
synchronously. The periods are defined by the minima in dUg, because these
time points can be located most easily. Automatic fitting seems possible,
although for the return phase it is difficult to obtain reliable, stable
parameters (see Jansen, 1990).
2.4. AVERAGING THE RESULTS
For inverse filtering some analysis parameters must be defined. The most
important are the length and exact position of the analysis window, and
the order of the analysis. Generally, there seems to be no combination of
these parameters that is optimal for each individual pitch period in an
utterance. Using an order of 12 worked satisfactorily for almost all voiced
frames.
Therefore, the following strategy was adopted. For each utterance inverse
filtering was done with a number of different analysis windows. For the
current article inverse filtering was done using all 15 different combinations
of 5 window lengths and 3 window shifts. In addition, inverse filtering
was done for closed glottis intervals (of variable length) that were derived
automatically from the EGG. Voice source parameters were extracted for all
16 resulting inverse filter signals, by fitting the LF-model to the data.
For each pitch period median values were calculated. The median values were
used for further analysis.
3. RESULTS
Reasonable inverse filter results were obtained automatically for the vowels;
the voiced consonants, and especially the first and last periods of a voiced
segment gave more difficulties. In general it was observed that the lower
Ptr, the more difficult it is to obtain reliable inverse filter results.
For all data the parameters obtained for the return phase (Ta) were less
stable than those of the exponential growing sine wave.
Rapid changes in the voice source parameters were observed at the beginning
and end of voiced intervals. Because there were also differences between
voice onset and offset, these data were analyzed seperately (see section
3.2). The data of the final vowel are presented in section 3.3. All remaining
data fall into the category called steady phonation. These data are treated
first (see section 3.1.) and serve as a reference against which the other
data are compared.
3.1. STEADY PHONATION
During the course of all 30 utterances there was a gradual decline in Psub,
Ptr, IL, and F0; while for the individual voiced segments Psub was almost
constant and Por covaried with Ptr, IL, and F0. A large covariance between
Ptr, IL, and F0 was found for all data (Strik and Boves, in press).
For the voice source parameters U0 and Ee the same tendencies were observed
(see Table Ia and Fig. 1). The consistently high covariance of Ptr, U0,
Ee, and IL does not seem surprising, as an increase in Ptr alone (everything
else being equal) would increase the amplitude of vibration of the vocal
folds, and therefore lead to an increase in U0 and Ee. Increasing U0 and
Ee by roughly the same amount would lift the spectrum (see Fant and Lin,
1988), and thus increase IL. However, included in the data of steady phonation
are voiced consonants, stressed and unstressed vowels. Large variations,
both in the glottis and in the vocal tract, are expected for these data.
For instance, for voiced consonants Ta and Tn are generally higher than
for vowels. Therefore it is suprising that, in spite of the large variation
in articulatory gestures, the covariance between Ptr, U0, Ee, and IL is
still consistently high. Further research is needed to unravel the underlying
relations.
Fig. 1. Scatterplot of U0 and IL as a function of Ptr for steady phonation,
with regression lines.
Table I. table with correlations between Ptr, U0, Ee, IL, and F0.
The correlation between Ptr and F0 is much lower as the correlations between
Ptr, U0, Ee, and IL (see Table Ib). Strik and Boves (1989) studied the relation
between Ptr and F0 in connected speech, and found that tha activity of laryngeal
muscles is an important factor in this relation. Probably, the variables
that are not used in the present article (like activity of laryngeal muscles)
have more effect on the relation between Ptr and F0, than on the relations
between Ptr, U0, Ee, and IL.
3.2. VOICE ONSET AND OFFSET
Scatterplots of U0 and IL as a function of Ptr are given in Fig. 2 and Fig.
3, for voice onset and offset respectively. Also given are the regression
lines for steady phonation (see Fig. 1). It is observed that for UV/V and
V/UV transitions U0 is relatively lower compared to steady phonation, but
that there are also differences between voice onset and voice offset.
Fig. 2. Scatterplot of U0 and IL as a function of Ptr for voice onset, and
regression lines obtained from the data of steady phonation (see Fig. 1).
Fig. 3. Scatterplot of U0 and IL as a function of Ptr for voice offset,
and regression lines obtained from the data of steady phonation (see Fig.
1).
The average Ptr for voice onset (4.9 cm H2O) is higher as the average Ptr
for voice offset (3.6 cm H2O). It seems that higher Ptr values are needed
to initiate vibration of the vocal folds, than to keep vibration going towards
the end of a voiced interval. At the beginning of a voiced interval the
average values of IL and F0 (59 dB and 130 Hz) are also higher as those
at the end of a voiced interval (57 dB and 120 Hz).
Both towards beginning and end of a voiced interval a rise in Ta and Tn
was observed. A higher Ta would enhance the spectral slope, and thus lower
IL. The result of increasing Tn alone is that the flow pulses are less skewed,
decreasing Ee and IL. Both changes would therefore affect IL and the spectrum
of the audio signal.
3.3. FINAL VOWEL
Near the end off all 30 utterances there was a substantial decrease in Psub,
Ptr, IL, and F0; and a marked increase in the activity of the SH. Also,
for the final vowel U0 was relatively high, compared to the data for steady
phonation (see Fig. ??). The deviating behaviour of the voice source during
the utterance final syllable (see also Klatt and Klatt, 1990) was studied
by comparing the inverse filter data of the last vowel /a/ to the data of
the first vowel /a/. The results of this comparison are that U0 of the last
vowel is higher, although Ptr, Ee, IL, and F0 are considerably lower. Furthermore,
when all parameters are expressed in percentages of the period duration,
no major differences were found between the relative time parameters of
the fitted flow signals for both vowels. This means that apart from time
stretching (increase of T0), there were no significant differences in the
shape of the flow pulses.
The fact that U0 of the final vowel is higher, while Ptr is about 1.5 cm
H2O lower, indicates that the impedance of the glottis must have been lowered.
Because the shape of the flow pulses is almost the same for both vowels,
no large differences in degree of adduction and open quotient are expected.
The latter was confirmed with measurements from the EGG. Probably the amplitude
of vibration of the vocal folds is increased by slackening the vocal folds.
This could be done by diminishing the antero-posterior tension of the folds.
Strik and Boves (1989) indeed found a suppressed activity of the CT and
VOC near the end of a declarative utterance.
As the relative time parameters for both vowels are about equal, the decrease
in Ee (+- 0.7 dB) is the result of the increase in U0 (+- 0.3 dB) and the
increase in T0 (+- 1.0 dB) alone. The effect of these changes on the spectrum
is described in Fant and Lin (1988). The increase in U0 increases the amplitudes
of the lower harmonics; the decrease in Ee causes a lowering of the high-frequency
part of the spectrum; while the increase of Ta causes an increase in the
spectral tilt (Ta/T0 is about the same for both vowels, but Ta is much larger
for the last vowel). When the spectra of both vowels are compared these
differences are clearly visible.
4. CONCLUSIONS
A consistently high covariance between Ptr, U0, Ee, IL, and F0 has been
observed for steady phonation. Increasing U0 and Ee would lift the spectrum
and thus increase IL, and increasing F0 would change the position of the
harmonics in the spectrum. Furthermore, both Ta and Tn rise, when going
towards the beginning or end of a voiced interval, or from a vowel to a
voiced consonant. All these fluctuations in the voice source parameters,
and especially those during the final vowel, would probably have perceptual
consequences. To improve the naturalness of synthetic speech, these effects
have to be taken into acount.
ACKNOWLEDGEMENTS
This research was supported by the foundation for linguistic research, which
is funded by the Netherlands Organization for the Advancement of Scientific
Research N.W.O. Special thanks are due to dr. Philip Blok who inserted the
EMG electrodes and the pressure catheter in the experiments.
REFERENCES
Fant, G., Liljencrants, J., & Lin, Q. (1985) A four-parameter model of glottal
flow. STL-QPSR 4, pp. 1-13.
Fant, G. & Lin, Q. (1988) Frequency domain interpretation and derivation
of glottal flow parameters. STL-QPSR 2-3, pp. 1-21.
Jansen, J. (1990) Automatische extractie van parameters voor het stembron-model
van Liljencrants & Fant. Unplubished doctoral dissertation, Nijmegen.
Klatt, D.H. & Klatt, L. (1990) Analysis, synthesis, and perception of voice
quality variations among female and male talkers. Journal of the Acoustical
Society of America 87, pp. 820-857.
Strik, H. & Boves, L. (1989) The fundamental frequency - subglottal pressure
ratio. In Proceedings of EUROSPEECH-89, Vol. 2, pp. 425-428.
Strik, H. & Boves, L. (in press) Control of fundamental frequency, intensity
and voice quality in speech. Journal of Phonetics.
Veth, J. de, Cranen, B., Strik, H. & Boves, L. (1990) Extraction of control
parameters for the voice source in a text-to-speech system. In Proceedings
of ICASSP-90, paper 21.S6a.2.