H. Strik & L. Boves (1987a)
Proceedings 11th International Congress of Phonetic Sciences, Tallinn, Vol. VI, pp. 32-35.
This article has appeared in Speech Communication. Therefore, I only have
a printed version with the final text and the final layout. If you want
a copy of this article, you can find it in Speech Communication 11, or you
can contact me. The text of the ASCII version
below is slightly different from the text of the article.
Proposed running head: voice source parameters and prosody
Keywords: inverse filtering; LF-model; voice source
Abstract
The behaviour of the voice source characteristics in connected speech was
studied. Voice source parameters were obtained by automatic inverse filtering,
followed by automatic fitting of a glottal waveform model to the data. Consistent
relations between voice source parameters and prosodic features were observed.
Zusammenfassung
Das Verhalten der Stimm-Quellcharakteristik in kontinuierlicher Sprache
wurde untersucht. Stimm-Quellparameter wurden durch automatisches inverses
Filtern ermittelt. Anschliessend wurden die Daten uber ein automatisches
Fehlerminimierungsverfahren in ein Modell der glottalen Wellenform eingepasst.
Es wurden konsistente Zusammenhange zwischen Stimmquellecharakteristiken
und prosodischen Merkmalen festgestellt.
Resume
Le comportement de la source vocale dans la parole continue a ete examine.
Des parametres de la source vocale ont ete obtenus a l'aide de filtrage
automatique, suivi par un approchement automatique d'un modele d'onde de
debit glottique aux observations. Des relations consistentes entre les parametres
de la source vocale et des traits prosodiques ont ete trouvees.
1. Introduction
Modern text-to-speech systems produce speech that is intelligible, but not
quite natural. This lack of naturalness is at least in part due to the absence
of adequate prosody control. Prosody does not only include fundamental frequency
(F0) and duration, but it also affects more subtle aspects of the speech
signal that can be subsumed under the cover term 'voice quality'. Completely
satisfactory prosody will therefore require the use of adequate voice source
control rules. This opinion is reflected by the fact that many rule based
text-to-speech systems are now being updated, in order to replace a static
voice source with a source that can be dynamically controlled. A number
of different voice source models have been proposed, each with its own specific
advantages and drawbacks. However, it is not our intention to compare different
models. Even the most sophisticated voice source model will not improve
speech quality if it is not being controlled by the right rules. These rules,
on the other hand, cannot be derived without a large amount of data on the
behaviour of the voice source in natural speech, or more specifically, of
the behaviour of those characteristics of the source that can be mapped
onto the model parameters. Fortunately, most modern source models share
a large number of parameters, so that most of the results obtained with
one model should be easy to generalise to other models.
In a text-to-speech synthesis framework all relevant properties of the voice
source can be described in terms of the glottal volume flow signal, and
its time derivative. Those glottal flow signals can be approximated, starting
from the acoustic speech signal, via inverse filtering. Model parameters
can then be estimated by fitting the model waveform to the inverse filtered
waveforms. Inverse filtering and model fitting could in principle be done
interactively. However, interactive measurements would take an inordinate
amount of time, because rule development requires one to process large quantities
of speech. Moreover, interactive measurements are difficult to reproduce.
For these reasons a procedure was developed to derive the voice source parameters
automatically. That procedure is explained in Section 2.
Up to now, most research on voice source characteristics has dealt with
sustained vowels, produced in different ways. For sustained vowels, recorded
with a high SNR, automatic extraction of the voice parameters is fairly
easy. But it is difficult to extrapolate from data acquired from isolated
speech sounds to rules for connected speech. Therefore, our aim is to study
the behaviour of the voice source in connected, preferably spontaneous speech.
And in addition to steady state portions of vowels we also want to extract
source parameters for voiced consonants, as well as for voiced/unvoiced
(V/UV) and UV/V transitions. The results of our work are presented in section
3.
The strategy that we adopt to find relations between several voice source
parameters on the one hand, and between voice source parameters and prosody
on the other, is the following: first we will derive general relations by
averaging over all data; after that we will look for local deviations from
these general relations. Special attention is given to the relation between
voice source parameters and prosodic features like F0, intensity (Int),
and voice quality.
2. Method and material
2.1. Speech material
To study voice source characteristics data were collected for four male
subjects. For all subjects recordings were made of the speech signal, electroglottogram
(EGG), subglottal (Psub) and oral (Por) pressure, lung volume, and electromyographic
activity of some laryngeal muscles (mostly crycothyroid, vocalis, and sternohyoid).
The signals were stored on wide band FM-tape. All recordings were made at
the ENT-clinic of the University Hospital "Sint Radboud" in Nijmegen, in
a room in which no special acoustic precautions were made. For the current
article only data of one subject were used (Strik and Boves, in press).
Near the end of a recording session he was asked to produce an utterance
spontaneously. His response was: "Ik heb het idee dat mijn keel wordt afgeknepen
door die band" ("I have the feeling that my throat is being pinched off
by that band"). He then repeated this utterance 29 times. The 30 utterances
had an average length of 2.3 seconds. For this paper inverse filter results
of the first four utterances were analyzed.
2.2. Inverse filtering
The speech signal was transduced by a condensor microphone (B&K 4134) placed
about 10 cm in front of the mouth, pre-amplified at the microphone (B&K
1619), and amplified by a measuring amplifier (B&K 2607) using the built-in
22.5 Hz high-pass filter to suppress low frequency noise. The speech signal
was A/D converted off-line at a 10 kHz sampling rate, and processed with
a phase correction filter in order to undo the low frequency phase distortion
caused by the high-pass filter.
Closed glottis interval covariance LPC analysis was used to estimate the
parameters of the inverse filter. In de Veth, Cranen, Strik & Boves (1990)
it was shown that this technique for estimating the inverse filter is as
powerful as more sophisticated techniques, like Robust ARMA analysis. The
moment of glottal closure was determined from the EGG, and it is used to
position the analysis window. Inverse filtering yields an estimate of the
differentiated glottal volume flow (dUg); integration of dUg gives the flow
signal (Ug).
Closed glottis interval inverse filtering is a complex process; its implementation
requires several choices to be made to fix parameters. The most important
parameters are the length and exact position of the analysis window, the
pre-emphasis factor, and the order of the analysis. In general, there seems
to be no combination of these parameters that is optimal for each individual
pitch period in a normal speech utterance. However, a 12th order LPC analysis
with a pre-emphasis factor of 0.95 worked satisfactorily for almost all
pitch periods.
Thus window position and window length were left as the parameters to be
varied. Instead of trying to formulate criteria that would allow one to
determine the unique optimal combination of window length and position for
each period, we decided to try a large number of combinations and to leave
it to a simple statistical procedure to make the final selection (see section
2.4.).
2.3. Voice source parameters
For automatic fitting of a glottal waveform model to inverse filtered flow
signals we used a special software package (Jansen, Cranen, and Boves, 1991).
The fit is done pitch synchronously. The periods are defined by the minima
in dUg, because these time points can be located most reliably. This software
package allows one to use different glottal waveform models, different definitions
of the error function, and different optimization routines. The choices
made for this study are given below.
The so called LF-model was used, because it seems useful for synthesis,
and because it has already been studied in great detail (see e.g. Fant,
Liljencrants, and Lin, 1985). The model and its parameters are presented
in Fig. 1. The relations between the dimensionless wave shape parameters
of the LF-model and the spectrum are well-known (see e.g. Fant and Lin,
1988): Rg has a small influence on the amplitude relations of the lower
harmonics, Rk influences the spectral balance, and Ra influences the spectral
tilt.
- insert Figure 1 about here -
The error function describes the difference between the model and the measured
signals. It can be defined in the time domain, the frequency domain, or
in both domains simultaneously. For this study the error function is based
on the time signals of flow and flow derivative. In a pilot experiment it
was found that this error definition minimises the number of discontinuities
in the signals fitted to Ug and dUg. For a given pitch period the error
function is calculated by subtracting the modelled signals from the measured
signals. The best fitting model waveform is found by adapting the model
parameters in such a way that the energy in the error function is minimised.
An adaptive nonlinear least-squares optimisation algorithm called NL2SNO
(Dennis, Gay, and Welsch, 1981) was used to find the best fit. The algorithm
returns the (minimised) error energy, and the parameters for which that
optimum is found. If the minimal error is smaller than a pre-defined threshold,
then the fit is said to be good. But if the minimal error remains above
the threshold, then all LF-parameters for that pitch period are set to -1
to indicate that the fit is not successful.
2.4. Averaging the results
Inverse filtering was done for all 25 combinations of 5 window lengths (33,
34, 35, 36, and 37 samples) and 5 window shifts (-2, -1, 0, 1, and 2 samples
relative to the moment of glottal closure). The LF-parameters were obtained
for all 25 resulting inverse filter signals, by fitting the LF-model to
the data. For each pitch period median values for all parameters in the
LF-model were calculated.
The median value of a parameter for a pitch period can become negative (-1),
if at least 13 of the 25 values of that parameter are equal to -1. This
occurs if in more than half of the cases the fit was not successful. The
data of all pitch periods in which the median value of one of the LF-parameters
is equal to -1 were discarded. In total 128 periods were discarded, and
the data of 613 pitch periods were used for further analysis. The disadvantage
of using such a conservative criterion is that a lot of data have to be
discarded, but the advantage is that the risk of errors in the final data
is reduced. We are convinced that keeping more of the data for the consonants
and onsets/offsets would not have changed our results and conclusions.
3. Results
- insert Figure 2 about here -
The audio signal, automatically calculated inverse filter results, and automatically
obtained fits for five consecutive pitch periods of a vowel /e/ are given
in Figure 2. The differentiated flow signals often contain a pronounced
ripple. It is clear from this figure that attempts to measure the LF-parameters
from the raw dUg or Ug signals would result in noisy estimates. For instance,
the maximum of dUg (Ei) and the place of this maximum (Ti) are to a large
extent determined by the ripple. By fitting a LF-model to the data the measurements
are made more robust. The fit procedure is almost always able to find a
combination of LF-parameters that generates a model signal that closely
resembles the measured flow signal.
- insert Figure 3 about here -
In Fig. 3 the median values of the most relevant parameters are given for
a voiced interval of one of the utterances. For some pitch periods the median
values of all LF-parameters are -1, indicating that for the majority of
the 25 combinations the fit was not successful for these periods. There
are two causes that could hinder a good fit. Sometimes the estimate of the
vocal tract transfer function was not correct, in which case inverse filtering
did not yield a flow signal that resembles a LF-pulse even remotely. There
were also cases, however, in which inverse filtering produced a reasonable
estimate of dUg, but where the optimization routine did not converge. Not
surprisingly, estimation problems occurred more often in voiced consonants,
and during voice onset and offset (the first and last periods of a voiced
segment) than during the steady parts of vowels.
Furthermore, it was observed that estimates of the parameters of the first
part of the LF-model (the exponentially growing sine wave, i.e. Tp, Te,
Ee) varied less than those of the return phase (i.e. Ta). Partly this is
due to the fact that the duration of the first part is longer than the duration
of the return phase. But another cause is that the return phase often is
not smooth and contains a ripple (see Fig. 2). This pronounced ripple often
affects the automatic fitting process for the return phase. In many cases
a reasonable fit could be reached for the first part of the LF-model, but
not for the return phase. The result is that the median value of Ta often
is -1, while the other parameters are not (see Fig. 3).
For the moment we do not know whether the failure of the fit procedure to
converge to an acceptably small error is due to computational problems or
to the failure of the LF-model to approximate all glottal flow pulse forms
that occur in real speech.
3.1. General behaviour
Typical behaviour of the LF-parameters can be observed in Fig. 3. During
transitions from vowel to consonant T0, Ta, and Tn generally increase, while
transglottal pressure (Ptr), Uo, Ee, and Int decrease. The consistent reciprocal
relation between the parameters in these two sets is reflected in the correlation
coefficients (see Table I), which are all negative and highly significant
(p<0.0001). For these and all following correlation coefficients the level
of significance for a two-tailed test was calculated (Ferguson, 1987). The
correlation coefficient between two sets of 613 samples is said to be significant
at the 0.01% level (p<0.0001) if its absolute value is larger than 0.16.
- insert Table I about here -
The rationale behind this very general behaviour is most probably the following.
For vowels the impedance of the glottis is much higher than the impedance
of the vocal tract, and thus Ptr is almost equal to Psub. For consonants
there is a constriction in the vocal tract, causing a pressure build-up
above the glottis and a drop in Ptr. In order to keep vibration going (with
a lowered Ptr) during these voiced consonants, some adjustments must be
made: the vocal folds are slackened and abducted, and the consequence is
that Ta and Tn are raised. Lowering of Ptr and slackening of the folds will
lower F0, and thus raise T0. Although the folds are slackened, the decrease
in Ptr is such that the amplitude of vibration of the folds decreases, and
with it the modulation of the flow (Uo), and eventually Ee and Int.
The observed reciprocal relation provides a natural way for dividing the
LF-parameters into two sets. The first set consists of Ti, Tp, Te, Tn, Ta,
and T0, and will be referred to as the 'time parameters', while the second
set (Ptr, Uo, Ee, Int) will be referred to as the 'amplitude related parameters'.
Relations within the first set are described in section 3.2, and relations
within the second set in section 3.3. The relations between F0 and other
parameters can be derived directly from the relations of these parameters
with T0. Therefore, they are not treated separately, but are part of section
3.2. The behaviour of the wave shape parameters Rg, Rk, and Ra is described
in section 3.4.
3.2. Time parameters
It was already mentioned that during transitions from vowels to consonants
T0, Ta, and Tn are generally raised (see Fig. 3). The following question
than emerges: How does a change in T0 affect the time parameters, or, in
other words, how does the shape of the pulse change with F0? In this section
we try to answer this question by looking at the relations between T0 and
the other time parameters.
The five time parameters Ti, Tp, Te, Ta, and Tn were first plotted as a
function of T0 on a double logarithmic scale, and the best linear fits were
calculated. The resulting lines are of the form:
logTx = loga0 + a1.logT0 <=> Tx = a0.T0^a1, x element of {i, p, e, a, n}
The regression lines for Ti, Tp, Te, Ta, and Tn are shown in Fig. 4. All
correlations between the logarithm of the five time parameters and the logarithm
of T0 are positive and highly significant (p<0.0001). So, on the average,
all time parameters increase with increasing T0, and the glottal pulse is
stretched. However, this stretching is not distributed uniformly over the
entire period.
- insert Figure 4 about here -
If a time parameter changes linearly with T0, then its regression line in
Fig. 4 should have a slope of 1. In that case it would run parallel to the
reference line for T0 that is also given in Fig. 4 (T0 = 1.T0^1), which
obviously has a slope of 1. This is the case for Te, so generally the duration
of the first part of the LF-pulse changes linearly with T0. However, the
increase in Ti and Tp is less than linear, and the increase in Ta and Tn
(Tn = Te - Tp) is more than linear (see Fig. 4). The ordering of the time
parameters with ascending power is Ti, Tp, Te, Tn, Ta. It seems as if the
amount of stretching increases when going towards the end of the LF-pulse.
With regard to the shape of the LF-pulse, the consequence is that the skewing
decreases more than linearly with T0.
3.3. Amplitude related parameters
A constantly high covariance between the amplitude related parameters was
found for all data (see Table II and Fig. 5). At first sight the high covariance
of these parameters does not seem surprising, as an increase in Ptr alone
(everything else being equal) would increase the amplitude of vibration
of the vocal folds, and therefore lead to an increase in Uo and Ee. Increasing
Uo and Ee by roughly the same amount would lift the spectrum (see Fant and
Lin, 1988), and thus increase Int. However, our data form a mix of voiced
consonants, stressed and unstressed vowels. Thus one might expect large
variations, both in the glottis and in the vocal tract. For instance, for
voiced consonants Ta and Tn are generally higher than for vowels (see section
3.2). A change in Ta has little effect on Int, but an increase in Tn (i.e.
less skewing) combined with a decrease in Uo would lead to a decrease in
Ee that is relatively larger than the decrease in Uo. Given the large variation
in articulatory gestures, it is surprising that the covariance between Ptr,
Uo, Ee, and Int is invariably high.
- insert Figure 5 about here -
- insert Table II about here -
Regression lines were calculated for the amplitude related parameters. The
procedure used was analogous to the procedure used for the time parameters,
as described in section 3.2. The regression lines are of the form:
logX = loga0 + a1logPtr <=> X = a0Ptr^a1, X element of {Uo, Ee, Int}
The slope of the regression line for Uo in Fig. 5 is 1.0, indicating that
the relation between Uo and Ptr is approximately linear. In the LF-model
Ee is a function of Uo and the skewing of the glottal pulse. The fact that
both Uo and skewing increase with increasing Ptr explains why the slope
for Ee (of 1.6, see Fig. 5) is larger than the slope for Uo. The slope of
the regression line for Int (of 3.0) is about twice the value found for
Ee, which is not surprising, because the Int of a freely travelling spherical
sound wave is proportional to the square of the derivative of the mouth
flow (Beranek, 1954). However, without the use of a proper production model
it is difficult to unravel the exact underlying relations between the parameters.
3.4. Wave shape parameters
For the dimensionless wave shape parameters Rg, Rk, and Ra the following
general relations can then be derived. Rg is almost constant; the correlation
of Rg with T0 is positive but very small (see Table III). For the range
of Rg values found in this study, the influence of this parameter on the
spectrum (and thus on voice quality) is very small. The correlations of
Ra and Rk with T0 (see Table III) are positive and highly significant (p<0.0001),
which implies that voice quality changes with T0 and consequently with F0.
The correlations of Ra and Rk with Int and Ptr were even higher (see Table
III), so voice quality also changes with Int. The average values of Rg,
Rk, and Ra were 108%, 41%, and 6.5% respectively and are in accordance with
the values given by Carlson et al. (1989).
3.5. Deviations from the general behaviour
The fact that we have a large data set in which most parameters display
consistent relations allows us to identify the outliers, i.e. the instances
that do not fit in with the general pattern. Pitch periods that show different
relations between the parameters are mainly found during voice onset and
voice offset, and in the last syllable of an utterance.
The values of Uo for voice onset and offset generally fall below the regression
line of Uo on Ptr that is given in Fig. 5, but there are also differences
between voice onset and offset. The average Ptr during an UV/V transition
(5.0 cm H2O) is higher than the average Ptr during a V/UV transition (3.7
cm H2O). It seems that higher Ptr values are needed to initiate vibration
of the vocal folds, than to keep vibration going towards the end of a voiced
interval. At the beginning of a voiced interval the average values of Int
and F0 (59 dB and 131 Hz) are also higher than those at the end of a voiced
interval (57 dB and 120 Hz). Furthermore, a rise in Ta and Tn was found
both towards beginning and end of a voiced interval.
Near the end of all 30 utterances there was a substantial decrease in Psub,
Ptr, Int, and F0; and a marked increase in the activity of the sternohyoid.
Also, for the final vowel Uo was relatively high, compared to the general
trend. The deviating behaviour of the voice source during the final syllable
was also observed by Klatt and Klatt (1990). This is described in more detail
in Strik and Boves (in press).
4. Conclusions
In general, the method of automatic inverse filtering and fitting worked
satisfactorily. Most problems were encountered with attempts to obtain a
good approximation for the Ta parameter in pitch periods taken from consonants.
For some glottal periods our method did not succeed in finding a combination
of LF-parameters that define a LF-model that closely resembles dUg. This
could be a shortcoming of the inverse filter or the fitting procedure, but
also of the LF-model. It remains to be seen if the LF-model can describe
all variations in the glottal pulse that occur in different kinds of speech.
Consistent relations were found within the set of the time parameters and
the set of amplitude related parameters, but also between the parameters
of both sets. The highest correlations were found between Ptr, Uo, Ee, and
Int. The behaviour of the voice source during voice onset, voice offset,
and the last syllable was different from the general behaviour. When relating
LF-parameters to prosody the general picture is that voice quality is mainly
affected by Rk and Ra (or Tn and Ta), and that Int is mainly affected by
Ee (or Uo).
All these fluctuations in the voice source parameters are likely to have
perceptual consequences. To improve the naturalness of synthetic speech,
these effects have to be taken into account.
Acknowledgements
This research was supported by the foundation for linguistic research, which
is funded by the Netherlands Organization for the Advancement of Scientific
Research N.W.O. Special thanks are due to dr. Philip Blok who inserted the
hooked-wire electrodes and the pressure catheter in the experiments.
References
L. Beranek (1954), Acoustics (McGraw-Hill Book Company, New York), pp. 23-115.
R. Carlson, G. Fant, C. Gobl, B. Granstrom, I. Karlsson and Q. Lin (1989),
"Voice source rules for text-to-speech synthesis", Proc. ICASSP, Vol. 1,
pp. 223-226.
J.E. Dennis, D.M. Gay, and R.E. Welsch (1981), "An adaptive nonlinear least-squares
algorithm", ACM Transactions on Mathematical Software, Vol. 7, pp. 348-368.
G. Fant, J. Liljencrants, and Q. Lin (1985), "A four-parameter model of
glottal flow", STL-QPSR, Vol. 4, pp. 1-13.
G. Fant and Q. Lin (1988), "Frequency domain interpretation and derivation
of glottal flow parameters", STL-QPSR, Vol. 2-3, pp. 1-21.
G.A. Ferguson (1987), Statistical analysis in psychology and education (McGraw-Hill
Book Company, Singapore), pp. 195.
J. Jansen, B. Cranen, and L. Boves (1991), "Modelling of source characteristics
of speech sounds by means of the LF-model", Proc. of EUROSPEECH '91, Vol.
1, pp. 259-262.
D.H. Klatt and L. Klatt (1990), "Analysis, synthesis, and perception of
voice quality variations among female and male talkers", J. Acoust. Soc.
Am., Vol. 87, pp. 820-857.
H. Strik and L. Boves (in press), "Control of fundamental frequency, intensity
and voice quality in speech", J. of Phon.
J. de Veth, B. Cranen, H. Strik and L. Boves (1990), "Extraction of control
parameters for the voice source in a text-to-speech system", Proc. of ICASSP-90,
paper 21.S6a.2.
- Figure captions -
Fig. 1. Glottal flow (Ug) and glottal flow derivative (dUg) with the parameters
of the LF-model.
Uo: maximum of Ug
Ei: maximum of dUg
Ee: absolute value of the minimum of dUg
t = 0: time of glottal opening
Tc: time of glottal closure
Ti, Tp, Te: time points of Ei, Uo, and Ee respectively
Ta: the time between Te and the projection of the tangent of dUg in t=Te
Tn = Te - Tp
The dimensionless wave shape parameters than can be derived from the LF-parameters
are:
Rg = T0/2Tp
Rk = Te/Tp - 1 = Tn/Tp
Ra = Ta/T0
Fig. 2. Results of the automatic fitting procedure for five periods of a
vowel /e/. Shown are, from top to bottom, audio signal, glottal flow derivative
(dUg, solid line) with fitted signal (dotted line), and glottal flow (Ug,
solid line) with fitted signal (dotted line).
Fig. 3. Results for a voiced interval to illustrate the behaviour of the
voice source parameters. Given are, from top to bottom, phonetic transcription,
audio signal, transglottal pressure (Ptr), median values of Ee and Uo, intensity
(Int), and median values of T0, Ta, and Tn. Although /p/ is phonologically
an unvoiced plosive, it is observed that voicing continues in this utterance.
Fig. 4. The relation between the time parameters and T0. Given are the regression
lines of the time parameters as a function of T0. Note that both the horizontal
and the vertical axis have a logarithmic scale.
Fig. 5. Scatterplots of the amplitude related parameters Uo, Ee, and Int
as a function of Ptr, with regression lines. Note that both the horizontal
and the vertical axis have a logarithmic scale.