H. Strik & L. Boves (1988e)
AFN-Proceedings, University of Nijmegen, Vol. 12, pp. 41-55.
0. Abstract
The physiological control of 'spontaneous' speech of an untrained subject
was studied. One of the measured physiological signals is transglottal pressure
(Pt) because it is believed that Pt is one of the main factors in the control
of the vibratory behaviour of the vocal folds. Data were analyzed using
a novel processing technique, because the variation in the time structure
of the individual repetitions is too large to make meaningful straightforward
averaging possible. The results show that speech gestures are reproduceable,
and that Pt indeed is one of the most important factors in the control of
fundamental frequency (F0) and intensity level (IL) in speech. However,
results do depend on the analyzing method used.
1. Introduction
Studies of the physiological control of speech parameters like F0 and IL
traditionally been characterized by the use of trained phoneticians as subjects,
and the use of specially designed speech material, mostly consisting of
a small number of short sentences that had to be produced in several ways
(different intonation contour, different placement of emphatic stress, etc).
Another quite general characteristic of those studies is the fairly global
way of processing the measurement data obtained in the course of the experiments.
This state of affairs has a number of potentially detrimental consequences.
One of the most important disadvantages is that what we know about the physiological
control of speech parameters is true only under the very limited condition
that the talker is trained to serve as a subject in phonetic experiments
and that he knows precisely how to produce a specific phonetic effect in
a consistent manner. Especially if relations between physiological control
parameters and speech parameters are derived from single tokens, or averages
of small numbers of tokens, there is the risk that the observations actually
are idiosyncrasies that may not be generalized to other talkers or other
speech conditions. Lastly, but not least important, the global data processing,
sometimes consisting of visual averaging of traces and visual estimation
of correlation coefficients, may result in an emphasis of expected relations
at the cost of other, perhaps stronger but unexpected, relations. Moreover,
the time span of the analysis may also influence the results to an unwanted
degree.
In this paper we make a first attempt at addressing the methodological problems
encountered in speech physiology research. First of all, we used a subject
who had no previous training as a phonetician, linguist or singer. Secondly,
he was asked to repeat a 'spontaneously' produced, fairly complex sentence
29 times. Finally, we have analyzed the relations between physiological
and speech parameters on different levels, using formal statistical techniques.
Physiological signals that are related to the speech production system invariably
contain some random activity. Especially electromyographic (EMG) signals
are always very noisy, but other signals like subglottal pressure (Psb),
supraglottal pressure (Psp), and lung volume (Vl) are usually also far from
clean. To extract meaningful relations between the various physiological
processes it is necessary to improve the signal-to-noise ratio of these
signals.
The usual procedure to highlight information in physiological signals is
to time-align and average the signals derived from various repetitions of
the 'same' utterance (Kewley-Port, 1973). For meaningful averaging of physiological
signals related to speech, two requirements must be fulfilled. The first
is that speech gestures must be reproduceable, for it is the causal relation
between a physiological process and a speech gesture that one wants to reveal.
The second requirement is that the inter-token variation in the temporal
structure of the repetitions must be small. Note that these are really different
requirements. Two tokens can be produced with essentially equivalent articulatory
gestures, while the speech rate is still very different. On the other hand,
similar overall articulation rates in two tokens do not guarantee that they
were produced with similar articulatory gestures.
Trained speakers often succeed in repeating the same utterance with essentially
equivalent articulatory gestures and with approximately the same duration,
especially if the utterances are not too long and too complex. But repetitions
of a fairly long sentence (duration &pm.4 s) produced by a person who had
experience in acting as a subject in phonetic experiments, time-aligned
with a line-up point in the middle of the sentence, show deviations of as
much as 150 ms towards both ends (Strik and Boves, 1988). For the repetitions
produced by the 'linguistically naive' subject of the present experiment
the deviations were even considerably greater. With deviations of this magnitude
straightforward averaging of tokens becomes questionable.
Therefore, before averaging the physiological signals, the time axes of
the repetitions must be adjusted such that the speech gestures of the various
repetitions are not only time-aligned at the line-up point, but at every
time point of the utterance. For this purpose a novel processing technique
is proposed, in which a dynamic time warping (DTW) algorithm is used to
obtain a sufficient degree of time-alignment.
2. Method
2.1 Experimental procedure
In this experiment simultaneous recordings of the acoustic signal, electroglottogram
(EGG), Vl, Psb, Psp, and EMG activity of the sternohyoid (SH), cricothyroid
(CT) and vocalis muscle (VOC) were obtained while the subject carried out
a large variety of speech tasks. All measured signals were stored on a 14-channel
instrumentation recorder (TEAC XR-510).
The speech signal was transduced by a condenser microphone (B&K 4134) placed
about 10 cm in front of the mouth, and amplified by a measuring amplifier
(B&K 2607).
The pressure signals were recorded using a catheter with four pressure transducers,
in the way described by Cranen and Boves (1985). The catheter was inserted
pernasally and fed through the glottis into the trachea. The pressure measurements
were calibrated by recording the pressure signals while the subject held
lung pressures of up to 20 cm H2O against a water-filled U-tube manometer.
The catheter, situated in the posterior commissure of the glottis, did not
have a noticeable effect on phonation.
The EMG signals were recorded using hooked-wire electrodes (Hirose, 1971).
The electrodes were inserted percutaneously, and correct electrode placement
was confirmed by audio-visual monitoring of the signals during various functional
manoeuvres. The EMG signals were calibrated by recording them while a 200
mV sine was used as input for the pre-amplifiers.
The perimeter of chest (X) and abdomen (Y) were measured with mercury filled
strain-gauge wires. Lung volume was calculated from the weighted sum of
these two signals: Vl=aX+bY+c. To obtain the relative contribution (a/b)
of each signal the subject performed paradoxical movements (chest out, abdomen
in and vice versa) with a closed glottis, i.e. a constant lung volume. The
second lung volume calibration manoeuvre was that the subject had to fill
and empty a balloon of fixed size several times. This gave absolute values
for a and b. In our set-up the value of c cannot be determined, so that
the measurements are confined to relative lung volumes. This is, however,
sufficient for computing average flow.
The subject was a male native speaker of Dutch, with no experience in phonetics
or linguistics and with no history of respiratory or laryngeal dysfunction.
Near the end of the experiment he was asked to produce an utterance spontaneously.
The produced Dutch sentence is : "Ik heb het idee dat mijn keel wordt afgeknepen
door die band" (I have the feeling that my throat is being pinched off by
that band). After he spoke this sentence, he was asked to repeat the same
sentence 29 times. During the experiment the quality of the recordings of
VOC and CT decreased considerably. For the utterances described in this
paper, recorded near the end of the experiment, they were not considered
useful for analysis.
2.2 Data preparation
All signals were A/D converted at a 10 kHz sampling rate. The files were
stored on a microVAX computer.
F0 and IL were calculated with the SIF program of ILS. Both values were
calculated every 5 ms, resulting in F0 and IL signals sampled at a 200 Hz
rate. F0 was calculated from the EGG signal, because this gave better results
than the F0 calculated from the audio signal. Still, F0 contours contained
so many errors that they had to be corrected manually. IL was calibrated
with the use of a pistonphone (B&K 4220). The physical unit of F0 is Hz,
and that of IL is dB re 10&'sup(-12) W/m&'sup(2).
Pressure signals, chest and abdomen signals were low-pass filtered (third
order digital elliptic low-pass filter, pass band edge 30 Hz) and downsampled
to 200 Hz. For the correlation analyses described in this article only the
DC-component of the pressure signal is needed. Lung volume was calculated
from the low-pass filtered chest and abdomen signals. The physical units
of pressure and of lung volume are cm H2O and cc, respectively.
Because the EMG signals sometimes contained a disturbing 50 Hz hum and/or
a small DC-offset, they were high-pass filtered (third order digital elliptic
high-pass filter, pass band edge 200 Hz). The EMG signals recorded with
hooked-wire electrodes are interference patterns of, usually a few, single
motor unit potentials. The motor unit potentials are sharp spikes and thus
high-pass filtering hardly affects the motor unit potentials. The integrated
rectified EMG was calculated in the way described by Basmajian (1975): first
the signal is full-wave-rectified, and then it is integrated over successive
periods of 5 ms. The integrator is reset after each integration. The physical
unit of EMG signals is mV.
The EMG signal is a measure of the electric potential in a muscle. But the
mechanical action of a muscle lags behind the main burst of electric potentials.
Therefore, in order to correlate muscle activity with the resultant acoustic
event, it must be shifted forward in time. Discrete cross correlation functions
between F0 data and SH data were calculated. Atkinson (1978) defined the
mean response time (MRT) of the muscle as the average value of the maxima
of the cross correlation functions. All SH signals were shifted forward
over their MRT of 190 ms. The value found by Atkinson (1978) was 120 ms.
At the moment we have no explanation for the discrepancy between his value
and ours.
2.3 Data processing
2.3.1 Time-alignment
To improve the signal-to-noise ratio of the physiological signals the method
of time-alignment and averaging is often used (see Introduction). For this
method to be used the subject has to produce several repetitions of the
'same' utterance. Line-up points are defined in each of the tokens, and
with these line-up points the signals of the repetitions are time-aligned.
Physiological signals are then averaged. However, F0 (and/or IL) signals
usually are not averaged. The F0 (and/or IL) contour of one of the repetitions
is chosen to represent the 'average' F0 (and/or IL) contour (Collier, 1975;
Maeda, 1976).
Previous results (Strik and Boves, 1988) already revealed that even with
a trained speaker, and prescribed sentences the variation in the temporal
structure can be too large to make meaningful averaging possible. It seems
neccessary to preprocess the signals before averaging them. Therefore, a
novel processing technique was developed that is described in the next section.
2.3.2 Time-normalization
In the novel method the same line-up points, as described above, are used
to time-align the signals of all repetitions. The repetition with median
length is then chosen as the most representative one, and used as a reference
for time-normalization of the remaining tokens. In the following this reference
utterance will be called 'the template'. In order to effect time-normalization
cepstrum coefficients are calculated for all speech signals, one set of
cepstrum coefficients for every 5 ms of speech. Using a dynamic time warping
(DTW) algorithm, the warp functions between the speech signals of all tokens
and the speech signal of the template are calculated (Vintsyuk, 1968). DTW
finds the local distortions of the time axis of a test utterance, relative
to the template, in such a way that the summed spectral distance between
the portions of the signals that get aligned is minimized. In searching
for this optimal non-linear time-alignment function, the additional constraint
has to be satisfied that the maximum amount of local time distortion remains
within reasonable bounds (Sakoe and Chiba, 1978). The upper and lower bound
of the DTW algorithm used in the present research was 200 ms.
Every warp function is a function that maps a 200 Hz speech analysis file
of a repetition onto the 200 Hz speech analysis file of the template. This
warp function, of a repetition on the template, is then used to distort
the time axes of all physiological signals belonging to that repetition,
resulting in physiological signals whose time axes are adjusted with reference
to the template. Henceforth we will call this procedure 'time-normalization'.
After time-normalization median signals were calculated. On every point
in time the median value of the 29 values of the time-normalized signals
was taken. This was also done for F0 and IL. For F0 this appeared to yield
a good V/UV decision criterion for each sample of the utterance.
2.3.3 Correlation analysis
After time-normalization, the physiological signals of 29 utterances and
the median physiological signals were available for further analysis. All
signals had a sampling rate of 200 Hz. Further analysis is mainly based
on the median signals, although in one case the original signals are used.
All physiological signals, except F0, are continuous functions of time.
F0 is a discontinuous function, because it is non-existent during both the
voiceless intervals and silent periods. It is not the purpose of this experiment
to study the distinction between voiced and unvoiced speech; many laryngeal
muscles and Pt are involved in this process. Rather the purpose is to study
the control of speech during voicing. Therefore, only the voiced portions
of the utterances were used in calculating the correlation coefficients.
Correlation coefficients were calculated, for many different data sets,
between all possible pairs of the six variables (i.e. 15 pairs) using the
Pearson-Product-Moment formula. Each data set contains a number of 6-dimensional
data vectors, consisting of the values of F0, IL, Pt, Psp, Psb and SH at
the same point in time. The first data set was created by appending all
29 utterances. It comprised data vectors for all 7937 voiced samples. Three
smaller data sets were created by taking the data vectors pertaining to
the largest voiced intervals of the median signals. Finally, an additional
data set was created consisting of data vectors for all voiced samples of
the median signals.
3. Results
None of the 29 repetitions of the spontaneous sentence included an inhalation
pause. In the original, spontaneous sentence there was a pause of almost
half a second, due to a swallowing gesture of the subject. Similar gestures
did not occur in the 29 repetitions. Thus, in order to minimize the risk
that utterances containing different articulatory gestures were averaged,
only the last 29 sentences were used for analysis.
The total duration of the utterances and the pauses, between the utterances,
is &pm.300 s. The total duration of the utterances alone is &pm.67 s. In
total there are 7937 voiced samples, almost 40 seconds of voiced speech.
The average number of voiced samples per token is 274 (sd=18). The individual
voiced intervals of the utterances have lengths varying from 10 to 595 ms.
3.1 Variability of speech gestures
In the introduction it was already mentioned that two requirements for meaningful
averaging of physiological signals are little inter-token variation in temporal
structure, and little inter-token variation in speech gestures. Both variations
are studied in the next section.
3.1.1 Variation in temporal structure
The average length of the utterances produced by our subject was 2310 ms
(sd=130). The maximum and the minimum were 2615 ms and 2165 ms.
The release of the /k/ of /keel/ was used as a line-up point. This line-up
point was chosen because it is clearly distinguishable, and it is situated
approximately in the middle of the sentence. After defining this line-up
point in each sentence it was possible to calculate the variation in the
first part (from beginning to the line-up point) and the last part (from
line-up point to the end) of the utterances. The average duration of the
first part was 880 ms (sd=80), with a maximum of 1075 ms and a minimum of
780 ms. The average duration of the last part was 1430 ms (sd=70); the maximum
and minimum value were 1590 ms and 1320 ms.
The results show that one can hardly maintain that there is little variation
in the temporal structure of the signals. The maximum deviation for the
first and last part were 295 ms and 270 ms, respectively. In general, it
appeared that the subject tended to increase his articulation rate as he
repeated the utterances more often. But even for the last six sentences
the maximum deviation in first and last part was 120 ms and 90 ms, respectively.
So even after numerous repetitions the magnitude of the deviations is still
so large that straightforward averaging of the tokens could result in averaging
physiological signals of one word in one sentence with that of a totaly
different word in another sentence. With such differences in temporal structure,
time-alignment and averaging no longer seems a useful procedure to extract
meaningful relations.
In Figure 1 the average signals are plotted for, from top to bottom, F0,
IL, Pt, Psp, Psb, Vl and SH. It can clearly be seen that the averages are
only meaningful in the direct neighbourhood of the line-up point.
3.1.2 Variation in speech gestures
Because the variation in the temporal structure of the 29 repetitions is
very large, it appears to be necessary to normalize the time axes of the
utterances before the signals are averaged. All signals were 'time-normalized'
using repetition 5 as the template. After time-normalization median signals
were calculated. At each point in time, for every physiological quantity,
the 29 values were ordered from low to high. The median value then is the
15th value. The median signals are plotted in Figure 2. To give an idea
of the amount of variation around the median traces for the 5th and the
25th value are also plotted (dotted lines).
From Figure 2 we can conclude that the method of time-normalization worked
satisfactorily, because the temporal structure variation around the median
that remains is reasonably small. The second conclusion that can be derived
from Figure 2 is that the variation in the magnitude of the signals is also
very small. The largest variations were found for Vl. Fairly large variations
were also found for the first word of the utterance. In some utterances
the first word was clearly pronounced. In these cases IL and Pt were large,
and part of the word was voiced. But in most utterances the clitic version
of the personal pronoun was used, turning the pronunciation /Ikh?p/ into
/k?p/. In these cases IL and Pt were small, and no part of the pronoun was
voiced.
Correlation coefficients were calculated for all voiced samples of the 29
utterances (see Table I). All correlations are highly significant (p<0.0001)
reflecting the consistent relations between the variables. Thus, apart from
Vl, it seems that speech gestures are reproduceable. Therefore, the first
requirement mentioned above is met. So, after time-normalization, averaging
seems useful to extract meaningful relations between the various physiological
processes.
Table I. Correlation matrix, mean and standard deviation for all voiced
samples.
=========================================================
F0 IL Pt Psp Psb SH mean SD
F0 1.000 0.561 0.556 -0.068 0.621 -0.382 117.01 9.98
IL 1.000 0.828 -0.574 0.439 -0.207 64.42 4.49
Pt 1.000 -0.626 0.601 -0.309 4.97 1.32
Psp 1.000 0.248 -0.114 0.90 1.09
Psb 1.000 -0.500 5.87 1.06
SH 1.000 8.65 15.10
N=7937
|R|>0.046 for p<0.0001
======================================================
3.2 Physiological control of speech
The top panel of Figure 2 shows the oscillogram of the audio signal of repetition
5, the utterance used as a reference for time-normalization. The median
signals in the other panels may not directly be referred to this audio signal,
because they are not the physiological signals belonging to this particular
audio signal but averages of 29 repetitions. For instance, the audio signal
may not seem periodic during the last schwa, but if it is periodic in at
least 15 of the 29 tokens, then the median F0 value will indicate that it
is voiced. The median physiological signals may, however, be compared with
each other.
If the data are analyzed quantitatively, using a correlation method, then
the results are dependent on the time domain over which the correlations
are calculated. The control of F0 and IL on word level is addressed in section
3.2.1, and the control of F0 and IL on sentence level in section 3.2.2.
Other apparent results are briefly mentioned here.
One clear result is that the Vl traces of the individual repetitions run
parallel, but the top and the bottom trace are seperated by &pm.400 cc.
On the other hand, the Psb values show little variation between the individual
repetitions which indicates that it is possible, at least for this subject,
to produce the same subglottal pressures with different lung volumes.
Apart from the slow decline in Psb, there is little variation in the median
value of Psb. The largest local variation is probably the rise during the
/k/ in /kel/, just before the word with emphatic stress. Probably this is
a combined effect of the prolongued obstruction of the vocal tract, and
the extra effort of the expiratory muscles that want to assist the laryngeal
muscles in raising the F0 of the vowel immediately following this consonant.
3.2.1 Control on word level
Before analyzing the control of F0 and IL, the F0 and IL signals in Figure
2 are first examined in order to assess variation between the tokens. Considerable
inter-token variations were found for the first word, as explained in section
3.1.2. Except for this first word, the inter-token variation in intensity
is very small. The inter-token variation in F0 is somewhat larger. This
is not because the inter-token variation in the absolute value of F0 is
large, but mainly because there is a fairly large inter-token variation
in the voiced/unvoiced decision of a few consonants. Some consonants are
always voiced, others are always unvoiced, but there are consonants that
are voiced in some tokens and unvoiced in others. As mentioned above (see
data processing), an F0 sample is classified as voiced if it is voiced in
at least 15 of the 29 tokens.
A large number of studies on F0 and IL in speech have reported a positive
relation between F0 and Psb (Collier, 1975; Atkinson, 1978; Baer, 1979;
Shipp, Doherty and Morrissey, 1979; Gelfer, Harris, Collier and Baer, 1983;
Titze and Durham, 1987), and a positive relation between IL and Psb (van
den Berg, Zantema and Doornenbal, 1957; Rubin, 1963; Isshiki, 1964; Ladefoged,
1967; Bouhuys, Mead, Proctor and Stevens, 1968; Baer, Gay and Niimi, 1976).
But the results of the present experiment suggest that it is Pt that covaries
most closely with F0 and IL (see Figure 2). This visual impression was tested
by calculating the correlation coefficients for the three largest voiced
intervals (Table II). The results clearly show that Pt is far more important
in the control of F0 and IL than Psb.
The correlation between Pt and Psp is almost -1 for all three segments.
Therefore it would have been possible, on purely statistical grounds, to
substitute Psp for Pt in all occurrences above. But on physiological grounds
it seems more reasonable to state that it is Pt that is important in the
control of the vibratory behaviour of the vocal folds.
Table II. Correlation matrix, mean and standard deviation for the three
largest voiced intervals of the median physiological signals.
==================================================
F0 IL Pt Psp Psb SH mean SD
-F0 1.000 0.838 0.914 -0.841 0.368 0.043 117.80 3.17
IL 1.000 0.923 -0.970 0.035 -0.045 63.27 3.44
Pt 1.000 -0.941 0.266 0.125 5.40 0.88
Psp 1.000 -0.195 -0.048 1.14 0.86
Psb 1.000 0.248 6.35 0.13
SH 1.000 1.27 0.04
N=66
|R|>0.315 for p<0.01
F0 1.000 0.833 0.869 -0.921 0.235 -0.460 127.31 2.40
IL 1.000 0.933 -0.920 0.478 -0.433 64.84 1.53
Pt 1.000 -0.987 0.558 -0.617 5.39 1.03
Psp 1.000 -0.439 0.571 0.75 1.02
Psb 1.000 -0.703 5.94 0.18
SH 1.000 7.00 7.63
N=48
|R|>0.369 for p<0.01
F0 1.000 0.784 0.837 -0.721 0.541 0.172 110.06 4.36
IL 1.000 0.973 -0.942 0.316 0.138 57.96 4.66
Pt 1.000 -0.938 0.401 0.172 3.93 1.00
Psp 1.000 -0.079 -0.337 1.22 0.98
Psb 1.000 -0.354 5.04 0.24
SH 1.000 6.37 5.78
N=67
|R|>0.313 for p<0.01
==================================================
In section 3.2.2 we will try to explain why Pt is more important than Psb
on a local level, here we will try to give a physiological explanation of
the importance of Pt in the control of F0 and IL. This latter explanation
is achieved best by dividing the problem into the control during vowel production
and the control during consonant production. Of course, there are also intermediate
states, but they can be seen as an interpolation between both extremes.
For vowel production it is possible to derive a direct causal relation between
Pt on the one hand, and F0 and IL on the other. During vowel production
there are no obstructions in the vocal tract, and the acoustic impedance
of the glottis is much larger than the impedance of the vocal tract. Therefore,
Psp is almost zero and Pt is almost equal to Psb (see Figure 2). Titze and
Durham (1987) showed that during stable phonation the maximum glottal width
(Gm) changes as a function of Psb (=Pt) alone (the activity of all laryngeal
muscles is assumed to be constant). They argued that this increase in amplitude
of vibration leads to an increase in F0. But if the amplitude of the vibration
increases, and if the period time of the vibration decreases, then the vocal
folds must close faster. Therefore, the airflow would have a steeper slope
during closing, and IL would increase (Gauffin and Sundberg, 1980). The
conclusion is that this mechanism would predict that during vowel production
F0 and IL are positively related to Pt alone.
For voiced consonants the positive relation could be the result of a combination
of the direct relation given above, and more indirect relations given below.
During consonant production the vocal tract is constricted at some point
along its length and there is a pressure build-up in the supraglottal region.
The increase in Psp could be such that Pt drops below a certain threshold
value, in which case the vibration of the vocal folds stops. The threshold
value depends on the state of the larynx, i.e. the activity of the laryngeal
muscles. The average Pt at which voicing stops for the first five tokens
(52 voiced intervals) is 2.42 cm H2O (sd=0.88), and for the same tokens
the average Pt at which voicing starts is 5.52 cm H2O (sd=0.95). Thus it
seems that it is easier to keep vibration going than it is to start vibration.
Stevens (1977) suggested that the laryngeal musculature controlling vocal
fold stiffness always responds to a decreasing Pt during consonant production
either by increasing the stiffness, to stop vocal fold vibration, or by
decreasing the stiffness, to keep vibration going although Pt is lowered.
For voiced consonants the vocal folds would then be slackened, and F0 would
decrease with decreasing Pt. Due to the constriction in the vocal tract
and/or the smaller opening of the mouth, compared with vowel production,
IL would also decrease.
The median value of Pt for all unvoiced consonants of this utterance remains
below the average value of Pt at which voicing stops (2.42 cm H2O), and
for all voiced consonants the median value of Pt remains above this threshold.
This is no conclusive evidence to reject Stevens's suggestion, but it indicates
that for the voiced consonants of this utterance the state of the vocal
folds need not be changed to keep vibration going, because Pt probably remains
sufficiently high to keep vibration going without any adjustments. If during
the production of consonants no adjustments are made then F0 and IL would
still decrease with decreasing Pt using the reasoning given above for vowel
production. The conclusion is that with these data it is not clear which
mechanism is used during consonant production, but that in both cases F0
and IL would be positively related to Pt alone.
3.2.2 Control on sentence level
In the section above, the control of F0 and IL in speech was studied on
a local level. It is also possible to analyze the data on a more global
level; in order to study the relation between the slow trends of the physiological
signals. This is done by calculating the correlation coefficients for the
data vectors pertaining to all voiced samples of the median signals (Table
III). Compared to the analysis on a local level (Table II) many differences
are observed.
Table III. Correlation matrix, mean and standard deviation for all voiced
samples of the median physiological signals.
==================================================
F0 IL Pt Psp Psb SH mean SD
F0 1.000 0.669 0.714 -0.184 0.746 -0.468 115.71 8.27
IL 1.000 0.910 -0.669 0.483 -0.221 62.10 4.21
Pt 1.000 -0.670 0.591 -0.378 4.90 1.16
Psp 1.000 0.186 -0.128 0.90 0.94
Psb 1.000 -0.657 5.64 0.86
SH 1.000 6.50 11.15
N=288
|R|>0.152 for p<0.01
==================================================
On a local level Pt was determined almost entirely by Psp, while on a global
level the contributions of Psp and Psb to the variation in Pt are almost
equal. High correlations between IL and Pt were found on both levels. Psp
and Psb contribute to the control of IL via Pt, and because Psb becomes
more important in the control of Pt on a global level, Psb becomes more
important in the control of IL too. Regarding F0, on word level the highest
correlations were those with Pt, while on sentence level the correlation
with Psb is the highest. On the whole, Psb becomes more important on a global
level.
Psb decreases slowly during the course of the utterance, and hardly varies
on a local level. F0, IL, Pt and Psp, on the other hand, do vary substantially
during these intervals. In Table II we can see that the variations (var
= sd&'sup(2)) of Pt and Psp are much greater than the variation of Psb.
Therefore, it is not surprising that Pt is more effective in predicting
the local rapid movements of F0 and IL, and that, in its turn, Psp is more
effective in predicting Pt on a local level than Psb.
Psp almost entirely determines the rapid fluctuations, while Psb mainly
determines the peak values of Pt. The overall pattern is that Psb and F0
decrease during the utterance, while IL decreases only slightly during the
final part of the utterance. Therefore Psb becomes more important in the
control of F0 on a global level.
4. Conclusions
In this paper a novel technique to process physiological signals related
to speech is presented. This technique seemed neccessary because it was
found that inter-token variations in the time structure of the speech gestures
were very large when an untrained subject repeated a fairly long spontaneous
sentence 29 times. In fact the variations were so large that straightforward
averaging did not result in averaging physiological signals related to the
'same' speech gestures, but in averaging physiological signals related to
totally different speech gestures. It is shown that the novel processing
technique reduces this time-jitter to such a degree that meaningful averaging
is possible.
After time-normalization it is possible to study the reproduceability of
speech gestures, which is another requirement for meaningful averaging.
The results show that, apart from Vl, the inter-token variation of the physiological
signals is small, indicating that speech gestures are reproduceable, even
if the subject is not a trained speaker.
The median signals reveal that of all the measured signals Pt is the most
important factor in the control of F0 and IL in voiced speech. A hypothetical
physiological model was proposed that could roughly explain how Pt controls
F0 and IL. The activity of the laryngeal muscles also influences F0 and
IL to some degree; their influence on F0 is probably greater than their
influence on IL. For instance, during the word with sentence stress a number
of F0-raising muscles is probably more active than during the other words.
ACKNOWLEDGEMENTS
This research was supported by the Foundation for linguistic research, which
is funded by the Netherlands Organization for Scientific Research, N.W.O.
Special thanks are due to Harco de Blaauw who was the subject of the present
experiment, to Philip Blok who inserted the EMG electrodes and the catheter,
to Hans Zondag who helped organizing and running the experiment, and to
Jan Strik who assisted in the processing of the data.
References
Atkinson, J.E. (1978) Correlation analysis of the physiological features
controlling fundamental voice frequency, Journal of the Acoustical Society
of America, 63, 211-222.
Baer, T. (1979) Reflex activation of laryngeal muscles by sudden induced
subglottal pressure changes, Journal of the Acoustical Society of America,
65, 1271-1275.
Baer, T.; Gay, T. and Niimi, S. (1976) Control of fundamental frequency,
intensity and register of phonation, Haskins Lab. Status Report on Speech
Reasearch, SR-45/46, 175-185.
Basmajian, J.V. (1967) Muscles Alive, their functions revealed by electromyography
(second edition), The Williams & Wilkins company, Baltimore.
Berg, J. van den; Zantema, J. and Doornenbal, P. (1957) On the air resistance
and the Bernouilli effect of the human larynx, Journal of the Acoustical
Society of America, 29, 626-631.
Bouhuys, A.; Mead, J.; Proctor, D.F. and Stevens, K.N. (1968) Pressure-Flow
Events during Singing, Annals of the New York Academy of Sciences, Vol.155,
Art.1, New York.
Collier, R. (1975) Physiological correlates of intonation patterns, Journal
of the Acoustical Society of America, 58, 249-255.
Cranen, B. and Boves, L. (1985) Pressure measurements during speech production
using semiconductor miniature pressure transducers: Impact on models for
speech production, Journal of the Acoustical Society of America, 77, 1543-1551.
Gauffin, J. and Sundberg, J. (1980) Data on the glottal voice source behavior
in vowel production, Speech Transmission Laboratory, Q. Prog. Status Rep.,
Royal Institute of Technology, Stochkholm, 2-3/1980, 61-70.
Gelfer, C.; Harris, K.; Collier, R. and Baer, T. (1983). Is declination
actively controlled? In: I.R. Titze and C. Scherer (eds.), Vocal Fold Physiology,
The Denver Center for the Performing Arts, Inc., Denver, Colorado.
Hirose, H. (1971) Electromyography of the Articulatory Muscles: Current
Instrumentation and Techniques, Haskins Lab. Status Report on Speech Reasearch,
SR-25/26, 73-86.
Isshiki, N. (1964) Regulatory mechanism of voice intensity variation, Journal
of Speech and Hearing Research, 7, 17-29.
Kewley-Port, D. (1973) Computer processing of EMG signals at Haskins Laboratories,
Haskins Lab. Status Report on Speech Reasearch, SR-33, 173-183.
Ladefoged, P. (1967) Three areas of experimental phonetics, Oxford: Oxford
University Press.
Maeda, S. (1976) A characterization of American English intonation, Ph.D.
thesis, MIT, Cambridge.
Rubin, H.J. (1963) Experimental studies on vocal pitch and intensity in
phonation, The Laryngoscope, 8, 973-1015.
Sakoe, H. and Chiba, S. (1978) Dynamic programming algorithm optimization
for spoken word recognition, IEEE Trans. Acoustics, Speech, and Signal Proc.,
Vol. ASSP-26, 43-49.
Shipp, T.; Doherty, E.T. and Morrissey, P. (1979) Predicting vocal frequency
from selected physiologic measures, Journal of the Acoustical Society of
America, 66, 678-684.
Stevens, K.N. (1977) Physics of Laryngeal Behavior and Larynx Modes, Phonetica,
34, 264-279.
Strik, H. and Boves, L. (1988). Averaging physiological signals with the
use of a DTW algorithm. Proceedings SPEECH'88, 7th FASE Symposium, Edinburgh,
Book 3, 883-890.
Titze, I.R. and Durham, P.L. (1987) Passive Mechanisms Influencing Fundamental
Frequency Control. In: T. Baer, C. Sasaki and K.S. Harris (eds.), Vocal
Fold Physiology, College-Hill Press, Boston.
Vintsyuk, T.K. (1968) Recognition of spoken words by the dynamic programming
method, Kibernetica, 1, 81-88.