H. Strik & L. Boves (1995)
Journal of Phonetics, Vol. 23, pp. 203-220.
This article has appeared in the Journal of Phonetics. Therefore, I only
have a printed version with the final text and the final layout. If you
want a copy of this article, you can find it in Journal of Phonetics 23,
or you can contact me. The text of the ASCII version
below is slightly different from the text of the article.
Helmer Strik & Louis Boves
University of Nijmegen, Department of Language and Speech, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands
In the present paper we examine the simultaneous downtrend in fundamental frequency and subglottal pressure that is often observed for running speech. In particular, we will test the hypothesis that the downtrend in fundamental frequency is caused by a gradual decrease in subglottal pressure during the course of an utterance. In the literature various ways to model the downtrend in fundamental frequency have been proposed. Our conclusion is that whether the hypothesis stated above is true depends on the model of downtrend adopted.
A simultaneous downtrend in fundamental frequency (F0) and subglottal pressure (Psb) has often been observed for running speech (Lieberman, 1967; Ohala, 1970; Collier, 1974, 1975; Atkinson, 1978; Gelfer, 1987; Strik & Boves, 1993). As it is known that changes in Psb will affect F0, everything else being equal (Titze, 1989), it seems plausible to assume that both downtrends are related. However, a considerable deal of controversy surrounds the relation between the two downtrends (see e.g. Ohala, 1978, 1990; Cohen, Collier & 't Hart, 1982; Ladd, 1984).
Research on the relation between the downtrend in F0 and Psb is impeded by the fact that there is still no consensus on the correct way to model the downtrend in F0. In the literature various models have been proposed. Many of these models consist of two components: a short-term or local component and a long-term or global component. In these models the global component is used to model the downtrend in F0. Only some of these models provide a physiological explanation of both components. Ohman (1968), Collier (1975), and Fujisaki (1991) agree that the local component is controlled by the laryngeal muscles, but they do not agree about the control of the global component. According to Ohman (1968) and Fujisaki (1991) downtrend is also controlled by the laryngeal muscles, while according to Collier (1975) it is controlled by Psb.
In Strik & Boves (1993) the relation between F0 and some of the physiological mechanisms that are known to be important in the control of F0 is studied by means of a qualitative analysis. Based on our own data and data from the literature it was concluded that from a physiological viewpoint the following hypothesis is plausible: the downtrend in F0 is due to the downtrend in Psb. However, this hypothesis is not unchallenged. In this article we will discuss the two main counter-arguments:
1. the lowering in Psb cannot explain all of the decrease in F0 (section 4.2); and
2. downtrend is part of the linguistic code, and thus it must be controlled by laryngeal muscles and not by Psb (section 4.3).
The fact that this issue is still controversial is expressed in the conclusion of a recent article by Ohala (1990): "It must be concluded that the question of whether F0 declination is caused by laryngeal or by respiratory activity has still not been answered definitively." The purpose of this article is to clarify the relation between the downtrend in F0 and Psb.
In the literature different models of intonation are available, which are motivated both by phonetic and phonological considerations. The primary goal of the present article is to study the relation between the downtrend in F0 and Psb. For this reason we look primarily at intonation from a physiological point of view. As a consequence, we try to avoid theory-laden terms like e.g. 'downdrift', 'declination' and 'baseline' as much as possible. Instead we predominantly use the more neutral term 'downtrend'. In some sections we refer to previous studies in which the term 'declination' is generally used. In these cases we will also use the term 'declination'. In this article 'downtrend' and 'declination' are seen as synonyms, and are used to denote the gradual lowering of a signal during a whole utterance.
The outline of the article is as follows. In section 2 material and method are described. Each experiment consisted of two parts. In part one the subjects were instructed to sustain vowels, and in part two they produced meaningful sentences. The results for 'sustained phonation' are described in section 3. These results are then used in the argumentation of section 4, in which the results for 'running speech' are presented. In section 4.1 our physiological model of intonation is described. Subsequently, the two counter-arguments mentioned above are discussed in section 4.2 and 4.3, respectively. Section 5 contains a general discussion. Finally, some conclusions are drawn in section 6..
2. Material and method
Recordings were made of the audio signal, electroglottogram, lung volume (Vl), Psb, and the activity of the sternohyoid (SH) and vocalis (VOC) muscles for two Dutch male subjects. Both subjects had normal phonation and hearing, but had not received special voice training. In addition to these signals, the activity of the cricothyroid (CT) muscle was also measured for subject LB (the second author), and oral pressure for subject HB. The electromyographic (EMG) signals of the laryngeal muscles were high-pass filtered, full-wave-rectified, and integrated over successive periods of 5 ms. All EMG signals were shifted forward over their mean response times, using the procedure described in Atkinson (1978).
The measurements were made while the subjects produced sustained vowels and meaningful Dutch sentences with different intonation patterns. The sentences spoken by subject LB were "Piet slikte zijn pillen met bier" (SU: Short Utterance); and "Piet slikte gisteren zijn vierentwintig gele pillen liever in stilte met bier" (LU: Long Utterance). The sentences produced by subject HB were "Heleen wil die kleren meenemen" (SU: Short Utterance); "Heleen en Emiel willen die kleren liever wel weer meenemen" (LU: Long Utterance); and "Indien Emiel die kleren wil meenemen, willen wij ze eerst wel even zien" (SWC: Sentence With Comma). These sentences contain mainly high vowels, in order to minimize the involvement of the SH in articulatory gestures.
The intonation contours produced were one "pointed hat" (HB-SU1, early stress); two "pointed hats" (HB-SU2, LB-SU2 and LB-LU2, early and late stress, F0 is lowered in between); a "flat hat" (HB-SU3, LB-SU1 and LB-LU1, early and late stress, F0 is kept high in between); and question intonation (HB-SU4, HB-LU4, LB-SU3 and LB-LU3). The intonation pattern of HB-SWC is more complex. For an explanation of the notions "pointed hat" and "flat hat" the reader is referred to 't Hart, Collier & Cohen (1990).
Some sentences were also produced in reiterant form, using either the syllable /fi/ or /vi/. The subjects repeated each sentence 5 to 8 times. The raw signals of these repetitions were used to calculate median signals for each intonation contour. The method of non-linear time-alignment and averaging was used to average all signals, including F0 (Strik & Boves, 1991). The procedures used for recording and processing the data are described in more detail in Strik & Boves (1992).
3. Sustained vowels
Before the actual measurements of the physiological signals were made, our subjects were trained to produce prolonged vowels for different combinations of F0 and intensity level (IL). When the subjects were asked to sustain a given vowel, a gradual lowering of F0 and IL was generally observed. Subsequently, when they were explicitly instructed to keep F0 and IL constant, the downtrend in F0 and IL diminished, but it was usually still present. Finally, the subjects were given on-line visual feedback of F0 and IL. In this condition they often managed to keep both F0 and IL fairly constant during the production of a vowel.
After the training sessions actual measurements of the physiological signals were obtained. The subjects were given on-line visual feedback and were again instructed to keep F0 and IL constant for a sustained vowel. This task was repeated for different combinations of F0 and IL. The measurements show that the subjects usually managed to keep F0 and IL at the target values. At the beginning of the utterances some variation in Psb and the activity of the laryngeal muscles was observed, probably to reach the target levels for F0 and IL. Apart from the initial variation the physiological signals usually remained constant for the rest of the utterance. Different combinations of F0 and IL were achieved by different levels of Psb, SH, CT, and VOC. The results of this part of the experiment are described in more detail in Strik & Boves (1987).
This experiment shows that subjects who had no special voice training can keep F0, IL, and Psb constant during a simple utterance (a sustained vowel), but only if they are supported by visual feedback. Subjects report that keeping F0 and IL constant requires more effort than allowing a gradual decline, and feels less natural. Without visual feedback F0 and IL (and probably also Psb) tend to fall gradually during the course of an utterance, even if subjects are instructed to keep F0 and IL constant. The results obtained for sustained phonation will be used as support for the argumentation in the next section on running speech.
4. Running speech
4.1. A physiological model of intonation
In Strik & Boves (1993) we proposed a qualitative model of F0 control in running speech. Our model describes consistent behaviour of Psb, CT, VOC, and SH that was observed in the data of subjects LB and HB, and in other data presented in the literature. Figures with the average signals for the recorded utterances of subjects LB and HB can be found in Strik & Boves (1993). Here we will only display the average signals of a typical utterance (see Fig. 1), in order to illustrate our model.
The four physiological signals mentioned above were chosen because it is known that they are important in the control of F0. In our model intonation and its physiological control take place at two levels, viz. a global and a local level. This is in accordance with other physiological models of intonation proposed in the literature (like Ohman, 1968; Collier, 1975; and Fujisaki, 1991).
Short-term variations in F0, Psb, SH, VOC, and CT have often been observed (see e.g. Fig. 1), i.e. all five signals clearly have a local component. But it is not immediately clear whether all of these five physiological signals also have a global component.
** Insert Figure 1 about here. **
A gradual lowering of Psb and F0 during the course of a major syntactic constituent is often observed (see e.g. Lieberman, 1967; Ohala, 1970; Collier, 1974, 1975; Atkinson, 1978; Gelfer, 1987; Strik & Boves, 1993). The domain in which the downtrends in F0 and Psb occur has previously been given many different names, among other things "breath group" (Lieberman, 1967), "intonation group" (Breckenridge, 1977), "utterance" (Pierrehumbert & Beckman, 1988), "clause or clause complexes" (Clark & Yallop, 1990), or "major phrase" (Honda & Fujimura, 1991). In this article we will use the term utterance. Within the recorded sentences there were no inspirations (resets of Vl), nor any resets of F0 or Psb.
Our definition of a global component is a gradual change spanning the total duration of an utterance. Therefore, in our model Psb and F0 have a global component. The global component of F0 and Psb in our model will be called F0,g and Psb,g, respectively. In this article the terms F0,g and Psb,g will be used for the global components of our model alone. Global components of other models will be denoted otherwise.
The model presented in Strik & Boves (1993) is a qualitative model. To illustrate our model a possible quantitative decomposition of F0 and Psb in a global and a local component is shown in Fig. 1. Psb,g was obtained by manually fitting an exponential function through most of the valleys of Psb (Fig. 1). Because it is assumed that F0 varies linearly with Psb (Titze, 1989), F0,g was defined in the following way: F0,g = B0 + B1*Psb,g. The values of B0 and B1 that gave a satisfactory result for this utterance were 70 Hz and 5 Hz/cm H2O (Fig. 1), respectively. We would like to note that the manually fitted trend lines are only presented here to illustrate our qualitative model, and to give an example of a procedure that can be used to obtain the global and local components of Psb and F0. These manually fitted trend lines are not used for further analysis in the present article. Instead we will use a more objective statistical method in the following section.
A gradual change in the activity of SH, VOC, or CT during a whole utterance was not observed in any of our recordings nor in published data of other researchers (as far as we know). Sometimes the activity of these three laryngeal muscles varied slowly during part of the utterances, but no instance of a slow increase or decrease during the whole utterance (just like Psb and F0) was found. It must therefore be concluded, both from our own data and the data presented in various other papers, that in general SH, VOC, and CT do not seem to have a global component.
At the beginning of utterances CT, VOC, and Psb may have extra high values, and the result will be a so-called 'initial rise' of F0 (Fig. 1). At the end of utterances SH activity often increases while Psb drops sharply. If these effects occur during voiced sounds at the end of the utterance, final lowering of F0 is observed (Fig. 1). Alternatively, increased SH activity and Psb release may be delayed until after the last voiced sound, in which cases final lowering is absent (e.g. in most interrogative utterances). The initial rise and final lowering of F0 will add to the F0 fall that results from the downtrend in F0,g alone (Fig. 1).
The local component of Psb (Psb,l = Psb - Psb,g) is generally positive. SH, VOC, and CT only have a local component, which is always positive because these signals can never become negative (see section 2). Finally, the local component of F0 (F0,l = F0 - F0,g) is positive when the effect of the F0-raising mechanisms (VOC, CT, and Psb,l) is larger than the effect of the F0-lowering mechanisms (SH), and F0,l becomes negative when the net effect of F0-raising and F0-lowering mechanisms is negative.
To conclude this section, in our physiological model of intonation SH, VOC, and CT do not have a global component, while F0 and Psb do have a global component. A two-component model was chosen, because from a physiological point of view this seems to be the model that best describes the data. Because a downtrend in F0,g and Psb,g is often observed, the following hypothesis seems likely: The downtrend in F0,g is due to the downtrend in Psb,g. This hypothesis has been challenged for different reasons. Two frequently adduced counter-arguments are discussed in the next two sections.
4.2. The F0-Psb ratio
4.2.1 Counter-argument 1
An argument used against the above-mentioned hypothesis is that the variation in Psb,g cannot explain the total variation in F0,g, because the F0-Psb ratio (FPR) observed in running speech is often larger than 7 Hz/cm H2O (e.g. Maeda, 1976; Ohala, 1978). Studies of the rate of F0 change resulting from a change in Psb alone (generally by externally induced pressure variations) have revealed that the FPR should be in the range 2-7 Hz/cm H2O (e.g. Ladefoged, 1967; Baer, 1979). In the present article this range will be called the FPR-range. Because the FPR obtained for utterances often seems to exceed the FPR-range, the hypothesis is either rejected totally (Ohala, 1978), or an additional mechanism is invoked to explain (part of) the decrease in F0 (the tracheal pull mechanism of Maeda, 1976).
Indeed, there seem to be no reasons to assume that the FPR obtained in experiments with externally induced pressure variations differs from the FPR in running speech. But the problem is that the FPR obtained for running speech depends on the way in which the downtrend in F0 and Psb is defined and modelled.
4.2.2 Modelling the relation between F0 and Psb
In the literature several methods have been proposed to model the downtrend in F0, such as the difference between F0 at the beginning and at the end of an utterance (see method 1 below), the baseline of Maeda (1976), and the bottomline and topline of Cooper & Sorensen (1981). Baseline, bottomline, and topline are trend lines which are generally fitted manually, just like Psb,g and F0,g in Fig. 1. Most probably, the fitting is done manually because it is difficult to define a mathematical error function that could be used to derive the trend lines with an optimization algorithm.
We have done a number of experiments to determine the parameters of the downtrend components. The results of two experiments, in which different definitions of downtrend were used, are presented below. For this aim six utterances of subject LB and six utterances of subject HB were used. For each subject, there are four declarative and two interrogative utterances (see Table I). All signals, including the F0 signals, are average signals (section 2). Figures with the average signals for these twelve utterances can be found in Strik & Boves (1993). The average signals for one utterance of subject LB are shown in Fig. 1.
In this method the F0 and Psb values are taken at two instances, one near the beginning (T1) and one near the end (T2). The following values are then calculated: dF0 = F0(T1) - F0(T2), dPsb = Psb(T1) - Psb(T2), FPR1 = dF0/dPsb. The total fall in F0 and Psb from T1 up to T2 (dF0 and dPsb, respectively) is used to model the downtrend in F0 and Psb, respectively. Basing dF0 on two F0 values is error prone. In some studies the F0 values are obtained from a trend line (e.g. the baseline in Maeda, 1976), while in other studies the F0 values are taken from a single, representative F0 contour (e.g. Collier, 1975; Gelfer, Harris, Collier & Baer, 1983; Collier, 1987). Our data processing procedure allowed us to average the F0 curves of all repetitions of a given sentence, therewith making the estimation procedure more reliable. In previous studies various choices of T1 and T2 have been made, based on different motives (see e.g. Gelfer et al., 1983). In this study T1 is the first voiced frame, and T2 the last voiced frame of each utterance. These instants of T1 and T2 were mainly chosen because the values of F0 and Psb at these time-points can be determined very easily for each utterance. Given this choice of T1 and T2, all relevant values were calculated for the twelve utterances of subjects LB and HB (see Table I).
** Insert Table I about here. **
In all utterances dPsb is positive (Table I). For subject LB dPsb is always larger than for subject HB. For both subjects dPsb for the interrogative utterances is smaller than dPsb for the declarative utterances. At the end of each question there is a marked increase in F0, and consequently dF0 is negative for the questions. But for all declarative utterances dF0 is positive. For the declarative utterances, dF0 of subject LB is always larger than dF0 of subject HB. Partly this is because dPsb is larger for subject LB, as noted above. In addition, for subject LB the CT and VOC often show increased activity at the beginning of an utterance, which causes an initial rise in F0, and the SH is increased at the end of the utterance during the final lowering of F0. Both effects will cause dF0 to be larger than the fall in F0 resulting from dPsb alone, i.e. both Psb and the laryngeal muscles participate in dF0.
The values of FPR1 can be seen in Table I. Only three of the twelve FPR1 values are within the accepted FPR-range. FPR1 for the four questions is negative because dF0 is negative, four of the eight values of FPR1 for the statements are larger than 7 cm H2O and one is smaller than 2 cm H2O. Based on these FPR1 values one could conclude that the downtrend in Psb cannot explain all the downtrend in F0, and thus other factors should contribute to the downtrend in F0. If downtrend is defined in this way, then this conclusion is correct. After all, dF0 does depend on both dPsb and the activity of the laryngeal muscles (especially for subject LB, as explained above).
The FPR-range is obtained from experiments with externally induced pressure variations (e.g. Ladefoged, 1967; Baer, 1979). The goal of these experiments was to determine the FPR for F0 changes that result from Psb changes alone, i.e. one tried to keep other processes that influence F0 (like the laryngeal muscles) constant (see e.g. Baer, 1979). In these studies the points in a scatterplot for F0 as a function of Psb could usually be fitted reasonably by a straight line. In Fig. 2 an F0-Psb scatterplot is given for a short utterance of subject LB. Clearly, in this scatterplot the points are not grouped around a straight line. The reason is that during this utterance the other factors which influence F0 are not constant. Drawn in Fig. 2 is the straight line that connects the first and the last voiced frame. FPR1 is the slope of this line. In Fig. 2 one can see that the FPR obtained in this way depends heavily on the exact choice of T1 and T2. To sum up, method 1 has two important drawbacks:
1. other factors that can affect F0 are not constant over the course of an utterance; and
2. because the other factors are not constant it is hazardous to make estimates of the FPR which are based on the values of F0 and Psb at two instants only.
** Insert Figure 2 about here. **
In method 2 a multiple regression analysis is used, in which F0 is the criterion and Psb, VOC, and SH are the predictors (Footnote 1). The outcome of the regression analysis are the coefficients Ai of the regression equation: F0 = A0 + A1*Psb + A2*VOC + A3*SH. The FPR is the regression coefficient between F0 and Psb: FPR2 = A1. This method does not have the drawbacks of method 1 because a correction is made for some important other factors which influence F0, and the regression coefficient is based on the data of all voiced frames.
The multiple regression analysis decomposes F0 into four components: A0, A1*Psb, A2*VOC, and A3*SH. The first component is the constant A0. VOC and SH do not have a global component either (section 4.1), and thus in this statistical model the downtrend in F0 is due to the downtrend in Psb alone. This is in line with the physiological model presented in section 4.1, except for one essential difference. In method 2 Psb is not decomposed into a global and a local component. However, because there are no reasons to assume that the FPR is different on a global and a local level, this does not seem to be a problem. Consequently, the Psb component in the regression analysis (A1*Psb) contains both the slow downtrend in F0, and the part of the local variations in F0 which is due to the local variations in Psb. The other part of the local variations in F0 is in the VOC and SH component (A2*VOC and A3*SH), respectively.
Instead of using the multiple regression analysis we could have based our estimates of the FPR on the global trend lines Psb,g and F0,g. To that end, Psb,g and F0,g should have been determined in the way described in section 4.1, i.e. by making manual fits for all utterances. This is certainly possible, but we prefer to use objective, statistical methods (like the multiple regression analysis described in the current section) instead of more subjective methods in which trend lines are fitted manually.
For all voiced frames of the twelve utterances a multiple regression analysis was performed in which F0 was the criterion and Psb, VOC and SH were the predictors. The resulting FPR2 values (i.e. the A1 values) can be seen in Table I. The resulting values of A0, A2 and A3 were not used for further analysis. Of the 12 FPR2 values, 11 are in the FPR-range, and one is slightly larger than the maximum of the FPR-range. If the CT had been used as a predictor instead of the VOC for subject LB, then FPR2 would have been 6.44 Hz/cm H2O for this utterance, and thus it would have been within the FPR-range (Footnote 1). Also for the interrogative utterances FPR2 is always within the FPR-range, while this was never the case for FPR1. The rise of F0 at the end of questions is usually due to an increase of CT, VOC, and Psb. In method 2 a correction is made for the increase in VOC, and the result is that the FPR2 is within the FPR-range. The rapid increase in Psb at the end of the questions is part of Psb, and will also explain part of the end rise in F0.
To conclude this section, comparison of FPR1 and FPR2 values for sentences has shown that the actual values obtained are crucially dependent on the way in which the F0-Psb ratio is defined. In our opinion FPR1, which has been used to refute the above-mentioned hypothesis, is not a fair measure because it isolates Psb, but at the same time ignores all other factors affecting F0. If some important additional influences are factored out of F0 by means of a multiple regression analysis, as is done with FPR2, a completely different picture emerges, which is compatible with the hypothesis that the downtrend in Psb explains the downtrend in F0. Even though the way in which the influence of the laryngeal muscles on F0 is modelled is extremely crude (the true relation between the activity of the laryngeal muscles and F0 is very likely to be non-linear) FPR2 is a much fairer measure than FPR1. According to this measure the variation in Psb can explain all the variation in F0, and no additional mechanisms are necessary. Therefore, too large a total F0 drop does not seem a reason to reject the hypothesis. Also, and perhaps even more important, arguments about the relation between F0 and Psb depend fully on the way in which the two downtrends are modelled. As long as the model of F0 downtrend does not partition out effects not related to Psb, it may remain a valid definition of its own, but it should no longer be used in arguments involving Psb.
4.3 Control of F0 and Psb
4.3.1 Counter-argument 2
At the basis of the second counter-argument is the idea that the laryngeal muscles can be controlled linguistically, while this is not possible for the respiratory muscles and thus the downtrend in Psb is a passive process. Subsequently, this idea is used as an argument against the above-stated hypothesis: because the downtrend in F0 is (at least partially) linguistically controlled it cannot result from an automatic process like the downtrend in Psb. The fact that some authors use this argument in the discussion about the physiological causes of declination was also noted by Cohen, Collier & 't Hart (1982).
The second argument against the hypothesis is expressed most clearly by Breckenridge (1977). She states that declination is part of the linguistic system, and therefore it must be controlled by the laryngeal muscles just as other linguistically significant aspects of F0 are. A similar line of reasoning is used by Ohala (1978, 1990). In Ohala (1978, 1990) three possible causes for declination are mentioned: (1) tracheal pull (Maeda, 1976); (2) downtrend in Psb (Collier, 1974, 1975); and (3) graded activity in the laryngeal muscles. According to Ohala the first two causes are automatic, non-purposive physiological causes. Because declination is not automatic but controlled, he argues that a model in which linguistic aspects of F0 are completely determined by actions of the laryngeal muscles is much more likely than a two-component model in which respiratory and laryngeal factors interact.
Clear opinions about the control of the downtrend in Psb can also be found in Gelfer et al. (1983), Ladd (1984) and 't Hart, Collier & Cohen (1990). Gelfer et al. (1983) studied whether declination is actively controlled. They noted a similar downtrend in F0 and Psb. They argue that if the declination in F0 is due to the declination in Psb, then this would suggest that declination is a passive phenomenon. In Ladd (1984) three physiological causes of declination are discussed: (1) the downtrend in Psb (Collier, 1975); (2) the tracheal pull (Maeda, 1976); and (3) F0 rises are harder to produce than F0 falls (Ohala & Ewan, 1973). According to Ladd, the downtrend in Psb and the tracheal pull are automatic mechanisms. Finally, according to 't Hart, Collier & Cohen (1990) the muscular activity involved in the regulation of Vl and Psb is subject to an automatic control system. In their view declination should be seen mainly as an automatic by-product of respiration.
The examples given above clearly illustrate that there seems to be a widespread notion that the downtrend in Psb is an automatic process. If the downtrend in Psb is a completely passive process, then this could indeed be used as a counter-argument against the above-mentioned hypothesis, because there are many indications that declination is under linguistic control, at least to some extent. However, it is not sure that the downtrend in Psb is a passive mechanism. On the contrary, there are many reasons to believe that Psb is controlled. This will be discussed in the next section.
4.3.2 Respiratory system
There are three factors which may affect Psb (see e.g. Ladefoged, 1967):
1. passive forces, like elastic recoil and gravitational forces;
2. active forces, resulting from contractions of respiratory muscles; and
3. the resistance to the air-stream, both at the glottis and in the vocal tract (Zg).
The pressure that results from passive forces alone is generally called the relaxation pressure (Prel), while the pressure change brought about by active muscle contractions is called the muscular pressure. For a speaker who remains in the same position (usually upright) the gravitational forces are roughly constant and thus Prel would depend on Vl alone. If expiration during speech production were a truly passive process, then the muscular pressure should be zero and Psb should be a function of Vl and Zg alone. Several observations reveal that this is not the case:
Our data show that for repetitions of the same sentence the amount of inspiration before the utterance was not always the same. Consequently, the Vl traces run essentially parallel (see e.g. Fig. 3), while Zg can be assumed to be reasonably constant. Although the differences in Vl are large, the Psb contours are very much alike (Fig. 3).
** Insert Figure 3 about here. **
Some of the sentences were also produced in reiterant form, using either the syllable /fi/ or /vi/. The slopes of the Vl traces of these two types of utterances are different, but also in this case the Psb contours showed much resemblance (see e.g. Fig. 4). This was also found by Gelfer (1987).
** Insert Figure 4 about here. **
Speakers can keep their Psb constant during the production of a long sequence of /ma/ syllables (Collier, 1987), and during sustained phonation (section 3). In both cases the activity of the measured laryngeal muscles also remained constant, so Zg was probably constant. The fact that speakers can keep Psb constant while Vl is decreasing also proves that Psb is not simply a function of Vl and Zg alone.
During phonation Psb should not become smaller than a threshold value below which phonation is not possible (the so-called phonation threshold pressure, see Titze, 1992). Furthermore, the loudness of the speech is determined to a large extent by Psb, and thus Psb should be kept within a certain range to produce speech with the desired loudness. After inspiration at the beginning of an utterance Prel is often larger than the desired Psb, while at the end of an utterance Prel is often lower than the desired Psb (see e.g. Ladefoged, 1967). If the respiratory muscles were not used, then Psb and the loudness would decrease rapidly; soon Psb would be smaller than the phonation threshold pressure and phonation would stop. To prevent this, the inspiratory muscles are used at the beginning of an utterance to keep Psb lower than Prel, while expiratory muscles are used when Prel is lower than the desired Psb (Ladefoged, 1967).
The arguments given above force one to assume that the respiratory muscles are used to control Psb during speech production. The following question then arises: How are the respiratory muscles used to control Psb? According to Ladefoged (1967) and Ohala (1990) the amount of control is limited, i.e. they claim that these muscles are only used to keep Psb reasonably constant above some minimal level. However, many measurements show that in general Psb is not constant but has a tendency to decline, both in sustained phonation (section 3) and in running speech (Lieberman, 1967; Ohala, 1970; Collier, 1974, 1975; Atkinson, 1978; Gelfer, 1987; Strik & Boves, 1993). Furthermore, Psb contours for repetitions of a sentence appear to be very similar in shape as well as in amplitude (see e.g. Fig. 3), too similar to assume that Psb has just a convenient (more or less random) value above its minimum.
If the respiratory muscles are under voluntary control, then they can be used to control Psb during speech production. Active control of the respiratory muscles and Psb in speech production seems likely, given the following arguments:
The way the respiratory muscles are used during speech production differs from the way they are used in normal breathing. In normal breathing the duration of inhalations and exhalations is about equal, while in speech production the inspiratory phase is much shorter. Furthermore, it has been observed that the posturing of the respiratory system for speech production (the prephonatory posturing of the chest wall) is different from the posturing for normal breathing (Hixon, Goldman & Mead, 1973; Baken, Cavallo & Weismann, 1979; Baken & Cavallo, 1981).
Breathing pauses occur mainly at major constituent breaks (Winkworth et al., 1994). Breathing pauses can also occur at minor constituent boundaries, but as speaking rate increases they are eliminated from these minor breaks (Grosjean & Collins, 1978). Grosjean & Collins (1978) conclude that "it would appear that breathing in speech depends to a large extent on the speaker's preplanned pause patterns", and thus breathing would be linguistically controlled.
The amount of air inspired and the Vl at the beginning of sentences was found to be significantly larger for longer utterances compared to shorter ones, and for major syntactic breaks compared to more minor ones (Winkworth et al., 1994). According to Winkworth et al. (1994) these findings indicate that speakers pre-plan their Vl and the volume inspired. It should be noted that this study concerned reading, and therefore their results suggest that the respiratory muscles are under linguistic control during reading.
Indications of extra respiratory activity (i.e. increased lung volume decrement) for stressed syllables were found by Ohala (1977), while Ladefoged (1967) and van Katwijk (1974) actually measured increased activity of respiratory muscles for stressed syllables. Although not all stressed syllables are probably accompanied by extra activity of the respiratory muscles, these results indicate that linguistic control of the respiratory muscles is possible, at least at a local level. If active control of the respiratory muscles is possible at a local level, then it is likely that it is also possible at a global level.
Loudness is a prosodic, i.e. a linguistic variable. If speakers are asked to increase loudness, they tend to initiate speech at higher lung volumes (Hixon, Goldman & Mead, 1973). Winkworth et al. (1994) also found that louder utterances within the "comfortable loudness" range are generally associated with higher lung volumes. According to Weismer (1985) it is more efficient to start at higher lung volumes for loud speech, because larger values of Psb are needed to generate loud speech. So, not only is this an example of linguistic control of the respiratory muscles, it is also an indirect indication of linguistic control of Psb. But there are also more direct indications of voluntary control of Psb.
In addition to Psb, a speaker can use many different physiological mechanisms to control F0, and thus a given F0 contour could be produced in various ways. Still, the amount of variation between physiological signals (including Psb) of repetitions of the same utterance is relatively small (Strik & Boves, 1991; Strik & Boves, 1993). The finding that the inter-repetition variation in Psb and the other physiological signals is small suggests that speakers have a notion of the manner in which they want to produce an utterance, and that they have a good control over Psb and the other mechanisms.
** Insert Figure 5 about here. **
Another indication that Psb is actively controlled can be seen in Fig. 5. In the middle of a spontaneous utterance subject HB made a swallowing gesture, probably because the pressure catheter was bothering him. During this interruption Psb suddenly drops to about 5 cm H2O. For subject HB phonation with such a level of Psb is possible, because comparable and even lower values of Psb were found at the beginning of many voiced intervals of the repetitions of the same utterance. If the subject's only intention was to provide a Psb above some minimal level at which phonation is possible, he could have kept Psb at approximately 5 cm H2O. However, before he resumed phonation Psb was raised to approximately the value it had before the interruption, and from that point it started declining again.
Finally, after the two subjects in our study had received instructions they were able to keep Psb fairly constant at different levels (section 3), i.e. their Psb was under voluntary control.
The conclusion of this section is that there are several reasons to believe that the respiratory muscles and Psb are actively controlled. If this is the case, then also the second counter-argument (specified above) cannot be used to refute the hypothesis that the lowering of F0,g is generally due to a decrease in Psb,g.
In this paper we have argued in favour of a major role for Psb in the control of the ubiquitous downtrend in F0 contours. The role of Psb has been called into question by a number of authors, and for a number of different reasons. The two most important counter-arguments center around the claim that the total F0 fall in most published data seems to exceed the range that should be expected from the fall in Psb, and the claim that the respiratory system is not suited for so precise a control as needed for the linguistic, communicative purpose served by F0 downtrend. These counter-arguments have been discussed in sections 4.2 and 4.3, respectively.
Before proceeding to a summary of these discussions we would like to address one additional argument. Ohala (1990) claims that there are examples in the literature that show a gradual downtrend of the activity of CT. It appears that these examples are limited to the contours 11 and 15 in Collier (1974). In these registrations a gradual decline of CT activity can indeed be seen, but only in the second half of the utterances. To the best of our knowledge there are no data showing a gradual variation of CT, VOC or SH over complete utterances. But there are numerous examples of Psb decline that span a complete utterance. Thus, we fully acknowledge the possibility that laryngeal muscles contribute to the total fall of F0 over the course of an utterance, but the available data more or less force us to accept the conclusion that the contribution of Psb to the control of F0 downtrend (as the concept is defined in our model) is much more important. For this reason, we think that the physiological validity of the models proposed by Ohman (1968) and Fujisaki (1991), which do not acknowledge a role for Psb, is debatable. Speakers can exploit a large array of physiological means to reach a certain goal, and it would be surprising if some of these means would never be exploited. After all, there is no valid reason to suppose that all subjects should always behave in exactly the same way. But individual examples attesting a possible way of control should not be generalized. For the time being, the data speak in favour of Psb.
Coming back to the arguments related to the F0-Psb ratio, it must be concluded that fair estimates of that ratio are extremely difficult to obtain from sentence material. In all naturally produced utterances laryngeal muscles affect F0 in addition to Psb. In order to obtain a fair estimate of FPR these additional contributions must be factored out. That is certainly not done by defining dF0 and dPsb as the difference between the values observed at the beginning and at the end of an utterance, not even when these values are averaged over a large number of tokens, simply because the F0 values are affected by laryngeal muscle activity.
A fundamental problem in studying the physiological causes of downtrend is that the literature abounds with definitions of F0 downtrend. Downtrend, declination or downdrift have been used to denote the tendency of F0 to decrease during the course of an utterance. This qualitative definition can be interpreted in many different ways, and is hardly suitable for studying the relation between physiology and F0 downtrend. Therefore, a more precise definition of downtrend is needed. Some of the definitions used in the literature are illustrated in Fig. 6. Fig. 6 shows hand-fitted estimates of a top line, a bottom line, a line connecting the first and last voiced sample in addition to F0,g, which was derived in the way described in section 4.1 (this is the same trend line as the one shown in Fig. 1). It can easily be seen that the slopes of these lines differ considerably. There is less literature on the definition of downtrend in Psb. Yet, it is clear that the existence of several essentially different definitions or models of F0 downtrend makes it impossible to discuss 'the' relation between downtrend in Psb and F0: the outcome of such a discussion is certain to depend on the exact definition of downtrend that is assumed.
According to our definition of a global component, F0 and Psb do have a global component while CT, VOC and SH generally do not have a global component. The quantitative statistical analysis has shown that, after correcting for the influence of VOC and SH, the variation in Psb can explain all the variation in F0 (i.e. the FPR is usually within the correct range). Consequently, in our physiological two-component model the downtrend in F0 can be explained completely by the downtrend in Psb. However, it is always possible that other (unknown) factors also contribute to the downtrend in F0. That is a possibility which cannot be ruled out.
This physiological two-component model was chosen because it seems to be the model which best describes the physiological data. If, for some reasons, someone prefers another definition of the global component, like for instance the top- or bottomline in Fig. 6, the conclusion should indeed be that the downtrend in F0 cannot be determined entirely by the downtrend in Psb, because top- and bottomline are determined to a large extent by the activity of the laryngeal muscles.
To sum up, in our model the downtrend in F0 could be entirely due to the downtrend in Psb. For other definitions of downtrend this does not have to be the case, i.e. these downtrend could be determined partially by the activity of the laryngeal muscles. However, the downtrend in Psb will always explain part of the downtrend in F0.
Ideally, trend lines should not be determined by means of hand fitting, but instead by means of formal, mathematical procedures. However, each and every mathematical fit procedure requires the definition of an error (or cost) function, to quantify the discrepancy between the observed data and the model curve. For the time being, such an error function is almost impossible to define, because it is not possible to reach agreement on the weight of details in the deviations. To a considerable extent, these weights depend on one's theoretical opinions about which details in F0 curves are linguistically relevant and which are not. Another factor complicating the construction of a completely quantitative model of the control of F0 in running speech is to do with the lack of knowledge about the relation between EMG activity of the laryngeal muscles and elastic properties of laryngeal tissue. In our own models we have assumed a simple linear relationship, but that is not more than a very crude first approximation. Thus, we have to be content with models that contain non-quantitative or non-realistic quantitative components for some time to come.
In this paper we have investigated the relation between downtrend in F0 and Psb, an issue that has been undecided despite considerable discussion in the recent literature. The most important conclusion of our own experiments and a detailed analysis of data published in the literature is that the issue is genuinely not decidable, unless there is agreement about the way in which downtrend in F0 and Psb are defined. In our model of F0 control presented in this paper we take the view that F0 and Psb both have a global component, and that these components are related by definition. Other models or definitions of F0 downtrend, like a line fitted through the F0 peaks (the topline), include effects of other factors affecting F0 besides Psb; therefore, these definitions (or models) of F0 downtrend do not allow a direct link with downtrend in Psb. Also, we have presented data and arguments from our own experiments and from the literature in favour of a tight and precise control of Psb and the underlying respiratory system. Therefore, the phonetic implementation component of any intonation model should include a role for Psb.
This research was supported by the Foundation for Linguistic Research, which is funded by the Netherlands Organization for Scientific Research (N.W.O.). Special thanks are due to Haskins Laboratories where one of the experiments was carried out, especially to dr. Thomas Baer who made this possible. I also express my gratitude to dr. Hiroshi Muta and dr. Philip Blok, who inserted the EMG electrodes and the pressure catheter in the experiments in New Haven and Nijmegen, respectively.
Footnote 1: For subject LB the correlation between CT and F0 is generally larger than the correlation between VOC and F0, and thus CT is a better predictor of F0. But because the behaviour of CT and VOC is almost identical for subject LB, and because the activity of the CT was not measured for subject HB, we have chosen the VOC as a predictor in the regression analysis for both subjects.
Atkinson, J.E. (1978) Correlation analysis of the physiological features controlling fundamental voice frequency, Journal of the Acoustical Society of America, 63, 211-222.
Baer, T. (1979) Reflex activation of laryngeal muscles by sudden induced subglottal pressure changes, Journal of the Acoustical Society of America, 65, 1271-1275.
Baken, R.J. & Cavallo, S.A. (1981) Prephonatory chest wall posturing, Folia Phoniatrica, 33, 193-203.
Baken, R.J., Cavallo, S.A. & Weissman, K.L. (1979) Chest wall movement prior to phonation, Journal of Speech and Hearing Research, 22, 862-872.
Breckenridge, J. (1977) Declination as a phonological process, Bell Labatories Technical Memorandum, Murray Hill, New Jersey.
Clark, J. & Yallop, C. (1990) An introduction to phonetics and phonology, Oxford: Basil Blackwell.
Cohen, A., Collier, R. & 't Hart, J. (1982) Declination: Construct or intrinsic feature of speech pitch?, Phonetica, 39, 254-273.
Collier, R. (1974) Laryngeal muscle activity, subglottal air pressure, and the control of pitch in speech, Haskins Laboratory Status Report on Speech Research, SR-39/40, 137-170.
Collier, R. (1975) Physiological correlates of intonation patterns, Journal of the Acoustical Society of America, 58, 249-255.
Collier, R. (1987) F0 declination: the control of its setting, resetting, and slope. In Laryngeal function in phonation and respiration (T. Baer, C. Sasaki & K.S. Harris, editors), pp. 403-421. Boston: College-Hill Press.
Cooper, W.E. & Sorensen, J.M. (1981) Fundamental frequency in sentence production. New York: Springer-Verlag.
Fujisaki, H. (1991) Modeling the generation process of F0 contours as manisfestation of linguistic and paralinguistic information. In Proceedings of the XIIth International Congress of Phonetic Sciences, supplement, pp. 1-10. Aix-en-Provence.
Gelfer, C.E. (1987) A simultaneous physiological and acoustic study of fundamental frequency declination. Ph.D. thesis, City University of New York.
Gelfer, C., Harris, K., Collier, R. & Baer, T. (1983) Is declination actively controlled?. In Vocal Fold Physiology (I.R. Titze & C. Scherer, editors), pp. 113-125. The Denver Center for the Performing Arts, Inc., Denver, Colorado,
Grosjean, F. & Collins, M. (1979) Breathing, pausing and reading, Phonetica, 36, 98-114.
Hart, J. 't, Collier, R. & Cohen, A. (1990) A perceptual study of intonation: an experimental-phonetic approach to speech melody, Cambridge: Cambridge University Press.
Hixon, T.J., Goldman, M.D. & Mead, J. (1973) Kinematics of the chest wall during speech production: volume displacements of the rib cage, abdomen, and lung, Journal of Speech and Hearing Research, 16, 78-115.
Honda, K. & Fujimura, O. (1991) Intrinsic vowel F0 and phrase-final F0 lowering: Phonological vs. biological explanations. In: J. Gauffin & B. Hammarberg (eds.), Phonatory mechanisms: physiology, acoustics, and assessment, San Diego: Singular Publishing Group, 57-64.
Katwijk, A. van (1974) Accentuation in Dutch: An experimental linguistic study. Ph.D. thesis, Utrecht University.
Ladd, D.R. (1984) Declination: a review and some hypotheses. In Phonology Yearbook, 1, pp. 53-74.
Ladefoged, P. (1967) Three areas of experimental phonetics. Oxford: Oxford University Press.
Lieberman, P. (1967) Intonation, Perception and Language. Cambridge, Massachusetts: The M.I.T. Press.
Maeda, S. (1976) A characterization of American English intonation. Ph.D. thesis, MIT, Cambridge, MA.
Ohala, J. (1970) Aspects of the control and production of speech, UCLA Working Papers Phonetics, 15, 1-192.
Ohala, J.J. (1977) The physiology of stress. In Studies in stress and accent (L. M. Hyman, editor), Vol. 4, pp. 145-168. Southern California Occasional Papers in Linguistics.
Ohala, J. (1978) Production of tone. In Tone: a linguistic survey (V. Fromkin, editor), pp. 5-39. New York: Academic Press.
Ohala, J.J. (1990) Respiratory activity in speech. In Speech production and speech modelling (W.J. Hardcastle & A. Marchal, editors), pp. 23-53. Netherlands: Kluwer Academic Publishers.
Ohala, J. & Ewan, W.G. (1973) Speed of pitch change, Journal of the Acoustical Society of America, 53: 345.
Ohman, S.E.G. (1968) A model of word and sentence intonation, Quart. Prog. & Status Reports, Speech Transmission Lab., Stockholm, 2-3, 6-11.
Pierrehumbert, J. & Beckman, M.E. (1988) Japanese Tone Structure, Linguistic Inquiry Monograph Series, 125, Cambridge, MA: MIT Press.
Strik, H. & Boves, L. (1987) Regulation of intensity and pitch in chest voice. In Proceedings 11th International Congress of Phonetic Sciences, Vol. VI, pp. 32-35. Tallinn.
Strik, H. & Boves, L. (1991) A dynamic programming algorithm for time-aligning and averaging physiological signals related to speech, Journal of Phonetics, 19, 367-378.
Strik, H. & Boves, L. (1992) Control of fundamental frequency, intensity and voice quality in speech, Journal of Phonetics, 20, 15-25.
Strik, H. & Boves, L. (1993) A physiological model of intonation, AFN-Proceedings, Vol. 16/17, pp. 96-105. University of Nijmegen.
Titze, I. (1989) On the relation between subglotal pressure and fundamental frequency in phonation, Journal of the Acoustical Society of America, 85, 901-906.
Titze, I.R (1992) Phonation threshold pressure: a missing link in glottal aerodynamics, Journal of the Acoustical Society of America, 91, 2926-2935.
Weismer, G. (1985) Speech breathing: contemporary views and findings. In Speech Science, (R.G. Daniloff, editor), pp. 47-72. San Diego: College Hill Press.
Winkworth, A.L., Davis, P.J., Ellis, E. & Adams, R.D. (1994) Variability and consistency in speech breathing during reading: lung volumes, speech intensity, and linguistic factors, Journal of Speech and Hearing Research, 37, 535-556.
Figure 1. Average physiological signals for the Dutch utterance "Piet slikte gisteren zijn vierentwintig gele pillen liever in stilte met bier" (LU1) spoken by subject LB. Also shown in the first and second panel are the global trend lines F0,g and Psb,g, respectively (dashed-dotted lines).
Figure 2. F0 as a function of Psb for the Dutch utterance "Piet slikte zijn pillen met bier" (SU1) spoken by subject LB. The straight line is the line connecting the first and the last voiced frame. FPR1 is the slope of this line.
Figure 3. F0, Psb, and Vl signals for two repetitions of a spontaneous sentence (see Fig. %) spoken by subject HB. The average difference for Vl is 470 cc, and for Psb it is 0.05 cm H2O.
Figure 4. Average F0, Psb, and Vl signals for two utterances produced with reiterant speech: /vi/ (dashed) and /fi/ (solid).
Figure 5. F0, Psb, and Vl signals for a spontaneous utterance spoken by subject HB. The arrow marks the interruption of about 0.5 sec.
Figure 6. Average F0 signal and trend lines for utterance LU1 spoken by subject LB (The average F0 signal is the same signal as in the upper panel of Fig. 1). The following trend lines are shown: F0,g (dashed), the line connecting the first and the last voiced frame (dashed-dotted), topline (dotted), and bottom- or baseline (solid).
Table I. Listed from top to bottom are: utterance type, number of voiced samples (N), length of the utterance (T = T2 - T1) in s, F0 values of first (F0(T1)) and last (F0(T2)) voiced sample in Hz, total fall of F0 (dF0 = F0(T1) - F0(T2)) in Hz, average rate of change of F0 (dF0/T) in Hz/s, Psb values for first (Psb(T1)) and last (Psb(T2)) voiced sample in cm H2O, total fall of Psb (dPsb = Psb(T1) - Psb(T2)) in cm H2O, average rate of change of Psb (dPsb/T) in cm H2O/s, FPR1 = dF0/dPsb in Hz/cm H2O, and the regression coefficient between F0 and Psb (FPR2) in a multiple regression equation, also in Hz/cm H2O (for explanations, see also the text).
subject LB subject HB
declarative utterances questions declarative utterances questions
utt SU1 SU2 LU1 LU2 SU3 LU3 SU1 SU2 SU3 SWC SU4 LU4
N 234 226 558 524 222 490 314 342 288 680 260 435
T 1.42 1.41 3.46 3.40 1.31 3.18 1.66 1.78 1.54 3.62 1.39 2.40
F0(T1) 150 136 147 136 121 138 119 113 121 132 118 114
F0(T2) 65 67 66 79 167 169 102 106 102 104 200 188
dF0 85 69 81 57 -46 -31 17 7 19 28 -82 -74
dF0/T 60.1 49.1 23.4 16.7 -35.2 -9.7 10.2 3.9 12.3 7.7 -59.0 -30.8
Psb(T1) 9.58 9.92 11.64 11.82 8.44 10.95 6.13 6.47 6.29 5.86 5.83 6.04
Psb(T2) 3.44 3.50 4.82 4.57 4.36 5.10 2.33 1.64 1.42 1.77 4.10 3.96
dPsb 6.14 6.42 6.82 7.25 4.08 5.85 3.80 4.83 4.87 4.09 1.73 2.08
dPsb/T 4.34 4.57 1.98 2.13 3.12 1.84 2.29 2.71 3.16 1.13 1.24 0.87
FPR1 13.9 10.8 11.9 7.87 -11.3 -5.30 4.47 1.45 3.90 6.84 -47.5 -35.5
FPR2 3.97 7.63 2.30 4.58 6.48 4.42 3.20 3.02 4.79 3.78 6.25 4.12