home > publications > a40
Contact
Automatic parametrization of voice source signals:
a novel evaluation procedure is used to compare methods and test the effects of low-pass filtering

Helmer Strik
Internal report, Department of Language and Speech, University of Nijmegen, The Netherlands

ABSTRACT

There is a need for automatic methods for parametrization of the voice source signals. Representatives of the two types of methods that have been used most often for parametrization were tested and compared. For this purpose a novel evaluation procedure is proposed which makes it possible to perform the numerous tests needed for a detailed comparison of the methods. This evaluation procedure revealed that in order to reduce the average error in the estimated voice source parameters the estimation methods should be able to estimate non-integer values of these parameters. The proposed evaluation method was also used to study the influence of low-pass filtering on the estimated voice source parameters. The factor low-pass filtering was chosen because low-pass filtering is probably used in all methods in which voice source parameters are estimated. It turned out that low-pass filtering causes an error in all estimated voice source parameters. On average, the smallest errors were found for a parametrization method in which a voice source model is fitted to the voice source signals, and in which the voice source model is low-pass filtered with the same filter as the voice source signals.

1. INTRODUCTION

The technique of inverse filtering has been available for a long time now. It was first described in Miller (1959). Inverse filtering is based on the linear source-filter model of speech production (Fant, 1960; Flanagan, 1965).

The signal that is inverse filtered most often is the acoustic sound pressure wave recorded with a microphone placed a few centimeters in front of the mouth. In this way an estimate of the first derivative of glottal flow (dUg) can be obtained (Miller, 1959; Rosenberg, 1971; Strube, 1974; Gauffin and Sundberg, 1980; Schoentgen, 1990, 1995; de Veth et al., 1990; Jansen et al., 1991; Alku, 1992; Karlsson, 1992; Strik and Boves, 1992a, 1992b; Fant, 1993). Subsequently, the lip radiation effect can be canceled by integrating dUg to obtain an estimate of true glottal flow (Ug).

It is also possible to inverse filter the airflow signal recorded at the lips to calculate Ug (e.g. Rothenberg, 1973, 1977; Sundberg and Gauffin, 1979; Holmberg, 1993; Hertegard, 1994; Koreman, 1996). This type of research has shown a strong increase since the introduction of the so called Rothenberg mask (Rothenberg, 1973, 1977), which consists of a differential pressure transducer attached to a circumferentially vented face mask covered with a wire mesh. The frequency response of this system is flat up to about 1.5 kHz (Sundberg and Gauffin, 1979; Gauffin and Sundberg, 1989; Hertegard and Gauffin, 1992). The mask is usually held as tightly as possible against the subject's face, in order to ensure a tight seal between face and mask.

Inverse filtering has already been studied extensively, and many different methods have been proposed in the literature (see e.g. Funaki and Mitome, 1990; Alku and Vilkman, 1994; Hong et al., 1994; Ding and Kasuya, 1996). However, estimating a voice source signal (either dUg or Ug) is usually not enough. For many applications it is necessary to parametrize the glottal flow signals. Parametrization of the voice source signals, and evaluation of these parametrization methods, has received far less attention in the past. That is why we focus on these aspects in this study.

Parametrization of dUg or Ug can be done in several ways. Usually landmarks (like minima, maxima, zero crossings) are detected in the signals (e.g. Sundberg and Gauffin, 1979; Gauffin and Sundberg, 1980; Gauffin and Sundberg, 1989; Alku, 1992; Strik and Boves, 1992a; Holmberg, 1993; Alku and Vilkman, 1995; Koreman, 1996). Because these landmarks are estimated directly from the voice source signals, these methods will be called direct estimation methods (DE methods).

Voice source parameters are also calculated by fitting a voice source model to the data (e.g. Ananthapadmanabha, 1984; Karlsson, 1990; Schoentgen, 1990, 1995; Jansen et al., 1991; Karlsson, 1992; Strik and Boves, 1992b; Fant, 1993; Milenkovic, 1993; Riegelsberger and Krisnamurthy, 1993). Because in estimation methods of this kind a model fitting procedure is used, they will be referred to as fit estimation' methods (FE methods).

In an FE method the voice source model is essential. Many different models have been proposed in the literature (see e.g. Rosenberg, 1971; Fant, 1979; Ananthapadmanabha, 1984; Fant et al., 1985, Fujisaki and Ljungqvist, 1986; Funaki and Mitome, 1990; Lobo and Ainsworth, 1992; Hong et al., 1994; Cummings and Clements, 1995). The Liljencrants-Fant model (LF model) (Fant et al., 1985) is used most often as the voice source model (e.g. Jansen et al., 1991; Karlsson, 1992; Strik and Boves, 1992a, 1992b; Fant, 1993; Riegelsberger and Krisnamurthy, 1993). Since a voice source model is not required in a DE method, some studies do not use it (e.g. Sundberg and Gauffin, 1979; Gauffin and Sundberg, 1980; Gauffin and Sundberg, 1989; Alku, 1992; Holmberg, 1993; Alku and Vilkman, 1995). However, other studies based on DE methods do use a voice source model (Rosenberg, 1971; Fujisaki and Ljungqvist, 1986; Gobl, 1988; Lobo and Ainsworth, 1992; Strik and Boves, 1992a; Koreman, 1996). An important reason for using a voice source model is that the estimated voice source parameters can be subsequently used for speech synthesis.

The parametrization is usually done in the time domain (e.g. Ananthapadmanabha, 1984; Fujisaki and Ljungqvist, 1986; Schoentgen, 1990, 1995; Jansen et al., 1991; Strik and Boves, 1992b; Milenkovic, 1993; Riegelsberger and Krisnamurthy, 1993), sometimes simultaneously in time and frequency domain (e.g. Gobl and N¡ Chasaide, 1988; Karlsson, 1990, 1992; Fant, 1993; N¡ Chasaide and Gobl, 1990, 1993), and occasionally in the frequency domain alone (Funaki and Mitome, 1990; Hong et al., 1994; Alku and Vilkman, 1996; Ding and Kasuya, 1996; Alku, Strik and Vilkman, to appear). What the optimal domain is depends on the application and the method used.

Besides the method used to estimate the voice source parameters, it is important to have a look at the method and material used for evaluation. The analyzed material is often limited to a small number of pitch periods of vowels; most often natural vowels (e.g. Fujisaki and Ljungqvist, 1986; Funaki and Mitome, 1990; Jansen et al., 1991; Holmberg, 1993; Milenkovic, 1993), sometimes synthetic vowels (e.g. Strik and Boves, 1994; Darsinos et al., 1995), or both (e.g. Strube, 1974; Alku, 1992; Strik et al., 1992, 1993; Riegelsberger and Krisnamurthy, 1993). Furthermore, the analyzed material usually consists of sustained vowels or carefully produced (e.g. read) utterances. Hardly ever were voice source parameters estimated for all pitch periods of a complete spontaneous sentence. As far as we know the only exception is Strik and Boves (1992a, 1992b). Three important reasons why the material is often limited to a small number of pitch periods of sustained or carefully produced vowels are:

[1] because almost none of the methods is automatic, it is too laborious to process large amounts of speech,

[2] inverse filtering and parametrization are generally easier for vowels than for consonants, and

[3] they are usually more difficult for spontaneous speech compared to sustained vowels or carefully produced utterances.

If voice source parameters are estimated only for a limited number of pitch periods of vowels, there is not much material that can be used for evaluation of the proposed methods. This is one of the reasons why a thorough evaluation generally is not provided. In fact, in the majority of the articles no evaluation is presented at all. In the few cases in which an evaluation was provided, it often consisted merely of a simple qualitative (usually visual) comparison of the glottal flow signals and the model fits for a small number of pitch periods of vowels (Strube, 1974; Fant et al., 1985; Lobo and Ainsworth, 1992; Riegelsberger and Krisnamurthy, 1993; Hong et al., 1994; Ding and Kasuya, 1996).

This qualitative evaluation was generally done for natural speech (Fant et al., 1985; Lobo and Ainsworth, 1992; Hong et al., 1994; Ding and Kasuya, 1996), although it is also possible to use synthetic speech for evaluation (e.g. Strik et al., 1992, 1993; Strik and Boves, 1994; Darsinos et al., 1995). Natural speech has the advantage that it is the kind of speech the method will be used for eventually. However, an important drawback of natural speech is that the correct voice source parameters are not known 2. This makes it hard to perform a quantitative and detailed evaluation of the estimated voice source parameters. On the other hand, for synthetic speech the correct voice source parameters are known: they are simply the voice source parameters used during synthesis, or some transformation of these parameters. They can be used to calculate the error in the estimated parameters (e.g. Strik et al., 1992, 1993; Strik and Boves, 1994; Darsinos et al., 1995). A drawback of synthetic speech is that it (usually) does not contain all effects (especially the non-linear effects) that are present in natural speech.

We think that evaluation of the estimation methods is important, and therefore should get more attention than it has received so far. That is why we elaborate on this topic in the current article.

Estimation of voice source parameters can be useful for many applications. Without doubt, the application mentioned most often is speech synthesis. However, the estimated voice source parameters are also used for fundamental research on speech production (N¡ Chasaide and Gobl, 1993; Holmberg et al., 1994; Strik, 1994; Koreman, 1996). Other areas in which methods to measure voice source behavior could be useful are clinical use, speech analysis, speech coding, automatic speech recognition, and automatic speaker verification and identification. However, in order to be applicable in these areas the methods should be fully automatic. Also for research on speech synthesis and fundamental research on speech production the use of automatic methods would be advantageous. Thus, for various reasons there is an increasing need for automatic methods (see e.g. Fritzel, 1992; Fant, 1993; N¡ Chasaide and Gobl, 1993; Ding and Kasuya, 1996). Although a lot of research has already been carried out on this topic, a completely automatic method that works satisfactorily does not seem to exist yet.

The long term goal of our research therefore is to develop such an automatic method. Both DE methods and FE methods can be made completely automatic. For this reason, and because they are the methods used most often, a representative of the DE method will be compared with a representative of the FE method. The representatives chosen are described in section 2.3 and 2.4.

The goals of the research reported on in this article are to find out what the pros and cons of each method are, to get a better understanding of the problems involved in estimating voice source parameters, and finally to determine which method performs best. In order to make it easier to compare the two methods, the same voice source model is used in both methods. To this end we use the LF model. The LF model and the reasons for choosing it are described in section 2.2.

To achieve these goals we tried to develop an evaluation procedure with which it is possible to make a thorough and systematic evaluation. The method and material chosen for evaluation are described in sections 3.1 and 3.2, respectively.

This evaluation procedure is then used to study voice source estimation. First, in section 4.1 and 4.2, it is studied how well the estimation methods succeed in estimating non-integer values of the parameters. This turned out to be a very crucial property of the estimation methods.

The evaluation procedure proposed in section 3 can be used to study the effect of different factors. As an example we have chosen to study the effect of low-pass filtering (see section 4.3). The reason is that low-pass filtering influences the estimated parameters (Strik et al., 1992, 1993; Perkell et al., 1994; Alku and Vilkman, 1995; Strik, 1996a; Koreman, 1996). Because low-pass filtering is used in (almost) all methods, it becomes very important to study what the effect of low-pass filtering exactly is. Previously proposed methods are not optimally suited for this task (see Strik, 1996a). We will show that the evaluation procedure proposed here is suitable for studying the effect of the factor low-pass filtering.

In section 5 the findings are discussed and some general conclusions are drawn.

2. ESTIMATION METHODS

In this article two estimation methods used to parametrize dUg are tested and compared. Before going on to describe these two methods (in sections 2.3 and 2.4), we shall first give some definitions in section 2.1 and describe the LF model in section 2.2.

2.1. Some definitions

In the current article it will be assumed that dUg is a discrete signal. Some terms related to these voice source signals, and the A/D conversion used to obtain them, are often used below. In order to avoid confusion later on, we shall first define some of these terms in this section.

For A/D conversion, a choice has to be made for some values like the sampling frequency (Fs), the input range (D = [Xmin, Xmax]), and the number of bits used to code each sample (Bc). Here, Fs = 10 kHz, D = [-2048, 2047], and Bc = 12. As the number of bits used for coding is Bc, the number of amplitude levels L = 2Bc, and the step size d = D/L. The step size is the smallest possible difference between two amplitude values. The distance between two neighboring sample points is called the sample interval or the sampling time Ts = 1/Fs. Throughout this article a time parameter is said to have an integer value, if its value is precisely an integer multiple of Ts. Likewise, an amplitude parameter is said to have an integer value, if its value is exactly an integer multiple of d.

2.2. LF model

In the current research the LF model is used as voice source model (see Figure 1). It should be noted that the LF model is a mathematically complex model, which is a disadvantage for a model used in a fitting procedure. Nevertheless, we have chosen to use the LF model, because this disadvantage is not crucial (its main effect is that it increases the CPU time), and because the LF model also has a number of advantages:

In previous research the LF model has often been used to estimate voice source parameters, with manual or (semi-)automatic methods. This research has shown that it is a suitable model for description of the voice source signal (see e.g Fujisaki and Ljungqvist, 1986; Jansen et al., 1991; Karlsson, 1992; Strik and Boves, 1992b; Strik et al., 1992, 1993; Riegelsberger and Krisnamurthy, 1993; Childers and Ahn, 1995; Darsinos et al., 1995).

Fujisaki and Ljungqvist (1986) compared several voice source models. Their results showed that the LF model and their own FL-4 model performed best.

Previous research has also proven that the LF model is suitable for speech synthesis (see e.g. Carlson et al., 1989).

Due to all research already performed, the model and its behavior are well known.

The parameters of the LF model can be divided into three groups (see Table I).

Table I. The LF parameters.

1. amplitudes

Ee: excitation strength, Ee = min(dUg)

U0: peak glottal flow, U0 = max(Ug)

2. moments

to: moment of opening

tp: moment of peak in Ug, tp = argmax(Ug)

te: moment of excitation, te = argmin(dUg)

tc: moment of closing

3. durations of time intervals

T0: duration of a pitch period, T0 = 1/F0,

Ta: duration of the interval between te and the projection of the tangent of dUg in te.

These parameters, in turn, can be used to derive many other parameters. For instance, speed quotient is often calculated: SQ = (tp-to)/(tc-tp) (e.g. Alku and Vilkman, 1995). However, in our opinion these derived parameters are less suitable for evaluation of the parametrization methods. The reason is that the derived parameters have an important drawback: whenever there is a change in a derived parameter, it is difficult to determine how this change came about (Strik, 1996a). An increase in SQ could be the result of a larger tp, a smaller to, a smaller tc, or a combination of any of these three changes. On the other hand, whenever a derived parameter remains constant, this does not necessarily imply that the underlying parameters (from which the parameter was derived) remain constant. It is always possible that changes in these underlying parameters cancel each other out. Therefore, we prefer to use the LF parameters specified in Table I for the evaluation of estimation methods. Since the parameters Ee, to, tp, te, and Ta give a complete description of an LF pulse, this set of parameters will be used in this article.

2.3. Direct estimation method

In DE methods, voice source parameters are calculated directly from dUg or Ug by means of simple arithmetic operators like min, max, argmin, and argmax. These arithmetic operators are used to detect landmarks in the signals. Some examples of estimations used quite often are: U0 = max(Ug), tp = argmax(Ug), Ee = -min(dUg), and te = argmin(dUg) (see e.g. Sundberg and Gauffin, 1979; Ananthapadmanabha, 1984; Gauffin and Sundberg, 1980; Gauffin and Sundberg, 1989; Alku, 1992; Alku and Vilkman, 1995; Koreman, 1996). Except for the value and the place of a maximum or minimum, the place of a zero crossing is also used to estimate parameters. For instance, in this way to and tc can be estimated (see Figure 1).

With DE methods, estimates of most voice source parameters can be obtained in a relatively simple way. However, DE methods also have some disadvantages. DE methods try to locate (important) events in the voice source signals. Thus the resulting estimates are limited to the place or amplitude of samples in the discrete signals. In other words, the estimated voice source parameters always have integer values. In practice, these (important) events generally will not coincide precisely with a sample point, and amplitudes will not always be exactly an integer multiple of the step size d; i.e. the parameters will not have an integer value. The error in the estimated voice source parameters due to this property of the DE methods will contribute to the total error, as we will show in section 4. This is a major drawback of DE methods.

Another drawback of DE methods is that a disturbance present in the estimated flow pulses can lead to large errors in the estimated parameters. For instance, noise or formant ripple can influence the position and the amplitude of certain events to a large extent. Some other drawbacks of DE methods can be found in Strik (1996a).

One of the aims of the research reported in this article is to test the performance of a DE method, and to compare it with the performance of an FE method. To that end we chose the DE method described in Alku and Vilkman (1995), because their method seemed promising and because the authors provide a fairly detailed description of their method (see especially page 765 of their article). Furthermore, with this method it was possible to estimate the LF parameters Ee, to, tp, and te (for which they use the terms Amin, to, tm, and tdm, respectively).

In their method Alku and Vilkman (1995) do not estimate Ta. They use the parameter tret to describe the return phase. Since Ta cannot be derived from tret and an LF model is not complete without Ta, another method had to be used to estimate Ta. For the current research all estimates were made in the time domain. Because it is very difficult to estimate Ta in the time domain with a DE method, estimates of Ta were obtained by fitting the LF model to the glottal pulse. More precisely, for given values of Ee, to, tp, and te (made with the DE method) the optimal value of Ta was estimated by fitting the LF model to the data. Therefore, strictly speaking, only Ee, to, tp, and te can be said to be the result of the DE method, while Ta is subsequently estimated with a fitting procedure. However, it is important to notice that the estimate of Ta does depend to a large extent on the estimates of Ee, to, tp, and te made before with the DE method. Furthermore, estimating one parameter (here Ta) with a fitting procedure, is a relatively simple operation. Consequently, the results showed that the error in the estimates of Ta is mainly the result of the errors in the estimates of Ee, to, tp, and te made with the DE method. For instance, if estimates of Ee and/or te are too large, the resulting estimates of Ta will generally be too small.

After implementing this method for parameter estimation, numerous experiments were first carried out to improve the implementation. The goal was to make the estimations more robust, and thus to make the resulting average errors in the estimates smaller. In the following stage, the DE method was used for the tests described below.

2.4. Fit estimation methods

Voice source parameters can also be obtained by fitting a voice source model to the data (e.g. Ananthapadmanabha, 1984; Karlsson, 1990; Schoentgen, 1990, 1995; Jansen et al., 1991; Karlsson, 1992; Strik and Boves, 1992b; Fant, 1993; Milenkovic, 1993; Riegelsberger and Krisnamurthy, 1993). In our FE method five LF parameters (Ee, to, tp, te, and Ta) are estimated for each pitch period. The FE method consists of three stages:

1. initial estimate

2. simplex search algorithm

3. Levenberg-Marquardt algorithm

The goal of the FE method is to determine a model fit which resembles the glottal pulse as good as possible. This resemblance is quantified by means of an error function, which is calculated in the following way. The optimization procedure provides a set of LF parameters. A routine (called the LF routine) uses the analytical expression of the LF model to calculate a continuous LF pulse for these LF parameters. Subsequently, this LF pulse is sampled and zeros are added before to and after tc (until the length of the fitted signal is equal to that of the glottal pulse). The output of the LF routine are the samples of the fitted signal. In turn, the samples of the fitted signal together with the samples of the glottal pulse are the input to the error function, which provides a measure of the difference between these samples.

The fitting procedure tries to minimize this error. We have experimented with several error functions which were defined either in the time domain, the frequency domain, or in both domains simultaneously. Defining a suitable error function in the frequency domain, for this automatic fitting procedure, turned out to be problematic. Probably the main reason is that the spectrum contains some details (e.g. the harmonics structure, the high-frequency noise) which need not be fitted exactly. With simple error measures, like e.g. the root-mean-square (rms) error, we did not succeed in obtaining a reasonable model fit. More sophisticated error functions are needed for this task. The desired error function should abstract away from the details which are not important, and emphasize the important aspects (e.g. the slope of the spectrum).

In the time domain it is much easier to obtain a fairly good model fit of dUg. Here a simple rms error does yield plausible results. Still, also in the time domain some aspects of dUg could be more important than others. It is likely that more sophisticated error functions could be defined which emphasize the relevant (e.g. perceptual) aspects. However, what is relevant does depend on the application. In the current research we did not have a specific application in our mind. The goal of this research was to develop a method for which the error in the estimated voice source parameters is small. Therefore, an important property of the error function is that it should decrease when the errors in the voice source parameters become smaller (this may sound trivial, but it is not). The rms error (defined in the time domain) did have this property and thus was suitable for this task, as our experiments revealed.

For the fitting procedure different non-linear optimization techniques were tested: several gradient algorithms and some versions of a non-gradient algorithm, i.e. the simplex search algorithm of Nelder and Mead (1964). Of the algorithms tested the simplex search algorithms usually came closer to the global minimum than the gradient algorithms. Owing to discontinuities in the error function gradient algorithms are more likely to get stuck in local minima than simplex search algorithms are. Therefore the best version of the simplex search algorithm is used in the second stage of the FE method. However, in the neighborhood of a minimum, the simplex algorithm may do worse (see Nelder and Mead, 1964). As a final optimization, the Levenberg-Marquardt algorithm (a gradient algorithm) is therefore used in the third stage.

In order to start the simplex search algorithm of stage 2 an initial estimate is required, which is made in the first stage. In principle, the best available DE method should be used to provide the initial estimate. In that case the rms error for the FE method can never be larger, and will almost always be smaller than the rms error for the DE method used (because in stage 2 and 3 of our FE method the rms error can never increase, and usually decreases gradually). Consequently, the errors in the voice source parameters estimated with the FE method would almost always be smaller than those estimated with the DE method used for initial estimation. Therefore, if we had used the DE method described in the section above for initial estimation, the performance of this DE method would probably have been worse than that of the FE method. Because we considered this to be an unfair starting point, we decided to use another routine for initial estimation. We simply selected the one we had used in previous research (Strik et al., 1993).

In section 4.3 we will introduce a second version of this FE method. This second version differs only slightly from the version described here. Together with the DE method described in section 2.3 this makes a total of three estimation methods that were studied.

3. EVALUATION METHOD AND MATERIAL

3.1. Evaluation method

Estimates of voice source parameters can be influenced by a large number of factors. So far, 11 of these factors have been studied: Fs, Bc, position (shift) and amplitude (Ee) of the glottal pulses, tc, T0, signal-to-noise ratio (i.e. the effect of additive noise), phase distortion (which can be caused e.g. by high-pass filtering), low-pass filtering, and errors in the estimates of formant and bandwidth values during inverse filtering (which will bring about formant ripple in the estimated voice source signals). We have performed at least 1000 model fits for each of these 11 factors, making a total of much more than 11.000 model fits.

Due to space limitations it is not possible to present the results of all the tests here. Therefore, we shall confine ourselves to the most important results, viz. those of the factors shift, Ee, and low-pass filtering. The results of other tests can be found in Strik et al. (1993), Strik and Boves (1994), and Strik (1994).

In natural speech many of these factors will simultaneously affect the estimated voice source parameters. Still, we think that it is better to conduct a systematic study of each factor in isolation. First of all, because otherwise it would be difficult to find out what the effect of each factor is. Second, because the contribution of these factors differs from one situation to the other, even within one experiment. If for different magnitudes of each factor it is known what the effects on the voice source parameters are, then for a given setting it could be estimated what the magnitude of each factor is and thus what the errors in the voice source parameters are. And third, because it is impossible to study all combinations (1000 cases for 11 factors make a total of 100011 combinations). Certainly, some of the factors will interact. Therefore, after studying the effect of each factor in isolation, some relevant combinations should also be studied later.

Next, we had to decide whether to use natural or synthetic speech for evaluation. Natural speech has the advantage that it is realistic: it is the kind of speech (with all its properties) for which the method eventually should be used. A previous version of the FE method tested with natural speech produced plausible results (Strik and Boves, 1992b; Strik et al., 1992). Plausible in the sense that visual inspection revealed that glottal flow signals and model fits were very much alike. This kind of qualitative (visual) evaluation is about the only evaluation which is usually done. Furthermore, we also checked whether the voice source parameters changed slowly in time (this is what one would expect if the voice source parameters are related to articulation), and they did. Of course the latter type of evaluation is not possible if only some pitch periods of vowels are processed (as is often done). In this case analysis of longer stretches of speech is required.

If a voice source model is used during analysis, another type of qualitative evaluation can be done. The estimated voice source parameters can be used to resynthesize the utterance, and by using perception one could try to minimize the difference between natural and resynthesized utterance (analysis-by-synthesis). This method has the disadvantage that (almost) similar percepts can be obtained with different articulatory settings. So with this method one is never sure whether the estimated voice source parameters approximate the correct voice source parameters, or whether it is another set of voice source parameters that just sounds (almost) similar. For some types of research, e.g. fundamental research on speech production, this is an important distinction.

Furthermore, all the qualitative evaluation methods mentioned above have some other disadvantages. First of all, for natural speech it is almost impossible to control all the factors which influence the estimated voice source parameters, and thus to examine the effect of each of these factors in isolation. And even if this were possible, these methods are much too laborious to be used in the numerous (more than 11.000) cases that were studied so far. After all, for every new model fit of each pitch period a qualitative evaluation has to be done by looking at or listening to the signals.

Finally, natural speech has to be inverse filtered before one can start with the parametrization of the glottal flow signals. Current inverse filter techniques work quite well, but they are certainly not perfect. Imperfections in inverse filtering lead to errors in the glottal flow signals. These errors contribute to the final errors in the estimated voice source parameters, and it becomes impossible to determine which part of the error is caused by inverse filtering and which one by parametrization. Inverse filtering has already been studied a lot in the past. Here we want to concentrate on the estimation methods.

Instead of natural speech synthetic speech can be used for evaluation. The most important drawback of synthetic speech is that it is only an approximation of natural speech, and does not contain all the properties of natural speech. However, synthetic speech also has many advantages. First of all, with synthetic speech inverse filtering and parametrization of the glottal flow signals can be studied in isolation (if a synthesizer is used that outputs both speech and the glottal flow signal). Furthermore, one can control and vary each factor, and thus each factor can be studied in isolation. Desired glottal waveforms with different kinds of shapes can easily be produced. For all these glottal flow signals the correct voice source parameters are known. They are simply the voice source parameters used to synthesize these pulses (or some transformation of them). This makes it easy to calculate the error between estimated and correct voice source parameters. Finally, the experimental cycle is fast, much faster than with the qualitative methods mentioned above. This is very important, because for a systematic and thorough evaluation many experiments have to be done (so far, already more than 11.000 model fits have been carried out).

Given the considerations presented above, we decided to use synthetic speech for our evaluations. Because we want to focus on the parametrization method, we shall not evaluate inverse filtering in the current research. In our experiments we first synthesize glottal flow signals. Subsequently, the three parametrization methods are used to estimate the voice source parameters. Finally, the estimated voice source parameters are compared with the correct ones (used to synthesize the glottal flow signals). In this way the experimental cycle is short, and can be used to perform the numerous tests which are needed. As we use the LF model for the fitting procedure, it is obvious that we also used the LF model to synthesize the glottal flow signals.

This evaluation method is equivalent to the method used by McGowan (1994) to estimate vocal tract parameters. He used the same articulatory synthesizer to produce formant tracks and to recover the articulatory trajectories from these formant tracks. His research showed that this is a useful evaluation method, which can be used to gain insight in the estimation procedure. For example, he found that the estimation could be improved by using additional acoustic information, such as rms amplitude.

In our research, just as in the research by McGowan (1994), all details of the generating procedure are explicitly known. We therefore agree with him that these kinds of studies should be regarded as best case studies which can be used to study the limitations of estimation procedures and to optimize these estimation procedures.

For evaluating the estimation methods 11 base pulses were defined (see section 3.2). These 11 base pulses served as a starting point, and were used to generate the test pulses. For instance, to study the influence of the factor low-pass filtering, the 11 base pulses were filtered with M low-pass filters in order to generate M x 11 test pulses. Calculation of the base pulses and the test pulses was first done in floating point arithmetic. After the test pulses had been created, the sample values were rounded off towards the nearest integer (as is done in standard A/D conversion). Subsequently, for these test signals voice source parameters were estimated with the DE method and the FE method. The resulting values were compared with the correct values, and the errors were calculated:

ERR(X) = 100%*abs(Xest - Xinp)/Xinp, for X = Ee

ERR(Y) = abs(Yest - Yinp), for Y = to, tp, te and Ta.

The experiments were carried out for a number (say N) of test pulses. After calculating the errors in the estimates of the 5 LF parameters for each test pulse, the errors had to be averaged. This can be done in a number of ways. Generally, averaging was done by taking the median of the absolute values of the errors. The absolute values were taken because otherwise positive and negative errors could cancel each other out. In this way the average error could be small, while the individual errors are (much) larger. The median was taken because (compared to the arithmetic mean) it is less affected by outliers which are occasionally present in the estimates. This method of averaging is the default method in the current article. Sometimes other ways of averaging were required. Whenever another way of averaging was used, this is explicitly mentioned in the text.

In all figures below, the errors are arranged in a similar fashion (see e.g. Figure 3). In the upper left corner are the errors for Ee (in %), in the middle row are the errors for to and tp, and in the bottom row are the errors for te and Ta. The errors in the time parameters to, tp, te, and Ta are expressed in msec. or in msec., depending on the magnitude of the errors.

3.2. Material

The three estimation methods used in this study are pitch-synchronous. This implies that a pitch period of dUg first has to be located before it can be parametrized. Among the parameters that have to be estimated are to and tc. Because these two parameters are not known beforehand, the pitch period cannot be segmented exactly. In practice, we first locate the main excitations (i.e. te) and then use a window with a width larger than the length of the longest (expected) pitch period. Generally, the pitch period will be situated between two other pitch periods (except for UV/V and V/UV transitions). Therefore, for each experiment sequences of three equal LF pulses were used. Each time voice source parameters were estimated for the (perturbated) pulse in the middle. Another reason for not using a single glottal pulse for evaluation is that the effects of perturbations cannot always be studied by a single, isolated LF pulse.

Furthermore, LF pulses with different shapes were used. The reason is that the effect of a studied factor can depend on the shape of a pulse. Therefore, to get a general picture of the effect of that factor, the effect has to be studied for a number of pulses with different shapes. These pulses will be called the base pulses. The base pulses were obtained by using the LF model for different values of the LF parameters. The parameters of Ee, T0, to, and tc were kept constant at 1024, 10 msec., 10 msec., and 20 msec., respectively. The values given for to and tc are the values for the second of the three pulses. For the first pulse one should subtract 10 msec., and for the last pulse add 10 msec. T0 and tc were kept constant because the results of our experiments showed that varying these parameters had very little effect on the estimations. The influence of varying Ee and shift (which is strongly related to to) were studied separately (see section 4.2).

For defining the base pulses the values of tp, te, and Ta were varied. Based on the data given in Carlson et al. (1989), and the data from previous experiments (Strik and Boves, 1992a, 1992b, 1994; Strik et al., 1992, 1993; Strik, 1994) the following 11 base pulses were defined:

For all tests Fs = 10 kHz and Bc = 12. If Bc = 12, the minimum value a sample can have is -2048, and thus the maximum value Ee can have is 2048. But in practice (when natural speech is used), even if the amplification during A/D conversion is optimal, the average value of Ee will be smaller than the maximum value of 2048. Therefore, the 11 base pulses were calculated with a value of 1024 for Ee.

4. TESTS

Various tests were performed to test the DE method and the FE method. The results of some of these tests are presented in this article. First, the LF routine used to generate the LF pulses is tested in section 4.1. Subsequently, the influence of position (shift) and amplitude (Ee) of the glottal pulses on the estimates is tested in section 4.2. Finally, in section 4.3 it is studied in which way low-pass filtering affects the estimates.

4.1. The LF routine

4.1.1. Introduction

In section 2.3 we argued that one of the drawbacks of the DE methods is that only integer values for the parameters can be estimated. Our intention was to develop an FE method that would make it possible to estimate non-integer values too. In order to make this possible an LF routine is needed which has a certain property: the LF routine should be able to calculate correct LF pulses for integer and non-integer values of the LF parameters. Here we shall test whether our LF routine has the required property, which will be called the non-integer' property.

4.1.2. Method

A 10 kHz LF pulse was calculated for the following values of the LF parameters (which are not all integer): to = 10.05, tp = 15.25, te = 17.25, tc = 20.05, Ta = 1.0 msec., and Ee = 1.0. For this 10 kHz pulse all important events (i.e. to = opening, tp = peak of Ug, te = excitation, and tc = closing) are positioned exactly halfway between two sample positions. Next, a 20 kHz LF pulse was calculated with the same values of the LF parameters. In this case, all events coincide with sample positions.

4.1.3. Results and conclusions

As is apparent from Figure 2, the two pulses do not differ. A similar test was also performed for non-integer values of Ee, and different values of Bc (number of bits used for coding). In this case, too, the pulses did not differ. Therefore, the conclusion is that the proposed LF routine succeeds in generating correct LF pulses, also for non-integer values of the time and amplitude parameters. Results of ensuing tests can be found in the next subsection.

4.2. Shift and Ee

4.2.1. Introduction

In the previous subsection it was tested whether it is possible to calculate correct LF pulses, with the proposed LF routine, also for non-integer values of the LF parameters. This was tested by studying some well-chosen examples of the LF pulses. As the test gave positive results, we can now go one step further. In this section a more thorough test is presented. For both the DE method and the FE method it will be tested how well they succeed in estimating (non-integer values of) the voice source parameters.

4.2.2. Method

The definition of the 11 base pulses is such that all time parameters have an integer value (see section 3.2). In order to create test pulses in which the time parameters did not have integer values, the 11 base pulses were shifted in steps of 0.01 msec., from 0.0 up to 0.1 msec. (11 values). This variable will be called shift. For only two of the chosen 11 values of shift (i.e. shift = 0.0 and 0.1), the time parameters will have an integer value, while for the other 9 values of shift all time parameters will have non-integer values. An example of a base pulse shifted over 0.05 msec. is the 10 kHz pulse in Figure 2 (dotted line).

In order to create test pulses in which the amplitude (Ee) does not have integer values the amplitude Ee was varied from 1023 to 1025 in steps of 0.2 (11 values). This makes a total of 1331 test pulses (11 base pulses x 11 shift values x 11 Ee values).

4.2.3. Results of the DE method

First the results of the DE method are presented in Figures 3 and 4. Each error in Figure 3 is the median of 121 errors (11 base pulses x 11 Ee values), while each error in Figure 4 is the median of another set of 121 errors (11 base pulses x 11 shift values).

Let us first look at the errors in Figure 3. To estimate to a threshold function is used in the DE method. The consequence is that the estimate of to is always much too large (on average about 820 msec., see Figure 4). For a shift of 0.03 msec. the average error in to is minimal, while for a shift of 0.04 msec. it suddenly becomes maximal. The reason is that this extra shift of 0.01 msec. causes the threshold to be exceeded one sample later in many test pulses, and thus the average error in to suddenly increases.

Except for to, the figures of the average errors of the other parameters all have roughly the expected triangular shape. For a shift of 0.0 and 0.1 msec. the errors are zero, and for other shift values the errors are greater than zero. The fact that (except for tp) the figures are not exactly triangular is caused by certain details of the implementation of the DE method which are not relevant here.

4.1.3 Results and conclusions

As is apparent from Figure 2, the two pulses do not differ. A similar test was also performed for non-integer values of Ee, and different values of Bc (number of bits used for coding). In that case, too, the pulses did not differ. Therefore, the conclusion is that the proposed LF routine succeeds in generating correct LF-pulses, also for non-integer values of the time and amplitude parameters. Results of ensuing tests can be found in the next subsection.

At this point it may seem more or less trivial to some readers that the LF-routine has the non-integer' property. However, this is not the case. For instance, the LF-routine I used first, i.e. the LF-routine described in Lin (1990), did not have the non-integer' property. The reason is that in Lin's routine all the input parameters are rounded off towards the nearest integer. Because Lin (1990) used his routine for speech synthesis, rounding off the input parameters was not a serious drawback for his application. For many implementations of a voice source model, rounding off the input seems a logical and practical operation.

In the current and the following subsection it is tested whether the LF-routine has the non-integer' property. These tests are presented here because I found that for the FE-method it is very important to use an LF-routine that has the non-integer' property. In fact, when the LF-routine used in my FE-method was changed from Lin's version to the current version, an enormous improvement was observed. Consequently, the errors in the estimates with the current version of the LF-routine are much smaller than those obtained with the previous (i.e. Lin's) version.

4.2 Shift and Ee

4.2.1 Introduction

In the previous subsection it was tested whether it is possible to calculate correct LF-pulses, with the proposed LF-routine, also for non-integer values of the LF-parameters. This was tested by studying some well-chosen examples of the LF-pulses. As the test gave positive results, I can now go one step further. In this section a more thorough test is presented. For both the DE-method and the FE-method it will be tested how well they succeed in estimating (non-integer values of) the voice source parameters.

4.2.2 Method

The definition of the 11 base pulses is such that all time parameters have an integer value (see section 3.2). In order to create test pulses in which the time parameters did not have integer values, the 11 base pulses were shifted in steps of 0.01 msec., from 0.0 up to 0.1 msec. (11 values). This variable will be called shift. For only two of the chosen 11 values of shift (i.e. shift = 0.0 and 0.1), the time parameters will have an integer value, while for the other 9 values of shift all time parameters will have non-integer values. An example of a base pulse shifted over 0.05 msec. is the 10 kHz pulse in Figure 2 (dotted line).

In order to create test pulses in which the amplitude (Ee) does not have integer values the amplitude Ee was varied from 1023 to 1025 in steps of 0.2 (11 values). This makes a total of 1331 test pulses (11 base pulses x 11 shift values x 11 Ee values).

The errors in the estimates for different values of Ee are shown in Figure 4. The errors in the time parameters to, tp, and te obviously do not depend on the value of Ee. Therefore, the errors for these time parameters are constant. If a large number of moments is randomly distributed, the average error (both the arithmetic mean and the median) due to rounding off towards the nearest sample would be Ts/4 = 25 msec. The average errors of tp, te, and Ta do not deviate much from this theoretical average. The reason that the average errors are not exactly equal to 25 msec. is that the related moments are not positioned randomly. The reason why the error in to is much larger was already explained above.

The figure of the errors in the estimates of Ee also has the expected triangular shape: the average errors are minimal for integer values of Ee, and are larger in between. The median error in Ee is never zero, because it is obtained by averaging over different values of shift, and for most values of shift the error in Ee is larger than zero. The estimate of Ta depends on the estimates of Ee and te, and thus is not constant as a function of Ee. Again, the exact shapes of the figures with the errors of Ee and Ta are a corollary of details in the implementation of the DE method which are not relevant.

4.2.4. Results of the FE method

The resulting average errors for the FE method are shown in Figures 5 and 6. In this case the errors were averaged by taking the mean value. This was done for two reasons: [1] since there are no outliers, median and mean values do not differ much; [2] by taking the mean it is also possible to calculate standard deviations. In turn, this makes it possible to test whether there is a significant difference between two mean values.

In this case for each value of shift the mean and standard deviation of 121 errors (11 base pulses x 11 Ee values) were calculated. The results are shown in Figure 5. Likewise, for each value of Ee the mean and standard deviation of 121 errors (11 base pulses x 11 shift values) were calculated. The results are shown in Figure 6.

In Figures 5 and 6 one can observe that the mean errors do not differ significantly from each other. Furthermore, no trend can be observed in the errors. Put otherwise, the magnitude of the error in all estimated parameters does not depend on the value of the factors shift and Ee. Furthermore, all errors are very small, in general much smaller than the errors for the DE method. Except of course for the cases where all the LF parameters have an integer value. In the latter case the errors for the DE method are zero, which is smaller still than the tiny errors found for the FE method. However, it is clear that in practice the voice source parameters will seldom have exactly an integer value.

4.2.5. Conclusions

The conclusions that can be drawn from these tests are the following. The errors obtained with the FE method are very small, in general much smaller than those for the DE method. It can be concluded that with the FE method non-integer values can be estimated as accurately as integer values. Therefore, the quality of the model fit does not depend on the exact value of Ee and the position of the pulse (which is determined here by the variable shift). This explains why to and Ee could be kept constant in the definition of the base pulses (see section 3.2).

For the DE method the average errors in to are always larger than for the FE method, because in the former a threshold function is used to estimate to. In fact, the error in to can be substantially reduced, simply by subtracting a constant from its estimate. For the other parameters the estimation errors for the DE method are zero if the parameters have exactly an integer value. Since parameters will rarely have an integer value in practice, estimates of parameters will almost always contain an error due to this fact alone. These errors will be called the intrinsic errors, because they are intrinsic to the estimation methods. They will always be present, even if the glottal pulses are perfect clean glottal pulses, as was the case in these tests. The results presented in this section make it possible to estimate what the average intrinsic errors are. For the DE method the average error in the time parameters (except to) is about Ts/4 = 25 msec., which is the theoretical average for randomly distributed values, while for Ee it is about 1% (see Figure 4). For the FE method the average error in the time parameters is less than 0.5 msec, while the average error for Ee is about 0.01% (see Figures 5 and 6).

At this point it may seem more or less trivial that the LF routine has the non-integer property. However, this is not the case. For instance, the first version of our LF routine, i.e. the LF routine described in Lin (1990), did not have the non-integer property. The reason is that in Lin's routine all the input parameters are rounded off towards the nearest integer. Because Lin (1990) used his routine for speech synthesis, rounding off the input parameters was not a serious drawback for his application. For many implementations of a voice source model, rounding off the input seems a logical and practical operation.

Since in this first version of the LF routine the input parameters are rounded off towards the nearest integer, the resulting parameters do not change gradually but instead jump from one integer value to the next. The consequence is that also the calculated rms error jumps from one value to the next, because the shape of the generated LF pulse changes abruptly. Thus the error function has the shape of a staircase. A staircase-like error function is problematic for many optimization algorithms, especially for gradient algorithms. They often get stuck in a local minimum, because the gradient is zero for each stair. Although the simplex search algorithms generally come closer to the global minimum than the tested gradient algorithms, the staircase-like error function also proved to be problematic for this algorithm. The explanation is the following. The simplex is formed by N+1 points in a N-dimensional space. During optimization the size of simplex often diminishes gradually. At a certain point the distance between two points of the simplex can become smaller than the width of a stair, and then it is usually stuck on that stair.

In the second version of the LF routine, oversampling was used within the LF routine. For instance, we tried oversampling by a factor 10. Thus not only integer values can be estimated, but also 9 values between these integers. Therefore, the second version of the LF routine has the required non-integer property. However, the error function still has the shape of a staircase. Since the stairs are 10 times smaller (compared to the first version of the LF routine), the resulting estimates were better. Still, the optimization often did not come close to the global minimum.

Our conclusion is that oversampling can reduce the width of the stairs in the error function, and thus improve the estimates, but it can never take away the fundamental problem for optimization, i.e. that the error function is a staircase. That is why we tried to define an error function which changes gradually. The solution was simple: do not round off the input parameters to integer values; instead use the real values. Subsequently, the analytical expression of the LF model is used to calculate a continuous LF pulse. Finally, this continuous LF pulse is sampled. This is the third version of the LF routine. Lin (1990) did not use the analytical expression, most probably because it requires too much CPU time for his application (real-time speech synthesis). Instead he used an approximation procedure for which it is necessary that the parameters are rounded off towards an integer. For our application CPU time is not essential, and therefore we can use the analytical expression.

The third version of the LF routine has the required non-integer property. More important, for this version the shape of the calculated LF pulse changes gradually when the input parameters change gradually. Consequently, the error function is no longer a staircase but a gradual function. We will call this the gradual' property. It is clear that LF routines which have the gradual property also have the non-integer property, i.e. LF routines with the non-integer property form a subset of the LF routines with the gradual property. This gradual property turned out to be essential. An enormous improvement in the FE method was observed when the third version of the LF routine was used (compared to the first and second version). The reason is that a gradual error function is an enormous advantage for both simplex search and gradient algorithms. All results presented in this article are obtained with the third version of the LF routine.

4.3. Low-pass filtering

4.3.1. Introduction

Before the glottal flow signals are parametrized, they are low-pass filtered at least once in all methods, viz. before A/D conversion. Often, they are low-pass filtered again after A/D conversion, usually to cancel the effects of formants that were not inverse filtered or to attenuate the noise component. The latter operation seems very sensible for DE methods, because in these methods high-frequency disturbances can influence the estimated parameters to a large extent. However, low-pass filtering changes the shape of the glottal flow signals, and, consequently, influences the estimated voice source parameters (Strik et al., 1992, 1993; Perkell et al., 1994; Alku and Vilkman, 1995; Strik, 1996a; Koreman, 1996).

An example of the distortion of a flow pulse caused by low-pass filtering is given in Figure 7. For low-pass filtering a convolution with a 19-point Blackman window was used. Shown are a base pulse before (solid) and after (dashed) low-pass filtering, and a model fit on the low-pass filtered pulse (dotted). Besides a picture of the three signals for the whole pitch period, some details around important events are also provided.

One can see in Figure 7 that low-pass filtering does influence the shape of the pulse. From this figure one can deduce that the change in shape can have a large impact on the estimates obtained by means of a DE method. This is most clear for the estimate of Ee, which will generally be too small. But also the estimates of the other parameters will be affected.

Low-pass filtering will also affect the estimates of an FE method. After low-pass filtering the shape of the pulse is changed. The fitting procedure will try to find an LF pulse that resembles the filtered pulse as closely as possible. This is done by minimizing the rms error which is a measure of the difference between the test pulse and the fitted LF pulse. The result is a fitted LF pulse which deviates from the original base pulse (see Figure 7). In Figures 7a and 7d it can be seen that the estimated values of Ee and te are too small, while the estimate of Ta is too large. Furthermore, one can see in Figure 7b that for this example pulse the estimate of to is too large, and in Figure 7c that the estimate of tp is a bit too small. For this example the errors in the estimates obtained by means of the FE method are: Err(Ee) = -11.2%, Err(to) = 46 msec., Err(tp) = -28 msec., Err(te) = -52 msec., and Err(Ta) = 144 msec.

Since low-pass filtering does affect the shape of the flow pulses, and consequently also the estimated parameters, it becomes important to study the effect of low-pass filtering on the parameter estimates. This will be done in the present section. The distortion of the glottal flow signals depends on a number of factors, like e.g. the type and the bandwidth of the low-pass filter, the frequency contents of the glottal flow signals, and the parametrization method used. We will study the effect of low-pass filtering for two parametrization methods (i.e. the DE method and the FE method), for glottal pulses with different frequency contents (i.e. the 11 base pulses), and for different values of the bandwidth of the low-pass filter.

Low-pass filtering is done by means of a convolution with a Blackman window 3. The bandwidth of this low-pass filter is varied by changing the length of the Blackman window (the longer the window, the smaller the bandwidth). This type of low-pass filtering was chosen because some preliminary tests showed that the error in the estimates induced by this filter was smaller than that of other tested filters. In part this can be explained by the fact that this low-pass filter does not have a ripple in its impulse response, while a ripple is present for many other low-pass filters. Therefore, for most other low-pass filters (including the generally used standard FIR filters) the estimation errors will be (much) larger than the errors presented below.

4.3.2. Method

The 11 base pulses were low-pass filtered by means of a convolution with a Blackman window of varying length. The length of the window was varied from 3 to 19 samples in steps of 2 samples (9 lengths). For the resulting 99 test pulses (11 base pulses x 9 window lengths) the parameters were estimated with the DE method and the FE method. For each length of the Blackman window the results of the 11 base pulses were pooled and the median values of the absolute errors were calculated. These median values are shown in Figures 8 and 9.

In the example provided in Figure 7 the test signal is low-pass filtered. An LF model is then fitted to the low-pass filtered test pulse. This seems the most obvious way to apply low-pass filtering, and will be called the first version of the FE method. However, there is an alternative (which will be called the second version of the FE method): apart from the test pulse one could also low-pass filter the fitted LF pulse. In that case, test pulse and fitted LF pulse are altered in a similar fashion. In this way we hope to achieve that the error in the estimated parameters (which is due to low-pass filtering) will be smaller than when only the test pulses are low-pass filtered. It is obvious that the same trick cannot be used in a DE method, because in this case the parameters are calculated directly from the (low-pass filtered) signal.

4.3.3. Results of the DE method

In Figure 7a one can see that low-pass filtering has most effect on the amplitude of the signal (Ee) and the shape of the return phase. Low-pass filtering causes the excitation peak to be smoother, and thus the estimate of Ee will be too small. Low-pass filtering also makes the return phase less steep, and therefore the estimate in Ta too large. These effects are enhanced if the length of the Blackman window increases (i.e. if the bandwidth of the low-pass filter is reduced). Therefore, the median errors of Ee and Ta increase with increasing window length.

Low-pass filtering does not have much influence on tp (= the position of the zero crossing in dUg, see Figure 7c). Therefore, in the majority of the cases the error in the estimates remains within half a sample, and the median of the errors is zero.

Usually, low-pass filtering causes the estimates of te to be too small (see Figure 7d). If the window length is 3 or 5, most of the errors in te remain within half a sample, and thus the median error is zero. However, for larger window lengths the errors in te become larger. As a result the median error increases too.

Finally, the error in to remains constant, at the value of 820 msec. (see also Figure 4). This can be explained with the help of Figure 7a and 7b. In these figures one can see that low-pass filtering has a large effect on the signal in the direct neighborhood of to, and that this effect diminishes away from to. If the threshold is chosen high enough (which is the case for the DE method used in the current research), low-pass filtering will not have much influence on this estimate of to.

Here, we would like to repeat a remark made in the introduction to this subsection. The low-pass filters used in these tests have ripple-free impulse responses, and are chosen because their effect on the estimates is smaller than that of most other low-pass filters. Therefore, it is most likely that for other low-pass filters the errors will be larger. Especially if a low-pass filter with a ripple in its impulse response is used, the errors for a DE method will be much larger (Strik, 1996a).

4.3.4. Results of the FE method

In Figure 8 not only the errors of the DE method are presented, but also those of the first version of the FE method (i.e. the version in which only the test pulses were low-pass filtered). If the median errors of the FE method are compared with those of the DE method, the following observations can be made:

The median errors are larger for tp for all window lengths, and for te for windows with a length of 3 or 5.

In all other cases the errors of the first version of the FE method are smaller than those of the DE method.

The fact that in certain cases the error of the DE method is smaller than the error of the FE method can be explained quite easily. If the effect of a studied phenomenon (here low-pass filtering) on an event (here tp or te) is such that the event is shifted by less than half a sample, the error with the DE method is zero, while that of the FE method is larger than zero. However, one should keep in mind that this is only the case for pulses in which all events coincide exactly with a sample position, as is the case with the test pulses. Only in that case does rounding off towards the nearest sample position mean rounding off towards the correct value.

In practice, events almost never fall exactly on a sample position. In section 4.2 we saw that this leads to substantial errors for the DE method, and much smaller errors for the FE method. Because we decided to study each phenomenon separately, the events of the test pulses used in this subsection coincide exactly with the corresponding sample point. Consequently, the errors of the DE method are sometimes smaller than those of the FE method. If the important events had been positioned randomly, the errors of the FE method would have been slightly larger while those of the DE method would have been substantially larger. In section 4.2 we estimate what the average intrinsic errors are. For the DE method this is about 1% and 25 msec, and for the FE method 0.01% and 0.5 msec. For a realistic comparison of the two methods these errors should be added to the errors found in this section. If this is done the average errors of the DE method are always larger than those of the FE method.

In Figure 9 the results of the two versions of the FE method are compared, i.e. the first version, in which only the test pulses are low-pass filtered (solid lines), and the second version, in which both test pulses and fitted LF pulses are low-pass filtered (dashed lines). Clearly, the errors for the second version are much smaller. The errors are not zero, as may seem to be the case from Figure 9, but they are extremely small. The largest error observed in the time parameters is 1 msec., and the errors in Ee are always smaller than 0.03%.

4.3.5. Conclusions

In the previous sections we have explained why with the test pulses used the errors in the DE method are sometimes smaller than those of the first version of the FE method. However, for a realistic comparison the errors found in section 4.2 should be added. In this case the errors for the DE method are always larger than those of the first version of the FE method. In turn, these errors are larger than the errors of the second version of the FE method. Therefore, the conclusion is that the second version of the FE method is superior. Low-pass filtering both the test pulse and the fitted voice source model seems to be a very good way to reduce the error caused by low-pass filtering. Of course, it cannot be used in a DE method (as was already noted above).

5. Discussion and general conclusions

In the current article the estimation of voice source parameters from flow signals is studied. Since for this purpose DE and FE methods are used most often, a representative of each method is chosen (see sections 2.3 and 2.4, respectively). In section 4.3 a second version of the FE method is proposed, making a total of three estimation methods. The goals of the research are to find out what the advantages of each estimation method are, to get a better understanding of the problems involved in these estimation methods, and finally to determine which method performs best.

In order to do this an evaluation method is needed. In section 3.1 several evaluation methods are discussed. The evaluation method used in this study is best suited for our goals. In this evaluation method synthetic test material is generated by a production model. Subsequently, the same production model is used to re-estimate the synthesis parameters. A similar method was used by McGowan (1994) to evaluate the estimation of vocal tract parameters. This evaluation method was useful for his research, and it also turned out to be useful for our own research. Since in the present research we want to focus on the estimation of voice source parameters from voice source signals, without being distracted by the problems of inverse filtering, we use a voice source model (the LF model) as the production model. For other purposes a vocal tract model or a complete synthesizer could be used.

The evaluation procedure proposed here is used to test the three estimation methods described in this article. For a quantitative evaluation the LF parameters Ee, to, tp, te, and Ta are used. Other parameters can be derived from these 5 LF parameters. These derived parameters are often used in other studies. However, in section 2.2 we argued that using derived parameters for evaluation has some disadvantages. Therefore, we prefer to use Ee, to, tp, te, and Ta themselves for evaluation (see also Strik, 1996a).

With this evaluation method the effect of several factors can be studied in isolation. For instance, in this article results for the factors shift, Ee, and low-pass filtering are presented. However, studying each factor in isolation is not enough because some factors can interact. For example, both low-pass filtering and high-frequency disturbances present in the voice source signals (e.g. noise or formant ripple) cause errors in the estimated voice source parameters. But the errors due to the high-frequency disturbances can be reduced by using an appropriate low-pass filter. What the optimal low-pass filter for this purpose is, depends on a number of factors like e.g. the estimation method and the voice source model used, and the kind and magnitude of the disturbances. With this evaluation method the effect of factors in combination can also be studied. Thus, e.g., the optimal low-pass filter for a given situation can be determined experimentally.

With the proposed evaluation method the effect of the factor low-pass filter was studied. Low-pass filtering is probably used in all methods in which voice source parameters are estimated from inverse filtered signals. Although parametrization of inverse filtered signals has been done in many studies for almost 40 years now (i.e. since Miller, 1959), it has only recently been noted that low-pass filtering can influence the estimated voice source parameters (Strik et al., 1992, 1993; Perkell et al., 1994; Alku and Vilkman, 1995; Strik, 1996a; Koreman, 1996).

In Strik et al. (1992, 1993) we mentioned that low-pass filtering changes the shape of the glottal flow signals, and consequently the estimated voice source parameters. We concluded that Ee and the return phase (i.e. Ta) are affected most by low-pass filtering (Strik et al., 1992). This conclusion is supported by the results presented in section 4.3. Since the amount of change cannot easily be determined for natural speech, we suggested that a correction which is based on calculations for synthetic speech be used (Strik et al., 1992, 1993).

Perkell et al. (1994) describe that in a first version of their data analysis procedure they used a low-pass filter "with a roll-off that began at 700 Hz and achieved 40 dB of attenuation by 1350 Hz" (ibid, p. 697). Subsequently, this procedure was used for some years to analyze large amounts of data (see references to other studies in Perkell et al. 1994). In a second version of the data analysis procedure somewhat less excessive low-pass filters were used. Voice source parameters estimated with the two versions of the software were compared, and differences were observed. So, more or less by accident, they observed that (the amount of) low-pass filtering influences the estimates. Indeed, for natural speech the effect of low-pass filtering cannot easily be observed, if only because for natural speech the correct voice source parameters are not known.

Perkell et al. (1994) concluded that the effect of the excessive low-pass filtering in the data obtained with the first version of the software appears to be confined to mfdr (which is equal to our parameter Ee). Indeed, the largest percentual differences were observed for mfdr. However, low-pass filtering will not only affect the estimates of Ee (their mfdr) but also those of all other voice source parameters. Probably, the evaluation method used by Perkell et al. (1994) was not sensitive enough to observe the (smaller) differences in the other parameters.

To study the effect of low-pass filtering Alku and Vilkman (1995) used a method which was similar to that of Perkell et al. (1994), in the sense that voice source parameters obtained with two different low-pass filters were compared. First voice source parameters were estimated for a low-pass filter with a bandwidth of 4 kHz. These voice source parameters were used as the reference values. Subsequently, the voice source parameters were estimated again for low-pass filters with a bandwidth of 2 and 1 kHz. The resulting values were compared to the reference values. Strik (1996a) showed that in only three cases was the measured difference larger than the standard deviation (in all cases for tret, the length of the return phase). The differences found for Amin (our Ee) were always much smaller than the standard deviation. Our conclusion is that the evaluation method used by Perkell et al. (1994) and Alku and Vilkman (1995) is not optimal for studying the effect of low-pass filtering.

Koreman (1996) uses low-pass filters with small bandwidths (varying from 200 to 1500 Hz) in his data analysis method. He notes that low-pass filtering reduces the value of Ee, and concludes that low-pass filtering does not affect the relative amplitude of Ee (ibid, p. 60). This is certainly not the case. The amount of decrease in Ee due to low-pass filtering does depend on a lot of factors, an important factor being the shape of the glottal pulse. To illustrate this, let us take two pitch periods of dUg with the same Ee. The first one has a sharp negative peak, the other is more sinusoidal. The reduction in Ee due to low-pass filtering will be larger for the first pulse than for the second. Furthermore, if a low-pass filter with a ripple in its impulse response is used (like the standard FIR filters used by Koreman, 1996) the resulting low-pass filtered signals will also contain a ripple (see also Strik, 1996a). The estimates of many voice source parameters will be influenced by this ripple, and in general the error in the estimates is larger for the first pulse with a sharp peak. Since the shape of the glottal pulse changes continuously, the errors in the voice source parameters generally are not constant.

To sum up, low-pass filtering changes the shape of the glottal flow signals, and thus affects the estimates of the voice source parameters. The error due to low-pass filtering does depend on a lot of factors, e.g. the shape of the glottal flow signal, and the low-pass filter and the estimation method used. So even for a given low-pass filter and estimation method (i.e. within one experiment) the error is not constant, because the shape of the glottal flow signal is generally not constant. Furthermore, for a low-pass filter with a ripple in its impulse response (like the often used standard FIR filters) the average errors will be larger than for the low-pass filter used in this study (a convolution with a Blackman window).

Before we draw our conclusions regarding the comparison of the three estimation methods, we first discuss some aspects of the FE methods used in this study. The first aspect is the voice source model used in the FE method, in our case the LF model. In the literature several voice source models have been described (see e.g. Rosenberg, 1971; Fant, 1979; Ananthapadmanabha, 1984; Fant et al., 1985, Fujisaki and Ljungqvist, 1986; Funaki and Mitome, 1990; Lobo and Ainsworth, 1992; Hong et al., 1994; Cummings and Clements, 1995). All voice source models for which an analytical expression exists can be used with the proposed FE method to parametrize either Ug or dUg. In the program there is a subroutine which calculates the fitted signal. The model fit is now calculated with the LF model, but this part can easily be substituted by the analytical expression of any voice source model. Furthermore, any number of voice source parameters can be used for parametrization. However, increasing the number of parameters makes the optimization problem (i.e. the error space) more complex, and thus the probability that the fitting procedure gets stuck in a local minimum is increased.

Using a voice source model for parametrization has some advantages, one of them being the possibility that the estimated voice source parameters can subsequently be used for speech synthesis. Of course, for FE methods a voice source model is mandatory. However, probably the most important disadvantage of a voice source model used for this purpose is that it cannot describe all the observed glottal flow signals. Although the LF model is capable of describing many different glottal pulse shapes, it cannot describe all details. For instance, it has been noted that there often is a second (smaller) excitation after the main excitation (Cranen, 1987; Hertegard, 1994; Koreman, 1996). The LF model cannot describe this second excitation, and therefore is not suitable to study this phenomenon. Whether a voice source model is suitable for research depends on the goals of this research. Above we explained that with our FE method it is possible to use many voice source models. The reasons for choosing the LF model in this study are given in section 2.2.

The second aspect of the FE method we want to discuss is the non-integer property and the gradual property. In practice the value of voice source parameters will not exactly be integer, i.e. they can have all kind of non-integer values. This fact alone will bring about a substantial error in estimates obtained with a DE method, because a DE method can only estimate integer values. In section 4.2 we estimated these average errors to be about 1% for Ee and 25 msec. for the time parameters.

Therefore, our goal was an FE method that could also estimate non-integer values. In this way we would e.g. be able to estimate moments between sample positions and thus reduce the error in the estimates. This was possible with the second and third version of our LF routine (described in section 4.2.5), which both have the non-integer property. However, another property of the LF routine turned out to be more important, i.e. the gradual property. The reason is that with an LF routine that has the gradual property it is not only possible to estimate instants between sample positions, but, more important, the optimization usually comes closer to the global minimum. This finding can probably be generalized to other FE methods and/or other voice source models: a reduction in the errors can be achieved if a routine is used (for calculation of the voice source signal) which has the gradual property.

Milenkovic (1993) also describes an estimation method which has the non-integer property. This method was not used in our research because it has some disadvantages compared to the FE methods used here. First of all, with the method of Milenkovic (1993) only to and te can be estimated, while with our FE method all parameters can be estimated. Furthermore, to and te are calculated with an iterative full search procedure. For two parameters this is feasible. However, for a larger number of parameters the number of combinations that should be tested grows exponentially, which makes this method less attractive.

The third aspect of the FE method which will be discussed is that no anti-aliasing low-pass filter is used. In the LF routine a continuous LF pulse is first calculated, which is then sampled with the same sampling frequency (Fs) as the flow signal which has to be parametrized (here, 10 kHz). We did not use an anti-alias low-pass filter here, because we wanted to be able to study each factor in isolation. If we had used an anti-alias low-pass filter, this factor (and its effect on the estimated voice source parameters) would always have been present, thus making it impossible to study it independently of other factors.

If no anti-aliasing low-pass filter is used, aliasing effects can be present in the digital signals. Careful inspection showed that this was not the case for the LF pulses used in this study. The dUg signals on average have a slope of -6 dB/oct. The first fundamental is at 100 Hz, so at 5 kHz the attenuation usually is more than 30 dB. Using a Fs of 10 kHz made it possible to study the effect of the factor low-pass filter independently of other factors (like e.g. shift and Ee).

If aliasing is a problem (e.g. because Fs is smaller than 10 kHz), an anti-alias low-pass filter has to be used. The most straightforward way to do this is to sample the continuous LF signal first with a sampling frequency Fs, and next use a digital low-pass filter with a bandwidth smaller than Fs/2. However, in that case the non-integer property is lost, and the error function (which quantifies the difference between the LF signal and the flow signal) becomes a staircase. The result is that the average error in the estimated voice source parameters becomes larger, as mentioned above. A somewhat better solution is to oversample the LF signal before digital low-pass filtering. By oversampling also non-integer values can be estimated. Furthermore, the stairs of the staircase become smaller. Consequently, the average error in the estimated voice source parameters also becomes smaller. Probably the best solution would be to use the analytic anti-alias low-pass filter proposed by Milenkovic (1993), which can be applied in continuous time. In this way the gradual property is preserved, and the error function remains a function that changes gradually (instead of being a staircase).

The comparison of DE- and FE methods revealed what the pros and cons of each method are. DE methods have the advantage that they are mathematically simple, and require little CPU time. However, DE methods also have many disadvantages. First of all only integer values can be estimated. Consequently, the intrinsic errors are large. The quality of the estimates depends on how well the corresponding landmarks can be determined. For instance, for to this is problematic because the flow signals generally change slowly during the beginning of opening. Therefore, it is difficult to determine the moment at which opening begins and the error in the estimates of to is generally large. Since the exact beginning of opening cannot easily be determined, a threshold function is generally used. However, our results showed that using a threshold function yields large errors in the estimates of to. Generally, the error in estimates of tc is large too, as the flow signals also change slowly around tc. Furthermore, for parameters for which a clear corresponding landmark is not present in the flow signals, estimates cannot (easily) be obtained with a DE method. An example of such a parameter is Ta, which describes the return phase. In DE methods Ta is generally not estimated. Finally, disturbances present in the signals (like noise and ripple) often will change the position of a minimum, maximum, or zero crossing, and thus result in (large) errors in the estimates obtained with a DE method.

An FE method has many advantages compared to a DE method. With an FE method it is possible to estimate non-integer values, making the intrinsic errors smaller. In fact, errors of a similar magnitude were found for estimates of integer and non-integer parameter values. Furthermore, estimates of all parameters of a voice source model can be obtained, i.e. not only for parameters related to clearly distinguishable events (as was the case for a DE method). The optimal model fit is determined for the whole pitch period, which makes the method more robust for disturbances present in the flow signals. Finally, in an FE method it is relatively easy to exchange voice source models, which is certainly not the case for a DE method. A disadvantage of an FE method is that each voice source model has its limitations, the most important one probably being that the voice source model cannot model all glottal flow pulses that occur in practice. However, as voice source models can be easily exchanged, this is not a major drawback.

In the current study two aspects were examined in detail. As parameters rarely have an integer value, we first estimated what the resulting intrinsic errors are for the two methods. For the DE method they turned out to be much larger than for the FE method. These intrinsic errors will always be present. Therefore, when the errors due to other factors are studied independently (i.e. with all input parameters having an integer value), the errors found for these factors should be increased with the intrinsic errors in order to make a realistic comparison possible. When this is done for the factor low-pass filtering, the arrangement in order of decreasing average error is: DE method, first and second version of the FE method. The factor low-pass filter was chosen because a low-pass filter is probably used in all methods in which voice source parameters are estimated from inverse filtered signals. Consequently, the resulting errors will be present in the estimated voice source parameters.

The conclusion which can be drawn on the basis of the tests presented in this article is that the second version of the FE method is superior. However, the effect of more single factors and factors in combination should be studied to get a more thorough understanding of the intricacies of the various parametrization methods.

Note, that in several ways this is a best case study. First of all, because all details of the generation of the test signals are explicitly known, as was already mentioned in section 3.1. Second, because the test signals are clean LF pulses, and besides the influence of low-pass filtering contain none of the other disturbances that are generally present in natural speech. And third, because for a standard FIR filter, which is used most often as a low-pass filter, the resulting average errors are larger than for the low-pass filter used in this study. Consequently, when estimation methods are used to parametrize inverse filtered natural speech signals, the errors in the resulting parameters will generally be (much) larger.

In the introduction we already noted that DE methods and FE methods are the methods used most often. Therefore, and because they can be made completely automatic, we have compared representatives of both methods. Before we started to compare these estimation methods, we first tried to improve each estimation method as much as possible. The evaluation method proposed in section 3.1 is very suitable for this purpose. This evaluation method makes it possible to perform numerous different tests relatively easily and fast. However, during improvement of the DE method we never changed the basic algorithm. The reason is that we wanted to use the method as it is described in Alku and Vilkman (1995). We only tried to omit as many (obvious) errors as possible, i.e. we made the implementation of the DE method more robust. It is likely that the DE method can be improved, e.g. by using interpolation or by trying to reconstruct the analogous signal from the discrete signal (this can be done with the use of sync functions). However, this is generally not done. Since we wanted to use a representative of an often used method, we also did not do it. Furthermore, we are convinced that it is very unlikely that the improvement in the DE method will be such that its final performance is better than that of the FE methods (especially, that of the second version of the FE method).

The final topic we want to discuss is how the proposed estimation methods can be used to estimate voice source parameters for natural speech. The answer is straightforward: first use inverse filtering to obtain estimates of the glottal flow signals, and next apply the estimation methods. In Strik and Boves (1992b) and Strik et al. (1992) we showed that this is possible for previous versions of the FE method. We only have to exchange the previous version of the FE method with the new improved version. The best solution would be to take the second version of the FE method, and in the error routine use the same low-pass filter as used during the inverse filter procedure.

Acknowledgments

The research of Dr. H. Strik has been made possible by a fellowship of the Royal Netherlands Academy of Arts and Sciences. I would like to thank Loe Boves, Bert Cranen and Jacques Koreman for their comments on a previous version of this paper.

Footnotes

1 This paper is available at http://lands.let.ru.nl/~strik/publications/. It is an elaborated and improved version of Strik (1996b).

2 The term correct voice source parameters' will be used for the voice source parameters which would be obtained if the whole estimation method (i.e. inverse filtering and parametrization of the resulting flow signals) were perfect. Consequently, if a linear source-filter model is used for speech synthesis, the correct voice source parameters' are equal to the voice source parameters used as input for the voice source model during synthesis.

3 This idea was suggested to me by Bert Cranen.

References

Alku, P. (1992). "An automatic method to estimate the time-based parameters of the glottal pulseform," Proc. Int. Conf. on Acoustic Speech Signal Process., San Francisco, USA, 2, 29-32.

Alku, P., Strik, H., and Vilkman, E. (to appear). "Parabolic Spectral Parameter - A new method for quantification of the glottal flow," Accepted for publication in Speech Communication.

Alku, P., and Vilkman, E. (1994). "Estimation of the glottal pulseform based on discrete all-pole modeling," Proc. Int. Conf. Spoken Language Process., Yokohama, Jpn., 3, 1619-1622.

Alku, P., and Vilkman, E. (1995). "Effects of bandwidth on glottal airflow waveforms estimated by inverse filtering," J. Acoust. Soc. Am. 98, 763-767.

Alku, P., and Vilkman, E. (1996). "Amplitude domain quotient for characterization of the glottal volume velocity waveform estimated by inverse filtering," Speech Communication 18, 131-138.

Ananthapadmanabha, T.V. (1984). "Acoustic analysis of voice source dynamics," Speech Transmiss. Lab. Q. Prog. Stat. Rep., 2-3, 1-24.

Carlson, R., Fant, G., Gobl, C., Granstrom, B., Karlsson, I., and Lin, Q. (1989). "Voice source rules for text-to-speech synthesis," Proc. Int. Conf. on Acoustic Speech Signal Process, Glasgow, Scotland, 1, 223-226.

Childers, D.G., and Ahn, C. (1995). "Modeling the glottal volume-velocity waveform for three voice types," J. Acoust. Soc. Am. 97, 505-519.

Cranen, L.I.J. (1987). "The acoustic impedance of the glottis: Measurements and Modelling," Ph.D. thesis, Univ. of Nijmegen.

Cummings, K.E., and Clements, M.A. (1995). "Analysis of the glottal excitation of emotionally styled and stressed speech," J. Acoust. Soc. Am. 98, 88-98.

Darsinos, V., Galanis, D., and Kokkinakis, G. (1995). "A method for fully automatic analysis and modelling of voice source characteristics," Proc. ESCA 4th European Conf. On Speech Communication and Technology, Madrid, Spain, 1, 413-416.

De Veth, J., Cranen, B., Strik, H., and Boves, L. (1990). "Extraction of control parameters for the voice source in a text-to-speech system," Proc. Int. Conf. on Acoustic Speech Signal Process., 1, 301-304

Ding, W., and Kasuya, H. (1996). "A novel approach to the estimation of voice source and vocal tract parameters from speech signals," Proc. Int. Conf. Spoken Language Process., Philadelphia, USA, 2, 1257-1260.

Fant (1960). Acoustic Theory of speech Production (Mouton, The Hague), 2nd ed., 1970.

Fant, G. (1979). "Glottal source and excitation analysis," Speech Transmiss. Lab. Q. Prog. Stat. Rep., 1, 70-85.

Fant, G. (1993). "Some problems in voice source analysis," Speech Communication, 13, 7-22.

Fant, G., Liljencrants, J., and Lin, Q. (1985). "A four-parameter model of glottal flow," Speech Transmiss. Lab. Q. Prog. Stat. Rep., 4, 1-13.

Flanagan, J.L. (1965 ). Speech Analysis, Synthesis and Perception (Springer-Verlag, Berlin), 2nd ed., 1972.

Fritzel, B. (1992). "Inverse filtering," Journal of Voice, 6, 111-114.

Fujisaki, H., and Ljungqvist, M. (1986). "Proposal and evaluation of models for the glottal source waveform," Proc. Int. Conf. on Acoustic Speech Signal Process., 4, Tokyo, Jpn., 1605-1608.

Funaki, K., and Mitome, Y. (1990). "A speech analysis method based on a glottal source model," Proc. Int. Conf. Spoken Language Process., Kobe, Jpn., 1, 45-48.

Gauffin, J., and J. Sundberg (1980). "Data on the glottal voice source behavior in vowel production," Speech Transmiss. Lab. Q. Prog. Stat. Rep., 2-3, 61-70.

Gauffin, J., and Sundberg, J. (1989). "Spectral correlates of glottal voice source waveform characteristics," J. of Speech and Hearing Research, 32, 556-565.

Gobl, C. (1988). "Voice source dynamics in connected speech," Speech Transmiss. Lab. Q. Prog. Stat. Rep., 1, 123-159.

Gobl, C., and N¡ Chasaide, A. (1988). "The effects of adjacent voiced/voiceless consonants on the vowel voice source: a cross language study," Speech Transmiss. Lab. Q. Prog. Stat. Rep., 2-3, 23-39.

Herteg†rd, S. (1994). "Vocal fold vibrations as studied with flow inverse filtering," Ph.D. thesis, Univ. of Stockholm.

Herteg†rd, S., and Gauffin, J. (1992). "Acoustic properties of the Rothenberg mask," Speech Transmiss. Lab. Q. Prog. Stat. Rep., 2-3, 9-18.

Holmberg (1993). "Aerodynamic measurements of normal voice," Ph.D. thesis, Univ. of Stockholm.

Holmberg, E.B., Hillman, R.E., Perkell, J.S., and Gress, C. (1994). "Relationships between intra-speaker variation in aerodynamic measures of voice production and variation in SPL across repeated recordings," J. Speech Hear. Res. 37, 484-495.

Hong, S., Kang, S., and Ann, S. (1994). "Voice parameter estimation using sequential SVD and wave shaping filter bank," Proc. Int. Conf. Spoken Language Process., Yokohama, Jpn., 3, 1059-1062.

Jansen, J., Cranen, B., and Boves, L. (1991). "Modelling of source characteristics of speech sounds by means of the LF-model," Proceedings of Eurospeech, Genova, Italy, 1, 259-262.

Karlsson, I. (1990). "Voice source dynamics of female speakers," Proc. Int. Conf. Spoken Language Process., Kobe, Jpn., 1, 69-72.

Karlsson, I. (1992). "Analysis and synthesis of different voices with emphasis on female speech," Ph.D. dissertation, KTH, Stockholm.

Koreman, J. (1996). "Decoding linguistic information in the glottal airflow," Ph.D. dissertation, Univ. of Nijmegen.

Lin, Q. (1990). "Speech production theory and articulatory speech synthesis," Ph.D. dissertation, KTH, Stockholm.

Lobo, A.P., and Ainsworth, W.A. (1992). "Evaluation of a glottal ARMA model of speech production," Proc. Int. Conf. on Acoustic Speech Signal Process., San Francisco, USA, 2, 13-16.

McGowan (1994). "Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests," Speech Communication 14, 19-48.

Milenkovic (1993). "Voice source model for continuous control of pitch period," J. Acoust. Soc. Am. 93, 1087-1096.

Miller, R.L. (1959). "Nature of the Vocal Cord Wave," J. Acoust. Soc. Am. 31, 667-677.

Nelder, J.A., and Mead, R. (1964). "A simplex method for function minimization," The Computer Journal, 7, 308-313.

N¡ Chasaide, A., and Gobl, C. (1990). "Linguistic and paralinguistic variation in the voice source," Proc. Int. Conf. Spoken Language Process., Kobe, Jpn., 1, 85-88.

N¡ Chasaide, A., and Gobl, C. (1993). "Contextual variation of the vowel voice source as a function of adjacent consonants," Language and Speech, 36, 303-330.

Perkell, J.S., Hillman, R.E., and Holmberg, E.B. (1994). "Group differences in measures of voice production and revised values of maximum airflow declination rate," J. Acoust. Soc. Am. 96, 695-698.

Riegelsberger, E.L., and Krisnamurthy, A.K. (1993). "Glottal source estimation: methods of applying the LF-model to inverse filtering," Proc. Int. Conf. on Acoustic Speech Signal Process, Minneapolis, USA, 2, 542-545.

Rosenberg, A.E. (1971). "Effect of glottal pulse shape on the quality of natural vowels," J. Acoust. Soc. Am. 49, 583-590.

Rothenberg, M. (1973). "A new inverse filtering technique for deriving the glottal airflow during voicing," J. Acoust. Soc. Am. 53, 1632-1645.

Rothenberg, M. (1977). "Measurement of airflow in speech," J. of Speech and Hearing Research, 20, 155-176.

Schoentgen, J. (1990). "Non-linear signal representation and its application to the modelling of the glottal waveform," Speech Communication 9, 189-201.

Schoentgen, J. (1995). "Dynamic Models of the Glottal Pulse," in Levels in Speech Communication: Relations and Interactions, a tribute to Max Wajskop, edited by C. Sorin, J. Mariani, H. Meloni, and J. Schoentgen (Elsevier, Amsterdam), 249-266.

Strik, H. (1994). "Physiological control and behaviour of the voice source in the production of prosody," Ph.D. dissertation, Univ. of Nijmegen.

Strik, H. (1996a). "Comments on "Effects of bandwidth on glottal airflow waveforms estimated by inverse filtering" [J. Acoust. Soc. Am. 98, 763-767 (1995)]," J. Acoust. Soc. Am. 100, 1246-1249.

Strik, H. (1996b). "Testing two automatic methods for estimation of voice source parameters," in Proceedings of the Department of Language and Speech, edited by H. Strik, N. Oostdijk, C. Cucchiarini, & P.A. Coppen, Vol. 19, pp. 105-127, Nijmegen, The Netherlands.

Strik, H., and Boves, L. (1992a). "Control of fundamental frequency, intensity and voice quality in speech," Journal of Phonetics 20, 15-25.

Strik, H., and Boves, L. (1992b). "On the relation between voice source parameters and prosodic features in connected speech," Speech Communication 11, 167-174.

Strik, H., and Boves, L. (1994). "Automatic estimation of voice source parameters," Proc. Int. Conf. Spoken Language Process., Yokohama, Jpn., 1, 155-158.

Strik, H., Cranen, B., and Boves, L. (1993). "Fitting an LF-model to inverse filter signals," Proc. of the 3rd European Conf. on Speech Technology, Berlin, Germany, 1, 103-106.

Strik, H., Jansen, J., and Boves, L. (1992). "Comparing methods for automatic extraction of voice source parameters from continuous speech," Proc. Int. Conf. Spoken Language Process., Banff, Canada, 1, 121-124.

Strube, H.W. (1974). "Determination of the instant of glottal closure from the speech wave," J. Acoust. Soc. Am. 56, 1625-1629.

Sundberg, J., and J. Gauffin (1979). "Waveforms and spectrum of the glottal voice source," in Frontiers of speech communication research, Festschrift for Gunnar Fant, edited by B. Lindblom and S. ™hman. London: Academic Press, 301-320.

Last updated on 22-05-2004