Comparing methods for automatic extraction of voice source parameters from continuous speech.

home > publications > a16

Contact

Comparing methods for automatic extraction of voice source parameters from continuous speech.
H. Strik, J. Jansen & L. Boves
Proceedings ICSLP-92, Banff, Vol. 1, pp. 121-124.

ABSTRACT

Two methods are presented for automatic calculation of the voice source parameters from continuous speech. Both methods are used to calculate the voice source parameters for natural speech. However, for natural speech no objective test procedure seems available. Therefore, both methods were also tested on synthetic speech.

INTRODUCTION

Modern text-to-speech systems produce speech that is intelligible, but not quite natural. A substantial improvement of the naturalness of synthetic speech can probably be achieved by the use of a properly controlled voice source model. To derive rules for voice source parameters large amounts of data are required. Extracting voice source parameters by hand is time consuming, subjective and thus probably not reproducible. Automatic extraction of voice source parameters from continuous speech is also far from trivial. Our aim is to develop methods for automatic extraction of voice source parameters from continuous speech.In this article we propose and test two generic automatic methods.

In continuous speech the glottal parameters may change from period to period. Thus, the procedure must be pitch-synchronous. In our case, estimates of the frequency response of the vocal tract are based on an analysis of the closed glottis interval only. It is well known that analysis over such short intervals can yield wildly fluctuating results if, for instance, the analysis window is shifted or extended by just one or two samples. Because the causes behind these fluctuations are not understood, it is not possible to determine the optimal window location and length by simple automatic procedures. Therefore, we estimate the vocal tract transfer function for a number of window positions and lengths for each pitch period. The two methods differ in the way in which these multiple estimates are used. Method 1 computes an inverse filtered flow waveform for each individual estimate, and then computes the median value for the parameters describing the glottal waveforms. Method 2 uses the multiple estimates to obtain a single optimal estimate of the vocal tract transfer function, that is subsequently used to obtain a single optimal glottal flow waveform.

Both methods are used to derive voice source parameters from natural speech. But for natural speech it is difficult to evaluate the performance objectively, because the true glottal flow waveforms are not known. The only possible evaluation is a subjective one, viz. checking that the voice source parameters behave the way they are expected to behave. As long as there is no objective test procedure for natural speech, an objective evaluation can only be done with synthetic speech. In this article we will first test if both methods give plausible results for natural speech. Then we will perform an objective, quantitative test of both methods with synthetic speech.

MATERIAL

Natural speech

To study voice source characteristics data were obtained for four male subjects. For the current article only the data of one subject were used. The speech signal was transduced by a condensor microphone (B&K 4134) placed about 10 cm in front of the mouth, pre-amplified at the microphone (B&K 1619), and amplified by a measuring amplifier (B&K 2607). The speech signal was A/D converted off-line at a 10 kHz sampling rate.

Synthetic speech

The synthesis system that is used to generate the synthetic speech is a serial pole-zero synthesizer [1] that uses the DEC-Talk source [2]. The main reason for using the DEC-Talk source is its computational simplicity. All synthesis signals have a sampling frequency of 10 kHz.

METHOD

Inverse filtering is often used to obtain an estimate of the glottal flow signal. At the base of this method is the assumption that the voice source and the vocal tract filter do not interact. It is known that this assumption is not valid [3], but it is a useful approximation of the human speech system. The approximation is best during the closed glottis intervals, because then there is least interaction between the sub- and supraglottal cavities. Therefore, the analysis window should preferably be confined to the closed glottis interval. In [4] we showed that Closed glottis interval Covariance Linear Predictive (CC-LP) analysis is as powerful as more sophisticated techniques, like robust pole-zero analysis.

General method

A block diagram of the general method is shown in Fig. 1. First the vocal tract filter is estimated. This is done by converting the results of the CC-LP analysis to M Formant-Bandwidth pairs. This module is therefore called "FB-EST" (see Fig. 1). The problem of finding the correct set of analysis parameters is treated in the next section.

The estimated filter is inverted, and the inverse filter is used to filter the audio signal (module IF in Fig. 1). The resulting inverse filtered signal, INV, is a first estimate of the differentiated glottal volume flow. As the inverse filtered signal often contains high-frequency noise, it is low-pass filtered (module LPF in Fig. 1c). The resulting signal is a new, usually better, estimate of the differentiated glottal volume flow, dUg. The glottal volume flow, Ug, is obtained by integration of dUg (module INT in Fig. 1).

Voice source parameters, like open quotient and skewing, could be derived directly from dUg and Ug. However, since these signals often are noisy direct measurements yield unreliable values. Fitting a voice source model to the data probably is a more robust method for obtaining voice source parameters. In our system, the fit is done simultaneously on dUg and Ug [5].

The LF-model was used as voice source model because it seems useful for synthesis, and because it has already been studied in great detail [6]. The model and its parameters are shown in Fig. 2. The LF-model is a four parameter model. There are different combinations of the LF-parameters that uniquely define a flow pulse. The four parameters that are used for the generation of flow pulses during the fit procedure are U0, Tp, Te, and Ta. Both during analysis and synthesis a fifth parameter is necessary to position the LF-pulses, viz. T0.

The general method can be split in two parts. In the first part the formants and bandwidths of the vocal tract filter are estimated (FB-EST), and in the second part this filter is used to derive LF-parameters from the audio signal. This second module is therefore called AUDIO2LF (see Fig. 1).

Analysis parameters

For CC-LP analysis a number of choices have to be made like position and width of the closed glottis interval, order of the analysis, and pre-emphasis factor. Usually, no combination of choices is optimal for the whole utterance. However, for the natural speech that was used in this study a 12th order LPC analysis with a pre-emphasis factor of 1.00 (+6 dB/oct) worked satisfactorily for almost all pitch periods. Thus, window position and window length are left as parameters that can be varied.

Generally, the moments of glottal closure are easier to detect than the moments of opening. It is possible to identify the locations of the main excitation from the audio signal [7]. However, we prefer to determine the moments of glottal closure from the electroglottogram [5]. The signal with the closure markers is called "MARKERS" (see Fig. 1). The window position will be determined relative to a closure marker (this parameter is called window shift). Five window shifts (of -2, -1, 0, 1, 2 samples) were used. For synthetic speech the close markers are found by using the same operations (differentiation, low-pass filtering, and peak-picking) for the source signal.

For natural speech it is difficult to obtain an accurate estimate of the instant of glottal opening. Therefore, LP analysis is done for five different, fixed window lengths. If the length of the window is too large, then the last part will be in the open phase. This will perturb the estimates of the formants and especially of the bandwidths. On the other hand, if the length of the window is too short then LP anal-ysis will not be able to make an accurate estimate. As a compromise we used window lengths of 33, 34, 35, 36, and 37 samples.

In both methods the FB-pairs are estimated for the 25 combinations of analysis parameters (see Fig. 3). The 25 estimations of the vocal tract filter are used by both methods to obtain a single set of voice source parameters.

Method 1

In the first method the module AUDIO2LF is used 25 times (see Fig. 3). The result is 25 sets of LF-parameters. Finally median values are calculated for the LF-parameters (module MEDIAN in Fig. 3), which results in one set of average LF-parameters. This method is described in more detail in [5].

Method 2

In method 2 a Formant-Bandwidth TRacker (module FB-TR in Fig. 3) is used to obtain an optimal set of FB-pairs. In this method the module AUDIO2LF is applied once, where it uses the optimal filter to obtain one set of LF-parameters.

FB-Tracker

The goal of the module FB-TR is to find the first 4 FB-pairs. Generally, the first 4 FB-pairs are among the first 5 poles that are modelled by the LP analysis. Therefore, all 5 possible combinations of 4 out of 5 poles are generated. This is done for the 25 estimates of the vocal tract filter, and the result is a lattice of 4 FB-pairs with a depth of 125. A Viterbi algorithm is used to find the optimal path in this lattice.

The optimal path is the path with minimal total cost. For calculation of the cost, a transition and a local cost function is used. The transition cost function is the Euclidean distance between the values of the current frame and the values of the previous frame. The local cost function is the Euclidean distance between the values of the current frame and reference values. The reference values are obtained by a correlation LP analysis, using an analysis window of 256 samples.

For higher formants the variation in both frequency and bandwidth values is higher than for lower formants. For FB-tracking the effect would be that the FB-values of higher formants are more important in determining the optimal path. Therefore, all variables are converted to standard normal variables by first subtracting the mean value, and subsequently dividing by the standard deviation.

The total cost function has four contributions, viz. the transition and local costs of the formants and the transition and local costs of the bandwidths. Each of these four contributions can be given a weight. The weights that have been used are 4, 2, 1, and 0, respectively. The reference bandwidths that are obtained by correlation LP analysis are usually larger than the bandwidths obtained by CC-LP. These bandwidths can not be compared in a straightforward manner, and consequently the weight of the local cost of the bandwidths is set to zero.

RESULTS

Natural speech

For the current article data are used that are obtained by applying both methods to a natural utterance with a length of about 2.5 seconds. In Fig. 4 the audio signal of part of the utterance is shown, together with T0 and the four LF-parameters as calculated by both methods.

The part of the utterance shown in Fig. 4 clearly demonstrates the dynamics of the LF-parameters. Generally, it is observed that U0 covaries with the amplitude of the speech signal. The same holds for the other amplitude related parameters of the LF-model, like Ee and Ei. During transitions from vowels to consonants it is often observed that U0 decreases, while the time parameters T0, Te, Tp and Ta increase. The same effects were found by Bickley and Stevens for artificial vocal tract constrictions [8].

The four LF-parameters and T0 can be used to calculate all other glottal waveform parameters like open quotient, speed quotient, skewing etc. As an example the 3 dimensionless wave shape parameters have been calculated: Rg = T0/2Tp, Rk = Te/Tp - 1, and Ra = Ta/T0. The average values of Rg, Rk, and Ra for all 194 voiced periods of the utterance are 111, 42, and 6.7 for method 1, and 110, 41, and 7.1 for method 2. These values are in accordance with the values given in [9].

Although the results of both methods are slightly different, both methods seem to give plausible results. Given these results, two questions emerged: what is the reliability of the results, and which method is the best? As there is no objective test procedure for natural speech, we also tested both methods for synthetic speech.

Synthetic speech

The DEC-Talk waveform and the LF waveform are fundamentally different. LF-parameters that are derived by both methods from synthetic speech can not be compared directly to source signal that is used during synthesis. For evaluation of the test results a reference was required. This reference was obtained by performing the same operations on the source signal as on the inverse filter results, i.e. low-pass filtering, integration and fitting of a LF-model.

The rationale behind this is that the differentiated source signal should really be compared to the inverse filtered signals. All operations that are necessary to obtain LF-parameters from the inverse filtered signals should therefore also be applied to the differentiated source signal. In Fig. 5 the low-pass filtered and the fitted signal are shown. Apart from the fitted signal, the fit also yields the four LF-parameters for each pitch period. These reference parameters are used below for evaluation. The utterance used for evaluation was a random concatenation of all vowels, liquids and glides that are used in the synthesis system.

In Fig. 5 it can be seen that low-pass filtering and fitting mainly influences the excitation strength and the return phase. The effects are clear for a synthetic glottal pulse (as in Fig. 5), but the same effects occur for the inverse filtered signals that are derived from the speech signals. In order to calculate the true values of the voice source parameters, a correction is mandatory after both methods. The amount of both corrections can be calculated from synthetic speech, and should be verified for natural speech. In the following part we will only test the similarity of the reference parameters (set 0) and the parameters obtained by both methods (set 1 and 2).

In Fig. 5 one can see that the reference signal and the source signals obtained by both methods from the synthetic speech signal are very much alike. A regression analysis on the voice source parameters has been done to test the degree of resemblance. The results are given in Table I.

Table I. Results of regression analysis for different combinations of the four LF-parameters. For each combination of two variables y and x a straight line is fitted through the data: Y = intercept + slope*X. The correlation coefficient R denotes the goodness of the fit (N = 232). In the top box are given the comparisons of the results of both methods (subscript 1 and 2) with the reference (subscript 0), and in the bottom box the comparisons between both methods.

Y X intercept slope R

U0,1 U0,0 -9.7 1.03 0.96

U0,2 U0,0 -11.4 1.05 0.97

Te,1 Te,0 -0.33 0.95 0.75

Te,2 Te,0 0.07 1.01 0.76

Tp,1 Tp,0 0.25 0.96 0.60

Tp,2 Tp,0 -0.09 1.05 0.62

Ta,1 Ta,0 0.05 0.95 0.69

Ta,2 Ta,0 0.00 1.00 0.64

U0,2 U0,1 3.8 0.99 0.98

Te,2 Te,1 -0.05 1.01 0.97

Tp,2 Tp,1 -0.09 1.02 0.96

Ta,2 Ta,1 0.02 0.90 0.80

The results of the comparisons between both methods and the reference are given in the upper part of Table I. All slopes are about 1, and all intercepts are small which means that the parameters calculated by both methods have, on the average, the same value as the reference parameters. (The intercept of U0 has a larger absolute value, but the value of the intercept relative to the average value of U0 is even smaller than those of Te, Tp, and Ta.) The correlation coefficients in Table I for U0 are almost 1, and those of Te, Tp, and Ta are somewhat smaller but still highly significant. This means that the values calculated by both methods for Te, Tp, and Ta closely resemble the reference, while the calculated values for U0 and the reference values are almost identical. Generally, the results of method 2 are slightly better than those of method 1.

In the lower part of Table I both methods are compared. For U0, Te, and Tp the intercept is small, and the correlation coefficient and the slope are almost 1. This means that for U0, Te, and Tp the results of both methods are very much alike. The resemblance of the values of Ta for both methods is somewhat less.

CONCLUSIONS

In this article 2 methods are proposed for the automatic extraction of voice source parameters from continuous speech. For natural speech both methods produce comparable and reasonable results. There is a need for an objective procedure to test the reliability of the extracted parameters. As long as this test is not available, the best alternative is to test these methods on synthetic speech.

The various operations that are used during the extraction of the voice source parameters from speech, have influence on the magnitude of these parameters. In order to re-estimate the true magnitude of the parameters these effects have to be corrected.

From the tests on synthetic speech it appeared that both methods succeed in estimating the voice source parameters quite accurately. The results obtained for the amplitude parameter U0 are better than those of the time parameters Te, Tp, and Ta. For synthetic speech the second method is slightly better than the first method.

REFERENCES

[1] L. Boves, J. Kerkhoff, and H. Loman. "A new synthesis model for an allophone based text-to-speech system." Proceedings of the European Conference on Speech Technology, J. Laver and M.A. Jack (eds.), Edinburgh, Vol. II, pp. 385-388, 1987.

[2] D.H. Klatt and L. Klatt. "Analysis, synthesis, and perception of voice quality variations among female and male talkers, " J. Acoust. Soc. Am. 87(2), pp. 820-857, 1990.

[3] T.V. Anantapadmanabha and G. Fant. "Calculation of true glottal flow and its components, " Speech Communication, Vol. 1, pp. 167-184, 1982.

[4] J. de Veth, B. Cranen, H. Strik, and L. Boves. "Extraction of control parameters for the voice source in a text-to-speech system, " Proc. Int. Conf. Acoust. Speech Signal Processing, Vol. 1, pp. 301-304, 1990.

[5] H. Strik and L. Boves. "On the relation between voice source parameters and prosodic features in connected speech, " to appear in Speech Communication, Vol. 11, 1992.

[6] G. Fant, J. Liljencrants and Q. Lin. "A four parameter model of glottal flow, " STL-QPSR, Vol. 4, pp. 1-13, 1985.

[7] D.G. Childers and C.K. Lee. "Vocal quality factors: Analysis, synthesis, and perception, " J. Acoust. Soc. Am. 90(5), pp. 2394-2410, 1991.

[8] C.A. Bickley and K.N. Stevens. "Effects of a vocal-tract constriction on the glottal source: experimental and modelling studies, " J. of Phon., Vol. 14, pp. 373-382, 1986.

[9] R. Carlson, G. Fant, C. Gobl, B. Granstrom, I. Karlsson and Q. Lin. "Voice source rules for text-to-speech synthesis, " Proc. Int. Conf. Acoust. Speech Signal Processing, Vol. 1, pp. 223-226, 1989.

Last updated on 22-05-2004