Functional Data Analysis as a Tool for Analyzing Speech Dynamics A Case Study on the French Word c’´etait Michele Gubian1, Francisco Torreira1;2, Helmer Strik1, Lou Boves1 1Centre for Language & Speech Technology, Radboud University, Nijmegen, NL 2Max Planck Institute for Psycholinguistics, Nijmegen, NL fM.Gubian@let.ru.nl, Francisco.Torreira@mpi.nl, w.strik@let.ru.nl, L.Boves@let.ru.nl} Abstract In this paper we introduce Functional Data Analysis (FDA) as a tool for analyzing dynamic transitions in speech signals. FDA makes it possible to perform statistical analyses of sets of mathematical functions in the same way as classical multivariate analysis treats scalar measurement data. We illustrate the use of FDA with a reduction phenomenon affecting the French word c’´ etait /setE/ ‘it was’, which can be reduced to [stE] in conversational speech. FDA reveals that the dynamics of the transition from [s] to [t] in fully reduced cases may still be different from the dynamics of [s] -[t] transitions in underlying /st/ clusters such as in the word stage. Index Terms: Functional Data Analysis, Principal Component Analysis, Categorical vs. gradual phenomena. 1. Introduction It is well known that most of the information in speech signals is encoded in dynamic changes in formants, pitch, and power. Dynamic changes are best described in the form of some mathematical function, such as a piece of a second or third order polynomial. Nevertheless, more often than not measurements obtained in phonetic experiments are formulated in terms of absence or presence of some phenomenon, or of initial and final values of some dynamically changing parameter, such as pitch or formant frequencies. In other words: experimental data tend to be given in terms of nominal, ordinal, interval or ratio data, none of which are mathematical functions. The reason for this discrepancy is evident: we had increasingly more powerful methods for the statistical analysis of scalar data (or of sets of scalars in a vector), but there were no statistical methods that could operate on functions, instead of on numbers or labels. Perhaps it is fair to say that phonetic research in a tradition where speech is represented as a sequence of discrete units (phonemes or feature vectors) was not hampered too much by the lack of statistical methods that can deal with functions. After all, the properties of discrete units can be summarized in scalar measurements, at least to a large degree. However, we are witnessing a growing interest in what has become known as ‘fine phonetic detail’, which is all about dynamic phenomena [2]. If dynamics is key, the need to map function- valued measurement data onto scalars or vectors to make them amenable to statistical analysis, becomes debilitating. After all, any mapping from functions to scalars is bound to destroy or distort essential information. Fortunately, speech research is not the only field where dynamic processes play a central role. Therefore, it need not come as a surprise that there is a growing number of statistical techniques specifically developed for dealing with function-valued measurement, developed in fields like astronomy, but that might be applied to advantage in phonetic research. One such class of methods are known as Functional Data Analysis (FDA). The use of FDA in phonetics is not completely new. It has been used to time align pitch periods (for calculating harmonics to noise ratio), kinematic [1] and aerodynamic data [3] with an accuracy that is substantially better than conventional Dynamic Time Warping. In this paper we show that FDA can also be used for analyzing speech data that can only be described in terms of differences in dynamic transitions and where precise time alignment is difficult at best because the dynamic processes that we want to investigate may well represent different underlying phenomena. The rest of this paper is structured as follows: in section 2 we briefly introduce the FDA technique used in our research. In section 3 we explain the phenomenon under study (the reduction of the vowel /e/ in French c’´etait /setE/ ‘it was’) in detail. Also, we explain the speech material and the measurements that we took to represent the dynamics of the transition from the initial [s] to the medial [t] of /setE/. Then we present the major results (section 4) and we finish with discussion and conclusions (section 5). 2. Functional Data Analysis Functional Data Analysis [7, 8] is a suite of computational techniques that extend classic methods from statistics so that they can operate on functions instead of on scalars. Thus, they allow one to make quantitative inferences from sets of whole continuous functions (signals) without the need for an intermediate step in which functions are converted into scalars, a process that always causes information loss, and that makes inference from dynamic traits of signals problematic. Analyzing sets of dynamic (function-valued) observations with FDA takes two steps. It must be emphasized that although FDA is applied to digital signals, this does not imply that functions are converted to scalars. The first step is data preparation, which consists in transforming the sampled signals into a functional form, usually employing basis functions like B-splines and standard least squares interpolation, often including a regularization term. In this process, all functions are normalized on the same time interval, to make them comparable across time. In cases when a set of landmarks can be reliably identified in all functions (e.g. a series of peaks with a clear physical interpretation) these landmarks can be used to produce a time registered version of the whole set of functions, making all corresponding landmarks coincide in (normalized) time (see e.g. [8], Chap. 7). The second part is data analysis. Many techniques from multivariate statistics have been extended to functions, including Copyright © 2009 ISCA 2199 6-10 September, Brighton UK functional Principal Component Analysis (fPCA) and different versions of functional linear modeling (c.f. [7] for a comprehensive overview). In this study we will use fPCA. Classic PCA is a way to extract and display the main modes of variation of a set of multidimensional data [7]. Starting from a data set in its original set of coordinates, a new basis is found such that by expressing (projecting) the data points on this basis, the projection on the first dimension accounts for the largest part of the variance in the data set, the second for the next largest part, and so on. While in PCA principal components are vectors of the same dimension as the data vectors, in fPCA principal components become functions defined on the same time interval as the functional data set. Fig.1 shows a typical way to display principal components in fPCA. The solid line shows the average signal, i.e. each point is the average of all the functions in the data set at that (normalized) time, while the ‘+’ and ‘-’ curves represent the effect of adding to or subtracting from it a multiple of the first principal component function (fPC1). Data points (i.e. functions) that get a high positive score when projected on fPC1 will then tend to look like the ‘+’ curve, and vice versa for negative scores. 3. Experiments 3.1. Goals of the experiments We demonstrate the application of fPCA with a study of vowel reduction affecting the French word c’´ etait /setE/ ‘it was’. In conversational speech the vowel /e/ can be reduced, even to the extent that it seems to be completely absent. The phonetic question that we want to address is whether vowel /e/ is gradiently reduced or categorically absent in the reduced pronunciations of c’´etait with a etait. To this aim, we investigated tokens of c’´ vowel between /s/ and /t/, tokens where no voicing was present and tokens of underlying /st/ clusters extracted from other words such as ’stage’. We want to show that by taking scores of individual tokens on the first fPC as as descriptors of dynamic behavior, standard statistical techniques like ANOVA and k-means clustering can be used to quantitatively assess the nature of the reduction phenomenon. The difference between use of FDA and classic statistics is that the fPC1 score represents the actual dynamics of the signals, contrary to e.g. mean or variance, which are the result of time averaging. As a corollary, we will show that FDA can be successfully applied to sets of qualitatively heterogeneous signals (with or without a prominent maximum in the middle), which has not yet been described in detail in the literature [8]. If the dynamics of the [s] to [t] transition in c’´etait tokens with a completely deleted vowel display the same dynamics as tokens from words with underlying /st/, we will have found support for the hypothesis that the reduction process is categorical. If, on the other hand, the fully reduced tokens of c’´etait can still be distinguished from underlying /st/ clusters (because their dynamics are different), then the result seems to support the hypothesis that the reduction process is gradual. In our experiments we will first investigate whether FDA can distinguish between between underlying /st/ clusters and c’´ etait tokens that were annotated as fully reduced. Next, we will investigate if FDA can uncover differences between the dynamics of different forms of c’´etait and underlying /st/ clusters. 3.2. Materials The materials used in this study were extracted from the Nijmegen Corpus of Casual French (NCCFr), which contains 35 hours of high-quality audio featuring casual conversations among French university students. A detailed description of the preparation, recording and contents of the NCCFr corpus can be found in [6]. The data set consists of 378 c’´ etait pronunciations and 81 tokens of words starting with a /st/ cluster (e.g. stage ‘internship’). In each token, we decided to take the beginning of [s] and the release of the [t] closure as the start and the end of the signal. Those events were manually marked by the second author by inspecting waveforms and spectrograms according to standard segmentation criteria. It should be noted that a considerable number of tokens exhibiting an incomplete [t] closure (n = 100) were discarded, since we preferred to have as clearly defined landmarks as possible. The presence of voicing between [s] and [t] was determined manually on the basis of voicing-like periodicity in the waveform. From the subset of c’´etait tokens, 191 contained voicing between [s] and [t], while 187 did not. 3.3. Feature Extraction For our experiment we decided to characterize the dynamics of the [s] to [t] transition with a single signal feature that could be extracted in the exact same manner from all tokens in the set, viz. the log-energy contour of a low-pass filtered version of the acoustic signal, henceforth called lowE. First a low-pass filter with cut-off frequency 3250Hz is applied, then a 20 ms window is moved through the output signal at 5 ms steps and the average log-energy is calculated. We subtracted from each lowE sequence its average value, since a global difference in log-energy across tokens reflect random effects such as distance to the microphone and overall speaker volume. We believe that lowE is a good index of the dynamics of the [st] transition because the total speech power is able to reveal opening and closing movements of the tongue related to the articulation of the [e] vowel. If the speaker made a gesture related to the production of a vowel, the constriction for the [s] should become less narrow, and consequently one would expect the power in the frication noise to drop. An intervening full vowel would cause a rise in the acoustic power after the release of the [s]. Even if the release of the [s] constriction does not result in a (voiced of voiceless) vowel, one would still expect a plateau or a somewhat gradual decrease of the acoustic power into the [t] closure. However, in the case of underlying /st/ clusters one would expect a fairly rapid and monotonic decrease of the acoustic power from the [s] into the [t] closure. 3.4. Data Preparation All data processing from this point on was carried out using the fda library for the R software [4] available at [5]. In order to perform FDA all sampled contours have to be transformed into functions defined on the same time interval. Since each audio segment has a different duration, we proceeded as follow. Each sampled feature contour was first interpolated using a 4th order B-spline basis with one knot per sample and a 2nd order roughness penalty. The smoothing parameter . was empirically set to 10. Then each function was re-sampled on 31 equally spaced points, and the obtained sampled contours were once again interpolated, this time using a 6th order B-spline basis with one knot per sample and a 4th order roughness penalty. The latter choice forces continuity up to the 2nd order derivative, thus allowing a greater deal of smoothness. The smoothing parameter 2200 t principal component (fPC1) of lowE contours of the fully reduced c’´ etait token set and the underlying [st] clusters. Average signal (solid line) and +/-2 standard deviations of fPC1 ( ‘+’ and ‘-’ curves). . was empirically set to 10..12 (this value is very different from the previous one mainly because of a different representation of time, ms in the former case, normalized in [0,1] in the latter). We did not attempt to register data with landmarks, since on such short trajectories we were not able to identify reliable landmarks. A few audio tokens were discarded because of problems in signal processing. In the end we worked with 369 c’´etait tokens, of which 186 were annotated as containing voicing between [s] and [t], and 80 tokens of underlying /st/ clusters. 4. Results 4.1. Voiceless c’´ etait vs. underlying /st/ clusters We first applied fPCA to all voiceless tokens, which include the subset of 183 voiceless c’´etait realizations plus the set of 80 realizations of underlying /st/ clusters. The aim was to investigate traces of any difference in dynamics between those two sets that could distinguish [st] tokens resulting from vowel deletion from underlying /st/ tokens. Using lowE as an indication of the dynamics of the [s] to [t] transition, fPCA suggests that fully reduced c’´etait and underlying /st/ clusters as in the word stage ‘internship’ are similar, but certainly not identical. Fig. 1 and 2 show fPC1, which explains 46% of variance, and the empirical densities [9] of the fPC1 scores of the tokens in the two subsets. The solid line in Fig. 1 shows the average signal, i.e. each point is the average of all the 263 lowE contours at that (normalized) time, while the ‘-’ curve (with rise-fall portion) and the ‘+’ curve (without) represent the effect of 2 standard deviations of fPC1 on the average curve (like in classic PCA, signs have no intrinsic meaning). Qualitatively, we expect tokens with an associated negative fPC1 coefficient (‘-’ curve) to belong to the c’´etait subset, since those curves will tend to have a slower decrease in acoustic power between [s] and [t] than the underlying /st/ tokens. In accordance with the limited proportion of variance explained by fPC1, the distance between the ’+’ curve and the ’-’ curve is not very large. The ‘+’ curve shows a sort of plateau that could be attributed to a gesture related to the articu- Figure 2: Empirical densities for the fPC1 scores of lowE contours of the fully reduced c’´etait token set and the underlying [st] tokens. lation of an intervening [e]. It should be noted, however, that manual spectrographic inspection of the materials suggested that the energy in the plateau area might be attributable to [s] frication rather than to genuine formants. A t-test on the two empirical distributions (cf. Fig. 2) yielded a statistically significant difference (p<:0001); thus, the two subsets must be considered as originating from different populations. However, by applying k-means with two clusters we could separate c’´etait tokens from underlying /st/ clusters only with 59% accuracy. The results of the fPCA analysis suggest that fPC1 clearly cannot separate the populations effectively; yet the difference between the two distributions leaves open the possibility that the c’´ etait tokens (several of which have fPC1 scores that are never reached by /st/ tokens) are from a population that is characterized by and underlying [e] in between [s] and [t]. 4.2. Three-way analysis We then applied fPCA analysis to the complete set of 449 tokens. The results are summarized in Figs. 3 and 4. fPC1 explains 63.4% of the total variance. The increase in the proportion of explained variance (compared to the analysis of the two sets of voiceless tokens) is not surprising. After all, the subset that contains clear traces of a vowel can be considered as structurally different from the completely voiceless tokens. An ANOVA on the complete set of fPC1 scores was carried out, with three groups, i.e. c’´ etait with a trace of a vowel, c’´etait without a trace of a vowel and underlying /st/. We obtained F (2, 442) = 595:91;p < :0001. A Tukey HSD post- hoc test revealed that the means of all three groups were statistically different from each other (p<:0001 in all cases). So, we see again that there is support for the hypothesis that we are dealing with three different underlying populations. k-means clustering with two clusters yielded 94.5% agreement with the set annotated as containing voicing between [s] and [t] on the one hand and the union of the sets annotated as fully reduced c’´etait and underlying /st/ clusters. However, clustering with k =3 did not yield meaningful results. This corroborates the previous finding that the two sets of voiceless tokens cannot easily be separated. 2201 t principal component (PC1) of lowE contours of the full token set (comprising three classes). Average signal (solid line) and +/-2 standard deviations of fPC1 ( ‘+’ and ‘-’ curves). 5. Discussion and Conclusion To the best of our knowledge this paper presents the first application of functional Principle Components Analysis for investigating the presence or absence of differences in the dynamics of the [s] to [t] transition between underlying /st/ clusters and [st] tokens that result from the deletion of an intervening vowel /e/ in French c’´ etait. In the past, Functional Data Analysis has only been applied in phonetic research to obtain very accurate time alignments (cf. [1][3]). To represent the dynamics of speech phenomena that may or may not contain a vowel between [s] and [t] we used the contours of the speech power as function- valued observations. Our results strongly suggest that fPCA can indeed be used for conducting statistical analyses on sets of data that are functions rather than scalars) or vectors. Both the fPC1 contours (relative to the average contours) and the empirical distributions of the fPC1 scores for the tokens in the three sub-sets indicate that there are three underlying populations, rather than two (the sub-set with vowel present and the union of the two fully voiceless sub-sets). Still, despite the fact that the difference between the distributions of the two voiceless sub-sets is statistically significant, the overlap is very substantial. The long-term goal of our research is to investigate the contribution of fine phonetic detail to speech comprehension and its role in speech production. FDA (and fPCA in particular) has proved to be a powerful tool for analyzing speech dynamics. In ongoing research we are investigating whether FDA can also be applied to speech features that involve non-linear processing, so as for example the extraction of formants. It is interesting to see whether the heuristic strategies needed to decide whether some spectral peak is a formant or not interfere with FDA analysis. The phonetic question that was at the basis of this study, viz. whether the reduction in French c’´ etait is a gradual or rather a categorical process cannot be answered conclusively. While we have found statistically significant differences between the distributions of underlying /st/ clusters and [st] clusters that result from vowel deletion, there are several potentially relevant factors that were not controlled for in this corpus-based study. A study is under way to investigate the effects of prosody on the Figure 4: Empirical densities for the fPC1 scores of lowE contours of the three token sets, separately computed for the two classes of c’´ etait tokens and the /st/ clusters. dynamics of [s] to [t] transitions. After all, French c’´ etait tends to also differ from words with underlying /st/ clusters in that it is rarely stressed, and in that it is very often phrase initial. In conclusion, we can say that FDA is an extremely promising tool in the study of fine phonetic detail. At the same time, the interaction between fine phonetic detail and other phonetic variables, notably prosody, is so strong that novel experimental designs may need to be developed to come to grips with the intricacies of speech dynamics. 6. Acknowledgements The research of Michele Gubian is supported by the Marie Curie Research Training Network Sound-to-Sense 1. 7. References [1] Byrd, D., Lee, S. and Campos-Astorkiza, R. (2008) Phrase boundary effects on the temporal kinematics of sequential tongue tip consonants. J. Acoust. Soc. Am., Vol. 123, pp. 4456-4465. [2] Carlson, R. and Hawkins, S. (2007) When is phonetic detail a detail? Proc. ICPhS XVI, pp. 211–214. [3] Koenig, L. R., Lucero, J. C. and Perlman, E. (2008) Speech production variability in fricatives of children and adults: Results of functional data analysis. J. Acoust. Soc. Am., Vol. 124, pp. 3158– 3170. [4] R Development Core Team (2008) R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org. [5] Online: http://www.functionaldata.org. [6] Torreira, F., Adda-Decker, M., and Ernestus, M. (submitted). The Nijmegen Corpus of Casual French. [7] Ramsay, J. O. and Silverman, B. W. (1997) Functional Data Analysis, Springer-Verlag New York, Inc. [8] Ramsay, J. O. and Silverman, B. W. (2002) Applied Functional Data Analysis -Methods and Case Studies, Springer-Verlag New York, Inc. [9] Sarkar, D. (2008) Lattice: Multivariate Data Visualization with R, Springer. 1http://www.ling.cam.ac.uk/s2s/ 2202