Functional 
Data 
Analysis 
as 
a 
Tool 
for 
Analyzing 
Speech 
Dynamics 
A 
Case 
Study 
on 
the 
French 
Word 
c’´etait 


Michele 
Gubian1, 
Francisco 
Torreira1;2, 
Helmer 
Strik1, 
Lou 
Boves1 


1Centre for Language & Speech Technology, Radboud University, Nijmegen, NL 
2Max Planck Institute for Psycholinguistics, Nijmegen, NL 

fM.Gubian@let.ru.nl, 
Francisco.Torreira@mpi.nl, 
w.strik@let.ru.nl, 
L.Boves@let.ru.nl} 


Abstract 


In this paper we introduce Functional Data Analysis (FDA) as 
a tool for analyzing dynamic transitions in speech signals. FDA 
makes it possible to perform statistical analyses of sets of mathematical
 functions 
in the same way as classical multivariate 
analysis treats scalar measurement data. We illustrate the use of 
FDA with a reduction phenomenon affecting the French word 

c’´

etait 
/setE/ 
‘it was’, which can be reduced to [stE] 
in conversational
 speech. FDA reveals that the dynamics of the transition 
from [s] to [t] in fully reduced cases may still be different from 
the dynamics of [s] -[t] transitions in underlying /st/ clusters 
such as in the word stage. 
Index 
Terms: Functional Data Analysis, Principal Component 
Analysis, Categorical vs. 
gradual phenomena. 

1. 
Introduction 
It is well known that most of the information in speech signals 
is encoded in dynamic changes in formants, pitch, and power. 
Dynamic changes are best described in the form of some mathematical
 function, such as a piece of a second or third order 
polynomial. Nevertheless, more often than not measurements 
obtained in phonetic experiments are formulated in terms of 
absence or presence of some phenomenon, or of initial and final
 values of some dynamically changing parameter, such as 
pitch or formant frequencies. In other words: experimental data 
tend to be given in terms of nominal, ordinal, interval or ratio 
data, none of which are mathematical functions. The reason for 
this discrepancy is evident: we had increasingly more powerful 
methods for the statistical analysis of scalar data (or of sets of 
scalars in a vector), but there were no statistical methods that 
could operate on functions, instead of on numbers or labels. 

Perhaps it is fair to say that phonetic research in a tradition
 where speech is represented as a sequence of discrete 
units (phonemes or feature vectors) was not hampered too much 
by the lack of statistical methods that can deal with functions.
 After all, the properties of discrete units can be summarized
 in scalar measurements, at least to a large degree. However,
 we are witnessing a growing interest in what has become 
known as ‘fine phonetic detail’, which is all about dynamic 
phenomena [2]. If dynamics is key, the need to map function-
valued measurement data onto scalars or vectors to make them 
amenable to statistical analysis, becomes debilitating. After all, 
any mapping from functions to scalars is bound to destroy or 
distort essential information. 

Fortunately, speech research is not the only field where dynamic
 processes play a central role. Therefore, it need not come 
as a surprise that there is a growing number of statistical techniques
 specifically developed for dealing with function-valued 

measurement, developed in fields like astronomy, but that might 
be applied to advantage in phonetic research. One such class of 
methods are known as Functional 
Data 
Analysis 
(FDA). 

The use of FDA in phonetics is not completely new. It has 
been used to time align pitch periods (for calculating harmonics 
to noise ratio), kinematic [1] and aerodynamic data [3] with an 
accuracy that is substantially better than conventional Dynamic 
Time Warping. In this paper we show that FDA can also be used 
for analyzing speech data that can only be described in terms 
of differences in dynamic transitions and where precise time 
alignment is difficult at best because the dynamic processes that 
we want to investigate may well represent different underlying 
phenomena. 

The rest of this paper is structured as follows: in section 2 
we briefly introduce the FDA technique used in our research. In 
section 3 we explain the phenomenon under study (the reduction
 of the vowel /e/ in French c’´etait 
/setE/ 
‘it was’) in detail. 
Also, we explain the speech material and the measurements that 
we took to represent the dynamics of the transition from the initial
 [s] to the medial [t] of /setE/. Then we present the major 
results (section 4) and we finish with discussion and conclusions 
(section 5). 

2. 
Functional 
Data 
Analysis 
Functional Data Analysis [7, 8] is a suite of computational techniques
 that extend classic methods from statistics so that they 
can operate on functions instead of on scalars. Thus, they allow 
one to make quantitative inferences from sets of whole continuous
 functions (signals) without the need for an intermediate step 
in which functions are converted into scalars, a process that always
 causes information loss, and that makes inference from 
dynamic traits of signals problematic. 

Analyzing sets of dynamic (function-valued) observations 
with FDA takes two steps. It must be emphasized that although 
FDA is applied to digital signals, this does not imply that functions
 are converted to scalars. The first step is data preparation, 
which consists in transforming the sampled signals into a functional
 form, usually employing basis functions like B-splines 
and standard least squares interpolation, often including a regularization
 term. In this process, all functions are normalized on 
the same time interval, to make them comparable across time. 
In cases when a set of landmarks can be reliably identified in all 
functions (e.g. a series of peaks with a clear physical interpretation)
 these landmarks can be used to produce a time registered 
version of the whole set of functions, making all corresponding 
landmarks coincide in (normalized) time (see e.g. [8], Chap. 7). 
The second part is data analysis. Many techniques from multivariate
 statistics have been extended to functions, including 

Copyright © 2009 ISCA 2199 
6-10 September, Brighton UK 


functional Principal Component Analysis (fPCA) and different 
versions of functional linear modeling (c.f. [7] for a comprehensive 
overview). 

In this study we will use fPCA. Classic PCA is a way to 
extract and display the main modes of variation of a set of multidimensional 
data [7]. Starting from a data set in its original 
set of coordinates, a new basis is found such that by expressing 
(projecting) the data points on this basis, the projection on 
the first dimension accounts for the largest part of the variance 
in the data set, the second for the next largest part, and so on. 
While in PCA principal components are vectors of the same 
dimension as the data vectors, in fPCA principal components 
become functions defined on the same time interval as the functional 
data set. Fig.1 shows a typical way to display principal 
components in fPCA. The solid line shows the average signal, 

i.e. each point is the average of all the functions in the data 
set at that (normalized) time, while the ‘+’ and ‘-’ curves represent 
the effect of adding to or subtracting from it a multiple 
of the first principal component function (fPC1). Data points 
(i.e. functions) that get a high positive score when projected on 
fPC1 will then tend to look like the ‘+’ curve, and vice versa for 
negative scores. 
3. 
Experiments 
3.1. 
Goals 
of 
the 
experiments 
We demonstrate the application of fPCA with a study of vowel 
reduction affecting the French word c’´

etait 
/setE/ 
‘it was’. In 
conversational speech the vowel /e/ can be reduced, even to the 
extent that it seems to be completely absent. The phonetic question 
that we want to address is whether vowel /e/ is gradiently 
reduced or categorically absent in the reduced pronunciations 
of c’´etait 
with a 

etait. To this aim, we investigated tokens of c’´
vowel between /s/ and /t/, tokens where no voicing was present 
and tokens of underlying /st/ clusters extracted from other words 
such as ’stage’. 

We want to show that by taking scores of individual tokens 
on the first fPC as as descriptors of dynamic behavior, standard 
statistical techniques like ANOVA and k-means clustering 
can be used to quantitatively assess the nature of the reduction 
phenomenon. The difference between use of FDA and classic 
statistics is that the fPC1 score represents the actual dynamics 
of the signals, contrary to e.g. mean or variance, which are 
the result of time averaging. As a corollary, we will show that 
FDA can be successfully applied to sets of qualitatively heterogeneous 
signals (with or without a prominent maximum in the 
middle), which has not yet been described in detail in the literature 
[8]. 

If the dynamics of the [s] to [t] transition in c’´etait 
tokens 
with a completely deleted vowel display the same dynamics as 
tokens from words with underlying /st/, we will have found support 
for the hypothesis that the reduction process is categorical. 
If, on the other hand, the fully reduced tokens of c’´etait 
can still 
be distinguished from underlying /st/ clusters (because their dynamics 
are different), then the result seems to support the hypothesis 
that the reduction process is gradual. 

In our experiments we will first investigate whether FDA 
can distinguish between between underlying /st/ clusters and 
c’´

etait 
tokens that were annotated as fully reduced. Next, we 
will investigate if FDA can uncover differences between the dynamics 
of different forms of c’´etait 
and underlying /st/ clusters. 

3.2. 
Materials 
The materials used in this study were extracted from the Nijmegen 
Corpus of Casual French (NCCFr), which contains 
35 hours of high-quality audio featuring casual conversations 
among French university students. A detailed description of the 
preparation, recording and contents of the NCCFr corpus can be 
found in [6]. The data set consists of 378 c’´

etait 
pronunciations 
and 81 tokens of words starting with a /st/ cluster (e.g. stage 
‘internship’). In each token, we decided to take the beginning 
of [s] and the release of the [t] closure as the start and the end of 
the signal. Those events were manually marked by the second 
author by inspecting waveforms and spectrograms according to 
standard segmentation criteria. It should be noted that a considerable 
number of tokens exhibiting an incomplete [t] closure 
(n 
= 
100) were discarded, since we preferred to have as clearly 
defined landmarks as possible. The presence of voicing between 

[s] and [t] was determined manually on the basis of voicing-like 
periodicity in the waveform. From the subset of c’´etait 
tokens, 
191 contained voicing between [s] and [t], while 187 did not. 
3.3. 
Feature 
Extraction 
For our experiment we decided to characterize the dynamics of 
the [s] to [t] transition with a single signal feature that could 
be extracted in the exact same manner from all tokens in the 
set, viz. the log-energy contour of a low-pass filtered version 
of the acoustic signal, henceforth called lowE. First a low-pass 
filter with cut-off frequency 3250Hz is applied, then a 20 ms 
window is moved through the output signal at 5 ms steps and 
the average log-energy is calculated. We subtracted from each 
lowE 
sequence its average value, since a global difference in 
log-energy across tokens reflect random effects such as distance 
to the microphone and overall speaker volume. 

We believe that lowE 
is a good index of the dynamics of 
the [st] transition because the total speech power is able to reveal 
opening and closing movements of the tongue related to 
the articulation of the [e] vowel. If the speaker made a gesture 
related to the production of a vowel, the constriction for the [s] 
should become less narrow, and consequently one would expect 
the power in the frication noise to drop. An intervening full 
vowel would cause a rise in the acoustic power after the release 
of the [s]. Even if the release of the [s] constriction does not 
result in a (voiced of voiceless) vowel, one would still expect a 
plateau or a somewhat gradual decrease of the acoustic power 
into the [t] closure. However, in the case of underlying /st/ clusters 
one would expect a fairly rapid and monotonic decrease of 
the acoustic power from the [s] into the [t] closure. 

3.4. 
Data 
Preparation 
All data processing from this point on was carried out using the 
fda 
library for the R software [4] available at [5]. In order to 
perform FDA all sampled contours have to be transformed into 
functions defined on the same time interval. Since each audio 
segment has a different duration, we proceeded as follow. Each 
sampled feature contour was first interpolated using a 4th order 
B-spline basis with one knot per sample and a 2nd order roughness 
penalty. The smoothing parameter . 
was empirically set 
to 10. Then each function was re-sampled on 31 equally spaced 
points, and the obtained sampled contours were once again interpolated, 
this time using a 6th order B-spline basis with one 
knot per sample and a 4th order roughness penalty. The latter 
choice forces continuity up to the 2nd order derivative, thus allowing 
a greater deal of smoothness. The smoothing parameter 

2200 



t 
principal 
component 
(fPC1) 
of 
lowE 
contours 
of 
the 
fully 
reduced 
c’´

etait 
token 
set 
and 
the 
underlying 
[st] 
clusters. 
Average 
signal 
(solid 
line) 
and 
+/-2 
standard 
deviations 
of 
fPC1 
( 
‘+’ 
and 
‘-’ 
curves). 


. 
was empirically set to 10..12 
(this value is very different from 
the previous one mainly because of a different representation 
of time, ms in the former case, normalized in [0,1] in the latter).
 We did not attempt to register data with landmarks, since 
on such short trajectories we were not able to identify reliable 
landmarks. A few audio tokens were discarded because of problems
 in signal processing. In the end we worked with 369 c’´etait 
tokens, of which 186 were annotated as containing voicing between
 [s] and [t], and 80 tokens of underlying /st/ clusters. 

4. 
Results 
4.1. 
Voiceless 
c’´
etait 
vs. 
underlying 
/st/ 
clusters 


We first applied fPCA to all voiceless tokens, which include the 
subset of 183 voiceless c’´etait 
realizations plus the set of 80 realizations
 of underlying /st/ clusters. The aim was to investigate 
traces of any difference in dynamics between those two sets that 
could distinguish [st] tokens resulting from vowel deletion from 
underlying /st/ tokens. 

Using lowE 
as an indication of the dynamics of the [s] to [t] 
transition, fPCA suggests that fully reduced c’´etait 
and underlying
 /st/ clusters as in the word stage 
‘internship’ are similar, 
but certainly not identical. Fig. 1 and 2 show fPC1, which explains
 46% of variance, and the empirical densities [9] of the 
fPC1 scores of the tokens in the two subsets. 

The solid line in Fig. 1 shows the average signal, i.e. each 
point is the average of all the 263 lowE 
contours at that (normalized)
 time, while the ‘-’ curve (with rise-fall portion) and the ‘+’ 
curve (without) represent the effect of 2 
standard deviations of 
fPC1 on the average curve (like in classic PCA, signs have no 
intrinsic meaning). Qualitatively, we expect tokens with an associated
 negative fPC1 coefficient (‘-’ curve) to belong to the 
c’´etait 
subset, since those curves will tend to have a slower decrease
 in acoustic power between [s] and [t] than the underlying 
/st/ tokens. In accordance with the limited proportion of variance
 explained by fPC1, the distance between the ’+’ curve and 
the ’-’ curve is not very large. The ‘+’ curve shows a sort of 
plateau that could be attributed to a gesture related to the articu-

Figure 2: Empirical 
densities 
for 
the 
fPC1 
scores 
of 
lowE 
contours 
of 
the 
fully 
reduced 
c’´etait 
token 
set 
and 
the 
underlying 
[st] 
tokens. 


lation of an intervening [e]. It should be noted, however, that 
manual spectrographic inspection of the materials suggested 
that the energy in the plateau area might be attributable to [s] 
frication rather than to genuine formants. 

A t-test on the two empirical distributions (cf. Fig. 2) 
yielded a statistically significant difference (p<:0001); thus, 
the two subsets must be considered as originating from different 
populations. However, by applying k-means with two clusters 
we could separate c’´etait 
tokens from underlying /st/ clusters 
only with 59% accuracy. 

The results of the fPCA analysis suggest that fPC1 clearly 
cannot separate the populations effectively; yet the difference 
between the two distributions leaves open the possibility that 
the c’´

etait 
tokens (several of which have fPC1 scores that are 
never reached by /st/ tokens) are from a population that is characterized
 by and underlying [e] in between [s] and [t]. 

4.2. 
Three-way 
analysis 
We then applied fPCA analysis to the complete set of 449 tokens.
 The results are summarized in Figs. 3 and 4. fPC1 explains
 63.4% of the total variance. The increase in the proportion
 of explained variance (compared to the analysis of the two 
sets of voiceless tokens) is not surprising. After all, the subset
 that contains clear traces of a vowel can be considered as 
structurally different from the completely voiceless tokens. 

An ANOVA on the complete set of fPC1 scores was carried
 out, with three groups, i.e. c’´

etait 
with a trace of a vowel, 
c’´etait 
without a trace of a vowel and underlying /st/. We obtained
 F 
(2, 
442) 
= 
595:91;p 
< 
:0001. A Tukey HSD post-
hoc test revealed that the means of all three groups were statistically
 different from each other (p<:0001 
in all cases). So, 
we see again that there is support for the hypothesis that we are 
dealing with three different underlying populations. 

k-means clustering with two clusters yielded 94.5% agreement
 with the set annotated as containing voicing between [s] 
and [t] on the one hand and the union of the sets annotated 
as fully reduced c’´etait 
and underlying /st/ clusters. However, 
clustering with k 
=3 
did not yield meaningful results. This 
corroborates the previous finding that the two sets of voiceless 
tokens cannot easily be separated. 

2201 



t 
principal 
component 
(PC1) 
of 
lowE 
contours 
of 
the 
full 
token 
set 
(comprising 
three 
classes). 
Average 
signal 
(solid 
line) 
and 
+/-2 
standard 
deviations 
of 
fPC1 
( 
‘+’ 
and 
‘-’ 
curves). 


5. 
Discussion 
and 
Conclusion 
To the best of our knowledge this paper presents the first application
 of functional Principle Components Analysis for investigating
 the presence or absence of differences in the dynamics of 
the [s] to [t] transition between underlying /st/ clusters and [st] 
tokens that result from the deletion of an intervening vowel /e/ 
in French c’´

etait. In the past, Functional Data Analysis has only 
been applied in phonetic research to obtain very accurate time 
alignments (cf. [1][3]). To represent the dynamics of speech 
phenomena that may or may not contain a vowel between [s] 
and [t] we used the contours of the speech power as function-
valued observations. 

Our results strongly suggest that fPCA can indeed be used 
for conducting statistical analyses on sets of data that are functions 
rather than scalars) or vectors. Both the fPC1 contours 
(relative to the average contours) and the empirical distributions 
of the fPC1 scores for the tokens in the three sub-sets indicate 
that there are three underlying populations, rather than two (the 
sub-set with vowel present and the union of the two fully voiceless
 sub-sets). Still, despite the fact that the difference between 
the distributions of the two voiceless sub-sets is statistically significant,
 the overlap is very substantial. 

The long-term goal of our research is to investigate the contribution
 of fine phonetic detail to speech comprehension and 
its role in speech production. FDA (and fPCA in particular) has 
proved to be a powerful tool for analyzing speech dynamics. In 
ongoing research we are investigating whether FDA can also be 
applied to speech features that involve non-linear processing, so 
as for example the extraction of formants. It is interesting to see 
whether the heuristic strategies needed to decide whether some 
spectral peak is a formant or not interfere with FDA analysis. 

The phonetic question that was at the basis of this study, viz. 
whether the reduction in French c’´

etait 
is a gradual or rather a 
categorical process cannot be answered conclusively. While we 
have found statistically significant differences between the distributions
 of underlying /st/ clusters and [st] clusters that result 
from vowel deletion, there are several potentially relevant factors
 that were not controlled for in this corpus-based study. A 
study is under way to investigate the effects of prosody on the 

Figure 4: Empirical 
densities 
for 
the 
fPC1 
scores 
of 
lowE 
contours 
of 
the 
three 
token 
sets, 
separately 
computed 
for 
the 
two 
classes 
of 
c’´

etait 
tokens 
and 
the 
/st/ 
clusters. 


dynamics of [s] to [t] transitions. After all, French c’´

etait 
tends 
to also differ from words with underlying /st/ clusters in that it 
is rarely stressed, and in that it is very often phrase initial. 

In conclusion, we can say that FDA is an extremely promising
 tool in the study of fine phonetic detail. At the same time, 
the interaction between fine phonetic detail and other phonetic 
variables, notably prosody, is so strong that novel experimental
 designs may need to be developed to come to grips with the 
intricacies of speech dynamics. 

6. 
Acknowledgements 
The research of Michele Gubian is supported by the Marie Curie 
Research Training Network Sound-to-Sense 1. 

7. 
References 
[1] Byrd, D., Lee, S. and Campos-Astorkiza, R. (2008) Phrase boundary
 effects on the temporal kinematics of sequential tongue tip 
consonants. J. 
Acoust. 
Soc. 
Am., Vol. 123, pp. 4456-4465. 
[2] Carlson, R. and Hawkins, S. (2007) When is phonetic detail a 
detail? Proc. 
ICPhS 
XVI, pp. 211–214. 
[3] Koenig, L. R., Lucero, J. C. and Perlman, E. (2008) Speech production
 variability in fricatives of children and adults: Results of 
functional data analysis. J. 
Acoust. 
Soc. 
Am., Vol. 124, pp. 3158– 
3170. 
[4] R Development Core Team (2008) R: 
A 
language 
and 
environment 
for 
statistical 
computing, R Foundation for Statistical Computing,
 Vienna, Austria. URL http://www.R-project.org. 
[5] Online: http://www.functionaldata.org. 
[6] Torreira, F., Adda-Decker, M., and Ernestus, M. (submitted). The 
Nijmegen Corpus of Casual French. 
[7] Ramsay, J. O. and Silverman, B. W. (1997) Functional 
Data 
Analysis,
 Springer-Verlag New York, Inc. 
[8] Ramsay, J. O. and Silverman, B. W. (2002) Applied 
Functional 
Data 
Analysis 
-Methods 
and 
Case 
Studies, Springer-Verlag New 
York, Inc. 
[9] Sarkar, D. (2008) Lattice: 
Multivariate 
Data 
Visualization 
with 
R, 
Springer. 
1http://www.ling.cam.ac.uk/s2s/ 

2202