Académique Documents
Professionnel Documents
Culture Documents
Master Thesis
Submitted in September 2007 by
Guillaume Beraud-Sudreau
Master of Cognitive Science / Master de Sciences Cognitives
Ecole Normale Suprieure / Ecole des Hautes Etudes en Sciences Sociales / Universit Paris V
Internship Directors :
Sciences
Cognitives
et
Tokyo
Introduction.........................................................................................................................................5
1.1
1.2
1.3
1.4
1.4.A
i:
ii :
iii :
iv :
v:
Within-language variability...................................................................................................7
1.4.B
1.5
1.5.A
1.6
1.6.A
2
Presentation ..............................................................................................................................11
2.1.A
2.1.B
2.1.C
2.2
2.2.A
2.2.B
2.2.C
2.3
2.3.A
2.3.B
i:
ii :
iii :
iv :
v:
2.3.C
i:
ii :
iii :
iv :
2.4
Experiments achieved...............................................................................................................17
2.4.A
i:
a)
b)
1.2.4.A.i.b.1
1.2.4.A.i.b.2
1.2.4.A.i.b.3
1.2.4.A.i.b.4
ii :
2.4.B
2.4.C
Enhancements ...................................................................................................................24
2.5
Conclusion........................................................................................................................................27
1.2
method
Comparison between the proposed clustering method and the "Kmeans over Trajectories"
29
Basic concept.....................................................................................................................29
2.1.B
i:
ii :
1
1.1
Introduction
The logical problem of language acquisition
During their first years of life, babies learn spontaneously the language spoken by adults around them.
This acquisition, which is incredibly fast, relies on mechanisms that are still unknown.
In 1979, N. Chomsky proposed the concept of Universal Grammar, according to which, the acquisition
of language rests on innately specified specialized mechanisms [Chomsky 79]. This point of view is
supported by the observation that babies learn a capacity to generate an infinite set of sentences from on a
finite, small, and noisy set of stimuli. In addition, the information provided to infants only contain positive
example of structures that are admissible in their language, but no counter-examples, nor negative
feedback. The impossibility of infinite language induction based on partial and finite data hence motivated
the idea that infants come equipped with knowledge about the subject matter which takes the form of an
innate universal grammar (see also [Osherson et al., 1984]).
Such apriori arguments are based on an analysis of the structure of language competence in the adult,
but not on real acquisition data. Therefore, they leave unspecified many aspects of the actual learning
mechanisms that may be used by human infants [see Elman et al and Marcus for an extended debate on
nativism versus empiricism]. Here, we propose to explore the opposite approach, namely, to examine how
far one can go with rather simple statistical learning mechanisms or signal processing algorithms which run
on actual speech data. Our aim is hence to specify which part of the human language competence has to
be prewired, and which can be extracted from the data using statistical procedures.
1.2
Human babies learn a lot about their language before they speak, but experimental studies of such
early language acquisition only appeared 30 years ago, with the development of new experimental
techniques (such as the High Amplitude Sucking Paradigm, the Preferential Looking Paradigm, or the Head
Turn Procedure [Eimas et al., 1971]. In the 90's, brain imaging techniques allowed a more direct
observation of the cerebral activities of the babies (using Event-Related Potentials [Dehaene-Lambertz and
Dehaene, 1994], Near Infrared Spectroscopy [Pena et al, 2003] or functional Magnetic Resonance Imaging
[Dehaene-Lambertz et al. 2002]). These new experimental techniques, based on indirect measures instead
of the oral productions of the baby, enabled researchers to gather a great amount of new data and
observations regarding the first steps of speech acquisition.
One of the surprising findings of these studies is that newborns and preverbal infants are extremely
competent in making very detailed phonetic distinctions. They can discriminate minimal phonetic contrasts
in stop consonants, which are notoriously difficult to classify. Infants are typically better at discrimination
then adults: For instance, English 6 month old infants, unlike adults, can discriminate the contrast between
dental [t] and retroflex [t], which is used in Hindi, but not in English [Werker and Tees, 1984a; 1984b].
They can however ignore changes in talkers in making consonant or vowel discrimination [Kuhl (1983)]. In
addition, human newborns can determine whether pairs of sentences belong or not to the same language,
even if they have not heard it before [Nazzi et al, 1998] and they can discriminate multisyllabic utterances
on the basis of their number of syllables [Bertoncini and Mehler, 1981].
During the first year of life, the perception becomes more and more specific to the language(s) spoken
in the environment, and contrasts, which are not used in the language(s) are less and less well perceived.
As far as phonetic categories are concerned (the consonants and vowels of the language), this tuning,
starts with the acquisition of the vowel categories, which seems to take place around 6 months of age [Kuhl
et al. 1992]), followed by the acquisition of the consonants and the loss of the non-distinctive contrasts at
st
the end of the 1 year [Werker and Tees, 1984a]). At the same time, infants learn the sequential constraints
which governs the ordering of consonants and vowels. At 9 months, a baby can detect what sequences of
phonemes are allowed and which are not [Friederici and Wessel 1993]). During the second of half of the
first year of life, infants have a more and more specialized processing of the prosody of their language
([Hirsh-Pasek et al. 1987]), as well as the cues which signal word boundaries [Jusczyk et al., 1996]. This
early tuning process seems to go hand in hand with a more definite specialization of the left hemisphere for
language [Dehaene-Lambertz et al., 2006].
5
Although the exact mechanisms governing the acquisition of language are still unknown, it is agreed
that the baby uses the perceived speech stream to carry out a statistical learning of the particularities of his
language. This learning is based on the analysis of regularities within his language. For instance, it is shown
that babies are sensitive to the sounds distribution (monomodal vs bimodal distribution), and can build
categories according to these distributions [Maye et al, 2002]. Babies are also sensitive to other statistical
distributions, such as transitions probabilities between segments [Mattys and Jusczyk, 2001], as well as the
presence of complementary distributions between segments [Peperkamp and Morgan, in press].
1.3
The objective of this report is to present a model of phonetic categories acquisition. As we have seen
previously, this acquisition is achieved before 1 year old, which means that infants could not have been
using lexical knowledge to drive it. Indeed, at the end of the first year of life, the recognition lexicon of
infants is believed to contain only around 20 words. Hence, the data suggests that the only possible
strategy for such an early acquisition is through an unsupervised algorithm looking for modes in the
phonetic distribution [Maye and Gerken, 2002]. Yet, this leaves open a number of hard questions regarding
the particular clustering algorithm which is used, as well as the type of preprocessing which is done on the
acoustic signal to optimize the possibility to finding proper phonetic categories.
1.4
1.4.A
i:
The physical substratum of speech is continuous, both in the spectral dimension and in the time
dimension . The segmentation of this signal into discrete phonetic units requires to find simultaneously the
spectral and the temporal boundaries of these units. Yet, these boundaries are difficult to find for two
reasons: they vary widely across languages, they are also very variable within languages.
ii :
The fact that languages vary in the physical realization of phonetic categories is best illustrated in the
inventory of vowels. Vowel categories have been described in terms of prototype theory [Kuhl, 1992 ;
Lacerda, 1995]. They have a center, or best examplar, and a domain, where examplar further and further
from the center are perceived as less and less typical. For instance, French and English both vary in the
center and domain of most of their vowels. Kuhl proposed nevertheless that such cross linguistic variation
might be limited by the existence of innate psychophysical boundaries that have to be respected by
linguistic systems. Although there has been the proposition of an innate 30ms psychoacoustical boundary
for the durational cue of voiced and unvoiced consonants (Voice Onset Time), [Pisoni et al,1980] such
boundaries have not been documented for vowels or other spectral characteristics of consonants.
The language-specific phonetic categories hence appear to be primarily the result of some clustering
process of the speech sounds of a given language in a space of spectral or spectro-temporal parameters.
iii :
A less well studied domain of variation regards the temporal axis. Here, we are not talking about
temporal cues for phonetic contrasts (such as VOT, or the slope of formant transitions), but about the way
in which a given stretch of speech is segmented into consecutive phonemes. For instance, the non-word
"ebzo" is considered to have 2 consonants and 2 vowels by English listeners. But Japanese listeners will
report hearing 2 consonants and 3 vowels ("ebuzo"). These perceptual differences are mainly due to the
fact that sequences of consonants are allowed in English, while they are not possible in Japanese. The
Japanese perceptual system interprets the small formants transition present between /b/ and /z/ as a full
vowel, which turns the sequence into a legal one in the language. In contrast the English or French
6
perceptual system simply ignore this information, or interprets it as being part of the respective consonants
[Dupoux et al, 1999] This problem is very general and not restricted to the perception of illusory vowels. For
instance, let consider the sequence "ts". In some language (eg. Italian), it is considered as a single affricate
s
phoneme [t ], in others (e.g. French), as the concatenation of two phonemes, "t" and "s", in yet others (e.g.
Canadian French), as an contextual allophone of the phoneme [t] . .
In order to construct a model of phonetic category acquisition, it is hence important to pay attention to
this double aspects of parsing, simultaneously along the time and the spectral dimensions..
iv :
These difficulties are increased by the lack of a priori knowledge about the phonetic categories to learn.
In particular, the number of categories to create is a priori unknown, and can greatly vary among languages.
For instance, some languages use 3 vowels, while some others use 20. In the same way, the number of
consonants in a given language can vary from 6 to almost an hundred. On another hand, it seems the
average duration of the phonetic units is relatively stable among languages (i.e., between, 50ms and
150ms). This duration window is probably a very useful clue for building the phonetic units.
It is to be noted that the lack of a priroi knowledge regarding the phonetic categories makes the
acquisition of phonetic categories even more difficult if one considers that this task has to be fully achieved
in a non-supervised way (ie. the learning is not guided by any other information than the speech signal
itself).
v:
Within-language variability
Finally, even if one focuses on a single language, the physical signal considered for the acquisition of
the phonetic categories is very noisy. This is due to three factors. First, the speech stream which arrives to
the ears of infants is mixed with a variety of nonspeech noises (environmental noises), and the speech
signal is itself distorted (filtering, reverberation). Second, the phonetic realization of the categories are not
the same across individuals: differences in gender, age, size of vocal tract modify considerably the acoustic
properties of the speech segments. Finally, and most importantly for us, the phonetic categories are also
variable within individuals. This source of variations, highly specific to speech signals, is generally called the
coarticulation effect. It is due to the fact that speech is produced by a complex set of independently
controlled articulators. In order to minimize energy expenses, our motor control system will tend to
anticipate future segments by preparing the articulators for their future targets. For instance, in the syllable
/Su/ (as in shoe), the fricative is already rounded in preparation for the rounded vowel. In /Si/ (as in she),
in contrast, the fricative is not rounded. The result is that the spectral characteristics of the fricative are
contaminated by the following vowel. Vice versa, due to inertia of the articulators, consonants and vowels
will be influenced by the preceding ones. Hence, depending on the context, the physical representation of a
phoneme will strongly vary from one realization to another. Some researchers have proposed that there is
no invariant properties of segments, because they are planned as entire articulatory gestures [Fowler et al.,
1993]
These noise sources, which strongly disturb the mapping between the abstract phonemes and their
acoustic realizations, don't seem to disturb neither infant acquisition nor adult comprehension of speech
signal.
1.4.B
In this report, we ignore the first two sources of variability by considering one single speaker, in a quiet
and non-reverberating environment. These precautions allowed us to focus on what we consider to be the
main difficulties, in particular the coarticulation problem and the non-supervised aspects of the phonetic
acquisition.
1.5
Early models of speech acquisition have assumed that the speech stream is coded into a low
dimensional space of local features (see [Eimas et al., 1975], for a theory of phonetic features). Speech
recognition systems typically encode the speech stream into a low dimensional vector of local spectral
features (for instance, a vector of 20 or so MFCC or LPC coefficients). These coefficients are computed
every 5ms frame, and represent an analysis of the speech stream on a time window of around 10-30ms.
Unlike the proposal of Eimas, however, these features have not been claimed to have a linguistic
interpretation, but are used solely for efficiency reasons. Finally, neurophysiological models of speech
perception also propose an early representation of speech in terms of local features. The cochlea performs
7
a spectral decomposition of sounds, and the coding in the auditory nerve has been described as filtering
through a basis of short term wavelet type functions [Smith and Lewicki, 2006]. All of these models share
the idea that the speech perception system is organized hierarchically, starting with the local acoustic
features, which are used to builfd higher order structures like phoneme, syllables, and words.
1.5.A
The main deficit of feature approaches is that they are too local to be able to capture adequately the
sources of variation that affect the acoustic properties of phonetic categories. Psychoacoustic models have
for the most part failed to provide features that would have the grequired degree of invariance regarding
coarticulation [Blumstein, and Stevens, 1979]. Speech recognition systems compensate for this defect of
features by providing another representational level which is much more holistic: the level of words. Words
are represented as sets of nodes or states connected through transition probabilities; each state
representing a probability distribution over feature space. Such models called Hidden Markov Models
(HMM) are being very successful in achieving invariant recognition of words, but these models are heavily
trained in a supervised fashion (the words or phonetic tags are provided to the learning system). Obviously,
such an approach is not possible for a nonsupervised learning system.
1.6
This report proposes to explore another, more recent class of models, based on more global or holistic
representations. The basic idea is that the invariant properties that correspond to phonemes are not present
at the level of local features, but rather at the level of entire articulatory trajectories (i.e., chunks of speech
of the size of syllables). In such models (see Figure 1), the system is based on the segmenting and storing
of large sized templates, which are the basis for discovering the more abstract segments, which then can
be used to recover the even more abstract linguistic features. In other words, this model puts the standard
hierarchical model on his head and starts with big units rather than small ones.
This approach makes two important presuppositions: 1. that the problem of segmentation into
templates or acoustic syllables is simpler than the original problem of segmentation into phonetic
categories, 2. that the acoustic templates provide a representation which is more useful for discovering
phonetic categories than the original feature representations.
In this report, we mostly focus on the second presupposition, although we report in the Appendix an
original algorithm achieving the syllabic segmentation, based on an even more global point of view and
thus coherent with the proposed model.
Classical architecture
Proposed architecture
W ord
W ord
Syllables
Features
Grammar
Grammar
Phoneme
s
Phoneme
s
Acoustic features
Acoustic syllables
Acoustics
Acoustics
Figure 1. Comparison between the classical hierarchical architecture based on local features,
and our antihierachical approach based on templates. Note that a template model is consistent
with the findings that newborn infants can count the number of syllables in the speech stream
before they can pay attention to individual segments [Bertoncini and Mehler, 1981].
1.6.A
The idea of a similarity based representation of sounds has introduced by R.N.Shepard [Shepard,
1968], and more recently developed by S. Edelman [Edelman 1998]. He proposed to describe a visual
recognition system based on an analysis of similarities over similarities: the visual stimuli were compared to
a given set of patterns, and the results of these comparisons were used as a new representation of the
stimuli.
Conventional recognitions systems often categorize objects according to their similarities with
prototypes of the categories to recognize ; these similarities are usually considered as the result of the
classification, as the categories associated to a stimulus is the one proposing the minimum distance
between it's prototype and the input. But S. Edelman proposes to base the analysis on similarities between
observations, and compare these similarities together (creating similarities of similarities). These analyses
named second order isomorphism allow building much more complex classifications than a simple
recognition system, based on a measure of similarities followed by a winner-takes-all system. In our work,
we follow this concept in the most literal way, as the stimuli are first compared with the reference sounds,
and then the results of these comparisons are compared together through a clustering algorithm, to create
the phonetic categories (so these categories are built from an analysis of similarities over similarities).
The idea to use templatic representation of sounds has already been applied in practical works.. In
particular, M. Coath & S.L. Denham [Coath and Denham, 2005] studied a representation based on the
comparisons between syllabic templates. The aim of this research is to recognize They propose to
represent the sounds as the response to convolution filters, each filter corresponding to the feature to
detect. This operation produces a distance between the stimuli and several instances of the sounds to
detect. This new representation of the stimuli is relatively comparable to the one we are proposing. Unlike
classical speech recognition system, the model proposes to use the comparisons to create a new
representation, and to base the recognition on this new representation, instead of classifying the stimuli by
a simple winner-takes-all test.
However, the authors propose to compare the syllables using convolution products, which don't
provide a sufficient robustness to obtain satisfying results. They also just consider the global result of the
sounds comparisons, and consequently can't efficiently distinguish units of smaller size than the templates
used for comparison. This limitation is not considered as a problem for the authors, as they focus on digit
recognition, instead of phonetic acquisition. Finaly, the authors don't propose any error-rates, or efficiency
comparison between their system and any conventional representations
Similar ideas are extensively applied to object recognition. In particular, comparing an observed shape
with multiple templates of a given set of objects obviously greatly decrease the difficulties related to the
rotation or illumination of this object. In particular, it is successfully used in face recognition problems,
proposing robustness to face orientation and illumination [Beymer and Poggio, 1996].
However, these applications raise more difficulties than sound representation, as it involves
comparisons of 2 dimensional images of 3 dimensional objects: these comparisons (in the best case, of 2
dimensional pictures) necessarily require more complex considerations than comparisons between speech
stimuli streams (that are 1 dimensional).
Finally, experiments have been achieved on humans in order to support the concept of visual
representation by similarities: for instance, F. Cutzu and S. Edelman proposed [Cutzu and Edelman 1996]
an experiment where subject indirectly rebuilt the relation between 3 dimensional objects: several threedimensional animal-like shapes were generated; the shapes were controlled by an high number of
parameters, hidden from the subject. Pairs of pairs of pictures of these shapes were then present to the
subjects, who had to decide which pair was the most similar; according to the answer, relations between the
animals were rebuilt by multidimensional scaling. These new relations, determined by the subject, were
corresponding to the (hidden) relations between the object in the parameters space. This experiment would
9
suggest that humans are sensitive to the relation between object (as, for instance, a recognitions of object
by feature wouldn't lead people to reproduce the relation between the shapes presented in this experiment).
It is interesting to highlight that, although it is common to represent data by using similarities with
reference templates, these templates are usually considered as prototypes of the objects to identify or,
sometime subparts of these objects. The model proposed hereafter is, on another hand, aiming to identify
units (phonemes) that are subpart of the patterns used for the comparisons (syllables).
10
Here, we propose a new representation of the speech signal based on syllabic templates. This
representation can be considered as a first step in the acquisition of the phonetic categories by the infant,
as it facilitates the acquisition of phonemes in this language.
In this report we will show that this representation helps learning one given language, while it
makes more difficult the discrimination of phonemes that are not part of this language.
2.1
2.1.A
Presentation
Structure of the model
The proposed model relies on an original representation of speech stimuli. This representation is based
on multiple comparisons between perceived stimuli and a bank of reference sounds (we suppose this set of
sounds has been previously stored).
In order to achieve these comparisons, it is necessary to segment the speech stream into relatively
short templates, of the same shape than the reference sounds. This segmentation would be a first step for
the comparison of new sounds with the reference base (and, consequently, this segmentation can be
considered as a first step for the acquisition of language).
A similarity measure is obtained by the comparison of every new sound with every reference template.
The measure of similarity between new stimuli and one reference sound is one dimension of the new
representation of the new sounds.
The created representation allows transforming a sequence of physical values (the physical or at least
acoustic representation of the speech stream) into a set of sequences of similarities (one similarity per
reference sound).
Before the comparison, and in order to improve its efficiency, the sounds are warped. The distortion
applied to the sounds can be useful information about the relation between these new stimuli and the
reference sounds. Therefore it can be valuable to consider it during the clustering step. In the experiments
proposed later in this report, we will present the gain obtained by the use of this information.
The acquisition of phonetic categories strictly speaking will be achieved in a second step, using the
computed representation. We will show in this report that the new representation allows easier learning of
these categories, and thus that this representation can be considered as a first step of their acquisition.
11
Reference base
(3)
acoustic signal
(1)
Low dimensional
spectral feature space
(2)
Match
(C)
High dimensional
similarity space
(4)
Figure 2. Global presentation of the proposed algorithm: the physical signal (1) is transformed (A) to
an acoustic representation (2 ; form for practical implementation of the model, MFCC coefficients are
used for efficiency). This data is segmented, and part of it is stored (B) to create the reference base (3).
In a second step, the new stimuli are compared with this reference base (C), to create a similarity
representation (4).
2.1.B
Produced representation
The representation obtained after comparisons is based on the reference templates, which means that
each dimension of this representation corresponds to one of the reference sounds.
For instance, if the syllables [ba], [na], [ra] have been stored in the reference base, one dimension of
the new representation will correspond to the similarity between stimuli and the reference sound [ba], while
another will correspond to the reference [na], and other dimensions will correspond to the other references
(cf Figure 3)
Figure 3. projection ma vs. mi: 2 axis of the new representation; the horizontal axe
corresponds to the reference template "ma", while the vertical axe correspond to the sound "mi".
Each point corresponds to a frame (5 milliseconds long), indicated by its label. Notice the "m"
sounds are a high similarity along both axis, while the phonemes "a" and "i" have a high
similarity along their corresponding reference sound only, and the other phonemes don't match
with neither axis, and are close from the origin. This figure displays two axes only, but the
representation is highly dimensional (with one dimension corresponding to every reference
sound).
12
2.1.C
The nature of the reference sounds is fundamental for an efficient representation of the speech stimuli.
We assumed that phonetic units were optimal. This assumption was motivated by the following reasons:
Coarticulation effect is very strong within syllables and thus coarticulation creates a very strong noise
if we consider smaller units than syllables. The use of syllabic units cancels the effect of this noise, as
different instances of the same phoneme (appearing in different contexts) can be stored. On another hand,
the coarticulation effect is very low between syllables, so it isn't necessary to consider longer units than
syllables such units would increase the complexity of a representation, without providing any gain.
This consideration is also coherent with observations achieved on newborn babies. The use of syllabic
templates for sounds comparisons requires the babies to be able to detect such units in the speech stream.
It has been shown that newborns babies have this ability, as they can discriminate between strings of
bisyllables and trisyllables (which means they can count the number of syllables in speech stream
[Bertoncini and Mehler, 1981]).
Finally, this hypothesis has been successfully tested on our model: experiments that have shown the
superiority of a syllable based representation for phoneme acquisition, which is, as we have seen, coherent
with both theoretical consideration and compartmental observations.
The exact mechanism allowing babies to achieve this segmentation is still unknown, and will not be the
focus of this report. Nevertheless, a syllabic segmentation mechanism is proposed in annex.
2.2
2.2.A
This proposed representation of the speech stream shows several major differences with a
"physical representation" like for instance, the spectral representation.
The first noticeable difference is the adaptation of the representation to a particular language. The
representation would be adapted to learn the phonemes that appear in the reference base, while it should
be inefficient in learning discrimination of phonetic categories that do not appear as distinct in the phonetic
base. So this representation is clearly adapted to the learned language, and can be considered as a first
step in its acquisition. This adaptation will be clearly underlined during the tests achieved on the proposed
implementation of the model.
2.2.B
As another major difference, it is important to note that our representation transforms data
appearing as sequences of physical values into a constant similarity. For instance, the physical
representation of a phoneme clearly isn't constant. But its similarity with a reference sound featuring the
same phoneme will be continuously high.
This difference greatly simplifies the learning of the phonetically categories, as a phoneme will be
corresponding to a point or an area of the new representation, instead of a complex trajectory in the
acoustic representation.
On another hand, phonemes that do not appear in the reference base aren't associated to any zone
in the new representation (the representation of such a phoneme will still be a trajectory). So our
representation would not be adapted to learn such a phoneme.
These differences would give an interesting understanding of the discrimination abilities (or inabilities)
as based on a representation specificity (or limitation), prior to any clustering.
13
2.2.C
Noticeable examples are the affricates phonemes. For instance, let's consider a language including a
s
phoneme " t ". A reference base acquired from such a language would incorporate a set of templates
s
s
including the phoneme " t ", so the sound " t " would correspond to an area in the new representation. New
s
stimuli containing the phoneme " t " would create a cluster in this area, which would be considered as an
independent phoneme.
s
On another hand, a representation built on a language containing the phonemes "t" and "s", but not " t
s
s
" would not have any constant area corresponding to the phoneme " t ". If a sound " t " is represented using
a reference based built from this language, it would not correspond to a fix area (and would successively
s
match with "t", then with "s"). In these conditions, the sound " t " would be considered as the concatenation
of two phonemes, instead of a third one.
This difference would enable any acquisition technique based on the proposed representation to easily
learn the discrimination between phonemes that would not be possibly discriminated using local information
only.
2.3
2.3.A
To compare the efficiency of the proposed representation, depending on the speech templates
considered (for instance, to compare syllabic templates with other kinds of segmentations).
2.3.B
Detailed algorithm
i:
This step was, for most of the trials, achieved in a supervised way. A given number of realizations of
every possible syllable of the considered language were randomly selected to compose the reference base.
This way to proceed allowed obtaining balanced reference bases, created from optimal templates.
14
The balance of the reference base is probably not mandatory, as long as every syllable is represented
(in the case of an unbalanced templates base, the results would probably be equivalent to a balanced
base, with the same number of elements than the less represented syllable in the database.
However, in some experiments, the reference base was built in an unsupervised way (and, in these
cases, it will be specified).
We also developed and implemented an unsupervised segmentation algorithm, which could produce
relevant template. This algorithm is described in annex.
ii :
The comparisons of the database templates and the new stimuli are achieved using the Dynamic Time
Warping (DTW) algorithm [Myers and Rabiner, 1981], DTW algorithm proposes a comparison between two
sounds, deforming one of them if it is necessary, in order to compare parts of the sounds that are the most
probably matching.
This comparison method offers robustness against changes in speed of the speech stream, and
synchronizes the templates prior to match them together.
In some cases, the 1-pass DP algorithm was preferred (1-pass DP is a generalization of the DTW
algorithm, allowing to compare one sound to many references simultaneously, or to compare one
continuous stimulus to a shorter reference, repeating it as many times as necessary); in these cases, the
comparison method and the reason of the choice will be specified.
In addition, we extracted from the comparisons information about the distortions that have to be applied
on the sounds in order to make them match. To collect a time-distortion measure, which gives additional
information about the comparison between the sounds, the optimal path obtained by the DTW algorithm is
considered. The distance between the derivate of this path at one given moment and a linear distortion of
the reference sound offers a dissimilarity measure (the highest the value of this distortion distance is, the
most the sounds had to be warped before matching).
iii :
Obtained representation
Considering the input data, the representation obtained after comparison is a N or 2N-dimensional
signal (if the reference base counts N distinct templates, the comparisons produce N similarities, and N
distortion distances), sampled every 5 milliseconds.
From a computational point of view, this representation is relatively complex. Therefore, the proposed
algorithm shouldn't be considered as efficient from an engineering point of view.
iv :
Depending to the experiments, the clustering achieved over the new representation uses supervised or
unsupervised algorithms (in case of supervised clustering, a perceptron algorithm is used, optimized using
15
the Rprop algorithm (Rprop algorithm achieves a supervised training on a perceptron network; it uses a
backpropagation optimization technique; see [Riedmiller, and Braun, 1993] for further details).
The training and the recognition of the phonetic categories was achieved on a frame wise point of view:
every frame was considered as an independent data, and the order of these frames was not used for the
clustering or the recognition; this point of view for clustering was clearly simplistic. However, our objective
was to compare different representations of the speech signal. For this task, it was relevant to use a simple
clustering algorithm. We only suffered of this simplicity for the unsupervised tests, where the proposed
algorithms were unable to converge toward the appropriate clusters.
In any cases, the error rate measured was the percentage of misclassified frames (so this error rate
was frame-wise).
In every experiment, a similar clustering algorithm was applied to the MFCC coefficients, and its two
first derivate coefficients, in order to judge the gain obtained using the proposed representation.
For the learning, we used a training set including 34 times every possible syllables (so, from 34x9=306
sounds for the smallest sets to 34x36=1224 sounds for the biggest set; each sound contains an average of
40 to 50 frames, depending on the considered set).
v:
On a computer-science point of view, our approach can be considered as related with the DTWdistances based K-means algorithm. (for instance, an application of such an algorithm is proposed in
[Somervuo and Harma, 2004]). This algorithm computes a K-means clustering directly on sequence of
points, using the DTW algorithm in order to compare the different stimuli to the centroids. It would not need
any intermediary representation to compute a relatively efficient clustering of the phonetic categories. But
this algorithm can only provide a clustering for entire sequences of sounds. So, in order to converge to
phonetic categories, it would require to segment the speech into phonetic units.
More details related to this algorithm and its connection with the proposed model are given in annex.
2.3.C
Input data
The input data provided to our algorithm was encoded using the 13 firsts MFCC coefficients, sampled
every 5ms.The choice of the MFCC coefficients as input representation was motivated by the fact these
coefficients are directly extracted from a physical representation of the speech signal (being the "Fourier
transform of the Fourier transform"), and in the same time relevant for speech recognition.
For the tests achieved on the algorithms, we considered 4 pseudo-languages.
All these stimuli were recorded by the same male speaker, in a quiet and non-reverberating
environment. In order to estimate the result of the proposed experiments, all the data were manually labeled
(whenever some of the proposed experiments are based on an unsupervised learning system, and thus do
not actually rely on these labels).
i:
Easy set : [r m s a e i]
This set was composed of 6 phonemes (a, e, i, r, m and s). The phonemes of this set were selected in
order to be relatively easy to differentiate. These phonemes were used to create syllables, of ConsonantVowels (CV) structure. These syllables were pronounced independently (thus, the considered data were
monosyllabic words).
This set was balanced (every possible syllable was appearing the same number of times). Each
syllable was recorded 54 times (so this set was constituted of 3 x 3 x 55 = 495 elements, as 3x3=9 syllables
can be created with these 3 consonants and 3 vowels)
ii :
Hard set : [p t k u o y]
This set was also composed of 6 phonemes (p t k u o y). These phonemes were selected to be difficult
to differentiate: in particular, the 3 stops (p, t, k), are extremely hard to differentiate, like the 3 selected
consonants (u, o, y).
These phonemes were also used to create monosyllabic words, with a CV structure.
This set was also balanced, and each syllable was recorded 55 times.
16
iii :
This set was a combination of the 2 sets previously described. It was constituted of CV words, with CV
monosyllabic words. It was including all possible syllables created from the phonemes of the easy and the
hard set (such as "ru" or "pa").
Like the sets described previously, this set was also balanced, and constituted of 54 realizations of
each possible syllable (so it was composed of 6 x 6 x 55 = 1980 elements).
Polysyllabic set : [R d m a i u]
iv :
This set contains trisyllabic words. These words are composed with 8 phonemes (R d m a i u).
These phonemes are arranged following the same CV structure than in the previous sets. So the trisyllables
recorded have a CVCVCV structure. 512 trisyllabic words were recorded. The set was built in such a way
that all the phonemes were pronounced the same number of times, in every position.
2.4
Experiments achieved
In order to test the predictions related to our models, we achieved several experiments, using the data
presented hereinabove.
2.4.A
The 1 test realized was related with the "efficiency" of the proposed representation, for phoneme
acquisition. To estimate this efficiency, we measured the frame wise error-rate for phonetic classification of
our data, either after a supervised or an unsupervised clustering.
i:
supervised clustering
This experiment was achieved in order to measure the linear separation between the phonetic
categories, for different kinds of representations.
The clustering of the data was achieved using a perceptron. The perceptron parameters were
determined using the R-prop (supervised) algorithm. [Riedmiller and Braun, 1993].
The perceptron output allows attaching a label to every input frame, and the error rate was obtained by
comparing this found label with the actual labels manually attributed to the frames.
We tested the different sets of stimuli, under different conditions:
-
Whenever this experiment showed a superiority of the proposed representation, it gave relatively
contrasted results, according to the sets it was achieved on:
a)
Monosyllabic sets :
The tests over the monosyllabic sets showed a clear superiority of the proposed representation.
For these tests, the considered templates were monosyllabic words, compared together using the DTW
algorithm. We used reference bases created with 4 to 12 instances of every possible syllable (there were 36
possible syllables), and tested the discrimination using or not the distortion distance.
17
30
Training
Classification error
25
Generalization
20
15
10
5
*9
12
8*
9
4*
9
*9
12
8*
9
4*
9
m
fc
m
c
fc
c+
de
lta
2
Nb of detectors
spectre only
spectre+time
We achieved the same tests, using a more complex representation of the sounds as input for our
algorithm: instead of using the simple MFCC coefficients, we tried to add the derived coefficients. Whenever
the use of these derived significantly coefficients increased the results for direct recognition (clustering
directly based on the MFCC coefficients), it didnt show any significant change when used as input of our
algorithm, meaning that information obtained by the use of these coefficients are already extracted by our
representation.
b)
Polysyllabic sets
Considering the central role of the reference templates in our algorithm, it is relevant to study the
impact of there selection on the results.
Training
Generalization
25
Classification error
20
15
10
0
mfcc delta2
syllabic templates
phonetic
templates
Figure 6. polysyllabic set: efficiency of the supervised clustering, over the MFCC
coefficients (plus derived), and the comparisons-based representation, using syllabic and
phonetic templates. For the phonetic based representation, the clustering is clearly suffering
from over-fitting: whenever the representation is complex (the learning error rate is lower than
the one obtained from the MFCC coefficients), it is inadequate (the rules extracted from the
learning set cannot be applied to the generalization set).
These experiments have been achieved in "ideal" conditions, as the templates were segmented
manually. Of course, any automatic segmentation of the speech signal would be less precise, and generate
errors in the templates boundaries. But the sounds are warped before the comparisons (using the DTW
algorithm), so the precision of the segmentation is not mandatory.
The result of this experiment, showing a superiority of the syllabic templates, is not surprising: unlike
phonetic units, syllabic ones take in account the coarticulation effect, and consequently avoid a very strong
source of noise.
The optimality of these syllabic sounds units also strongly supports our model, as it has been shown
that newborn babies can perceive these units: it would give a role to this early sensibility (that precedes the
phonetic acquisition), and would involve it in a full language acquisition (unlike more classical models).
It is based on the idea that the most relevant units are optimally describing the speech stream. Using
this assumption, the algorithm finds a (locally) optimal set of sounds that can, with a minimal distortion, be
concatenated to match in an optimal way with new speech stimuli.
random templates
For the "random templates", we considered a segmentation of the speech at a constant length. We tried
several possible lengths, in order to find the optimal one.
Figure 7. 3 different types of segmentation, for the creation of the templates base: the portions of
st
signal between the red lines will be kept as reference templates. The 1 image corresponds to a
nd
rd
syllabic segmentation, the 2 one is a phonetic segmentation, and the 3 one corresponds to a
random segmentation, as the boundaries are fixed at regular intervals, without considering the data
contained in the signal. The syllabic and phonetic segmentations are obtained manually.
The comparison was achieved using the 1-pass DP algorithm, independently on every reference
sounds (so there is no interaction of the best path through the different sounds, and the dimensions are
independent).
We also tried to synchronize the paths through the different dimensions (by forcing the path along all
the dimensions to come back to the origin of the corresponding reference sounds at the same moment than
when the optimal pass does). However, this type of comparison wasn't more efficient than comparing the
stimuli independently with every reference sound.
20
Training
25
Generalization
Classification error
20
15
10
40
0
36
0
32
0
28
0
24
0
20
0
16
0
12
0
80
40
c
te
ph
m
pl
on
at
et
es
ic
te
m
pl
at
es
sy
ll a
bi
m
fc
de
lta
0
Length of the random references
As expected, the random templates were not optimal. It should be noted that the optimal random
templates were corresponding to an intermediary length between the length of a phoneme or a
syllable.
The random templates are more efficient than the optimal randomly segmented templates. This
result is due to the additional freedom given to the comparisons with the randomly segmented
templates by the use of the 1-pass DP algorithm.
This experiments support the idea of syllables as basic perceptual, prior to the learning of the
perception of phonemes.
It also suggests that whenever an appropriate segmentation improves the efficiency of a later
clustering, a roughly achieved segmentation can give relatively efficient results.
ii :
unsupervised clustering
The same experiment was achieved, considering an unsupervised learning. This experiment should be
more relevant in term of phonetic acquisition, considering this acquisition is achieved by babies in an
unsupervised way (ie. It rests on a passive listening of the speech, and is not directed by any exterior
information).
However, we didn't focus on this part of the learning mechanism in this report: the learning of units
such as phonemes involves complex learning algorithms (usually based on Hidden Markov Models (HMM) ;
for an example of such an algorithms, see [Takami and Sagayama, 1992]).
These algorithms usually build allophonic categories, which are merged into phonetic categories
[Peperkamp and Le Calvez, 2003].
21
We tried to avoid these difficulties by using of simple language, featuring very distinct phonemes. In
these conditions, it is possible to obtain a satisfying result with simple algorithm such as K-means or EM
(these algorithms don't use time-related information, such as HMM-based algorithms).
For the more complex languages, the simple clustering methods used didn't allow us to obtain
satisfying results: the clusters found, from both the MFCC or the similarity based representation, were not
corresponding to phonetic units (the algorithm was proposing allophonic units for some phonemes, and
merging other ones).
A test has been achieved on our simple language ("Simple set", containing the phonemes [r, m, s, a, e,
i]). The simplicity of this language allowed the use of simple unsupervised clustering algorithms (we tested
K-means and EM algorithms).
This test showed a clear superiority of our representation. While the clustering proposed over the
MFCC representation proves to be irrelevant, the clustering proposed on our representation is
corresponding to the categories found by a supervised algorithm. [FIG XXX]
As it could be expected, K-means and EM algorithms obtain the same error rate on our representation,
nd
whenever the 2 is much more complex. This is due to the special shape of the speech signal, in our
representation (where the values correspond to similarities with other sounds, and thus "have a meaning"
for phonetic clustering).
Unsupervised learning
Easy Set [mrs][aei]
35
Training
Classification error
30
Generalization
25
20
15
10
5
0
mfcc
8*9
K-means
mfcc
Nb of detectors
8*9
EM
Figure 9. Unsupervised learning of the phonetic categories, using MFCC (and derived coefficients)
versus similarities-based representation. The similarities based representation proved to be clearly
superior to the MFCC-based coefficients. In both case, the number of clusters (corresponding to the 6
phonemes) was provided to the algorithm.
In these experiments, the number of categories to build was provided to the algorithm. However, this
additional information was not necessary to find the optimal partition, as shown in Figure 9.
22
4100
4050
4000
3950
3900
3850
3800
3750
3700
3
10
11
12
number of clusters
Figure 10. Evaluation of the efficiency of the clustering, as a function of the number of clusters
created. The efficiency was computed using the Bayesian Information Criterion (BIC). The minimum
value of the BIC corresponds to the optimal number of clusters; here, the optimal number of clusters
corresponds to the 6 phonetic categories : [m R s a e i].
2.4.B
The representation built strongly depends of the reference base used (as dimensions of the
representation correspond to sounds of the reference base).
For instance, a well balanced base containing sounds of the learned language should be optimal for
the acquisition of phonetic categories, while learning a language from a representation composed of
inadequate sounds (like syllables of another language) should be harder.
In order to test the impact of the language of the reference base, we tested the learning of a language
represented on a base containing different phonemes:
We considered 2 languages: the "Hard Set" (phonemes [p, t, k, u, o, y]), and the "Easy Set" (phonemes
[m, r, s, a, e, i]). We tried to discriminate phonemes from the Easy Set, using a representation based on the
"Hard Set", and vice versa (discriminating the phonemes of the Hard Set, represented with a reference
based built from sounds of the "Easy Set").
Both sets contain the same number of phonemes and possible syllables (so the number of dimensions
in an Easy Set-based representation and in an Hard Set-based representation are equals).
23
Classification error
Training
Generalization
20
18
16
14
12
10
8
6
4
2
0
Reference =
Reference =
Easy ; Learned Hard ; Learned
= Easy
= Easy
Reference =
Reference =
Hard ; Learned Easy ; Learned
= Hard
= Hard
Nb of detectors
Figure 11. Influence of the reference base for the acquisition of the phonetic categories: clustering
algorithm were trained to learn phonetic categories from data represented on an adequate
representation (when the language of the reference base corresponds to the learn language), or an
inadequate one (when the language of the reference base differs from the learned one)..
These results clearly demonstrate the impact of the language of the reference base. It is much harder
to learn the phonetic categories if the speech is represented on an inadequate base.
This shows that the construction of a reference base adapted to the learned language is a first step for
the acquisition of phonetic categories. It corresponds to the observations, as adults are less efficient to
discriminate phonemes that do not appear in there native language.
2.4.C
Enhancements
The representation proposes exhibits some important particularities, due to the nature of its dimensions
(which are independent similarities). These particularities can be used in order to adapt the clustering
algorithm used to actually create the phonetic categories.
For instance, it can be admitted that each syllables contains a limited number of phonemes (for
instance, one could say that, in most of the languages, most of the syllables are composed of 3 phonemes
or less).
Based on this assumption, and noticing that the detectors are syllabic templates, some observations
can be made on the shape of the categories to create. One reference sound should contain a limited sound
of phonemes (usually up to 3). So one dimension, corresponding to this reference sound, should have a
limited number phonemes (which are included in the reference sound, thus matching with it), centered at a
high similarity value, while all the other phonemes should be centered at a low similarity.
references matching to them (this information was sufficient to achieve the recognition, as the number of
reference sounds was relatively high). In term of quantity of data manipulated, this binary representation
was clearly less complex than the MFCC representation.
We tested the efficiency of this simplified representation either on the "Full Set" (composed of the
phonemes [m, s, R, p, t, k, a, e, I, u, y, o]) and the polysyllabic set (manually segmented, composed of the
phonemes [R, S, d, m, a, e, i, u]).
Trisyllabes [SRdm][aiu2]
45
35
40
Generalization
35
Classification error
Classification error
30
Training
25
20
15
10
Training
Generalization
30
25
20
15
10
0
*2
12
*3
6
12
*3
6
*3
6
mfcc
12
8*
36
*ti
m
bi
na
ry
/3
9
ta
2
fc
c+
de
l
m
fc
c
0
mfcc
delta2
12*16
binary
24*16
binary
12*9
12*9*2
(time)
Nb of detectors
Type of representation
Figure 12. Comparison of the efficiency of different representation: the MFCC representation (and
MFCC plus 2 derived coefficients "delta2"), the binary similarity-based representation (with 12x16 or
24x16 reference sounds), or the similarity based segmentation (using or not the temporal distance). For
these tests, the ceil used to create the binary segmentation was at the average matching plus one
standard deviation (all the coefficients higher than this level were set at 1, all the ones matching less
were set at 0).
These results show the efficiency of a binary representation. Despite its simplicity, the results obtained
are relatively similar to the results obtained using a continuous representation (and still show an
improvement compared with the MFCC-based representations).
It is also important to notice that the binary representation allows the results to improve when the data
base is enlarged (as shown with the polysyllabic examples), while a database of the same size would lead
to an over-fitting if used with a continuous representation.
These considerations allow to think that the ceiling achieved to obtain the binary representation actually
keeps most of the relevant information for phonemes discrimination.
2.5
One strong criticism against the presented model should be that it doesn't correspond to the observed
phonetic discrimination chronology. It has been shown that, prior to any phonetic acquisition, newborn
babies can discriminate virtually every phonetic categories. However, according to our model, infants
should learn discriminations (by adding discriminant pairs of templates in there reference base), and not
forget the distinctions, as observed (which means they would have to get a full initial reference base, and to
forget references during the learning, which is impossible).
However, an evolution of this model could be suggested, in which the reference base would not be a
set of real speech templates, but more abstract data. These data, which could correspond to artificial
sounds, can be initialized in a homogeneous way along the possible speech spectrum. While the baby
hears his language, adapts the reference sounds to fit with the statistical distribution of the heard speech.
By doing so, he will decrease his sensitivity to sounds that do not correspond to any receptor, and increase
it for sounds in different zones containing many receptors.
25
26
Conclusion
We have proposed in this report an original model for the phonetic categories acquisition. Whenever
this model is mainly focusing on the representation of speech stream by young infants, we it corresponds as
a first stage in the language acquisition.
The proposed representation can adapt to the linguistic environment in which it is developed, which
simplifies the acquisition of the phonetic categories.
We have shown that the proposed representation is clearly depending of the stimuli perceived during
the early stage of language acquisition, as an inadequate representation makes the acquisition of phonetic
categories harder, while an appropriate one facilitates the learning of these categories.
We showed that a syllabic point of view is the most relevant for the acquisition of phonetic categories
through our model, which is coherent with the early perception of syllables by infants. Consequently, the
developed model allows proposing coherent chronology of the speech acquisition during the first year of
life.
Our approach is closely related with the idea proposed by M. Coath and S. L. Denham, who propose to
represent sounds using similarities with reference templates. However, we propose a more complex
algorithm (by warping the sounds before comparison), which allows us obtaining a relevant representation
for the phonemes acquisition. We also propose a comparison of our representation and a more
conventional one, which gives an estimate of the gain obtained by our system.
We also compare the efficiency of different kinds of template; while this previous study only considered
reference templates corresponding to the sounds they wanted to discriminate (digits pronounced in
English), we used templates that were not corresponding to the units we wanted to discriminate (using
syllables while we were aiming to discriminate phonemes). Nevertheless, this work, as the one we
presented in this report, was a direct application of the theoretical considerations previously developed by
S. Edelman, who underlined the benefits for recognition of a representation based on similarity measure.
27
2 APPENDIXES
1
1.1
DTW algorithm
The DTW algorithm provides an efficient measure of similarity between two sounds. In order to
increase the robustness of the comparison between the sounds, they are warped one relatively to the
other. This warping aligns the part of the two songs that corresponds together, in order to compare the
sounds in a relevant way.
Considering 2 sounds, coded as two sequences of L1 and L2 vectors (or frames), the comparison
will be achieved in 3 steps:
st
1 , a distance matrix D between the two sounds is created. Each element D(t1,t2) of this matrix, of
size L1xL2, corresponds to the distance between the t1th frame of the sound 1, and the t2th frame of
the sound 2 (so this matrix represent the distance between every pair of sounds of the two sounds).
A path along this matrix (a sequence of couples (t1,t2), each element t1 and t2 increasing)
corresponds to a distortion of the sounds 1 relatively to the sound 2, or vice versa. The aim of the DTW
st
algorithm is to find a path connecting the 1 element of the distance matrix D(0,0) to the last one
D(L1,L2), minimizing the sum of the elements encountered (each element corresponding to the distance
between the 2 sounds at a given time, the found path would correspond to the distortion minimizing the
distance between the sounds).
The distance between two frames can be computed using the Euclidian distance after a
normalization of every component of the vectorial representation of the sound. It is also possible to use
the Mahalanobis distance between the two frames (this distance takes in account the statistical
distribution of the frames, via the covariance matrix of a set of frame. In the experiments presented in
this report, the first solution (Euclidian distance between normalized vectors) has been used, for
computational reasons.
ILLUSTRATION DISTANCE MATRIX
nd
In a 2 step, the Cumulated Distance matrix C is created. Each element of thes matrix
st
corresponds to the sum of the distance along the shortest path from the 1 element of the distance
matrix D(0,0) to a given couple of coordinates t1,t2.
st
The cumulated distance matrix is built recursively : the 1 element of this matrix C(0,0) is fixed at 0.
Then, the element C(t1,t2) is equal to the smallest possible previous element (for instance, the minimum
of C(t1-1,t2-1), C(t1-1,t2), or C(t1,t2-1) ; this minimum is called the origine of the element C(t1,t2), as
the optimal path leading to this elements comes from its orignin), plus the value of D(t1,t2)
st
nd
(corresponding to the distance between the t1th frame of the 1 sound and the t2th frame of the 2
sound). In order to find the best path leading to an element, the origin of every couple of frame is kept in
a matrix called the path matrix.
st
Finally, in a 3 step, the path matrix is used to find the optimal path connecting the 1 elements to
the last one. This path is found iteratively, starting from the final element. The previous element in the
path is its origin, that can be found using the path matrix. It is also possible to find the origin of this
element, and, gradually, to find back the best path, corresponding to the distortion of the two sounds
minimizing the distance between each other.
1Pass DP algorithm improves the DTW algorithm, allowing to compare a long (continuous) sound S
to several shorter templates {S1... Sn}. The output of this algorithm would be the optimal sequence of
28
short templates, Si1...Sik, and, for each template Sij, a sequence of couples of frames (one
corresponding to the continuous sound, the other one to the template Sij). These sequences create a
path through the continuous sound and the set of template, minimizing the distance between each other.
1.2
Our model can be related to a more conventional algorithm for clustering of multidimensional time
series.
This algorithm is simply an adaptation of the K-means algorithm, and thus is an unsupervised
clustering algorithm.
Kmeans algorithm usually consider points, instead of sequences. For creating N clusters, it is
initialized by associating random values to N points, considered as the centroides of the initial clusters.
Then, it repeats the following steps:
-
for every point, it finds the closest centroid, using the Euclidian distance. The point will be
considered as part of the cluster associated to its closest centroid.
For every cluster, the centroid is recomputed, according to the set of point that have been declared
as part of it.
After iterating these operations, the centroids should converge to some value, creating a Voronoi
decomposition of the space (each cluster being a part of this decomposition).
This algorithm can be modified in order to classify trajectories instead of points: trajectories are
considered as centroides, and the distances between trajectories are computed using the DTW
algorithm instead of the Euclidian distance.
Such an algorithm could be used to classify sounds in an unsupervised way, and thus to find the
phonetic units. However, this algorithm suffers several weaknesses:
st
1 , it requires the sequences to be segmented in a relevant way for the clustering. For the
acquisition of phonetic units, it would require the sounds to be segmented following the phonemes
boundaries, raising a bootstrapping problem how to find these boundaries if the phonemes are
unknown? An algorithm dealing with this bootstrapping problem can be derived from the
segmentation algorithm proposed bellow (however, its complexity is much higher than the algorithm
proposed in this report).
2 , it is, on a computational point of view, a very expensive algorithm. It involves, for every iteration
of the Kmeans algorithm, the computation of the distance between every instance of the sounds to
classify and the centroide of every cluster. On this point of view, the proposed algorithm,
proceeding in two step (first comparing the sounds to a predetermined database, and then applying
the standard Kmeans algorithm on points) can be considered as an approximation of this algorithm,
as it is computationally less expensive.
nd
The algorithm presented in this section proposes an original and purely top-down method, for syllabic
or phonetic segmentation. The originality of this technique is its extensive use of global information, and the
absence of direct use of local information.
2.1.A
Basic concept
The method proposed hereafter relies as a conception of the phonemes or syllables as best units for a
description of the speech signal: the syllabic or phonetic segmentation are optimal to describe the speech
as a permutation/segmentation of a finite number of units.
29
This algorithm would thus try to find a segmentation of a relatively short speech stimulus providing
independent units that allow describing as well as possible new speech stimuli. The best segmentation is
the segmentation minimizing the deformation that would have to be achieved on the new stimuli in order to
obtain the best matching (ie. the best description) between these new stimuli and the considered units.
An analogy can be proposed between the task aimed and a puzzle game: considering a given set of
pieces, one can try to rebuild a given picture. The pieces of the puzzle are the sound units, the picture to
rebuild is the speech stream ; the algorithm proposed here will find the optimal set o pieces to rebuild the
image. We would want to prove that these are the phonetic or syllabic units.
2.1.B
i:
Expectation step
In this step, we try to find a "description" of speech stimuli, based on a limited set of segmented units.
The description considered is obtained using the 1pass-DP algorithm. This algorithm allows finding the
optimal concatenation of the syllables units for a matching with the new speech stream.
The 1pass-DP algorithm also provides the deformations of the sounds that are necessary to obtain this
matching. This deformation is the objective we will try to minimize.
ii :
Maximization step
The result of the expectation step described hereinabove will be used to modify the boundaries of the
sound units considered.
These boundaries are moved according to the deformation proposed by the 1Pass-DP algorithm.
This algorithm should converge to units minimizing locally the distortion, describing new speech stimuli.
Ideally, there should be two sets of units minimizing this distortion, namely the syllabic and phonetic units.
Unfortunately, I couldn't fully develop and optimize this algorithm. However, a nave version has been
implemented, providing promising results on simple tests. Further development would be needed.
30
3 References:
[Bertoncini and Mehler, 1981], Bertoncini, J., Mehler, J., (1981).,Syllables as units in infant speech
perception, Infant Behavior and Development.
[Beymer and Poggio, 1996], Beymer, D., Poggio, T. (1996), Image Representations for Visual
Learning, Science
[Blumstein, and Stevens, 1979] Blumstein, S.E. & Stevens, K.N. (1979). Acoustic invariance in speech
production: evidence from measurements of the spectral characteristics of stop consonants. Journal of the
Acoustical Society of America.
[Coath and Denham, 2005] Coath M., Denham, S.L., (2005). Robust sounds classifications through the
representations of similarity using response fields derived from stimuli during early experience, Biological
Cybernetics.
[Cutzu and Edelman 1996], Cutzu, F., Edelman, S. (1996), Faithful representation of similarities among
3D shapes in human vision.
PNAS.
[Dehaene-Lambertz and Dehaene, 1994] Dehaene-Lambertz, G., Dehan, S., (1994). Speed and
cerebral correlates of syllable discrimination in infants. Nature.
[Dehaene-Lambertz et al., 2002] Dehaene-Lambertz, G., Dehaene, S., Hertz-Pannier, L. (2002),
Functional neuro imaging of speech perception in infans, Science
[Dehaene-Lambertz et al., 2006], Dehaene-Lambertz, G., Hertz-Pannier, L., Dubois, J. (2006), Nature
and nurture in language acquisition: anatomical and functional brain-imaging studies in infants. Trends in
Neurosciences.
[Dupoux et al, 1999], Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., Mehler, J., (1999), Epenthetic
vowels in Japanese, a perceptual illusion?. Journal of Experimental Psychology.
[Edelman, 1998] Edelman, S. (1998). representation as representation of similarity, Behavioral and
Brain Sciences.
[Eimas, et al., 1971], Eimas, P.D., Siqueland, E.R., Jusczyk, P., Vigorito J., (1971). Speech perception
by infants. Science.
[Eimas, et al., 1975] Eimas, P.D., Tartter, V.C. (1975). The role of auditory feature detectors in the
perception of speech. Perception and Psychophysics
[Fowler et al., 1993], Fowler, A., Saltzman, E. (1993), Coordination and coarticulation in speech
production. Language and Speech.
[Friederici and Wessels, 1993], Friederici, A. D., & Wessels, J. M. I. (1993) Phonotactic knowledge of
word boundaries and its use in infant speech-perception. Perception & Psychophysics.
[Hirsh-Pasek et al., 1987], Hirsh-Pasek, K., Golingkoff, R.M., Cauley, K.M., & Gordon, L. (1987). The
eyes have it: Lexical and syntactic comprehension in a new paradigm. Journal of Child Language
[Jusczyk et al., 1996], Jusczyk, P.W., Myers J, Kemler Nelson DG, Charles-Luce J, Woodward AL,
Hirsh-Pasek K. (1996). Infants sensitivity to words boundaries in fluent speech, Journal of Child Language.
[Kuhl, 1983], Kuhl, P. K. (1983). Perception of auditory equivalence classes for speech in early infancy.
Infant Behavior.
[Kuhl et al. 1992], Kuhl, P.K., Williams, K.A., Lacerda, F., Stevens, K.N., (1992). Linguistic experience
alters phonetic perception in infants by 6 months of age. Science.
[Kuhl, 1992], Kuhl P. K., (1992), Infants' Perception and Representation of Speech: Development of a
New Theory.
[Lacerda, 1995], Lacerda, F., (1995). The perceptual-magnet effect: An emergent consequence of
exemplar-based phonetic memory . Proceedings of the XIIIth International Congress of Phonetic Sciences,
Stockholm.
[Mattys and Jusczyk, 2001], Mattys, S.L., Jusczyk, P.W., (2001), Do Infants Segment Words or
Recurring Contiguous Patterns?. Journal of Experimental Psychology.
31
[Maye et al, 2002], Maye, J., Werker, J.F., Gerken L., (2002), Infant sensitivity to distributional
information can affect phonetic discrimination. Cognition.
[Mermelstein, 1976] Mermelstein, P. (1976), Distance measures for speech recognition, psychological
and instrumental, in Pattern Recognition and Artificial Intelligence.
[Myers and Rabiner, 1981], Myers C.S., Rabiner, L. R. (1981). A comparative study of several dynamic
time-warping algorithms for connected word recognition. The Bell System Technical Journal
[Nazzi et al., 1998], Nazzi, T., Bertoncini, J., Mehler, J. (1998). Language discrimination by newborns:
towards an understanding of the role of rhythm. Journal of Experimental Psychology.
[Osherson et al., 1984] Osherson, D.N., Stob, M., Weinstein M. (1984) Learning theory and natural
language. Cognition.
[Pena et al, 2003], Pena, M., Maki, A., Kovacic, D., Dehaene-Lambertz, G. (2003). Sounds and silence:
An optical topography study of language recognition at birth, PNAS.
[Peperkamp and Le Calvez, 2003] Peperkamp, S., Le Calvez, R., (2003). The acquisition of allophonic
rules: statistical learning with linguistic constraints, Cognition.
[Pisoni et al., 1980], Pisoni, D.B., Jusczyk P.W. Walley, A., J Murray, (1980), Discrimination of relative
onset time of two-component tones by infants. Journal of the Acoustical Society of America.
[Riedmiller and Braun, 1993], Riedmiller, M., H. Braun, H., (1993), A direct adaptive method for faster
back-propagation learning: the RPROP algorithm
[Shepard, 1968], Shepard, R.N., (1968), Cognitive psychology: A review of the book by U. Neisser.
American Journal of Psychology.
[Smith and Lewicki, 2006], Smith E.C., Lewicki, M.S., (2006), Efficient auditory coding, Nature, Smith &
Lewicki, 2006
[Somervuo and Harma, 2004] Somervuo P. Harma, A. (2004). Bird song recognition based on syllable
pair histogram, Acoustics, Speech, and Signal Processing.
[Takami and Sagayama, 1992] Takami, J,. Sagayama, S. (1992). A successive state splitting algorithm
for efficient allophonic modeling, Acoustics, Speech, and Signal Processing.
[Werker and Tees, 1984a] Werker, J.F., Tees, R.C., (1984), Cross-language speech perception:
Evidence for perceptual reorganization during the first year of life, Infant Behavior.
[Werker and Tees, 1984b] Werker, J.F., Tees, R.C., (1984), Phonemic and phonetic factors in adult
cross-language speech perception, Journal of Acoustic Society of America.
32