Vous êtes sur la page 1sur 32

Model of Acquisition of Phonetic Categories by Young Infants:

A Templatic Representation for Speech Signals

Master Thesis
Submitted in September 2007 by

Guillaume Beraud-Sudreau
Master of Cognitive Science / Master de Sciences Cognitives
Ecole Normale Suprieure / Ecole des Hautes Etudes en Sciences Sociales / Universit Paris V

Internship Directors :

Emmanuel Dupoux, Laboratoire de


Psycholinguistique (EHESS/ENS)
Shigeki Sagayama,

Sciences

Cognitives

et

Sagayama & Ono Laboratory, University of

Tokyo

Introduction.........................................................................................................................................5
1.1

The logical problem of language acquisition...............................................................................5

1.2

Early language acquisition ..........................................................................................................5

1.3

Focus of the present work...........................................................................................................6

1.4

Why is the acquisition of phonetic categories a hard problem....................................................6

1.4.A
i:

General problem ......................................................................................................................6

ii :

Cross-linguistic variations in spectral boundaries................................................................6

iii :

Cross-linguistic variations in temporal boundaries ..............................................................6

iv :

Cross-linguistic variations in inventory sizes .......................................................................7

v:

Within-language variability...................................................................................................7

1.4.B

Problems treated, simplifications .........................................................................................7

1.5

Featural codes for speech recognition........................................................................................7

1.5.A
1.6

Limitations of featural codes ................................................................................................8

Holistic or template-based codes ................................................................................................8

1.6.A
2

Parsing of continuous signal................................................................................................6

Other templatic approaches.................................................................................................9

A template-based representation of the speech stream ..................................................................11


2.1

Presentation ..............................................................................................................................11

2.1.A

Structure of the model........................................................................................................11

2.1.B

Produced representation ...................................................................................................12

2.1.C

Nature of the reference sounds .........................................................................................13

2.2

Particularities of the new representation:..................................................................................13

2.2.A

Adaptation to the learned language...................................................................................13

2.2.B

Transformation of trajectories into points...........................................................................13

2.2.C

Specific case: Affricates phonemes...................................................................................14

2.3

Implementation and testing of the model ..................................................................................14

2.3.A

Objectives and presentation of the implementation...........................................................14

2.3.B

Detailed algorithm ..............................................................................................................14

i:

Constitution of the reference base ........................................................................................14

ii :

Comparison with new stimuli .............................................................................................15

iii :

Obtained representation ....................................................................................................15

iv :

Clustering and score ..........................................................................................................15

v:

Comparison with a direct clustering algorithm ...................................................................16

2.3.C

Input data ...........................................................................................................................16

i:

Easy set : [r m s a e i] ............................................................................................................16

ii :

Hard set : [p t k u o y].........................................................................................................16

iii :

Complete set ('all') : [r m s p t k a e i u o y]........................................................................17

iv :

Polysyllabic set : [R d m a i u] ......................................................................................17

2.4

Experiments achieved...............................................................................................................17

2.4.A
i:

Experiment 1: efficiency of the new comparison ; supervised clustering ..........................17


supervised clustering.............................................................................................................17
3

a)

Monosyllabic sets : ............................................................................................................17

b)

Polysyllabic sets ................................................................................................................18

1.2.4.A.i.b.1

Presentation of the problem ..............................................................................18

1.2.4.A.i.b.2

Comparison between syllabic and phonetic segmentations .............................19

1.2.4.A.i.b.3

Segmentation of the continuous speech ...........................................................19

1.2.4.A.i.b.4

Random templates ............................................................................................20

ii :

unsupervised clustering .....................................................................................................21

2.4.B

Impact of the language of the reference base ...................................................................23

2.4.C

Enhancements ...................................................................................................................24

2.5

Known limitation of the proposed model ...................................................................................25

Conclusion........................................................................................................................................27

Used algorithms : DTW and 1Pass DP ............................................................................................28


1.1

DTW algorithm ..........................................................................................................................28

1.2
method

Comparison between the proposed clustering method and the "Kmeans over Trajectories"
29

Speech segmentation based on template permutations:.................................................................29


2.1.A

Basic concept.....................................................................................................................29

2.1.B

Description of the algorithm ...............................................................................................30

i:

Expectation step ....................................................................................................................30

ii :

Maximization step ..............................................................................................................30

1
1.1

Introduction
The logical problem of language acquisition

During their first years of life, babies learn spontaneously the language spoken by adults around them.
This acquisition, which is incredibly fast, relies on mechanisms that are still unknown.
In 1979, N. Chomsky proposed the concept of Universal Grammar, according to which, the acquisition
of language rests on innately specified specialized mechanisms [Chomsky 79]. This point of view is
supported by the observation that babies learn a capacity to generate an infinite set of sentences from on a
finite, small, and noisy set of stimuli. In addition, the information provided to infants only contain positive
example of structures that are admissible in their language, but no counter-examples, nor negative
feedback. The impossibility of infinite language induction based on partial and finite data hence motivated
the idea that infants come equipped with knowledge about the subject matter which takes the form of an
innate universal grammar (see also [Osherson et al., 1984]).
Such apriori arguments are based on an analysis of the structure of language competence in the adult,
but not on real acquisition data. Therefore, they leave unspecified many aspects of the actual learning
mechanisms that may be used by human infants [see Elman et al and Marcus for an extended debate on
nativism versus empiricism]. Here, we propose to explore the opposite approach, namely, to examine how
far one can go with rather simple statistical learning mechanisms or signal processing algorithms which run
on actual speech data. Our aim is hence to specify which part of the human language competence has to
be prewired, and which can be extracted from the data using statistical procedures.

1.2

Early language acquisition

Human babies learn a lot about their language before they speak, but experimental studies of such
early language acquisition only appeared 30 years ago, with the development of new experimental
techniques (such as the High Amplitude Sucking Paradigm, the Preferential Looking Paradigm, or the Head
Turn Procedure [Eimas et al., 1971]. In the 90's, brain imaging techniques allowed a more direct
observation of the cerebral activities of the babies (using Event-Related Potentials [Dehaene-Lambertz and
Dehaene, 1994], Near Infrared Spectroscopy [Pena et al, 2003] or functional Magnetic Resonance Imaging
[Dehaene-Lambertz et al. 2002]). These new experimental techniques, based on indirect measures instead
of the oral productions of the baby, enabled researchers to gather a great amount of new data and
observations regarding the first steps of speech acquisition.
One of the surprising findings of these studies is that newborns and preverbal infants are extremely
competent in making very detailed phonetic distinctions. They can discriminate minimal phonetic contrasts
in stop consonants, which are notoriously difficult to classify. Infants are typically better at discrimination
then adults: For instance, English 6 month old infants, unlike adults, can discriminate the contrast between
dental [t] and retroflex [t], which is used in Hindi, but not in English [Werker and Tees, 1984a; 1984b].
They can however ignore changes in talkers in making consonant or vowel discrimination [Kuhl (1983)]. In
addition, human newborns can determine whether pairs of sentences belong or not to the same language,
even if they have not heard it before [Nazzi et al, 1998] and they can discriminate multisyllabic utterances
on the basis of their number of syllables [Bertoncini and Mehler, 1981].
During the first year of life, the perception becomes more and more specific to the language(s) spoken
in the environment, and contrasts, which are not used in the language(s) are less and less well perceived.
As far as phonetic categories are concerned (the consonants and vowels of the language), this tuning,
starts with the acquisition of the vowel categories, which seems to take place around 6 months of age [Kuhl
et al. 1992]), followed by the acquisition of the consonants and the loss of the non-distinctive contrasts at
st
the end of the 1 year [Werker and Tees, 1984a]). At the same time, infants learn the sequential constraints
which governs the ordering of consonants and vowels. At 9 months, a baby can detect what sequences of
phonemes are allowed and which are not [Friederici and Wessel 1993]). During the second of half of the
first year of life, infants have a more and more specialized processing of the prosody of their language
([Hirsh-Pasek et al. 1987]), as well as the cues which signal word boundaries [Jusczyk et al., 1996]. This
early tuning process seems to go hand in hand with a more definite specialization of the left hemisphere for
language [Dehaene-Lambertz et al., 2006].
5

Although the exact mechanisms governing the acquisition of language are still unknown, it is agreed
that the baby uses the perceived speech stream to carry out a statistical learning of the particularities of his
language. This learning is based on the analysis of regularities within his language. For instance, it is shown
that babies are sensitive to the sounds distribution (monomodal vs bimodal distribution), and can build
categories according to these distributions [Maye et al, 2002]. Babies are also sensitive to other statistical
distributions, such as transitions probabilities between segments [Mattys and Jusczyk, 2001], as well as the
presence of complementary distributions between segments [Peperkamp and Morgan, in press].

1.3

Focus of the present work

The objective of this report is to present a model of phonetic categories acquisition. As we have seen
previously, this acquisition is achieved before 1 year old, which means that infants could not have been
using lexical knowledge to drive it. Indeed, at the end of the first year of life, the recognition lexicon of
infants is believed to contain only around 20 words. Hence, the data suggests that the only possible
strategy for such an early acquisition is through an unsupervised algorithm looking for modes in the
phonetic distribution [Maye and Gerken, 2002]. Yet, this leaves open a number of hard questions regarding
the particular clustering algorithm which is used, as well as the type of preprocessing which is done on the
acoustic signal to optimize the possibility to finding proper phonetic categories.

1.4
1.4.A
i:

Why is the acquisition of phonetic categories a hard problem


Parsing of continuous signal
General problem

The physical substratum of speech is continuous, both in the spectral dimension and in the time
dimension . The segmentation of this signal into discrete phonetic units requires to find simultaneously the
spectral and the temporal boundaries of these units. Yet, these boundaries are difficult to find for two
reasons: they vary widely across languages, they are also very variable within languages.

ii :

Cross-linguistic variations in spectral boundaries

The fact that languages vary in the physical realization of phonetic categories is best illustrated in the
inventory of vowels. Vowel categories have been described in terms of prototype theory [Kuhl, 1992 ;
Lacerda, 1995]. They have a center, or best examplar, and a domain, where examplar further and further
from the center are perceived as less and less typical. For instance, French and English both vary in the
center and domain of most of their vowels. Kuhl proposed nevertheless that such cross linguistic variation
might be limited by the existence of innate psychophysical boundaries that have to be respected by
linguistic systems. Although there has been the proposition of an innate 30ms psychoacoustical boundary
for the durational cue of voiced and unvoiced consonants (Voice Onset Time), [Pisoni et al,1980] such
boundaries have not been documented for vowels or other spectral characteristics of consonants.
The language-specific phonetic categories hence appear to be primarily the result of some clustering
process of the speech sounds of a given language in a space of spectral or spectro-temporal parameters.

iii :

Cross-linguistic variations in temporal boundaries

A less well studied domain of variation regards the temporal axis. Here, we are not talking about
temporal cues for phonetic contrasts (such as VOT, or the slope of formant transitions), but about the way
in which a given stretch of speech is segmented into consecutive phonemes. For instance, the non-word
"ebzo" is considered to have 2 consonants and 2 vowels by English listeners. But Japanese listeners will
report hearing 2 consonants and 3 vowels ("ebuzo"). These perceptual differences are mainly due to the
fact that sequences of consonants are allowed in English, while they are not possible in Japanese. The
Japanese perceptual system interprets the small formants transition present between /b/ and /z/ as a full
vowel, which turns the sequence into a legal one in the language. In contrast the English or French
6

perceptual system simply ignore this information, or interprets it as being part of the respective consonants
[Dupoux et al, 1999] This problem is very general and not restricted to the perception of illusory vowels. For
instance, let consider the sequence "ts". In some language (eg. Italian), it is considered as a single affricate
s
phoneme [t ], in others (e.g. French), as the concatenation of two phonemes, "t" and "s", in yet others (e.g.
Canadian French), as an contextual allophone of the phoneme [t] . .
In order to construct a model of phonetic category acquisition, it is hence important to pay attention to
this double aspects of parsing, simultaneously along the time and the spectral dimensions..

iv :

Cross-linguistic variations in inventory sizes

These difficulties are increased by the lack of a priori knowledge about the phonetic categories to learn.
In particular, the number of categories to create is a priori unknown, and can greatly vary among languages.
For instance, some languages use 3 vowels, while some others use 20. In the same way, the number of
consonants in a given language can vary from 6 to almost an hundred. On another hand, it seems the
average duration of the phonetic units is relatively stable among languages (i.e., between, 50ms and
150ms). This duration window is probably a very useful clue for building the phonetic units.
It is to be noted that the lack of a priroi knowledge regarding the phonetic categories makes the
acquisition of phonetic categories even more difficult if one considers that this task has to be fully achieved
in a non-supervised way (ie. the learning is not guided by any other information than the speech signal
itself).

v:

Within-language variability

Finally, even if one focuses on a single language, the physical signal considered for the acquisition of
the phonetic categories is very noisy. This is due to three factors. First, the speech stream which arrives to
the ears of infants is mixed with a variety of nonspeech noises (environmental noises), and the speech
signal is itself distorted (filtering, reverberation). Second, the phonetic realization of the categories are not
the same across individuals: differences in gender, age, size of vocal tract modify considerably the acoustic
properties of the speech segments. Finally, and most importantly for us, the phonetic categories are also
variable within individuals. This source of variations, highly specific to speech signals, is generally called the
coarticulation effect. It is due to the fact that speech is produced by a complex set of independently
controlled articulators. In order to minimize energy expenses, our motor control system will tend to
anticipate future segments by preparing the articulators for their future targets. For instance, in the syllable
/Su/ (as in shoe), the fricative is already rounded in preparation for the rounded vowel. In /Si/ (as in she),
in contrast, the fricative is not rounded. The result is that the spectral characteristics of the fricative are
contaminated by the following vowel. Vice versa, due to inertia of the articulators, consonants and vowels
will be influenced by the preceding ones. Hence, depending on the context, the physical representation of a
phoneme will strongly vary from one realization to another. Some researchers have proposed that there is
no invariant properties of segments, because they are planned as entire articulatory gestures [Fowler et al.,
1993]
These noise sources, which strongly disturb the mapping between the abstract phonemes and their
acoustic realizations, don't seem to disturb neither infant acquisition nor adult comprehension of speech
signal.

1.4.B

Problems treated, simplifications

In this report, we ignore the first two sources of variability by considering one single speaker, in a quiet
and non-reverberating environment. These precautions allowed us to focus on what we consider to be the
main difficulties, in particular the coarticulation problem and the non-supervised aspects of the phonetic
acquisition.

1.5

Featural codes for speech recognition

Early models of speech acquisition have assumed that the speech stream is coded into a low
dimensional space of local features (see [Eimas et al., 1975], for a theory of phonetic features). Speech
recognition systems typically encode the speech stream into a low dimensional vector of local spectral
features (for instance, a vector of 20 or so MFCC or LPC coefficients). These coefficients are computed
every 5ms frame, and represent an analysis of the speech stream on a time window of around 10-30ms.
Unlike the proposal of Eimas, however, these features have not been claimed to have a linguistic
interpretation, but are used solely for efficiency reasons. Finally, neurophysiological models of speech
perception also propose an early representation of speech in terms of local features. The cochlea performs
7

a spectral decomposition of sounds, and the coding in the auditory nerve has been described as filtering
through a basis of short term wavelet type functions [Smith and Lewicki, 2006]. All of these models share
the idea that the speech perception system is organized hierarchically, starting with the local acoustic
features, which are used to builfd higher order structures like phoneme, syllables, and words.

1.5.A

Limitations of featural codes

The main deficit of feature approaches is that they are too local to be able to capture adequately the
sources of variation that affect the acoustic properties of phonetic categories. Psychoacoustic models have
for the most part failed to provide features that would have the grequired degree of invariance regarding
coarticulation [Blumstein, and Stevens, 1979]. Speech recognition systems compensate for this defect of
features by providing another representational level which is much more holistic: the level of words. Words
are represented as sets of nodes or states connected through transition probabilities; each state
representing a probability distribution over feature space. Such models called Hidden Markov Models
(HMM) are being very successful in achieving invariant recognition of words, but these models are heavily
trained in a supervised fashion (the words or phonetic tags are provided to the learning system). Obviously,
such an approach is not possible for a nonsupervised learning system.

1.6

Holistic or template-based codes

This report proposes to explore another, more recent class of models, based on more global or holistic
representations. The basic idea is that the invariant properties that correspond to phonemes are not present
at the level of local features, but rather at the level of entire articulatory trajectories (i.e., chunks of speech
of the size of syllables). In such models (see Figure 1), the system is based on the segmenting and storing
of large sized templates, which are the basis for discovering the more abstract segments, which then can
be used to recover the even more abstract linguistic features. In other words, this model puts the standard
hierarchical model on his head and starts with big units rather than small ones.
This approach makes two important presuppositions: 1. that the problem of segmentation into
templates or acoustic syllables is simpler than the original problem of segmentation into phonetic
categories, 2. that the acoustic templates provide a representation which is more useful for discovering
phonetic categories than the original feature representations.
In this report, we mostly focus on the second presupposition, although we report in the Appendix an
original algorithm achieving the syllabic segmentation, based on an even more global point of view and
thus coherent with the proposed model.
Classical architecture

Proposed architecture

W ord

W ord

Syllables

Features

Grammar

Grammar
Phoneme
s

Phoneme
s

Acoustic features

Acoustic syllables

Acoustics

Acoustics

Figure 1. Comparison between the classical hierarchical architecture based on local features,
and our antihierachical approach based on templates. Note that a template model is consistent
with the findings that newborn infants can count the number of syllables in the speech stream
before they can pay attention to individual segments [Bertoncini and Mehler, 1981].

1.6.A

Other templatic approaches

The idea of a similarity based representation of sounds has introduced by R.N.Shepard [Shepard,
1968], and more recently developed by S. Edelman [Edelman 1998]. He proposed to describe a visual
recognition system based on an analysis of similarities over similarities: the visual stimuli were compared to
a given set of patterns, and the results of these comparisons were used as a new representation of the
stimuli.
Conventional recognitions systems often categorize objects according to their similarities with
prototypes of the categories to recognize ; these similarities are usually considered as the result of the
classification, as the categories associated to a stimulus is the one proposing the minimum distance
between it's prototype and the input. But S. Edelman proposes to base the analysis on similarities between
observations, and compare these similarities together (creating similarities of similarities). These analyses
named second order isomorphism allow building much more complex classifications than a simple
recognition system, based on a measure of similarities followed by a winner-takes-all system. In our work,
we follow this concept in the most literal way, as the stimuli are first compared with the reference sounds,
and then the results of these comparisons are compared together through a clustering algorithm, to create
the phonetic categories (so these categories are built from an analysis of similarities over similarities).

The idea to use templatic representation of sounds has already been applied in practical works.. In
particular, M. Coath & S.L. Denham [Coath and Denham, 2005] studied a representation based on the
comparisons between syllabic templates. The aim of this research is to recognize They propose to
represent the sounds as the response to convolution filters, each filter corresponding to the feature to
detect. This operation produces a distance between the stimuli and several instances of the sounds to
detect. This new representation of the stimuli is relatively comparable to the one we are proposing. Unlike
classical speech recognition system, the model proposes to use the comparisons to create a new
representation, and to base the recognition on this new representation, instead of classifying the stimuli by
a simple winner-takes-all test.
However, the authors propose to compare the syllables using convolution products, which don't
provide a sufficient robustness to obtain satisfying results. They also just consider the global result of the
sounds comparisons, and consequently can't efficiently distinguish units of smaller size than the templates
used for comparison. This limitation is not considered as a problem for the authors, as they focus on digit
recognition, instead of phonetic acquisition. Finaly, the authors don't propose any error-rates, or efficiency
comparison between their system and any conventional representations

Similar ideas are extensively applied to object recognition. In particular, comparing an observed shape
with multiple templates of a given set of objects obviously greatly decrease the difficulties related to the
rotation or illumination of this object. In particular, it is successfully used in face recognition problems,
proposing robustness to face orientation and illumination [Beymer and Poggio, 1996].
However, these applications raise more difficulties than sound representation, as it involves
comparisons of 2 dimensional images of 3 dimensional objects: these comparisons (in the best case, of 2
dimensional pictures) necessarily require more complex considerations than comparisons between speech
stimuli streams (that are 1 dimensional).

Finally, experiments have been achieved on humans in order to support the concept of visual
representation by similarities: for instance, F. Cutzu and S. Edelman proposed [Cutzu and Edelman 1996]
an experiment where subject indirectly rebuilt the relation between 3 dimensional objects: several threedimensional animal-like shapes were generated; the shapes were controlled by an high number of
parameters, hidden from the subject. Pairs of pairs of pictures of these shapes were then present to the
subjects, who had to decide which pair was the most similar; according to the answer, relations between the
animals were rebuilt by multidimensional scaling. These new relations, determined by the subject, were
corresponding to the (hidden) relations between the object in the parameters space. This experiment would
9

suggest that humans are sensitive to the relation between object (as, for instance, a recognitions of object
by feature wouldn't lead people to reproduce the relation between the shapes presented in this experiment).

It is interesting to highlight that, although it is common to represent data by using similarities with
reference templates, these templates are usually considered as prototypes of the objects to identify or,
sometime subparts of these objects. The model proposed hereafter is, on another hand, aiming to identify
units (phonemes) that are subpart of the patterns used for the comparisons (syllables).

10

A template-based representation of the speech stream

Here, we propose a new representation of the speech signal based on syllabic templates. This
representation can be considered as a first step in the acquisition of the phonetic categories by the infant,
as it facilitates the acquisition of phonemes in this language.
In this report we will show that this representation helps learning one given language, while it
makes more difficult the discrimination of phonemes that are not part of this language.

2.1
2.1.A

Presentation
Structure of the model

The proposed model relies on an original representation of speech stimuli. This representation is based
on multiple comparisons between perceived stimuli and a bank of reference sounds (we suppose this set of
sounds has been previously stored).
In order to achieve these comparisons, it is necessary to segment the speech stream into relatively
short templates, of the same shape than the reference sounds. This segmentation would be a first step for
the comparison of new sounds with the reference base (and, consequently, this segmentation can be
considered as a first step for the acquisition of language).
A similarity measure is obtained by the comparison of every new sound with every reference template.
The measure of similarity between new stimuli and one reference sound is one dimension of the new
representation of the new sounds.
The created representation allows transforming a sequence of physical values (the physical or at least
acoustic representation of the speech stream) into a set of sequences of similarities (one similarity per
reference sound).
Before the comparison, and in order to improve its efficiency, the sounds are warped. The distortion
applied to the sounds can be useful information about the relation between these new stimuli and the
reference sounds. Therefore it can be valuable to consider it during the clustering step. In the experiments
proposed later in this report, we will present the gain obtained by the use of this information.
The acquisition of phonetic categories strictly speaking will be achieved in a second step, using the
computed representation. We will show in this report that the new representation allows easier learning of
these categories, and thus that this representation can be considered as a first step of their acquisition.

An illustration of this model is given on Figure 2.

11

Reference base
(3)

Segment & store


(B)
Code
(A)
oe
ffi
ci
en

acoustic signal
(1)

Low dimensional
spectral feature space
(2)

Match
(C)
High dimensional
similarity space
(4)

Figure 2. Global presentation of the proposed algorithm: the physical signal (1) is transformed (A) to
an acoustic representation (2 ; form for practical implementation of the model, MFCC coefficients are
used for efficiency). This data is segmented, and part of it is stored (B) to create the reference base (3).
In a second step, the new stimuli are compared with this reference base (C), to create a similarity
representation (4).

2.1.B

Produced representation

The representation obtained after comparisons is based on the reference templates, which means that
each dimension of this representation corresponds to one of the reference sounds.
For instance, if the syllables [ba], [na], [ra] have been stored in the reference base, one dimension of
the new representation will correspond to the similarity between stimuli and the reference sound [ba], while
another will correspond to the reference [na], and other dimensions will correspond to the other references
(cf Figure 3)

Figure 3. projection ma vs. mi: 2 axis of the new representation; the horizontal axe
corresponds to the reference template "ma", while the vertical axe correspond to the sound "mi".
Each point corresponds to a frame (5 milliseconds long), indicated by its label. Notice the "m"
sounds are a high similarity along both axis, while the phonemes "a" and "i" have a high
similarity along their corresponding reference sound only, and the other phonemes don't match
with neither axis, and are close from the origin. This figure displays two axes only, but the
representation is highly dimensional (with one dimension corresponding to every reference
sound).

12

2.1.C

Nature of the reference sounds

The nature of the reference sounds is fundamental for an efficient representation of the speech stimuli.
We assumed that phonetic units were optimal. This assumption was motivated by the following reasons:
Coarticulation effect is very strong within syllables and thus coarticulation creates a very strong noise
if we consider smaller units than syllables. The use of syllabic units cancels the effect of this noise, as
different instances of the same phoneme (appearing in different contexts) can be stored. On another hand,
the coarticulation effect is very low between syllables, so it isn't necessary to consider longer units than
syllables such units would increase the complexity of a representation, without providing any gain.
This consideration is also coherent with observations achieved on newborn babies. The use of syllabic
templates for sounds comparisons requires the babies to be able to detect such units in the speech stream.
It has been shown that newborns babies have this ability, as they can discriminate between strings of
bisyllables and trisyllables (which means they can count the number of syllables in speech stream
[Bertoncini and Mehler, 1981]).
Finally, this hypothesis has been successfully tested on our model: experiments that have shown the
superiority of a syllable based representation for phoneme acquisition, which is, as we have seen, coherent
with both theoretical consideration and compartmental observations.

The exact mechanism allowing babies to achieve this segmentation is still unknown, and will not be the
focus of this report. Nevertheless, a syllabic segmentation mechanism is proposed in annex.

2.2
2.2.A

Particularities of the new representation:


Adaptation to the learned language

This proposed representation of the speech stream shows several major differences with a
"physical representation" like for instance, the spectral representation.
The first noticeable difference is the adaptation of the representation to a particular language. The
representation would be adapted to learn the phonemes that appear in the reference base, while it should
be inefficient in learning discrimination of phonetic categories that do not appear as distinct in the phonetic
base. So this representation is clearly adapted to the learned language, and can be considered as a first
step in its acquisition. This adaptation will be clearly underlined during the tests achieved on the proposed
implementation of the model.

2.2.B

Transformation of trajectories into points

As another major difference, it is important to note that our representation transforms data
appearing as sequences of physical values into a constant similarity. For instance, the physical
representation of a phoneme clearly isn't constant. But its similarity with a reference sound featuring the
same phoneme will be continuously high.
This difference greatly simplifies the learning of the phonetically categories, as a phoneme will be
corresponding to a point or an area of the new representation, instead of a complex trajectory in the
acoustic representation.
On another hand, phonemes that do not appear in the reference base aren't associated to any zone
in the new representation (the representation of such a phoneme will still be a trajectory). So our
representation would not be adapted to learn such a phoneme.
These differences would give an interesting understanding of the discrimination abilities (or inabilities)
as based on a representation specificity (or limitation), prior to any clustering.

13

2.2.C

Specific case: Affricates phonemes

Noticeable examples are the affricates phonemes. For instance, let's consider a language including a
s
phoneme " t ". A reference base acquired from such a language would incorporate a set of templates
s
s
including the phoneme " t ", so the sound " t " would correspond to an area in the new representation. New
s
stimuli containing the phoneme " t " would create a cluster in this area, which would be considered as an
independent phoneme.
s

On another hand, a representation built on a language containing the phonemes "t" and "s", but not " t
s
s
" would not have any constant area corresponding to the phoneme " t ". If a sound " t " is represented using
a reference based built from this language, it would not correspond to a fix area (and would successively
s
match with "t", then with "s"). In these conditions, the sound " t " would be considered as the concatenation
of two phonemes, instead of a third one.
This difference would enable any acquisition technique based on the proposed representation to easily
learn the discrimination between phonemes that would not be possibly discriminated using local information
only.

2.3
2.3.A

Implementation and testing of the model


Objectives and presentation of the implementation

A Matlab implementation of the described model has been written.


The objectives of this implementation were:
-

To demonstrate the efficiency of the proposed representation compared with a physical


representation of the speech signal, learning the phonetic categories.

To compare the efficiency of the proposed representation, depending on the speech templates
considered (for instance, to compare syllabic templates with other kinds of segmentations).

To propose some optimization to achieve on the representation.

2.3.B

Detailed algorithm

The implemented algorithm was proceeding in several successive steps.


The input data provided to the algorithm was coded using the MFCC coefficients [Mermelstein, 1976].
MFCC representation of a sound is computed from the spectrum of the speech signal: the logarithm of the
spectrum is mapped along a mel scale, and the MFCC coefficients are the amplitude of the discrete cosine
transform of the obtained signal.
This choice is motivated by technical consideration, as MFCC coefficients provide an efficient way to
encode a speech stream for speech recognition and are relatively fast to compute.
In experiments described in this report, the signal was represented using the MFCC representation.
The MFCC coefficients were computed every 5ms, but each MFCC frames actually represents the speech
signal for 15 milliseconds (the frames are computed on overlapping windows). The signal was represented
by the first 13 coefficients.
In order to judge the efficiency of the proposed representation, we often compared the error-rates
obtained from clustering based on our representation with results obtained from the MFCC coefficients. In
these cases, we used the MFCC coefficients, plus the 2 first derived coefficients (so the considered signal
was 39-dimensional). These derived coefficients greatly improves the learning and recognition of phonetic
categories (as shown for instance in Figure 5).
Considering the input sounds represented by their MFCC coefficients, the implemented algorithm was
proceeding by successive steps:

i:

Constitution of the reference base

This step was, for most of the trials, achieved in a supervised way. A given number of realizations of
every possible syllable of the considered language were randomly selected to compose the reference base.
This way to proceed allowed obtaining balanced reference bases, created from optimal templates.
14

The balance of the reference base is probably not mandatory, as long as every syllable is represented
(in the case of an unbalanced templates base, the results would probably be equivalent to a balanced
base, with the same number of elements than the less represented syllable in the database.
However, in some experiments, the reference base was built in an unsupervised way (and, in these
cases, it will be specified).
We also developed and implemented an unsupervised segmentation algorithm, which could produce
relevant template. This algorithm is described in annex.

ii :

Comparison with new stimuli

The comparisons of the database templates and the new stimuli are achieved using the Dynamic Time
Warping (DTW) algorithm [Myers and Rabiner, 1981], DTW algorithm proposes a comparison between two
sounds, deforming one of them if it is necessary, in order to compare parts of the sounds that are the most
probably matching.
This comparison method offers robustness against changes in speed of the speech stream, and
synchronizes the templates prior to match them together.

Figure 4. Result of the comparison of 2 sounds using the DTW


algorithm: one of the sounds (the new stimulus, here "ba") is
displayed along an horizontal time-axe, while the other (the reference
sound, here "be") is displayed along a vertical time-axe. The DTW
algorithm compares these sounds by computing the distance
between every possible pair of frames from the two sounds (these
distances are stored in the "distance matrix", represented in the
figure: similar frames are represented in yellow, while distant ones are
represented in red). Then, it finds the best possible path (the path
joining the beginning of the sounds to their ends with a minimal total
distance, in blue on the figure). Finally, it projects the distance along
this path on one of the sounds (in our algorithm, the new stimulus).
This allows obtaining a frame-wise distance between the new
stimulus and the reference sound. The same comparison is applied
with every reference sound, providing a vector of similarities for every
frame of the new stimulus.

In some cases, the 1-pass DP algorithm was preferred (1-pass DP is a generalization of the DTW
algorithm, allowing to compare one sound to many references simultaneously, or to compare one
continuous stimulus to a shorter reference, repeating it as many times as necessary); in these cases, the
comparison method and the reason of the choice will be specified.
In addition, we extracted from the comparisons information about the distortions that have to be applied
on the sounds in order to make them match. To collect a time-distortion measure, which gives additional
information about the comparison between the sounds, the optimal path obtained by the DTW algorithm is
considered. The distance between the derivate of this path at one given moment and a linear distortion of
the reference sound offers a dissimilarity measure (the highest the value of this distortion distance is, the
most the sounds had to be warped before matching).

iii :

Obtained representation

Considering the input data, the representation obtained after comparison is a N or 2N-dimensional
signal (if the reference base counts N distinct templates, the comparisons produce N similarities, and N
distortion distances), sampled every 5 milliseconds.
From a computational point of view, this representation is relatively complex. Therefore, the proposed
algorithm shouldn't be considered as efficient from an engineering point of view.

iv :

Clustering and score

Depending to the experiments, the clustering achieved over the new representation uses supervised or
unsupervised algorithms (in case of supervised clustering, a perceptron algorithm is used, optimized using
15

the Rprop algorithm (Rprop algorithm achieves a supervised training on a perceptron network; it uses a
backpropagation optimization technique; see [Riedmiller, and Braun, 1993] for further details).
The training and the recognition of the phonetic categories was achieved on a frame wise point of view:
every frame was considered as an independent data, and the order of these frames was not used for the
clustering or the recognition; this point of view for clustering was clearly simplistic. However, our objective
was to compare different representations of the speech signal. For this task, it was relevant to use a simple
clustering algorithm. We only suffered of this simplicity for the unsupervised tests, where the proposed
algorithms were unable to converge toward the appropriate clusters.
In any cases, the error rate measured was the percentage of misclassified frames (so this error rate
was frame-wise).
In every experiment, a similar clustering algorithm was applied to the MFCC coefficients, and its two
first derivate coefficients, in order to judge the gain obtained using the proposed representation.
For the learning, we used a training set including 34 times every possible syllables (so, from 34x9=306
sounds for the smallest sets to 34x36=1224 sounds for the biggest set; each sound contains an average of
40 to 50 frames, depending on the considered set).

v:

Comparison with a direct clustering algorithm

On a computer-science point of view, our approach can be considered as related with the DTWdistances based K-means algorithm. (for instance, an application of such an algorithm is proposed in
[Somervuo and Harma, 2004]). This algorithm computes a K-means clustering directly on sequence of
points, using the DTW algorithm in order to compare the different stimuli to the centroids. It would not need
any intermediary representation to compute a relatively efficient clustering of the phonetic categories. But
this algorithm can only provide a clustering for entire sequences of sounds. So, in order to converge to
phonetic categories, it would require to segment the speech into phonetic units.
More details related to this algorithm and its connection with the proposed model are given in annex.

2.3.C

Input data

The input data provided to our algorithm was encoded using the 13 firsts MFCC coefficients, sampled
every 5ms.The choice of the MFCC coefficients as input representation was motivated by the fact these
coefficients are directly extracted from a physical representation of the speech signal (being the "Fourier
transform of the Fourier transform"), and in the same time relevant for speech recognition.
For the tests achieved on the algorithms, we considered 4 pseudo-languages.
All these stimuli were recorded by the same male speaker, in a quiet and non-reverberating
environment. In order to estimate the result of the proposed experiments, all the data were manually labeled
(whenever some of the proposed experiments are based on an unsupervised learning system, and thus do
not actually rely on these labels).

i:

Easy set : [r m s a e i]

This set was composed of 6 phonemes (a, e, i, r, m and s). The phonemes of this set were selected in
order to be relatively easy to differentiate. These phonemes were used to create syllables, of ConsonantVowels (CV) structure. These syllables were pronounced independently (thus, the considered data were
monosyllabic words).
This set was balanced (every possible syllable was appearing the same number of times). Each
syllable was recorded 54 times (so this set was constituted of 3 x 3 x 55 = 495 elements, as 3x3=9 syllables
can be created with these 3 consonants and 3 vowels)

ii :

Hard set : [p t k u o y]

This set was also composed of 6 phonemes (p t k u o y). These phonemes were selected to be difficult
to differentiate: in particular, the 3 stops (p, t, k), are extremely hard to differentiate, like the 3 selected
consonants (u, o, y).
These phonemes were also used to create monosyllabic words, with a CV structure.
This set was also balanced, and each syllable was recorded 55 times.
16

iii :

Complete set ('all') : [r m s p t k a e i u o y]

This set was a combination of the 2 sets previously described. It was constituted of CV words, with CV
monosyllabic words. It was including all possible syllables created from the phonemes of the easy and the
hard set (such as "ru" or "pa").
Like the sets described previously, this set was also balanced, and constituted of 54 realizations of
each possible syllable (so it was composed of 6 x 6 x 55 = 1980 elements).

Polysyllabic set : [R d m a i u]

iv :

This set contains trisyllabic words. These words are composed with 8 phonemes (R d m a i u).
These phonemes are arranged following the same CV structure than in the previous sets. So the trisyllables
recorded have a CVCVCV structure. 512 trisyllabic words were recorded. The set was built in such a way
that all the phonemes were pronounced the same number of times, in every position.

2.4

Experiments achieved

In order to test the predictions related to our models, we achieved several experiments, using the data
presented hereinabove.

2.4.A

Experiment 1: efficiency of the new comparison ; supervised clustering


st

The 1 test realized was related with the "efficiency" of the proposed representation, for phoneme
acquisition. To estimate this efficiency, we measured the frame wise error-rate for phonetic classification of
our data, either after a supervised or an unsupervised clustering.

i:

supervised clustering

This experiment was achieved in order to measure the linear separation between the phonetic
categories, for different kinds of representations.
The clustering of the data was achieved using a perceptron. The perceptron parameters were
determined using the R-prop (supervised) algorithm. [Riedmiller and Braun, 1993].
The perceptron output allows attaching a label to every input frame, and the error rate was obtained by
comparing this found label with the actual labels manually attributed to the frames.
We tested the different sets of stimuli, under different conditions:
-

MFCC representation, with the first 2 derivate coefficients (control)

Our representation, testing :


o

The influence of the number of reference sounds

The influence of the distortion distance

Whenever this experiment showed a superiority of the proposed representation, it gave relatively
contrasted results, according to the sets it was achieved on:

a)

Monosyllabic sets :

The tests over the monosyllabic sets showed a clear superiority of the proposed representation.
For these tests, the considered templates were monosyllabic words, compared together using the DTW
algorithm. We used reference bases created with 4 to 12 instances of every possible syllable (there were 36
possible syllables), and tested the discrimination using or not the distortion distance.

17

Hard set 3x3 [ptk][uyo]

30
Training
Classification error

25

Generalization

20
15
10
5

*9
12

8*
9

4*
9

*9
12

8*
9

4*
9

m
fc
m
c
fc
c+
de
lta
2

Nb of detectors
spectre only

spectre+time

Figure 5. Efficiency of a supervised classification, achieved on


MFCC coefficients (for control, including or not the 2 first derived
coefficients "mfcc" or "mfcc+delta2"), or on the similarities based
representation, with 4, 8 or 12 examples of every reference, and
without or with the time distance ("spectrum only" / "spectrum+time").
As expected, a bigger reference base allows a more precise discrimination of the stimuli. The next test
also shows the efficiency of the distortion distance to discriminate phonemes.
However, these affirmations could be considered as obvious, as the representation based on the
biggest database (12 instances of each syllable) is extremely complex: It counts 12 * 36 * 2 = 864
dimensions (12 = number of instances of every syllable, 36 = number of different syllables in the language,
and the number of references is multiplied by 2 because we used the distortion distance). With such a
representation, the discrimination is necessarily higher than with simpler data. Nevertheless, all this
information is directly extracted from the MFCC coefficients. And, if the representation based on a great
number of references is complex, a projection from this representation on the most significant dimensions
(using a Principal Component Analysis ) allows obtaining much better results that directly processing the
MFCC coefficients (and there derived), with a same level of complexity.
So it can be deduced from these experiments that, for monosyllabic data, the proposed representation
is more efficient for discriminating phonemes. It is also shown that increasing the number of references
sounds improves the discrimination, and that the distortion distance helps discriminating phonemes more
efficiently.

We achieved the same tests, using a more complex representation of the sounds as input for our
algorithm: instead of using the simple MFCC coefficients, we tried to add the derived coefficients. Whenever
the use of these derived significantly coefficients increased the results for direct recognition (clustering
directly based on the MFCC coefficients), it didnt show any significant change when used as input of our
algorithm, meaning that information obtained by the use of these coefficients are already extracted by our
representation.

b)

Polysyllabic sets

1.2.4.A.i.b.1 Presentation of the problem


The case of continuous speech is tested here using polysyllabic words (trisyllables).
This kind of treatment raises additional problems compared with the monosyllabic data previously
presented. In particular, the continuous speech has to be segmented in templates prior to the creation of
the reference base and the comparison between the new stimuli and the sounds of this base.
18

Considering the central role of the reference templates in our algorithm, it is relevant to study the
impact of there selection on the results.

1.2.4.A.i.b.2 Comparison between syllabic and phonetic segmentations


We first compared representations based on syllabic templates with phonetic templates (both manually
created), and with the direct clustering achieved over the MFCC coefficients (plus derivate coefficients). The
use of syllabic templates proved to be more efficient than phonetic templates. Syllabic templates-based
representation also proved to be more efficient than the MFCC coefficients (and, in order to compare
models of an equivalent complexity, we artificially increased the number of dimensions of this
representation; in this case, the MFCC based representation was equivalent to the phonetic template based
one; but the representation based on syllabic comparisons still allowed improvement of discrimination of the
phonetic categories).
Polysyllables: Influence of the Templates
[SRdm][aiu2]

Training
Generalization

25

Classification error

20

15

10

0
mfcc delta2

mfcc delta 20 (13 *


20 coefs)

syllabic templates

phonetic
templates

Figure 6. polysyllabic set: efficiency of the supervised clustering, over the MFCC
coefficients (plus derived), and the comparisons-based representation, using syllabic and
phonetic templates. For the phonetic based representation, the clustering is clearly suffering
from over-fitting: whenever the representation is complex (the learning error rate is lower than
the one obtained from the MFCC coefficients), it is inadequate (the rules extracted from the
learning set cannot be applied to the generalization set).

These experiments have been achieved in "ideal" conditions, as the templates were segmented
manually. Of course, any automatic segmentation of the speech signal would be less precise, and generate
errors in the templates boundaries. But the sounds are warped before the comparisons (using the DTW
algorithm), so the precision of the segmentation is not mandatory.
The result of this experiment, showing a superiority of the syllabic templates, is not surprising: unlike
phonetic units, syllabic ones take in account the coarticulation effect, and consequently avoid a very strong
source of noise.
The optimality of these syllabic sounds units also strongly supports our model, as it has been shown
that newborn babies can perceive these units: it would give a role to this early sensibility (that precedes the
phonetic acquisition), and would involve it in a full language acquisition (unlike more classical models).

1.2.4.A.i.b.3 Segmentation of the continuous speech


A segmentation algorithm is presented in annex. This algorithm is coherent with our approach as it
mainly considers global information of the speech.
19

It is based on the idea that the most relevant units are optimally describing the speech stream. Using
this assumption, the algorithm finds a (locally) optimal set of sounds that can, with a minimal distortion, be
concatenated to match in an optimal way with new speech stimuli.

1.2.4.A.i.b.4 Random templates


We also tested the efficiency of a sound representation based on randomly segmented templates.
We considered 3 types of templates could be relevant:
-

syllabic templates (a priori assumed to be optimal, as said above)

phonetic templates (already described hereinabove)

random templates

For the "random templates", we considered a segmentation of the speech at a constant length. We tried
several possible lengths, in order to find the optimal one.

Figure 7. 3 different types of segmentation, for the creation of the templates base: the portions of
st
signal between the red lines will be kept as reference templates. The 1 image corresponds to a
nd
rd
syllabic segmentation, the 2 one is a phonetic segmentation, and the 3 one corresponds to a
random segmentation, as the boundaries are fixed at regular intervals, without considering the data
contained in the signal. The syllabic and phonetic segmentations are obtained manually.

The comparison was achieved using the 1-pass DP algorithm, independently on every reference
sounds (so there is no interaction of the best path through the different sounds, and the dimensions are
independent).
We also tried to synchronize the paths through the different dimensions (by forcing the path along all
the dimensions to come back to the origin of the corresponding reference sounds at the same moment than
when the optimal pass does). However, this type of comparison wasn't more efficient than comparing the
stimuli independently with every reference sound.

20

Polysyllabic set [SRdm][aiu2]


Influence of the templates

Training

25

Generalization

Classification error

20

15

10

40
0

36
0

32
0

28
0

24
0

20
0

16
0

12
0

80

40

c
te
ph
m
pl
on
at
et
es
ic
te
m
pl
at
es

sy
ll a
bi

m
fc

de
lta

0
Length of the random references

Figure 8. comparison between different segmentation techniques; for the randomly


segmented templates, lengths from 40 to 400 ms have been tried. These results show the
superiority of the syllabic templates. The U-shape of the score function according to the length
of the random units is also noticeable. The optimal value (120m) is longer than an average
phoneme, but shorter than a syllable. This result allow to think that templates have to
represent the transition periods between phonemes to be efficient.

The results are presented in the Figure 8.


-

The syllabic templates proved to be optimal.

As expected, the random templates were not optimal. It should be noted that the optimal random
templates were corresponding to an intermediary length between the length of a phoneme or a
syllable.

The random templates are more efficient than the optimal randomly segmented templates. This
result is due to the additional freedom given to the comparisons with the randomly segmented
templates by the use of the 1-pass DP algorithm.

This experiments support the idea of syllables as basic perceptual, prior to the learning of the
perception of phonemes.
It also suggests that whenever an appropriate segmentation improves the efficiency of a later
clustering, a roughly achieved segmentation can give relatively efficient results.

ii :

unsupervised clustering

The same experiment was achieved, considering an unsupervised learning. This experiment should be
more relevant in term of phonetic acquisition, considering this acquisition is achieved by babies in an
unsupervised way (ie. It rests on a passive listening of the speech, and is not directed by any exterior
information).
However, we didn't focus on this part of the learning mechanism in this report: the learning of units
such as phonemes involves complex learning algorithms (usually based on Hidden Markov Models (HMM) ;
for an example of such an algorithms, see [Takami and Sagayama, 1992]).
These algorithms usually build allophonic categories, which are merged into phonetic categories
[Peperkamp and Le Calvez, 2003].
21

We tried to avoid these difficulties by using of simple language, featuring very distinct phonemes. In
these conditions, it is possible to obtain a satisfying result with simple algorithm such as K-means or EM
(these algorithms don't use time-related information, such as HMM-based algorithms).
For the more complex languages, the simple clustering methods used didn't allow us to obtain
satisfying results: the clusters found, from both the MFCC or the similarity based representation, were not
corresponding to phonetic units (the algorithm was proposing allophonic units for some phonemes, and
merging other ones).

A test has been achieved on our simple language ("Simple set", containing the phonemes [r, m, s, a, e,
i]). The simplicity of this language allowed the use of simple unsupervised clustering algorithms (we tested
K-means and EM algorithms).
This test showed a clear superiority of our representation. While the clustering proposed over the
MFCC representation proves to be irrelevant, the clustering proposed on our representation is
corresponding to the categories found by a supervised algorithm. [FIG XXX]
As it could be expected, K-means and EM algorithms obtain the same error rate on our representation,
nd
whenever the 2 is much more complex. This is due to the special shape of the speech signal, in our
representation (where the values correspond to similarities with other sounds, and thus "have a meaning"
for phonetic clustering).
Unsupervised learning
Easy Set [mrs][aei]

35
Training

Classification error

30

Generalization

25
20
15
10
5
0
mfcc

8*9
K-means

mfcc
Nb of detectors

8*9
EM

Figure 9. Unsupervised learning of the phonetic categories, using MFCC (and derived coefficients)
versus similarities-based representation. The similarities based representation proved to be clearly
superior to the MFCC-based coefficients. In both case, the number of clusters (corresponding to the 6
phonemes) was provided to the algorithm.

In these experiments, the number of categories to build was provided to the algorithm. However, this
additional information was not necessary to find the optimal partition, as shown in Figure 9.

22

BIC (Bayesian Information Criterium)

Optimal number of clusters


for the easy set [m R s a e i]

4100
4050
4000
3950
3900
3850
3800
3750
3700
3

10

11

12

number of clusters

Figure 10. Evaluation of the efficiency of the clustering, as a function of the number of clusters
created. The efficiency was computed using the Bayesian Information Criterion (BIC). The minimum
value of the BIC corresponds to the optimal number of clusters; here, the optimal number of clusters
corresponds to the 6 phonetic categories : [m R s a e i].

2.4.B

Impact of the language of the reference base

The representation built strongly depends of the reference base used (as dimensions of the
representation correspond to sounds of the reference base).
For instance, a well balanced base containing sounds of the learned language should be optimal for
the acquisition of phonetic categories, while learning a language from a representation composed of
inadequate sounds (like syllables of another language) should be harder.
In order to test the impact of the language of the reference base, we tested the learning of a language
represented on a base containing different phonemes:
We considered 2 languages: the "Hard Set" (phonemes [p, t, k, u, o, y]), and the "Easy Set" (phonemes
[m, r, s, a, e, i]). We tried to discriminate phonemes from the Easy Set, using a representation based on the
"Hard Set", and vice versa (discriminating the phonemes of the Hard Set, represented with a reference
based built from sounds of the "Easy Set").
Both sets contain the same number of phonemes and possible syllables (so the number of dimensions
in an Easy Set-based representation and in an Hard Set-based representation are equals).

23

Classification error

Inadequate Reference Bases:


Easy Set [Rms][aei]
Hard Set [ptk][uyo]

Training
Generalization

20
18
16
14
12
10
8
6
4
2
0
Reference =
Reference =
Easy ; Learned Hard ; Learned
= Easy
= Easy

Reference =
Reference =
Hard ; Learned Easy ; Learned
= Hard
= Hard
Nb of detectors

Figure 11. Influence of the reference base for the acquisition of the phonetic categories: clustering
algorithm were trained to learn phonetic categories from data represented on an adequate
representation (when the language of the reference base corresponds to the learn language), or an
inadequate one (when the language of the reference base differs from the learned one)..

These results clearly demonstrate the impact of the language of the reference base. It is much harder
to learn the phonetic categories if the speech is represented on an inadequate base.
This shows that the construction of a reference base adapted to the learned language is a first step for
the acquisition of phonetic categories. It corresponds to the observations, as adults are less efficient to
discriminate phonemes that do not appear in there native language.

2.4.C

Enhancements

The representation proposes exhibits some important particularities, due to the nature of its dimensions
(which are independent similarities). These particularities can be used in order to adapt the clustering
algorithm used to actually create the phonetic categories.
For instance, it can be admitted that each syllables contains a limited number of phonemes (for
instance, one could say that, in most of the languages, most of the syllables are composed of 3 phonemes
or less).
Based on this assumption, and noticing that the detectors are syllabic templates, some observations
can be made on the shape of the categories to create. One reference sound should contain a limited sound
of phonemes (usually up to 3). So one dimension, corresponding to this reference sound, should have a
limited number phonemes (which are included in the reference sound, thus matching with it), centered at a
high similarity value, while all the other phonemes should be centered at a low similarity.

It is also possible to simplify the representation, considering it is based on similarities. Instead of


considering the continuous similarity measure used in the proposed experiments, it is possible to use a
simple binary notation: if, at a given moment, the stimulus is matching with a reference sound, the
corresponding dimension will be set at one, otherwise it's value will be set at 0 (the decision "matching"/"not
matching" can be based on a simple ceiling over the similarities).
This simplification offers the advantage of being relatively similar to the data manipulated by neural
models. We compared the efficiency of the recognition, based on a continuous and a binary representation,
using a simple artificial neural network (perceptron) trained on a supervised way. It was also coherent with
the concept of representation by similarities, as the only information representing the sounds was the list of
24

references matching to them (this information was sufficient to achieve the recognition, as the number of
reference sounds was relatively high). In term of quantity of data manipulated, this binary representation
was clearly less complex than the MFCC representation.
We tested the efficiency of this simplified representation either on the "Full Set" (composed of the
phonemes [m, s, R, p, t, k, a, e, I, u, y, o]) and the polysyllabic set (manually segmented, composed of the
phonemes [R, S, d, m, a, e, i, u]).

Trisyllabes [SRdm][aiu2]

all easy + hard set 9x9 [ptkRms][uyoaei]

45

35

40

Generalization

35

Classification error

Classification error

30

Training

25
20
15
10

Training
Generalization

30
25
20
15
10

0
*2

12
*3
6

12
*3
6

*3
6

mfcc

12

8*
36

*ti
m

bi
na
ry

/3
9

ta
2
fc
c+
de
l
m

fc
c

0
mfcc
delta2

12*16
binary

24*16
binary

12*9

12*9*2
(time)

Nb of detectors

Type of representation

Figure 12. Comparison of the efficiency of different representation: the MFCC representation (and
MFCC plus 2 derived coefficients "delta2"), the binary similarity-based representation (with 12x16 or
24x16 reference sounds), or the similarity based segmentation (using or not the temporal distance). For
these tests, the ceil used to create the binary segmentation was at the average matching plus one
standard deviation (all the coefficients higher than this level were set at 1, all the ones matching less
were set at 0).

These results show the efficiency of a binary representation. Despite its simplicity, the results obtained
are relatively similar to the results obtained using a continuous representation (and still show an
improvement compared with the MFCC-based representations).
It is also important to notice that the binary representation allows the results to improve when the data
base is enlarged (as shown with the polysyllabic examples), while a database of the same size would lead
to an over-fitting if used with a continuous representation.
These considerations allow to think that the ceiling achieved to obtain the binary representation actually
keeps most of the relevant information for phonemes discrimination.

2.5

Known limitation of the proposed model

One strong criticism against the presented model should be that it doesn't correspond to the observed
phonetic discrimination chronology. It has been shown that, prior to any phonetic acquisition, newborn
babies can discriminate virtually every phonetic categories. However, according to our model, infants
should learn discriminations (by adding discriminant pairs of templates in there reference base), and not
forget the distinctions, as observed (which means they would have to get a full initial reference base, and to
forget references during the learning, which is impossible).
However, an evolution of this model could be suggested, in which the reference base would not be a
set of real speech templates, but more abstract data. These data, which could correspond to artificial
sounds, can be initialized in a homogeneous way along the possible speech spectrum. While the baby
hears his language, adapts the reference sounds to fit with the statistical distribution of the heard speech.
By doing so, he will decrease his sensitivity to sounds that do not correspond to any receptor, and increase
it for sounds in different zones containing many receptors.
25

26

Conclusion

We have proposed in this report an original model for the phonetic categories acquisition. Whenever
this model is mainly focusing on the representation of speech stream by young infants, we it corresponds as
a first stage in the language acquisition.
The proposed representation can adapt to the linguistic environment in which it is developed, which
simplifies the acquisition of the phonetic categories.
We have shown that the proposed representation is clearly depending of the stimuli perceived during
the early stage of language acquisition, as an inadequate representation makes the acquisition of phonetic
categories harder, while an appropriate one facilitates the learning of these categories.

We showed that a syllabic point of view is the most relevant for the acquisition of phonetic categories
through our model, which is coherent with the early perception of syllables by infants. Consequently, the
developed model allows proposing coherent chronology of the speech acquisition during the first year of
life.

Our approach is closely related with the idea proposed by M. Coath and S. L. Denham, who propose to
represent sounds using similarities with reference templates. However, we propose a more complex
algorithm (by warping the sounds before comparison), which allows us obtaining a relevant representation
for the phonemes acquisition. We also propose a comparison of our representation and a more
conventional one, which gives an estimate of the gain obtained by our system.
We also compare the efficiency of different kinds of template; while this previous study only considered
reference templates corresponding to the sounds they wanted to discriminate (digits pronounced in
English), we used templates that were not corresponding to the units we wanted to discriminate (using
syllables while we were aiming to discriminate phonemes). Nevertheless, this work, as the one we
presented in this report, was a direct application of the theoretical considerations previously developed by
S. Edelman, who underlined the benefits for recognition of a representation based on similarity measure.

27

2 APPENDIXES
1

Used algorithms : DTW and 1Pass DP

1.1

DTW algorithm

The DTW algorithm provides an efficient measure of similarity between two sounds. In order to
increase the robustness of the comparison between the sounds, they are warped one relatively to the
other. This warping aligns the part of the two songs that corresponds together, in order to compare the
sounds in a relevant way.
Considering 2 sounds, coded as two sequences of L1 and L2 vectors (or frames), the comparison
will be achieved in 3 steps:
st

1 , a distance matrix D between the two sounds is created. Each element D(t1,t2) of this matrix, of
size L1xL2, corresponds to the distance between the t1th frame of the sound 1, and the t2th frame of
the sound 2 (so this matrix represent the distance between every pair of sounds of the two sounds).
A path along this matrix (a sequence of couples (t1,t2), each element t1 and t2 increasing)
corresponds to a distortion of the sounds 1 relatively to the sound 2, or vice versa. The aim of the DTW
st
algorithm is to find a path connecting the 1 element of the distance matrix D(0,0) to the last one
D(L1,L2), minimizing the sum of the elements encountered (each element corresponding to the distance
between the 2 sounds at a given time, the found path would correspond to the distortion minimizing the
distance between the sounds).
The distance between two frames can be computed using the Euclidian distance after a
normalization of every component of the vectorial representation of the sound. It is also possible to use
the Mahalanobis distance between the two frames (this distance takes in account the statistical
distribution of the frames, via the covariance matrix of a set of frame. In the experiments presented in
this report, the first solution (Euclidian distance between normalized vectors) has been used, for
computational reasons.
ILLUSTRATION DISTANCE MATRIX
nd

In a 2 step, the Cumulated Distance matrix C is created. Each element of thes matrix
st
corresponds to the sum of the distance along the shortest path from the 1 element of the distance
matrix D(0,0) to a given couple of coordinates t1,t2.
st

The cumulated distance matrix is built recursively : the 1 element of this matrix C(0,0) is fixed at 0.
Then, the element C(t1,t2) is equal to the smallest possible previous element (for instance, the minimum
of C(t1-1,t2-1), C(t1-1,t2), or C(t1,t2-1) ; this minimum is called the origine of the element C(t1,t2), as
the optimal path leading to this elements comes from its orignin), plus the value of D(t1,t2)
st
nd
(corresponding to the distance between the t1th frame of the 1 sound and the t2th frame of the 2
sound). In order to find the best path leading to an element, the origin of every couple of frame is kept in
a matrix called the path matrix.

ILLUSTRATION CUMULATED DISTANCE MATRIX


rd

st

Finally, in a 3 step, the path matrix is used to find the optimal path connecting the 1 elements to
the last one. This path is found iteratively, starting from the final element. The previous element in the
path is its origin, that can be found using the path matrix. It is also possible to find the origin of this
element, and, gradually, to find back the best path, corresponding to the distortion of the two sounds
minimizing the distance between each other.

1Pass DP algorithm improves the DTW algorithm, allowing to compare a long (continuous) sound S
to several shorter templates {S1... Sn}. The output of this algorithm would be the optimal sequence of
28

short templates, Si1...Sik, and, for each template Sij, a sequence of couples of frames (one
corresponding to the continuous sound, the other one to the template Sij). These sequences create a
path through the continuous sound and the set of template, minimizing the distance between each other.

1.2

Comparison between the proposed clustering method and the "Kmeans


over Trajectories" method

Our model can be related to a more conventional algorithm for clustering of multidimensional time
series.
This algorithm is simply an adaptation of the K-means algorithm, and thus is an unsupervised
clustering algorithm.
Kmeans algorithm usually consider points, instead of sequences. For creating N clusters, it is
initialized by associating random values to N points, considered as the centroides of the initial clusters.
Then, it repeats the following steps:
-

for every point, it finds the closest centroid, using the Euclidian distance. The point will be
considered as part of the cluster associated to its closest centroid.

For every cluster, the centroid is recomputed, according to the set of point that have been declared
as part of it.

After iterating these operations, the centroids should converge to some value, creating a Voronoi
decomposition of the space (each cluster being a part of this decomposition).
This algorithm can be modified in order to classify trajectories instead of points: trajectories are
considered as centroides, and the distances between trajectories are computed using the DTW
algorithm instead of the Euclidian distance.
Such an algorithm could be used to classify sounds in an unsupervised way, and thus to find the
phonetic units. However, this algorithm suffers several weaknesses:
st

1 , it requires the sequences to be segmented in a relevant way for the clustering. For the
acquisition of phonetic units, it would require the sounds to be segmented following the phonemes
boundaries, raising a bootstrapping problem how to find these boundaries if the phonemes are
unknown? An algorithm dealing with this bootstrapping problem can be derived from the
segmentation algorithm proposed bellow (however, its complexity is much higher than the algorithm
proposed in this report).

2 , it is, on a computational point of view, a very expensive algorithm. It involves, for every iteration
of the Kmeans algorithm, the computation of the distance between every instance of the sounds to
classify and the centroide of every cluster. On this point of view, the proposed algorithm,
proceeding in two step (first comparing the sounds to a predetermined database, and then applying
the standard Kmeans algorithm on points) can be considered as an approximation of this algorithm,
as it is computationally less expensive.

nd

Speech segmentation based on template permutations:

The algorithm presented in this section proposes an original and purely top-down method, for syllabic
or phonetic segmentation. The originality of this technique is its extensive use of global information, and the
absence of direct use of local information.

2.1.A

Basic concept

The method proposed hereafter relies as a conception of the phonemes or syllables as best units for a
description of the speech signal: the syllabic or phonetic segmentation are optimal to describe the speech
as a permutation/segmentation of a finite number of units.
29

This algorithm would thus try to find a segmentation of a relatively short speech stimulus providing
independent units that allow describing as well as possible new speech stimuli. The best segmentation is
the segmentation minimizing the deformation that would have to be achieved on the new stimuli in order to
obtain the best matching (ie. the best description) between these new stimuli and the considered units.
An analogy can be proposed between the task aimed and a puzzle game: considering a given set of
pieces, one can try to rebuild a given picture. The pieces of the puzzle are the sound units, the picture to
rebuild is the speech stream ; the algorithm proposed here will find the optimal set o pieces to rebuild the
image. We would want to prove that these are the phonetic or syllabic units.

2.1.B

Description of the algorithm

The proposed mechanism relies on an EM algorithm.


It is initialized with a given set of units. These units don't necessarily have to be corresponding to real
linguistic units, considering their boundaries will be modified in order to find the "best units" (corresponding
to syllables or phonetic units). It is an iterative algorithm, computing, on every iteration, an Expectation
phase, proposing, using a given set of segmented units, to describe as well as possible new input stimuli.
The Expectation phase is followed by a Maximization phase, during which, using the description of new
speech stimuli found previously, the units are redefined.

i:

Expectation step

In this step, we try to find a "description" of speech stimuli, based on a limited set of segmented units.
The description considered is obtained using the 1pass-DP algorithm. This algorithm allows finding the
optimal concatenation of the syllables units for a matching with the new speech stream.
The 1pass-DP algorithm also provides the deformations of the sounds that are necessary to obtain this
matching. This deformation is the objective we will try to minimize.

ii :

Maximization step

The result of the expectation step described hereinabove will be used to modify the boundaries of the
sound units considered.
These boundaries are moved according to the deformation proposed by the 1Pass-DP algorithm.

This algorithm should converge to units minimizing locally the distortion, describing new speech stimuli.
Ideally, there should be two sets of units minimizing this distortion, namely the syllabic and phonetic units.
Unfortunately, I couldn't fully develop and optimize this algorithm. However, a nave version has been
implemented, providing promising results on simple tests. Further development would be needed.

30

3 References:
[Bertoncini and Mehler, 1981], Bertoncini, J., Mehler, J., (1981).,Syllables as units in infant speech
perception, Infant Behavior and Development.
[Beymer and Poggio, 1996], Beymer, D., Poggio, T. (1996), Image Representations for Visual
Learning, Science
[Blumstein, and Stevens, 1979] Blumstein, S.E. & Stevens, K.N. (1979). Acoustic invariance in speech
production: evidence from measurements of the spectral characteristics of stop consonants. Journal of the
Acoustical Society of America.
[Coath and Denham, 2005] Coath M., Denham, S.L., (2005). Robust sounds classifications through the
representations of similarity using response fields derived from stimuli during early experience, Biological
Cybernetics.
[Cutzu and Edelman 1996], Cutzu, F., Edelman, S. (1996), Faithful representation of similarities among
3D shapes in human vision.
PNAS.
[Dehaene-Lambertz and Dehaene, 1994] Dehaene-Lambertz, G., Dehan, S., (1994). Speed and
cerebral correlates of syllable discrimination in infants. Nature.
[Dehaene-Lambertz et al., 2002] Dehaene-Lambertz, G., Dehaene, S., Hertz-Pannier, L. (2002),
Functional neuro imaging of speech perception in infans, Science
[Dehaene-Lambertz et al., 2006], Dehaene-Lambertz, G., Hertz-Pannier, L., Dubois, J. (2006), Nature
and nurture in language acquisition: anatomical and functional brain-imaging studies in infants. Trends in
Neurosciences.
[Dupoux et al, 1999], Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., Mehler, J., (1999), Epenthetic
vowels in Japanese, a perceptual illusion?. Journal of Experimental Psychology.
[Edelman, 1998] Edelman, S. (1998). representation as representation of similarity, Behavioral and
Brain Sciences.
[Eimas, et al., 1971], Eimas, P.D., Siqueland, E.R., Jusczyk, P., Vigorito J., (1971). Speech perception
by infants. Science.
[Eimas, et al., 1975] Eimas, P.D., Tartter, V.C. (1975). The role of auditory feature detectors in the
perception of speech. Perception and Psychophysics
[Fowler et al., 1993], Fowler, A., Saltzman, E. (1993), Coordination and coarticulation in speech
production. Language and Speech.
[Friederici and Wessels, 1993], Friederici, A. D., & Wessels, J. M. I. (1993) Phonotactic knowledge of
word boundaries and its use in infant speech-perception. Perception & Psychophysics.
[Hirsh-Pasek et al., 1987], Hirsh-Pasek, K., Golingkoff, R.M., Cauley, K.M., & Gordon, L. (1987). The
eyes have it: Lexical and syntactic comprehension in a new paradigm. Journal of Child Language
[Jusczyk et al., 1996], Jusczyk, P.W., Myers J, Kemler Nelson DG, Charles-Luce J, Woodward AL,
Hirsh-Pasek K. (1996). Infants sensitivity to words boundaries in fluent speech, Journal of Child Language.
[Kuhl, 1983], Kuhl, P. K. (1983). Perception of auditory equivalence classes for speech in early infancy.
Infant Behavior.
[Kuhl et al. 1992], Kuhl, P.K., Williams, K.A., Lacerda, F., Stevens, K.N., (1992). Linguistic experience
alters phonetic perception in infants by 6 months of age. Science.
[Kuhl, 1992], Kuhl P. K., (1992), Infants' Perception and Representation of Speech: Development of a
New Theory.
[Lacerda, 1995], Lacerda, F., (1995). The perceptual-magnet effect: An emergent consequence of
exemplar-based phonetic memory . Proceedings of the XIIIth International Congress of Phonetic Sciences,
Stockholm.
[Mattys and Jusczyk, 2001], Mattys, S.L., Jusczyk, P.W., (2001), Do Infants Segment Words or
Recurring Contiguous Patterns?. Journal of Experimental Psychology.
31

[Maye et al, 2002], Maye, J., Werker, J.F., Gerken L., (2002), Infant sensitivity to distributional
information can affect phonetic discrimination. Cognition.
[Mermelstein, 1976] Mermelstein, P. (1976), Distance measures for speech recognition, psychological
and instrumental, in Pattern Recognition and Artificial Intelligence.
[Myers and Rabiner, 1981], Myers C.S., Rabiner, L. R. (1981). A comparative study of several dynamic
time-warping algorithms for connected word recognition. The Bell System Technical Journal
[Nazzi et al., 1998], Nazzi, T., Bertoncini, J., Mehler, J. (1998). Language discrimination by newborns:
towards an understanding of the role of rhythm. Journal of Experimental Psychology.
[Osherson et al., 1984] Osherson, D.N., Stob, M., Weinstein M. (1984) Learning theory and natural
language. Cognition.
[Pena et al, 2003], Pena, M., Maki, A., Kovacic, D., Dehaene-Lambertz, G. (2003). Sounds and silence:
An optical topography study of language recognition at birth, PNAS.
[Peperkamp and Le Calvez, 2003] Peperkamp, S., Le Calvez, R., (2003). The acquisition of allophonic
rules: statistical learning with linguistic constraints, Cognition.
[Pisoni et al., 1980], Pisoni, D.B., Jusczyk P.W. Walley, A., J Murray, (1980), Discrimination of relative
onset time of two-component tones by infants. Journal of the Acoustical Society of America.
[Riedmiller and Braun, 1993], Riedmiller, M., H. Braun, H., (1993), A direct adaptive method for faster
back-propagation learning: the RPROP algorithm
[Shepard, 1968], Shepard, R.N., (1968), Cognitive psychology: A review of the book by U. Neisser.
American Journal of Psychology.
[Smith and Lewicki, 2006], Smith E.C., Lewicki, M.S., (2006), Efficient auditory coding, Nature, Smith &
Lewicki, 2006
[Somervuo and Harma, 2004] Somervuo P. Harma, A. (2004). Bird song recognition based on syllable
pair histogram, Acoustics, Speech, and Signal Processing.
[Takami and Sagayama, 1992] Takami, J,. Sagayama, S. (1992). A successive state splitting algorithm
for efficient allophonic modeling, Acoustics, Speech, and Signal Processing.
[Werker and Tees, 1984a] Werker, J.F., Tees, R.C., (1984), Cross-language speech perception:
Evidence for perceptual reorganization during the first year of life, Infant Behavior.
[Werker and Tees, 1984b] Werker, J.F., Tees, R.C., (1984), Phonemic and phonetic factors in adult
cross-language speech perception, Journal of Acoustic Society of America.

32

Vous aimerez peut-être aussi