Vous êtes sur la page 1sur 10

Access the number of speakers through visual access tendency

for effective speech clustering

T. Suneetha Rani*
M. H. M. Krishna Prasad**

* Departement of Computer Science and Engineering, JNTUK, Kakinada, AP.


** Departement of Computer Science and Engineering, JNTUK, Kakinada, AP.

Article Info ABSTRACT


Assess the number of speakers of unlabeled speech data is a key problem in
speech clustering; Even the well-known clustering methods such as k-means,
or graph-based methods are also unable to find the number of speakers i.e k
value in k-means in speech clustering. Thus, some cluster detection methods
are studied for detecting the number of speakers of speech data. It is
investigated that visual access tendency (VAT) is able to detect the k value
effectively than other cluster detection methods. The process of speech
Keyword: clustering is devised as follows: feature extraction of speech utterances,
model the speech utterances separately, use visual method i.e VAT for
Speech Clustering
detecting the number of speakers (or respective clusters), and perform
GMM clustering. In feature extraction, the speech utterance data is extracted in
GMM mean Supervectors Mel-Frequency-Cepstral-Coefficient (MFCC) form, and done the modeling
i-vectors step by Gaussian Mixture Modeling (GMM), which derives the GMM, mean
supervectors of speech utterances. These vectors are compared using distance
metrics in VAT for detecting the cluster tendency of speech data. The cluster
tendency determines the number of speakers of speech data. However, a
visual method limits the clustering performance due to high-dimensionality
problem. This problem is addressed in this paper and improves the
performance of proposed clustering methods by i-vectors of GMM mean
supervectors. The i-vectors are more robust than GMM mean supervectors
and i-vectors are intermediate vectors between high-dimensional space of
GMM mean supervectors and low-dimensional manifold of supervectors.
This paper develops the enhanced clustering approaches by i-vectors and
various subspace learning techniques. Efficiency of proposed methods are
demonstrated and discussed in the experimental study by real time datasets.

Corresponding Author:
T Suneetha Rani
Department of Computer Science and Engineering
JNTUK
Kakinada, Andhra Pradesh, India
Email: suneetha.qis@gmail.com

1. INTRODUCTION
Speakers (or clusters) detection algorithms has emerged significant research in speech clustering. It
is also required to derive the features space of individual speech segments (or speech utterances) in clusters
detection algorithms. Speech clustering is an unsupervised method, In [1] [2], have presented that it is a two-
step process, the steps are described as follows: a) to find labels of speech segments by similarity features,
and b) to discover the groups of speech utterances (or speech segments) based on their similarity features.
The similarity features are computed by distance metrics. The primary component of speech clustering
system is the back-end in which the speakers utterances are modeled. Modeling of speech utterance is
usually accomplished by a well-known statistical method, called as Gaussian Mixture Model (GMM). The
GMM and universal background model (UBM) are described in [3] [4]. The UBM represents the average
voice of speakers.

Thus, the speaker-specific model is adapted from the UBM using the maximum a posteriori (MAP)
estimation. This process is clearly mentioned by [3]. Automatic detection of speakers in speech is a one of
key problem and it is addressed in this paper through visual methods. The VAT is a visual method that is for
detecting the cluster tendency, thus, it is one of reasonable choice for detecting the number of speakers from
speech data. The GMM mean supervectors of speech utterances are derived from GMM-UBM frame work
and dissimilarity features between GMM mean supervectors is the input of VAT method and shows the
output in VAT Image form. The VAT Image consists of number of square- shaped dark blocks, where each
block indicates the speech utterances of same speaker. Thus, number of speakers is accessed by counting the
number of square-shaped dark blocks in VAT Image. During dissimilarity computation in VAT, the different
distance metrics such as Euclidean, cosine etc. are used for better speaker discrimination results. VAT is
useful for finding segment labeling of the speech segments. The clustering process is performed using
assessed value of k in k-means [5] and graph based clustering [6]. However, the GMM mean supervectors
are in high-dimensional and these are required to transform into low-dimensional manifold for improving the
speed of clustering methods. In such low-dimensional transformations, there is chance for losing the features
information. The i-vectors is best alternate for addressing this problem, in [7], explained the use of i-vectors
in speech related algorithms. The i-vectors are intermediate vectors between high-dimensional and low-
dimensional vectors. Currently, in [15] [17] , presented the state-of-art of speech processing techniques and
these are preferred i- vectors, rather than GMM mean supervectors because these are producing the more
optimal results than GMM mean supervectors. The Fig. 1 shows the steps of proposed work.

Fig. 1 Key steps of Proposed Work


This paper presented proposed speech clustering methods, which methods are addressed the
speakers detection problem and discover the effective speech clusters by i-vectors.
Visual methods and i- vectors generation are vital steps in proposed methods, namely, i-vector-
VAT-based-k-means (iVK), i- vector-VAT-based-MST-clustering method (iVM). The performance of speech
clustering methods is also depending on cluster tendency. It is solved by VAT and detected the cluster
tendency as a priorly in speech clustering.
Contributions of the proposed work in this paper are summarized as follows:
1. The GMM model parameters of speech utterances are derived
2. Equivalent i-vectors of GMM mean supervectors are derived for improving the speed of
clustering methods
3. Number of clusters(or speakers) of unlabeled speech data is detected for effective speech
clustering
4. The i-vectors based clustering results are derived and performance rate is demonstrated
5. To show the empirical analysis for demonstrating the effectiveness of proposed methods
Unsupervised methods cannot have the prior information about the data labels of speech
utterances, but the goal of these methods is to extract the inherent information about the data labels.
Normally, aim of the clustering algorithms is to merge data objects into small partitions, so that they
presenting the essential clustering structure. The rest of the paper is organized as follows: Related work is
discussed in next sub-section, another sub-section presented the proposed work, experimental results and
its discussion are presented in separate sub-section, and final sub-section presented the conclusion along
with future scope of the paper.

2. RELATED WORK

In speech clustering, the speech data is extracted in Mel-Frequency-Cepstral-coefficients (MFCC) [8]


form and described the GMM process for finding model parameters of speech utterances for speaker
identification. In [11] discussed unsupervised acoustic model training by Gaussian mixture model (GMM),
hidden markov model (HMM). Discriminative process is used for discriminating the speakers based on
model parameters and it is clearly presented in [12]. The probabilistic estimation of GMM or HMM is also
more effective while in differentiating the speakers segments, which has proposed in [13]. In all these
systems, GMM is the most common approach for characterizing the voice of speaker based on estimated
parameters. This GMM is exclusively used for deriving the speaker models of speaker-specific utterances.
Generally, this GMM is described with m component model, hence it is abbreviated as m-GMM and
common used formats of GMM models are 64-GMM, 128-GMM, 256-GMM, 512-GMM, 1024-GMM etc.
The following equations (Eqn. (1) and Eqn. (2)) are used for GMM modeling.

P(x / ) im1 w i N(x / i, i )


(1)

w i N(X t / i, i )
P(i / X t )
m
w j N(X t / j, j )
j1
(2)
In Eqn. (1) and Eqn. (2), variables x or X t refers d-dimensional vectors, wi refers to prior probability
with condition of refers , which is defined by GMM. By using Eqn. (1) and Eqn. (2), it is focused on
GMM for finding the model parameters, which are solved by maximum likelihood estimation (MLE)
technique, i.e. expectation-maximization (EM) algorithm. The GMM mean super vectors are shown in
Eqn. (3)

S [ 1T 2T 3T 4T ........mT ]T
(3)
In Eqn. (3), S refers to high-dimensional form of super vector and these vectors are altered in to low-
dimensional form for reducing the time complexity of speech clustering process. In this regard, there may
be a chance to lose the features of information. To overcome this problem, this paper uses optimal i-
vectors representation and these i-vectors are used the channel disparities in respect to all aspects. The i-
vector is a kind of compact representation and these vectors are very useful to get an effective clustering
results compared to low-dimensional vector based clustering results. Therefore, this paper addresses the
problem of dimensionality reduction and channel compensation by i-vectors in proposed framework.
In, [9] [10] have discussed the bottom-up and top-down clustering, k-means, and other traditional
methods for performing of speech clustering. These methods required the initial information about number
of clusters(or speakers) for a set of given speech utterances. Determining the number of involved speakers
of speech data is normally called as cluster tendency of speech data. Post validation measures such as
cluster validation index measures of [19] [20], are used for assessing cluster tendency. However, this
cluster tendency is required at initial steps of either k- means or graph-based clustering methods for
discovering of expected quality of speech clusters. Thus, pre-assessment clustering methods are required
for efficient speech clustering. In this rigorous study, it is investigated that visual access tendency
method(VAT) is the well-suited for determining the prior information about number of involved speakers
for given speech data. This VAT [18] is proposed by Bezdek. In VAT [18], dissimilarity information about
data objects i.e speech utterances is computed in matrix form, it is called as dissimilarity matrix D. Re-
ordering of data objects in D is performed by Prims logic, and similar kind of objects are moved into
same groups by this re-ordering, which produces the re-ordered dissimilarity matrix, simply known as
RDM. Finally, image of RDM is displayed in VAT procedure, which shows the number of speakers as
number of square-shaped dark blocks along the diagonal in RDM Image (it is also known as VAT Image).
The procedure of VAT [18] is shown in following algorithm.
VAT (int diss[ ][ ],int n-objects)
Begin
Step 1:

Initialize I= ;J={0,1,.....n-1}
Find max of diss[ ] [ ], and its cell is(i,j) P(0)=i; I={i},J=J-{I};
Step 2:
for (s=1;s<n;s++)
{

Find(i,j) from min {dis[i][j], where i I, j J}

I=I {j}; J=J-{I};

P(s)=j;
}
.Step 3:
/*Compute Reordered Dissimilarity Matrix*/
for(i=0;i<n;i++)
for(j=0;j<n;j++)
print (diss(P[i],P[j]);
End
Sample visual results of assessment of speech cluster tendency is shown in following Fig. 2.

a)VAT Image for 3-speakers dataset b) VAT Image 4-speakers dataset


Fig.2 Assessment of speakers (cluster tendency) of speech dataset
Present clustering methods such as k-means and graph-based techniques can assess the speakers
information by their GMM mean supervectors in visual method (i.e. VAT). In the proposed work, i-
vectors are used in the place of GMM mean supervectors. Hence, visual method is enhanced as i-vectors-
based-VAT for the purpose of better assessment of speech cluster tendency and respective clustering
procedures are presented in next following section.

3. PROPOSED WORK

Clustering of speech data may need optimal dimensional coefficients in each GMM mean
supervector. Feature selection is done at each level and the best features are selected to be used in the
proposed clustering procedure. Selection of i-vectors give the optimal feature representation of speech
utterances. These i-vectors reduces the computational complexity of clustering methods. In the proposed
work, mainly we focused on optimizing the performance of speech clustering in terms of considerable
improvement in accuracy of clustering and in computational time. rocedure of i-vectors generation and
the proposed system is discussed in this section.
The input of speech data is transformed into sequence of acoustic features (i.e in MFCC form).
The MFCC are extracted for every 10ms over a 25ms window. For modeling of speech data, the GMM is
used and factor analysis method is applied to GMM supervectors for representing the speaker variabilities
and compensating the channel inconsistencies, which are discussed in [14]. This approach is known as a
Total Variability Approach (TVA), it is shown as follows.
M=m+Tw (4)
Here, m is a speaker and session-independent supervector and it is taken from the large GMM (it is called
as Universal Background Model-UBM), and M is session dependent supervector. The UBM is trained by
huge amount of speech data and it is used to represent the speaker independent distribution of the acoustic
features. The term T defines rectangular matrix of low-rank for the total variability space, w is a low-
dimensional random vector and it is referred as total factor vector or an i-vector, stands for intermediate
vector. The UBM is trained by huge amount of speech data and it is used to represent the speaker
independent distribution of the acoustic features.

3.1 Dissimilarity Matrix Computation in Visual Access Tendency (VAT)


The Euclidean and cosine distance metrics are frequently used metrics and as per the knowledge
from [7], it is noted that cosine based dissimilarity features are greatly succeeded in speaker discrimination
process. Thus, cosine metric is used for obtaining the more accurate dissimilarity features in speech
clustering. Eqn. (5) shows the formula for cosine based dissimilarity metric. This equation is given by the
cosine distance between two speech segments x and y.

x. y
D ( x, y ) 1
|| x |||| y || (5)
The quality of clusters is depending on estimation of cluster tendency. Determining the number of
involved speakers of speech data is known as cluster tendency and it is derived by non-parametric
visualized method, which is used for detecting the number of speaker from obtained dissimilarity features
from Eqn. (5).

3.2 Proposed Speech Clustering Methods


The clustering refers to the process of grouping the similar speech utterances together based on
similarity features of respective i-vectors, i.e., one cluster contains i-vectors of of the same speaker. Access
the number of speakers is a main criteria and it is solved at beginning step in proposed methods. Two top-
most clustering methods, k-means and minimum-spanning tree (MST) based clustering techniques are
enhanced in proposed work for discovering the effective speech clusters.
The proposed i-vectors -VAT based k-means (iVK) is described in following algorithm.
Algorithm: iVK

Input : SS= {S1, S2,., Sn}, set of n number of speech segments Output: k-
number of speakers C= {C1, C2,,Ck}, k-clusters
1. Train the UBM with huge amount of speech data
2. Build the GMM Models for speech segments Of SS
3. Adapt the UBM into GMM and Generate GMM Mean Supervectors
S [ 1T 2T 3T 4T ........ mT ]T
4. Find i-vectors using Eqn. (4)
5. Compute the Dissimilarity Matrix D of obtained i-vectors (w)
6. k=VAT(D)
7. Discover the clustering results using k-means at k value. The
results of k- means is C

In iVK, Step 1 shows the training of UBM by huge data for defining of average speaker model.
Step 2 indicates that speaker-specific GMM models and it is adapted with UBM for estimating the
accurate model parameters speaker-specific segments, which is shown in Step 3. This steps generates the
GMM mean supervectors. Step 4 defines the process of i-vector generations using TVA. Cosine metric is
applied for determining the dissimilarity matrix of a set of i-vectors and that matrix is taken as input of
VAT method for assessing of number of speakers (i.e. number of clusters). In Step7, the clustering results
are discovered using k-means.
The steps of i-vectors-VAT-based-MST Clustering (iVM) are shown in following Algorithm.

Algorithm: iVM
Input : SS= {S1, S2,., Sn}, set of n number of speech
segments Output: k-number of speakers C= {C1, C2,,Ck}, k-
clusters
1. Use Step 1, Step 2, Step 3 of iVK Algorithm for generating of GMM Mean supervectors ,
S [ 1T 2T 3T 4T ........ mT ]T
2. Generate i-vectors of S using TVA
3. Use cosine metric for determining the dissimilarity matrix D of i-vectors
4. Call VAT(D), access the number of speakers k(or number of clusters)
5. Find speech clustering results using the following steps of MST-based clustering method
a. Find the graph G of set of speech utterances, in which edge weight is the distance
between respective i-vectors of speech utterances
b. Construct MST from Graph G using Prims
c. Detect and remove an inconsistent edge (whose weight is greater than sum of
average of weight values)that results into two distinct sub-trees, whereas each sub-
tree is treated as a speech cluster.
d. Step 5c is repeated until we reach the cluster count k

The MST-based clustering approach is extended using i-vectors and VAT methods and it is known
as iVM. The cluster tendency k is derived using Steps 1 to 4. The graph is constructed for a set of nodes,
here node may be speech utterance, and edge weight is computed by applying cosine metric between
respective i-vectors of speech utterances. Each inconsistent edge is removed at a time and generates two
sub-trees as two speech clusters. It is applied recursively until to reach the k number of speech clusters.
This proposed method is effectively works in determining the number of speakers as well as in discovering
the efficient speech clustering results.
4. EXPERIMENTAL RESULTS
The experiments are conducted on TSP speech datasets and this dataset is freely available. This
dataset consists of the set of speech utterances of different sentences in .wav files format, which are spoken
by different speakers. The details of experimental settings of dataset are described in Table 1. The proposed
methods are developed in MATLAB environment. The i-vectors of GMM mean supervectors are derived in
both methods i.e. iVK and iVM. The clustering is performed based on the similarity features of i-vectors of
speech segments. The traditional k-means and MST-based clustering methods are discovering the clustering
results without knowing the number of speakers information. The proposed methods , iVK and iVM are
effectively detected the number of speakers information and groups the speech utterances based on prior
knowledge of number of speakers. Additional benefit of these methods is that they can demand optimal
computational time , because they use the i-vectors instead of GMM mean supervector. The description of
training and test datasets in our experimental study is given in Table 1. It is experimented with large number
of speakers datasets, in which majority of data is used for training purpose. For testing, we have taken
several hundreds of speech utterances of different speakers.
Table 1: Experimental settings for speech clustering

Description of data Independent training Test set


set
Number of speakers 30 10
Number of utterances 3000 200 (max)
(or speech segments)
Number of utterances 100 to 200 30 to 40
of each speaker
Utterance (.wav file) 10 to 20 sec 10 to 20s
duration
Proposed methods are evaluated with two performance measures. The evaluation measures are clustering
accuracy (AC), and normalized mutual information (NMI). AC metric and NMI metric have been widely
used for evaluating the speech clustering methods. These performance measures are described in[16].

xig fi g
Suppose that is assumed to the ground tuth label, and is the clustering label, then AC is defined
as follows:
max map i 1 ( xig , map( fi g )) / n
n

(6)

where n is the total number of objects, the function


( xig , map( fi g )) =1 if and only if

xig map( fi g )
, otherwise the function value is 0. The sub function named as map that permutes all
clustering labels to match the equivalent labels given by the labels of ground truth. For this purpose we use
the Khun- Munkres algorithm for obtaining of best mapping according to [21], and these evaluation are
presented in Table 2 and Table 3. The tables indicate comparative study between traditional and proposed
clustering methods.
Table 2: Performance of k-means and proposed clustering methods

k-means iVK
Speech data
CA NMI CA NMI
2-speaker data 0.72 0.66 0.86 0.77
3-speaker data 0.81 0.56 0.80 0.57
4-speaker data 0.66 0.56 0.67 0.59
5-speaker data 0.67 0.61 0.71 0.62
6-speaker data 0.56 0.45 0.59 0.47

Table 3: Performance of MST-based-clustering and proposed clustering method

MST-
based- iVM
Speech data clustering

CA NMI CA NMI

2-speaker data 0.56 0.44 0.61 0.46


3-speaker data 0.77 0.56 0.76 0.58
4-speaker data 0.56 0.46 0.57 0.47
5-speaker data 0.47 0.41 0.48 0.44
6-speaker data 0.53 0.44 0.55 0.45

Table 2 presents the speech clustering results of iVK for 2-speaker, 3-speaker, 4-speaker, 5-
speaker, and 6-speaker datasets with sizes of 80,110,150,180,210 (speech utterances) . From these
experimental values, it is noted that iVK is performed as best than k-mean(GMM-version), because iVK
uses the i-vectors but k-means uses the GMM mean supervectors. It is also observed that significant
improvement is made in iVK than k-means. Table 3 shows the comparative study between MST-based
clustering and iVM. From the experimental values of Table 3, it is observed that proposed method (iVM)
achieves more accurate clustering results than MST-based clustering method.
Fig. 3 shows performance comparison between two proposed methods iVK and iVM with respect
to clustering accuracy. Fig. 4 shows the comparison between these methods with respect to NMI. From
this empirical study, it is concluded that iVK and iVM are performed better than k-means and MST-based
clustering methods and iVK is produces more quality of speech clusters than iVM.

Fig. 3 Clustering Accuracy for Proposed Methods


Fig. 3 and Fig. 4 shows the empirical study between existing and proposed methods with respect to
clustering accuracy and normalized mutual information. From these results, it is noted than iVK is
outperformed than iVM methods.

Fig. 4 Normalized Mutual Information for Proposed Methods

5. CONCLUSION AND FUTURE SCOPE

This paper focused on emerging problems of speech clustering, these are cluster tendency assessment and
quality of speech clusters. Typically, traditional clustering methods may generate the clustering results
without prior knowledge about the number of involved speakers. Therefore, it is taken as the key problem
in this paper and successively assessed this value in proposed methods through visual method. Proposed
methods, iVK and iVM uses i-vectors instead of GMM mean supervectors for better speakers
discrimination in order to discover the quality of speech clusters. Further scope of this paper is to extend
the proposed system for scalable speech data clustering.
REFERENCES

[1] Roberto, T. An overview of speaker identification: Accuracy and Robustness issues, IEEE circuits and
systems magazine, 2011, 23-61.
[2] Haipeng, W., Tan, L., Cheung, C.L., & Bin, M. Acoustic segment modeling with spectral clustering
methods, IEEE Trans. on Audio, speech, and Lang. Proce., 2015,23(2), 264-277
[3] Chu, S., Tang, H., & Haung, T. Fishervoice and semi-supervised speaker clustering, Proc. IEEE Int. Acoustic
speech and signal processing, 2009, 4089-4092.
[4] Douglas, & Reynolds, A. Speaker verification using adapted Gaussian mixture models, digital signal
processing, 2000, 10, 19-41.
[5] Han, j., & Kamber, M. Data Mining: concepts and techniques, 3ndedn. (Morgan Kaufmann, San Francisco),
2011.
[6] Arun, P. Data Mining Techniques, Universities Press, 2001.
[7] Mohammad, S., & Patrick, K. A Study of the Cosine Distance-Based Mean Shift for
Telephone Speech Diarization, IEEE/ACM Trans. on Audio, Speech, and Language Processing. 2014, 22(1),
217-227.

[8] Tang, S., & Chu, S. Partially supervised speaker clustering, IEEE Trans. Pattern Anal. Machine Intell. 2012,
34(5), 959-971.Goethals, Survey on Frequent Pattern Mining, manuscript, 2003.
[9] Senoussaoui, M., Patrick, K., Themo, S., & Dumouchel,P. Efficient iterative mean shift based cosine
dissimilarity for mutli-recording speaker clustering, in Proc. ICASSP, 2013, 7712-7715.
[10] Jain, A., & Dubes, R. Algorithms for clustering data. Prentice Hall, 1988.
[11] Varadarajan, B., Khudanpur, & Dupoux, E. Unsupervised learning of acoustic subword units, in ACL-08:
HLT, 2008.
[12] Anguera, X. Speaker independent discriminant feature extraction for acoustic pattern matching, in Proc.
ICASSP, 2012.
[13] Lee, Y., & Glass, J. A nonparametric Bayesian approach to acoustic model discovery, in Proc. ACL, 2012.
[14] Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. Front-end factor analysis for speaker
verification, IEEE Trans. Audio, Speech, Lang. Process., 19(4), 2010, 788798.
[15] Stephen, Shum, H., Dehak, N., Dehak, R., & Glass, J. Unsupervised Methods for Speaker Diarization: An
Integrated and Iterative Approach , IEEE Trans. on Audio, Speech, and Language Processing, 2013, 21(10).
[16] Cai, D., He, X., & Han, J. Document Clustering Using Locality preserving indexing, IEEE Trans. on Know. and
data engg, 17(2),1624-1637.
[17] Sangeeta, B., Rohdin, J., & Shinoda, K. Autonomous selection of i-vectors for PLDA modeling in Speaker
Verification, Speech Communication, 2015, 72, 32-46
[18] Wang, L., Geng, X., Bezdek, J., Leckie, C., & Ramamohanarao, K. Enhanced Visual Analysis for Cluster
Tendency Assessment and Data Partitioning, IEEE Trans. on Know. and Data Engg. ,2010, 22(10), 1401-1414.
[19] X.L Xie, G.Beni, A validity measure for fuzzy clustering. IEEE Trans. Pattern Ana. and Mac. Int., vol. 13,
1991, 841-847.
[20] T.C Havens, J.C. Bezdek, J.M. Keller, M. Popescu, Dunns Cluster Validity Index as Contrast Measure of VAT
Images. Int. Conf. IEEE 2008
[21] L. Lovasz and M.Plummer. Matching theory. Budapest, Northholland, 1986.

Vous aimerez peut-être aussi