Vous êtes sur la page 1sur 5

Signal Processing 84 (2004) 663 667

www.elsevier.com/locate/sigpro

Fast communication

Receiver-based packet loss concealment for pulse code


modulation (PCM G.711) coder
Maha Elsabrouty , Martin Bouchard, Tyseer Aboulnasr
School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada K1N 6N5
Received 6 September 2002

Abstract
This paper introduces a high-performance concealment algorithm for packetized PCM-coded speech as in ITU-T Recommendation G.711. The proposed prediction algorithm implements a combination of linear prediction model and reverse-order
replicated pitch period technique as implemented in the ITU-T G.711 Appendix A (ITUT Recommendation G.117, November 2000). The new algorithm is compared to the ITU-T G.711 Appendix A standard and to the commercial tool of packet
repetition. It is shown to produce better concealment quality in almost all cases.
? 2003 Elsevier B.V. All rights reserved.
Keywords: Concealment algorithm; Pulse code modulation; ITU-T

1. Introduction
Voice-over-IP (VoIP), the transmission of packetized voice over IP networks, is gaining much
attention as a possible alternative to conventional
public switched telephone networks (PSTN). However, impairments present on IP networks, namely
jitter, delay and channel errors can lead to the loss
of packets at the receiving end. This packet loss
degrades the speech quality. Model-based coders,
especially G.729-A [2] and G.723.1 [3] International Telecommunication Union (ITU-T) standards, have been extensively used for speech coding over IP networks because of their low bit
rates requirements (5.3 to 6:4 kbit=s for G.723.1

Corresponding author.
E-mail addresses: melsabro@site.uottawa.ca
(M. Elsabrouty), bouchard@site.uottawa.ca (M. Bouchard),
aboulnas@site.uottawa.ca (T. Aboulnasr).

and 8 kbit=s for G:729 A) and their inherent ability


to recover from erasure. Their built-in packet loss
concealment makes their quality drop slowly with
increasing amount of packet loss. However, their
memory requires a few frames for the transition from
a concealed state to a correct state. Thus, they actually
tend to corrupt a few good packets before recovery
as a result of a phenomenon known as State Error [6]. On the other hand, pulse code modulation
(PCM, 64 kbit=s) [9], although having a higher quality compared to G.729 and G.723.1 in the periods of
normal operation, does not have the ability to conceal
erasure. This results in a dramatic drop in the quality of speech during loss periods. Yet, PCM-based
coders can recover from packet loss more rapidly than
model-based coders, since the Drst speech sample in
the Drst good packet restores speech to its original
quality. The low complexity of PCM and its good
performance in tandem coding make it a viable
alternative to G.729 or G.723.1 for VoIP.

0165-1684/$ - see front matter ? 2003 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2003.10.021

664

M. Elsabrouty et al. / Signal Processing 84 (2004) 663 667

Several approaches have been implemented to address the frame erasure problem in PCM streams.
The simplest approach is to play a mute (silence)
packet in the erasure period. This method, however,
introduces annoying voice clipping and most subjective tests proved that this method deteriorates the
speech quality even at very low packet loss rates
[4,5]. Many other concealment algorithms depend on
the quasi-stationary property of speech (not a lot of
new information is delivered in the duration of a 10
30 ms lost packet). One of the popular commercial
concealment algorithms repeats the speech signal
received in the last speech packet. This method performs better than silence substitution but its quality
is still not satisfactory for high-quality applications.
ITU-T has lately standardized (in G:711 Appendix
A [1]) a high-quality low-complexity PCM-coded
speech concealment method. This method depends on
waveform substitution. The packet loss concealment
(PLC) algorithm Drst performs pitch detection on a
suHcient length of speech samples kept in the history
buIer (390 samples of 8 kHz-sampled speech). The
concealment unit then places the pointer one pitch
period backward and copies a speech signal of the duration of the lost packet. This pitch predicted replica
is played in the gap resulting from the missing speech
segment. The algorithm also performs an overlap and
add at the transition between the last received good
samples and the concealed ones. This overlap and add
is to ensure a smooth and natural transition and higher
quality for the resulting concealment. However, this
results in an added algorithmic delay of 3:75 ms [1].
The algorithm introduces a very low complexity of 0.5
MIPS. Another standard method is presented in the
ANSI standard T1-521-2000 (Appendix B) [7]. This
method depends on the well-known linear prediction
model in estimating the missing speech waveform.
This standard simply adopts the model-based codecs
approach. It implements a complete analysis to extract
the short- and long-term excitation from the previous
correctly received speech. Then, the synthesis unit
uses these parameters along with the most recently
received speech samples (as initial conditions for the
inverse linear prediction (LP) Dlter) to synthesize an
approximation of the missing speech segment. This
method introduces an algorithmic delay of 5 ms (a
half 10 ms correct packet) to perform the smoothing
transition between the last good speech segment and

the beginning of the concealed one. It also requires a


much higher complexity (2.3 MIPS for 10 ms packet)
which is around 5 times the complexity of ITU-T
G.711 Appendix A [4,7]. The resulting concealment
quality of this method is comparable to the ITU-T
G.711 Appendix A [4,7]. In this paper, we present
a new receiver-based PLC algorithm for packetized
PCM-coded speech. It is designed to work with the
conventional sampling rate of 8 kHz and frame sizes
of 10 ms. The proposed algorithm does not require
any delay and has an aIordable complexity of 1.85
MIPS.
The rest of this paper is organized as follows. In
Section 2, the concealment model is described. Section 3 presents the quality assessment test for the new
method as well as simulation results conDrming the
improved performance of the proposed algorithm. We
then conclude the paper in Section 4, along with the future work that could be added to the proposed method.
2. The new packet loss concealment algorithm
2.1. Prediction equation
The new LP-based concealment technique is based
on the prediction with a suHciently large-order Dlter
that is capable of accurately modelling the speech
P

S(n) =
(a(i) S(n i)) + b(n);
(1)
i=1

where S(n) is the nth speech sample, P is the prediction order, which was set to 50 as will be explained
later, a(i) are the LP coeHcients and b(n) is the residual signal.
As can be seen from Eq. (1) the current speech sample S(n) is composed of two components. The Drst
component is the predictable part carrying the information of the vocal tract along with the correlation between the current sample and the previous ones. The
second component is the residual signal b(n) that contains the current unpredictable excitation. The ideal
case is when the LPC Dlter is capable of accounting for
the whole correlation between the current sample and
the past samples. In this case, the prediction error is a
random excitation signal reMecting the unpredictability of b(n). However, if the LPC fails to extract the
complete correlation between the successive samples,

M. Elsabrouty et al. / Signal Processing 84 (2004) 663 667

the residual signal is coloured (has some correlation


with the original speech signal).
In the case of a lost packet, the previous correct
speech samples are present and thus the predictable
term in Eq. (1) can be computed by linear prediction
synthesis. However, the input residual signal is unknown to the receiver side. In this case, a good choice
can be to use a small percentage of the pitch-predicted
signal as the input excitation for the system. Here,
the pitch-predicted signal of the lost frame refers to a
reverse-order pitch period replication (RORPP) of the
lost frame, estimated in a manner similar to the concealment algorithm implemented in the ITU-T standard G.711-Annex A [1].
Thus, using a small percentage of the pitch-predicted
signal we can rewrite Eq. (1) to be
S(n) =

P


(a(i) S(n i)) + (S(n)


G);

(2)

i=1

where S(n) denotes the LPC prediction and S(n)


is
the pitch-predicted signal obtained from the ITU-T
G.711-A (RORPP) concealment standard. G = 0:01
was found to give the best results in practice.
Next, we propose to modify the algorithm by using a 
weighted summation of the short-term predicP

tion [ i=1 (a(i) S(n i)) + (S(n)


G)] and the

pitch-based prediction S(n) to provide a better approximation of the original signal. Thus the Dnal form of
the prediction algorithm becomes
S1 (n) =

P


(a(i) S1 (n i)) + (S(n)


G);

(3)

i=1

S(n) = S1 (n) + S(n);

(4)

where S(n) is the Dnal form of the concealed signal to


be played instead of the missing speech frame, and
are summation weights that add up to unity. The
best results were obtained with = 0:7 and = 0:3.
2.2. How the algorithm works
During the normal operation of the PCM decoder
(period of no loss), the receiver decodes the received
packets and sends the output to the audio port. Meanwhile, in order to support the concealment algorithm,
a copy of the decoded output is saved in a history
buIer that is 390 samples long. The history buIer is

665

used to calculate the auto-correlation function, estimate both the pitch and the LP coeHcients, extract
the pitch replica and provide the past samples S(n
i); 1 i P where P is the order of the prediction
Dlter.
A lost speech segment contains at least one lost
packet but may contain more. The majority of the
computational load is in the Drst 10 ms of erasure (the
Drst lost frame). Fig. 1 shows a block diagram of the
principal blocks of the concealment algorithm.
At the start of the erasure period, the pitch detection unit estimates the current value of the pitch
by searching among the peaks of the auto-correlation
coeHcients calculated as in the ITU-T concealment

standard G.711-A. The samples S(n)


found by this
pitch-prediction method will be used twice. They are
Drst multiplied by the gain G, which is equal to 0.01.
This re-scaled signal is used as the short-term excitation of the speech production model (3). The same
signal is weighted by a factor of 0.3 and then added
to the output of the synthesis LP Dlter S1 (n), weighted
by a factor of 0.7 as in (4).
Meanwhile, the Drst 50 coeHcients of the autocorrelation function of the last 20 ms (160 samples) of
speech are calculated. The LP coeHcients are calculated in the LP-analysis block that implements the
LevinsonDurbin algorithm for LP estimation. The LP
prediction order was chosen to be 50 to cover at least
one pitch period in female speech, which had shown
to deteriorate more severely than male speech quality
when both are subject to the same loss rates. These 50
coeHcients are used as the poles of the LP-synthesis
Dlter which is the model of the speech production.
Typically one frame has 80 samples. However, we
have modiDed that model to produce 90 samples/lost
frame instead of 80 samples, to allow for a smooth
transition between packets. The last 10 samples are
the predicted values of the packet following the lost
packet. If the next packet is lost then these values are
played as the Drst concealed samples of that lost frame.
However, if the next packet is not lost then these samples are multiplied by a decaying ramp and added to
the corresponding Drst 10 samples in the new correct
speech sequence, that are to be multiplied by an uprising ramp. The output of the addition is played instead
of the Drst 10 good samples after erasure. This cross
fading process guarantees a smooth transition from the
concealed speech segment to the good speech packets.

666

M. Elsabrouty et al. / Signal Processing 84 (2004) 663 667

0.3

Overlap
buffer

Pitch Period

240 samples from


speech buffer
(30 ms)

Long
Prediction

Pitch
detector

Last
10 samples

0.7
Inverse
LP filter

Autocorrelation
unit

First 80 samples

Speaker

Reconstructed
signal

LP
analysis
Initial
conditions

Last 50 samples

Fig. 1. Block diagram of the new concealment algorithm for the Drst lost packet.

3. Performance of the proposed algorithm


The new algorithm is compared to the ITU-T standard concealment tool G.711-A and to the packet repetition method. The test was performed on a set of
speech Dles from four speakers; two males and two
females referred to in the results as: M1, M2, F1 and
F2. Each of those speakers has 10 speech Dles to investigate, each containing two sentences in English of
duration 8 s. The format of the Dles was linear PCM.
The Dles were taken from the ITU-T supplement P.23.
The assessment tool used to evaluate the results
of the concealment techniques is the perceptual estimation of speech quality (PESQ) standard P.862 de-

3.6
PESQ-MOS

If the erasure lasts more than 10 ms (one packet


period) no new parameters are calculated. We re-use
the previously obtained parameters used for the Drst
lost packet concealment with the slight modiDcation
of changing the long-term estimated period samples,
as in ITU-T G711-A. In the case of consecutive lost
packets, the pitch-predicted replica is multiplied by a
decaying ramp starting at the initial value 1 and decaying at a rate of 0:2=10 ms. This ramp multiplication introduces a smooth decay increasing along the
loss period. Eventually, at 60 ms of continuous erasure, the pitch replica and the input residual signal are
zeros and Eq. (3) turns to a no input LP model than
eventually decays due to its stability.

3.4
3.2
3
2.8
M1

M2

F1

F2

Speaker

Fig. 2. Average results for 5% random packet loss (: new


algorithm, : ITU-T G.711-A, 4: packet repetition).

veloped by the ITU-T [8]. It is the newest and most


accurate tool [10] in the perceptual-based standards,
that has shown to give reliable estimation of the subjective quality tests. The score is given in the range
[0:5 4:5], similar to the standard mean opinion score
(MOS) scale.
A random loss test was performed at loss rates of
5%, 10 % and 25%. Figs. 24 summarize the average
results of the three loss rates.
We can see from the above Dgures that the new
algorithm performance is superior to both the existing ITU-T standard and the packet repetition method.
Actually, the performance of the packet repetition
method is much worse than both the new algorithm
and the ITU-T concealment standard. A small but
signiDcant and almost steady margin appears as a
diIerence between the new algorithm and the ITU-T
standard. This margin presents the performance

M. Elsabrouty et al. / Signal Processing 84 (2004) 663 667

PESQ-MOS

3.3
3.1
2.9
2.7
2.5
2.3
M1

M2

F1

F2

Speaker

PESQ-MOS

Fig. 3. Average results for 10% random packet loss (: new
algorithm, : ITU-T G.711-A, 4: packet repetition).

2.8
2.6
2.4
2.2
2
1.8

667

method (packet repetition) or the standard ITU-T


G.711 A concealment technique. This high quality
could be further enhanced by making use of the delay
introduced by the jitter buIer at the receiver side.
Taking the future samples into account could result
in a better estimation of the missing speech segment.
Adapting the gain G and the weighting coeHcients
and in (3) and (4) based on voiced/unvoiced decision could also improve the resulting concealment
quality. Finally, investigating the interaction of the
concealment algorithm with network or acoustic echo
cancellers will be helpful to provide an idea about
the success of the algorithm in real transmission
situations.
References

M1

M2

F1

F2

Speaker

Fig. 4. Average results for 25% random packet loss (: new
algorithm, : ITU-T G.711-A, 4: packet repetition).

gain of incorporating the LP model with the plain


long-term pitch-repetition-based concealment standard. Extensive tests with periodic loss patterns were
also performed and produced nearly identical results.
It should be noted again that the proposed new algorithm does not introduce any delay, as opposed to the
ITU-T standard.
4. Conclusion and future work
In this paper, we introduced a new concealment algorithm for PCM packetized speech of 10 ms packet
length. The model implemented in (3) and (4) provides very encouraging results for the idea of combining the pitch prediction along with the high-order
LP-based prediction to produce the concealed speech
segments. The PESQ-MOS scores obtained for the
random loss tests prove that the algorithm exhibits
a superior high-quality concealment performance in
all cases when compared to an existing commercial

[1] Appendix A: a high quality low-complexity algorithm


for packet loss concealment with G.711, ITU-T
Recommendation. G.711, November 2000.
[2] Coding of speech at 8 kb=s using conjugate-structure
algebraic-code-excited linear-prediction (CS-ACELP), ITU-T
Recommendation G.729, March 1996.
[3] Dual rate speech coder for multimedia communications
transmitting at 5.3 and 6:3 kb=s, ITU-T Recommendation
G.723.1, March 1996.
[4] E. Gunduzhan, K. Momtahan, A linear prediction based
packet loss concealment algorithm for PCM coded speech,
IEEE Trans. Speech and Audio Process. 9 (8) (November
2001) 778785.
[5] M. Hassan, A. Nayandoro, Internet telephony: services,
technical challenges, and products, IEEE Communication
Magazine, April 2000, pp. 96 103.
[6] C. Montminy, T. Aboulnasr, Improving the performance
of ITU-T G.729A for VoIP, International Conference on
Multimedia Exposition 2000 (ICME 2000), Vol. 1, New
York, NY, USA, 30 July2 August 2000, pp. 433 436.
[7] Packet loss concealment algorithm for use with
ITU-T Recommendation G.711, ANSI Recommendation
T1.521-2000 (Annex B), July 2000.
[8] Perceptual evaluation of speech quality (PESQ), an objective
method for end-to-end speech quality assessment of
narrow-band telephone network and speech codecs, ITU-T
Recommendation P. 862, May 2000.
[9] Pulse code modulation (PCM) of voice frequencies, ITU-T
Recommendation G.711, November 1998.
[10] A.W. Rix, et al. Perceptual evaluation of speech quality
(PESQ)a new method for speech quality assessment of
telephone networks and codecs, ICASSP 2001, Vol. 2, Salt
Lake City, UT, USA, 711 May 2001, pp. 749 752.