Tems Voice Service Quality Evaluation Techniques and Polqa

Prepared by: Date: Document:
Dr. Irina Cotanis 3 November 2010 NT11-1037

Ascom (2010)
All rights reserved. TEMS is a trademark of Ascom. All other trademarks are the property of their respective holders.

Voice Service Quality Evaluation
Techniques and the New Technology,
POLQA

White Paper

Ascom (2010) Document:
NT11-1037 2(13)

Contents

1 Todays Voice Service Challenges .................................... 3
2 Speech Quality Evaluation Techniques ............................ 3
2.1 Intrusive Techniques ...................................................... 4
2.2 Non-Intrusive Techniques .............................................. 4
2.3 Standardization Status and Evolution Related to
the Listening Quality of Voice Service .......................... 5
3 POLQA Technology ............................................................ 6
3.1 POLQA Algorithms Overview ........................................ 7
3.2 Operability Requirements............................................... 8
3.3 Telecommunication Test and Application
Scenarios ......................................................................... 9
3.4 Understanding POLQA Limitations ............................. 10
3.5 POLQA Algorithms Performance Evaluation ............. 10
4 Beyond the MOS Score .................................................... 11
5 Ascom Network Testing Presence in the
Standardization Work on Objective Evaluation
Metrics for Listening Speech Quality ............................. 12
6 Conclusions ...................................................................... 12
7 References ........................................................................ 13

NT11-1037 3(13)

1 Todays Voice Service Challenges
Almost 10 years ago, operators and infrastructure vendors were struggling
to provide speech quality on 2G networks at the level expected by users
accustomed to PSTN levels of quality. Network optimization and
troubleshooting, as well as advanced speech processing techniques and an
in-depth understanding of speech transport on wireless networks, helped
operators bring the level of speech quality on 2G networks to that of fixed
networks. With the 3G network evolution, with the move to all IP, and with
the transition from narrowband (NB) to wideband (WB) speech, it was
expected that wireless voice services would even supersede traditional
PSTN quality.
However, todays voice services are still raising a set of challenges for
operators as they attempt to continue meeting their users expectations.
The roots of these challenges lie mainly in the convergence and
coexistence of voice, data, and multimedia application services, which
involve a multitude of factors that invariably produce new types of
distortions that dynamically, variably, and sometimes randomly affect
speech quality.
These factors range from the increased demand for capacity generated by
high and dynamic traffic patterns with various application-dependent
patterns to low and adaptive bit rate codecs with different bandwidths (NB,
WB, and super wideband (SWB)) and complex error concealment solutions
as well as voice enhancement devices (e.g., noise suppressors, automatic
gain control, echo cancellers) designed to counter speech degradation with
speech processing techniques that, if not well designed and implemented,
could have an effect opposite that of the desired speech quality
enhancement.
In addition, with next generation network (NGN) (LTE/SAE-SON) evolution,
network vendors as well as operators are looking to a challenging change
from traditional CS, and then from VoIP to VoIP over IMS (VoLTE). Details
on these challenges can be found in [1].

2 Speech Quality Evaluation Techniques
Providing voice service on NGNs at the quality level demanded by
subscribers while supporting backward compatibility with 3G/2G networks
as well as integrating voice with a myriad of multimedia and data services
increases the need for voice quality testing. Likewise, providing and
ensuring a high quality level for testing and evaluating speech quality
comes with its own series of challenges.
The need for cost efficient speech quality evaluation techniques to replace
subjective testing while ensuring high accuracy on a larger variety of
network configurations and conditions, codecs, bandwidths, and
applications continues to drive network testing tools and infrastructure
vendors, as well as operators and standardization organizations, to
collaborative work on speech quality evaluation techniques.
Extensive work has been performed during the last decade by both the
ITU-T and the telecommunication industry in developing speech quality

NT11-1037 4(13)

evaluation algorithms designed to accurately evaluate any network
degradation impact on subscriber perception as well as to cope with the
complex testing conditions of the 3G environment.
These speech quality evaluation algorithms have been developed with
different scopes and applications. They can be either intrusive perceptual
solutions performing end-to-end speech quality evaluation [2], [3], [6] on
different types of networks (wireless, VoIP, or fixed) based on the speech
signal, or non-intrusive perceptual (single-ended algorithms) [4] and non-
intrusive parametric [5], which can evaluate speech quality at different
nodes of the network (including the end node) based on the degraded
speech signal and, respectively, on network parameters.

2.1 Intrusive Techniques
These algorithms provide speech quality scores by comparing reference
(transmitted) and degraded (received) speech samples. Therefore, intrusive
assessment techniques require access to both the transmission and
reception ends of communication. Comparing time-frequency processed
reference and degraded speech samples based on human perception and
cognition models facilitates an accurate estimation of the subjective
perception of speech quality received by the terminal. An accurate
estimation, however, is performed at the cost of sending the test samples
through the network under test. The connection under test is therefore
withdrawn from normal service and rendered unavailable to the customer.
During peak hours, and for some technologies and certain areas, this
situation may generate artificially low quality scores.
Intrusive perceptual metrics estimate end-to-end speech quality, and thus
are useful and meaningful to network operators for monitoring the quality
experienced (QoE) by their voice service subscribers.

2.2 Non-Intrusive Techniques
Non-intrusive metrics can be network parameter based or speech based.
Parametric methods can use RF and/or IP parameters for predicting
quality. Their limitation comes from the fact that these algorithms can
actually predict quality affected either by the radio access network or by the
IP-core network. Just a few studies are going on investigating the possibility
of combining the effects of both RF and IP parameters on speech quality.
The non-intrusive speech based methods need to use predictions regarding
the transmitted original speech based on the degraded signal. Strong
degradations could easily affect the accuracy of these predictions and,
therefore, the overall speech quality evaluation. As a result, even though
they are based on the processing of the speech signal using human
perception and cognition models, these algorithms are recommended only
when large amount of samples are available for averaging [4].
Although less accurate than intrusive perceptual metrics, non-intrusive
perceptual and parametric algorithms have an important role in network

NT11-1037 5(13)

monitoring for SLA agreements as well as troubleshooting and optimization
of different network elements.

2.3 Standardization Status and Evolution Related to the
Listening Quality of Voice Service
Techniques for objective and subjective evaluation of voice service quality
are developed within ITU-T Study Group 12 Performance, QoS and QoE.
Standardization organizations such as ETSI/3GPP and other industry
forums work in liaison with ITU-T.
For almost a decade, the intrusive perceptual solution for listening speech
quality evaluation has been PESQ standard P.862 (along with P.862.1, 2,
and 3) [2]. With the 3G network evolution towards all IP, particularly NGN
(LTE/SAE-SON), ITU-T recognized the industrys immediate need for a
new standard that would both improve current PESQ performance under
certain specific network conditions (e.g., CDMA networks, EVRC codecs)
and cover 3G network evolution for voice service: from traditional CS to
VoIP and VoIP over IMS, from NB to WB and SWB, and from low codec
rates to very low and adaptive codec rates. As a result, POLQA was
developed [3], [6].
POLQA development and the wireless technology evolution toward NGN
showed that more than a subjective mean opinion score (MOS) is needed
for infrastructure vendors and operators to understand subscriber
perception and to appropriately troubleshoot and optimize their networks for
the voice service. Details related to new study items initiated in ITU-T are
presented in [1].
The non-intrusive solution is covered by the perceptual metric P.563 and by
the IP parametric based P.564.
Comprehensive summaries of standardized speech quality evaluation
metrics, their characteristics, and their applications are presented in Figure
1 for perceptual based metrics and in Figure 2 for parametric based
metrics.

NT11-1037 6(13)

Figure 1. Perceptual (Signal-Based)

Figure 2. No Reference (Parametric Based)

3 POLQA Technology
Today, voice service quality is determined by more than speech codecs
used or frames lost. Networks and devices now integrate many new
components ranging from voice enhancement devices (e.g., automatic gain
controllers, noise reduction, and smart loss concealment schemes) to new
techniques and features such as time scaling (stretching and compression
of the speech signals in the time domain). All these components have been
designed to ensure, maintain, and possibly even increase user experience
E2E QoE monitoring
Troubleshooting in correlation
with perception metric
Intrusive:
Uses test original and degraded speech signals to
provide quality score
Non-intrusive:
Uses impaired, received speech to
predict quality
Advantages:
Highly accurate estimator of subscribers opinion
Reflects the quality ensured by the entire network
as perceived by users
Requires access only to the end point
Advantages:
Normal usage of the network
Troubleshooting the problem generating node
High time and space granularity
Disadvantages:
Uses test stimuli that could artificially load the
network
Limited space-time granularity defined by the
speech/video sample length requirement
Disadvantages:
Low accuracy (high-order averaging is required and
therefore possible problems could be smoothed out)
Algorithms:
ITU-T P.862, 1-3 series (PESQ)
ITU-T P.863 (POLQA) (ITU-T consented on 17
September 2010)
Algorithms:
ITU-T P.563
Perceptual (signal based)
Troubleshooting in correlation with the network
parameters and perceptual metric
Non-intrusive:
Uses IP / transport parameters (or could possibly use RF, too)
Advantages:
Normal usage of the network
Troubleshooting the problem generating node (if access enabled)
High time and space granularity
Possibility for quick correlation with network behavior
Disadvantages:
Low accuracy (high-order averaging is required and therefore possible problems could be smoothed out)
Quality evaluation is one-dimensional, taking into consideration metrics belonging to a single segment of the
entire network (such as IP)
Algorithms:
ITU-T P.564 (IP parameter based)
No Reference (parametric-based)

NT11-1037 7(13)

of the perceived voice service quality. However, due to the complexity of
the speech processing involved, these components might cause new and
unexpected degradation effects. POLQA is especially designed to handle
disruptive effects caused by these multicomponent distortions.

3.1 POLQA Algorithms Overview
As an intrusive perceptual metric, POLQA processes and compares the
transmitted original speech signal and the degraded received speech signal
in order to provide a prediction of the quality that would be perceived by
subjects (regular subscribers) in a subjective listening test. The high level
architecture of the algorithm is presented in Figure 3.
POLQA processes both the original signal and the degraded signal before
performing the comparison. The processing of the original signal is based
on the fact that since the subjective testing is carried out without a direct
comparison against an original (Absolute Category Rating), the ideal signal
assumption on which the subject bases his or her opinion is unknown
during the test. The processing of the degraded signal is related to high-
level cognitive processes (e.g., relative insensitivity to linear frequency
response distortion and to steady state wideband noise [3]).
POLQA runs a time alignment of the degraded signal against the original
speech signal before the comparison process. The determined delay is
used both for estimating and using the proper sampling frequency as well
as for delay compensation in the comparison process performed based on
a perceptual model [3]. The accuracy of the comparison process is
determined by the transformation applied to the original and degraded
signals to an internal representation that is similar to the psychophysical
representation of audio signals in the human auditory system. The
transformation is applied in the perceptual frequency (Bark) and the
loudness domains (Sone), and runs in several steps: time alignment, level
alignment to a calibrated listening level, time-frequency mapping, frequency
warping, and compressive loudness scaling [3].
The internal representation takes into account several factors impacting the
perceived quality, such as playback level mapping from the digital signal
representation level, local gain variations, rapid variations, linear filtering,
and noise levels. In addition, it applies different levels of compensation for
these factors depending on their final contribution to the overall perceptual
disturbance. Therefore, minor and stationary differences between the
original and degraded speech signals are compensated, while more severe
effects known to have a greater impact on the perceived quality are only
partially compensated [3]. .
The final quality perception at the output of the module calculates the
difference between the original and degraded internal representations
based on a small number of quality indicators that are used to model all
related subjective effects. The cognitive model calculates the following
parameters: frequency response indicator, noise indicator, room
reverberation indicator, and three more indicators describing the internal
differences in the time-pitch-loudness domain. All these indicators are

NT11-1037 8(13)

combined to give an objective listening quality expressed by the raw
POLQA score [3].
The raw POLQA score is then mapped to the subjective MOS domain,
MOS-LQO. The mapping is a third order polynomial mapping developed
based on a large set of databases (tens of thousands of speech samples)
containing a broad range of network types (fixed, IP, and mobile) and
conditions (simulated error patterns and live degradations), codecs (e.g.,
AMR NB,/WB, G.722.1, iLBC, EVRC, EVRC-WB, EVRC-A/B, AAC/AAC
LD, Skype, MP3 low bit rate, G.726, EFR), various BGN types and levels,
different languages (American and British English, German, Swedish,
French, Dutch, Czech, Chinese, and Japanese) and three speech
bandwidths (NB, WB, and SWB).

Figure 3. High Level Architecture of POLQA Algorithm

3.2 Operability Requirements
The POLQA algorithm is designed to predict overall listening speech quality
under NB, WB, and SWB (50 to 14000Hz) conditions in 3G/4G (LTE-SAE)
networks, including advanced speech processing technologies, acoustical
interfaces, and hands-free applications. It should be noted that POLQA has
two operational modes: SWB and NB. The main difference is the bandwidth
of the original speech signal used by the model. In SWB mode, the
received (and potentially degraded) speech signal is compared with an
SWB reference. Therefore, band limitations are considered to be
degradations and are scored accordingly. The listening quality is modelled
as perceived by a human listener using a diffuse-field equalized headphone
with diotic presentation (same signal at both ear-caps). In NB mode, the
received (and potentially degraded) speech signal is compared to an NB
(300 to 3400Hz) original. Thus, normal telephone band limitations are not
considered to be severe degradations. NB mode maintains compatibility to
the previously developed ITU-T Recommendation P.862.1 (PESQ) [2]. The
listening quality is modelled as perceived by a human listener using a
loosely coupled IRS type handset at one ear (monotic presentation).
Perceptual
model
Cognitive
model
Environment
modeling
Listening
conditions /
cognitive
perception
Delay
estimates
Raw
POLQA
POLQA
MOS-LQO
Possibly various speech
based diagnostic (e.g.,
delay, gain levels, noise)
Psycho-acoustic model
Original speech
Degraded speech
Perceptual
model
Time
alignment
Internal representation
of original (transmitted)
speech signal
Internal representation
of degraded (received)
speech signal
Mapping to
subjective
domain
Speech databases
(NB/WB/SWB; varietyof codecs,
wireless / VoIP simulated / live
conditions, acoustic / electrical,
BGN conditions, languages)
Difference between
internal representations
(user perceived)

NT11-1037 9(13)

3.3 Telecommunication Test and Application Scenarios
The telecommunication scenarios include current transmission
technologies [3]
- Public switched networks (e.g., fixed wire PSTN, GSM, WCDMA,
CDMA)
- Push-over-Cellular, Voice over IP, and PSTN-to-VoIP
interconnections, Tetra
- Commonly used speech processing components (e.g., codecs such
as AMR NB/WB, G.722.1, iLBC, EVRC, EVRC-WB, EVRC-A/B,
AAC/AAC LD, Skype, MP3 low bit rate, G.726, and EFR; noise
reduction systems for different types of BGN such as office, street,
car, and babble; adaptive gain control; comfort noise; and other
types of voice enhancement devices) and their combinations.

The tested distortion types [3] cover:
- Single speech codecs and speech codecs used in tandem, as
currently used in telecommunication scenarios
- Packet loss and concealment strategies (packet-switched
connections)
- Frame errors and bit errors (wireless connections)
- Interruptions (such as unconcealed packet loss or handover in
GSM)
- Front-end clipping (temporal clipping)
- Amplitude clipping (overload, saturation)
- Variable delay (VoIP, video-telephony) / time warping
- Gain variations
- Influence of linear distortions (spectral shaping), being also time
variant
- Non-linear distortions produced by the microphone / transducer at
acoustical interfaces
- Reverberations caused by hands-free test setups in defined
acoustical environments

The application scenarios cover both electrical and acoustical measuring
interfaces as well as different terminal types (handset, headphone, or
hands-free).

NT11-1037 10(13)

3.4 Understanding POLQA Limitations
It should be noted that there are several conditions and applications for
which POLQA was not designed. POLQA scores obtained in these types of
conditions are not reliable and should not be considered for any kind of
speech quality evaluation. These conditions include:
- Other dimensions of speech quality such as conversational aspects
and talking quality.
- Speech quality per call. POLQA is not intended to score longer
sequences of speech. It is focused on prediction of quality for
shorter speech utterances of 6 to 12 seconds.
- Noisy listening environments. POLQA does not predict perceived
speech quality in these environments; it is designed in accordance
with P.800, ACR testing.
- Music (including multimedia).
- Evaluation of performance or ranking of voice enhancement devices
(e.g., noise suppressors).
- Other technologies or components such as speech storage formats
or non-telephony applications such as public safety networks or
professional mobile radio connections.

Although yet not tested or evaluated, POLQA could be cautiously applied
for the following applications:
- Other languages (e.g., Russian, Arabic, etc.)
- Longer speech samples
Subjective tests for confirming POLQA performance on these types of
applications are recommended.
3.5 POLQA Algorithms Performance Evaluation
Understanding POLQA performance as an estimator of subscriber
perception relies on the fact that results from a subjective experiment
reflect the relative quality between the tested speech samples, while the
absolute values could vary from experiment to experiment depending on
the listener group and the design of the subjective test.
Unlike subjective results, POLQA is independent of test context and
individual voter behavior. POLQA estimates the average subjective score
obtained from a group of voters listening to the same speech sample.
Although it does not provide an exact absolute score of an individual
experiment; POLQA does reproduce the relative quality ranking [3].
Therefore, POLQA performance evaluation involves comparison to
subjective scores as well as consideration of the variability that exists within
a listening panel. In addition, the differences between individual subjective
experiments must be removed. This is achieved by determining and

NT11-1037 11(13)

applying an optimal regression function (3
rd
order polynomial) between the
subjective and objective scores.
Due to the large numbers and types of databases, as well as their content
variability, a rigorous and extensive evaluation procedure has been
developed for POLQA testing. A series of different statistical metrics as well
as statistical significance testing have been used [3], but the core one
against which the algorithm has been optimized is the epsilon insensitive
root mean square error that brings statistical significance and accuracy in
the sense that it best emulates the usability of POLQA and its performance
in real life scenarios. The epsilon insensitive root mean square error
expresses POLQA error against the average MOS of individual voters
considering only differences related to an epsilon-wide band around the
target average value. Therefore, the uncertainty of a MOS panel is taken
into account by the epsilon value defined as the 95% confidence interval of
the averaged MOS.
( ) |
.
|
\
|
=

N
i Perror
d N
rmse
1
*

The Perror is defined as:

)) ( ) ( ) ( , 0 max( ) (
95
i ci i MOSLQO i MOSLQS i Perror =

where the index i denotes the condition of the speech sample, N denotes
the number of conditions or speech samples, and d denotes the degrees of
freedom (d = 4 in the case of a 3
rd
order regression).
The results reported in [12] provide general information on the POLQA
performance on a broad range of databases containing a large variety of
technologies, codecs and bandwidths. These results representing an
overall performance might be misleading to a certain extent. Due to the
variety of databases and the statistical aggregation procedure of the results
[3], [12], a weaker or better performance for a specific application and/or
bandwidth could be smoothed out or hidden. Therefore, additional analysis
is expected for more detailed analysis or for a particular application. This
analysis is planned by ITU-T during the POLQA characterization phase and
the results are expected to be published in the forthcoming POLQA
Application Guide (estimated for June 2011).

4 Beyond the MOS Score
Due to the complexity of the NGN environment, as well as the challenges in
supporting voice service on LTE-SAE/SON networks, several solutions for
providing voice service are currently envisioned. Therefore, test and
evaluation of speech quality in the NGN environment must be
comprehensive. In order to understand and cost efficiently control the
speech degradation of different implementation solutions, evaluation
techniques need to go beyond the MOS score.

NT11-1037 12(13)

To a large extent, as in the PESQ case, interim calculations of POLQA as
well as the six degradation parameters used as input to the POLQA
algorithms cognitive model would allow some network diagnosis based on
speech quality evaluation. Details are discussed in [1], but generally the
main diagnosis could regard aspects such as latency, jitter (variable delay),
gain variations, speech signal and BGN level measurements, level clipping,
dropouts (e.g., generated by packet loss), operability of VAD, and short-
term spectra (linear degradations caused by either the frequency response
of the devices and/or by the VoIP landline connection).

5 Ascom Network Testing Presence in the
Standardization Work on Objective Evaluation
Metrics for Listening Speech Quality
For more than 10 years, Ascom Network Testing has been an active
member within ITU-T Study Group 12, which develops objective speech
quality evaluation metrics. Our contributions to the standardization work
cover different areas and stages of objective metric development.
Ascom Network Testing contributed live recorded speech databases
needed for accurate training and tuning of the algorithms running in real life
scenarios typical of network troubleshooting, optimization, and operation
applications performed by operators. Within ITU-T, we were the initiator
and developer of the statistical evaluation procedure for objective metrics
that was first applied to PESQ and that was later applied in a modified form
to POLQA [8]. Recently, based on our initial work as well as work
performed for POLQA performance evaluation, Ascom Network Testing
introduced a new study item within ITU-T on a more general statistical
evaluation procedure to be applied to various types of objective metrics [9].
This type of evaluation becomes more and more a must for all kinds of
objective metrics (e.g., speech, video, audio, multimedia) that are designed
for testing in real life networks and therefore for their implementation in
network testing tools. We also developed a technique for objective quality
metrics calibration to the MOS scale. As a result, we co-authored two
standards in relation to PESQ: P.862.1 (Mapping PESQ to MOS domain)
and P.862.3 (Guidance for PESQ usage) [2].
Additionally, Ascom Network Testing recently wrote a white paper
contribution [10] on aspects related to POLQA implementation in field
testing tools, as well as a white paper contribution related to topics that are
required to be studied during the POLQA characterization phase [11].

6 Conclusions
The convergence and coexistence of voice, data, and multimedia
application services, which involve a multitude of factors that invariably
produce new types of distortions that dynamically, variably, and sometimes
randomly affect voice service quality. Today, speech quality is determined
by more than speech codecs used or frames lost. Networks and devices
now integrate many new components ranging from voice enhancement
devices to new techniques such as time scaling.

NT11-1037 13(13)

Extensive work has been performed during the past decade by both the
ITU-T and the telecommunication industry in developing speech quality
evaluation algorithms designed to accurately evaluate any network
degradation impact on subscriber perception as well as to cope with the
complex testing conditions of the 3G networks and beyond. The new
technology POLQA was developed to cope with the evolving networks
complexities. Like with all new technologies, extensive life testing is
expected to complete POLQA algorithms performance picture. Ascom
Network Testing, a proved veteran in ITU-T on the objective quality metrics
evaluation, continues to play an active role in the standardization work on
this topic.

7 References
[1] I. Cotanis, Voice Services in the Next Generation Networks/LTE-
SON as Perceived by Users, Ascom Network Testing white paper,
November 2010.
[2] ITU-T P.862.x series; P.862 (PESQ algorithm), P.862.1 (Mapping to
MOS domain), P.862.2 (WB-PESQ), P.862.3 (PESQ-Application
guide); PESQ algorithm.
[3] ITU-T P.863, Perceptual Objective Listening Quality Assessment
(POLQA), Geneva, January 2011.
[4] ITU-T P.563, Single-ended method for objective speech quality
assessment in narrow-band telephony applications.
[5] ITU-T P.564, Conformance testing for voice over IP transmission
quality assessment models.
[6] ITU-T TD SG 12 Gen 345, Final report of Working Party 2,
Geneva, May 2010.
[7] ITU-T P.800, Subjective testing of overall listening speech quality.
[8] I Cotanis, ITU-T SG12/Q9 C137, A procedure for statistical
evaluation of the objective quality metrics performance, May 2008.
[9] I. Cotanis, ITU-T C151, Proposal on statistical evaluation
framework for objective quality algorithms, submitted for ITU-T
January 2011 meeting.
[10] I. Cotanis, ITU-T SG 12 C112, Some aspects related to P.OLQA
standard, May 2010.
[11] I. Cotanis, ITU-T C142, Proposed study items for POLQA
characterization phase, September 2010.
[12] Opticom, TNO, SwissQual, ITU-T C148, Performance of the joint
POLQA model, September 2010.
[13] POLQA coalition, www.polqa.info, July 2010.

Tems Voice Service Quality Evaluation Techniques and Polqa

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Tems Voice Service Quality Evaluation Techniques and Polqa

Transféré par

Droits d'auteur :

Formats disponibles

Prepared by: Date: Document:

Dr. Irina Cotanis 3 November 2010 NT11-1037

Vous aimerez peut-être aussi