Vous êtes sur la page 1sur 7

Speech Quality of AMR Wideband Coding vs.

Narrowband as Perceived by the End-User


R. Mllner and M. Mummert
3GPP Release 6 defines five different codec modes ranging from 6.60 kbps to 23.85 kbps coded-speech bit rate. In GSM/EDGE single time slot configuration, all five codec rates are supported on 8-PSK full-rate speech channels (OTCH/WFS), whereas only three codec rates from 6.60 kbps to 12.65 kbps are supported in GMSK full-rate speech channels (TCH/WFS). The latter three codec rates can also be allocated on half-rate channels using 8-PSK modulation (OTCH/WHS). AMR-NB on GMSK full-rate channels includes eight codec modes with rates ranging from 4.75 kbps to 12.2 kbps (TCH/AFS). ISDN narrowband speech employs G.711 A-law coding [11] delivering 64 kbps. The perceived quality difference between these wideband and narrowband codecs has been analyzed in this study. The paper is structured as follows. The reasons for selecting the specific test methods are explained in Section II. Section III introduces the experimental design of the audio test. The test conditions are described in Section IV. Results are presented in Section V and discussed in Section VI. Finally the main conclusions are given in Section VII. II. SELECTION OF TEST METHOD The 3GPP/ETSI AMR characterization tests for pure wideband [12] and pure narrowband speech [7] are based on Absolute Category Rating (ACR) [13] delivering Mean Opinion Scores (MOS) as a quality index. When both bandwidths are presented in the same experiment, however, this method is no longer suitable. The reason for this is that the change of bandwidth has strong impact on sound impression and is deemed to confuse the test subjects internal reference implied for absolute ratings. In order to avoid misinterpretations, it is also advisable to avoid the designation MOS for a quality index: Once both bandwidths are presented in an experiment, test settings lose comparability with single bandwidth experiments. Objective methods like PESQ [14] promote the notion that a combination of codec type and transmission conditions features a particular MOS value. These methods, however, were calibrated by a large number of real listening tests in single bandwidth settings. A multiple bandwidth setting renders this calibration meaningless. Rather than ACR, a method is desirable where (a) an explicit reference is always accessible and, to avoid sequence effects and to support consistent assessments, (b) all sounds are presented virtually at once.

Abstract AMR-WB has been standardized for mobile GSM and UMTS networks, for wire-line services, for VoIP and multimedia applications. Its significance lies in an extended audio bandwidth over traditional narrowband telephony. Listening tests in German language were performed to assess the perceived difference in speech quality between wideband and narrowband coded speech under various acoustic background noise and channel conditions. End users perception of speech quality was assessed using a modified version of the MUSHRA test: Rather than selecting experienced listeners, typical mobile subscribers were recruited in the street. The study confirms that AMR-WB provides significantly higher speech quality than AMR-NB and G.711 (ISDN telephony), both which were rated to be of similar quality. Quality improvement for speech, with or without background noise, corresponds to a full step on a five grade scale ranging from bad to excellent. Comparing AMR-WB at varying radio conditions with narrowband competitors at errorfree conditions, wideband is always perceived to be better than narrowband, unless radio conditions become poor. The turn-over point in this case lies between 7 dB and 4 dB C/I for typical urban full-rate GMSK or 8-PSK interferer channels. When radio conditions are better, speech bandwidth is more important than coded-speech bit rate. In presence of background noise this effect persists emphasizing the substantial benefit of AMR-WB compared to AMR-NB or G.711. Index TermsAMR-NB, AMR-WB, G.711, ISDN, listening test, MUSHRA test, speech quality.

HE Adaptive Multi-Rate (AMR) narrowband (NB) speech codec is well established in mobile networks [1]-[6]. Combined with common channel coding it outperforms older codec standards with respect to maximum speech quality and error robustness, such as Enhanced Full-Rate (GSM-EFR), Full-Rate (GSM-FR), and Half-Rate (GSM-HR) [7]. A major step towards uppermost speech quality is achieved by the introduction of AMR wideband (AMR-WB), which has been standardized for GSM and UMTS mobile networks [8][9]. ITU-T has adopted this codec as Recommendation G.722.2, facilitating the introduction of wideband across networks [10].
R. Mllner is with Siemens AG, Communications Mobile Networks, 81541 Munich Germany (e-mail: robert.muellner@siemens.com). M. Mummert is with Dr.-Ing. M. Mummert, Audio Signal Processing, 80796 Munich Germany (e-mail: Mummert-ASP@t-online.de).

I. INTRODUCTION

A test method well suited to meet the requirements is the MUlti Stimulus test with Hidden Reference and Anchors MUSHRA [15]-[17]. This test method has been designed to provide a reliable and repeatable measure of the audio quality of intermediate-quality signals [15]. However, the major difference of the applied test method and strict MUSHRA test methodology is the intentional selection of standard mobile phone users instead of professional listeners. The quality assessment by people recruited in the street is a new aspect to the MUSHRA test and has been selected in order to explicitly evaluate the end-users perception. III. EXPERIMENTAL DESIGN In MUSHRA multiple stimulus means that differently processed versions of a sound source are presented simultaneously. This allows the subject to listen to one version and to switch quickly to another version. The original unprocessed version of the source is also available to the subject and is identified as the reference version. This guarantees that the listener knows how the versions really should sound, as would not be the case in ACR tests. Among versions to be assessed the MUSHRA test also includes a hidden reference and so-called anchors. Thus the original unprocessed version also appears as one of the versions to be graded by the subject. Anchors are low-pass filtered originals with a band limit at 3.5 kHz (narrowband anchor) and 7 kHz (wideband anchor), respectively [16]. The hidden reference and anchors provide a reference grid that covers the grading scale. Fig. 1 demonstrates the graphical user interface. The dark green button at the top left position plays the reference. The other versions including the hidden reference and the anchors are played by the buttons with a headphone symbol. Under each button, except for the button for the reference, a slider is positioned to grade the quality of the stimulus.

The subjects were instructed to rate the speech quality of each signal, on a scale of 0 to 100, by positioning the sliders, whilst observing that 100 corresponds to the quality of the reference. The term speech quality was not defined to the listeners so that their understanding may be a combination of intelligibility and general quality impression. The fact that a hidden reference appeared among the versions to be graded was not revealed to the subjects. Subjects were asked to proceed in two steps. In the first step the quality was to be classified according to a coarse scale, sub-divided into five equidistant ranges representing excellent, good, fair, poor, and bad quality. In the second step a refinement of the setting was requested according to the continuous scale between 0 and 100. It was left to the subject to decide that his/her assessments were satisfactory and to terminate the experiment by pressing the button situated at the lower right hand side. Each of the computers used for the experiments was connected via external sound card and amplifier for individual volume control to a monaural headphone (beyerdynamic DT 252, closed construction). The subjects were trained in a separate experiment in order to make them familiar with the user interface and to expose them to the range of the sound quality. Also, the position of the headphone and the sound volume could be adjusted. These settings were to be kept unchanged during all following experiments. All tests were carried out in a silent environment by an independent research institute. IV. TEST CONDITIONS AND STIMULI A total number of 150 test subjects (78 male, 72 female) were recruited at three different locations in Germany (Dortmund, Dsseldorf, and Munich). The subjects were characterized by uniform distribution of business users and private users in the age of 16-25, 26-40, 41-55, and over 55 years. The subjects were screened at the beginning to ensure that they met the selection criteria. People with hearing impairments were not included in the tests. The complete listening test included five experiment types for the assessment of AMR-WB and AMR-NB, (a) jointly under error free conditions (one type) and (b) separately under error-prone radio channel conditions (four types). Including the training experiment, each subject performed three to four separate experiments within his/her individual session, which lasted about 30 minutes in total. In the main experiment type rating of AMR-WB against AMR-NB quality at error free conditions was conducted with a total number of 80 subjects. Four experiment types for error-prone channel conditions within specific GSM channels were conducted, with a total of 32 subjects each1. For each experiment subjects were randomly selected to reflect proportions of the full sample. Disregarding training experiments a specific combination of
1 Another four experiments, with 32 subjects each, were conducted for similar purposes not reported here so that the count of 3 to 4 experiments per subject results.

Play buttons excellent good fair poor bad Slider

Fig.1. User interface of an experiment. Top text in English: Please assess the speech quality of each sound by using the sliders on a scale from 0 - 100, whereas 100 corresponds to the speech quality of the reference sound. Dark top left button reads: reference sound.

experiment type and speaker was presented not more than once per session. Audio source material (recorded at 44.1 kHz) comprised eight different utterances in German language, spoken by male or female speakers, in clean or typical background noise conditions, as shown in Table I. Each sample lasted about 8 s.
TABLE I SPEAKER CHARACTERIZATION Speaker 1 2 3 4 5 6 7 8 Description male clean speech female male female speech with background noise train station noise crowded street noise (Manhattan) suburban street noise inside car/vehicle noise

The stimuli used in the test had been generated from source material using rate conversion, filtering, transcoding, and level adjustment. Characteristics of pre-filtering and possible transcodings to define a stimulus type are summarized in Table II.
TABLE II STIMULUS TYPES USED IN TEST For AMR, transcoding implies codec mode best-matching with either error free or, for given C/I, error-prone radio channel. Stimulus PrePre-filter characteristics type filter Reference Original sound (none) (also hidden) recorded at 44.1 kHz 7 kHz Low pass eliminating f > 7 kHz, C7K WB-anchor steep roll-off similar to P.341 Low pass eliminating f > 3.5 kHz, 3.5 kHz rather steep roll-off (dB:kHz C3K5 NB-anchor -1:3.45, -3:3.50, -7:3.55, -12:3.60, -33:3.70, -64:3.80, Gaussian) WB send filter, ITU-T AMR-WB, P.341/G.191, eliminates f > 7 kHz alternative P.341 and attenuates f < 50 Hz, affects pre-filtering disturbing noise but hardly voice AMR-WB AMR-NB, alternative pre-filtering AMR-NB G.711 (ISDN) P.341 + (two filter stages, as above/below) MSIN Attenuation of low frequencies, MSIN ITU-T G.191, -3 dB at 190 Hz, 40 dB/decade Modified IRS send filter, ITU-T P.830/G.191, attenuates f < 300 Hz MIRS and f > 3.4 kHz, f-progressive boost in mid-frequency range up to 3 kHz Transcoding

(none)

TCH/WFS TCH/WFS O-TCH/WFS O-TCH/WHS TCH/AFS TCH/AFS G.711 A-law

intelligibility. The combination of P.341 and MSIN filtering was applied as standard in order to reflect these restrictions. No post-filtering was applied, i.e. no receive filter has been included. Thus the receive filter mask according to [18] has been slightly exceeded towards frequencies below 100 Hz. This was deemed tolerable because effects of a wideband range almost unrestricted towards lower frequencies (P.341 only) were also investigated. Care was taken in designing filter roll-off of C3K5 in order to avoid artifacts resulting from excessive steepness (ringing). C7K0 was found to be uncritical in this respect. For error-free channel conditions transcoding comprises a single encoder/decoder pair of either AMR-NB, AMR-WB, or G.711 type [19][20][11]. Under error-free radio conditions transcodings of TCH/WFS, O-TCH/WFS, and O-TCH/WHS are alike for same codec mode. Using a single pair in case of AMR reflects Tandem Free Operation (TFO), which is a prerequisite for wideband operation [21]. Additionally, for error-prone channel transcoding, the encoded signal was passed through a radio channel simulator. Radio conditions were single interferer, Typical Urban 3 km/h, ideal frequency hopping, GSM 900 with Carrier/Interferer ratios (C/I) of 1, 4, 7, 10, and 13 dB. Simulated performances were similar to those used within ETSI/3GPP AMR performance characterization tests [7][12]. At each condition only the best-suited AMR codec mode was presented, as determined in advance by experienced listeners. All audio samples were then converted to 44.1 kHz mono irrespective of the previous processing, whereas the reference stimuli remained unprocessed. Filtering and rate conversion between 48 kHz and 8 or 16 kHz as well as codec input level equalization to -26 dBov was performed with use of the ITUT G.191 Software Tool Library [22]. A separate tool was used for rate conversion between 44.1 and 48 kHz. Finally, signal levels were adjusted by two offsets pertaining to the stimulus type and speaker respectively. These offsets had been previously determined by experienced listeners, to achieve equal subjective loudness for all stimuli. To determine the stimulus type offsets for all AMR-WB and, respectively, for all AMR-NB transcodings, TCH/WFS12.65 and TCH/AFS12.2, both at error-free channel conditions, were selected. V. RESULTS A. Error-free channel condition Fig. 2a shows the results of the main experiment for clean speech including male and female voices. The mean of assigned values and 95% confidence intervals (as calculated for MUSHRA tests) are plotted. Sample size is smaller than 40, representing the half of the 80 subjects that assessed clean speech. Some subjects were rejected, as explained later in part C of this section. The hidden reference receives highest score but falls short of the expected value 100. AMR-WB speech quality assessments stand out considerably from narrowband results,

In wideband coded speech, the exploitation of full bandwidth from 50 Hz to 7000 Hz is limited by (a) technical restrictions in mobile phones, such as miniaturization of loudspeakers, and (b) other design considerations, such as suppression of low-frequency noise and optimization of

100 90 80 Assigned value

100 90 80 Assigned value


86
or ch An

70 60 50 40 30 20 10 0 74
0 7k

70 60 50 40 30 20

44

66

67

64

45

41

47

10 0

63
or ch An 7

50
k0

21

51

48

49

26

25

28

ce en fe r Re

5 5 t) t) N) 2.2 .8 5 3k (al (al 2 .6 23 S1 ISD or S1 .65 2 .2 FS ch 1( AF 12 S1 WF H/ 71 An H/W H/ FS G. /AF TC W H TC TC H/ OTC TC

ce en fe r Re

) 5 5 t) N) 2.2 .8 5 3k (al (alt 2.6 23 S1 ISD or S1 .65 2.2 FS ch 1( /AF 12 S1 WF 71 An CH H/W H/ FS AF G. T C W H/ TC H/ OT TC TC

Fig. 2. Speech quality assessments for (a) clean and (b) noisy speech. Error-free channel, male, and female speakers. Mean of assigned values and 95% confidence interval based on (a) N = 38 samples and (b) N = 39 samples. (alt: alternative pre-filtering, see Table II.)

even though they do not match up to the original sound alias reference. The quality improvement of around 20 corresponds to a full step on the five grade scale that the experiments maintained in parallel with the continuous scale. 7 kHz anchor ranks only slightly higher indicating that losses due to WB transcoding had only minor effects. AMR-WB assessments do not expose any difference for 23.85 and 12.65 kbps transcoding and alternative pre-filtering seems to have no impact. The 3.5 kHz anchor adheres to the narrowband group among which no significant differences can be observed either. Thus AMR-NB transcoding losses in relation to G.711 (ISDN) are of no importance. As with AMR-WB, alternative filtering does not show an impact. In the results for speech in background noise conditions in Fig. 2b the shape from clean speech results is virtually preserved. This confirms validity of prior observations. However, the level of the scores is reduced by a common offset (20 on the average). Thus the score of the hidden reference misses expectation considerably, a fact which will be discussed later.

To further analyze the outcome of the main experiment Fig. 3a shows the dependency of the mean of assigned values on the speaker group. The left part of the diagram plots the assessments for single speakers or combinations of speakers and features clean speech. The results for noisy speech are provided on the right side. Specifically, mean values of Fig. 2a can be found directly left to the center (clean & noisy), all plotted on the spot one above the other. Similarly, values of Fig. 2b are plotted directly to the right of the center. Across speaker groups the assessments of the same stimulus type are linked by lines to serve visual orientation. This does not necessarily imply dependence between connected points. Sample size is added on the abscissa being equal to the number of subjects contributing. In a similar manner Fig. 3b shows the half-width of the 95% confidence interval to illustrate reliability of mean values given before. Naturally, highest reliability is achieved if assessments for all speakers are jointly averaged, thus including the assessments of 77 subjects. The size of the confidence interval and the contrast between different codec

100 90

1000

30 25 20 15 10 5 9 9 10 10 20 18 38 77 39 20 19 10 9 10 10 0
y e 2 1 4 3 n . # p. # p. # p. # mal male clea nois s /f s /f s sp fe n & c c an clea c/m c/m an cl e cle #5 #6 #7 #8 isy al e al e no fem y m sp. sp. f sp. f sp. n/ n/ isy nois n/m n/m no

10000

70 60
Sample size N

40 30 20 10 0
y 6 5 e e 7 e 8 2 1 y 3 4 . # p. # p. # p. # mal male cl ean nois nois mal mal p. # p. # p. # p. # s /f s s sp s s s /f s y fe fe n & n/f c/f n c an clea isy nois n/m n/m c/m c/m an no cle cle

10

10

20

18

38

77

39

20

19

10

10

10
0

a
Reference Anchor 7k0 Anchor 3k5 O-TCH/WFS23.85 TCH/WFS12.65

b
TCH/WFS12.65 (alt) TCH/AFS12.2 TCH/AFS12.2 (alt) G.711 (ISDN)

Fig. 3. Dependency of speech quality on speaker group for error-free channel: (a) mean of assigned values and (b) half-width of confidence interval. Shading to remind of decreasing sample size. (alt: alternative pre-filtering, see Table II.)

Sample size N

50

Half-width of confidence interval

80
Mean of assigned values

modes and filter combinations increase with decreasing sample size, as statistics become less reliable. Average standard deviation was determined to be round about 23, almost independent of the speaker group. When shifting focus from clean to noisy speech in Fig. 3a it becomes apparent that assignment levels vary more across speakers, revealing dependency on the nature of the background noise (spectral distribution and level). Also it seems that the 7 kHz anchor loses its small advantage over AMR-WB, hinting that wideband codec losses are even less important in presence of background noise. However, this tendency is not really confirmed for error-prone channels, presented below. Disregarding assessment offset variations, Fig. 3a demonstrates that observations made previously for clean and noisy speech are essentially reflected by any subgroup. However, when dividing results into individual speakers, two effects have to be considered: (a) low statistical reliability due to small sample size and (b) effects peculiar to individual speakers and background conditions. In any case, AMR-WB was always found to be better than narrowband. B. Error-prone radio channel conditions Another four experiments reflect error-prone radio conditions in three AMR-WB channels and one AMR-NB channel. Results are displayed in Figs. 4a - 4d. In each diagram mean values for clean speech are represented by squares, circles represent noisy speech, and triangles represent all speakers. The first three stimulus types on the left

correspond to hidden reference and anchors. Linking data points by lines in these cases is for better visual distinction only. The other types following correspond to deteriorating radio conditions for the given channel. Marks for 95% confidence intervals are attached to the data points. Due to the fact that sample size is much smaller than in the main experiment, intervals are somewhat larger. Although confidence intervals are of considerable size the curves of deteriorating speech quality are clearly localized for almost all channels and speaker groups. In other words, the mean values do hardly deviate from a visually smooth curve for the deteriorating side of the diagram. This indicates that the variance of assessments only partly accounts for independent random error. Obviously, the subjects assessments in relation to each other are likely more reliable than the assessments for a particular stimulus between subjects. The subjects individual interpretations of scale compose a spread that contributes to variance and, in consequence, to standard deviation and confidence width. This finding is supported by the fact that the large offset between clean and noisy speech observed in the main experiment is reproduced to smaller extent only, depending on the channel. Hence assessments feature a warped scale effect which is to be discussed in Section VI. An extreme case of this effect probably aggravates the picture for noisy speech results at TCH/WFS in Fig. 4c: AMR-WB assessments hardly surpass the 3.5 kHz anchor that, being the only narrowband stimulus, is already likely to bear a large independent deviation.

100 90 80 70 60 50 40 30 20 10 0

Assigned value

e 5 0 B B B ce dB dB 3k 7k f re 1d 4d 7d en 10 13 or or 0 0 or 5 fer ch ch err 6.6 .65 6 .6 .8 5 8.8 Re An An 5 12 23 FS FS FS S S W W 3.8 H/ W H/ H/ WF WF S2 H/ H/ TC TC TC WF OOTC OTC a H/ OOTC O-

Assigned value

clean (N = 13) all (N = 28) noisy (N = 15)

100 90 80 70 60 50 40 30 20 10 0

clean (N = 16) all (N = 31) noisy (N = 15)

e 5 0 B B ce dB dB 3k 7k fre 7d 4d en 13 10 or or or fe r .6 0 .6 0 ch ch err .65 .60 Re S6 S6 An An 5 12 S6 S .6 WH WH WH 12 H/ H/ WH H/ HS H/ TC TC TC /W TC OOb OOCH T O-

100 90 80 70 60 50 40 30 20 10 0

Assigned value

5 0 B B B ce dB dB 3k 7k 4d 7d 1d en 13 10 or or 0 5 0 fe r 5 5 ch ch 6.6 8 .8 6.6 Re 2.6 2.6 An An FS FS FS S1 S1 W W F H/W H/ H/ WF H/W H/ TC TC TC TC TC c

Assigned value

clean (N = 15) all (N = 31) noisy (N = 16)

100 90 80 70 60 50 40 30 20 10 0

clean (N = 16) all (N = 31) noisy (N = 15)

0 5 e B B B ce dB dB 7k 3k fre 4d 1d 7d en 10 13 or or or 5 5 .9 fer 5 .2 ch ch e rr 4.7 7.9 S5 Re 7.9 12 An An FS FS /AF 2.2 FS FS A A H A A H/ H/ S1 H/ H/ TC AF TC TC TC TC H/ d TC

Fig. 4. Speech quality assessments for error-prone radio channel: (a) O-TCH/WFS, (b) O-TCH/WHS, (c) TCH/WFS, and (d) TCH/AFS. Speaker groups clean, noisy, all. Mean of assigned values and 95% confidence interval.

It is interesting to note that the best AMR-WB assessments do not come as close to the 7 kHz anchor as they did in the main experiment. In order to relate the results of wideband channels with the results of narrowband channels, one proceeds as follows. From error-free channel results, e.g. Fig 3a, it is known that the 3.5 kHz anchor scores almost as well as TCH/AFS12.2. In Fig. 4d, the tendency is reversed, even though this may be attributed to 3.5 kHz anchor being the only narrowband stimulus in a smaller sample set thus diminishing confidence. In any case, speech quality of 3.5 kHz anchor plus/minus a small margin may serve as an orientation for the level of AMR-NB error-free channel quality within Figs. 4a 4c. This quality level also corresponds with G.711, as evidenced by error-free channel results. The procedure delivers the following results: Except for poor radio conditions, wideband speech quality always outperforms error-free channel narrowband. The quality turnover points fall between 7 dB and 4 dB C/I for O-TCH/WFS and TCH/WFS and between 10 dB and 7 dB for OTCH/WHS. For fair comparison with AMR-NB channel, its deterioration must also be taken into account. The true turnover points are therefore even lower, if they exist at all. C. Rejection criterion All results presented above employ a rejection procedure to remove outliers. Rejection of a subject and all its experiments is based on excessive variance in assessing the hidden reference. For this purpose the measures var1 and var2 have been computed individually for each subject as follows: A is a set of hidden references in all experiments the subject has participated, including training phase experiment. B is the set of all sounds presented in training phase experiment. C is the union of A and B, i.e. the hidden reference from training phase experiment has been counted only once. The term var1 is the variance of values given by A, i.e. the absolute variance of assessing a hidden reference. The term var2 equals var1 divided by the variance of values given by C, i.e. variance of A relative to the subjects estimated individual variance for all sounds. The limits applied are var1 2250 and var2 2.25. These values were found suitable to cut off conspicuous outliers. Yet in effect only between 0 and 3 assessments were removed from the original sample when evaluating speaker groups clean or noisy. VI. DISCUSSION A common finding among the results is that the hidden reference was rated significantly lower than the expected value of 100, especially in case of noisy speech. Obviously some subjects graded on an absolute scale rather than in relation to the reference, which naturally would discriminate speech in distracting background. A likely explanation is that the meaning of the reference (button) was not always understood, or perhaps was in doubt, when background noise made a bad overall impression. Subjects recruited in the street cannot be expected to be as attentive to the test procedure as

experienced listeners would be in genuine MUSHRA test. However, this does not appear to compromise basic validity of the results: The ordering of stimuli as well as relative differences in their assessment are not affected when only a smaller fraction of the full scale is used. It may be though that sensitivity in detecting differences is reduced a little. In effect Degradational Category Rating (DCR) foundation of MUSHRA grasped by some of the subjects merges with Absolute Category Ratings (ACR) by others. Basic objections against ACR do not apply because best quality was always accessible, as opposed to classic ACR. Another effect that comes into play, especially when ACR is utilized, is which range or fraction of the scale is used by the subject. Some people tend to assign high marks (so-called optimists) and others tend to give low marks (so-called pessimists). Combining both effects, subjects obviously use differently warped scales for grading, producing a substantial standard deviation already. This explains why the consistency of the results is satisfactory even though formal confidence may be poor. Supporting evidence can be seen in the fact that it was possible to reduce the typical standard deviation and hence confidence interval by a third, if the following procedure was employed: The mean of each individual experiment across all stimuli was subtracted before evaluating the stimulus standard deviation. Of course, normal random deviations add on top which may contain significant outliers. In a genuine MUSHRA test the grading of the hidden reference is a suitable starting point for rejection, when grading was not close to 100. The fact that subjects used ACR made this kind of rejection criteria inappropriate. Alternative criteria have been studied, in addition to the variance criterion finally used, such as subjective reliability classification and test location. A further reduction of the standard deviation was not found to be possible. It seems that outliers manifest themselves in scattered stimulus maladjustments rather than in unusable experiments or individual subjects with inability or unwillingness to adjust values properly. For this reason only the most conspicuous outliers have been removed by the variance criterion. An observation within Fig. 4d is that assessments among AMR-NB stimuli tend to stick close together in the context of a MUSHRA test. This seems to contradict more distinct assessments from known listening test results [7] but lends itself to a simple explanation: The substantial quality improvement by wideband compresses differences among narrowband when using the same scale. Compression is intensified by the presence of the reference providing still better quality, hence consuming space on the scale as well. It is also possible that the ACR/DCR effect serves compression, too. Resulting compressed differences may then be too small to be resolved by the statistical accuracy of this test. On the other hand, compression simply supports that in good channels, bandwidth is more important than coding rate. Fig. 2a and Fig. 2b supported by Fig. 4a show no quality

advantage of the 8-PSK modulated channel at 23.85 kbps source bit rate over a GMSK modulated channel at 12.65 kbps. Yet the advantage of 8-PSK modulation lies in providing AMR-WB on half-rate channels and in providing better quality at 23.85 kbps for more complex sounds, such as music on hold. VII. CONCLUSIONS The speech quality of AMR-WB has been determined from mobile phone users perspective by listening tests and has been compared to AMR-NB and G.711 A-law representing ISDN quality. Tests were based on the MUSHRA method presenting test subjects with several stimuli to be assessed simultaneously. A modification to the original MUSHRA concept was applied as non-experienced listeners participated. Variable test conditions for measuring the quality were radio channel distortion, background noise, and pre-filtering. Up to 80 subjects were used per experiment. The results show a substantial advantage in speech quality for wideband coded speech compared to narrowband. For both, clean and noisy speech, AMR-WB was perceived to be of considerably higher quality than AMR-NB and G.711, which were rated to be of approximately equal quality. For error-free channels, the quality difference corresponds to a full step on the five grade scale ranging from bad to excellent. Consistent with MUSHRA methodology, a visible reference (original sound), a hidden reference and anchors in each experiment provided a reference grid covering the grading scale. Many of the test subjects did not discover the hidden reference to be 100%, its average value being even lower for sounds with background noise. This indicates that subjects employed an Absolute Category Rating approach rather than a Degradational Category Rating as required by MUSHRA instructions. This is attributed to the fact that subjects were recruited in the street and hence could be expected to be less attentive than experienced listeners required for genuine MUSHRA setup. Though subjects use different scales, relative grading is preserved. On the other hand, this effect significantly contributes to confidence interval size. Because the range of signal qualities presented simultaneously covers original, wideband, and narrowband quality, differences among narrowband seem compressed or even leveled that would otherwise be more distinct in known studies, e.g. for AMR-NB [7]. A change of wideband pre-filtering characteristics in favor of an unrestricted low-frequency range did not show any significant impact on user judgment. Speech quality for error-prone radio conditions was tested using a typical urban, single-interferer, frequency-hopping channel setup. The quality of wideband speech down to a C/I of 7 to 4 dB for full-rate GMSK and 8-PSK channels was assessed to be higher than narrowband speech at error free conditions. In case of half-rate 8-PSK channel turn-over point ranges between 10 and 7 dB. These ranges already represent

poor radio conditions. Considering an error-prone channel also for AMR-NB, turn-over likely is even lower. For radio conditions above turn-over speech bandwidth is more important than coded-speech bit rate. In presence of background noise this effect persists, and is possibly enhanced, confirming the outstanding advantage of AMR-WB in comparison with AMR-NB and G.711. REFERENCES
[1] [2] [3] [4] [5] [6] [7] [8] [9] W. C. Chu, Speech Coding Algorithms. Wiley & Sons, 2003. J. Eberspcher, H.-J. Vgel, C. Bettstetter, GSM Switching, Services and Protocols. Wiley & Sons, 2001. T. Halonen, J. Romero, and J. Melero, GSM, GPRS and EDGE Performance. Wiley & Sons, 2002. H. Holma, A. Toskala, WCDMA for UMTS. Wiley & Sons, 2004. 3GPP TS 26.103 v6.1.0 (2004-12), Speech codec list for GSM and UMTS (Release 6). 3GPP TS 45.003 v6.7.0 (2005-01), Radio Access Network; Channel coding (Release 6). 3GPP TR 26.975 v6.0.0 (2004-12), Performance characterization of the Adaptive Multi-Rate (AMR) speech codec (Release 6). 3GPP TS 26.171 v6.0.0 (2004-12), Adaptive Multi-Rate Wideband (AMR-WB) speech codec; General description (Release 6). I. Varga, R. Drogo De Iacovo, P. Usai, Standardization of the AMR Wideband Speech Codec in 3GPP and ITU-T, IEEE Communications Magazine, May 2006.

[10] ITU-T G.722.2, Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB), 2003. [11] ITU-T G.711, Pulse code modulation (PCM) of voice frequencies, 1988. [12] 3GPP TR 26.976 v6.0.0 (2004-12), Performance characterization of the Adaptive Multi-Rate Wideband (AMR-WB) speech codec (Release 6). [13] ITU-T P.800, Methods for subjective determination of transmission quality, 1996. [14] ITU-T P.862, Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, 2001. [15] A. J. Mason, The MUSHRA audio subjective test method, BBC Research & Development, White Paper WHP 038, 2002. [16] ITU-R BS. 1534-1, Method for the subjective assessment of intermediate quality level of coding systems, 2001-2003. [17] 3GPP TSG-SA4, Tdoc AHAUC-033, Verification of fixed-point implementation of AMR-WB+, 2005. [18] 3GPP TS 26.131 v6.0.0 (2004-09), Terminal acoustic characteristics for telephony; Requirements (Release 6). [19] 3GPP TS 26.073 v6.0.0 (2004-12), ANSI-C code for the Adaptive Multi-Rate (AMR) speech codec (Release 6). [20] 3GPP TS 26.173 v6.0.0 (2004-12), ANSI-C code for the Adaptive Multi Rate - Wideband (AMR-WB) speech codec (Release 6). [21] 3GPP TS 28.062 v6.2.0 (2005-12), Inband Tandem Free Operation (TFO) of speech codecs (Release 6). [22] ITU-T G.191, Software tools for speech and audio coding standardization, 2000. The authors would like to thank Dr. I. Varga (Siemens AG, Munich) for support in design and review, Dr. T. Fingscheidt (Siemens AG, Munich) and Mr. M. Elliott for review, and psychonomics AG, Cologne, for execution of the experiments.

Vous aimerez peut-être aussi