Vous êtes sur la page 1sur 4

Implementation of 13kbps QCELP Vocoder ASIC

Kyung-Jin Byun*, Minsoo Hahn**, Kyung-Su Kim*


*Semiconductor Technology Division, ETRI 161 Kajong-dong, Yusong-Gu, Taejeon, 305-350, Korea **Information and communications University, Taejeon, 305-438, Korea Email:kjbyun@etri.re.kr Abstract In this paper, efficient implementation of a 13 kbps QCELP vocoder ASIC having a speech compression function used in the digital mobile communication is presented. The 13 kbps QCELP algorithm has better quality than 8 kbps one, but it requires much more computation. Especially, the complexity load of the pitch and codebook search process for speech synthesis is predominant. We propose an optimized routine for convolution computation by utilizing pipeline structure characteristics of the DSP. Our DSP, specifically designed for vocoder applications, is a 16-bit fixed-point one. We adopt RISC type instruction set, distributed decoding, alternative program fetch, dual bank memory structure, and repeat loop without loss in order to reduce the power consumption and to obtain fast operating capability while keeping the chip size small. The concurrent development of the DSP and the QCELP assembly code enables us to optimize the assembly code more successfully than adopting other general-purpose DSP chips. suitable for vocoder applications, especially for QCELP. Section III overviews the 13 kbps QCELP algorithm while section IV discusses the implementation of the QCELP algorithm including the complexity reduction method. Finally, section V is for some concluding remarks.

II. Design of DSP for vocoder application


We designed a 16-bit fixed-point DSP in order to implement the 13 kbps QCELP algorithm. The DSP consists of four major blocks such as program control, memory, ALU, and I/O blocks as shown in Figure 1.

Program Control Block Interrupt control

I/O Block
io_strb strb iack ext_int emu_int si_clk si_sync serial_in so_clk so_sync serial_out pi_sync po_sync pio_bus(16bit) reset

PC
addr data
16 24

Program ROM (16K*24bit)

IMR ISR SCR I/O Block SIR SOR

mp_mode

Stack

IR RE RS Counter

Parallel port

PIR POR

Instruction Bus ybus

I. Introduction
In digital mobile systems, various speech coding algorithms are adopted for using channel bandwidth efficiently and communicating with high quality speech in wireless channel environments. The CELP coding is one of the best algorithms for bit rate between 4 kbps and 16 kbps[3]-[6]. Even though this algorithm guarantees good quality at low bit rate by using the analysis-by-synthesis method, its implementation is not easy because of high complexity of the optimal excitation search in the process of synthesizing speech[7][8]. The 13 kbps QCELP is a variable rate one and has four different bit rates ranging from 1 kbps to 13.3 kbps[2]. The coder operates on 20 ms speech frames corresponding 160 samples at the 8 KHz sampling frequency. Very high complexity is inevitable since codebook search is performed 16 times per frame. Beside the proper performance specifications, the low power consumption is another important aspect for the vocoder since the battery is the usual power source to mobile handsets. The low power consumption of the vocoder has been achieved by algorithm optimization and the proper DSP chip design. In this paper, section II gives the DSP architecture

xbus

x start x end
-1 +1

y start y end
+1 -1

rbh

rbl

rx2 rx3 Multiplier Shifter

ry2 ry3

adder ax0 ax1 sp ix

adder ay0 ay1 psw iy

32

16

16

32

ALU

Data RAM (1.5K*16) Data ROM (2K*16)

Data RAM (1.5K*16) Data ROM (2K*16)


ab1

ext ext

rx0 rx1 Shifter

ry0 ry1

ab2

Memory Block

ALU Block

Figure 1. DSP block diagram DSPs for mobile applications should have low power consumption and fast operating capability with small chip size. Therefore, we adopted RISC type instruction set, distributed decoding, alternative program fetch, repeat control and dual bank memory system in our architecture. For general-purpose DSPs, CISC type instruction set and diverse addressing modes are supported. However, RISC type instruction set and limited addressing modes simplify the control logic resulting in the reduced chip size and power consumption. Moreover, by adopting 24-bits

instruction format, the immediate and direct addressing modes can be coded in one instruction word. This feature enables execution of all instructions in one clock cycle. Our DSP contains a 16k x 24 bit internal program ROM. ROM access time imposes bottleneck for pipeline scheme. Enough pre-charge time in given pipeline time constraints is obtained by dividing the program ROM into 2 separate banks. One bank includes odd address instruction codes, another, even. When odd PC addresses are applied, the odd part of the ROM operates, and the even part is precharged. This alternating scheme reduces ROM access time to half. In this DSP, four function blocks have separate decoding units. The program control block distributes 24bit instruction codes for each block. The program control block includes program ROM, instruction register, and program stack. It fetches an instruction code from program memory, and then distributes the instruction code to other function blocks. Program control instructions such as repeat, call and branch are serviced in this block. The DSP has 3 pipeline stages for most instructions except multiply type ones that are executed in 4 pipeline stages. The memory block includes data RAM, ROM, address registers, stack pointer register, and index registers. The main function of this block is the data read/write operation using the address registers and index registers. The DSP has the dual bank data memory structure supporting the 16-bit 2word operand load or store operations at the same time. The ALU block consists of two 36-bit accumulators, four general-purpose 16-bit registers, 18x18 multiplier, and 36bit barrel shifter for arithmetic logic operations. It can handle both the single and the double precision ALU instructions with parallel data move instructions. The I/O block generates four phase clocks and supplies these clocks to the other functional blocks. The idle instruction disables the main clock generator to reduce power dissipation. This block is responsible for the operations of parallel port, serial port, external interrupt and emulation interrupts. The major features of DSP are: 25 nsec instruction cycle (40 MIPS) MAC operation and 32-bit data load in one cycle 2 x 36 bit accumulator 16K x 24 bit internal program memory and supports 64K x 24 bit external memory Internal 3K x 16 bit data RAM and 4K x 16 bit data ROM Single cycle exponent evaluation One serial port and one parallel port

frame s(n)
High pass

subframe

Compute taget for pitch search

x(n)
Windowing and Autocorrelation R[ ] Rate Decision Find pitch delay and gain

pitch index, gain


Update filter memories for next subframe

Levinson Durbin R[ ] A(z)

Rate Reduction algorithm Compute target for codebook search Compute excitation

A(z)

LSP

x2(n)

code index
codebook gain quantization

LSP indices

LSP quantization

Find codebook index and gain

h(n)
Interpilation for subframes LSP A(z) Compute impulse response

codebook gain

Figure 2. 13 kbps QCELP encoder block The 13 kbps QCELP is a variable rate speech compression algorithm having different bit rates depending on the voice activity. Table 1 shows the bit allocation for each bit rate. Table 1. Bit allocation of 13 kbps QCELP
Rate 1 LPC Pitch Codebook Total 32 44 188 (264+2) bit 13.3kbps Rate 1/2 32 44 48 124 bit 6.2kbps Rate 1/4 32 0 20 (52+2) bit 2.7kbps Rate 1/8 10 0 6 (16+4) bit 1.0kbps

In the encoding process, the LP coefficients are extracted for formant characteristics. Then, the coefficients are transformed into line spectrum pair frequencies. For rate 1, 1/2, and rate 1/4, a 32-bit vector quantizer is used for the LSP frequency quantization while rate 1/8 uses an 1-bit scalar quantizer. The rate determination algorithm in QCELP consists of two stages. At the first stage, voice activity detection decides if the current frame should be encoded using rate 1, rate 1/2, or rate 1/8. If the codec is operating in a reduced average data rate mode and the current frame rate at the first stage is rate1 or rate 1/2, the second stage classifies the current frame into rate 1, rate 1/2, or rate 1/4 using 7 features such as ZCR and NACF(Normalized autocorrelation function). A speech frame is divided into 4 sub-frames of 5ms each (40 samples) for pitch search. Pitch analysis is performed only for the rate 1 and rate 1/2. The pitch synthesis filter can be expressed as,

III. 13kbps QCELP algorithm


The block diagram of the 13 kbps QCELP speech encoder is shown in Figure 2.

1 1 . = P ( z ) 1 bz L
The pitch lag, L ranges form 17 to 143 and includes 0.5 fractional lags between 17 and 139. The pitch gain, b

ranges from 0 to 2.0. An analysis-by-synthesis method is used to select the pitch parameters which minimize the weighted error between the input and the synthesized speech. The synthesized speech is the output of the formant synthesis filter excited by the output of the pitch synthesis filter. The weighted synthesis filter is given as,
H (z) = 1 . 1 W (z) = A( z ) A( z / )

utilizing the characteristics of our DSP pipeline structure. The convolution y L (n) is computed for the first delay in the search range, and for the other delays in the search range, it is updated using the equation of the recursive relationship, y L (n) = y L 1 (n 1) + u ( L)h(n) . Even though computational complexity of the convolution is reduced remarkably by using the recursive relationship, it is still very complex if we use general instruction set supported by commercial DSPs. The implementation of the above equation requires the operation as follows.
RY2 = *(AY0+) ; RX2 = *(AX0) ; REPEAT n{ R0 = R0 + RX2 * RY2 ; RY2 = *(AY0+) ; RX0 = 0 ; *(AX0+) = RX0 ; RX0 = *(AX0) ; }

And the weighted error between the original and the synthesized speech can be represented as
LP 1 n= 0

{x(n) by

(n)}2 ,

where x(n) is the target signal and y L (n) is the convolution of the past excitation, uL (n) , and h(n) with delay L and can be expresssed as,
y L ( n) =
LP 1 i =0

h(i )u L (n i), 16 < L 143 .

The minimum weighted error is computed by evaluating the minimum of 2bE xy + b 2 E yy over the allowable quantized values of L and b, where E xy and E xy are correlations represented as,

E xy =

LP 1 n =0

x ( n) y

( n) , E yy =

LP 1 n=0

( n) y L ( n) .

Each pitch subframe is divided into 4 codebook subframes of 1.25 msec (10 samples). The codebook given at rate 1/2 is exactly the same one used for the 8 kbps QCELP at rate 1. But, for rate 1, it is not a center-clipped codebook containing approximately 80% of zeros as in the 8 kbps QCELP and the computation of codebook search is dramatically increased. For rate 1 and rate 1/2 frames, the speech codec determines the codebook index and gain for each codebook subframe. For rate 1/4 and rate 1/8 frames, the excitation codebook is not searched but the energy of the excitation signal is coded as a gain parameter by estimating the prediction residual energy. For rate 1/4 frames, the excitation gain parameter is estimated 5 times per frame, while for rate 1/8 frames, only once.

The convolution result can be obtained by using the above repeat routine and five instructions are needed for the routine. But we can optimize the above routine since most of DSPs support parallel operations of ALU and data movement. We should use at least three instructions in order to implement above routine, if DSP does not support special complex instructions. In this paper, we reduce the convolution complexity by constructing the routine with only two MAC instructions without adding any special instruction set. It was possible since our DSP has 4 pipelines for multiplication while it has 3 pipelines for other instruction operations. One cycle in the repeat block corresponds to more than one million instruction cycles. This routine is as follows.
R1 = R1 R1 || RX2 = *(AX0), RY2 = *(AY0+) ; RX0 = *(AX1) ; REPEAT n{ R0 = R1 + RX2 * RY2 || *(AX1+) = RX0 ;

IV. Implementation of 13 kbps QCELP


In the pitch and codebook search process, the search of the excitation is done with the analysis-by-synthesis structure. The excitation signal is computed by feeding candidate excitation segments into the synthesis filter and the one that minimizes a perceptually weighted distortion measure between the original and the synthesized signal is selected. This search is very complex and is a major coder complexity load. Especially, the complexity of the 13 kbps QCELP is enormous since the number of sub-frames of the 13 kbps QCELP in the codebook search is twice of that of the 8 kbps QCELP. Therefore, we propose an optimization method for the convolution computation by
}

R0 = R0 + RX2 * RY1 || RX1 = *(AX0+), RY2 = *(AY0+) ;

Table 2 and 3 show the quantity of DSP resources and the detailed complexity for each sub-block for the 13 kbps QCELP speech codec, respectively, as our implementation results. Table 2. Result of implementation of 13 kbps QCELP Program ROM 11,747 words Data RAM 2,027 words Data ROM 2,938 words MIPS 33.5 MIPS

Table 3. Complexity of 13Kbps QCELP Function block Encoder Pre-processing (HPF) Compute LPC LPC to LSP Rate decision LSP vector quantization LSP to LPC Compute residual Rate reduced mode Pitch search for one subframe Codebook search for one subframe Reconstruction Pitch*3+(codebk+reconstruct)*3 Data packing Sub Encoder total total Decoder total Total Total

MIPS 0.43 1.08 1.36 0.03 1.59 0.65 0.20 0.64 3.02 0.73 0.05 19.96 0.27 30.02 3.47 33.49

References
[1] TIA/EIA IS-96, Speech service option standard for wideband spread spectrum digital cellular system, April 1994. [2] Qualcomm Inc., High rate speech service option for wideband spread spectrum communications systems, Feb. 22, 1995. [3] R. Salami, C. Laflamme, D. Massaloux, A toll quality 8Kb/s speech codec for the personal communications system(PCS), IEEE Trans. Veh. Technol., vol. 43, No. 3, pp. 808-816, Aug. 1994. [4] W. B. Kleijn, P. Kroon, and D. Nahumi, The RCELP Speech-Coding Algorithm, Europian Transactions on Telecomm., Vol 5, No. 5., pp 573-582, Sep.-Oct. 1994. [5] R. Salami, C. Laflamme, J-P. Adoul, Design and Description of CS-ACELP: A Toll Quality 8 kb/s Speech Coder, IEEE Trans. Speech and audio processing, Vol. 6, No. 2, pp 116-130, Mar. 1998. [6] R. Salami, C. Laflamme, B. Bessette, Descripton of GSM enhanced full rate speech codec, ICC 97, Vol. 2, pp 725-729, 1997. [7] A. Dejaco, W. Gardner, P. Jacobs, L. Chong, Qcelp: The North American CDMA Digital Cellular Variable Rate Speech Coding Standard, IEEE, Workshop on Speech coding for Telcomm., pp 5-6, 1993 [8] J. McDonough, C. Chienchung, P. Hantak, C. Sakamaki, A single chip QCELP vocoder for CDMA digital cellular, CICC 94, pp 211-214, May 1994.

Table 4 shows the result of the SNR and segSNR measurements for five 3-second duration speech samples. One is the result of floating-point C code simulation while the other is that of the optimized assembly code simulation. Table 4. Measurement of SNR(dB) and segSNR(dB) Speech Floating-point C Optimized sample Code assembly code (No. of frame) SNR segSNR SNR segSNR Kch5 (153) 18.53 10.75 19.06 11.13 Hji1 (153) 21.80 18.01 22.35 18.40 Jhs3 (153) 20.88 16.29 21.08 16.41 Hji4 (153) 19.45 18.12 19.85 18.31 Kmy2 (204) 22.61 12.27 22.64 12.47 The comparison of the floating-point C code simulation to the optimized assembly code simulation shows that the performance of speech quality is almost same. The input signal used for above measurements is the output of the high pass filter in the encoder block and the output signal is the formant synthesized signal, i.e., reconstructed speech signal with extracted parameters.

V. Conclusion
This paper presented 1) an efficiently designed DSP architecture for vocoder algorithm implementations and 2) the real-time 13 kbps QCELP algorithm implementation. The 13 kbps QCELP implementation with our DSP requires only 33 MIPS of computation due to the assembly code optimization by fully exploiting our DSP architecture characteristics. Therefore, the proposed vocoder ASIC is more competitive than others with general-purpose DSPs in the area of mobile applications. Moreover, our DSP with simple architecture and the convolution optimization methods described above could be very useful for the future implementation of speech codecs for IMT-2000 applications.