Académique Documents
Professionnel Documents
Culture Documents
I/O Block
io_strb strb iack ext_int emu_int si_clk si_sync serial_in so_clk so_sync serial_out pi_sync po_sync pio_bus(16bit) reset
PC
addr data
16 24
mp_mode
Stack
IR RE RS Counter
Parallel port
PIR POR
I. Introduction
In digital mobile systems, various speech coding algorithms are adopted for using channel bandwidth efficiently and communicating with high quality speech in wireless channel environments. The CELP coding is one of the best algorithms for bit rate between 4 kbps and 16 kbps[3]-[6]. Even though this algorithm guarantees good quality at low bit rate by using the analysis-by-synthesis method, its implementation is not easy because of high complexity of the optimal excitation search in the process of synthesizing speech[7][8]. The 13 kbps QCELP is a variable rate one and has four different bit rates ranging from 1 kbps to 13.3 kbps[2]. The coder operates on 20 ms speech frames corresponding 160 samples at the 8 KHz sampling frequency. Very high complexity is inevitable since codebook search is performed 16 times per frame. Beside the proper performance specifications, the low power consumption is another important aspect for the vocoder since the battery is the usual power source to mobile handsets. The low power consumption of the vocoder has been achieved by algorithm optimization and the proper DSP chip design. In this paper, section II gives the DSP architecture
xbus
x start x end
-1 +1
y start y end
+1 -1
rbh
rbl
ry2 ry3
32
16
16
32
ALU
ext ext
ry0 ry1
ab2
Memory Block
ALU Block
Figure 1. DSP block diagram DSPs for mobile applications should have low power consumption and fast operating capability with small chip size. Therefore, we adopted RISC type instruction set, distributed decoding, alternative program fetch, repeat control and dual bank memory system in our architecture. For general-purpose DSPs, CISC type instruction set and diverse addressing modes are supported. However, RISC type instruction set and limited addressing modes simplify the control logic resulting in the reduced chip size and power consumption. Moreover, by adopting 24-bits
instruction format, the immediate and direct addressing modes can be coded in one instruction word. This feature enables execution of all instructions in one clock cycle. Our DSP contains a 16k x 24 bit internal program ROM. ROM access time imposes bottleneck for pipeline scheme. Enough pre-charge time in given pipeline time constraints is obtained by dividing the program ROM into 2 separate banks. One bank includes odd address instruction codes, another, even. When odd PC addresses are applied, the odd part of the ROM operates, and the even part is precharged. This alternating scheme reduces ROM access time to half. In this DSP, four function blocks have separate decoding units. The program control block distributes 24bit instruction codes for each block. The program control block includes program ROM, instruction register, and program stack. It fetches an instruction code from program memory, and then distributes the instruction code to other function blocks. Program control instructions such as repeat, call and branch are serviced in this block. The DSP has 3 pipeline stages for most instructions except multiply type ones that are executed in 4 pipeline stages. The memory block includes data RAM, ROM, address registers, stack pointer register, and index registers. The main function of this block is the data read/write operation using the address registers and index registers. The DSP has the dual bank data memory structure supporting the 16-bit 2word operand load or store operations at the same time. The ALU block consists of two 36-bit accumulators, four general-purpose 16-bit registers, 18x18 multiplier, and 36bit barrel shifter for arithmetic logic operations. It can handle both the single and the double precision ALU instructions with parallel data move instructions. The I/O block generates four phase clocks and supplies these clocks to the other functional blocks. The idle instruction disables the main clock generator to reduce power dissipation. This block is responsible for the operations of parallel port, serial port, external interrupt and emulation interrupts. The major features of DSP are: 25 nsec instruction cycle (40 MIPS) MAC operation and 32-bit data load in one cycle 2 x 36 bit accumulator 16K x 24 bit internal program memory and supports 64K x 24 bit external memory Internal 3K x 16 bit data RAM and 4K x 16 bit data ROM Single cycle exponent evaluation One serial port and one parallel port
frame s(n)
High pass
subframe
x(n)
Windowing and Autocorrelation R[ ] Rate Decision Find pitch delay and gain
Rate Reduction algorithm Compute target for codebook search Compute excitation
A(z)
LSP
x2(n)
code index
codebook gain quantization
LSP indices
LSP quantization
h(n)
Interpilation for subframes LSP A(z) Compute impulse response
codebook gain
Figure 2. 13 kbps QCELP encoder block The 13 kbps QCELP is a variable rate speech compression algorithm having different bit rates depending on the voice activity. Table 1 shows the bit allocation for each bit rate. Table 1. Bit allocation of 13 kbps QCELP
Rate 1 LPC Pitch Codebook Total 32 44 188 (264+2) bit 13.3kbps Rate 1/2 32 44 48 124 bit 6.2kbps Rate 1/4 32 0 20 (52+2) bit 2.7kbps Rate 1/8 10 0 6 (16+4) bit 1.0kbps
In the encoding process, the LP coefficients are extracted for formant characteristics. Then, the coefficients are transformed into line spectrum pair frequencies. For rate 1, 1/2, and rate 1/4, a 32-bit vector quantizer is used for the LSP frequency quantization while rate 1/8 uses an 1-bit scalar quantizer. The rate determination algorithm in QCELP consists of two stages. At the first stage, voice activity detection decides if the current frame should be encoded using rate 1, rate 1/2, or rate 1/8. If the codec is operating in a reduced average data rate mode and the current frame rate at the first stage is rate1 or rate 1/2, the second stage classifies the current frame into rate 1, rate 1/2, or rate 1/4 using 7 features such as ZCR and NACF(Normalized autocorrelation function). A speech frame is divided into 4 sub-frames of 5ms each (40 samples) for pitch search. Pitch analysis is performed only for the rate 1 and rate 1/2. The pitch synthesis filter can be expressed as,
1 1 . = P ( z ) 1 bz L
The pitch lag, L ranges form 17 to 143 and includes 0.5 fractional lags between 17 and 139. The pitch gain, b
ranges from 0 to 2.0. An analysis-by-synthesis method is used to select the pitch parameters which minimize the weighted error between the input and the synthesized speech. The synthesized speech is the output of the formant synthesis filter excited by the output of the pitch synthesis filter. The weighted synthesis filter is given as,
H (z) = 1 . 1 W (z) = A( z ) A( z / )
utilizing the characteristics of our DSP pipeline structure. The convolution y L (n) is computed for the first delay in the search range, and for the other delays in the search range, it is updated using the equation of the recursive relationship, y L (n) = y L 1 (n 1) + u ( L)h(n) . Even though computational complexity of the convolution is reduced remarkably by using the recursive relationship, it is still very complex if we use general instruction set supported by commercial DSPs. The implementation of the above equation requires the operation as follows.
RY2 = *(AY0+) ; RX2 = *(AX0) ; REPEAT n{ R0 = R0 + RX2 * RY2 ; RY2 = *(AY0+) ; RX0 = 0 ; *(AX0+) = RX0 ; RX0 = *(AX0) ; }
And the weighted error between the original and the synthesized speech can be represented as
LP 1 n= 0
{x(n) by
(n)}2 ,
where x(n) is the target signal and y L (n) is the convolution of the past excitation, uL (n) , and h(n) with delay L and can be expresssed as,
y L ( n) =
LP 1 i =0
The minimum weighted error is computed by evaluating the minimum of 2bE xy + b 2 E yy over the allowable quantized values of L and b, where E xy and E xy are correlations represented as,
E xy =
LP 1 n =0
x ( n) y
( n) , E yy =
LP 1 n=0
( n) y L ( n) .
Each pitch subframe is divided into 4 codebook subframes of 1.25 msec (10 samples). The codebook given at rate 1/2 is exactly the same one used for the 8 kbps QCELP at rate 1. But, for rate 1, it is not a center-clipped codebook containing approximately 80% of zeros as in the 8 kbps QCELP and the computation of codebook search is dramatically increased. For rate 1 and rate 1/2 frames, the speech codec determines the codebook index and gain for each codebook subframe. For rate 1/4 and rate 1/8 frames, the excitation codebook is not searched but the energy of the excitation signal is coded as a gain parameter by estimating the prediction residual energy. For rate 1/4 frames, the excitation gain parameter is estimated 5 times per frame, while for rate 1/8 frames, only once.
The convolution result can be obtained by using the above repeat routine and five instructions are needed for the routine. But we can optimize the above routine since most of DSPs support parallel operations of ALU and data movement. We should use at least three instructions in order to implement above routine, if DSP does not support special complex instructions. In this paper, we reduce the convolution complexity by constructing the routine with only two MAC instructions without adding any special instruction set. It was possible since our DSP has 4 pipelines for multiplication while it has 3 pipelines for other instruction operations. One cycle in the repeat block corresponds to more than one million instruction cycles. This routine is as follows.
R1 = R1 R1 || RX2 = *(AX0), RY2 = *(AY0+) ; RX0 = *(AX1) ; REPEAT n{ R0 = R1 + RX2 * RY2 || *(AX1+) = RX0 ;
Table 2 and 3 show the quantity of DSP resources and the detailed complexity for each sub-block for the 13 kbps QCELP speech codec, respectively, as our implementation results. Table 2. Result of implementation of 13 kbps QCELP Program ROM 11,747 words Data RAM 2,027 words Data ROM 2,938 words MIPS 33.5 MIPS
Table 3. Complexity of 13Kbps QCELP Function block Encoder Pre-processing (HPF) Compute LPC LPC to LSP Rate decision LSP vector quantization LSP to LPC Compute residual Rate reduced mode Pitch search for one subframe Codebook search for one subframe Reconstruction Pitch*3+(codebk+reconstruct)*3 Data packing Sub Encoder total total Decoder total Total Total
MIPS 0.43 1.08 1.36 0.03 1.59 0.65 0.20 0.64 3.02 0.73 0.05 19.96 0.27 30.02 3.47 33.49
References
[1] TIA/EIA IS-96, Speech service option standard for wideband spread spectrum digital cellular system, April 1994. [2] Qualcomm Inc., High rate speech service option for wideband spread spectrum communications systems, Feb. 22, 1995. [3] R. Salami, C. Laflamme, D. Massaloux, A toll quality 8Kb/s speech codec for the personal communications system(PCS), IEEE Trans. Veh. Technol., vol. 43, No. 3, pp. 808-816, Aug. 1994. [4] W. B. Kleijn, P. Kroon, and D. Nahumi, The RCELP Speech-Coding Algorithm, Europian Transactions on Telecomm., Vol 5, No. 5., pp 573-582, Sep.-Oct. 1994. [5] R. Salami, C. Laflamme, J-P. Adoul, Design and Description of CS-ACELP: A Toll Quality 8 kb/s Speech Coder, IEEE Trans. Speech and audio processing, Vol. 6, No. 2, pp 116-130, Mar. 1998. [6] R. Salami, C. Laflamme, B. Bessette, Descripton of GSM enhanced full rate speech codec, ICC 97, Vol. 2, pp 725-729, 1997. [7] A. Dejaco, W. Gardner, P. Jacobs, L. Chong, Qcelp: The North American CDMA Digital Cellular Variable Rate Speech Coding Standard, IEEE, Workshop on Speech coding for Telcomm., pp 5-6, 1993 [8] J. McDonough, C. Chienchung, P. Hantak, C. Sakamaki, A single chip QCELP vocoder for CDMA digital cellular, CICC 94, pp 211-214, May 1994.
Table 4 shows the result of the SNR and segSNR measurements for five 3-second duration speech samples. One is the result of floating-point C code simulation while the other is that of the optimized assembly code simulation. Table 4. Measurement of SNR(dB) and segSNR(dB) Speech Floating-point C Optimized sample Code assembly code (No. of frame) SNR segSNR SNR segSNR Kch5 (153) 18.53 10.75 19.06 11.13 Hji1 (153) 21.80 18.01 22.35 18.40 Jhs3 (153) 20.88 16.29 21.08 16.41 Hji4 (153) 19.45 18.12 19.85 18.31 Kmy2 (204) 22.61 12.27 22.64 12.47 The comparison of the floating-point C code simulation to the optimized assembly code simulation shows that the performance of speech quality is almost same. The input signal used for above measurements is the output of the high pass filter in the encoder block and the output signal is the formant synthesized signal, i.e., reconstructed speech signal with extracted parameters.
V. Conclusion
This paper presented 1) an efficiently designed DSP architecture for vocoder algorithm implementations and 2) the real-time 13 kbps QCELP algorithm implementation. The 13 kbps QCELP implementation with our DSP requires only 33 MIPS of computation due to the assembly code optimization by fully exploiting our DSP architecture characteristics. Therefore, the proposed vocoder ASIC is more competitive than others with general-purpose DSPs in the area of mobile applications. Moreover, our DSP with simple architecture and the convolution optimization methods described above could be very useful for the future implementation of speech codecs for IMT-2000 applications.