Vous êtes sur la page 1sur 18

Journal of VLSI Signal Processing 35, 43–60, 2003


c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

A JPEG Chip for Image Compression and Decompression

SUNG-HSIEN SUN AND SHIE-JUE LEE


Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung 80424, Taiwan

Received April 18, 2001; Revised December 14, 2001; Accepted January 4, 2002

Abstract. JPEG is an international standard for still-image compression/decompression and has been widely
implemented in hardware. In this paper, we describe the development of a JPEG chip which employs a single-
chip implementation and an efficient architecture of Huffman codec. Firstly, we use VHDL (VHSIC Hardware
Description Language) to describe the behavior of the chip. Each functional block of the chip is defined and
simulated. An architecture consisting of two RAMs is adopted to reduce the size of the Huffman tables. Then we
verify the functionality of our design with field programmable gate arrays (FPGAs) on circuit boards. Finally, a
single chip is implemented using the standard cell design approach with the 0.6 µ triple-metal process. The chip
is compliant with the JPEG baseline system and can work in real time at any compression ratio. The chip contains
411,745 transistors, with a chip size of 6.6 × 6.9 mm2 .

Keywords: image compression/decompression, VLSI chip, CAD tools, VHDL, FPGA, standard cell design

1. Introduction words, while for decompression, a set of compressed


codewords are converted into colored pixel values.
Digital devices have become more and more popular JPEG can be implemented in software. However,
than analog ones because of the amazing improve- for real-time applications, hardware implementation
ment in die size and speed of the integrated circuits. has been widely adopted [5–8]. A chip set consisting
Digital storage media is free from distortion and is of two chips was designed [5] to perform the baseline
more reliable than analog devices. However, digitized compression and decompression in real time without
images (video or sound) may occupy an extremely any constraint on the minimum compression ratio. In
large amount of storage without compression. Further- this design, DC and AC Huffman codes are divided
more, communication with uncompressed data may in- into 10 groups and the total size required for the Huff-
crease the load of networks and require a lot of trans- man tables takes about 33 K bits. Because of using
mission time. Therefore, compression/decompression two chips, additional wirings are required. Besides,
techniques are important for both data storage and data redundancy occurs in the Huffman tables, making
communication. them unnecessarily large. Tree-based architectures
There exist several standards for image compres- of Huffman codec were proposed in [6–8]. In [8],
sion/decompression, e.g., JPEG (Joint Photographic the Huffman tree is mapped onto memory with an
Experts Group) [1], MPEG (Motion Picture coding area-efficient solution. However, the performance of
Experts Group) [2], and H.261 [3]. JPEG is a com- the method was analyzed with the assumption of 50%
pression/decompression technique for still images [4]. compression ratio which is not practical in real-world
In this paper, we describe the design and implementa- applications.
tion of a JPEG chip which complies with the baseline We present the design and implementation of a JPEG
sequential mode of JPEG. For compression, the chip chip compliant with the JPEG baseline system. This
encodes the pixel values of a colored image into code- work was motivated by the higher cost-effectiveness
44 Sun and Lee

of a single chip which requires no additional wirings 2. Overview of JPEG Baseline System
in the circuit board. We have also removed data re-
dundancies occurring in [5] and reduced the size of Figure 1 shows the major components of the JPEG
the Huffman tables with a modified RAM architec- compression system. The decompression system is es-
ture. The chip was implemented with VLSI CAD tools. sentially the same with data flowing in the opposite di-
Firstly, VHDL (VHSIC Hardware Description Lan- rection and with each function replaced by its inverse.
guage) codes were written to describe the architecture Four operations are involved in the compression pro-
and behavior of the chip. Each block of the chip was cess: DCT (Discrete Cosine Transform), quantization,
defined and simulated. Then the functionality of the zig-zag, and Huffman coding. The signed pixel data of
design was verified with field programmable gate ar- a picture are grouped into 8 × 8 blocks and each block
rays (FPGAs) on circuit boards. Finally, a single chip is transformed by DCT into 8 × 8 = 64 values called
was implemented using the standard cell design ap- DCT coefficients. The upper-left corner in a 8×8 block
proach with the 0.6 µ triple-metal process. The chip of the DCT coefficients is the DC coefficient and the
can compress and decompress CCIR601 images, with other 63 values are AC coefficients. The 64 coefficients
resolution of 720 × 480 pixels, at a rate of 30 images are then quantized using corresponding values from
per second without any restriction on the compression a quantization table. After quantization, the 64 quan-
ratio. The chip contains 411,745 transistors, with a chip tized coefficients are converted into a one-dimensional
size of 6.6 × 6.9 mm2 . sequence by the zig-zag operation. Finally, the coeffi-
A digital image can be regarded as a two- cients are encoded by Huffman coding and codewords
dimensional array of pixels. For gray-level images, are obtained by looking up DC and AC tables. For de-
each pixel is a value between 0 and 255 (i.e., repre- compression, the Huffman decoder decodes the given
sented by 8 bits). For colored images, each pixel con- compressed data. Then the data are organized into 8×8
tains three values, e.g., RGB or YUV, with each value two-dimensional blocks. After dequantization the data
lying between 0 and 255. In JPEG, values of the same in each block are transformed to a set of 8 × 8 pixel
category are processed separately from values of other vales by the Inverse DCT (IDCT).
categories. For example, Y values of a colored image DCT transforms a picture in spatial domain into an-
are processed separately from U values and V values. other in frequency domain. As mentioned, each pixel
For convenience, we treat each pixel as a value repre- in an original image is assumed to represent a value
sented by 8 bits in the rest of the paper. between 0 and 255. Before DCT, level shift is done by
The rest of the paper is organized as follows. In subtracting 128 from each pixel, making a pixel value
Section 2, we provide a brief introduction of the JPEG range from −128 to 127. Then the image is partitioned
baseline system. Section 3 describes the JPEG hard- into 8 × 8 blocks and these blocks are processed one
ware implementation. Major modules, DCT/IDCT, by one left to right and top to bottom. The DCT for a
quantizer/de-quantizer, zig-zag, and Huffman codec, block is defined as follows:
are presented in order. In Section 4, we describe the
RAM architecture of the Huffman tables. Section 5 de- C(u)C(v) 7  7
(2x + 1)uπ
scribes the implementation in FPGAs and shows some S(u, v) = s(x, y) cos
4 x=0 y=0
16
experimental results. The implementation in ASIC is
described in Section 6 and our conclusion is given in (2y + 1)vπ
× cos (1)
Section 7. 16

offset(-128)

Image data Compressed data


8x8 Huffman
+ Quantization Zig-zag coder
DCT

Figure 1. JPEG compression system.


A JPEG Chip 45

where s(x, y) is the pixel value, S(u, v) is the DCT


coefficient, and
0 1 5 6 14 15 27 28

 √1 when k = 0;
C(k) = 2 (2) 2 4 7 13 16 26 29 42

1 when k = 0.
3 8 12 17 25 30 41 43
When a block is processed by DCT, high-frequency
coefficients appear at the lower-right corner of the block 9 11 18 24 31 40 44 53
while low-frequency coefficients appear at the up-left
corner. 10 19 23 32 39 45 52 54
For quantization, the DCT coefficients obtained from
the DCT module are divided by the values defined in
20 22 33 38 46 51 55 60
the quantization table which contains 8 × 8 entries, i.e.,

S(u, v) 1 21 34 37 47 50 56 59 61
Sq (u, v) = = S(u, v) × (3)
q(u, v) q(u, v)
35 36 48 49 57 58 62 63
where Sq (u, v) is the quantized coefficient of S(u, v)
1
and q(u,v) is the corresponding quantizing value stored Figure 2. Sequence obtained by zig-zag.
at position (u, v) of the quantization table. If the di-
vider is large, the bit-rate will be low but the quality
of the reconstructed image will be bad; and vice versa. and the effect of compression is thus achieved. Huff-
Therefore, the user may select from different quan- man decoder does the reverse of Huffman coder, i.e., it
tization tables to achieve a desired trade-off between receives blocks of codewords and generates blocks of
bit-rate and quality of reconstruction. corresponding DCT coefficients.
For a block with a small variation of pixel values, the
high-frequency coefficients obtained by DCT tend to
be small. Furthermore, the quantization operation per- 3. JPEG Implementation
formed previously makes intentionally the DCT coeffi-
cients of high frequencies smaller, namely, coefficients The JPEG system is partitioned into four mod-
in lower frequencies are divided by smaller integers ules: DCT/IDCT, quantizer/dequantizer, zig-zag, and
while higher ones are divided by larger integers. There- Huffman codec. The DCT coefficients of a block
fore, the lower triangular part of a coefficient block of pixel values is obtained by the cascade of two
tends to contain many zero entries. The zig-zag opera- one-dimensional (1-D) DCT processors. The quan-
tion, shown in Fig. 2, places the DCT coefficients of a tizer/dequantizer is implemented by the radix-4 mod-
block in a sequence from low to high frequencies. As a ified Booth’s algorithm. The zig-zag operation is per-
result, long sections of successive zeros are more likely formed by a dual-buffering mechanism. Finally, the
to occur in the tail of the sequence, which is good for Huffman codec employs an efficient architecture for
Huffman coding. Huffman decoding.
Huffman coding is a variable-length coding method.
Its idea is to use fewer bits to represent a symbol which
appears more frequently and more bits to represent a 3.1. DCT/IDCT
symbol which appears less frequently. Huffman coder
receives the sequences obtained from the zig-zag mod- The DCT of Eq. (1) can be rewritten as
ule and generates one block of codewords for each such 
sequence. A block of codewords consists of one DC C(u)  7
C(v) 7
S(u, v) = s(x, y)
codeword and one or more AC codewords, as shown 2 x=0 2 y=0
in Fig. 3. Note that an end-of-block (EOB) mark is 
inserted at the end of each block. In this way, a block (2y + 1)vπ (2x + 1)uπ
× cos cos (4)
of 8×8 pixel values is turned into a block of codewords 16 16
46 Sun and Lee

BLOCK 1

DC codeword AC codeword AC codeword

Huffman code Amplitude Huffman code Amplitude Huffman code Amplitude

BLOCK 2

AC codeword DC codeword

Huffman code Amplitude EOB Huffman code Amplitude

Figure 3. Codewords obtained from Huffman coder.

    
which can be performed by applying the cascade of S(1) D E F G s(0) − s(7)
two 1-D DCT processors in vertical and horizontal di-  S(3)   E −G −D −F   
     s(1) − s(6) 
rections [9, 10]. The 1-D DCT is defined as follows:  =  
 S(5)   F −D G E   s(2) − s(5) 
S(7) G −F E D s(3) − s(4)
C(w) 
7
(2t + 1)wπ
S(w) = s(t) cos (5) (7)
2 t=0 16
where A = cos π4 , B = cos π8 , C = sin π8 , D = cos 16π
,
π
where C(u) has the same value as given in Eq. (2). E = cos 16 , F = sin 16 , and G = sin 16 .
3π 3π

Equation (5) can be written in the following matrix Figure 4 shows the architecture of the DCT module.
form: An 8 × 8 block is processed column by column by the
first 1-D DCT processor. The intermediate results are
    
S(0) A A A A s(0) + s(7) placed in the transposition memory. Then the DCT val-
 S(2)   B −C −B    ues are obtained row by row by the second 1-D DCT
   C   s(1) + s(6) 
 =   processor. The architecture of the 1-D DCT processors,
 S(4)   A −A −A A   s(2) + s(5)  implemented in distributed arithmetic [11], is shown in
S(8) C −B B −C s(3) + s(4) Fig. 5. The first stage, FREG, in a 1-D DCT consists
(6) of 8 parallel-in-serial-out registers. The eight pixels in

Data in Data out


8/12 12/8

1-D DCT/IDCT Transposition Memory 1-D DCT/IDCT


10 10

2-D DCT/IDCT

Figure 4. Architecture of the DCT/IDCT module.


A JPEG Chip 47

Data in Data out


12 12

2 2 2 2

2 2 2 2

2 2 2 2
ADDSUB ADDSUB
FREG 2 and 2 RAC 2 and 2 BREG
Butterfly Butterfly
2 2 2 2

2 2 2 2

2 2 2 2

2 2 2 2

1-D DCT/IDCT

Figure 5. Architecture of 1-D DCT/IDCT.

a column of an 8 × 8 block are fed into the registers, For IDCT, the first ADDSUB acts as a butterfly
and shifted out serially by 2 bits at a time to the next connection between FREG and RAC, and the second
stage, called ADDSUB, which is responsible for the ADDSUB is responsible for the required additions and
additions and subtractions appearing in the right-hand subtractions.
sides of Eqs. (6) and (7). The next stage, called RAC
(ROMs and Accumulators) and shown in Fig. 6, con-
sists of eight ROMs and eight corresponding accumu- 3.2. Quantizer/Dequantizer
lators. Note that all the combinations of the constants
and pixel values are stored in ROMs. RAC receives We adopted the radix-4 modified Booth’s algo-
eight pixel values and obtains eight DCT coefficients by rithm [12] to speed up the multiplication operation
looking up ROM tables. The second ADDSUB acts as involved in quantization and dequantization. The al-
a butterfly connection between RAC and BREG which gorithm, as shown in Table 1, takes care of two bits
consists of 8 serial-in-parallel-out registers. at a time. Note that in this table, DCT coefficients,
denoted by x’s, are multipliers and quantizing values,
Q, are multiplicands. The architecture of the quantizer
Input
is shown in Fig. 7 which includes five adders and five
Shift 2 bits
8 to right
14
Table 1. The radix-4 modified Booth’s algorithm.
Coefficient CSA
4 ROM xi xi−1 xi−2 Operation Comments

0 0 0 +0 String of zeros
14
+ 0 1 0 +Q A single 1
1 0 0 −2Q Beginning of 1’s
Coefficient
ROM Register 1 1 0 −Q Beginning of 1’s
4
0 0 1 +Q End of 1’s
0 1 1 +2Q End of 1’s
12
1 0 1 −Q A single 0
1 1 1 +0 A string of 1’s
Figure 6. Architecture of RAC.
48 Sun and Lee

DCT coefficients Quantization


table

Reg. for DCT coef. Register for value Q pipeline stage1

pipeline stage2
pipeline stage3
12 pipeline stage 4
x 1x 0x -1
0,+Q,-Q,+2Q,-2Q >>2
3 + Reg. >>4 Quantized
+ Reg. >>8 coef.

x 3x 2x 1
+ 12
Reg.
12
0,+Q,-Q,+2Q,-2Q
3

x 5x 4x 3
0,+Q,-Q,+2Q,-2Q >>2
3 + Reg.
Reg.

x 7x 6x 5
0,+Q,-Q,+2Q,-2Q
3

x 9x 8x 7
0,+Q,-Q,+2Q,-2Q >>2
3 + Reg.

x 11x 10x 9
0,+Q,-Q,+2Q,-2Q
3

Figure 7. Architecture of the quantizer.

pipeline registers. In the first clock, a DCT coefficient of RAM1 and RAM2, work in double-buffering mode.
12 bits and the corresponding quantizing value from the When the content of RAM1 is read out in the zig-
quantization table are latched in the pipeline registers zag order, RAM2 is being loaded with another block
of stage 1. In the second clock, six values are selected of 64 DCT coefficients. Then RAM1 and RAM2 are
from the set {+0, +Q, −Q, +2Q, −2Q} by x1 x0 x−1 , switched, namely, RAM2 is read out and RAM1 is
x3 x2 x1 , x5 x4 x3 , x7 x6 x5 , x9 x8 x7 , x11 x10 x9 , respectively, loaded. The switching of the two RAMs is controlled
and are passed through a set of shift-and-add structures. by a multiplexer.
The results are then stored in the pipeline registers of
stage 2. In the third clock, values from the pipeline reg-
isters of stage 2 are processed in a similar manner and 3.4. Huffman Codec
the results are stored in the pipeline registers of stage 3.
After one more shift-and-add operation, the quantized DC coefficients and AC coefficients are encoded sep-
coefficient is stored in the pipeline register of stage 4. arately. DC coefficients are not encoded directly. In-
For dequantization, the input to Fig. 7 is a quantized stead, the difference of the DC coefficient of the present
coefficient and the table is replaced with the dequanti- block from that of the previous block is used for en-
zation table. Note that each element of the dequantiza- coding. To obtain AC codewords for a sequence, the 63
tion table is the inverse of the corresponding element AC coefficients are interpreted into runs of zeros each
of the quantization table. of which ends with a non-zero coefficient. Then each
run of zeros and its following nonzero coefficient are
used for encoding. A DC codeword is derived from two
3.3. Zig-Zag parts, size and amplitude, and each AC codeword is de-
rived from run-length/size and amplitude. Amplitude
The zig-zag module consists of two RAMs and an ad- indicates the difference of the underlying DC coeffi-
dress control, as shown in Fig. 8. The two RAMs, cient from the previous DC coefficient, or the nonzero
A JPEG Chip 49

Addr1
Dout1
Read/Write 6
RAM1(64x12)
address control
6

Mux Data out


12
Addr2

RAM2(64x12)
Dout2
Data in
12

Figure 8. Architecture of the zig-zag module.

AC coefficient following a run of zeros. Size indicates a DC difference, D, we use the size of D to obtain a
the number of bits required for representing the ampli- Huffman code, CODE1, by looking-up the DC Huff-
tude in one’s complement form. The relationship be- man table, as shown in Table 4(a). Let CODE2 be
tween size and amplitude is shown in Table 2. Note the amplitude of D represented in one’s complement
that for DC coding, size can be of up to 11 bits, while form. Then the codeword for D is the concatenation of
for AC coding, up to 10 bits. Run-length indicates the CODE1 and CODE2. For example, let D be 6. The size
number of zeros in a run of zeros. Table 3 shows all for 6 is 3. Then CODE1 is 100 obtained from Table 4(a),
the possible combinations allowed for run-length (in and CODE2 is 110. Therefore, the codeword for D is
horizontal direction) and size (in vertical direction) for 100110. For the encoding of a run of zeros followed by a
AC coding. Note that when the run-length of a run is nonzero AC coefficient, A, we use the run-length/size
greater than 16, two or more codes required for this run. combination to obtain a Huffman code, CODE1, by
The ZRL mark in Table 3 indicates a run of 15 zeros looking-up the AC Huffman table, a small part of which
followed by a zero AC coefficient (i.e., 16 consecutive is shown in Table 4(b). Let CODE2 be the amplitude
zeros), and the EOB mark is used to end a block, as of A represented in one’s complement form. Then the
mentioned earlier. codeword for this run of zeros and the following AC co-
When size, amplitude, and run-length are available, efficient is the concatenation of CODE1 and CODE2.
we are ready for coding. To obtain the codeword for For example, suppose we want to find the codeword

Table 2. Correspondence between size and amplitude. Table 3. Possible combinations of run-length and size for AC
Size Amplitude coding.

0 1 2 ··· 9 10 11 ··· 14 15
0 0
1 −1, 1 0 EOB N/A N/A · · · N/A N/A N/A · · · N/A ZRL
2 −3, −2, 2, 3 1 01 11 21 ··· 91 A1 B1 ··· E1 F1
3 −7 ∼ −4, 4 ∼ 7 2 02 12 22 ··· 92 A2 B2 ··· E2 F2
4 −15 ∼ −8, 8 ∼ 15 3 03 13 23 ··· 93 A3 B3 ··· E3 F3
5 −31 ∼ −16, 16 ∼ 31 4 04 14 24 ··· 94 A4 B4 ··· E4 F4
6 −63 ∼ −32, 32 ∼ 63 5 05 15 25 ··· 95 A5 B5 ··· E5 F5
7 −127 ∼ −64, 64 ∼ 127 6 06 16 26 ··· 96 A6 B6 ··· E6 F6
8 −255 ∼ −128, 128 ∼ 255 7 07 17 27 ··· 97 A7 B7 ··· E7 F7
9 −511 ∼ −256, 256 ∼ 511 8 08 18 28 ··· 98 A8 B8 ··· E8 F8
10 −1023 ∼ −512, 512 ∼ 1023 9 09 19 29 ··· 99 A9 B9 ··· E9 F9
11 −2047 ∼ −1024, 1024 ∼ 2047 10 0A 1A 2A ··· 9A AA BA ··· EA FA
50 Sun and Lee

Table 4. Huffman tables: (a) DC Huffman table; (b) Part of AC The overall architecture of the Huffman codec mod-
Huffman table. ule is shown in Fig. 9. In this figure, zero-run detec-
(a) (b) tor, size detector, combiner, and barrel shift-out are in-
Size Huffman code Run-length/Size Huffman code
volved in the coding process, while barrel shift-in, lead-
ing ones detector, and reconstructed data generator are
0 00 ··· ··· involved in the decoding process. The two components,
1 010 3/1 111010 address MUX and Huffman tables, are involved in both
2 011 3/2 111110111 encoding and decoding.
3 100 3/3 111111110101
4 101 3/4 111110001111 3.4.1. Encoding Path. The zero-run detector is used
5 110 3/5 1111111110010000 to count the number of successive zeros in a section of
6 1110 3/6 1111111110010001 an input block. It is equipped with a zero-run counter.
7 11110 3/7 1111111110010010 The entries of a block are fed into the zero-run detec-
8 111110 3/8 1111111110010011 tor one by one. If the input entry is zero, the zero-run
9 1111110 3/9 1111111110010100
counter increments by one. If the input entry is non-
zero, it is sent out to the size detector and the zero-run
10 11111110 3/10 1111111110010101
counter is reset to zero. The size detector determines
11 111111110 ··· ···
the size of the input value.
For coding DC coefficients, address MUX outputs
size to form the address for the Huffman tables. How-
for the sequence 0003 with A = 3. The run-length is 3 ever, for coding AC coefficients, address MUX outputs
and the size of A is 2. Then CODE1 is 111110111 ob- both run-length and size to form the address for the
tained from Table 4(b), and CODE2 is 11. Therefore, Huffman tables. A detailed description about address-
the codeword for this sequence is 11111011111. ing for encoding will be given in the next section. The
For decoding, each block of codewords is converted output from the Huffman tables is a unique Huffman
back to a block of DCT coefficients by looking up the code which is concatenated with amplitude in combiner
Huffman tables. The decoding process is totally the to form a codeword for each input.
reverse of the coding process and the description of it Obviously, codewords obtained are variable in
is omitted here. length. However, the width of the data bus is fixed. The

DCT size
coefficients codewords
size detector combiner
shift-out
address Huffman
MUX code

zero-run run-len
detector
Huffman tables
no. of leading
ones

leading ones
run-len reconstructed
codewords detector reconstructed
size DCT coefficients
shift -in data generator
+

Huffman Coder/Decoder
Figure 9. The Huffman codec module.
A JPEG Chip 51

Length Data in

6 32

Length Code Register

MSB
Mux
32 32

control signals Barrel shifter


5

64
MSB LSB

Register 1 Register 2

32

Data out Barrel Shift-out

Figure 10. Architecture of the barrel shift-out module.

barrel shift-out module is thus used to send out such codeword. This number, together with the following
variable-length codewords onto a fixed-width data bus. bits, forms the address lines for the Huffman table.
Its architecture is shown in Fig. 10. The code register in A detailed description about addressing for decoding
this figure stores one codeword each time. The barrel will be given in the next section. The output of the
shifter concatenates the content of register 1 with that table is represented in a pair of run-length and size and
of the code register and the result is latched in register 1 the decoded DCT coefficients are obtained from the
and register 2. When register 1 is full, its content is sent reconstructed data generator.
out onto the data bus. Then the content of register 2 is
concatenated with the new code in the code register 4. RAM Architecture of Huffman Tables
through the barrel shifter, and so on. With this arrange-
ment, codewords of variable-length can be lined up on Huffman tables are required in encoding and decoding.
a fixed-width data bus. With a simple brute-force method, a RAM of 256×16
bits is needed for AC encoding and another RAM of
3.4.2. Decoding Path. When decoding, codewords 216 ×8 bits is needed for AC decoding. Ruetz et al. [5]
are fed into the barrel shift-in module and one code- proposed an efficient method to reduce such a large de-
word is decoded each time. The barrel shift-in module mand of RAM storage. We propose an improvement to
is responsible for eliminating the codeword that has further reduce the amount of RAM bits for the Huffman
been decoded and receiving more new codewords. Its tables.
architecture is shown in Fig. 11. Input codewords are
fed into Register 1, and meantime the original content 4.1. Ruetz’s Method
of Register 1 is shifted into Register 2. The barrel shifter
acts like a sliding window which shifts 32 bits out from The JPEG Huffman code has 13 codewords for the DC
the concatenation of Register 1 and Register 2. coefficients and 163 codewords for the AC coefficients.
The leading ones detector in Fig. 9 is used to detect The codewords for the DC coefficients are divided into
the number of successive ones at the beginning of each 10 groups, with groups 0 to 8 having 0 to 8 leading
52 Sun and Lee

Data in

32 32

Register 2 Register 1

MSB LSB
32 32

5
Barrel shifter control signals
Length
5

32

Data out Barrel Shift-in

Figure 11. Architecture of the barrel shift-in module.

ones, and group 9 having at least 9 leading ones. There- the leading ones and the remaining bits (excluding the
fore, 23 = 8 RAM locations are needed to decode the leading zero), respectively, of the code. We also di-
codewords in each of the 10 groups, requiring a to- vide AC codes and DC codes, respectively, into ten
tal of 80 locations for a DC decoding table. Similarly, groups. Therefore, like [5], we need 2720 entries for
the codewords for the AC coefficients are divided into decoding tables. However, in Ruetz’s method, each
10 groups. The maximum possible tail length of the entry is 12 bits wide. We propose an improvement
codewords is 8. Since the first bit of the tail is always to reduce the width of each entry as follows. Each
0, up to 7 additional trailing bits are required to de- of the 2720 entries contains two fields, RUN-LEN
code the codewords within each group. Hence 27 = 128 and SIZE, only. Therefore, each entry is 8 bits wide.
RAM locations are needed to decode the codewords in The code length C-LEN required for decoding is de-
each of the 10 groups, requiring a total of 1280 lo- rived by checking-up with another small table with
cations for a AC decoding table. Since each entry is each entry being 4 bits wide. Furthermore, encoding
12 bits wide and there are two sets of Huffman tables, is done by taking advantage of the decoding tables
2720 × 12 = 32640 RAM bits are required for the without additional storage. As a result, we can save
decoding tables. about 10 K bits more than Ruetz’s method for Huffman
The encoder was designed to utilize the hardware re- tables.
quired for the decoder. By grouping the codewords by We use two RAMs, RAM3 and RAM4 to imple-
code length, the encoder tables can be reduced to 12- ment the Huffman tables, as shown in Fig. 12. RAM3
bits to fit within the decoding tables. Instead of looking has 2720 entries with 8 bits in each entry, and RAM4
up the codeword directly, the length is used to deter- has 376 entries with 4 bits in each entry. For de-
mine the first code value for that group. Adding the coding, an address for RAM3 contains three variable
first code value and the offset of the codeword within fields: Group, SEVEN/THREE, and N, with 4 bits,
the group produces the desired codeword. With this ap- 7/3 bits, and one bit, respectively. The Group field
proach, only an additional RAM of size 56 × 16 = 896 indicates the group of the leading ones of the un-
is required. Therefore, a total of 32640 + 896 = 33536 derlying Huffman code. The SEVEN/THREE field
RAM bits are required for the Huffman codec. contains the 7/2 bits behind the zero that follows
the leading ones. This works because Huffman cod-
4.2. Our Improvement ing ensures that none of the codes is a prefix of an-
other code. Of course, multiple entries in RAM3 may
As in [5], we treat a Huffman code as the concate- store the same content due to this way of address-
nation of two parts, ONES and CBITS, which denote ing. The N field indicates which set of the tables is
A JPEG Chip 53

Address format Data format

Encoding 11 10 9 8 7 6 5 4 3 2 1 0

AC 000 SIZE RUN-LEN N 7 6 5 4 3 2 1 0 3 2 1 0

EIGHT C-LEN

DC 0001011 SIZE N

Decoding 11 10 9 8 7 6 5 4 3 2 1 0

AC Group SEVEN N 7 6 5 4 3 2 1 0 3 2 1 0

RUN-LEN SIZE C-LEN

DC 1010 Group THREE N

000 000
12 8 (0) (0)
RAM3
2720 x 8

177 177
(375) (375)
Table number

9
RAM4 4
9 376x4

A9F
(2719)

Figure 12. Addressing of RAM tables.


54 Sun and Lee

used. Each entry in RAM3 contains the information We also use RAM3 and RAM4 for encoding. From
about run-length and size, indicated as RUN-LEN and Table 4, it is clear that 12 entries and 11 × 16 = 176
SIZE, respectively. However, we need to provide the entries are required for DC and AC encoding, re-
barrel shifter of the barrel shift-in module with the spectively. Therefore, 188 × 2 = 376 entries in to-
length of CBITS so that the exact number of bits tal are needed for two sets of tables. The way of ad-
can be shifted. The length of CBITS can be obtained dressing for encoding is shown in the upper-part of
from the code length field, C-LEN, of RAM4 (In Fig. 12. An address for RAM3 and RAM4 contains
fact, C-LEN is one less than the code length, as ex- three variable fields: RUN-LEN, SIZE, and N. An en-
plained later). When we have RUN-LEN and SIZE try in RAM3 contains EIGHT, the lower-order eight
from RAM3, we can get the length of CBITS from bits of a Huffman code, and an entry in RAM4 con-
RAM4 (The length of CBITS is equal to the difference tains C-LEN, one less than the length of a Huffman
of C-LEN and the number of leading ones of the code. For example, consider the Huffman code 111010
codeword). for 3/1 in Table 4(b). RAM3 will include an entry of

Codewords

16 32

Next code Sum1_reg

Leading ones +
detector

Barrel shifter

Tmp_pos

Address MUX
+

Barrel shifter 1
4 (Shift leading ones)

Sum0
RAM3
2720x8 words
+

Barrel shifter 2
(Shift CBITS)
4
Sum1

RAM4
376x4 words

Barrel shifter 3
3

RUN-LEN SIZE C-LEN

To the next stage

Figure 13. Architecture of the Huffman decoder.


A JPEG Chip 55

00111010 and RAM4 will include an entry of 0101 an entry of 1111 (i.e., 15) at address 00000110101N
(i.e., 5) at address 00000010011N for this Huffman for this Huffman code. Apparently, the total number
code. Consider another example of the Huffman code of RAM bits required is 2720 × 8 + 376 × 4 = 23264
1111111110010000 for 3/5 in Table 4(b). RAM3 will bits, which is 10272 bits fewer than that required in
include an entry of 10010000 and RAM4 will include Ruetz’s method.

Figure 14. The JPEG system implemented with FPGAs: (a) Components involved in the system; (b) Connecting the system to a PC.
56 Sun and Lee

The Huffman decoder based on these RAM archi- 5.1. Construction of Circuit Boards
tectures is shown in Fig. 13. In this figure, a sequence
of 16 input bits is loaded into the Next Code register The DCT/IDCT module and interface circuits are
through several barrel shifters. We have the barrel placed in the EPF10K100 FPGA. The transposition
shifters operate in parallel with the other parts of the RAM of the DCT module is fit into the embedded
decoder. When the output of the Next Code register RAM architecture of the FPGA. The quantization, zig-
passes through the leading ones detector, the barrel zag, and codec modules are placed in the EPF10K70
shifter in the shift-in module shifts in the input bit FPGA. The RAMs of quantization and zig-zag mod-
stream. When the data pass through the address ules are fit into the embedded RAM architecture of the
multiplexer and RAM3, the barrel shifter shifts in a FPGA. Huffman tables are implemented in an external
number of input bits with the size equal to the number static RAM.
of leading ones. When the data pass through RAM4, The FPGAs are mounted on two circuit boards, as
the barrel shifter shifts in a number of input bits shown in Fig. 14(a), and are connected together by two
with the size equal to the length of the CBITS of the 50-bits flat cables and a 8-bits download cable. The two
code. flat cables form the path for the signals flowing between
the two FPGAs. The download cable is connected to the
parallel port of a PC and programming of the 2 cascaded
5. Emulation with FPGAs FPGAs is done by the PC, as shown in Fig. 14(b). The
whole system communicates with the outside world via
VHDL was adopted as the high-level language for the interface circuits.
implementing the JPEG baseline system. VHDL The estimated propagation delays of the two circuit
codes were written to describe the architecture and boards are 189.7 ns and 218 ns, respectively. The clock
behavior of each component. After the function of the rate of the ISA bus on a PC is about 8.2 MHz. We divide
design had been tested successfully with the VHDL the clock on the ISA bus by 2 and apply it to our circuit
functional simulator, the design was synthesized boards. The system works correctly at the clock rate of
by Synopsys with the Altera FLEX 10K FPGA 4.1 MHz.
technology. The whole system was partitioned and
fit into two FLEX 10K FPGAs, an EPF10K100 5.2. Experimental Results
and an EPF10K70. Placement, routing, and pro-
gramming of the FPGAs were done by ALTERA A program written in the C language is used to moni-
Maxplus II. tor the status of the FPGAs and read results from and

Figure 15. LENA: (a) original image; (b) reconstructed image.


A JPEG Chip 57

Figure 16. PEPPERS: (a) original image; (b) reconstructed image.

write data into the circuit boards. The subroutine “im- is a 512 × 512 image with three color components in
port( )” reads data from the specified I/O address and RGB. The image is first compressed and then decom-
“outport( )” writes data to the specified I/O address. pressed. The bit rate is 0.055 bits/pixel and the SNR is
When encoding, an image file of raw data is opened about 36.7 dB. SNR (Signal to Noise Ratio) is defined
for feeding pixel data into the circuit boards and an- as follows:
other file is opened for storing the resulting codewords M N 2
coming from the circuit boards. When decoding, a i=1 j=1 s̄(i, j)
SNR =  M  N (8)
j=1 [s(i, j) − s̄(i, j)]
file of codewords is fed into the circuit boards and 2
i=1
another file is opened for storing the resulting pixel
values. where M and N denote the dimensions of x and y,
Some standard testing images are used as bench- respectively, of the image, s(i, j) denotes pixel values
marks to test the functionality of our design. Figure 15 of the original image, and s̄(i, j) denotes pixel values

Figure 17. BABOON: (a) original image; (b) reconstructed image.


58 Sun and Lee

Figure 18. Floor plan of the JPEG chip.

of the reconstructed image. Obviously, a large SNR the Compass 0.6 µ standard cell library. Placement and
indicates a small distortion of the reconstructed im- routing are done by Cadence.
age from the original image. A low bit-rate means a The core is divided into 4 parts roughly according to
high compression ratio and thus a small transmission the size of the four major modules: DCT/IDCT, quan-
bandwidth is required. Figure 16 is a color image in tization, zig-zag, and Huffman codec. Each RAM used
RGB format. The bit rate is 0.058 bits/pixel and the is then placed within the respective part and around the
SNR is 32.9 dB. Figure 17 is also a color image. The core properly. The floor plan of the chip is shown in
bit rate and the SNR are 0.127 bits/pixel and 27.8 dB, Fig. 18.
respectively. Clock trees and the constraints on placing stan-
dard cells are added before routing. Finally, we got
the layout as shown in Fig. 19. DRC (design rule
6. Chip Implementation check), ERC (electrical rule check), and LVS (lay-
out versus schematic) are performed to check the
A single chip for the JPEG baseline system is con- correctness of the layout. Post-layout timing analy-
structed by standard cells with the 0.6 µ triple-metal sis is done using TimeMill. Power analysis is done by
process. The design is synthesized using Synopsys with PowerMill.

Table 5. Comparison of JPEG chips.

Author Size (mm2 ) Complexity RAM Memory Clk Rate Power Consumed

Sun and Lee 45.54 (0.6 µ) 411,745 transistors 23.264 K 27 MHz 1000 mW
Ruetz et al. [5] Two chips: N/A 33.536 K 30 MHz N/A
62.4 and 96.04 (1 µ)
Kovac and Ranganathan [13] 168 (1 µ) N/A N/A 100 MHz N/A
Okada et al. [14] 54 (0.6 µ) 70 K (Gate Count) 17 K 18 MHz 400 mW
Asada et al. [15] 125 (0.5 µ) 200 K (Gate Count) 38 K 17.5 500 mW
Hunter et al. [16] 14 (0.35 µ) 50 K (Gate Count) 11.5 K 30 MHz 840 mW
A JPEG Chip 59

Figure 19. Layout of the JPEG chip.

Simulation results have shown that the chip can op- resources rather than an application-specific standard
erate in real time at any compression ratio. A pixel or a product (ASSP) chip design adopted for commercially
codeword can be processed within a single clock cycle available chips [16].
and the maximal working frequency achieved is about
27 MHz. The chip contains 411,745 transistors, with a
die size of 6.6 × 6.9 mm2 . Power dissipation is about 7. Conclusion
1 watt at the maximal working frequency. A com-
parison between our chip and other JPEG designs [5, We have presented the design and implementation
13–16] is given in Table 5. Note that the work of [5] of a single chip for the JPEG baseline system. The
contains two chips. DCT and zig-zag are contained in chip is mainly composed of four modules: DCT/IDCT,
a chip with 62.4 mm2 , and coding and quantization are quantizer/dequantizer, zig-zag, and Huffman codec.
contained in another chip with 96.04 mm2 . Some of the The chip was designed with modern VLSI CAD tools.
chips, e.g., [16], are only for JPEG encoders. Appar- VHDL was used to describe the architecture and behav-
ently, our chip is highly competitive with other chips. ior of each module. Then the functionality of the de-
Note that our focus has been on the development of sign was verified with field programmable gate arrays
a single-chip JPEG codec at a campus-lab with limited (FPGAs) on circuit boards. Finally, a single chip was
60 Sun and Lee

implemented using the standard cell design approach and Video Coding,” Journal of VLSI Signal Processing, vol. 28,
with the 0.6 µ triple-metal process. The resulting chip no. 3, 2001, pp. 205–220.
contains 411,745 transistors, with a die size of is 11. S.A. White, “Applications of Distributed Arithmetic to Digital
Signal Processing: A Tutorial Review,” IEEE ASSP Magazine,
6.6 × 6.9 mm2 . The chip can operate in real time at June 1989, pp. 4–17.
any compression ratio. 12. I. Koren, Computer Arithmetic Algorithm, Prentice-Hall Inter-
national Editions, 1993.
13. M. Kovac and N. Ranganathan, “JAGUAR: A Fully Pipelined
Acknowledgments VLSI Architecture for JPEG Image Compression Standard,”
Proceedings of the IEEE, vol. 83, no. 2, 1995, pp. 247–258.
The authors would like to thank the anonymous referees 14. S. Okada, Y. Matsuda, T. Watanabe, and K. Kondo, “A Single
for their constructive comments and suggestions. Chip Motion JPEG Codes LSI,” IEEE Transactions on Con-
sumer Electronics, vol. 43, no. 3, 1997, pp. 418–422.
This work was supported by the National Science 15. S.K. Asada, H. Ohtsubo, T. Fujihira, and T. Imaide, “Develop-
Council under the grants NSC-87-2213-E-110-012 and ment of a Low-Power MPEG1/JPEG Encode/Decode IC,” IEEE
NSC-88-2218-E-110-008. A preliminary version of Transactions on Consumer Electronics, vol. 43, no. 3, 1997,
this paper appeared in Proceedings of the 2000 Asia pp. 639–644.
Pacific Conference on Multimedia Technology and Ap- 16. J.K. Hunter, J.V. McCanny, A. Simpson, Y. Hu, and J.G. Doherty,
“JPEG Encoder System-on-a-Chip Demonstrator,” in Proceed-
plications, Kaohsiung, Taiwan, December 2000. ings of the Thirty-Third Asilomar Conference, vol. 1, 1999,
pp. 762–766.
References
Sun-Hsien Sun was born on October 11, 1971 in Taipei, Taiwan.
1. “JPEG digital compression and coding of continuous-tone still
He received bachelor and master degrees of electrical engineer-
image,” Technical Report Draft ISO 10918, 1991.
ing from National Sun Yet-Sen University (NSYSU), Kaohsiung,
2. “Coding of moving pictures and associated audio,” Committee
Taiwan, ROC, in 1994 and 1998, respectively. His main interests
Draft of Standard ISO11172: ISO/MPEG/90/176, 1990.
include VLSI design of signal processing and hardware description
3. “Video codec for audio visual services at p × 64 k bits per
languages.
second,” CCITT Recommendation H.261, 1990.
4. K. Sayood, Introduction to Data Compression. San Francisco,
CA: Morgan Kaufmann, 2000. Shie-Jue Lee was born at Kin-Men, ROC on August 15, 1955. He
5. P.A. Ruetz, P. Tong, D. Luthi, and P.H. Ang, “A Video-Rate received the B.S.E.E. and M.S.E.E. degrees in 1977 and 1979, re-
JPEG Chip Set,” Journal of VLSI Signal Processing, vol. 5, 1993, spectively, from National Taiwan University, and the Ph.D. degree
pp. 29–38. from the Department of Computer Science at the University of North
6. H. Park and V.K. Prasanna, “Area efficient VLSI architectures for Carolina, Chapel Hill, USA, in 1990. Dr. Lee joined the faculty of the
Huffman coding,” IEEE Transactions on Circuits and Systems, Department of Electrical Engineering at National Sun Yat-Sen Uni-
1993, pp. 568–575. versity, Taiwan, in 1983, and has become a professor of the depart-
7. A. Mukherjee, N. Ranganathan, J.W. Flieder, and T. Acharya, ment since 1994. His research interests include machine intelligence,
“MARVLE: A VLSI Chip for Data Compression Using Tree- multimedia communications, and chip design.
Based Codes,” IEEE Transactions on Very Large Scale Integra- Dr. Lee served as the acting director and the director of the
tion Systems, 1993, pp. 203–213. Telecommunication Development and Research Center of National
8. Y.S. Lee, J.J. Jong, T.S. Perng, L.C. Hsu, M.Y. Jaw, and C.Y. Sun Yat-Sen University during 1997–2000, and the director of the
Lee, “A Memory-Based Architecture for Very-High-Throughput Southern Telecommunications Research Center, National Science
Variable Length Codec Design,” IEEE International Symposium Council, Taiwan, in 1998–1999. He is now professor and chairman
on Circuits and Systems, June 1997, pp. 2096–2099. of the Electrical Engineering Department, National Sun Yat-Sen Uni-
9. M. Sun, T. Chen, and A.M. Gottlieb, “VLSI Implementation of versity.
a 16 × 16 Discrete Cosine Transform,” IEEE Transactions on Dr. Lee is a member of IEEE, IEICE, Association for Automated
Circuits and Systems, vol. 36, 1989, pp. 610–617. Reasoning, Chinese Fuzzy Systems Association, Institute of Infor-
10. S.-F. Hsiao and J.-M. Tseng, “Parallel, Pipelined and Folded mation and Computing Machinery, and Taiwanese Association of
Architectures for Computation of 1-D and 2-D DCT in Image Artificial Intelligence.

Vous aimerez peut-être aussi