Vous êtes sur la page 1sur 11

1774

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007

Rate Distortion Optimization for H.264 Interframe Coding: A General Framework and Algorithms
En-Hui Yang, Senior Member, IEEE, and Xiang Yu
AbstractRate distortion (RD) optimization for H.264 interframe coding with complete baseline decoding compatibility is investigated on a frame basis. Using soft decision quantization (SDQ) rather than the standard hard decision quantization, we rst establish a general framework in which motion estimation, quantization, and entropy coding (in H.264) for the current frame can be jointly designed to minimize a true RD cost given previously coded reference frames. We then propose three RD optimization algorithmsa graph-based algorithm for near optimal SDQ in H.264 baseline encoding given motion estimation and quantization step sizes, an algorithm for near optimal residual coding in H.264 baseline encoding given motion estimation, and an iterative overall algorithm to optimize H.264 baseline encoding for each individual frame given previously coded reference frameswith them embedded in the indicated order. The graph-based algorithm for near optimal SDQ is the core; given motion estimation and quantization step sizes, it is guaranteed to perform optimal SDQ if the weak adjacent block dependency utilized in the context adaptive variable length coding of H.264 is ignored for optimization. The proposed algorithms have been implemented based on the reference encoder JM82 of H.264 with complete compatibility to the baseline prole. Experiments show that for a set of typical video testing sequences, the graph-based algorithm for near optimal SDQ, the algorithm for near optimal residual coding, and the overall algorithm achieve on average, 6%, 8%, and 12%, respectively, rate reduction at the same PSNR (ranging from 30 to 38 dB) when compared with the RD optimization method implemented in the H.264 reference software. Index TermsFixed-slope lossy compression, H.264 hybrid coding, rate distortion (RD) optimization, soft decision quantization (SDQ).

Fig. 1. Illustration of a hybrid coding structure.

I. INTRODUCTION HE H.264, the newest hybrid video compression standard [2], has proved its superiority in coding efciency over its precedents, e.g., it shows a more than 40% rate reduction over H.263 [5]. However, as the enormous volume of video data is constantly fueling the demand for better and better compression [19], [20], it is desirable to study how to further enhance the compression performance in the H.264 standard-compliant coding environment.
Manuscript received December 8, 2006; March 1, 2007. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grants RGPIN203035-02 and RGPIN203035-06 and under Collaborative Research and Development Grant, in part by the Premiers Research Excellence Award, in part by the Canadian Foundation for Innovation, in part by the Ontario Distinguished Researcher Award, and in part by the Canada Research Chairs Program. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Tamas Sziranyi. The authors are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, ON N2L 3G1 Canada (e-mail: ehyang@uwaterloo.ca; x23yu@bbcr.uwaterloo.ca). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TIP.2007.896685

H.264 utilizes a well-known hybrid structure, as shown in Fig. 1. Specically, since the quantization part introduces permanent information loss to video data, the hybrid scheme leads to a lossy compression, whose performance is characterized by the rate distortion (RD) function of the source [1]. The four coding parts all contribute to the RD function and there is no easy way to quantitatively separate their contributions. Therefore, the fundamental tradeoff in the design of a hybrid video compression system including H.264 is its overall RD performance, based on which many optimization methods, broadly referred to as RD methods, have been developed and widely used in video compression applications [5], [20]. RD methods for video compression can be classied into two categories. The rst category computes the theoretical RD function based on a given statistic model for video data, e.g., [15][17]. In general, the challenge for designing a method in the rst category is the model mismatch due to the nonstationary nature of video data. The second category uses an operational RD function, which is computed based on the data to be compressed. Mainly, there are two problems. First, in most operational RD methods, the formulated optimization problem is restricted and the RD cost is optimized only over motion estimation and quantization step sizes. Second, there is no simple way to solve the restricted optimization problem if the actual RD cost is used. By the actual RD cost, we mean a cost based on the nal reconstruction error and the entire coding rate. Because hard decision quantization (HDQ) is used, there is no simple analytic formula to represent the actual RD cost as a function of motion estimation and quantization step sizes, and, hence, a brute force approach with high computational complexity is likely to be used to solve the restricted optimization problem [20]. For this reason, an approximate RD cost is often used in the restricted optimization problem in many operational RD methods. For example, the optimization of motion estimation in [5] is based on the prediction error instead of the actual distortion, which is the quantization error. This paper proposes an operational RD method using the actual RD cost. The target is RD optimization for hybrid video coding subject to syntax constraints of H.264 baseline prole. We rst discuss a somewhat hidden parameter to be optimized in addition to prediction mode, reference frame

1057-7149/$25.00 2007 IEEE

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING

1775

indexes, motion vectors and quantization step sizes, and formulate a joint optimization framework. Specically, using soft decision quantization (SDQ) instead of HDQ, we notice that the quantized residual itself is a free parameter that can be optimized in order to improve compression performance. By SDQ, entropy coding is brought into the quantization design. The general optimization framework can then be formulated as jointly designing motion estimation, quantization, and entropy coding in the H.264 hybrid video coding structure. Surprisingly, this generality not only improves the compression performance in term of the RD tradeoff, but also makes the optimization problem tractable at least algorithmically. Indeed, with respect to the baseline prole of H.264, we propose three RD optimization algorithmsa graph-based algorithm for near optimal SDQ, an algorithm for near optimal residual coding, and an iterative overall algorithm to optimize H.264 baseline prole encodingwith them embedded in the indicated order. The SDQ algorithm is the core. It helps to bring all coding components into the optimization scheme with the actual RD cost being its objective function, enabling us to jointly design them in the hybrid coding structure. The proposed RD optimization algorithms for H.264 video coding are inspired by a xed-slope universal lossy data compression scheme1 considered in [7], which was rst initiated in [6], and was later extended in [8]. Other related works on practical SDQ include without limitation SDQ in JPEG image coding and H.263 video coding (see [9][11], [24], [25], and references therein). In [9] and [10], partial SDQ called rate-distortion optimal thresholding was considered. Recently, Yang and Wang [11] successfully developed an algorithm for optimal SDQ in JPEG image coding to further improve the compression performance of a standard JPEG image codec. Without considering optimization over motion estimation and quantization step sizes, Wen et al. [24] proposed a trellis-based algorithm for optimal SDQ in H.263 video coding, which, however, is not applicable to SDQ design in H.264 due to the inherent difference in the entropy coding stages of H.264 and H.263 . In [25], Schumitsch et al. studied interframe optimization of transform coefcient levels2 based on a simplied linear model of interframe dependencies. Although the SDQ principle is not new and this paper is not the rst attempt to apply SDQ to practical coding standards either, designing algorithms for optimal or near optimal SDQ in conjunction with a specic entropy coding method is still quite challenging, especially when the involved entropy coding method is complicated. Different entropy coding methods require different algorithms for SDQ. In some cases, for example, SDQ for GIF/PNG coding where the entropy coding methods are the LempelZiv [29], [30] algorithms, the SDQ design problem is still open [31]. Fortunately, in the case of H.264, we are able to tackle the SDQ design issue associated with the context adaptive variable length coding (CAVLC) of H.264 by putting it into the xed slope framework. Furthermore, our studies in SDQ within the xed slope scheme con1Related to xed slope compression are entropy constrained [14] and conditional entropy constrained scalar/vector quantization. See [8] and [10] for their difference and similarity. 2Transform coeffcient levels are also referred to as quantized transform coefcients.

stitutionally leads to a new framework for jointly designing all key components in Fig. 1 for hybrid video coding. Application of the proposed framework to the syntax constrained optimization for H.264 has shown a signicant improvement on the RD performance. This paper is organized as follows. In Section II, we review the hybrid coding in H.264 and RD optimization methods for video compression in the literature. In Section III, we develop a framework for jointly designing the hybrid coding structure in H.264, with discussions on algorithm designs for residual coding optimization, motion estimation, and the overall joint optimization. Section IV is then dedicated to the core algorithm of SDQ based on CAVLC. Experiment results are presented in Section V, and, nally, conclusions are drawn in Section VI. II. BACKGROUND RD optimization of hybrid video coding with H.264 compatibility is subject to decoding syntax constraints specied in the standard. This section reviews the hybrid coding in H.264 and some related RD methods. A. Hybrid Video Compression in H.264 The motion estimation design in H.264 has been signicantly improved over previous standards. It allows various block sizes from 4 4 to 16 16. It also uses a higher prediction accuracy of -pixel. According to Girods study [18], this is the highest precision that is required in order to achieve the best performance for motion estimation. For the transform part, H.264 uses the well-known discrete cosine transform (DCT) with a block size of 4 4 in its baseline prole. Quantization in H.264 is simply achieved by a scalar quantizer. It is dened by 52 step . The quansizes based on an index parameter tization step size for a given is specied as (1) where % and mainder and quotient of are the redivided by 6, and

. For the purpose of fast implementation, quantization and transform in H.264 are combined together. Specically, suppose that the decoder receives the quantized transform coefcients and the quantization parameter for a 4 4 block. Then the de-quantization and inverse transform are performed together as follows:

(2) where and are constants dened in the decoding syntax of H.264 (see [21] for details). H.264 supports two entropy coding methods for residual coding, i.e., CAVLC [4] and context adaptive binary arithmetic coding (CABAC) [2]. In the baseline prole, only CAVLC is supported. As discussed above, each individual coding part in H.264 has been well designed to achieve good coding performance using the state-of-the-art technologies. Optimization of an individual part in H.264 alone will unlikely bring much improvement. Meanwhile, a joint optimal design of the whole encoding structure is possible because the standard only species a syntax

1776

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007

Fig. 2. Signal ow of a typical hybrid codec as in H.264.

for the coded bit stream, leaving details of the encoding process open to a designer. In this paper, we propose a joint optimization framework and its algorithm designs for hybrid video coding with complete decoding compatibility to the H.264 baseline prole. B. Review of Related Rate Distortion Optimization Work Using the generalized Lagrangian multiplier method [22], Wiegand et al. proposed a simple, effective operational RD method for motion estimation optimization [5], [13]. The mode selection for motion estimation is conducted based on the actual RD cost in a macroblock-by-macroblock manner. For a given prediction mode, motion estimation is optimized based on an operational RD cost, which approximates the actual RD cost, as follows: (3) is the where stands for the original image block, prediction with given prediction mode , reference index , and motion vector is a distortion measure, is the number of bits for coding is the number of bits for coding , and is the Lagrangian multiplier. Wen et al. [24] proposed an operational RD method for residual coding optimization in H.263 using a trellis-based SDQ design. In H.263 , residuals are coded with run-length codes followed by variable length coding (VLC). The VLC in H.263 is simple and does not introduce any dependency among neighboring coefcients, while the dependency mainly comes from the run-length code. Therefore, a trellis structure is used to decouple the dependency so that a dynamic programming algorithm can be used to nd the optimal path for quantization decisions. In the baseline of H.264, however, context adaptive VLC is used after the run-length coding. The context adaptivity introduces great dependency among neighboring coefcients; thus, a new design criterion is needed to handle the context adaptivity for designing SDQ in H.264. A recent study on SDQ in [25] developed a linear model of interframe dependencies and a simplied rate model to formulate an optimization problem for computing the quantization outputs using a quadratic program. From the problem formulation point of view, our SDQ problem formulation shares the same spirit as that in [25], except that the latter one is more ambitious as it targets interframe dependencies. From the algorithm design point of view, [25] gives an optimized determination of transform coefcient levels by considering temporal dependencies, but neglecting other factors such as the specic entropy coding

method, while the graph-based SDQ design to be presented later in this paper provides the optimal SDQ under certain conditions, i.e., prediction is given and CAVLC is used for entropy coding. III. SYNTAX-CONSTRAINED OPTIMIZATION FRAMEWORK FOR H.264 INTERFRAME COMPRESSION In this section, we investigate the syntax-constrained optimization problem for H.264 video coding. By exploring all possible optimization variabilities within the H.264 hybrid coding scheme, we rst establish a general framework in which motion estimation, quantization, and entropy coding for the current frame, given previously coded reference frames, can be jointly designed to minimize the actual RD cost, and then present three RD optimization algorithms as a solution to the syntax-constrained optimization problem. A. Problem Formulation Fig. 2 illustrates the signal ow of a typical hybrid encoder as in H.264. Note that the previously coded frames are assumed known in the frame buffer when we discuss optimization of the current frame. For a given distortion measure , the actual is , reproduction error for coding a whole frame is the reconstruction of . Correspondingly, the enwhere tire rate for coding involves in ve parts, i.e., the prediction modes , reference frame indexes , motion vectors , quantization step sizes , and quantized transform coefcients . For a given entropy coding method with its rate function , the entire coding rate is . Then, the actual RD cost for coding is

(4) where is a positive constant, which is determined by end users based on both the available bandwidth and the expected video quality. From a RD theoretic point of view, a good coding design is to nd a set of encoding and decoding algorithms to minimize the actual RD cost as given in (4). However, in the syntax-constrained optimization scenario, the decoding algorithms have already been selected and xed. Specically, for a given 4 4 quantized transform coefcient block and the corresponding prediction mode , reference index , motion vector , and quantization step size [note that (4) is dened for a whole frame while H.264 species a block-based coding scheme; for simplicity, however, the subscript is omitted

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING

1777

hereafter when the discussion is focused on the block-based coding syntax], the reconstruction is computed by (5) is dened as in (2). Under this constraint, we where examine the maximal variability and exibility an encoder can enjoy before establishing our optimization problem based on the actual RD cost of (4). Conventionally, the constraint of (5) is used to derive a deterministic quantization procedure, i.e., (6) which mainly minimizes the quantization distortion . The factor is an offset parameter for adapting the quantization outputs to the source distribution to some extend. e.g., there are empirical studies on determining according to the signal statistics to improve the RD compression efciency. From the syntax-constrained optimization point of view, however, . there is no deterministic relationship between and Indeed, inspired by the xed-slope lossy data compression scheme in [7], we see that given , each (per block) itself (or equivalently per frame) is a free parameter and one has the exibility to choose the desired to minimize (4). Such a way of determining (or equivalently ) is called soft decision quantization. The idea of trading off a little distortion for a better RD performance has already been used partially in the H.264 reference software, however, in an ad hoc way. A whole block of quantized coefcients is discarded under certain conditions, e.g., when there is only one nonzero coefcient taking a value . This is equivalent to quantizing that coefcient to 0, of 1 or although a hard decision scalar quantizer outputs 1 or . Such practice is well justied by experimental results [2]. To get better compression performance, it is interesting and desirable to study SDQ in a systematic way. The purpose of SDQ is to minimize the actual RD cost by adapting quantization to a specic entropy coding method. Fig. 3 shows the structure of the xed-slope lossy compression and a method. Given a residual block quantization step size , the RD optimal residual coding is to solve a minimization problem of (7) is the actual distortion of quantization where error, is the total rate for residual coding, and is a constant, which has an interpretation of the slope of the resulting RD curve. In case of syntax constrained optimization, the decoding mapping and the lossless coding algorithms and are xed by the standard, i.e., and accord to CAVLC and . In this case, the problem of (7) reduces to nding to minimize the RD cost (8) where is a given quantization step size, and the minimization in (8) is over all possible quantized values. Such a is not achieved, in general, by the hard decision process via (6).

Fig. 3. Universal lossy compression scheme for residual coding.

Having described SDQ, we can now have the complete syntax-constrained optimization problem for H.264 hybrid video coding as follows: (9) In general, the overall solution to (9) represents the best compression performance an encoder under H.264 syntax constraints can possibly achieve for the current frame given previously encoded frames. The optimization problem (9), together with its solution, gives a general framework in which motion estimation and residual coding for the current frame can be jointly designed to minimize the actual RD cost. B. Problem Solution In general, (9) is difcult due to the mutual dependency among . To make the problem tractable, we propose an iterative solution, in which motion estimation and residual coding are optimized alternately. Specically, three RD optimization algorithms are developed as follows. , 1) Optimal Soft Decision Quantization: Given in SDQ, we compute (10) Details of our SDQ design based on H.264 baseline coding are presented in the next section. 2) Residual Coding Optimization: Given , in residual coding optimization, we compute (11) Examining the distortion term in (11), we see that it is macroblock wise additive. As will be discussed later in the next secis not strictly macroblock-wise tion, even though the term additive, the adjacent block dependency used in coding is so weak that we can ignore it in our optimization and simply reas being block-wise additive. Thus, the main difgard , which represents a rst order preculty lies in the term of dictive coding method [2]. As such, the optimization problem in (11) can not be solved in a macroblock-by-macroblock manner. , To tackle the adjacent macro-block dependency from stages and 52 states at we develop a trellis structure with each stage. Each stage accords to a macro-block, while each state accords to a quantization step size. States between two neighboring stages are fully connected with each other. The RD cost for a transition between the th state at the th stage to the th state at the th stage can be computed by two parts, and the RD cost for coding the i.e., the coding rate of th macro-block using , which is computed using SDQ. The RD cost for each state at the initial stage is equalto the RD cost resulting from encoding the rst macro-block using and the

1778

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007

corresponding optimal SDQ. Then, dynamic programming can be used to solve (11). Apparently, the above solution is computationally expensive as it involves in running SDQ for each one of 52 states at each stage and then searching the whole trellis. In practice, however, there is no need for this full scale dynamic programming because the RD cost corresponding to is much greater than that corresponding to . This implies that very likely, the globally optimal quantization step size for each macro-block will be within a small neighboring region around the best quantization step is ignored in the cost, and one can apply size obtained when dynamic programming to a much reduced trellis with states at each stage limited only to such a small neighborhood. To this end, we rst propose the following procedure to nd the best when is ignored. Step 1) Initialize using the following empirical equation proposed in [13] with a given and (1) (12) Step 2) Compute by the SDQ algorithm. . As a Step 3) Fix . Compute by solving , which is then result, we have rounded to one of the 52 predened values in H.264. Step 4) Repeat Steps 2 and 3 until the decrement of the RD cost is less than a prescribed threshold. Simulations show that (12) makes a good initial point. After one iteration, the obtained is quite close to the best quantizabeing ignored. We then select a neightion step size with to build up the trellis at stage , boring region of and, hence, the computation complexity is greatly reduced. Our experiments show that dynamic programming applied to this reduced trellis achieves almost the same performance as that applied to the full trellis. 3) Joint Optimization Algorithm: Based on the algorithm for the near optimal residual coding, a joint optimization algorithm for solving (9) is proposed to alternately optimize motion estimation and residual coding as follows. Step 1) (Motion estimation) For given residual reconstruc, we compute by tion

Compare (14) with (3). For given , (14) is equivalent to in (3). Thus, the same searching for a prediction to match search algorithm is used to solve (14) as the one for (3) in [5]. The computational complexity for (14) and (3) is almost the is ignorable. same since the time for computing For a given , the joint optimization algorithm starts with , which is equivalent to using the motion estimation in [5] as a starting point. Experiments show that with this initialization, the algorithm converges very fast; after two iterations, the decrement in the total cost is almost negligible. C. Comparing the Proposed Scheme With Conventional One We rst review the conventional optimization framework based on HDQ. By HDQ, quantization outputs are given by a , and , as shown deterministic function with respect to by (6) for H.264. Therefore, in the conventional framework, the , and , i.e., true RD cost is minimized over (15) Comparison between the proposed framework in (9) and the conventional one in (15) reveals two advantages for the proposed framework. First, we have

since, for given , we can always apply SDQ in Section III-B1 to reduce the RD cost. Second, the problem of optimizing the true RD cost becomes tractable algorithmically, i.e., as discussed in III-B, an iterative solution is easily established to opand . The solution is at least feasible, timize over although it may not be proved to be globally optimal. On the other hand, with the conventional framework of (15), it is im, and , because practical to optimize the true RD cost over it would require to go through the residual coding procedure , and . Overall, due to to evaluate the cost for all possible SDQ, the new framework supports a better RD performance and features a feasible solution to minimizing the true RD cost for hybrid video coding. IV. SOFT DECISION QUANTIZATION ALGORITHM DESIGN

(13) which is equivalent to (9) for given . , the process Step 2) (Residual coding) For given . in Section III-B2 is used to nd Step 3) Repeat Steps 1 and 2 until the decrement of the actual RD cost is less than a given threshold. We now study the solution to (13), which involves mode selection and motion estimation. In [5], the prediction mode is selected for each macroblock by computing the actual RD cost corresponding to each mode and choosing the one with the minimum. This method of mode selection is also used in this paper. Then, for a pixel block with its residual reconstruction and is computed by a given mode

(14)

In this section, we presents our core graph-based SDQ algorithm for solving the minimization problem given in (10). In general, SDQ is a search in a vector space of quantization outputs for a tradeoff between the quality and the rate. The efciency of the search largely depends on how we may discover and utilize the structure of the vector space, which features the de-quantization syntax and the entropy coding method of CAVLC. In this paper, we propose to use dynamic programming techniques to do the search, which requires an additive evaluation of the RD cost. In the following, we rst show the additive distortion computation in DCT domain based on the de-quantization syntax reviewed in Section II-A. Second, we design a graph for additive evaluation of the rate based on analysis of CAVLC, with states being dened according to the level coding and connections being specied according to the run coding. Finally, we discuss the optimality of the graph-based algorithm, showing that the graph design helps to solve the minimization problem of (10).

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING

1779

Fig. 4. Graph structure for SDQ based on CAVLC. There are 16 columns according to 16 coefcients. A column consists of multiple state groups, according to different ZL. The left panel shows the connections between these groups. Each group initially contains a set of states dened on the right panel, while eventually only those states that receive valid connections remain.

A. Distortion Computation in DCT Domain The distortion term in (10) is additive in the pixel domain. However, it contains inverse DCT, which is not only time consuming, but also makes the optimization problem intractable. Consider that DCT is a unitary transform, which maintains the Euclidean distance. We choose the Euclidean distance for so that the distortion can be computed in the transform domain in an additive manner. Specically, for a given residual block , the distortion is computed as in [3] by (16) where , and are constants specied by the standard syntax. This equation brings to us two advantages. The rst is high efciency for computing distortion. Note that is computed before SDQ for given . Thus, the evaluation of consumes only two integer multiplications together with some shifts and additions per coefcient. More importantly, the second advantage is the resulted element-wise additive computation of distortion, which enables us to solve the SDQ problem using Viterbi algorithm, as to be presented later. B. Graph Design for Soft Decision Quantization While CAVLC is designed for each individual block, the coding for CoeffToken (see [4] for details) introduces certain dependency among neighboring blocks. However, the dependency is very weak. Therefore, in the optimization problem given in (10) for the whole frame, we will decouple such weak dependency. In doing so, the optimization of the whole frame can be solved in a block by block manner with each block being 4 4. That is, the optimal can be determined independently for each . By omitting the subscript, the optimization problem given in (10) reduces to (17) where is the number of bits needed for CAVLC to encode given that its two neighboring blocks have been optimized.

Apply the result of (16) to (17). The problem becomes

(18) Note that every bold symbol here, e.g., , represents a 4 4 matrix. For entropy coding, the 4 4 matrix of will be zig-zag ordered into a 1 16 sequence. To facilitate the following discussion on graph design, we introduce a new denotation, i.e., to add a bar on the top of a bold symbol to indicate the zig-zag ordered sequence of the corresponding matrix. Then, the equation of (18) is rewritten as follows:

(19) where we still use the symbol to indicate the element-wise multiplication between two vectors. The problem of (19) is equivalent to a search in a vector space of . We now construct a graph, as shown in Fig. 4, to represent this vector space. In the designed graph, each transition stands for a run level pair, while each path from the initial state HOS to the end state EOS gives a unique sequence of . Moreover, the graph enables an additive rate evaluation corresponding to CAVLC. In the following, we give more details on how to construct this graph. 1) Denition of States According to CAVLC Level Coding: CAVLC encodes levels based on adaptive contexts, which are used to select VLC tables. These adaptive contexts are represented by different states in Graph 4. Let us rst examine the trailing one coding rule (see [4] for details). The trailing ones are a set of levels with three features. First, they must be handled at the beginning of the coding process (note that coding is conducted in reverse order of the zig-zag sequence). Second, they are consecutive. Third, there is a restriction of to consider, at most, 3 of them. To meet these three requirements, we design three types of states, . In addition, CAVLC requires to know the number of trailing ones, i.e., , both at the beginning of the coding process ( is transmitted) and at the

1780

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007

Fig. 5. Left panel: States and connections dened according to the trailing one coding rule of CAVLC. HOS is a dummy state, indicating the start of encoding. Right panel: States and connections dened according to the level coding process of CAVLC.

point that the level coding table is initialized. As such, we dene 6 states, Tn3H, Tn2H, Tn1H, Tn2T, Tn1T, and Tn1TH as shown in the left panel of Fig. 5, where TnjH in the column of represents that is the rst trailing one and in the column of represents that is the th trailing one and , and Tn1TH in the column of represents that is the second trailing one and . Hereafter, these states are also referred to as T-states. More states are dened based on features for coding levels other than trailing ones. The important factors for coding these levels are the seven coding tables and the table selection criteria. Specically, denote the seven tables as Vlc(0)Vlc(6), and the corresponding thresholds for table selection as . Note that , meaning that always switches to another table. Also, is beyond the range of a possible output, meaning that once is selected, it will be used until the end of the current block. Other than these, the coding table will be switched from to when the current level is greater than for . Therefore, each coding table except needs to have two states in order to determine the context to choose a coding table for the next level according to the current level. As shown in the right panel of Fig. 5, there are 13 states dened, named as either V or . These states are refereed to as V-states. 2) Denition of State Groups According to Run Coding: Now we examine the runs coding process of CAVLC and explain why and how states are clustered into groups. The context for choosing a table to code runs depends on a parameter of ZerosLeft (referred to as hereafter), which involves in future states in the graph structure. To build this dependency into the denition of states, we dene a state group for each different . As shown in Fig. 4, a state group initially consists of all T-states and V-states. For the column of coefcient , there are groups, corresponding to . Besides the run coding table selection, the formation of state groups according to provides other two advantages. First, it naturally leads us to know TotalZeros for every path in the graph.

Second, it enables us to include the coding rate of CoeffToken in (refereed the optimization process by providing the value of to as hereafter). In addition, is also used to initialize the level coding table. 3) Connecting States to Build Up a Graph: Connections from one column to another are now established in two steps. The rst is to connect state groups, and the second is to further clarify connections between states in two connected groups. Specically, HOS is connected to all groups, while a group in the column of is connected to EOS only if its equals to . Moreover, consider the th group in the column of with being and the th group in the column of with being , where . These two groups are con. The outcome of this rule nected if and only if is illustrated in Fig. 4. Now we discuss connections between two groups. First, two and rules are dened as between T-states as shown in the left panel of Fig. 5. Second, connections between V-states are established by two rules, as illustrated in the right panel of Fig. 5. will go to both and . 1) The state 2) The state will go to both and . Third, we utilize the level coding table initialization rule to set up other necessary connections including those from the initial state HOS and those to the end state EOS. 1) Connections from HOS to T-states. HOS is connected to ; HOS Tn3H in the column corresponding to when is connected to Tn2H in the column corresponding to when ; HOS is connected to all Tn1H states. 2) Connections from HOS to V-states in a group with in the column corresponding to : This is for the case where . Connect HOS to and if ; Connect HOS to and if . 3) Connections from Tn1H to V-states in a group with in the column corresponding to : This is for the case where . Connect Tn1H to and if ; Connect Tn1H to and if . 4) Connections from Tn1TH to V-states in a group with in the column corresponding to : This is for the case where . Connect Tn1TH to and if ; Connect Tn1TH to and if . 5) Connecting Tn1T to and . Eventually, while each group initially contains 19 states as shown in Fig. 4, only those states that receive valid connections remain. The graph ends at a dummy state EOS. 4) Metric Assignment: In general, because the output of a V-state can be any integer within a given range, there exist multiple transitions, called parallel transitions for a connection to a V-state. Consider a connection from a state in the column of to a state in the column of . Denote the output range of as . There will be parallel transitions from to , with each according to a unique output. Now, we assign metrics to three types of transitions, i.e., a transition starting from HOS, a transition ending at EOS, and a transition from a state in the column of to another state

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING

1781

in the column of in the column of

. The metric for a transition from HOS to is , which is

(20) where the rst term is the distortion for quantizing to zeros as the encoding starts with , the last two terms accord to the RD cost for quantizing to , and is the th element of the constant vector in (18). The metric for a transition from in the column of to in the column of is dened as as

handle the parallel transitions. For a connection from a state in one column to a state in another column, the number of , where parallel transitions is is the range of all possible quantization outputs at the state . From (20) and (21), it follows that the only difference among the RD costs assigned to these parallel transitions is in the RD costs . arising from different quantization outputs Studies on CAVLC show the rate variation due to different is insignicant compared to the quadratic distortion. This implies that very likely, the quantization output for the optimal transition be within a small neighboring region around , which the hard-decision quantization output minimizes the quadratic distortion. Thus, the number of parallel transitions to be examined in practice could be much smaller. Our experiments show that it is sufcient to compare as few as four parallel transitions around , and, hence, the complexity is reduced to a fairly low level. V. EXPERIMENTAL RESULTS Experiments have been conducted to study the coding performance of the proposed three algorithms for SDQ, residual coding optimization, and overall joint optimization. These algorithms are implemented based on the H.264 reference software Jm82 [26]. The B frame is not used since we target baseline decoder compatibility. Each sequence is divided into and encoded by groups of frames. In each group, there is one standard I-frame,3 while all the subsequent frames are coded as P-frames. Experiment results are reported with a group size of 21. The , and ve referrange for full-pixel motion estimation is ence frames are used for motion estimation. Comparative studies of the coding performance are shown by RD curves, with the distortion being measured by PSNR de, where MSE is the ned as mean square error. Fig. 6 shows the RD curves for coding various sequences. The RD performance is measured over P-frames only since I-frames are not optimized. The result is reported on the luma component as usual. Comparisons are conducted among four encoders, i.e., a baseline encoder with the proposed overall joint optimization method, a main-prole reference encoder with the RD optimization method in [5] and CABAC (the coding setting of this encoder is the same as that of a baseline prole except that CABAC is used instead of CAVLC), a baseline reference encoder with the RD optimization method in [5], and a baseline reference encoder with compromised RD optimization.4 The RD curve for the proposed method is obtained by varying the slope in (4), while RD curves for other methods result from varying the quantization step size. Specically, the six points on the curve of the proposed joint optimization method accord to . As illustrated in Fig. 6, the baseline encoder with the proposed overall joint optimization method achieves a signicant rate reduction over the baseline reference encoder with the RD optimization in [5]. Moreover, experiments over a set of eight video sequences (i.e., Highway, Carphone, Foreman, Salesman,
3Intraframes are not optimized in this paper. The joint optimization is designed based on interprediction. However, the proposed SDQ is applicable to residual coding for intraframes. 4This is conducted by disabling the RD optimization option in the JM software. In this case, empirical formulas are used to compute the RD cost for mode selection, resulting in a compromised RD performance.

(21) where the rst term computes the distortion for quantizing coefcients to zero, the second term is the rate cost for coding the given by the run coding table at state , run with the last two terms are the RD cost for quantizing to with determined by the level coding table at state . Finally, for a transition from a state in the column corresponding to to EOS, the RD cost is (22) which accords to the distortion for quantizing all remaining coefcients from to to zeros. C. Algorithm, Optimality, and Complexity With the above metric assignments, the problem of (19) can be solved by running dynamic programming over Graph 4. In other words, the optimal path resulting from dynamic programming applied to Graph 4 will give rise to an optimal solution to (19), as shown in the following theorem. Theorem: Given a 4 4 residual block, applying dynamic programming for a search in the proposed graph gives the optimal solution to the SDQ problem of (19). The proof of the above theorem is sketched as follows. For , any possible sea given input sequence quence of quantization outputs accords to a path in the proposed graph, and vice versa. Dene a metric for each transition in the graph as by (20) to (22). Carefully examining details of CAVLC will show that the accumulated metric along any path leads to the same value as evaluating the RD cost in (19) for the corresponding output sequence. Thus, when dynamic programming e.g., the viterbi algorithm, is applied to nd the path with the minimize RD cost, the obtained path gives the quantization output sequence to solve (19). The complexity of the proposed graph-based SDQ algorithm (i.e., dynamic programming applied to Graph 4) mainly depends on three factors, i.e., the number of columns as 16, the number of states in each column, and the number of parallel transitions for each connection. Expansion of Graph 4 into a full graph reveals that the number of states varies from 17 to 171. With states selectively connected, the major computational cost is to

1782

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007

Fig. 6. RD curves of four coding methods for coding video sequences of Foreman, Highway, and Carphone.

Fig. 7. Comparison of the coding gain for the proposed three algorithms, Enc(SDQ); Enc(SDQ + QP), and Enc(SDQ + QP + ME).

Silent, Container, Mother-Daughter, Grandma) show the proposed joint optimization method achieves an average 12% rate reduction while preserving the same PSNR over the RD optimization in [5] with the baseline prole, and 23% rate reduction over the baseline encoder with compromised RD optimization. Fig. 7 compares the coding gain for the proposed three algorithms. For simplicity, the encoders with proposed algo, and rithms are referred to as , while the forth encoder is called Enc (baseline, [5]). For , motion estimation and quantization step sizes are computed using the baseline method in [5]; for , the proposed residual coding optimization is performed based on the motion estimation obtained using the baseline method in [5]. It is shown that approximately, half of the gain for overall joint optimization comes from SDQ,5 while QP and ME contribute the other half gain together. On average, our experiments show rate reductions of 6%, 8%, and 12% while preserving PSNR by
5It may be interesting to relate the SDQ gain to the picture texture. In general, they can be related to each other qualitatively through the effectiveness of motion estimation, i.e., the gain from SDQ is higher when the energy of residual signals is greater. Usually, this accords to a less effective motion estimation, which may be observed for highly textured pictures.

, and , respectively, over Enc (baseline, [5]). In term of program execution time with our current implementation, the baseline encoder using RD optimization of [5] takes 1 s to encode a P frame; SDQ adds 1 s for each P frame; takes 6 s to encode each frame; and the overall optakes 15 s per frame. The timization with complexity of comes from the process to explore a neighboring region of ve quantization step sizes. The complexity of the overall algorithm mainly comes from the iterative procedure, for which two iterations are used since by observation the RD cost does not decreases much after two iterations. Frankly, the current implementation is not efcient, and there is plenty of room to improve the software structure and efciency. Meanwhile, compared with the RD method in [5] and the compromised RD method, the proposed approach seeks for better RD performance while maintaining the decoding complexity. It targets off-line applications such as video delivery, for which the RD performance is more important and a complicated encoder is normally acceptable since encoding is carried out only once. The proposed joint optimization algorithm works in a frame-by-frame manner. Clearly, the optimization of the current P-frame encoding will impact on the coding of the next

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING

1783

Fig. 8. Relative rate savings averaged over various numbers of frames for coding the sequence of Salesman.

P-frame. Thus, it is interesting to see such impact as the number of optimized P-frames increases. Fig. 8 shows the results of the relative rate savings (see its denition in [5]) of the proposed joint optimization algorithm over the baseline reference encoder with compromised RD optimization for various numbers of P-frames. Also shown in Fig. 8 is the result for the RD method in [5]. Although the proposed joint optimization algorithm constantly provides better gains than the RD method increases in in [5], the relative rate savings decreases as both cases. This warrants the joint optimization of a group of frames, which is left open for future research. VI. CONCLUSION AND DISCUSSION Using SDQ, we have proposed a general framework in which motion estimation, quantization, and entropy coding in the hybrid coding structure for the current frame can be jointly designed to minimize a true RD cost given previously coded reference frames. Within the framework, we have then developed three RD optimization algorithmsa graph-based algorithm for near optimal SDQ in H.264 baseline encoding given motion estimation and quantization step sizes, an algorithm for near optimal residual coding in H.264 baseline encoding given motion estimation, and an iterative overall algorithm to optimize H.264 baseline encoding for each individual frame given previously coded reference frameswith them embedded in the indicated order. It has been shown that if the weak adjacent block dependency utilized in CAVLC of H.264 is ignored for optimization, the proposed graph-based algorithm for SDQ is indeed optimal, and so is the algorithm for residual coding. These algorithms have been implemented based on the reference encoder JM82 of H.264 with complete compatibility to the baseline prole. Experiments have demonstrated that for a set of typical video testing sequences, the graph-based SDQ algorithm, the algorithm for residue coding, and the iterative overall algorithm achieve on average, 6%, 8%, and 12%, respectively, rate reduction at the same PSNR (ranging from 30 to 38 dB) when compared with the RD optimization method implemented in the H.264 reference software. Although we have focused mainly on H.264, especially its baseline prole, our proposed optimization framework is applicable to other hybrid video coding methods such as H.263, MPEG2, and MPEG4 as well. Of course, the detailed optimization algorithm design, especially SDQ design, will depend on each specic video coding method. The SDQ design proposed in this paper is based on CAVLC in H.264. To improve the coding performance of the main prole encoder for H.264, SDQ can be

designed based on the CABAC method and be embedded into the joint optimization framework, as shown in [28]. Many problems concerning RD optimization both within and beyond our proposed framework remain open, however. For example, within the proposed framework, it is interesting to see how to further reduce the computation complexity of the proposed algorithm for residual coding and the iterative overall joint optimization algorithm while maintaining the RD performance. It is also interesting to seek for an optimal solution to (13). A more challenging problem is to extend our proposed optimization framework to the joint optimization of a group of frames. These issues are left open for future research. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their constructive comments, which have helped to improve the presentation of this paper. REFERENCES
[1] T. Berger, Rate Distortion Theory-A Mathematical Basis for Data Compression. Englewood Cliffs, NJ: Prentice-Hall, 1971. [2] T. Wiegand, G. J. Sullivan, and A. Luthra, Draft ITU-T Rec. H.264/ISO/IEC 14496-10 AVC, presented at the JVT ISO/IEC MPEG, ITU-T VCEG, Doc. JVT-G050r1, 2003. [3] E.-H. Yang and X. Yu, On joint optimization of motion compensation, quantization and baseline entropy coding in H.264 with complete decoder compatibility, in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Philadelphia, PA, Mar. 2005, pp. II325328. [4] G. Bjntegaard and K. Lillevold, Context-adaptive VLC (CVLC) coding of coefcients, presented at the JVT-C028, Joint Video Team (JVT) ISO/IEC MPEG, ITU-T VCEG, 3rd Meeting, Fairfax, VA, May 610, 2002. [5] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, Rate-constrained coder control and comparison of video coding standards, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 688703, Jul. 2003. [6] E.-H. Yang and S.-Y. Shen, Distortion program-size complexity with respect to a delity criterion and rate distortion function, IEEE Trans. Inf. Theory, vol. 39, no. 1, pp. 288292, Jan. 1993. [7] E.-H. Yang, Z. Zhang, and T. Berger, Fixed-slope universal lossy data compression, IEEE Trans. Inf. Theory, vol. 43, no. 5, pp. 14651476, Sep. 1997. [8] E.-H. Yang and Z. Zhang, Variable-rate trellis source encoding, IEEE Trans. Inf. Theory, vol. 45, no. 3, pp. 586608, Mar. 1999. [9] K. Ramchandran and M. Vetterli, Rate-distortion optimal fast thresholding with complete JPEG/MPEG decoder compatibility, IEEE Trans. Image Process., vol. 3, no. 9, pp. 700704, Sep. 1994. [10] M. Crouse and K. Ramchandran, Joint thresholding and quantizer selection for transform image coding: Entropy constrained analysis and applications to baseline JPEG, IEEE Trans. Image Process., vol. 6, no. 2, pp. 285297, Feb. 1997. [11] E.-H. Yang and L. Wang, Joint optimization of run-length coding, Huffman coding and quantization table with complete baseline JPEG decoder compatibility, U.S. Patent Application, 2004.

1784

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007

[12] E.-H. Yang and X. Yu, Optimal soft decision quantization design for H.264, in Proc. 9th Canad. Workshop on Information Theory, Montral, QC, Canada, Jun. 2005, pp. 223226. [13] T. Wiegand and B. Girod, Lagrangian multiplier selection in hybrid video coder control, in Proc. Int. Conf. Image Processing, Oct. 2001, pp. 542545. [14] P. A. Chou, T. Lookabaugh, and R. M. Gray, Entropy-constrained vector quantization, IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 1, pp. 3142, Jan. 1989. [15] W. Ding and B. Liu, Rate control of MPEG video coding and recording by rate quantization modeling, IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 2, pp. 1220, Feb. 1996. [16] H. M. Hang and J. J. Chen, Source model for transform video coder and application-part I: Fundamental theory, IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 4, pp. 287298, Apr. 1997. [17] N. Kamaci and Y. Altunbasak, Frame bit allocation for H.264 using cauchy-distribution based source modelling, in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Philadelphia, PA, Mar. 2005, pp. II5760. [18] B. Girod, Efciency analysis of multihypothesis motion-compensated prediction for video coding, IEEE Trans. Image Process., vol. 9, no. 2, pp. 173183, Feb. 2000. [19] G. J. Sullivan and T. Wiegand, Rate-distortion optimization for video compression, IEEE Signal Process. Mag., vol. 15, no. 6, pp. 7490, Nov. 1998. [20] A. Ortega and K. Ramchandran, Rate-distortion methods for image and video compression, IEEE Signal Process. Mag., vol. 15, no. 6, pp. 2349, Nov. 1998. [21] I. E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next Generation Multimedia. Hoboken, NJ: Wiley, 2003. [22] H. Everett, Generalized lagrange multiplier method for solving problems of optimum allocation of resources, Oper. Res., vol. 11, no. 3, pp. 399417, Jun. 1963. [23] K. Ramchandran, A. Ortega, and M. Vetterli, Bit allocation for dependent quantization with applications to multiresolution and MPEG video coders, IEEE Trans. Image Process., vol. 3, no. 5, pp. 533545, Sep. 1994. [24] J. Wen, M. Luttrell, and J. Villasenor, Trellis-based R-D optimal quantization in H.263 , IEEE Trans. Image Process., vol. 9, no. 8, pp. 14311434, Aug. 2000. [25] B. Schumitsch, H. Schwarz, and T. Wiegand, Inter-frame optimization of transform coefcient selection in hybrid video coding, presented at the Picture Coding Symp., San Francisco, CA, Dec. 2004. [26] HHI, H.264 Reference Software, [Online]. Available: http://bs.hhi.de/ suehring/tml/ [27] T. Wiegand, M. Lightstone, D. Mukherjee, T. G. Campbell, and S. K. Mitra, Rate-distortion optimized mode selection for very low bit rate video coding and the emerging H.263 standard, IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 2, pp. 182190, Apr. 1996. [28] E.-H. Yang and X. Yu, Rate distortion optimization of H.264 with main prole compatibility, in Proc. IEEE Int. Symp. Information Theory, Seattle, WA, Jul. 914, 2006, pp. 282286. [29] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory, vol. IT-23, no. 3, pp. 337342, May 1977. [30] J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Trans. Inf. Theory, vol. IT-24, no. 5, pp. 530536, Sep. 1978. [31] E.-H. Yang and J. Zeng, Method, system, and software product for color image encoding, Apr. 23, 2004, U.S. Patent Application 10/831 656.

En-Hui Yang (M97SM00) was born in Jiangxi, China, on December 26, 1966. He received the B.S. degree in applied mathematics from HuaQiao University, Qianzhou, China, and the Ph.D. degree in mathematics from Nankai University, Tianjin, China, in 1986 and 1991, respectively. Since June 1997, he has been with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada, where he is now a Professor and Canada Research Chair in information theory and multimedia compression. He held a Visiting Professor position at the Chinese University of Hong Kong from September 2003 to June 2004, positions of Research Associate and Visiting Scientist at the University of Minnesota, Minneapolis-St. Paul; the University of Bielefeld, Bielefeld, Germany; the University of Southern California, Los Angeles, from January 1993 to May 1997; and a faculty position (rst as an Assistant Professor and then an Associate Professor) at Nankai University from 1991 to 1992. He is the founding Director of the Leitch University of Waterloo Multimedia Communications Lab and a Co-Founder of SlipStream Data, Inc. (now a subsidiary of Research In Motion). His current research interests are multimedia compression, multimedia watermarking, multimedia transmission, digital communications, information theory, source and channel coding including distributed source coding and space-time coding, Kolmogorov complexity theory, quantum information theory, and applied probability theory and statistics. Dr. Yang is a recipient of several research awards, including the 1992 Tianjin Science and Technology Promotion Award for Young Investigators; the 1992 third Science and Technology Promotion Award of Chinese National Education Committee; the 2000 Ontario Premiers Research Excellence Award, Canada; the 2000 Marsland Award for Research Excellence, University of Waterloo; and the 2002 Ontario Distinguished Researcher Award. Products based on his inventions and commercialized by SlipStream received the 2006 Ontario Global Traders Provincial Award and were deployed by over 2200 Service Providers in more than 50 countries, servicing millions of home subscribers worldwide every day. He served, among many other roles, as a Technical Program Vice-Chair of the 2006 IEEE International Conference on Multimedia & Expo (ICME), the Chair of the award committee for the 2004 Canadian Award in Telecommunications, a Co-Editor of the 2004 Special Issue of the IEEE TRANSACTIONS ON INFORMATION THEORY, a Co-Chair of the 2003 U.S. National Science Foundation (NSF) workshop on the interface of Information Theory and Computer Science, and a Co-Chair of the 2003 Canadian Workshop on Information Theory.

Xiang Yu received the M.E. degree in physics in 1994 from Tsinghua University, Beijing, China, and the M.E. degree in electrical engineering from Peking University, Beijing, China, in 1997. He is currently pursuing the Ph.D. degree in electrical and computer engineering at the University of Waterloo, Waterloo, ON, Canada. His research interests include data compression, multimedia communications, information theory, image processing, and machine learning.