Sidda Raju Project

Multiplexing the elementary streams of H.
264 video and MPEG4 HE AAC v2 audio using MPEG2 systems specification, demultiplexing and achieving lip synchronization during playback
by
NAVEEN SIDDARAJU
Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING Nov 2010
Copyright by Naveen Siddaraju 2010 All Rights Reserved ii
ACKNOWLEDGEMENTS I am greatly thankful to my supervising professor Dr.K.R.Rao, whose constant encouragement, guidance and support have helped me in smooth completion of the project. He has always been accessible and helpful throughout. I also thank him for introducing me to the field of multimedia processing. I would like to thank Dr.W. Alan Davis and Dr.William E.Dillon for taking interest in my project and accepting to be part of my project defense committee. I am forever grateful to my parents for their unconditional support at each turn of the road. I thank my brother and sisters, who have always been a source of inspiration. I would like to thank my friends both in US and in India for their encouragement and support. November 22, 2010
iii
ABSTRACT
MULTIPLEXING THE ELEMENTARY STREAMS OF H.264 VIDEO AND MPEG4 HE AAC v2 AUDIO USING MPEG2 SYSTEMS SPECIFICATION, DEMULTIPLEXING AND ACHIEVING LIP SYNCHRONIZATION DURING PLAYBACK Naveen Siddaraju, MS The University of Texas at Arlington, 2010 Supervising Professor: Dr. K. R. Rao
Delivering broadcast quality content to the mobile customers is one of the most challenging tasks in the world of digital broadcasting. Limited network bandwidth and processing capability of the handheld devices are critical factors that should be considered. Hence selection of the compression schemes for the media content is very important from both economic and quality points of view. H.264 which is also known as Advanced Video Codec (AVC) [1] is the latest and the most advanced video codec available in the market today. The H.264 baseline profile which is used in applications such as mobile television (mobile DTV) broadcast has one of the best compression ratios among the other profiles and requires the least processing power at the decoder. The audio MPEG4 HE AAC v2 [2] which is also known as enhanced aacplus, is the latest audio codec belonging to the AAC (advanced audio codec) [3] family. In addition to the core AAC, it uses the latest tools such as Spectral Band Replication (SBR) [2] and Parametric Stereo (PS) [2] resulting in the best perceived quality for the lowest
iv
bitrates. The audio and video codec standards have been chosen based on ATSCM/H (advanced television systems committee mobile handheld) [17]. For the television broadcasting applications such as ATSC-M/H, DVB [16] the encoded audio and video streams should be transmitted in a single transport stream containing fixed sized data packets, which can be easily recognized and decoded at the receiver. The goal of the project is to implement a multiplexing scheme for the elementary streams of H.264 baseline and HE AAC v2 using the MPEG2 systems specifications [4], then demultiplex the transport stream and playback the decoded elementary stream with lip synchronization or audio-video synchronization. The multiplexing involves two layers of packetization of the elementary streams of audio and video. The first level of packetization results in Program Elementary Stream (PES) packets, which are variable size packets and hence not suitable for transport. MPEG2 defines a transport stream where PES packets are logically organized into fixed size packets called the Transport Stream (TS) packets, which are 188 bytes long. These packets are continuously generated to form a transport stream, which is decoded by the receiver and the original elementary streams are reconstructed. The PES packets that are logically encapsulated into the TS header contain the time stamp information which is used at the de-multiplexer to achieve synchronization between audio and video elementary streams.
TABLE OF CONTENTS ACKNOWLEDGEMENTS ABSTRACT. LIST OF FIGURES.... LIST OF TABLES. LIST OF TABLES. ACRONYMS AND ABBREVIATIONS.. Chapter 1. INTRODUCTION 2. OVERVIEW OF H.264. 2.1 H.264/ AVC. 2.2 Coding structure.. 2.3 Profiles and levels. 2.4 Description of various profiles 2.4.1 Baseline Profile... 2.4.2 Extended profile.. 2.4.3 Main Profile 2.4.4 High Profiles 2.5 H.264 encoder and decoder. 2.5.1 Intra prediction 2.5.2 Inter prediction 2.5.3 Transform and quantization 2.5.4 Entropy coding
vi
iii iv iii ix xi xii
1 2 2 2 3 4 4 5 5 5 6 8 9 10 10
2.5.5 Deblocking filter 11 2.6 H.264 bitstream. 3. OVERVIEW OF HE AAC V2. 3.1HE AAC v2 3.2 Spectral Band Replication (SBR). 3.3 Parametric Stereo (PS) 3.4 Enhanced aacplus encoder.. 3.5 Enhanced aacplus decoder.. 3.6 Advanced Audio Coding (AAC) . 3.6.1 AAC encoder. 3.7 HE AAC v2 bitstream formats.. 4. TRANSPORT PROTOCOLS . 4.1 Introduction 4.2 Real-Time protocol (RTP) 4.3 MPEG2 systems layer. 4.4 Packetized elementary stream (PES).. 4.4.1 PES encapsulation process. 4.5 MPEG Transport stream (MPEG- TS). 4.6 Time stamps 5. MULTIPLEXING. 6. DE MULTIPLEXING. 6.1 Lip or audio-video synchronization 11 16 16 18 19 20 22 23 23 27 30 30 30 31 32 34 35 38 42 48 51
vii
7. RESULTS.. 7.1 Buffer fullness.. 7.2 Synchronization/skew calculation .. 8. CONCLUSIONS . 9. FUTURE WORK References.
55 55 56 59 59 60
viii
LIST OF FIGURES Fig 2.1. Video data organization in H.264 [42]. Fig 2.2. Specific coding parts of the profiles in H.264 [5]. Fig 2.3 Different YUV systems. Fig 2.4. H.264 encoder [5]. Fig 2.5. H.264 decoder [5]. Fig2.6. Intra prediction modes for 4X4 luma in H.264 Fig2.7. Different layers of JVT coding. Fig2.8. NAL formatting of VCL and non-VCL data [6]. Fig2.9. NAL unit format [6]. Fig2.10. Relationship between parameter sets and picture slices [24]. Fig3.1: HE AAC audio codec family Fig 3.2: Typical bitrate ranges of HE-AAC v2, HE-AAC and AAC for stereo [7] Fig 3.4 Original audio signal [28]. Fig 3.5 High band reconstruction through SBR [28]. Fig3.6: Enhanced aacplus encoder block diagram [9] Fig3.7: Enhanced aacplus decoder block diagram [9] Fig 3.8 AAC encoder block diagram [10] Fig 3.9. ADTS elementary stream Fig 4.1. RTP packet structure (simplified) [22]
ix
Fig 4.2. MPEG2 transport stream [22] Fig 4.3. Conversion of an elementary stream into PES packets [29] Fig4.4. A standard MPEG-TS packet structure [14]. Fig4.5. Transport stream (TS) packet format used in this project. Fig5.1. Overall multiplexer flow diagram Fig5.2. Flow chart of video processing block Fig 5.3. Flow chart of audio processing block. Fig6.1. Flow chart for the de-multiplexer used
LIST OF TABLES Table2.1. NAL Unit types. Table3.1 ADTS header format [2] [3] Table3.2 Profile bits expansion [2] [3] Table4.1. PES packet header format used [4]. Table7.1. Video and audio buffer sizes and their respective playback times Table 7.2. Characteristics of test clips used Table 7.3: Demultiplexer output.
xi
ACRONYMS AND ABBREVIATIONS 3GPP AAC AAC LPT ADIF ADTS AFC ATM ATSC ATSC-M/H AVC CABAC CAVLC CC CIF CRC DCT DPCM DVB DVB DVD EI ES FMO GOP HDTV HE AACv2 IC IDR IID IP IPD ISDB Third generation partnership project Advanced audio coding Advanced audio coding - long term prediction Audio data interchange format Audio data transport stream Adaptation field control Asynchronous transfer mode Advanced television systems committee Advanced television systems committeemobile /handheld Advanced Video Coding Context-based Adaptive Binary Arithmetic Coding Context-based Adaptive Variable Length Coding Continuity counter Common intermediate format Cyclic redundancy check Discrete code transform Discrete pulse coded modulation Digital video broadcasting Digital video broadcasting Digital video disc Error indicator Elementary stream Flexible macro block order Group of pictures High definition television High efficiency advanced audio codec version 2 Inter-channel Coherence Instantaneous decoder refresh Inter-channel Intensity Differences Internet protocol Inter-channel Phase Differences Integrated Services Digital Broadcasting system
xii
I-slice ISO ITU JM JVT M4A MB MC MDCT ME MP4 MPEG MPTS NALU OPD PCR PES PID PMT PPS PS P-slice PTS PUSI QCIF QMF RTP SBR SCI S-DMB SDTV SEI SPS SPTS
Intra predictive slice International Standards Organization International Telecommunication Union Joint model Joint Video Team Moving picture experts group four file format audio only Macro blocks Motion compensation Modified discrete cosine transform Motion estimation Moving picture experts group four file format Moving Picture Experts Group Multi program transport stream Network Abstraction Layer Unit Overall Phase Difference Program clock reference Packetized elementary stream Packet identifier Program map table Picture parameter set Parametric stereo Predictive slice Presentation time stamp Payload unit start Indicator Quarter common intermediate format Quadrature mirror filter banks Real time protocol Spectral band replication Simplified chrominance intra prediction Satellite - Digital multimedia broadcasting Standard definition television Supplemental enhancement information Sequence parameter set Single program transport stream
xiii
STC TCP TNS TS UDP VC1 VCEG VCL VLC VUI YUV
System timing clock Transmission control protocol Temporal noise shaping Transport stream User datagram protocol Video Codec1 Video Coding Experts Group Video coding layer Variable length coding Video usability information Luminance and chrominance color components
xiv
CHAPTER 1 INTRODUCTION
Mobile broadcast systems are increasingly important as cellular phones and highly efficient digital video compression techniques merge to enable digital TV and multimedia reception on the move. There are several digital mobile TV broadcasters in the market. Major ones are DVB-H (digital video broadcasthandheld) [16] and ATSC-M/H (advanced television systems committeemobile/handheld) [17] [18]. Both DVB-H and ATSC-M/H have a relatively smaller channel bandwidth allocation (~14Mbps for DVB-H and ~19.6 Mbps for ATSCM/H), so the choice of multimedia compression standard and transport protocol becomes very important. DVB-H specifies the use of either VC1 [19] or H.264 [1] compression standard for video and AAC [3] audio. ATSC-M/H specifies H.264 baseline profile for video and HEAACv2 [2] for audio. The transport protocol is usually RTP (real time protocol) [20] or MPEG2 part 1 systems [4].The MPEG-2 systems specification [4] describes how MPEG-compressed video and audio data streams may be multiplexed together with other data to form a single data stream suitable for digital transmission or storage. Two alternative streams are specified for the MPEG-2 systems layer. The program stream is used for storage of multimedia content like DVD while the transport stream is intended for the simultaneous delivery of a number of programs over potentially error-prone channels. In this project, the compression standards used is H.264 baseline profile for video and HEAACv2 for audio, as specified by the ATSC-M/H standard. Distribution is achieved through the MPEG-2 part 1 systems specifications transport stream. Chapters 2 and 3 give the brief overview of H.264 and HEAACv2 compression standards respectively. Chapter 4 explains the transport stream
1
protocol used in this project. Chapters 5 and 6 explain the multiplexing and demultiplexing schemes used in this project. All the results are tabulated in chapter 7. Chapters 8 and 9 outline the conclusions of the project and future work respectively.
CHAPTER 2 OVERVIEW OF H.264 2.1 H.264/ AVC

H.264 is the latest and the most advanced video codec available today. It was jointly developed by the VCEG (video coding experts group) of ITU-T (international telecommunication union) and the MPEG (moving pictures experts group) of ISO/IEC (international standards organization). This standard achieves much greater compression than its predecessors like MPEG-2 video [37], MPEG4 part 2 [38] etc. But the higher coding efficiency comes at the cost of increased complexity. The H.264 has been adopted as the video standard for many applications around the world including ATSC. The H.264 baseline profile with some restrictions is the adopted standard for ATSC-M/H or ATSC mobile DTV [40].
2.2 Coding structure:

The basic coding structure of H.264 is similar to that of the earlier standards (MPEG-1, MPEG-2) and is commonly referred to as motion-compensated transform coding structure. Coding of video is performed picture by picture. Each picture is partitioned into a number of slices, which is a sequence of macroblocks. Each slice is coded independently in H.264. It is possible that a picture can have just one slice. A macroblock (MB) consists of 16X16 luminance (y) component and associated two chrominance ( ) components. Each macroblocks 16X16 luminance can be partitioned into 16 X 16, 16 X 8, 8 X 16, and 8 X 8 units, and further, each 8 X 8 luminance can be sub-partitioned into 8 X 8, 8 X 4, 4 X 8 and 4 X 4. The 4 X 4 sub-macroblock partition is called a block. The hierarchy of video data organization is shown in figure 2.1.
Fig2.1 Video data organization in H.264 [42]
There are basically three types of slices I (intra predictive), P (predictive) and B (bipredictive) slices. I slices are strictly intracoded I.e. Macro blocks (MB) are compressed without using any motion prediction from earlier slices. A special type picture containing only I-slice is called instantaneous decoder refresh (IDR) picture. Any picture following the IDR picture does not use the pictures prior to IDR for its motion prediction. IDR pictures can be used for random access or as entry points for a coded sequence [6]. P-slices on the other hand contain macroblocks which use motion prediction. The MBs of P-slice can use only one frame as reference (either from past or future) for their motion prediction.
2.3 Profiles and levels:

Profiles and levels specify restrictions on bit streams and hence limits on the capabilities needed to decode the bit streams. Profiles and levels may also be used to indicate interoperability points between individual decoder implementations. For any given profile, levels generally correspond to decoder processing load and memory capability. Each level may support a different picture size QCIF, CIF, ITU-R 601 (SDTV), HDTV etc. Also each level sets the limits for data bitrates, frame size, picture buffer size, etc [5].
H.264/AVC is purely a video codec with as many as seven profiles. Three profiles main, baseline and extended profile were included in its first release. Four new profiles were added in the subsequent releases defined in the fidelity range extensions for applications such as content distribution, content contribution, and studio editing and post-processing [5]. Profiles and their specific tools and common features are shown in fig 2.2.
Fig 2.2. Specific coding parts of the profiles in H.264 [5].
It can be noted that I-slice, P-slice and CAVLC (Context-based Adaptive Variable Length Coding) entropy coding are common to all the profiles.
2.4 Description of various profiles: 2.4.1 Baseline Profile:

Baseline profile supports coded sequences containing I and P slices. Apart from the common features, baseline profile consists of some error resilience tools such as Flexible Macro Block order (FMO), arbitrary slice order and redundant
5
slices. It was designed for low delay applications, as well as for applications that run on platforms with low processing power and in high packet loss environment. Among the three profiles, it offers the least coding efficiency [6]. The baseline profile caters to applications such as video conferencing and mobile television broadcast video. This project uses baseline profile for video encoding since it is specified by ATSC for mobile digital television.
2.4.2 Extended profile:

The extended profile is a superset of the baseline profile. Besides tools of the baseline profile it includes B-, SP- and SI-slices, data partitioning, and interlace coding tools. SP and SI are specifically coded P and I slice respectively which are used for efficient switching between different bitrates in some streaming applications. It however does not include CABAC. It is thus more complex but also provides better coding efficiency. Its intended applications are streaming video over internet [6].
2.4.3 Main Profile:

Other than the common features main profile includes tools such as CABAC for entropy coding, B-slices. It does not include any error resilience tools such as FMO. Main profile is used in Broadcast television and high resolution video storage and playback. It also contains interlaced coding tools like extended profile.
2.4.4 High Profiles:

High profiles are the superset of main profile. It also includes additional tools such as adaptive transform block size, quantization scaling matrices. High profiles are
6
used for applications such as content-contribution, content-distribution, and studio editing and post-processing [5]. Four different high profiles are described below: High Profile - supports the 8-bit video with 4:2:0 sampling for applications using high resolution. High 10 Profile supports the 4:2:0 sampling with up to 10 bits of representation accuracy per sample. High 4:2:2 Profile - supports up to 4:2:2 chroma sampling and up to 10 bits per sample. High 4:4:4 Profile supports up to 4:4:4 chroma sampling, up to 12 bits per sample, and integer residual color transform for coding RGB signal. Different YUV formats are shown in fig 2.3.
Fig 2.3 Different YUV systems.

7
For any given profile, a level corresponds to various data bit rates, frame size, picture buffer size, etc.
2.5 H.264 encoder and decoder:

The H.264 encoder follows a classic DPCM encoding loop. The encoder may select between various inter- or intra- prediction modes. Intra- coding uses up to nine prediction modes to reduce spatial redundancy for the single picture. Intercoding is more efficient than intra-coding and are used in B and P frames. Intercoding uses motion vectors for block-based inter-prediction to reduce temporal redundancy among different pictures [5]. The de-blocking filter is used to reduce the blocking artifacts. The predicted signal is then subtracted with the input sequence to get a residual which is further compressed by applying integer transform. This will remove the spatial correlation between the pixels. The resulting signal is given to the quantization block. Finally the quantized transform coefficients, motion vectors, intra prediction modes, control data etc are given to the entropy coding block. There are basically two types of entropy encoders in H.264 they are CAVLC (context adaptive variable length coding) and CABAC (context adaptive binary arithmetic coding). The encoder and decoder for H.264 are shown in figures 2.4 and 2.5 respectively.
Fig 2.4. H.264 encoder [5].
The decoder performs in the exact opposite way, taking in the encoded bitstream and decoding it. Then it is given to the inverse quantization and inverse transform block.
Fig 2.5. H.264 decoder [5]
2.5.1 Intra prediction:

In the intra-coded mode, a prediction block is formed based on previously reconstructed (but, unfiltered for deblocking) blocks of the same frame. The residual signal between the current block and the prediction is finally encoded. All macroblocks are Intra-coded in I- slice. Macroblocks having unacceptable temporal correlation in P and B slices are also intra coded. Essentially, the intracoded macroblocks introduces the large number of coded bits. This is a bottleneck for reducing the bitrates. For the luma samples, the prediction block may be formed for each 4 X 4 sub block, each 8 X 8 block, or for a 16 X16 macroblock. There are a total of 9 prediction modes for 4 X 4 and 8 X 8 luma blocks; 4 modes for a 16 X16 luma block; and four modes for each chroma block. Figure 2.6 shows the intra prediction modes for 4X4 luma. There are basically nine different modes. For mode 0 (vertical) and mode 1 (horizontal), the predicted samples are formed by extrapolation from upper samples [A, B, C, D] and from left samples [I, J, K, L] respectively. For mode 2 (DC), all of the predicted samples are formed by the mean of the upper and left samples [A, B, C, D, I, J, K, L]. For mode 3 (diagonal down left), mode 4 (diagonal down right), mode 5 (vertical right), mode 6 (horizontal down), mode 7 (vertical left), and mode 8 (horizontal up), the predicted samples are formed from a weighted average of the prediction samples AM.
10
Fig 2.6. Intra prediction modes for 4X4 luma in H.264 [39].
For prediction of each 8X8 luma block, one mode is selected from the 9 modes, similar to the (4 X4) intra-block prediction. For prediction of all 16 X16 luma components of a macroblock, four modes are available. For mode 0 (vertical), mode 1 (horizontal), mode 2 (DC), the predictions are similar with the cases of 4X 4 luma block. For mode 4 (Plane), a linear plane function is fitted to the upper and left samples.
2.5.2 Inter prediction:

This block includes both Motion Estimation (ME) and Motion Compensation (MC). It generates a predicted version of a rectangular array of pixels, by choosing another similarly sized rectangular array of pixels from a previously decoded reference picture, translating the reference array to the position of the current rectangular array [6]. In H.264, the block sizes for motion prediction include: 4X4, 4X8, 8X4, 8X8, 16X8, 8X16, and 16X16 pixels (shown in figure 2.1). Inter-prediction of a sample block can also involve the selection of the frames to be used as the reference pictures from a number of stored previously decoded pictures. Reference pictures for motion compensation are stored in the picture buffer. With respect to the current picture, the pictures before and after the
11
current picture, in the display order are stored into the picture buffer. These are classified as short-term and long-term reference pictures. Long-term reference pictures are introduced to extend the motion search range by using multiple decoded pictures, instead of using just one decoded short-term picture. Memory management is required to take care of marking some stored pictures as unused and deciding which pictures to delete from the buffer for efficient memory management [5].
2.5.3 Transform and quantization:

The residual signal (prediction error) will have a high spatial redundancy. AVC like its predecessors uses block based transform (Integer DCT) and quantization to remove/reduce this spatial redundancy. H.264 uses an adaptive transform block size, 4X4 and 8X8 (for high profile only). The smaller block size reduces the ringing artifacts. Also, the 4 X4 transform has the additional benefit of removing the need for multiplications [5]. For improved compression H.264 also employs 4X4 Hadamard transform for the DC components of the 4X4 (DCT) transforms in case of luma 16X16 intra mode and 2X2 Hadamard transform for chroma DC coefficients.
2.5.4 Entropy coding:

The predecessors of H.264 (MPEG 1 and MPEG 2) used the entropy coding based on the fixed tables of variable length codes (VLC). H.264 uses different VLCs to match a symbol to a code based on the context characteristics. All syntax elements except for the residual data are encoded by the exp-golomb codes [5]. For coding the residual data, a more sophisticated method called CAVLC (context based adaptive variable length coding) is employed. CABAC (context based adaptive binary arithmetic coding) is employed in main and high profiles, CABAC
12
achieves better coding efficiency, but with higher complexity compared to CAVLC [1].
2.5.5 Deblocking filter:

H.264 based systems may suffer from blocking artifacts due to block-based transform in intra and inter-prediction coding, and the quantization of the transform coefficients. The deblocking filter reduces the blocking artifacts in the block boundary and prevents the propagation of accumulated coded noise since the filter is present in the DPCM loop. For this project, the H.264 baseline profile at level 1.3 is used as specified for the ATSC- mobile digital television [41]. The resolution of the video sequence is 416 pixels X240 lines (aspect ratio 16:9).
2.6 H.264 bitstream:

H.264 video syntax can be broken down into two layers. Video Coding Layer (VCL) which consist of the video data, slice layer or below and the Network Abstraction Layer (NAL) which formats the VCL representation of the video and provides the header information. NAL also provides additional non-VCL information like sequence parameter sets, picture parameter sets, Supplemental Enhancement Information (SEI), etc so that it may be used in a variety of transport streams like the MPEG2 transport stream , IP/RTP systems, etc or storage media like ISO image. Figure 2.7 shows the different layers of JVT (joint video team) coding. Figure 2.8 shows the NAL layer formatting of VCL and nonVCL data.
13
Fig 2.7 Different layers of JVT coding [1].
Fig 2.8 NAL formatting of VCL and non-VCL data [6].
The H.264 bit stream is encapsulated into packets called NAL units (NALU). Each NALU is separated by a 4 byte sequence 0X00000001. After this byte sequence the following byte is the NAL header and the rest is a variable byte
14
length raw byte sequence payload (RBSP). The NAL header/unit format is shown in the figure 2.9. The first bit of the NAL header called the forbidden bit is always zero. The next two bits indicate whether the NALU consist of sequence parameter set, picture parameter set or the slice data of the reference picture. The next 5 bits indicate the type of NALUs (type indicator), depending upon the type of data being carried by that NALU. There are 32 different types of NALU. These may be classified into VCL NALUs and non-VCL NALUs, depending upon the type of data they carry.
Fig 2.9. NAL unit format [6].
If the type indicator is less than 5, it is a VCL NALU and if the type indicator is greater than 5 it is non-VCL NALU. Different types of NALUs are listed in table2.1. NALU types 1-5 and 12 are VCL-NAL units containing coded VCL data. The rest of NALUs are called non-VCL NAL units that contain information such as SEI, sequence parameter set, picture parameter set etc. Of these NALUs IDR pictures, sequence parameter set and picture parameter set are important. An instantaneous decoder refresh (IDR) picture is a picture that is placed at the beginning of the video sequence. When the decoder receives an IDR picture, all information is refreshed, which indicated a new coded video sequence and frames prior to this IDR frame are not required to decode this new sequence. The sequence parameter set contains important header information that applies to all NALUs in the coded sequence. The picture parameter set contains
15
important header information that is used for decoding one or more frames in the coded sequence.
Type indicator 0 1 2 3 4 5 6 7 8 9 10 11 12 13-23 23-31 NALU type unspecified coded slice data partition A data partition B data partition C IDR (instantaneous decoder refresh) SEI(supplemental enhancement information) sequence parameter set picture parameter set access unit delimiter end of sequence end of stream filler data extended undefined Table 2.1. NAL unit types [6].
The relationship between the parameter sets and the slice data is shown in fig 2.10. Each VCL NAL unit contains an identifier that refers to the content of the relevant Picture Parameter Set (PPS) and each PPS contains an identifier that refers to the content of the relevant Sequence Parameter Set (SPS). In this manner, a small amount of data (the identifier) is used to refer to a larger amount of information (the parameter set) without repeating that information within each VCL NAL unit. Sequence and picture parameter sets can be sent well ahead of the VCL NAL units to which they apply, and can be repeated to provide robustness against data loss. In some applications, parameter sets may be sent within the channel that carries the VCL NAL units (termed "in-band" transmission). In other applications, it can be advantageous to convey the
16
parameter sets "out of band" using a more reliable transport mechanism than the video channel itself. By using this mechanism, H.264 can transmit multiple video sequences (with different parameters) in a single bitstream.
Fig 2.10. Relationship between parameter sets and picture slices[24].
The important information carried by SPS include profile/level indicator, decoding order or playback order, frame size, number of reference frames, Video Usability Information (VUI) such as aspect ratio, color or space details etc. SPS remains the same for the entire coded video sequence. Important information carried by PPS include entropy coding scheme used, macro block reordering, quantization parameters, a flag to indicate whether inter predicted MBs can be used for intrapredicton etc. PPS remains unchanged within a coded picture. This chapter provided an overview of H.264. Various profiles, encoder, decoder and the H.264 bit stream format were discussed in detail. An overview of the HE AAC v2 audio codec is presented in the next chapter.
17
CHAPTER 3
OVERVIEW OF HE AAC V2 3.1HE AAC v2

High efficiency advanced audio codec version 2 also known as enhanced aacplus is a low bit rate audio codec defined in MPEG4 audio profile [2] belonging to the AAC family. It is specifically designed for low bit rate applications such as streaming. HE AAC v2 has been proven to be the most efficient audio compression tool available today. It comes with a fully featured toolset which enables coding in mono, stereo and multichannel modes (up to 48 channels). Apart from ATSC [17], enhanced aacplus is already the audio standard in various applications and systems around the world. In Asia, HE-AAC v2 is the mandatory audio codec for the Korean Satellite Digital Multimedia Broadcasting (S-DMB) [25] technology and is optional for Japans terrestrial Integrated Services Digital Broadcasting system (ISDB) [26]. HE-AAC v2 is also a central element of the 3GPP and 3GPP2 [27] specifications and is applied in multiple music download services over 2.5 and 3G mobile communication networks. Others includes XM satellite radio (the digital satellite broadcasting service in the USA), HD Radio (the terrestrial digital broadcasting system of iBiquity Digital, USA) [7].
HE AAC v2 is a combination of three technologies: AAC (advanced audio codec), SBR (spectral band replication) and PS (parametric stereo). All the 3 technologies are defined in MPEG4 audio standard [2]. The combination of AAC and SBR is called HE-AAC or aacplus. AAC is a general audio codec, SBR is a bandwidth extension technique offering substantial coding gain in combination
18
with AAC, and Parametric Stereo (PS) enables stereo coding at very low bitrates. Figure 3.1 shows the family of AAC audio codecs.
HE AAC v2
AAC
SBR HE AAC
PS
Fig 3.1: HE AAC audio codec family [7]
Figure 3.2 shows the typical bitrate ranges for stereo plotted against the perceptual quality factor for all three forms of the codec. It can be easily derived that HEAAC v2 provides the best quality for the lowest bitrate.
Fig 3.2: Typical bitrate ranges of HE-AAC v2, HE-AAC and AAC for stereo [7]
19
3.2 Spectral Band Replication (SBR):

SBR [2] is a bandwidth expansion technique; it has emerged as one of the most important tools that have led to the development of audio coding technology. SBR exploits the correlation that exists between the energy of the audio signal at high and low frequencies also referred to as high and low bands. It is also based on the fact that psychoacoustic importance of high band is relatively low. SBR uses a well guided technique called transposition to predict the energies at high band from low band. Besides just transposition, the reconstruction of the high band is conducted by transmitting some guiding information such as spectral envelope of the original signal, prediction error, etc. These are referred to as SBR data. The original and the high band reconstructed audio signal are shown in the figures 3.4 and 3.5 respectively
Fig 3.4 original audio signal [28].
Fig 3.5 High band reconstruction through SBR [28].

20
SBR has enabled high-quality stereo sound at bitrates as low as 48 kbps. SBR was invented as a bandwidth extension tool when used along with AAC. It was adopted as an MPEG4 standard in March 2004 [2].
3.3 Parametric Stereo (PS):

Parametric stereo coding is a technique to efficiently code a stereo audio signal as a monaural signal plus a small amount of stereo parameters. The monaural signal can be encoded using any audio coder. The stereo parameters can be embedded in the ancillary part of the mono bit stream creating backwards mono compatibility. In the decoder, first the monaural signal is decoded after which the stereo signal is reconstructed from the stereo parameters. PS coding has led to a high quality stereo sound reconstruction at relatively low bitrates. In the parametric approach, the audio signal or stereo image is separated into its transient, sinusoid, and noise components. Next, each component is re-represented via parameters that drive a model for the signal, rather than the standard approach of coding the actual signal itself. PS uses three types of parameters to describe the stereo image: Inter-channel Intensity Differences (IID): describes the intensity differences between the channels. Inter-channel Phase Differences (IPD): describes the phase differences between the channels and Inter-channel Coherence (IC): describes the coherence between the channels. The coherence is measured as the maximum of the cross-correlation as a function of time or phase. In principle, these three parameters allow for a high quality reconstruction of the stereo image. However, the IPD parameters only specify the relative phase
21
differences between the channels of the stereo input signal. They do not prescribe the distribution of these phase differences over the left and right channels. Hence, a fourth type of parameter is introduced, describing an overall phase offset or Overall Phase Difference (OPD). In order to reconstruct the stereo image, in the PS decoder a number of operations are performed, consisting of scaling (IID), phase rotations (IPD/OPD) and decorrelation (IC).
3.4 Enhanced aacplus encoder:

Figure 3.6 shows the complete block diagram of the enhanced aacplus encoder. The input PCM time domain signal (raw audio signal) is first fed to a stereo-to-mono down mix unit, which is only applied if the input signal is stereo but the chosen audio encoding mode is selected to be mono. The (mono or stereo) input time domain signal is fed to an IIR resampling filter in order to adjust the input sampling frequency sampling rate to the best-suited for the encoding process. The usage of the IIR resampler block
is only applied if the input signal sampling rate differs from the encoding sampling rate. The IIR resampler may either be run as a 3:2 downsampler (e.g. to downsample from 48 kHz to 32 kHz) or as a 1:2 upsampler (e.g. to upsample from 16 to 32 kHz). The QMF filter bank (part of SBR) is used to derive the spectral envelope of the original signal. This envelope data along with some other error information forms the SBR stream. The enhanced aacplus encoder basically consists of the well-known AAC encoder, the SBR high band reconstruction encoding tool and the Parametric Stereo (PS) encoding tool. The enhanced aacplus encoder is operated in a dual frequency mode, SBR encoder unit operates at the encoding sampling frequency ( as delivered from the IIR resampler and the AAC encoder unit at half of
22
this sampling rate
. Consequently a 2:1 downsampler is present at the
input to the AAC encoder. For an efficient implementation an IIR (Infinite Impulse Response) filter is used. The parametric stereo tool is used for low-bitrate stereo coding, i.e. up to and including a bitrate of 44 kbps [4].
stereo parameter Parametric Stereo Estimation (incl. Downmix)
Analysis QMF Bank
Envelope Estimation
Bitstream Payload Formatter
Input PCM Samples stereo-tomono downmix 1
IIR Resampler1
Downsampled Synthesis QMF Bank
SBR-related Modules
Coded Audio Stream
usage dependant on audio mode
2:1 IIR Downsampler
AAC Core Encoder
Fig3.6: Enhanced aacplus encoder block diagram [9]
The SBR encoder consists of a QMF (Quadrature Mirror Filter) analysis filter bank, which is used to derive the spectral envelope of the original input signal. This spectral envelope data along with transposition information forms the SBR stream. For stereo bitrates at and below 44 kbps, the parametric stereo encoding tool in the enhanced aacplus encoder is used. For stereo bitrates above 44 kbps, normal stereo operation is performed. The parametric stereo encoding tool estimates parameters characterizing the perceived stereo image of the input signal. These stereo parameters are embedded in the SBR stream.
23
The embedding of the SBR stream (including the parametric stereo data) into the AAC stream is done in a backwards compatible way, i.e. legacy AAC decoders can parse the enhanced aacplus stream and decode the AAC core part.
3.5 Enhanced aacplus decoder:

Figure 3.7 shows the entire block diagram of an enhanced aacplus decoder. In the decoder the bitstream is de-multiplexed into the AAC and the SBR stream. Error concealment, e.g. in the case of frame loss, is achieved by designated algorithms in the decoder for AAC, SBR and parametric stereo. The low band AAC time domain signal, sampled at , is first fed to a 32-channel QMF analysis filter bank. The QMF low-band samples are then used to generate a high-band signal, whereas the transmitted transposition guidance information is used to best match the original input signal characteristics. The transposed high band signal is then adjusted according to the transmitted spectral envelope signal to best match the originals spectral envelope. Also, missing components that could not be reconstructed by the transposition process are introduced. Finally, the lowband and the reconstructed highband are combined to obtain the complete output signal in the QMF domain. In the case of a stream using parametric stereo, the mono output signal from the underlying aacplus decoder is converted into a stereo signal. This process is carried out in the QMF domain and is controlled by the parametric stereo parameters embedded in the SBR stream. A 64-channel QMF synthesis filter bank is used to obtain the time domain output signal, sampled at the encoding sampling rate . The synthesis filter bank may also be used to apply an implicit down sampling by a factor of 2, resulting in an output sampling rate of .
24
AAC Core Decoder (incl. error concealment)

Bitstream Payload Demultiplexer
stereo-tomono downmix 1
Analysis QMF Bank

Merge lowband and highband Parametric Stereo Synthesis
Coded Audio Stream
Synthesis QMF Bank
guidance information
HF Generation
Spline Resampler1
Output PCM Samples
SBR error concealment PS error concealment
SBR stereo-tomono downmix1
Envelope Adjustment
1
stereo parameter
usage dependant on audio mode
Fig3.7: Enhanced aaplus decoder block diagram [9]
3.6 Advanced Audio Coding (AAC) 3.6.1 AAC encoder:

The AAC encoder acts as the core encoding algorithm of the enhanced aacplus system encoding at half the sampling rate of aacplus. In the case of SBR being used, the maximum AAC sampling rate is restricted to 24 kHz whereas if SBR is not used, the maximum AAC sampling rate is restricted to 48 kHz [9]. Figure 3.8 shows the block diagram of a core AAC encoder. The various blocks in the encoder are explained below. Stereo preprocessing: In this block, the stereo width of difficult to encode signals at low bitrates are reduced (attenuated). Stereo preprocessing is active for bitrates less than 60kbps. The smaller the bitrate, the more attenuation of the side channel takes place.
25
Filter bank: The encoder breaks down the raw audio signal into segments known as blocks. Modified Discrete Cosine Transform (MDCT) is applied to the blocks to maintain smooth transition from block to block. AAC dynamically switches between the two block sizes i.e., 2048-samples, and 256-samples each, referred to as long blocks and short blocks, respectively. (3.1) shows the MDCT equation. AAC also switches between two different types of long blocks: sine-function and Kaiser-Bessel Derived (KBD) according to the complexity of the signal. X (k) Where N is the block length. Psychoacoustic model: Is a highly complex block which implements the switching between block sizes, threshold calculation (upper limit of quantization error), spreaded energy calculation, grouping etc [10]. Temporal Noise Shaping (TNS) block: This technique does noise shaping in the time domain by doing an open loop prediction in the frequency domain. The TNS technique provides enhanced control of the location, in time, of quantization noise within a filter bank window. TNS proves to be especially successful for the improvement of speech quality at low bit-rates. Mid/Side Stereo: M/S stereo coding is another data reduction module based on channel pair coding. In this case channel pair elements are analyzed as left/right and sum/difference signals on a block-by-block basis. In cases where the M/S channel pair can be represented by fewer bits, the spectral coefficients are coded, and a bit is set to note that the block has used M/S stereo coding. During decoding, the decoded channel pair is de-matrixed back to its original left/right state. For normal stereo operation, M/S Stereo, is only required when operating the encoder at bitrates at or above 44 kbps. Below 44 kbps the parametric stereo coding tool is used instead where the AAC core is operated in mono.
26
Reduction of psychoacoustic requirements block: Usually the requirements of the psychoacoustic model are too strong for the desired bitrate. Thus a threshold reduction strategy is necessary, i.e. the strategy reduces the requirements by increasing the thresholds given by the psychoacoustic model. Quantization and coding: A majority of the data reduction generally occurs in the quantization phase after the data has already achieved a certain level of compression when passed through the previous modules. This module contains many other blocks such as Scale factor quantization block, Noiseless coding and Out of Bits Prevention. Scale factor quantization: This block consist of two additional blocks called scale factor determination and scale factor difference reduction. Scale factor determination: The scale factors determine the quantization step size for each scale factor band. By changing the scale factor, the quantization noise will be controlled. Scale factor difference reduction: This block takes into account the difference of the scale factors which will be encoded. A smaller difference between two adjacent scale factors requires fewer bits. Noiseless coding: Coding of the quantized spectral coefficients is done by the noiseless coding. The encoder uses a so called greedy merge algorithm to segment the 1024 coefficients of a frame into section and to find the best Huffman codebook for each section. Out of Bits Prevention: after noiseless coding, the number of really needed bits is counted. If this number is too high, the number of bits has to be reduced.
27
Input signal
Stereo Preprocessing
Filterbank
Psychoacoustic Model
TNS
M/S
Bitstream multiplex Bitstream Reduction of psychoacoustic requirements
Quantization & Coding scalefactors / quantization
Out of bits prevention
Noiseless Coding
Fig 3.8 AAC encoder block diagram [10]
28
3.7 HE AAC v2 bitstream formats:

HE AAC v2 encoded data has variable file formats with different extensions, depending on the implementation and the usage scenario. The most commonly used file formats are the MPEG-4 file formats MP4 and M4A [15], carrying the respective extensions .mp4 and .m4a. The .m4a extension is use d to emphasize the fact that a file contains audio only. Additionally there are other bit stream formats, such as MPEG-4 ADTS (audio data transport stream) and ADIF (audio data interchange format). ADIF format has a single header at the beginning of the bit stream followed by raw audio data blocks. It is used mainly for local storage purposes. ADTS has a header before each access unit or audio frame and also the header information will remain same for all the frames in a stream. ADTS is more robust against errors and is suited for communication applications like broadcasting. For this project file format ADTS has been used. Tables 3.1 and 3.2 describe the ADTS header. This is present before each access unit (a frame). This is later exploited for packetizing the frames into packetized elementary stream (PES) packets, which is the first layer of packetization before transport. Figure 3.9 shows the ADTS elementary stream. In this chapter, an overview of HE AAC v2 audio codec standard was presented. The encoder, decoder, SBR, PS, AAC encoder and the bit stream format were described. Next chapter gives a brief overview of the transport protocols in particular MPEG 2 systems layer.
29
Field name
syncword ID layer protection_absent profile sampling_frequency_index private_bit channel_configuration original/copy home copyright_identification_bit copyright_identification_start aac_frame_length adts_buffer_fullness no_raw_data_blocks_in_frame crc_check raw_data_blocks
Number of bits
12 1 2 1 2 4 1 3 1 1 always "111111111111" 0: MPEG-4, 1: MPEG-2 always "00" Explained below ADTS fixed header
13 11 2 16
length of the frame including header (in bytes) 0x7FF indicates VBR
ADTS variable header
only if protection_absent == 0 variable size
Table 3.1 ADTS header format [2] [3]
profile bits
00 (0) 01 (1) 10 (2) 11 (3)
bits ID == 1 (MPEG-2 profile) ID == 0 (MPEG-4 Object type)

Main profile AAC MAIN Low Complexity profile (LC) AAC LC Scalable Sample Rate profile (SSR) AAC SSR (reserved) AAC LTP
Table 3.2 Profile bits expansion [2] [3]
30
ADTS frame
ADTS frame
ADTS frame
ADTS header
ADTS raw data
Profile Sampling frequency ..
Sync word
ADTS variable header
Fig 3.9 . ADTS elementary stream [3].
31
CHAPTER 4 TRANSPORT PROTOCOLS 4.1 Introduction

Once the raw video and audio are encoded into their respective elementary streams, it has to be converted into fixed sized packets to enable transmission across networks such as IP (internet protocol), wireless mobile networks etc. H.264 and HE AAC v2 do not define a transport mechanism for the coded data. There are a number of transport solutions available which can be used depending on the application. Some of them are discussed briefly below.
4.2 Real-Time Protocol (RTP):

RTP is a packetisation protocol which can be used along with User Datagram Protocol (UDP) to transmit real time multimedia content across networks that uses the Internet Protocol (IP). The packet structure for RTP real time data is shown in figure 4.1. The payload type indicates the type of codec used to generate the coded data. Sequence is used during playback for reordering the packets that are transmitted out of order. A time stamp is used for calculating the presentation time during decoding. Transmission via RTP involves packetizing each elementary stream into a series of RTP packets and transmitting them across the IP network using UDP as the basic transport protocol. H.264 NAL has been designed keeping RTP protocol in mind, since each NALU can be placed in its own RTP packet.
32
Payload type
Sequence number
Time stamp
Unique id
payload
Fig 4.1. RTP packet structure (simplified) [22]
4.3 MPEG2 systems layer:

The MPEG2 part 1 [4] standard describes a method of combining one or more elementary streams of video and audio , as well as other data , into a single stream which is suitable for storage (DVD) or transmission ( digital television, streaming etc). There are basically two types of system coding, Program Stream (PS) and Transport Stream (TS). Each one is optimized for different types of applications. Program stream is designed for reasonably reliable media such as DVDs, while transport stream is designed for less reliable media such as television broadcast, mobile networks etc. Irrespective of the scheme used the transport/program stream is constructed in two layers, the outer layer is the system layer and the inner most layer is the compression layer. System layer provides the functions necessary for one or more compressed streams like audio and video in a system. The MPEG2 transport stream system is shown in figure 4.2.The elementary stream such as coded audio or video undergoes two layers of packetization. The first layer of packetization results in variable sized packets known as Packetized Elementary Stream (PES). PES packets from different elementary streams undergo one more level of packetisation known as multiplexing, where they are broken down into fixed size packets (188 bytes) known as transport stream packets (TS). These TS packets are what are actually transmitted across the network using
33
broadcast techniques such as those used in ATSC and DVB. The TS contains the actual data (payload) as well as timing and synchronization information and some error control mechanism. The latter plays a crucial role during the decoding process. This project is implemented using the MPEG2 transport stream specification with a few modifications. Transport of H.264 bit stream over MPEG2 systems is covered in amendment 3 of MPEG2 systems [3]. Even though MPEG2 systems support multiple elementary streams for multiplexing, for implementation purposes only two elementary streams, audio and video are considered. Additional elementary streams like data streams or different audio/video streams may be added by following the same method described next.
Elementary stream eg.coded video,audio
packetize
PES packets from multiple streams Transport stream multiplex
Fig 4.2. MPEG2 transport stream [22]
4.4 Packetized elementary stream (PES):

PES packets are obtained after the first layer of packetisation of audio and video coded data. This packetisation process is carried out by sequentially separating out the audio and video elementary streams into access units. The access units for both audio and video elementary streams are frames. Hence each
34
PES packet is an encapsulation of one frame of data, either an audio or video elementary stream. Each PES packet contains a packet header and the payload data from only one particular stream. PES header contains information which can distinguish between audio and video PES packets. Since the number of bits used to represent a frame in the bit stream varies (for both audio and video) the size of the PES packets also varies and depends on the type of frame that is encoded. For example I frames require more bits to be represented than P frames. Figure 4.3 shows how the elementary stream is converted into PES stream.
Fig 4.3. Conversion of an elementary stream into PES packets [29]
The PES header used is shown in table 4.1. The PES header starts with a 3 byte packet start code prefix which is always 0x000001 followed by 1 byte stream id. Stream id is used to uniquely identify a particular stream. Stream id along with start code prefix is known as start code (4 bytes). Valid stream ids [30] for audio streams range from 11000000 to 11011111 and for video streams range from 11100000 to 11101111. Stream id 11000000 and 11100000 for audio and video respectively are used in this implementation.
35
Name Packet start code prefix Stream id
Size (in Bytes) 3 1
Description 0x000001 Unique ID to distinguish between audio and video PES packet Examples: Audio streams (0xC0-0xDF), Video streams (0xE0-0xEF)[3] Note: the above 4 bytes together known as start code. The PES packet can be of any length. A value of zero for the PES packet length can be used only when the PES packet payload is a video elementary stream frame number
PES Packet length Time Stamp
2 2
Table4.1. PES packet header format used [4].
PES packet length may vary and go up to 65536 bytes. In case of longer elementary stream, the packet length may be set as unbound i.e. 0, only in the case of video stream. The next two bytes in the header is the time stamp field, which contains the playback time information. In this project we use frame number to calculate the playback time, which is explained in detail later.
4.4.1 PES encapsulation process:

As discussed before PES packets are obtained by encapsulating sequential access units data bytes from elementary streams into PES header. In the case of an audio stream, HE AAC v2 bitstream is searched for the 12 bit synch word i.e. 111111111111 which indicates the start of a ADTS header and the start of an audio frame. The frame length is extracted from the ADTS header. The audio frame number is calculated from the beginning of the frame and coded as a two byte timestamp. The stream ID used for audio is 11000000. An audio PES packet is formed by encapsulating start code, frame length, time stamp and the payload. In the case of an video stream, the H.264 bitstream is searched for a 4 byte prefix start sequence 0x00000001 which indicates the beginning of a NAL unit. Then the 5 bit frame type is extracted from the NAL header and checked if it is a video
36
frame or a parameter set. Parameter sets are very important and are required for decoding process. So if a parameter set is found (both PPS and SPS) it is packetized separately and transmitted. If NAL unit contains the slice data, then frame number is calculated from beginning of the stream and coded as time stamp in PES. It has to be noted that parameter sets are not counted as frames, so while coding parameter sets the time stamp field is coded as zero. Stream id used for video is 11100000. Then the video PES packet is formed by encapsulating the start code, frame length, time stamp and payload.
4.5 MPEG Transport stream (MPEG- TS):

PES packets are of variable sizes and are difficult to multiplex and transmit in an error prone network. Hence they undergo one more layer of packetisation which results in Transport Stream (TS) packets. MPEG Transport Streams (MPEG-TS) use a fixed length packet size and a packet identifier identifies each transport packet within the transport stream. A packet identifier in an MPEG system identifies the type of packetized elementary stream (PES) whether audio or video. Each TS packet is 188 bytes long which includes header and payload data. Each PES packet may be broken down into a number of transport stream (TS) packets since a PES packet which represents an access unit (a frame) in the elementary stream which will be usually much bigger than 188bytes. And also a particular TS packet should contain data from only one particular PES. The standard MPEG TS packet format is shown in the fig 4.4. It consists of a synchronization byte, whose value is 0x47, followed by three one-bit flags and a 13-bit PID (packet identifier). This is followed by a 4-bit continuity counter, which usually increments with each subsequent packet of a frame, and can be used to
37
detect missing packets. Additional optional transport fields, whose presence may be signaled in the optional adaptation field, may follow. The rest of the packet consists of payload. Packets are most often 188 bytes in length, but some transport streams consist of 204-byte packets which end in 16 bytes of ReedSolomon error correction data. The 188-byte packet size was originally chosen for compatibility with ATM systems.
Fig4.4. A standard MPEG-TS packet structure [14].
The transport Error Indicator (EI) flag is set by the demodulator if it cannot correct errors in the stream, to tell the demultiplexer that the packet has an uncorrectable error. Payload Unit Start Indicator (PUSI) flag indicates the start of PES data. A transport priority flag when set means higher priority than other packets with the same PID. Out of 188 bytes, the header occupies 4 bytes and the rest 184 bytes are for the payload.
38
For the purposes of this implementation all the flags and fields mentioned above are not required, hence a few changes have been made although the frame work and the packet size remains the same. The whole header information is represented in 3 bytes instead of 4 bytes and the rest is available for payload data. The modified transport stream (TS) packet is shown in fig. 4.5
188 bytes long Sync byte 0x47 1 byte

P U S I
A F C
CC
(4 bits)
PID
(10 bits)
Offsetoptional 8bits
Data payload 185 bytes
2 bytes
Fig 4.5. Transport stream (TS) packet format used in this project .
The sync byte (0x47) indicates the start of the new TS packet. It is followed by a payload unit start indicator (PUSI) flag, which when set indicates that the data payload contains start of new PES packet. The Adaptation Field Control (AFC) flag when set indicates that the whole of the allotted 185 bytes for the data payload is not occupied by the PES data. This occurs when the PES data is smaller than 185 bytes. When this happens the unoccupied bytes of the data payload are filled with filler data ( all zeros in this case), and the length of the filler data is stored in a byte called the offset right after the TS header, offset is calculated by 185 length of PES data.
39
The Continuity Counter (CC) is a 4 bit field which is incremented by the multiplexer for each TS packet sent for a particular stream I.e. audio PES or video PES, this information is used at the demultiplexer side to determine if any packets are lost, repeated or is out of sequence. Packet ID (PID) is a unique 10 bit identification to describe a particular stream to which the data payload belongs in the transport stream (TS) packet. The MPEG2 transport stream has a concept of broadcast programs. Each single program is described by a Program Map Table (PMT), and the elementary streams associated with that program have PIDs listed in the PMT. For instance, a transport stream used in digital television might contain three programs, to represent three television channels. Suppose each channel consists of one video stream and one audio stream. A receiver wishing to decode a particular "channel" merely has to decode the payloads of each PID associated with its program. It can discard the contents of all other PIDs. A transport stream with more than one program is referred to as a MPTS (multi program transport stream). Similarly a transport stream with a single program is referred to as a single program transport stream (SPTS). A PMT will have its own PID and will be transmitted at regular intervals. In this implementation only two streams, audio and video are used, so the PMT is not required. The PIDs is assumed to be known by the decoders. The PIDs used for this implementation are 0000001110 (14) for audio stream and 0000001111 (15) for video stream. Optional offset byte: as described above, if the adaptation field control flag is set, this byte is filled with the length of the filler data (zeroes).
4.6 Time stamps:

Time stamps indicate where a particular access unit belongs in time. Lip sync is obtained by incorporating time stamps into the headers in both video and audio PES packets. When a decoder receives a selected PES packet, it decodes
40
each access unit and stores the elementary streams in buffers. When the timeline count reaches the value of the time stamp, the buffer is read out. This operation has two desirable results. First, effective time base correction is obtained in each elementary stream. Second, the video and audio elementary streams can be synchronized together to make a program. Traditionally to enable the decoder to maintain synchronization between audio track and video frames, a 33 bit encoder clock sample called Program Clock Reference (PCR) is transmitted in the adaptation field of the TS packet from time to time (every 100 ms). The PCR transport stream (TS) packet will have its own PID that will be recognized by the decoder. This is used to generate the system timing clock (STC) in the decoder which provides an accurate time base. This along with the presentation time stamp (PTS) field that resides in the PES packet layer of the transport stream is used to synchronize the audio and video elementary streams. This project uses the frame numbers of both audio and video as time stamps to synchronize the streams. This section explains how frame numbers can be used to synchronize audio and video streams. As explained before in sections 2.6 and 3.7, both H.264 and HE AAC v2 bit streams are organized into access units i.e. frames separated by their respective sync sequence. A particular video sequence will have a fixed frame rate during playback which is specified by frames per second (fps). So assuming that the decoder has a prior knowledge about the fps of the stream, the presentation time or the playback time of a particular video frame can be calculated using (4.1).
41
The AAC compression standard defines each audio frame to contain 1024 samples per channel. This is true for HE AAC v2 as well .The sampling frequency of the audio stream can be extracted from the sampling frequency index field of the ADTS header. The sampling frequency remains the same for a particular audio stream. Since both samples per frame and sampling frequency are fixed the audio frame rate also remains constant throughout a particular audio stream. Hence the presentation time of a particular audio frame (assuming stereo) can be calculated using (4.2).
The same expression can be expanded for multi channel audio streams, just by multiplying the number of channels. Hence by using (4.1) and (4.2), presentation times of the frames can be calculated by coding the frame numbers as the time stamps. And also once the presentation time of one stream is calculated, the frame number of the second stream that has to played at that particular time can calculated. This approach is used at the decoder to achieve the audio-video synchronization or lip synchronization; this is explained in detail in later chapters. Using frame numbers as time stamps has many advantages over the traditional PCR approach. Obvious advantages are that there is no need to send the additional Transport Stream (TS) packets with PCR information, reduced overall complexity, no need to consider clock jitters during synchronization, smaller time stamp field in the PES packet, just 16 bits to encode frame number
42
compared to 33 bits for the Presentation Time Stamp (PTS) which has a sample from the encoder clock. The time stamp field in this project is encoded in 2 bytes in the PES header, which implies that time stamp field can carry frame numbers up to 65536. Once the frame number of either stream exceeds this number, which is a possibility in the case of long video and audio sequences, the frame number is reset to 1. The reset is done simultaneously on both audio and video frame numbers as soon as the frame number of either one of the stream crosses 65536. This will not create a frame number conflict at the demultiplexer side during synchronization because the audio and video buffer sizes are much smaller than the maximum allowed frame number. Hence, at no point of time will there be two frames in the buffer with the same time stamp. The next chapter addresses the multiplexing scheme used to multiplex the audio and video elementary streams.
43
CHAPTER 5 MULTIPLEXING
Multiplexing is a process where Transport Stream (TS) packets are generated and transmitted in such a way that the buffers at the decoder (demultiplexer) do not overflow or under flow. Buffer overflow or underflow by the video and audio elementary streams can cause skips or freeze/mute errors in video and audio playback. There are many methods adopted in various systems to prevent this at the decoder side, like when a potential buffer overflow is detected; null packets (transmitted to maintain constant bit rate) are deleted or presentation time is delayed by a few frames till both the buffers have the content to be played back at that particular presentation time.
The flow chart of the multiplexing scheme used in this project is shown in figures 5.1, 5.2 and 5.3. The basic logic is based on both audio and video sequences having constant frame rates. For video the frames per second value will remain same throughout the video sequence. In an audio sequence since sampling frequency remains constant throughout the sequence and samples per frame is fixed (1024 for stereo), the frame duration also remains constant. For transmission a PES packet which represents a frame is logically broken down to n number (n depends on PES packet size) of TS packets of 188 bytes each. The exact presentation time of each TS packet ( shown in (5.1), (5.2) and (5.3), where required to represent corresponding PES packet or frame: ) may be calculated as is the number of TS packets
44
= Similarly for audio:
where
is given by
From (5.3) and (5.6) it may be observed that the presentation time of a current TS packet is the cumulative sum of presentation time of previous TS packet (of the same type) and the current TS duration. The decision to transmit a particular TS packet (audio or video) is made by comparing their respective presentation times, and which ever stream has a lower value, it is scheduled to transmit a TS packet. This makes sure that both audio and video content get equal priority and gets transmitted uniformly. Once the decision about which TS to transmit is made, the control goes to one of the blocks where the actual generation and transmission of TS and PES packets takes place. In the audio/video processing block, the first step is to check whether the multiplexer is still in the middle of a frame or in the beginning of a new frame. If a new frame is being processed, (5.2) or (5.5) is executed appropriately, to find out the TS duration. This information is used to update the TS presentation time at a
45
later stage. Next data is read from the concerned PES packet, if PES is bigger than 185 bytes then only the first 185 bytes are read out and the PES packet is adjusted accordingly. If the current TS packet is the last packet for that PES packet, a new PES packet for the next frame (for that stream) is generated. Now the 185 bytes payload data and all the remaining information are ready to generate the transport stream (TS) packet. Once a TS packet is generated, the TS presentation time is updated using (5.3) and (5.6). Then the control goes back to the presentation time decision block and the whole process is repeated till all the video and audio frames are transmitted. It has to be noted here that one of streams I.e. video or audio may get transmitted completely before the other. In that case only that particular processing block is operated which is still pending transmission. The next chapter describes the de-multiplexing algorithm used and the method used to achieve audio-video synchronization.
46
(comparing presentation times of audio and video TS)

PT_VIDEO_TS < PT_AUDIO_TS ?
NO YES (Video processing block)

TRANSMIT A VIDEO TS PACKET, GENERATE THE NEXT VIDEO PES PACKET IFREQUIRED UPDATE PT_VIDEO_TS
(Audio processing block)

TRANSMIT A AUDIO TS PACKET, GENERATE THE NEXT AUDIO PES PACKET IF REQUIRED. UPDATE PT_AUDIO_TS
NO
ALL AUDIO FRAMES DONE AND ALL VIDEO FRAMES DONE?
YES
DONE
Fig 5.1. Overall multiplexer flow diagram
47
NEW VIDEO FRAME ?
NO
YES
CALCULATE NO OF TS PACKETS FOR THIS FRAME TS DURATION
CURRENT PES LENGTH >185
NO
YES
CONSIDER ONLY FIRST 185 BYTES FOR TRANSMISSION UPDATE PES DATA AND LENGTH GENERATE PES PACKET FOR NEXT VIDEO FRAME
GENERATE (TRANSMIT) A VIDEO TS PACKET UPDATE PT_VIDEO_TS VALUE
Fig 5.2. Flow chart of video processing block
48
NEW AUDIO FRAME ?
NO
YES
CALCULATE NO OF TS PACKETS FOR THIS FRAME TS DURATION
CURRENT PES LENGTH >185
NO
YES
CONSIDER ONLY FIRST 185 BYTES FOR TRANSMISSION UPDATE PES DATA AND LENGTH GENERATE PES PACKET FOR NEXT AUDIO FRAME
GENERATE (TRANSMIT) A AUDIO TS PACKET UPDATE PT_AUDIO_TS VALUE
Fig 5.3. Flow chart of audio processing block.
49
CHAPTER 6 DE MULTIPLEXING
The Transport Stream (TS) input to a receiver is separated into a video elementary stream and audio elementary stream by a demultiplexer. At this time, the video elementary stream and the audio elementary stream are temporarily stored in the video and audio buffer, respectively. The basic flow chart of the demultiplexer is shown in the figure 6.1. After receiving a packet, it is checked for the sync byte (0X47), to check if the packet is valid or not. If invalid that packet is skipped and de-multiplexing is continued with the next packet. The valid TS packet header is read to extract fields like packet ID (PID), adaptation field control flag (AFC), payload unit start (PUS) flag, 4 bit continuity counter etc. Now the payload is prepared to be read into the appropriate buffer. By checking the AFC flag it can be known that an offset value has to be calculated or all 185 bytes in the TS packet has payload data. If the AFC is set then the payload is extracted by skipping through the stuffing bytes. The Payload Unit Start (PUS) bit is checked to see if the present TS packet contains a PES header. If so then, the PES header is first checked for the presence of the sync sequence (I, e 0X000001). If not, the packet is discarded and the next TS packet is processed. If valid then the PES header is read and fields like stream ID, PES length, frame number are extracted. Now the PID is checked to see if it is an audio TS packet or video TS packet. Once this decision is made, the payload is written into its respective buffer. If the TS packet payload contained the PES header, information like frame no, its location in the corresponding buffer , PES length are stored in a separate array variable which is later used for synchronizing audio and video streams.
50
Read TS packet Go to next packet Valid packet ? YES Get PID AFC flag NO
AFC = 1 ? NO Extract payload
YES
Adjust offset
NO
PUS = 1 YES Sync seq present ? YES Read PES header - PES length -frame no -stream ID NO
Video packet ? YES PUS = 1 ? NO
NO
PUS = 1 ?
YES
Save frame no and pointer loc in buffer
NO
YES Write payload data in to audio buffer Save frame no and pointer loc in buffer
Write payload data in to video buffer
Is video buffer full ?
NO
YES Search for the next IDR frame and calculate the corresponding audio frame Write both video and audio buffer values from those frames in to their respective bitstream files(.264 and .aac)
Fig 6.1. Flow chart for the Demultiplexer used

51
After the payload has been written into the audio/video buffer. Video buffer is checked for fullness. Since video files are always much larger than audio files, the video buffer gets filled up first. Once the video buffer is full, the next occurring IDR frame is searched in the video buffer. Once found the frame number is noted and used to calculate the corresponding audio frame number that has to be played at that time which is given by (6.1).
The above equation is used to synchronize the audio and video streams. Once the frame numbers are obtained, the audio and video elementary streams can be constructed by writing the audio and video buffer contents from that point (frame) into their respective elementary streams i.e. .aac and .264 files respectively. Then the streams are merged into a container format by using mkv merge software [31] which is freely available software. The resulting container format can be played back by video player like VLC media player [32] or Gom media player [33]. In the case of video sequence, to ensure proper playback, picture and sequence parameter sets must be inserted before the first IDR frame. The reason the de-multiplexing is carried out from an IDR (instantaneous decoder refresh) frame is because the IDR frame breaks the sequence making sure that the later frames like P-frames do not use frames before the IDR frame for motion estimation. This is not true in the case of normal I- frame. So in a long sequence the GOPs after the IDR frame are treated as a new sequence by the H.264 decoder. In the case of audio the HE AAC v2 decoder can playback the sequence from any audio frame.
52
6.1 Lip or audio-video synchronization:

Synchronization in multimedia systems refers to the temporal relations that exist between the media objects in a system. A temporal or time relation is the relation between a video and an audio sequence during the time of recording. If these objects are presented, the temporal relation during the presentations of the two media objects must correspond to the temporal relation at the time of recording. Once the video buffer is full, the content (audio and video) is ready to be written in their elementary streams and played back. The audio ADTS elementary stream can be played back from any frame. However for the H.264 stream decoding can only start from an IDR frame. So the video buffer is searched for the next occurring IDR frame, this frame number is used to calculate the corresponding audio frame using (6.1). Once the audio frame number is obtained, the audio buffer from that frame is written into its elementary stream (.aac file). For the video elementary stream however, the picture and sequence parameter sets (which were sent as a separate TS packet from the multiplexer) are inserted before writing from the IDR frame in the video buffer. Because both PPS and SPS information is used by decoder to find out the encoding parameters used. Since the output of (6.1) may not be a whole number, it is rounded off to the closest integer value. The theoretical value of maximum rounding off error is half the audio frame duration. This depends on the sampling frequency of the audio stream. For example a sampling frequency of 24000, the frame duration is 1024/24000 i.e. 42.6 ms and the maximum latency will be 21.3 ms. For most audio streams the frame duration value will be not more that 64 ms, hence the maximum latency will be 31 ms [33]. This latency/time difference is known as a
53
skew *47+. According to a research the in-sync region spans a skew of -80 ms (audio behind video) and +80 ms (video behind audio) [47]. In-sync refers to the range of skew values where the synchronization error is not perceptible. The MPEG-2 systems [4] define a skew threshold of 40 ms. Once the streams are synchronized the skew remains constant throughout. This possible maximum skew is the limitation of the project; however the value remains well below the allowed range. Simulation results using the audio and video test sequences are presented in the next chapter.
54
CHAPTER 7
RESULTS
The multiplexing algorithm was implemented in MATLAB and demultiplexing algorithm was implemented using C++. JM (joint model) 16.1 [35] and 3gpp Enhanced aacplus encoder [36] were used for encoding video and audio raw sequences respectively. GOP sequence adopted for video encoding was IPPP; the H.264 baseline profile was used. For audio encoding bitrate was set at 32 kbps to enable parametric stereo.
7.1 Buffer fullness:

As stated before buffer overflow or underflow by the video and audio elementary streams can cause skips or freeze/mute errors in video and audio playback. Table 7.1 shows the values of video buffer and the corresponding audio buffer size at that moment and the playback times of both audio and video contents of buffer. It can be observed the content playback times vary only by about 20 ms. This means that when a video buffer is full (for any size of video buffer) almost all the corresponding audio content is present in the audio buffer.
55
video frames in buffer 100 200 300 400 500 600 700 800 900 1000
Audio frames in buffer 98 196 293 391 489 586 684 782 879 977
video buffer size (in KB) 771.076 1348.359 1770.271 2238.556 2612.134 3158.641 3696.039 4072.667 4500.471 4981.05
audio buffer size (in KB) 17.49 34.889 52.122 69.519 86.949 104.165 121.627 139.043 156.216 173.657
video content play back time (in sec) 4.166 8.333 12.5 16.666 20.833 25 29.166 33.333 37.5 41.666
audio content play back time (in sec) 4.181 8.362 12.51 16.682 20.864 25.002 29.184 33.365 37.504 41.685
Table7.1. Video and audio buffer sizes and their respective playback times
7.2 Synchronization/skew calculation

Table 7.2 shows the results and various parameters of the test clips used. The results show that, the compression ratio achieved by HEAACv2 is in the order of 45 to 65 which is at least three times better than that achieved by just core AAC. Also H.264 video compression is in the order of 100, which is due to the fact that baseline profile is used. The net transport stream bitrate requirement is about 50 kBps, which can be easily accommodated in systems such as ATSC-M/H, which has an allocated bandwidth of 19.6 Mbps [17] or 2450 kBps.
56
Test clip 1 Clip length (sec) 30 Video FPS 24 Audio sampling frequency (Hz) 24000 total video frames 721 Total audio frames 704 Video raw file (.yuv) size(kB) 105447 Audio raw file (.wav) size(kB) 5626 H.264 file size(kB) 1273 AAC file size (kB) 92 Video compression ratio 82.82 Audio compression ratio 61.15 H.264 encoder bitrate(kBps) 42.43 AAC encoder bitrate(kbps) 32 Total TS packets 8741 Transport stream size(kB) 1605 Transport stream bitrate (kBps) 53.49 Test clip size (kB) 1376.78 Reconstructed clip size (kB) 1312.45 Table 7.2. Characteristics of test clips used
2 50 24 24000 1199 1173 175354 9379 1365 204 128.4 45.97 27.3 32 9858 1810 36.2 1576.6 1563.22
Table 7.3 shows the skew for various start TS packets. The delay column indicates the skew achieved when demultiplexing was started from different TS packet number. The maximum theoretical value is 21 ms because the sampling frequency used is 24,000 Hz (audio frame duration is 42 ms). As seen the worst skew is 13 ms, but in most cases the skew rate is below 10ms. This is well below the MPEG2 threshold of 40 ms. Chapter 8 outlines the conclusions followed by future work which is described in chapter 9.
57
Transport stream packet number

100 300 400 500 600 800 100
video IDR frame number chosen

13 29 33 45 53 73 89
Audio frame number chosen

13 28 32 44 52 71 87
chosen chosen video frame audio frame presentation presentation time (s) time (s)
.5416 1.208 1.375 1.875 2.208 3.041 3.708 .5546 1.1946 1.365 1.877 2.218 3.03 3.712
delay ms
Perceptible ?
13 13 9.6 2.3 10.6 11 4
no no no no no no no
Table 7.3: Demultiplexer output.
58
CHAPTER 8 CONCLUSIONS
This project implemented an effective multiplexing and demultiplexing scheme with synchronization. The latest codec H.264 and HE AAC v2 were used. Both encoders achieved very high compression ratios as a result the transport stream bitrate requirement was contained to about 50 kBps. Also buffer fullness was effectively handled with maximum buffer difference observed was around 20ms of media content. During decoding the audio-video synchronization was achieved with a maximum skew of 13ms.
CHAPTER 9 FUTURE WORK

This project implemented a multiplexing/demultiplexing algorithm for one audio and one video stream i.e. a single program. The same scheme can be expanded to multiplex multiple programs by having a program map table (PMT). Also the same algorithm can be modified to multiplex other elementary streams like VC1 [44], Dirac video [45], AC3 [46] etc. The present project used standards specified by MPEG2 systems. The same multiplexing scheme can be applied to other transport streams like RTP/IP, which are used for applications such as streaming videos over the internet. Since transport stream is sent across networks that are prone to errors, an error correction scheme like Reed-Solomon [43] or CRC can be added while coding the transport stream (TS) packets.
59
References:
[1] MPEG-4: ISO/IEC JTC1/SC29 14496-10: Information technology Coding of audio-visual objects - Part 10: Advanced Video Coding, ISO/IEC, 2005. [2] MPEG-4: ISO/IEC JTC1/SC29 14496-3: Information technology coding of audio-visual objects part3: Audio, AMENDMENT 4: Audio lossless coding (ALS), new audio profiles and BSAC extensions. 3] MPEG2: ISO/IEC JTC1/SC29 138187, advanced audio coding, AAC. International Standard IS WG11, 1997. [4]MPEG-2: ISO/IEC 13818-1 Information technologygeneric coding of moving pictures and associated audioPart 1: Systems, ISO/IEC: 2005. [5] Soon-kak Kwon et al, Overview of H.264 / MPEG-4 Part 10 (pp.186-216), Special issue on Emerging H.264/AVC video coding standard, J. Visual Communication and Image Representation, vol. 17, pp.183-552, April. 2006. [6] A. Puri et al, Video coding using the H.264/MPEG-4 AVC compression standard, Signal Processing: Image Communication, vol.19, pp. 793-849, Oct. 2004. [7] MPEG-4 HE-AAC v2 audio coding for today's digital media world , article in the EBU technical review (01/2006) giving explanations on HE-AAC. Link: http://tech.ebu.ch/docs/techreview/trev_305-moser.pdf [8]ETSI TS 101 154 Implementation guidelines for the use of video and audio coding in broadcasting applications based on the MPEG-2 transport stream. [9] 3GPP TS 26.401: General Audio Codec audio processing functions; Enhanced aacPlus General Audio Codec; 2009 [10] 3GPP TS 26.403: Enhanced aacPlus general audio codec; Encoder Specification AAC part. [11] 3GPP TS 26.404 : Enhanced aacPlus general audio codec; Encoder Specification SBR part. [12] 3GPP TS 26.405: Enhanced aacPlus general audio codec; Encoder Specification Parametric Stereo part.
60
[13] E. Schuijers et al, Low complexity parametric stereo coding ,Audio engineering society, May 2004 , Link: http://www.jeroenbreebaart.com/papers/aes/aes116_2.pdf
[14] MPEG Transport Stream. Link: http://www.iptvdictionary.com/iptv_dictionary_MPEG_Transport_Stream_TS_definition.html
[15] MPEG-4: ISO/IEC JTC1/SC29 14496-14 : Information technology coding of audio-visual objects Part 14 :MP4 file format, 2003 [16] DVB-H : Global mobile TV. Link : http://www.dvb-h.org/ [17] ATSC-M/H. Link : http://www.atsc.org/cms/ [18] Open mobile vido coalition. Link : http://www.openmobilevideo.com/about-mobiledtv/standards/ [19] VC-1 Compressed Video Bitstream Format and Decoding Process (SMPTE 421M2006), SMPTE Standard, 2006 (http://store.smpte.org/category-s/1.htm). [20] Henning Schulzrinne's RTP page. Link: http://www.cs.columbia.edu/~hgs/rtp/ [21] G.A.Davidson et al, ATSC video and audio coding, Proc. IEEE, vol.94, pp. 60-76, Jan. 2006 (www.atsc.org). [22] I. E.G.Richardson, H.264 and MPEG-4 video compression: video coding for nextgeneration multimedia, Wiley, 2003. [23] European Broadcasting Union, http://www.ebu.ch/ *24+ Shintaro Ueda, et al, NAL level stream authentication for H.264/AVC , IPSJ Digital courier, vol.3 , Feb.2007. [25] World DMB: link: http://www.worlddab.org/ [26] ISDB website. Link: http://www.dibeg.org/ [27] 3gpp website. Link: http://www.3gpp.org/ [28] M Modi, Audio compression gets better and more complex, link: http://www.eetimes.com/discussion/other/4025543/Audio-compression-gets-better-andmore-complex [29] PA Sarginson,MPEG-2: Overview of systems layer, Link: http://downloads.bbc.co.uk/rd/pubs/reports/1996-02.pdf
61
[30] MPEG-2 ISO/IEC 13818-1: GENERIC CODING OF MOVING PICTURES AND AUDIO: part 1SYSTEMS Amendment 3: Transport of AVC video data over ITU-T Rec H.222.0 |ISO/IEC 13818-1 streams, 2003 [31] MKV merge software. Link: http://www.matroska.org/ [32] VLC media player. Link: http://www.videolan.org/ [33] Gom media player. Link: http://www.gomlab.com/ *34+ H. Murugan, Multiplexing H264 video bit-stream with AAC audio bit-stream, demultiplexing and achieving lip sync during playback, M.S.E.E Thesis, University of Texas at Arlington, TX May 2007. [35] H.264/AVC JM Software link: http://iphome.hhi.de/suehring/tml/download/. [36] 3GPP Enhanced aacPlus reference software. Link: http://www.3gpp.org/ftp/ [37] MPEG2: ISO/IEC JTC1/SC29 138182, Information technology -- Generic coding of moving pictures and associated audio information: Part 2 - Video, ISO/IEC, 2000. [38] MPEG4: ISO/IEC JTC1/SC29 144962, Information technology Coding of audio visual objects: Part 2 - visual, ISO/IEC, 2004. [39] T. Wiegand et al, Overview of the H.264/AVC Video Coding Standard , IEEE Trans. CSVT, Vol. 13, pp. 560-576, July 2003. [40] ATSC-Mobile DTV Standard, part 7 AVC and SVC video system characteristics. Link: http://www.atsc.org/cms/standards/a153/a_153-Part-7-2009.pdf [41] ATSC-Mobile DTV Standard, part 8 HE AAC audio system characteristics. Link: http://www.atsc.org/cms/standards/a153/a_153-Part-8-2009.pdf [42] H.264 Video Codec - Inter Prediction. Link: http://mrutyunjayahiremath.blogspot.com/2010/09/h264-inter-predn.html *43+ B.A. Cipra, The Ubiquitous Reed-Solomon Codes. Link: http://www.eccpage.com/reed_solomon_codes.html
62
[44] VC1 technical overview .link: http://www.microsoft.com/windows/windowsmedia/howto/articles/vc1techoverview.aspx [45] Dirac video compression website. Link: http://diracvideo.org/ [46] MPEG2: ISO-IEC JTCl/SC29/WGll 13818-3 : Coding Of Moving Pictures and Associate Audio : Part 3 audio Nov.1994 [47] G. Blakowski et al, A Media Synchronization Survey: Reference Model, Specification, and Case Studies, IEEE Journal on selected areas in communications, VOL. 14, NO. 1, Jan 1996.
63

Sidda Raju Project

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Sidda Raju Project

Transféré par

Droits d'auteur :

Formats disponibles

Multiplexing the elementary streams of H.

MASTER OF SCIENCE IN ELECTRICAL ENGINEERING Nov 2010

Copyright by Naveen Siddaraju 2010 All Rights Reserved ii

iii iv iii ix xi xii

CHAPTER 2 OVERVIEW OF H.264 2.1 H.264/ AVC

2.2 Coding structure:

Fig2.1 Video data organization in H.264 [42]

2.3 Profiles and levels:

Fig 2.2. Specific coding parts of the profiles in H.264 [5].

2.4 Description of various profiles: 2.4.1 Baseline Profile:

2.4.2 Extended profile:

2.4.3 Main Profile:

2.4.4 High Profiles:

Fig 2.3 Different YUV systems.

2.5 H.264 encoder and decoder:

Fig 2.4. H.264 encoder [5].

Fig 2.5. H.264 decoder [5]

2.5.1 Intra prediction:

2.5.2 Inter prediction:

2.5.3 Transform and quantization:

2.5.4 Entropy coding:

2.5.5 Deblocking filter:

2.6 H.264 bitstream:

Fig 2.7 Different layers of JVT coding [1].

Fig 2.8 NAL formatting of VCL and non-VCL data [6].

Fig 2.9. NAL unit format [6].

Fig 2.10. Relationship between parameter sets and picture slices[24].

OVERVIEW OF HE AAC V2 3.1HE AAC v2

Fig 3.1: HE AAC audio codec family [7]

3.2 Spectral Band Replication (SBR):

Fig 3.4 original audio signal [28].

Fig 3.5 High band reconstruction through SBR [28].

3.3 Parametric Stereo (PS):

3.4 Enhanced aacplus encoder:

this sampling rate

. Consequently a 2:1 downsampler is present at the

stereo parameter Parametric Stereo Estimation (incl. Downmix)

Analysis QMF Bank

Input PCM Samples stereo-tomono downmix 1

Downsampled Synthesis QMF Bank

Coded Audio Stream

usage dependant on audio mode

2:1 IIR Downsampler

AAC Core Encoder

Fig3.6: Enhanced aacplus encoder block diagram [9]

3.5 Enhanced aacplus decoder:

AAC Core Decoder (incl. error concealment)

Analysis QMF Bank

Coded Audio Stream

Synthesis QMF Bank

Output PCM Samples

SBR error concealment PS error concealment

SBR stereo-tomono downmix1

usage dependant on audio mode

Fig3.7: Enhanced aaplus decoder block diagram [9]

3.6 Advanced Audio Coding (AAC) 3.6.1 AAC encoder:

Bitstream multiplex Bitstream Reduction of psychoacoustic requirements

Quantization & Coding scalefactors / quantization

Out of bits prevention

Fig 3.8 AAC encoder block diagram [10]

3.7 HE AAC v2 bitstream formats:

ADTS variable header

only if protection_absent == 0 variable size

Table 3.1 ADTS header format [2] [3]