Académique Documents
Professionnel Documents
Culture Documents
264 video and MPEG4 HE AAC v2 audio using MPEG2 systems specification, demultiplexing and achieving lip synchronization during playback
by
NAVEEN SIDDARAJU
Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of
ACKNOWLEDGEMENTS I am greatly thankful to my supervising professor Dr.K.R.Rao, whose constant encouragement, guidance and support have helped me in smooth completion of the project. He has always been accessible and helpful throughout. I also thank him for introducing me to the field of multimedia processing. I would like to thank Dr.W. Alan Davis and Dr.William E.Dillon for taking interest in my project and accepting to be part of my project defense committee. I am forever grateful to my parents for their unconditional support at each turn of the road. I thank my brother and sisters, who have always been a source of inspiration. I would like to thank my friends both in US and in India for their encouragement and support. November 22, 2010
iii
ABSTRACT
MULTIPLEXING THE ELEMENTARY STREAMS OF H.264 VIDEO AND MPEG4 HE AAC v2 AUDIO USING MPEG2 SYSTEMS SPECIFICATION, DEMULTIPLEXING AND ACHIEVING LIP SYNCHRONIZATION DURING PLAYBACK Naveen Siddaraju, MS The University of Texas at Arlington, 2010 Supervising Professor: Dr. K. R. Rao
Delivering broadcast quality content to the mobile customers is one of the most challenging tasks in the world of digital broadcasting. Limited network bandwidth and processing capability of the handheld devices are critical factors that should be considered. Hence selection of the compression schemes for the media content is very important from both economic and quality points of view. H.264 which is also known as Advanced Video Codec (AVC) [1] is the latest and the most advanced video codec available in the market today. The H.264 baseline profile which is used in applications such as mobile television (mobile DTV) broadcast has one of the best compression ratios among the other profiles and requires the least processing power at the decoder. The audio MPEG4 HE AAC v2 [2] which is also known as enhanced aacplus, is the latest audio codec belonging to the AAC (advanced audio codec) [3] family. In addition to the core AAC, it uses the latest tools such as Spectral Band Replication (SBR) [2] and Parametric Stereo (PS) [2] resulting in the best perceived quality for the lowest
iv
bitrates. The audio and video codec standards have been chosen based on ATSCM/H (advanced television systems committee mobile handheld) [17]. For the television broadcasting applications such as ATSC-M/H, DVB [16] the encoded audio and video streams should be transmitted in a single transport stream containing fixed sized data packets, which can be easily recognized and decoded at the receiver. The goal of the project is to implement a multiplexing scheme for the elementary streams of H.264 baseline and HE AAC v2 using the MPEG2 systems specifications [4], then demultiplex the transport stream and playback the decoded elementary stream with lip synchronization or audio-video synchronization. The multiplexing involves two layers of packetization of the elementary streams of audio and video. The first level of packetization results in Program Elementary Stream (PES) packets, which are variable size packets and hence not suitable for transport. MPEG2 defines a transport stream where PES packets are logically organized into fixed size packets called the Transport Stream (TS) packets, which are 188 bytes long. These packets are continuously generated to form a transport stream, which is decoded by the receiver and the original elementary streams are reconstructed. The PES packets that are logically encapsulated into the TS header contain the time stamp information which is used at the de-multiplexer to achieve synchronization between audio and video elementary streams.
TABLE OF CONTENTS ACKNOWLEDGEMENTS ABSTRACT. LIST OF FIGURES.... LIST OF TABLES. LIST OF TABLES. ACRONYMS AND ABBREVIATIONS.. Chapter 1. INTRODUCTION 2. OVERVIEW OF H.264. 2.1 H.264/ AVC. 2.2 Coding structure.. 2.3 Profiles and levels. 2.4 Description of various profiles 2.4.1 Baseline Profile... 2.4.2 Extended profile.. 2.4.3 Main Profile 2.4.4 High Profiles 2.5 H.264 encoder and decoder. 2.5.1 Intra prediction 2.5.2 Inter prediction 2.5.3 Transform and quantization 2.5.4 Entropy coding
vi
1 2 2 2 3 4 4 5 5 5 6 8 9 10 10
2.5.5 Deblocking filter 11 2.6 H.264 bitstream. 3. OVERVIEW OF HE AAC V2. 3.1HE AAC v2 3.2 Spectral Band Replication (SBR). 3.3 Parametric Stereo (PS) 3.4 Enhanced aacplus encoder.. 3.5 Enhanced aacplus decoder.. 3.6 Advanced Audio Coding (AAC) . 3.6.1 AAC encoder. 3.7 HE AAC v2 bitstream formats.. 4. TRANSPORT PROTOCOLS . 4.1 Introduction 4.2 Real-Time protocol (RTP) 4.3 MPEG2 systems layer. 4.4 Packetized elementary stream (PES).. 4.4.1 PES encapsulation process. 4.5 MPEG Transport stream (MPEG- TS). 4.6 Time stamps 5. MULTIPLEXING. 6. DE MULTIPLEXING. 6.1 Lip or audio-video synchronization 11 16 16 18 19 20 22 23 23 27 30 30 30 31 32 34 35 38 42 48 51
vii
7. RESULTS.. 7.1 Buffer fullness.. 7.2 Synchronization/skew calculation .. 8. CONCLUSIONS . 9. FUTURE WORK References.
55 55 56 59 59 60
viii
LIST OF FIGURES Fig 2.1. Video data organization in H.264 [42]. Fig 2.2. Specific coding parts of the profiles in H.264 [5]. Fig 2.3 Different YUV systems. Fig 2.4. H.264 encoder [5]. Fig 2.5. H.264 decoder [5]. Fig2.6. Intra prediction modes for 4X4 luma in H.264 Fig2.7. Different layers of JVT coding. Fig2.8. NAL formatting of VCL and non-VCL data [6]. Fig2.9. NAL unit format [6]. Fig2.10. Relationship between parameter sets and picture slices [24]. Fig3.1: HE AAC audio codec family Fig 3.2: Typical bitrate ranges of HE-AAC v2, HE-AAC and AAC for stereo [7] Fig 3.4 Original audio signal [28]. Fig 3.5 High band reconstruction through SBR [28]. Fig3.6: Enhanced aacplus encoder block diagram [9] Fig3.7: Enhanced aacplus decoder block diagram [9] Fig 3.8 AAC encoder block diagram [10] Fig 3.9. ADTS elementary stream Fig 4.1. RTP packet structure (simplified) [22]
ix
Fig 4.2. MPEG2 transport stream [22] Fig 4.3. Conversion of an elementary stream into PES packets [29] Fig4.4. A standard MPEG-TS packet structure [14]. Fig4.5. Transport stream (TS) packet format used in this project. Fig5.1. Overall multiplexer flow diagram Fig5.2. Flow chart of video processing block Fig 5.3. Flow chart of audio processing block. Fig6.1. Flow chart for the de-multiplexer used
LIST OF TABLES Table2.1. NAL Unit types. Table3.1 ADTS header format [2] [3] Table3.2 Profile bits expansion [2] [3] Table4.1. PES packet header format used [4]. Table7.1. Video and audio buffer sizes and their respective playback times Table 7.2. Characteristics of test clips used Table 7.3: Demultiplexer output.
xi
ACRONYMS AND ABBREVIATIONS 3GPP AAC AAC LPT ADIF ADTS AFC ATM ATSC ATSC-M/H AVC CABAC CAVLC CC CIF CRC DCT DPCM DVB DVB DVD EI ES FMO GOP HDTV HE AACv2 IC IDR IID IP IPD ISDB Third generation partnership project Advanced audio coding Advanced audio coding - long term prediction Audio data interchange format Audio data transport stream Adaptation field control Asynchronous transfer mode Advanced television systems committee Advanced television systems committee- mobile /handheld Advanced Video Coding Context-based Adaptive Binary Arithmetic Coding Context-based Adaptive Variable Length Coding Continuity counter Common intermediate format Cyclic redundancy check Discrete code transform Discrete pulse coded modulation Digital video broadcasting Digital video broadcasting Digital video disc Error indicator Elementary stream Flexible macro block order Group of pictures High definition television High efficiency advanced audio codec version 2 Inter-channel Coherence Instantaneous decoder refresh Inter-channel Intensity Differences Internet protocol Inter-channel Phase Differences Integrated Services Digital Broadcasting system
xii
I-slice ISO ITU JM JVT M4A MB MC MDCT ME MP4 MPEG MPTS NALU OPD PCR PES PID PMT PPS PS P-slice PTS PUSI QCIF QMF RTP SBR SCI S-DMB SDTV SEI SPS SPTS
Intra predictive slice International Standards Organization International Telecommunication Union Joint model Joint Video Team Moving picture experts group four file format audio only Macro blocks Motion compensation Modified discrete cosine transform Motion estimation Moving picture experts group four file format Moving Picture Experts Group Multi program transport stream Network Abstraction Layer Unit Overall Phase Difference Program clock reference Packetized elementary stream Packet identifier Program map table Picture parameter set Parametric stereo Predictive slice Presentation time stamp Payload unit start Indicator Quarter common intermediate format Quadrature mirror filter banks Real time protocol Spectral band replication Simplified chrominance intra prediction Satellite - Digital multimedia broadcasting Standard definition television Supplemental enhancement information Sequence parameter set Single program transport stream
xiii
STC TCP TNS TS UDP VC1 VCEG VCL VLC VUI YUV
System timing clock Transmission control protocol Temporal noise shaping Transport stream User datagram protocol Video Codec1 Video Coding Experts Group Video coding layer Variable length coding Video usability information Luminance and chrominance color components
xiv
CHAPTER 1 INTRODUCTION
Mobile broadcast systems are increasingly important as cellular phones and highly efficient digital video compression techniques merge to enable digital TV and multimedia reception on the move. There are several digital mobile TV broadcasters in the market. Major ones are DVB-H (digital video broadcasthandheld) [16] and ATSC-M/H (advanced television systems committeemobile/handheld) [17] [18]. Both DVB-H and ATSC-M/H have a relatively smaller channel bandwidth allocation (~14Mbps for DVB-H and ~19.6 Mbps for ATSCM/H), so the choice of multimedia compression standard and transport protocol becomes very important. DVB-H specifies the use of either VC1 [19] or H.264 [1] compression standard for video and AAC [3] audio. ATSC-M/H specifies H.264 baseline profile for video and HEAACv2 [2] for audio. The transport protocol is usually RTP (real time protocol) [20] or MPEG2 part 1 systems [4].The MPEG-2 systems specification [4] describes how MPEG-compressed video and audio data streams may be multiplexed together with other data to form a single data stream suitable for digital transmission or storage. Two alternative streams are specified for the MPEG-2 systems layer. The program stream is used for storage of multimedia content like DVD while the transport stream is intended for the simultaneous delivery of a number of programs over potentially error-prone channels. In this project, the compression standards used is H.264 baseline profile for video and HEAACv2 for audio, as specified by the ATSC-M/H standard. Distribution is achieved through the MPEG-2 part 1 systems specifications transport stream. Chapters 2 and 3 give the brief overview of H.264 and HEAACv2 compression standards respectively. Chapter 4 explains the transport stream
1
protocol used in this project. Chapters 5 and 6 explain the multiplexing and demultiplexing schemes used in this project. All the results are tabulated in chapter 7. Chapters 8 and 9 outline the conclusions of the project and future work respectively.
There are basically three types of slices I (intra predictive), P (predictive) and B (bipredictive) slices. I slices are strictly intracoded I.e. Macro blocks (MB) are compressed without using any motion prediction from earlier slices. A special type picture containing only I-slice is called instantaneous decoder refresh (IDR) picture. Any picture following the IDR picture does not use the pictures prior to IDR for its motion prediction. IDR pictures can be used for random access or as entry points for a coded sequence [6]. P-slices on the other hand contain macroblocks which use motion prediction. The MBs of P-slice can use only one frame as reference (either from past or future) for their motion prediction.
H.264/AVC is purely a video codec with as many as seven profiles. Three profiles main, baseline and extended profile were included in its first release. Four new profiles were added in the subsequent releases defined in the fidelity range extensions for applications such as content distribution, content contribution, and studio editing and post-processing [5]. Profiles and their specific tools and common features are shown in fig 2.2.
It can be noted that I-slice, P-slice and CAVLC (Context-based Adaptive Variable Length Coding) entropy coding are common to all the profiles.
slices. It was designed for low delay applications, as well as for applications that run on platforms with low processing power and in high packet loss environment. Among the three profiles, it offers the least coding efficiency [6]. The baseline profile caters to applications such as video conferencing and mobile television broadcast video. This project uses baseline profile for video encoding since it is specified by ATSC for mobile digital television.
used for applications such as content-contribution, content-distribution, and studio editing and post-processing [5]. Four different high profiles are described below: High Profile - supports the 8-bit video with 4:2:0 sampling for applications using high resolution. High 10 Profile supports the 4:2:0 sampling with up to 10 bits of representation accuracy per sample. High 4:2:2 Profile - supports up to 4:2:2 chroma sampling and up to 10 bits per sample. High 4:4:4 Profile supports up to 4:4:4 chroma sampling, up to 12 bits per sample, and integer residual color transform for coding RGB signal. Different YUV formats are shown in fig 2.3.
For any given profile, a level corresponds to various data bit rates, frame size, picture buffer size, etc.
The decoder performs in the exact opposite way, taking in the encoded bitstream and decoding it. Then it is given to the inverse quantization and inverse transform block.
10
Fig 2.6. Intra prediction modes for 4X4 luma in H.264 [39].
For prediction of each 8X8 luma block, one mode is selected from the 9 modes, similar to the (4 X4) intra-block prediction. For prediction of all 16 X16 luma components of a macroblock, four modes are available. For mode 0 (vertical), mode 1 (horizontal), mode 2 (DC), the predictions are similar with the cases of 4X 4 luma block. For mode 4 (Plane), a linear plane function is fitted to the upper and left samples.
current picture, in the display order are stored into the picture buffer. These are classified as short-term and long-term reference pictures. Long-term reference pictures are introduced to extend the motion search range by using multiple decoded pictures, instead of using just one decoded short-term picture. Memory management is required to take care of marking some stored pictures as unused and deciding which pictures to delete from the buffer for efficient memory management [5].
12
achieves better coding efficiency, but with higher complexity compared to CAVLC [1].
13
The H.264 bit stream is encapsulated into packets called NAL units (NALU). Each NALU is separated by a 4 byte sequence 0X00000001. After this byte sequence the following byte is the NAL header and the rest is a variable byte
14
length raw byte sequence payload (RBSP). The NAL header/unit format is shown in the figure 2.9. The first bit of the NAL header called the forbidden bit is always zero. The next two bits indicate whether the NALU consist of sequence parameter set, picture parameter set or the slice data of the reference picture. The next 5 bits indicate the type of NALUs (type indicator), depending upon the type of data being carried by that NALU. There are 32 different types of NALU. These may be classified into VCL NALUs and non-VCL NALUs, depending upon the type of data they carry.
If the type indicator is less than 5, it is a VCL NALU and if the type indicator is greater than 5 it is non-VCL NALU. Different types of NALUs are listed in table2.1. NALU types 1-5 and 12 are VCL-NAL units containing coded VCL data. The rest of NALUs are called non-VCL NAL units that contain information such as SEI, sequence parameter set, picture parameter set etc. Of these NALUs IDR pictures, sequence parameter set and picture parameter set are important. An instantaneous decoder refresh (IDR) picture is a picture that is placed at the beginning of the video sequence. When the decoder receives an IDR picture, all information is refreshed, which indicated a new coded video sequence and frames prior to this IDR frame are not required to decode this new sequence. The sequence parameter set contains important header information that applies to all NALUs in the coded sequence. The picture parameter set contains
15
important header information that is used for decoding one or more frames in the coded sequence.
Type indicator 0 1 2 3 4 5 6 7 8 9 10 11 12 13-23 23-31 NALU type unspecified coded slice data partition A data partition B data partition C IDR (instantaneous decoder refresh) SEI(supplemental enhancement information) sequence parameter set picture parameter set access unit delimiter end of sequence end of stream filler data extended undefined Table 2.1. NAL unit types [6].
The relationship between the parameter sets and the slice data is shown in fig 2.10. Each VCL NAL unit contains an identifier that refers to the content of the relevant Picture Parameter Set (PPS) and each PPS contains an identifier that refers to the content of the relevant Sequence Parameter Set (SPS). In this manner, a small amount of data (the identifier) is used to refer to a larger amount of information (the parameter set) without repeating that information within each VCL NAL unit. Sequence and picture parameter sets can be sent well ahead of the VCL NAL units to which they apply, and can be repeated to provide robustness against data loss. In some applications, parameter sets may be sent within the channel that carries the VCL NAL units (termed "in-band" transmission). In other applications, it can be advantageous to convey the
16
parameter sets "out of band" using a more reliable transport mechanism than the video channel itself. By using this mechanism, H.264 can transmit multiple video sequences (with different parameters) in a single bitstream.
The important information carried by SPS include profile/level indicator, decoding order or playback order, frame size, number of reference frames, Video Usability Information (VUI) such as aspect ratio, color or space details etc. SPS remains the same for the entire coded video sequence. Important information carried by PPS include entropy coding scheme used, macro block reordering, quantization parameters, a flag to indicate whether inter predicted MBs can be used for intrapredicton etc. PPS remains unchanged within a coded picture. This chapter provided an overview of H.264. Various profiles, encoder, decoder and the H.264 bit stream format were discussed in detail. An overview of the HE AAC v2 audio codec is presented in the next chapter.
17
CHAPTER 3
HE AAC v2 is a combination of three technologies: AAC (advanced audio codec), SBR (spectral band replication) and PS (parametric stereo). All the 3 technologies are defined in MPEG4 audio standard [2]. The combination of AAC and SBR is called HE-AAC or aacplus. AAC is a general audio codec, SBR is a bandwidth extension technique offering substantial coding gain in combination
18
with AAC, and Parametric Stereo (PS) enables stereo coding at very low bitrates. Figure 3.1 shows the family of AAC audio codecs.
HE AAC v2
AAC
SBR HE AAC
PS
Figure 3.2 shows the typical bitrate ranges for stereo plotted against the perceptual quality factor for all three forms of the codec. It can be easily derived that HEAAC v2 provides the best quality for the lowest bitrate.
Fig 3.2: Typical bitrate ranges of HE-AAC v2, HE-AAC and AAC for stereo [7]
19
SBR has enabled high-quality stereo sound at bitrates as low as 48 kbps. SBR was invented as a bandwidth extension tool when used along with AAC. It was adopted as an MPEG4 standard in March 2004 [2].
differences between the channels of the stereo input signal. They do not prescribe the distribution of these phase differences over the left and right channels. Hence, a fourth type of parameter is introduced, describing an overall phase offset or Overall Phase Difference (OPD). In order to reconstruct the stereo image, in the PS decoder a number of operations are performed, consisting of scaling (IID), phase rotations (IPD/OPD) and decorrelation (IC).
is only applied if the input signal sampling rate differs from the encoding sampling rate. The IIR resampler may either be run as a 3:2 downsampler (e.g. to downsample from 48 kHz to 32 kHz) or as a 1:2 upsampler (e.g. to upsample from 16 to 32 kHz). The QMF filter bank (part of SBR) is used to derive the spectral envelope of the original signal. This envelope data along with some other error information forms the SBR stream. The enhanced aacplus encoder basically consists of the well-known AAC encoder, the SBR high band reconstruction encoding tool and the Parametric Stereo (PS) encoding tool. The enhanced aacplus encoder is operated in a dual frequency mode, SBR encoder unit operates at the encoding sampling frequency ( as delivered from the IIR resampler and the AAC encoder unit at half of
22
input to the AAC encoder. For an efficient implementation an IIR (Infinite Impulse Response) filter is used. The parametric stereo tool is used for low-bitrate stereo coding, i.e. up to and including a bitrate of 44 kbps [4].
Envelope Estimation
Bitstream Payload Formatter
IIR Resampler1
SBR-related Modules
The SBR encoder consists of a QMF (Quadrature Mirror Filter) analysis filter bank, which is used to derive the spectral envelope of the original input signal. This spectral envelope data along with transposition information forms the SBR stream. For stereo bitrates at and below 44 kbps, the parametric stereo encoding tool in the enhanced aacplus encoder is used. For stereo bitrates above 44 kbps, normal stereo operation is performed. The parametric stereo encoding tool estimates parameters characterizing the perceived stereo image of the input signal. These stereo parameters are embedded in the SBR stream.
23
The embedding of the SBR stream (including the parametric stereo data) into the AAC stream is done in a backwards compatible way, i.e. legacy AAC decoders can parse the enhanced aacplus stream and decode the AAC core part.
24
stereo-tomono downmix 1
guidance information
HF Generation
Spline Resampler1
Envelope Adjustment
1
stereo parameter
25
Filter bank: The encoder breaks down the raw audio signal into segments known as blocks. Modified Discrete Cosine Transform (MDCT) is applied to the blocks to maintain smooth transition from block to block. AAC dynamically switches between the two block sizes i.e., 2048-samples, and 256-samples each, referred to as long blocks and short blocks, respectively. (3.1) shows the MDCT equation. AAC also switches between two different types of long blocks: sine-function and Kaiser-Bessel Derived (KBD) according to the complexity of the signal. X (k) Where N is the block length. Psychoacoustic model: Is a highly complex block which implements the switching between block sizes, threshold calculation (upper limit of quantization error), spreaded energy calculation, grouping etc [10]. Temporal Noise Shaping (TNS) block: This technique does noise shaping in the time domain by doing an open loop prediction in the frequency domain. The TNS technique provides enhanced control of the location, in time, of quantization noise within a filter bank window. TNS proves to be especially successful for the improvement of speech quality at low bit-rates. Mid/Side Stereo: M/S stereo coding is another data reduction module based on channel pair coding. In this case channel pair elements are analyzed as left/right and sum/difference signals on a block-by-block basis. In cases where the M/S channel pair can be represented by fewer bits, the spectral coefficients are coded, and a bit is set to note that the block has used M/S stereo coding. During decoding, the decoded channel pair is de-matrixed back to its original left/right state. For normal stereo operation, M/S Stereo, is only required when operating the encoder at bitrates at or above 44 kbps. Below 44 kbps the parametric stereo coding tool is used instead where the AAC core is operated in mono.
26
Reduction of psychoacoustic requirements block: Usually the requirements of the psychoacoustic model are too strong for the desired bitrate. Thus a threshold reduction strategy is necessary, i.e. the strategy reduces the requirements by increasing the thresholds given by the psychoacoustic model. Quantization and coding: A majority of the data reduction generally occurs in the quantization phase after the data has already achieved a certain level of compression when passed through the previous modules. This module contains many other blocks such as Scale factor quantization block, Noiseless coding and Out of Bits Prevention. Scale factor quantization: This block consist of two additional blocks called scale factor determination and scale factor difference reduction. Scale factor determination: The scale factors determine the quantization step size for each scale factor band. By changing the scale factor, the quantization noise will be controlled. Scale factor difference reduction: This block takes into account the difference of the scale factors which will be encoded. A smaller difference between two adjacent scale factors requires fewer bits. Noiseless coding: Coding of the quantized spectral coefficients is done by the noiseless coding. The encoder uses a so called greedy merge algorithm to segment the 1024 coefficients of a frame into section and to find the best Huffman codebook for each section. Out of Bits Prevention: after noiseless coding, the number of really needed bits is counted. If this number is too high, the number of bits has to be reduced.
27
Input signal
Stereo Preprocessing
Filterbank
Psychoacoustic Model
TNS
M/S
Noiseless Coding
28
29
Field name
syncword ID layer protection_absent profile sampling_frequency_index private_bit channel_configuration original/copy home copyright_identification_bit copyright_identification_start aac_frame_length adts_buffer_fullness no_raw_data_blocks_in_frame crc_check raw_data_blocks
Number of bits
12 1 2 1 2 4 1 3 1 1 always "111111111111" 0: MPEG-4, 1: MPEG-2 always "00" Explained below ADTS fixed header
13 11 2 16
length of the frame including header (in bytes) 0x7FF indicates VBR
profile bits
00 (0) 01 (1) 10 (2) 11 (3)
30
ADTS frame
ADTS frame
ADTS frame
ADTS header
Sync word
31
32
Payload type
Sequence number
Time stamp
Unique id
payload
broadcast techniques such as those used in ATSC and DVB. The TS contains the actual data (payload) as well as timing and synchronization information and some error control mechanism. The latter plays a crucial role during the decoding process. This project is implemented using the MPEG2 transport stream specification with a few modifications. Transport of H.264 bit stream over MPEG2 systems is covered in amendment 3 of MPEG2 systems [3]. Even though MPEG2 systems support multiple elementary streams for multiplexing, for implementation purposes only two elementary streams, audio and video are considered. Additional elementary streams like data streams or different audio/video streams may be added by following the same method described next.
Elementary stream eg.coded video,audio
packetize
PES packet is an encapsulation of one frame of data, either an audio or video elementary stream. Each PES packet contains a packet header and the payload data from only one particular stream. PES header contains information which can distinguish between audio and video PES packets. Since the number of bits used to represent a frame in the bit stream varies (for both audio and video) the size of the PES packets also varies and depends on the type of frame that is encoded. For example I frames require more bits to be represented than P frames. Figure 4.3 shows how the elementary stream is converted into PES stream.
The PES header used is shown in table 4.1. The PES header starts with a 3 byte packet start code prefix which is always 0x000001 followed by 1 byte stream id. Stream id is used to uniquely identify a particular stream. Stream id along with start code prefix is known as start code (4 bytes). Valid stream ids [30] for audio streams range from 11000000 to 11011111 and for video streams range from 11100000 to 11101111. Stream id 11000000 and 11100000 for audio and video respectively are used in this implementation.
35
Description 0x000001 Unique ID to distinguish between audio and video PES packet Examples: Audio streams (0xC0-0xDF), Video streams (0xE0-0xEF)[3] Note: the above 4 bytes together known as start code. The PES packet can be of any length. A value of zero for the PES packet length can be used only when the PES packet payload is a video elementary stream frame number
2 2
PES packet length may vary and go up to 65536 bytes. In case of longer elementary stream, the packet length may be set as unbound i.e. 0, only in the case of video stream. The next two bytes in the header is the time stamp field, which contains the playback time information. In this project we use frame number to calculate the playback time, which is explained in detail later.
frame or a parameter set. Parameter sets are very important and are required for decoding process. So if a parameter set is found (both PPS and SPS) it is packetized separately and transmitted. If NAL unit contains the slice data, then frame number is calculated from beginning of the stream and coded as time stamp in PES. It has to be noted that parameter sets are not counted as frames, so while coding parameter sets the time stamp field is coded as zero. Stream id used for video is 11100000. Then the video PES packet is formed by encapsulating the start code, frame length, time stamp and payload.
detect missing packets. Additional optional transport fields, whose presence may be signaled in the optional adaptation field, may follow. The rest of the packet consists of payload. Packets are most often 188 bytes in length, but some transport streams consist of 204-byte packets which end in 16 bytes of ReedSolomon error correction data. The 188-byte packet size was originally chosen for compatibility with ATM systems.
The transport Error Indicator (EI) flag is set by the demodulator if it cannot correct errors in the stream, to tell the demultiplexer that the packet has an uncorrectable error. Payload Unit Start Indicator (PUSI) flag indicates the start of PES data. A transport priority flag when set means higher priority than other packets with the same PID. Out of 188 bytes, the header occupies 4 bytes and the rest 184 bytes are for the payload.
38
For the purposes of this implementation all the flags and fields mentioned above are not required, hence a few changes have been made although the frame work and the packet size remains the same. The whole header information is represented in 3 bytes instead of 4 bytes and the rest is available for payload data. The modified transport stream (TS) packet is shown in fig. 4.5
A F C
CC
(4 bits)
PID
(10 bits)
Offsetoptional 8bits
2 bytes
Fig 4.5. Transport stream (TS) packet format used in this project .
The sync byte (0x47) indicates the start of the new TS packet. It is followed by a payload unit start indicator (PUSI) flag, which when set indicates that the data payload contains start of new PES packet. The Adaptation Field Control (AFC) flag when set indicates that the whole of the allotted 185 bytes for the data payload is not occupied by the PES data. This occurs when the PES data is smaller than 185 bytes. When this happens the unoccupied bytes of the data payload are filled with filler data ( all zeros in this case), and the length of the filler data is stored in a byte called the offset right after the TS header, offset is calculated by 185 length of PES data.
39
The Continuity Counter (CC) is a 4 bit field which is incremented by the multiplexer for each TS packet sent for a particular stream I.e. audio PES or video PES, this information is used at the demultiplexer side to determine if any packets are lost, repeated or is out of sequence. Packet ID (PID) is a unique 10 bit identification to describe a particular stream to which the data payload belongs in the transport stream (TS) packet. The MPEG2 transport stream has a concept of broadcast programs. Each single program is described by a Program Map Table (PMT), and the elementary streams associated with that program have PIDs listed in the PMT. For instance, a transport stream used in digital television might contain three programs, to represent three television channels. Suppose each channel consists of one video stream and one audio stream. A receiver wishing to decode a particular "channel" merely has to decode the payloads of each PID associated with its program. It can discard the contents of all other PIDs. A transport stream with more than one program is referred to as a MPTS (multi program transport stream). Similarly a transport stream with a single program is referred to as a single program transport stream (SPTS). A PMT will have its own PID and will be transmitted at regular intervals. In this implementation only two streams, audio and video are used, so the PMT is not required. The PIDs is assumed to be known by the decoders. The PIDs used for this implementation are 0000001110 (14) for audio stream and 0000001111 (15) for video stream. Optional offset byte: as described above, if the adaptation field control flag is set, this byte is filled with the length of the filler data (zeroes).
each access unit and stores the elementary streams in buffers. When the timeline count reaches the value of the time stamp, the buffer is read out. This operation has two desirable results. First, effective time base correction is obtained in each elementary stream. Second, the video and audio elementary streams can be synchronized together to make a program. Traditionally to enable the decoder to maintain synchronization between audio track and video frames, a 33 bit encoder clock sample called Program Clock Reference (PCR) is transmitted in the adaptation field of the TS packet from time to time (every 100 ms). The PCR transport stream (TS) packet will have its own PID that will be recognized by the decoder. This is used to generate the system timing clock (STC) in the decoder which provides an accurate time base. This along with the presentation time stamp (PTS) field that resides in the PES packet layer of the transport stream is used to synchronize the audio and video elementary streams. This project uses the frame numbers of both audio and video as time stamps to synchronize the streams. This section explains how frame numbers can be used to synchronize audio and video streams. As explained before in sections 2.6 and 3.7, both H.264 and HE AAC v2 bit streams are organized into access units i.e. frames separated by their respective sync sequence. A particular video sequence will have a fixed frame rate during playback which is specified by frames per second (fps). So assuming that the decoder has a prior knowledge about the fps of the stream, the presentation time or the playback time of a particular video frame can be calculated using (4.1).
41
The AAC compression standard defines each audio frame to contain 1024 samples per channel. This is true for HE AAC v2 as well .The sampling frequency of the audio stream can be extracted from the sampling frequency index field of the ADTS header. The sampling frequency remains the same for a particular audio stream. Since both samples per frame and sampling frequency are fixed the audio frame rate also remains constant throughout a particular audio stream. Hence the presentation time of a particular audio frame (assuming stereo) can be calculated using (4.2).
The same expression can be expanded for multi channel audio streams, just by multiplying the number of channels. Hence by using (4.1) and (4.2), presentation times of the frames can be calculated by coding the frame numbers as the time stamps. And also once the presentation time of one stream is calculated, the frame number of the second stream that has to played at that particular time can calculated. This approach is used at the decoder to achieve the audio-video synchronization or lip synchronization; this is explained in detail in later chapters. Using frame numbers as time stamps has many advantages over the traditional PCR approach. Obvious advantages are that there is no need to send the additional Transport Stream (TS) packets with PCR information, reduced overall complexity, no need to consider clock jitters during synchronization, smaller time stamp field in the PES packet, just 16 bits to encode frame number
42
compared to 33 bits for the Presentation Time Stamp (PTS) which has a sample from the encoder clock. The time stamp field in this project is encoded in 2 bytes in the PES header, which implies that time stamp field can carry frame numbers up to 65536. Once the frame number of either stream exceeds this number, which is a possibility in the case of long video and audio sequences, the frame number is reset to 1. The reset is done simultaneously on both audio and video frame numbers as soon as the frame number of either one of the stream crosses 65536. This will not create a frame number conflict at the demultiplexer side during synchronization because the audio and video buffer sizes are much smaller than the maximum allowed frame number. Hence, at no point of time will there be two frames in the buffer with the same time stamp. The next chapter addresses the multiplexing scheme used to multiplex the audio and video elementary streams.
43
CHAPTER 5 MULTIPLEXING
Multiplexing is a process where Transport Stream (TS) packets are generated and transmitted in such a way that the buffers at the decoder (demultiplexer) do not overflow or under flow. Buffer overflow or underflow by the video and audio elementary streams can cause skips or freeze/mute errors in video and audio playback. There are many methods adopted in various systems to prevent this at the decoder side, like when a potential buffer overflow is detected; null packets (transmitted to maintain constant bit rate) are deleted or presentation time is delayed by a few frames till both the buffers have the content to be played back at that particular presentation time.
The flow chart of the multiplexing scheme used in this project is shown in figures 5.1, 5.2 and 5.3. The basic logic is based on both audio and video sequences having constant frame rates. For video the frames per second value will remain same throughout the video sequence. In an audio sequence since sampling frequency remains constant throughout the sequence and samples per frame is fixed (1024 for stereo), the frame duration also remains constant. For transmission a PES packet which represents a frame is logically broken down to n number (n depends on PES packet size) of TS packets of 188 bytes each. The exact presentation time of each TS packet ( shown in (5.1), (5.2) and (5.3), where required to represent corresponding PES packet or frame: ) may be calculated as is the number of TS packets
44
where
is given by
From (5.3) and (5.6) it may be observed that the presentation time of a current TS packet is the cumulative sum of presentation time of previous TS packet (of the same type) and the current TS duration. The decision to transmit a particular TS packet (audio or video) is made by comparing their respective presentation times, and which ever stream has a lower value, it is scheduled to transmit a TS packet. This makes sure that both audio and video content get equal priority and gets transmitted uniformly. Once the decision about which TS to transmit is made, the control goes to one of the blocks where the actual generation and transmission of TS and PES packets takes place. In the audio/video processing block, the first step is to check whether the multiplexer is still in the middle of a frame or in the beginning of a new frame. If a new frame is being processed, (5.2) or (5.5) is executed appropriately, to find out the TS duration. This information is used to update the TS presentation time at a
45
later stage. Next data is read from the concerned PES packet, if PES is bigger than 185 bytes then only the first 185 bytes are read out and the PES packet is adjusted accordingly. If the current TS packet is the last packet for that PES packet, a new PES packet for the next frame (for that stream) is generated. Now the 185 bytes payload data and all the remaining information are ready to generate the transport stream (TS) packet. Once a TS packet is generated, the TS presentation time is updated using (5.3) and (5.6). Then the control goes back to the presentation time decision block and the whole process is repeated till all the video and audio frames are transmitted. It has to be noted here that one of streams I.e. video or audio may get transmitted completely before the other. In that case only that particular processing block is operated which is still pending transmission. The next chapter describes the de-multiplexing algorithm used and the method used to achieve audio-video synchronization.
46
NO
YES
DONE
47
NO
YES
CALCULATE NO OF TS PACKETS FOR THIS FRAME TS DURATION
NO
YES
CONSIDER ONLY FIRST 185 BYTES FOR TRANSMISSION UPDATE PES DATA AND LENGTH GENERATE PES PACKET FOR NEXT VIDEO FRAME
48
NO
YES
CALCULATE NO OF TS PACKETS FOR THIS FRAME TS DURATION
NO
YES
CONSIDER ONLY FIRST 185 BYTES FOR TRANSMISSION UPDATE PES DATA AND LENGTH GENERATE PES PACKET FOR NEXT AUDIO FRAME
49
CHAPTER 6 DE MULTIPLEXING
The Transport Stream (TS) input to a receiver is separated into a video elementary stream and audio elementary stream by a demultiplexer. At this time, the video elementary stream and the audio elementary stream are temporarily stored in the video and audio buffer, respectively. The basic flow chart of the demultiplexer is shown in the figure 6.1. After receiving a packet, it is checked for the sync byte (0X47), to check if the packet is valid or not. If invalid that packet is skipped and de-multiplexing is continued with the next packet. The valid TS packet header is read to extract fields like packet ID (PID), adaptation field control flag (AFC), payload unit start (PUS) flag, 4 bit continuity counter etc. Now the payload is prepared to be read into the appropriate buffer. By checking the AFC flag it can be known that an offset value has to be calculated or all 185 bytes in the TS packet has payload data. If the AFC is set then the payload is extracted by skipping through the stuffing bytes. The Payload Unit Start (PUS) bit is checked to see if the present TS packet contains a PES header. If so then, the PES header is first checked for the presence of the sync sequence (I, e 0X000001). If not, the packet is discarded and the next TS packet is processed. If valid then the PES header is read and fields like stream ID, PES length, frame number are extracted. Now the PID is checked to see if it is an audio TS packet or video TS packet. Once this decision is made, the payload is written into its respective buffer. If the TS packet payload contained the PES header, information like frame no, its location in the corresponding buffer , PES length are stored in a separate array variable which is later used for synchronizing audio and video streams.
50
Read TS packet Go to next packet Valid packet ? YES Get PID AFC flag NO
YES
Adjust offset
NO
PUS = 1 YES Sync seq present ? YES Read PES header - PES length -frame no -stream ID NO
NO
PUS = 1 ?
YES
NO
YES Write payload data in to audio buffer Save frame no and pointer loc in buffer
NO
YES Search for the next IDR frame and calculate the corresponding audio frame Write both video and audio buffer values from those frames in to their respective bitstream files(.264 and .aac)
After the payload has been written into the audio/video buffer. Video buffer is checked for fullness. Since video files are always much larger than audio files, the video buffer gets filled up first. Once the video buffer is full, the next occurring IDR frame is searched in the video buffer. Once found the frame number is noted and used to calculate the corresponding audio frame number that has to be played at that time which is given by (6.1).
The above equation is used to synchronize the audio and video streams. Once the frame numbers are obtained, the audio and video elementary streams can be constructed by writing the audio and video buffer contents from that point (frame) into their respective elementary streams i.e. .aac and .264 files respectively. Then the streams are merged into a container format by using mkv merge software [31] which is freely available software. The resulting container format can be played back by video player like VLC media player [32] or Gom media player [33]. In the case of video sequence, to ensure proper playback, picture and sequence parameter sets must be inserted before the first IDR frame. The reason the de-multiplexing is carried out from an IDR (instantaneous decoder refresh) frame is because the IDR frame breaks the sequence making sure that the later frames like P-frames do not use frames before the IDR frame for motion estimation. This is not true in the case of normal I- frame. So in a long sequence the GOPs after the IDR frame are treated as a new sequence by the H.264 decoder. In the case of audio the HE AAC v2 decoder can playback the sequence from any audio frame.
52
skew *47+. According to a research the in-sync region spans a skew of -80 ms (audio behind video) and +80 ms (video behind audio) [47]. In-sync refers to the range of skew values where the synchronization error is not perceptible. The MPEG-2 systems [4] define a skew threshold of 40 ms. Once the streams are synchronized the skew remains constant throughout. This possible maximum skew is the limitation of the project; however the value remains well below the allowed range. Simulation results using the audio and video test sequences are presented in the next chapter.
54
CHAPTER 7
RESULTS
The multiplexing algorithm was implemented in MATLAB and demultiplexing algorithm was implemented using C++. JM (joint model) 16.1 [35] and 3gpp Enhanced aacplus encoder [36] were used for encoding video and audio raw sequences respectively. GOP sequence adopted for video encoding was IPPP; the H.264 baseline profile was used. For audio encoding bitrate was set at 32 kbps to enable parametric stereo.
55
video frames in buffer 100 200 300 400 500 600 700 800 900 1000
Audio frames in buffer 98 196 293 391 489 586 684 782 879 977
video buffer size (in KB) 771.076 1348.359 1770.271 2238.556 2612.134 3158.641 3696.039 4072.667 4500.471 4981.05
audio buffer size (in KB) 17.49 34.889 52.122 69.519 86.949 104.165 121.627 139.043 156.216 173.657
video content play back time (in sec) 4.166 8.333 12.5 16.666 20.833 25 29.166 33.333 37.5 41.666
audio content play back time (in sec) 4.181 8.362 12.51 16.682 20.864 25.002 29.184 33.365 37.504 41.685
Table7.1. Video and audio buffer sizes and their respective playback times
56
Test clip 1 Clip length (sec) 30 Video FPS 24 Audio sampling frequency (Hz) 24000 total video frames 721 Total audio frames 704 Video raw file (.yuv) size(kB) 105447 Audio raw file (.wav) size(kB) 5626 H.264 file size(kB) 1273 AAC file size (kB) 92 Video compression ratio 82.82 Audio compression ratio 61.15 H.264 encoder bitrate(kBps) 42.43 AAC encoder bitrate(kbps) 32 Total TS packets 8741 Transport stream size(kB) 1605 Transport stream bitrate (kBps) 53.49 Test clip size (kB) 1376.78 Reconstructed clip size (kB) 1312.45 Table 7.2. Characteristics of test clips used
2 50 24 24000 1199 1173 175354 9379 1365 204 128.4 45.97 27.3 32 9858 1810 36.2 1576.6 1563.22
Table 7.3 shows the skew for various start TS packets. The delay column indicates the skew achieved when demultiplexing was started from different TS packet number. The maximum theoretical value is 21 ms because the sampling frequency used is 24,000 Hz (audio frame duration is 42 ms). As seen the worst skew is 13 ms, but in most cases the skew rate is below 10ms. This is well below the MPEG2 threshold of 40 ms. Chapter 8 outlines the conclusions followed by future work which is described in chapter 9.
57
chosen chosen video frame audio frame presentation presentation time (s) time (s)
.5416 1.208 1.375 1.875 2.208 3.041 3.708 .5546 1.1946 1.365 1.877 2.218 3.03 3.712
delay ms
Perceptible ?
no no no no no no no
58
CHAPTER 8 CONCLUSIONS
This project implemented an effective multiplexing and demultiplexing scheme with synchronization. The latest codec H.264 and HE AAC v2 were used. Both encoders achieved very high compression ratios as a result the transport stream bitrate requirement was contained to about 50 kBps. Also buffer fullness was effectively handled with maximum buffer difference observed was around 20ms of media content. During decoding the audio-video synchronization was achieved with a maximum skew of 13ms.
59
References:
[1] MPEG-4: ISO/IEC JTC1/SC29 14496-10: Information technology Coding of audio-visual objects - Part 10: Advanced Video Coding, ISO/IEC, 2005. [2] MPEG-4: ISO/IEC JTC1/SC29 14496-3: Information technology coding of audio-visual objects part3: Audio, AMENDMENT 4: Audio lossless coding (ALS), new audio profiles and BSAC extensions. 3] MPEG2: ISO/IEC JTC1/SC29 138187, advanced audio coding, AAC. International Standard IS WG11, 1997. [4]MPEG-2: ISO/IEC 13818-1 Information technologygeneric coding of moving pictures and associated audioPart 1: Systems, ISO/IEC: 2005. [5] Soon-kak Kwon et al, Overview of H.264 / MPEG-4 Part 10 (pp.186-216), Special issue on Emerging H.264/AVC video coding standard, J. Visual Communication and Image Representation, vol. 17, pp.183-552, April. 2006. [6] A. Puri et al, Video coding using the H.264/MPEG-4 AVC compression standard, Signal Processing: Image Communication, vol.19, pp. 793-849, Oct. 2004. [7] MPEG-4 HE-AAC v2 audio coding for today's digital media world , article in the EBU technical review (01/2006) giving explanations on HE-AAC. Link: http://tech.ebu.ch/docs/techreview/trev_305-moser.pdf [8]ETSI TS 101 154 Implementation guidelines for the use of video and audio coding in broadcasting applications based on the MPEG-2 transport stream. [9] 3GPP TS 26.401: General Audio Codec audio processing functions; Enhanced aacPlus General Audio Codec; 2009 [10] 3GPP TS 26.403: Enhanced aacPlus general audio codec; Encoder Specification AAC part. [11] 3GPP TS 26.404 : Enhanced aacPlus general audio codec; Encoder Specification SBR part. [12] 3GPP TS 26.405: Enhanced aacPlus general audio codec; Encoder Specification Parametric Stereo part.
60
[13] E. Schuijers et al, Low complexity parametric stereo coding ,Audio engineering society, May 2004 , Link: http://www.jeroenbreebaart.com/papers/aes/aes116_2.pdf
[15] MPEG-4: ISO/IEC JTC1/SC29 14496-14 : Information technology coding of audio-visual objects Part 14 :MP4 file format, 2003 [16] DVB-H : Global mobile TV. Link : http://www.dvb-h.org/ [17] ATSC-M/H. Link : http://www.atsc.org/cms/ [18] Open mobile vido coalition. Link : http://www.openmobilevideo.com/about-mobiledtv/standards/ [19] VC-1 Compressed Video Bitstream Format and Decoding Process (SMPTE 421M2006), SMPTE Standard, 2006 (http://store.smpte.org/category-s/1.htm). [20] Henning Schulzrinne's RTP page. Link: http://www.cs.columbia.edu/~hgs/rtp/ [21] G.A.Davidson et al, ATSC video and audio coding, Proc. IEEE, vol.94, pp. 60-76, Jan. 2006 (www.atsc.org). [22] I. E.G.Richardson, H.264 and MPEG-4 video compression: video coding for nextgeneration multimedia, Wiley, 2003. [23] European Broadcasting Union, http://www.ebu.ch/ *24+ Shintaro Ueda, et al, NAL level stream authentication for H.264/AVC , IPSJ Digital courier, vol.3 , Feb.2007. [25] World DMB: link: http://www.worlddab.org/ [26] ISDB website. Link: http://www.dibeg.org/ [27] 3gpp website. Link: http://www.3gpp.org/ [28] M Modi, Audio compression gets better and more complex, link: http://www.eetimes.com/discussion/other/4025543/Audio-compression-gets-better-andmore-complex [29] PA Sarginson,MPEG-2: Overview of systems layer, Link: http://downloads.bbc.co.uk/rd/pubs/reports/1996-02.pdf
61
[30] MPEG-2 ISO/IEC 13818-1: GENERIC CODING OF MOVING PICTURES AND AUDIO: part 1SYSTEMS Amendment 3: Transport of AVC video data over ITU-T Rec H.222.0 |ISO/IEC 13818-1 streams, 2003 [31] MKV merge software. Link: http://www.matroska.org/ [32] VLC media player. Link: http://www.videolan.org/ [33] Gom media player. Link: http://www.gomlab.com/ *34+ H. Murugan, Multiplexing H264 video bit-stream with AAC audio bit-stream, demultiplexing and achieving lip sync during playback, M.S.E.E Thesis, University of Texas at Arlington, TX May 2007. [35] H.264/AVC JM Software link: http://iphome.hhi.de/suehring/tml/download/. [36] 3GPP Enhanced aacPlus reference software. Link: http://www.3gpp.org/ftp/ [37] MPEG2: ISO/IEC JTC1/SC29 138182, Information technology -- Generic coding of moving pictures and associated audio information: Part 2 - Video, ISO/IEC, 2000. [38] MPEG4: ISO/IEC JTC1/SC29 144962, Information technology Coding of audio visual objects: Part 2 - visual, ISO/IEC, 2004. [39] T. Wiegand et al, Overview of the H.264/AVC Video Coding Standard , IEEE Trans. CSVT, Vol. 13, pp. 560-576, July 2003. [40] ATSC-Mobile DTV Standard, part 7 AVC and SVC video system characteristics. Link: http://www.atsc.org/cms/standards/a153/a_153-Part-7-2009.pdf [41] ATSC-Mobile DTV Standard, part 8 HE AAC audio system characteristics. Link: http://www.atsc.org/cms/standards/a153/a_153-Part-8-2009.pdf [42] H.264 Video Codec - Inter Prediction. Link: http://mrutyunjayahiremath.blogspot.com/2010/09/h264-inter-predn.html *43+ B.A. Cipra, The Ubiquitous Reed-Solomon Codes. Link: http://www.eccpage.com/reed_solomon_codes.html
62
[44] VC1 technical overview .link: http://www.microsoft.com/windows/windowsmedia/howto/articles/vc1techoverview.aspx [45] Dirac video compression website. Link: http://diracvideo.org/ [46] MPEG2: ISO-IEC JTCl/SC29/WGll 13818-3 : Coding Of Moving Pictures and Associate Audio : Part 3 audio Nov.1994 [47] G. Blakowski et al, A Media Synchronization Survey: Reference Model, Specification, and Case Studies, IEEE Journal on selected areas in communications, VOL. 14, NO. 1, Jan 1996.
63