Académique Documents
Professionnel Documents
Culture Documents
ACKNOWLEDGEMENT
I take this opportunity to thank the Almighty for keeping me on the right path and the immense blessing towards the successful completion of my Seminar.
I wish to express my sincere gratitude to Smt. Geetha Ranjin, H.O.D Department of Electronics and Communication Engineering for her expert guidance, constant encouragement and valuable suggestions for the completion of this Seminar.
I am also grateful to my Staff-in-Charge Mr. Ranjith Ram and Mr. Vinod Kumar, Department of Electronics and Communication Engineering, for always being there to hand out invaluable pieces of advise.
Last of all, all my teachers and friends who extend every possible assistance they could. ROSHITH. P
2005
ABSTRACT
Three-dimensional facial model coding can be employed in various mobile applications to provide an enhanced user experience. Instead of directly encoding the data using conventional coding techniques such as MPEG-2, a one-time 3D computer model of the caller is transmitted at the beginning of the telephone call. There after capturing 3D movements and mimicry parameters with the camera is all that required to continually see and hear a true -to-life, synchronized caller on the display. The 3Dmodels are interchangeable, which means that one person can be displayed on the screen with the movements of another, it is suitable for use in conjunction with various mobile networks from GSM to UMTS. However what is less clear is that sensitivity to channel errors of 3D -coded data.
2005
CONTENTS
CHAPTERS Chapter 1. Chapter 2. Chapter 3. INDEX INTRODUCTION SYSTEM OVERVIEW FACIAL ANIMATION AND SPECIFICATION 3.1 MPEG-4 standard 3.2 Face animation parameters 3.3 Facial animation parameter units 3.4 Face feature points Chapter 4. 3.5 MPEG-4 facial animations delivery CODING OF FAP'S 4.1 Arithmetic coding OF FAP'S 4.2. DCT13.1 coding of FAPS Chapter 5. Chapter 6. 4.3. Interpolation and extrapolation SYSTEM ARCHITECTURE CHANNEL MODELS FOR FAP 6.1 GPRS 6.2 EDGE Chapter 7. Chapter 8. 6.3 Results ERRORS IN MOBILE FACIAL ANIMATION APPLICATIONS 7.1 Embodied agents in spoken dialogue systems 7. 2 Language training with talking heads Chapter 9. 7.3 Synthetic faces as aids in communication CONCLUSION REFERENCES 22 23 18 20 12 14 9 PAGES 1 2 3
2005
INTRODUCTION
Facial animation and virtual human technology in computer graphics has made considerable advancements during past decades has been a research topic attracting an -increasing. Number of commercial applications such as mobile platforms, telecommunications, tele-presence via Internet end or digital entertainment. A number of mobile applications may benefit from the enhancement that 3-D video can bring including message services and e-commerce. Despite of the possible advantages of such technologies, the effect of the mobile link on 3-D video has not considered in the design of it's syntax. Another issue is delivering the coded possible. MPEG-4 is the first international standard that standardizes-time multimedia communication- including natural and, synthetic audio, video and 3D graphics. In order to define face models there is BIFS for MPEG-4. Within BIFS the FAP coding provides, lower bit rate for face models. In order to deliver the services, the possible channel models are GPRS and EDGE. But the coded data is sensitive to errors, so it should be considered. In the next session, gives an overview of some relevant parts of FAP technology, coding of FAP, and discusses different mobile network technology. This is followed by some results when FAP is delivered through GPRS and EDGE channels and Comparing channel errors. Next are other noticeable issues involved on this technology. The applications of facial animation in mobile terminals are also discussed. bit stream over the wireless network The bandwidth should be as narrow as
2005
SYSTEM OVERVIEW
Figure 2.1 System overview The mobile facial animation can be describe-'' using the above block diagram. Using a projection camera the 3D input surfaces or facial models are produced. Using facial animation technique the movements of the face can be tracked. MPEG-4 FAP encoder encodes the high-resolution facial models. Now this data stream is transmitted over wireless network. GPRS and EDGE channel models are preferred to use this due to data rate and bandwidth. In this error rate is small in EDGE. Now in the receiver the data stream is received using same protocol stack as in transmitter but in the inverse manner. Then it -s decoded using MPEG-4 FAP decoder. Now the face model is reconstructed.
2005
A combination of the first two features allows elements within scenes to be animated. BIFS also allows scenes to be displayed as the data arrives at the client, while VRML requires the whole scene to be downloaded before anything is shown.
In an effort to standardize face models parameterization, originally* for the purposes of efficient model-based coding of moving images, the MPEG consortium developed the MPEG-4 Facial animation standard. This standard defines 68 facial animation parameters (FAPs) and 84 facial feature points. The facial feature points are well-defined landmark points on the human face. Face Animation Parameters (FAPs) that have been designed to be independent of any particular facial model. In other words, essential facial gestures and visual speech derived from a particular performer will produce good results on other faces unknown at the time the encoding takes place. The 68 parameters are categorized into 10 groups related to parts of the face (Table 1). FAPs represent a complete set of basic facial actions including head motion, tongue, eye, and mouth control. They allow the representation of natural facial expressions. They can also be used to define facial action' units . Exaggerated values permit the definition of actions that are normally not possible for humans, but are desirable for cartoon-like characters. The FAP set contains the two high-level FAP and 66 low level FAP's .The high level FAP are visemes and expressions (FAP group 1). A viseme is a visual correlate to a phoneme. Only 14 static visemes that are clearly distinguished are included in the standard set. In order to allow for co articulation of speech and mouth movement, transitions from one viseme to the next are defined by blending the two visemes with a weighting factor. Similarly, the expression parameter defines 6 high level facial expressions like joy and sadness (Figure). In contrast to visemes, facial expressions are animated with a value defining the excitation ,of the expression. Two facial expressions can be blended with a weighting factor. Since expressions are high-level animation parameters', they allow animating unknown models with high subjective quality.
2005
Table 3.1: FAP groups Group 1: Visemes and expressions 2: Jaw, chin, inner lowerlip, cornerlips, midlip 3: Eyeballs, pupils, eyelids 4: Eyebrow 5: Cheeks 6: Tongue 7: Head rotation 8: Outer lip positions 9: Nose 10: Ears Number of FAPs 2 16 12 8 4 5 3 10 4 4
2005
2005
Figure 3.4.1 An MPEG-4 compliant face model, the dots representing the 84 Feature Points (left). An example of varying FAP 4,5,6 and 12 to describe mouth opening (right) Some feature points like the ones along the hairline are not affected by FAPs. They are required for defining the shape of a proprietary face model using feature points. Feature points are arranged in groups like cheeks, eyes, and mouth (Table 1). The location of these feature points has to he known for any MPEG-4 compliant face model.
This discussion aims at analyzing the issues in the transmission of MPEG-4 compliant facial animation streams over lossy packet networks, such as wireless LANs, the Internet, or third generation mobile networks. Many web-based applications now exploit three-dimensional, animated virtual characters to enrich their user interface . However, the HTTP/TCP protocols used for animation transport in the majority of existing systems fail to guarantee a fast interaction over a wide range of network conditions .The use of unreliable, connectionless transport protocols, such as the Real-time Transport: Protocol (RTP) over UDP,for the delivery of multimedia content has been proposed in order to reduce end-to-end latency and improve robustness against network congestion. The MPEG-4 standard allows for the encoding and representation of a wide range of natural and synthetic audio and video sources .A major difference with previous multimedia standards lies in its object-based approach in which a scene is composed of severs! AudioVisual Objects, each of them being represented through an elementary bit stream. One such object is the Face Object-, a three dimensional face model (either humanor cartoon-like)that may be animated by a set of Facial Animation Parameters (FAP) .Because of the complexity of implementing a complete MPEG-4 Systems architecture, a common approach in web-based applications is that of directly carrying a single elementary stream over the lightweight RTP protocol . For face models very low bit rate and the use of a model-based, variable-length, predictive encoding is used. MPEG-4 employs the highly efficient arithmetic and DOT be achieved .Thus, the frame size (Discrete Cosine Transform) coding algorithms to reduce temporal redundancy in FAP streams. Bit rates as low as 2 Kbps can becomes comparable with the size of RTP/UDP/IP headers. The use of these algorithms also means that the loss/late arrival of a single packet may destroy a significant amount of information, hence requiring the use of error resilience and/or concealment techniques .Finally, the specific bit stream syntax often requires a significant amount of look-ahead in the decoding process: if a packet is lost or corrupt, the decoding process is interrupted up to the next reference frame. This effects on bandwidth . work investigates the
2005
CODING OF FAP'S
One key issue in making use of FAP technology is how FAP parameters are obtained ready for encoding. The standard MPEG-4 FAP encoder software uses text based FAP files as input. These text-based files contain Various parameters specifying how the face moves, anyhow the binary encoded FAP should be produced. Generation of these FAP files may be achieved in two ways: manually, or automatically by employing image Processing algorithms. A number of image processing techniques have been proposed, which are capable of identifying and tracking facial features. Basically for coding facial animation parameters,MPEG-4 provides two tools, ceding of quantized and temporally predicted FAP's using an arithmetic coder allows for coding of FAP's introducing small delay only. Using a discrete cosine transform(DCT) for encoding a sequence of FAP's introduces significant delay but achieves higher coding efficiency.
2005
10
In order to avoid transmitting all FAPs for every frame, the encoder can transmit a mask indicating for which group FAP values are transmitted. The encoder can also specify for which FAPs within a group values will be transmitted. This allows the encoder .to send incomplete sets of FAPs to the decoder.
Figure 4.1.1 Block diagram of the encoder using arithmetic coding for FAPs.
2005
11
Figure 4.2.1 Block diagram of the FAP encoder using DCT. DC coefficients are predictively coded. AC coefficients are directly coded.
2005
12
SYSTEM ARCHITECTURE
On the transmitting machine /7X) an uncompressed FAP file is encoded in real time. A dedicated hardware motion capture system could also be taken as the FAP source. The transmitter is responsible for applying the desired encoding parameters and implementing the packetization policy.
Figure 5.1 System Architecture The HTTP/TCP protocols used for animation transport in the majority of existing systems fail to guarantee a fast interaction over a wide range of network conditions. The use of unreliable, connectionless transport protocols, such as the Real-time Transport Protocol (RTP) over UDP, for the delivery of multimedia content has been proposed in order to reduce end-to-end latency and improve robustness against network congestion . On the receiving terminal (KX), a network buffer is used to compensate for jitter and out-of-order arrival of packets. As soon as a sufficient quantity of packets is received (typically, 12 packets or about 1 second; the exact number may be adjusted to fit network conditions), the decoder starts processing the received stream. After reassembling the bit stream, the receiver is to detect and hide network errors from the animation' player. As soon as an error/packet loss is detected, the decoder starts a search for the next reference frame in the bit stream. When this is reached, the decoding process can restart. Another issue is the generation of 3-D face models that resemble the speaker. FAP data can be decoded and applied either to default facial models on the end user's terminal, or can be applied to models downloaded for a particular session that more accurately represent the speaker. Various methods have been proposed to produce 3-D models of human faces from camera images.
2005
13
To find an effective compromise between bandwidth and video quality, the choice of encoding and packetization parameters must take into account the characteristics of the channel on which the animation is transmitted. So the channel should be enough bandwidth.
2005
14
For FAP the suitable channel models are GPRS channel and EDGE (Enhanced Data rates for GSM and TDMA/136 Evolution) channel .The following figure gives relevant comparison of different networks. From that it is clear that high bandwidth is needed for low error, which can give, by GPRS and EDGE. Table for comparison of error in different application Application environment Uni-directional applications Interactive <5% <5% 5-7 3-5 1-3 Packet loss IFD Frames/ packet 2 2 2 RTP Animation Bitrate 5 Kbps 6 Kbps 9 Kbps Buffering for Error Concealment -500 ms -200 ms -120ms
6.1 GPRS
GPRS is a wireless packet based network architecture using GSM radio systems. The original design of GPRS has been driven by non real time requirements. Nevertheless the adaptive multi slot capability of GPRS which allows for dynamic allocations of timeslots to a given terminal, provided enough bandwidth for the support of a limited set of multimedia enabled services. Furthermore the native support of the IP protocol allows a simple .interfacing of current IP/RTP based multimedia applications such as facial animation streaming to a GPRS network For the GPRS channel model, the propagation conditions were those specified in GSM 05.05 as TU50 Ideal Frequency Hopping at 900MHz. The TU50 channel model represents multi-path propagation conditions found in typical urban conditions. Four channel coding schemes are specified for GPRS, three of which were employed here. The frames are convolution coded under different rates. In convolutional coding when we use v output symbols for each input code the convolutional code rate is 1/p.when k symbols are shifted using k shift .registers and output symbols are v then code rate is k/v. The schemes
2005
15
used for GPRS are labeled CS-1, CS-2, CS-3, and respectively correspond to convolutional code rates of , 2/3 and 3/4. Figure below shows the results of GPRS simulations performed using various channel coding schemes, at a number of C/I13.2 ratios. PSNR values above 45 dB generally indicate very infrequent error bursts. Values between 40 and 45 dB indicate more frequent errors, but overall quality is likely to be acceptable to many users. Taking this as a guide, it is clear that acceptable quality is achievable using the entire channel coding schemes tested. However, relatively high C/I ratios are required using CS-3, making the use of this scheme undesirable.
Figure 6.1.1 : PSNR results for GPRS channel of FAP transmission with 11-frame per Second
2005
16
6.2 EDGE
Beyond GPRS, EDGE (Enhanced Data Rates For GSM Evolution) is a generation 2.5-air interface, which represents a step towards UMTS. It provides higher data rates than GPRS and introduces a new modulation technique called eight-phase-shift-keying (8-psk) that allows a much higher bit rates and automatically adapts to radio conditions. EDGE shares its available bandwidth among users on one carrier in a sector, which ranges from several 10's of kbps to 384 kbps, depending on various conditions such as propagation, interference, and traffic load. The network chooses a maximum number of retransmissions that may be attempted for each link layer segment. Link adaptation is used in EDGE so that the system can select the most efficient modulation and coding scheme for each mobile based on its current channel condition method. EDGE uses 8 different channel coding schemes. Some of them are based on the convolutional coding with correct error correction capabilities. For the EDGE channel model, the propagation conditions were again those specified in GSM 05.05 with ideal frequency hopping. However, for this model the mobile terminal speed is set to 3 km/hr. eight joint modulation-coding schemes are specified, which make use of two different modulation schemes and various convolutional coding rates. Modulation is either GMSK, as used in GSM and GPRS, or 8-PSK, which gives higher data rates. Two GMSK schemes are used here: MCS-1 and MCS:2 correspond to covolutional code rates of 0.53 and 0.66, Two 8-PSK schemes are also tested: MCS-5 and MCS-6 correspond to covolutional code rates of 0.37and 0.49. The other modulation-coding schemes resulted in the transmitted data being subjected to error rates that are too high to consider for transmission of FAP's. Results with the EDGE channel model are shown in figure below, The results show that transmission of FAP's using the 8-PSK-modulation scheme is likely to result in unacceptable quality unless the C/I ratio is greater than 18 dB. Even with GMSK modulation, acceptable quality decoding of FAP's may only realistically be possible using MCS-1, unless a C/Lratio greater than 15 dB can be guaranteed. In terms of error rates, EDGE provides a more hostile environment to multimedia than GPRS.
2005
17
Figure 6.2.1 PSNR results of FAP transmission over and EDGE channel with 11-frame per second.
6.3 RESULTS
1. Freezing of the animation: Corrupted data is detected before 4t is displayed. The display freezes while the decoder searches for the next resync code. 2. Catastrophic display of corrupted data. Corrupted data is not detected before it is displayed. This leads to highly obvious, "catastrophic" errors being visible in the decoder display (see figure below).
2005
18
2005
19
Typically, an error is detected only indirectly after some frames, and the perfect localization of the error is often impossible. It is thus very important detect the error as early as possible. With this in mind, the bit stream syntax was analyzed in order to pinpoint the places where an error could be detected. Based on several assumptions that stand in the case of most on-line transmissions, it introduced optional checks for the values of certain fields of the stream. While these fields, like gender bit, coding type, and object mask, are theoretically unconstrained, in practice they are not supposed to change during a single session. On one hand, the need for an error concealment module is increased by the fact that the loss of a single P-frame prevents the correct decoding of the following ones. On the other hand, given that the facial animation parameters represent 1-D displacements of the Feature Points, and that loss bursts are typically comparable with the length of-a phoneme (during which the mouth position, or viseme, does not vary significantly), the use-of error concealment techniques based on Interpolation proves effective in reconstructing FAP trajectories. In the MPEG-4 version different software's are developing for FAP encoding & decoding in mobile platforms. But it is necessary for them to add the following functionality. Error resilient decoding: When errors are detected, the decoder freezes the display, and searches for the next resync code. There is no error concealment built in, it has to be added. Regular insertions of resync codes: A 32-bit resync code -specified in the MPEG-4 standard is inserted before every I-frame to limit the effects of synchronization loss. Output of decoded visual data to file: The displayed output is written to a series of bitmap files, to aid quality evaluation and comparison of test result.
2005
20
APPLICATIONS
8.1 EMBODIED AGENTS IN SPOKEN DIALOGUE SYSTEMS
Using facial animation we can create talking heads, which deliver their services through mobile phones. Users were able to ask questions about available services to the talking heads on their mobile phones. Examples of the services are timetables for trains, accommodation and location of hotels. The system may use graphical interface. Other than providing lip movements to accompany the synthesized voice output, the head was capable of deictic movements: when information (e.g. a timetable) was presented somewhere in the graphical interface, the face would look and turn towards that location on the screen, thereby guiding the user's attention.
2005
21
interested parties to construct applications involving multimodal speech technology. Using the graphical interface (GUI), users could select different viewings of face and tongue.
Figure 8.2.1 Software tool for language training (remote assistance); Talking head animated for mouth.
2005
22
CONCLUSION
FAP coding provides a method of supplying animated 3-D representations of speakers at very low bandwidths. Although the processing power involved in acquiring FAP information appropriate for encoding may be challenging for mobile terminals, trading quality for complexity may produce feasible solutions. When it is carried out using the GPRS and EDGE channel models revealed that FAP coded streams are reasonably robust to error when compared to normal coded video. However, certain channel errors produce highly disturbing effects that indicate the need for efficient error detection and concealment schemes. Investigation of more advanced resynchronization code insertion schemes is also recommended.
2005
23
REFERENCES
[1] Jochen Schiller-Mobile Communications, 2nd Edition, ADDISON WESLEY Publishers, 2003. [2] www.apple.com /mpeg4. [3] www.vidiator.com [4] www.visage technologies.com
2005