Master Thesis Vogt

Contents
1 Introduction 1.1 Overview of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Imitation Learning For Humanoid Robots . . . . . . . . . . . . . . . . . 2.2 Postural Expression of Emotions . . . . . . . . . . . . . . . . . . . . . . . 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Mathematical Foundation 3.1 Introduction . . . . . . . . . . . . . . 3.2 Dimensionality Reduction . . . . . . 3.2.1 Principal Component Analysis 3.2.2 Locally Linear Embedding . . 3.2.3 Isometric Feature Mapping . . 3.2.4 Manifold Sculpting . . . . . . 3.3 Dimension Expansion . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 5 5 6 7 8 8 8 9 10 11 12 14 14 16 16 17 19 21 23 24 26 27 28 29 29 29 30 31 32 33 34
4 Learning Interaction Models 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Interaction Data Acquisition . . . . . . . . . . . . . . . . . 4.3 Dimensionality Reduction . . . . . . . . . . . . . . . . . . 4.4 Learning Interaction Models . . . . . . . . . . . . . . . . . 4.4.1 Linear Regression . . . . . . . . . . . . . . . . . . . 4.4.2 Articial Neural Net . . . . . . . . . . . . . . . . . 4.4.3 Echo State Network . . . . . . . . . . . . . . . . . . 4.5 Real-time Human Posture Approximation and Interaction 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Emotions and Behavior Modications 5.1 Walt Disneys Principles of Animation 5.1.1 Squash and Stretch . . . . . . . 5.1.2 Timing . . . . . . . . . . . . . . 5.1.3 Anticipation . . . . . . . . . . . 5.1.4 Exaggeration . . . . . . . . . . 5.1.5 Arcs . . . . . . . . . . . . . . . 5.1.6 Secondary Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents 5.2 Expressing Basic Emotions 5.2.1 Happiness . . . . . 5.2.2 Sadness . . . . . . 5.2.3 Anger . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 36 37 38 39 40 40 41 41 42 43 45 45 46 46 48 51 54
5.3
6 Software Architecture 6.1 Overview . . . . . . . . . . . . . . 6.2 Dimensionality Reduction . . . . 6.3 Interaction Learning Algorithm . 6.4 Behavior Database . . . . . . . . 6.5 Visualization . . . . . . . . . . . 6.6 Emotional and Behavioral Filters 6.7 Additional Components . . . . . 7 Evaluation 7.1 Arm Mirroring . 7.2 Yoga . . . . . . . 7.3 Defending Oneself 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Conclusion 56 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 A Appendix 58 A.1 A more Technical Interaction Learner Example . . . . . . . . . . . . . . . 58 A.2 An Emotion Filter Example . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.3 DVD Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Eidesstattliche Erkl arung 69
1 Introduction
Current research in the eld of robotics can be divided into two main directions; one is the development of task-oriented robots that work for humans in limited environments and the other direction is the creation of collaborative robots that can coexist with humans in our open and ever-changing environment. When considering these two directions industrial robots rank among the former. In contrast to that humanoids and mainly pet robots are developed with the intention to have them placed in human society and day to day life. In order to do so, humanoids need to be programmed in an economic and timesaving way. Traditionally, each joint angle is manipulated one at a time, which is a tedious and complex task when the desired motion is supposed to be life-like. Additionally, existing software is often too complex for unskilled users, resulting in the need of an expert. In contrast to that is the human learning process. Humans learn new skills through imitation [DN02] and this could be of great value to the eld of robotics as well. Since imitation is a convenient and less complex learning technique, which does not require special training. In a human-centered environment this feature is desirable for humanoid robots. When observing human two-person interactions it becomes clear that people are accustomed to work with other people. Many types of communication rely on human form and behavior. Hence, a humanoid robot should take advantage of these communication channels and should have the ability to respond to interactions in order to be included in human society. This would be a great leap forward, because it would lead to systems that are open to multiple interaction partners as well as to systems that are not limited by their environment. In this thesis a novel technique for learning such interactions will be developed. In classic imitation learning methods one demonstrator is used to show a motion [SK08]. The behavior is then adopted, learned and played back by a humanoid robot. In doing so, no human-robot interaction capabilities are provided since the learned behavior is simply played back. The focus of this thesis is on learning two-person interactions rather than the imitation of a single motion. In contrast, the proposed approach uses recorded motion data of two demonstrators to learn a generalized interaction model. This model calculates motor movements for human-robot interactions. The learning method utilizes motion capture data that has been reduced in dimensionality. Well-known as well as state-of-the-art dimension reduction techniques are implemented to create low-dimensional behavior trajectories. The fundamental idea of the new learning technique is to create a continuous mapping of one low-dimensional behavior to another. In doing so, responsive and reactive motor skills can be calculated during an human-robot interaction. This allows the robot to change its posture depending on
1.1 Overview of this Thesis the users movement, creating a life-like and believable interaction. The underlying model is learned oine and prior to the interaction, so no additional calculations are necessary during an ongoing interaction.
1.1 Overview of this Thesis

After reviewing recent developments in the eld of imitation and interaction learning in chapter 2, disadvantages of current approaches will be pointed out. After that fundamental mathematical concepts are introduced in chapter 3. The novel interaction learning mechanism will be presented in chapter 4. Starting with the acquisition of human motion data with a Microsoft Kinect camera, the data basis for the new learning approach is shown. Recorded human motion data can be very highdimensional since each joint angle is recorded separately. Hence, dimension reduction should be applied to reduce the computational complexity. Dierent dimensionality reduction techniques are presented to project the data to fewer dimensions in section 4.3. The learning of a single interaction is emphasized in section 4.4. The underlying mapping algorithms, namely linear regression, articial neural nets and echo state networks are introduced. When a user interacts with the robot, it needs to adopt suitable postures depending on observed user poses. To calculate these, the interaction model is used. In section 4.5 the underlying algorithm is explained. Additional behavioral and emotional modications that can be added to a learned interaction model are presented in chapter 5. For that fundamental rules and techniques of classic hand-drawn cartoon animation will be reviewed and their implementation for two-person interaction models is explained. After that, chapter 6 shows the basic software architecture and which libraries were used to implement all learning algorithms. Within chapter 7 three recorded two-person interactions are evaluated. First, a simple arm mirroring example that will be used to introduce interaction models is presented in more detail. After that two experiments will be conducted were complex motor skills are learned in a virtual environment. Additionally, all calculated interaction models are compared regarding the applied mapping algorithm. Chapter 8 summarizes the proposed two-person interaction learning approach. The contributions to the eld of humanoid robotics as well as computer animation are pointed out. Finally, future research directions are presented.
2 Related Work
In this chapter recent work on imitation learning in the eld of humanoid robotics as well as current research on the expression of basic emotions and attitude through body postures will be summarized.
2.1 Imitation Learning For Humanoid Robots

The last 10 years in the eld of humanoid robotics have shown that the complexity of tasks that such robots have to perform increased steadily. This makes it hard and di cult to program them manually. Since the fundamental idea of humanoid robotics is to put these in human social environments new skills should be learned intuitively without the need of an expert. Programming by Demonstration (PbD) has received increasing attention, since it involves one of the main principles of human learning: imitation. Imitation, as stated by Thorpe [Tho63], is the ability of humans and higher animals to reproduce a novel, unseen behavior. This skill can be of value for a humanoid robot as well and this is one of the main reasons for the recent interest in PbD [SK08]. Chalodhorn et. al [CR10] developed a model free approach to map low-dimensional data, acquired through a motion capture system to a humanoid robot. The humanoid learns stable motions with 3D Eigenposes created from motion capture data combined with its own sensory feedback. It is generally known that human motion data cannot be transferred to a humanoid robot without further optimization. This is due to the variance in statures and is generally known as the correspondence problem [ANDA03]. The approach of Chalodhorn has been proposed as a solution for this issue. The authors of [MGS10] present a technique that allows a robot to learn generic and invariant movements from a human tutor. They state that the underlying interaction is based on a kinematically controlled model of the demonstrator, which is then used as a model-based lter. In order to have their robot ASIMO play back a motion an extensive generalization step is added to have it adopt the behavior to its body schema. Another recent trend is to combine dierent learning methods together with human tutor interactions. Cognitive abilities can be added to a humanoid robot. An architecture for that can be found in [MGS05] and [BMS+ 05]. In [IAM+ 09] an interaction learning approach is presented that is based in haptic interactions. The focus is on giving a robot the ability to engage direct physical contact with its interaction partner. In order to do so, the authors implemented several machine learning techniques for adapting the behavior of the robot during real-time interaction. The important feature is that the humanoid robot learns during an interaction with a
2.2 Postural Expression of Emotions human. The major drawback of this approach is the need of human in order to evaluate the interaction. In contrast to that, a framework is presented in [GHW10] where a robot gets the ability to recognize people and remember details about past interactions. The authors describe that an interaction memory is built with data from already recognized people and their related interaction data. The implemented storage structure is XML-based and provides information about starting time, location and duration of an interaction. The approach is limited by the interaction medium, since speech is the only way to start an interaction. Additionally, the person starting the interaction has to face robot. Most recently, a Pleo robot has been used by [CSG+ 11] to implement a low-cost platform to treat physical and mental disorders. The robot is trained via demonstration and reinforcement. The dinosaur robot can then play back learned motions to the beat of music. A dance movement in this context is learned by tracking a yellow block held by the user. In doing so, each leg can be trained separately. The authors stated that due to low-cost approach a relatively basic robot has been bought. Consequently, the robots hardware lacks certain features, like a high resolution camera for high frame rates that are crucial for the implemented learning approach. Recent work on imitation learning has shown great interest in learning a single behavior. One demonstrator is used to train a motion and a robot plays it back after adapting it to its own body composition. In doing so, the motion is learned by imitating the shown movement. The learning of interactions through imitation has not been in the focus of recent research. In contrast to that, a method will developed in this thesis where two persons demonstrate an interaction and imitation learning is used to adapt the behavior of one demonstrator to a humanoid robot to allow human-robot interactions.
2.2 Postural Expression of Emotions

Emotions can be conveyed through body postures. For that, the authors of [BbK03] collected several postural expressions of emotions by utilizing a motion capture system with 32 markers. These postures where then presented to an audience, who had to classify these postures into emotion categories. Bianchi and colleagues clearly state that basic emotions can be conveyed by virtual body postures recorded with motion capture devices. This corresponds to previous works of Bull [Bul78], who conducted several studies in the eld of postural expression of emotions in humans from a psychological point of view. For example he documented that interest can be communicated by slightly leaning forward while having the legs drawn back. Additionally, Atkinson et. al [ADGY04] argued that ve emotional states (anger, disgust, fear, happiness, and sadness) can clearly be expressed by body postures even with varying exaggeration. Walt Disney stated in the early 1930s that emotion and attitude can be expressed in cartoons by using body features. One of his most famous examples is the half-lled our sack. These drawings show, that even with most simplistic shapes attitudes can be conveyed [Joh95]. In regard of computer animation emotional additions to virtual humans can be crucial
2.3 Conclusion for their liveliness and believability, because they aect peoples cognitive processes, perceptions, beliefs and the way they behave [MTKM08]. Recent development in neuroscience has shown that body postures are more import in conveying an emotional states in cases of incongruent aective displays [GSG+ 04]. The model-based approach of Garcia et. al [GRGT08] tries to create lifelike reactions and emotions in virtual humans. The underlying reaction model consists of a decision tree that is based on a statistical analysis of the reaction of people. An emotion update model is utilized to increases the vividness of a computer animation. The resulting action is calculated with a combination of key frame interpolation and inverse kinematics. In doing so, dierent levels of an emotion can be expressed, which is not backed by psychology theories. The fundamental results in human psychology regarding the expression of postural emotions have proven to be applicable for humanoid robots as well. For the Nao robot for example, Marzooqi et. al [MC11] have conducted a study whereby the software shipped with the robot has been used to express emotions successfully. The study summarizes that anger, happiness and sadness can be conveyed on the humanoid robot. Also [ET10] showed that these emotions can be expressed on the robot platform. A more practical application for emotions in humanoid robots has been implemented by [KNY05]. The authors stated that emotions are crucial for autism therapy. Also [Adr07] reported that bodily postures can be used to emulate empathy in socially assistive robotics.
2.3 Conclusion
The mentioned interaction learning approaches achieve remarkable results for specic robot platforms. Nevertheless, they are seldom transferable to other systems or robots. Also the proposed algorithms and methods do not provide the functionality to manipulate learned behaviors in regard of their visual appeal. That is, a learned movement will be executed the same way it has been learned. Additionally, most of the existing techniques focus on learning a single behavior rather than an interaction involving two persons. The aim of this thesis is to overcome these limitations. In doing so, a novel approach will be developed for learning two-person interactions by imitating one interaction partner. Additionally, an implementation of well-know character animation methods is presented. The basis for these are the fundamental rules of Walt Disneys Principles Of Animation [Joh95]. In conjunction with that emotional additions are presented to control a learned model even further.
3 Mathematical Foundation
Recorded human motion data can be high-dimensional since each joint angle is recorded separately. Hence, dimensionality reduction should be applied. In this chapter fundamental concepts for dimension reduction will be introduced. The algorithms utilized here are explained on manifolds used in literature. Later, a method will be presented to project unseen points in low-dimensional embeddings into the manifolds original dimension.
3.1 Introduction
The goal of dimension reduction is to decrease the size of a dataset while preserving the information within it. This is usually done by nding hidden structures and the least dimensional embedding space [Ros08, Ben10]. To demonstrate each algorithm synthetic, non-linear, two-dimensional datasets lying in a three-dimensional space, namely a Swiss roll, a S-shaped curve and a shbowl will be used (see gure 3.1). The two-dimensional shapes of each dataset are well known and often used in literature [SMR06, SR04, SR00, M06, TSL00], hence the embedding, obtained by dierent dimension reducers can be compared. For that the Pearsons correlation and the Spearmans coe cient can be used [SMR06].
1600 1200 800 400 0
Figure 3.1: Two-dimensional non-linear manifolds lying in a three-dimensional space. From left to right: the Swiss roll, shbowl and S-shaped surface.
3.2 Dimensionality Reduction

The focus of this chapter is on four dimension reduction algorithms, namely PCA (section 3.2.1), LLE (section 3.2.2), IsoMap (section 3.2.3) and Manifold Sculpting (section 3.2.4). Within the succeeding sections the basic mathematical foundation of each will be discussed.
3.2.1 Principal Component Analysis

Principal component analysis (PCA), also known as Hotelling transform or empirical orthogonal function (EOF), tries to reduce the dimensionality of data based on the P covariance matrix of variables (see equation 3.1). The PCA seeks to reduce the dimensions by nding fewer orthogonal linear combinations (the principal components) of the original variables depending on the largest variance. The rst principal component (PC) is the linear combination with the largest variance. The second PC is the linear combination with the second largest variance and orthogonal with the rst PC and so forth. It is customary to standardize the variables, because the variance depends on the scale of the variable [Hot33]. X 1 = XX T (3.1) n p p The o-diagonal terms of quantify the covariance between the corresponding features and the diagonal terms capture the variance in the individual features. The idea of PCA is to transform the data so that the covariance terms are zero. When using the spectral P decomposition theorem, it is possible to write as where U is an orthogonal matrix containing the eigenvectors and is a diagonal matrix with ordered eigenvalues. The total variation equals the sum of the eigenvalues of the covariance matrix. p p
X
i=1
= U U T
(3.2)
V ar(P Ci ) =
X
i=1
(3.3)
Depending on the eigenvalues the contribution of the rst l principal components can be calculated with the following fraction:
Pl Pp
i=1 i=1 i i
(3.4)
The computed eigenvectors hold the principal components of the data set. Furthermore,
Figure 3.2: The datasets introduced in section 3.2 reduced in dimensionality with principal component analysis. Color is used to indicate how points in the result correspond to points on the high-dimensional manifold.
3.2 Dimensionality Reduction the calculated PCs are the basis of the low-dimensional subspace. It is possible to project any point from the original space into this space and vice versa. Dimension reduction can be achieved by subtracting the sample mean value and then calculating the dot product of the result with each of the l PCs.
3.2.2 Locally Linear Embedding

Locally linear embedding (LLE), as presented by [SR04] is an unsupervised learning algorithm that computes a low dimensional embedding while preserving neighbor relationships. Essentially, the algorithm tries to calculate a low dimensional embedding with the property that neighboring high dimensional points remain neighbors in low dimensional space. LLE is based on the assumption that the data is well-sampled and that the underlying manifold is smooth. Within in this context smooth and well-sampled mean that the datasets curvature is su ciently sampled, in a manner that each high dimensional point has at least 2d neighbors. Under this prerequisite a neighborhood on the manifold can be characterized as a linear patch. The LLE algorithm calculates a low dimensional dataset as follows: 1. Gathering of all neighboring points - Calculate all neighbors from every point xi within the dataset (possibly with a k -dimensional tree). 2. Calculation of weights for patch creation - Compute the weights Wij to approximate xi as a linear combination with its neighbors while minimizing the reconstruction error in equation 3.5. X X E (W ) = | xi Wij xj |2 (3.5)
i j
3. Mapping of embedded coordinates - Map embedded coordinates by determining the vector y with the weights Wij . This is done by minimizing the quadratic form in equation 3.6 with its bottom nonzero eigenvectors. (y ) =
X
i
| yi
X
j
Wij yj |2
(3.6)
While minimizing the cost function in equation 3.5 two constraints need to be taken into account. Firstly, each data point xi will be reconstructed only with its neighbors, resulting in Wij = 0 if xi is not part of the set and secondly the rows of the weight matrix need to sum to one. The optimal weight matrix W is then found by solving a set of constrained least squares problems. One drawback of LLE is its sensitivity to noise. Even small noise can cause failure in obtaining low-dimensional coordinates [CL11]. Also, the algorithm is highly delicate to its two main parameters the amount of neighbors and the regularization parameter.
10
Figure 3.3: The datasets introduced in section 3.2 reduced in dimensionality with locally linear embedding. Color is used to indicate how points in the result correspond to points on the high-dimensional manifold.
3.2.3 Isometric Feature Mapping

The isometric feature mapping algorithm (IsoMap) is a multidimensional scaling approach generalized to non-linear manifolds. Within IsoMap the dimensionality reduction problem is viewed as graph problem, which requires the distances between all pairs i and j from N data points in the high dimensional space X as input. The output calculated by IsoMap are coordinate vectors in a lower k -dimensional space Y that represents the intrinsic geometry of the underlying data [TSL00]. The IsoMap algorithm, as stated by Tenebaum and colleagues [TSL00], is composed of the following three steps: 1. Estimation of the neighborhood graph - First of all the algorithm determines the neighbors on the manifold based on their Euclidean distances dX (i, j )1 . The neighbor relationships are then represented as a weighted graph with an edge weight of dX (i, j ). 2. Calculation of the shortest path in the neighborhood graph - The geodesic distances dM (i, j ) between all points on the manifold are estimated by computing the shortest path lengths dG (i, j ) in the graph G. dG (i, j ) is set to dX (i, j ) if i and j are linked by an edge, otherwise to 1. Each value of n = 1, 2, . . . , N with the entities dg (i, j ) are replaced by min{dG (i, j ), dg (i, n) + dg (n, j )}. Finally, the values of the shortest paths are stored in a matrix DG = {dG (i, j )} 3. Construction of lower dimensional embedding - Multidimensional scaling is applied to DG in order to achieve dimensionality reduction. The preserving embedding in the k -dimensional Euclidean space is created by minimizing the cost function for all coordinate vectors Y : E = || (DG ) (DY )||L2 (3.7)
DY denotes the matrix containing low dimensional Euclidean distances of two points {dY (i, j ) = ||yi yj ||}. Within the context of equation 3.7, the operator
1
Within this thesis a k -dimensional tree is used to determine the neighbors of a given point
11
3.2 Dimensionality Reduction converts distances to inner products. More precisely, is dened by HSH/2, 2 where S is the matrix of squared distances Sij = Dij and H is the centering matrix (Hij = ij 1/N ) [TSL00].
Figure 3.4: The datasets introduced in section 3.2 reduced in dimensionality with isometric feature mapping. Color is used to indicate how points in the result correspond to points on the high-dimensional manifold. The accuracy of IsoMap depends highly on the amount of neighbors kn being used in order to create the weighted graph. This parameter has to be set independently for each problem as it can have a considerable impact on the calculated results (see gure 3.5). For reasons of simplication an automatic regulator can be used [SMR06].
(a) kn = 8
(b) kn = 24
(c) kn = 256
Figure 3.5: The accuracy of IsoMap depends highly on the neighborhood parameter k , which is shown with three dierent values for the Swiss roll dataset. Using a value too small, discontinuities of the graph can occur, causing the manifold to fragment into disconnected clusters. In contrast, values to large will include data points from other branches of the manifold, shortcutting them. This would lead to errors in the nal embedding [SMR06].
3.2.4 Manifold Sculpting

A novel approach presented by Gashler et. al [M06], referred to as Manifold Sculpting (MS) is an NLDR algorithm, which iteratively transforms data by balancing two opposing heuristics, one that scales information out of unwanted dimensions and one that
12
3.2 Dimensionality Reduction preserves the local structure of the data. First of all, the algorithm searches for all k-nearest neighbor points Ni for a given point pi . During the second step the algorithm computes the Euclidean distances ij between pi and its k-nearest neighbors nij . Meanwhile, the angle i between two line segments pi nij and nij mij , where mij is the most collinear neighbor of nij with pi , is calculated2 . The algorithm tries then to retain the values of and i during the transformation and projection.
Figure 3.6: The datasets introduced in section 3.2 reduced in dimensionality with Manifold Sculpting. Color is used to indicate how points in the result correspond to points on the high-dimensional manifold. Before transforming the data a pre-processing step can be included in order to achieve faster convergence. [M06] describe that a principle component analysis can be applied to move the information in the data to fewer dimensions. The rst d principal components are calculated, where d is the number of dimensions that will be preserved during the projection. Later, the dimensional axes are rotated to align with these PCs. The next step iteratively transforms the data until the local changes fall below a threshold, which can be set depending on the desired output quality. The dimensions which will not be preserved during transformation Dscaled , are scaled down within each step, so the values slowly converge to zero. The preserved dimensions Dpreserved are scaled up to keep their average neighbor distance. Then, the neighbor relationships are recovered with an error heuristic. For that, all entries of Dpreserved are adjusted with a simple hill-climbing technique in that direction that yields improvement. Once the transformation is done, Dscaled will contain only values close to zero, which will be dropped during projection in order to achieve dimension reduction. Manifold sculpting is robust to sampling holes while preserving high quality results under the assumption that a high sample rate is used [M06]. However, its computational complexity associated with the required hardware resources need to be taken into account when analyzing larger datasets.
The point with an angle closest to is called co-linear neighbor.
13
3.3 Dimension Expansion
3.3 Dimension Expansion

When playing back an animation on a robot or avatar, one needs to control all joint angles. For that a low dimensional model, e.g. a dataset reduced in dimensionality needs to be transformed continuously back into its original dimension. The idea behind some dimension reduction approaches is that neighboring points in high dimensional space remain neighbors in the low dimensional space. Those relationships can also be exploited during dimension expansion as follows: 1. Search for low-dimensional neighbor points - Firstly, a search in the low dimensional space for the k -nearest neighbors (Pl1 , . . . , Plk ) of a given low-dimensional point Pl is conducted. 2. Creation of a weight matrix - Then, a weight matrix W is created so that Pl is approximated by a linear combination of its k neighbors. Pl =
k X i=0
Wi Pli
(3.8)
3. Restore high-dimensional neighbors utilizing Wi - The idea is then that neighboring points within low dimensional space remain neighbors in high dimensional space. Hence, every neighbor of Pl has an exact high dimensional representation 1 k ( Ph , . . . , Ph ) that can be identied by its index i (1 i k ). A high dimensional i point Ph is then found by multiplying the high-dimensional neighbors Ph with the weight matrix W . Ph =
k X i=0 i W i Ph
(3.9)
This algorithm is applied to all low-dimensional points that do not equal any point within the given dataset. It is obvious that this dimension expansion technique needs high- and low-dimensional representations of a dataset in order to calculate a high-dimensional representation Ph of an unknown low-dimensional point Pl .
3.4 Summary
In this chapter important mathematical concepts that are crucial for the interaction learning approach which will be developed in this thesis have been introduced. Several dimensionality reduction techniques utilizing well-known as well as current state-of-theart algorithms were explained. It has been pointed out that some try to calculate low-dimensional embedding with linear approximations, while others use approaches originated in graph theory. Figure 3.7 shows the low-dimensional embeddings of the introduced dimension reduction techniques applied to three datasets. The overall precision of the produced results can vary and each concept has its advantages, as well as limitations. The analysis of each
14
3.4 Summary algorithm was performed on simple datasets used in literature, which are in general not adoptable to other applications or domains. Hence, a preference for a specic technique was not presented and each algorithm will be analyzed in regard of its applicability for recorded interaction data in the following chapter. Due to the nature of some dimension reduction techniques a projection into the original dimension is not possible. A simple mathematical concept to transform unseen input data in low-dimensional embeddings to high-dimensional space has been presented.
Original Dataset
PCA
LLE
IsoMap
Manifold Sculpting
Figure 3.7: The gures shows all introduced dimension reduction methods applied to three simple datasets used in literature. As it can be seen the results vary greatly regarding their shape and precision.
15
4 Learning Interaction Models

4.1 Overview
In this chapter a novel interaction learning method will be introduced that allows humanoid robots as well as virtual humans to interact with people (see gure 4.1). The foundation of this approach is an interaction model which is created from demonstrated two-person interactions.
Figure 4.1: Human motion is recorded utilizing a depth camera. The acquired behaviors are then used to learn a low-dimensional mapping onto a virtual humans or humanoid robots behavior. For that a two-person interaction model is learned. The learning technique is based on low-dimensional behavior embeddings that are calculated for both demonstrators. In doing so, a small and yet complete dataset is created from a shown two-person interaction. In order to animate the virtual human or manipulate a robots joint angles, an algorithm will be introduced that learns how to map both shown movements while preserving temporal features. This mapping is essential, since it will be used to predict a users posture and calculate suitable robot or virtual human postures in order to have users interact with them.
16
4.2 Interaction Data Acquisition Additionally, this model can generalize observed human motion so that various versions of a single action are possible. This allows synthetic humanoids and robots to adapt known behaviors to changing situations. This is especially useful when considering that humans tend to have a low repetitive accuracy when performing tasks [TLKS08]. In the following it will be discussed how human motion data can be acquired using a Microsoft Kinect depth camera. Furthermore, the dimension reduction techniques introduced in section 3.2 applied to recorded human behaviors are explained. One dimension reducer will be emphasized since it is ideal for the interaction learning approach that will be developed in this thesis. Later, the algorithm for extracting a generalized model of two person interactions will be introduced. The role of dimension reduction will be explained, since the interaction model is based low-dimensional training data. In conjunction with that, the approximation of human poses during real time interaction is addressed.
4.2 Interaction Data Acquisition

The interaction learning approach presented here is based on the fact that humans often learn from imitation. In doing so, they observe others and adopt behaviors [Zen06]. This ability can be valuable for synthetic humanoids as well, giving them a tool to learn how to interact with people. The necessary motor movements can be calculated from observed human joint angles. This is possible due to the similarity in body compositions [Ben10]. Motion capture systems have become increasingly popular for life-like virtual character animation [PB02]. The motion data used in this thesis is recorded with a Microsoft Kinect depth camera utilizing the same-titled software development kit. There are several reasons for this choice. Mainly, the consumer market availability and the low-cost nature of this motion capture device have been determining the decision.
Figure 4.2: The gure illustrates the recognized joint rotations. Not all axes of rotation have been added for reasons of clarity and comprehensibility. A complete list of all joint angles can be found in [Ber11]
17
4.2 Interaction Data Acquisition The underlying framework which extracts human body postures is developed by Berger [Ber11] and supports a continuous recording from up to two persons simultaneously. The recognized joint angles are displayed in illustration 4.2. Also the Microsoft Kinect camera is used for joint angle extraction during real-time interaction. For that, joint angle values are transformed into the low-dimensional behavior space and used as input data for a learned two-person interaction model in order to react interactively. A trivial imitation example is shown in gure 4.3. Two persons were instructed to mirror the others arm movement. After the left person (A) received a secret sign he started pulling up his right arm. Shortly after starting this movement the right person (B) noticed the changing shoulder angle and he corrected his own left arm pose. When the shoulder angle of the left person reached approximately 45 degrees he was shown another secret sign to lower his arm again. During the next chapters this example will be used to teach a virtual human, to mirror the pose of the right person.
Figure 4.3: The gure illustrates steps during a mirrored two-person arm movement. The recording of this behavior consists of 240 frames measuring 50ms each. The precision of calculated joint angles depends on the frame rate used to record the human behavior. In general, low frame rates lead to small datasets with a lack of accuracy, whereas high frame rates result in larger datasets with increased precision. Furthermore, a higher sampling rate leads to redundant joint angle values for slow motions. In contrast to that lower rates increase the risk of having too few measurement points, resulting in choppy animations. For the behaviors used in this thesis a frame rate between 16fps and 25fps has been proven to be well-suited. In order to map an animation on a virtual human the joint angle values need to be transformed from the recorded human space into the coordinate system of the virtual representative. Varying degrees of freedom complicate this calculation. This applies also humanoid robots and is generally known as the correspondence problem [ANDA03]. To overcome this issue for humanoid robots, multiple genetic algorithms combined with trajectory adaption are applied [Ber11]. In doing so, acquired movements are transformed into the robots space, allowing it to adopt human behaviors within a simulation. Due to the low accuracy of the simulation engine in use, not all behaviors can be played back on a robot. In this chapter it has been briey explained how a Microsoft Kinect depth camera can be used to record human motion. Acquired joint angles are stored in a high-dimensional matrix. Since each value is stored in one column with one row per time step this leads to large records for longer motions. In the following section the previously introduced dimensionality reduction algorithms are applied to decrease the size of each behavior.
18

The human body is an articulated object with very high number of degrees of freedom (DOF). One problem that arises when recording human motion is that not all measured variables are important in order to understand the underlying phenomena. A walking gait for example, is a one-dimensional manifold embedded in a high-dimensional visual space [EL04]. From an observer perspective the shape of a walking person deforms over time within its physical constraints. When we consider the silhouette as a point in highdimensional space it becomes obvious that it moves on a low-dimensional manifold over time. Dimensionality reduction should be applied to strip o the redundant information, producing a more economic representation of the recorded motion. Also dimensionality reduction is not only benecial for computational e ciency but can also improve the accuracy of data analysis [Fod02].
240 180 120 60 0
(a) Person A pulling up its right arm

240 180 120 60 0
(b) Person B pulling up its left arm
Figure 4.4: An arm movement dataset reduced to two dimensions with (from left to right) PCA, LLE, IsoMap and Manifold Sculpting. Color is used to display the temporal coherence within the dataset. Meaning that during the execution of the movement a low dimensional point would move along the curve, with a starting color from dark blue through to an ending color of light gray. In the following sections the previously introduced arm mirroring dataset (see chapter 4.2) will be used to compare the dimension reduction techniques introduced in chapter 3.2 in regard of their applicability for the interaction learning approach presented in this thesis. The main focus is on practicability and readability of the produced lowdimensional trajectory rather than its precision. A justication for that will be given in chapter 4.4. The rst diagram of gure 4.4 is a visualization of the embedding calculated with principal component analysis. Since the eigenvalues are sorted in a descending manner, the rst few principal components encode most of the information. In the domain of robot
19
4.3 Dimensionality Reduction motion a few PCs, e.g. ve to eight are su cient enough to store up to 97% of the information [Ber09, Ben10]. For that, the rst l principal components have to have a cumulative proportion greater than 0.97 in order to restrain the information lost to 3%.
(a) Person A pulling up its right arm
(b) Person B pulling up its left arm
Figure 4.5: Principal Component Analysis applied to an arm movement dataset. Gray regions mark key postures. Since the arm was pulled up and down again the PCA creates an enclosed trajectory Each point within the calculated space has a corresponding posture [Ben10], hence the name posture space. Some of these are shown in gure 4.5. Because of the underlying linearity, postures can be obtained from points that do not equal points on the recorded trajectory. The second diagram in gure 4.4 is an illustration of the embedding space calculated by locally linear embedding. The measured joint angle values are not noiseless1 , since they were captured using an infrared camera. Hence, LLE failed to obtain low-dimensional coordinates because of its sensitivity to noise [CL11]. The low amount of measurement frames boosts this problem even further. A low-dimensional embedding calculated with isometric feature mapping can be seen in gure 4.4 (second graphic from right). Similar to LLE, the IsoMap algorithm was not able to create a low-dimensional space. This is once again due to the low amount of measurement points. In contrast to that, the Manifold Sculpting algorithm is able to obtain a correct embedding (see gure 4.6). But MS lacks the ability of transforming additional points from high-dimensional space into the low-dimensional embedding. This results in the
1
An evaluation concerning noise when using a Microsoft Kinect camera can be found in [Ber11]
20
4.4 Learning Interaction Models need of an external transformation tool for unseen points in order to obtain their lowdimensional coordinates.
(a) Person A pulling up its arm
(b) Person B pulling up its arm
Figure 4.6: Manifold Sculpting used to reduce an arm movement dataset in dimensionality. Gray regions mark key postures. The output of each algorithm varies greatly regarding its visual appearance and continuity. Within the context of this thesis, the smooth trajectory curvatures and the simple mathematical concept of PCA makes it the ideal dimension reducer. This also corresponds to arguments of Chalodhorn et. al [CR10] to use PCA for learning algorithms. Nevertheless, all techniques have their advantages as well as limitations. For the interaction learning approach presented here any dimensionality reduction technique can be used, as long as the embedding is su ciently smooth within low-dimensional space.
4.4 Learning Interaction Models

The low-dimensional representations, introduced in section 4.3 are a compressed version of a single behavior and equivalent to an array of temporary postures adopted from human teachers during an interaction. That is, that these models are complete records of a shown two-person interaction packed in two low-dimensional spaces. Each point in this space corresponds to a posture. So they could be adopted by a humanoid robot or virtual human when transformed back into their original dimension. In the example in section 4.3 a dataset has been recorded where one person mirrors the arm movement of another one. Now, this scenario will be altered in a way, that a virtual human will replace the second person (person B). Within a simulation an avatar will be controlled by the rst person in order to interact with the virtual human. The virtual human is then instructed to mirror the arm movement of the avatar. When a person is instructed to repeat a shown behavior it is unlikely that the executed
21
4.4 Learning Interaction Models behavior equals the one in the recording [Ben10]. That is, postures and movement speeds will most likely vary. Since the virtual human does not know how fast the arm movement will be, it needs to react interactively in order to mimic the behavior correctly. Because of the concurrent recording of both humans, the temporal coherence remains within the datasets. The low-dimensional model of the second person is now assigned to a robot or virtual human (see gure 4.7). During a user interaction a suitable pose from the posture space has to be adopted with the least amount of delay, avoiding unnatural waiting periods.
(a) Person A pulling up its arm
(b) Behavior of person B assigned to a Nao robot
Figure 4.7: The gure illustrates the arm mirroring data of person A and the assigned postures for a virtual Nao robot. Once again, color is used to mark the temporal coherence between both datasets. The question is, when and how the virtual human has to react on an observed human pose. This is done by searching the assigned low-dimensional model, looking for a posture that is suitable for the observed low-dimensional posture of the rst person. Thus, the virtual human needs a continuous mapping from one behavior to another. This becomes obvious when analyzing the low-dimensional behavior trajectory. Every human pose has a corresponding point in the low-dimensional space, so when repeating an interaction, this point moves along the behavior trajectory (see gure 4.8). Due to the low repetitive accuracy of human movements the newly found point will most likely not lie exactly on the trajectory, but rather in range of it. Also humans vary in size and their silhouette, which results also in diering low-dimensional points when they adopt the same pose, making a continuous mapping of both recorded behaviors indispensable. An aggregation of both low-dimensional spaces into one is not possible due to the fact that some dimension reduction techniques use non-linear ap-
22
4.4 Learning Interaction Models
Learning Algorithm
Figure 4.8: The gure illustrates that each point within the rst space (left) has a corresponding point in the second space. A mapping between those two has to be calculated by the interaction learning algorithm. proaches in order nd the underlying manifolds. Even when using linear techniques, like PCA the principal components would still vary in their direction, making a single combined space hard to predict. This leads to the conclusion that a mapping between both low-dimensional models has to be found in order to calculate robot poses depending on observed human postures. The combination of these two spaces with a mapping algorithm is called two-person interaction model. Over the following sections three implemented techniques that create these interaction models utilizing dierent machine learning approaches will be introduced. In doing so, the example introduced above, where a virtual human has to learn to mirror the arm of a user-driven avatar will be use.
4.4.1 Linear Regression

Linear regression (LR) is a special case of the general concept of regression analysis. One dependent variable Y (the target variable) is described by several independent variables Xi (the predictor variables) with a linear function (see equation 4.1). Y = 0 + 1 X1 + 2 X2 + . . . + n Xn (4.1)
The linear regression analysis tries to determine the values of the coe cients that minimize the sum of the squared residual values for a dataset. The residual value or deviation describes the dierence between the actual value of the target variable Y and its predicted value Yp . Figure 4.9 (left) illustrates a mapping learned from the arm movement dataset with linear regression. The green trajectory describes once again the low-dimensional embedding of person B. The mapped curve, created by LR with joint angles of person A as input data, is highlighted red. The Euclidean distance of desired and calculated points in
23
4.4 Learning Interaction Models low-dimensional space can be seen in gure 4.9 (right). This distance can be considered an error heuristic for the interaction model. A distance of zero between both points results in an equal posture. If the error is greater than zero the two poses will appear to be similar. With raising error values the similarity will decrease. Two red arm poses in gure 4.9 are extracted from regions with the largest error. The desired trajectory is highlighted green, whereby the learned curve is marked red. Resulting dierences in arm poses are indicated for regions with the largest values.
0.5 0.45 0.4 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 50 100 150 200 250
Euclidean Distance PC 1
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
PC 2
Frame Index
Figure 4.9: Left: The arm mirroring data of person A mapped with linear regression onto the behavior of person B. The desired values are marked green. Calculated results are highlighted red. Right: The calculation error obtained during LR mapping. Colored circles are used to show how regions with the highest error correspond to the behavior trajectory. The advantage of LR is its computational scalability and exibility, compared to classic or recurrent neural nets. However, the disadvantage of this approach lies in the accuracy of the produced results, when dealing with complex non-linear datasets (see chapter 7).
4.4.2 Articial Neural Net

Learning input-output relationships from recorded motion data can be considered as the problem of approximating an unknown function. An articial neural net (ANN) is known to be well suited for this [MC01]. This learning algorithm is inspired by the structure of biological neural nets and its adaptive system changes its structure based on internal and external information. In contrast to LR articial neural nets can learn from data. This makes it possible to use preceding human postures in order to calculate virtual postures. When only examining a single human pose it is unapparent what the persons movement direction is. Falsely interpreting the direction can result in unnatural robotic behavior. But when using multiple human poses, the persons movement history can be analyzed and the synthetic humanoids joint angle values can be set accordingly. Figure 4.10 shows how multiple human postures are mapped onto a single virtual human posture using an ANN. The
24
4.4 Learning Interaction Models pose history is also called sliding window and its size (the amount of postures stored in it) has to be set independently for each recorded behavior.
Input Layer Hidden Layer Output Layer
Multiple Human Postures
Single Virtual Human Posture
Figure 4.10: Multiple human postures are mapped on single robot/virtual human posture with an articial neural net. The dierent layers of the ANN are highlighted blue, green and red. The ANN consists of three layers an input, a hidden and an output layer. How many neurons a layer consists of and which connectivity value is used, depends on the recorded motion data. These values are set automatically by the software. All points of the human low-dimensional posture space are added to the ANNs input layer. In doing so, the neural net is trained with the supplied dataset. When using the net for prediction, several input points (with the size of the sliding window) are combined to produce a single virtual human posture.
0.5 0.45 0.4 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 50 100
Euclidean Distance PC 1 PC 2
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
Timesteps
150
200
250
Figure 4.11: The left gure illustrates the mapping learned with an ANN. Colored regions indicate how areas with the highest mapping error correspond with the overall error diagram on the right.
25
4.4 Learning Interaction Models The obtained error during the transformation is illustrated in gure 4.11. Colored regions indicate how areas with the highest error correspond to the overall behavior trajectory. As mentioned earlier, the supplied motion data describes a smooth trajectory in high-dimensional space. This characteristic has to remain within the low-dimensional embedding because a smooth behavior trajectory has been proven to be well-suited as training data. Noisy input data can lead to overtrained networks that adapt the noise and do not generalize for unseen input points. Especially for strong nonlinear embeddings overtraining can occur.
4.4.3 Echo State Network

Pioneering approaches in reservoir computing are Echo State Networks (ESNs). ESNs are based on the observation that randomly created recurrent neural nets possess certain algebraic properties and training a linear readout from it is often su cient to achieve excellent performance [JMP07, LJ09]. Within an ESN a recurrent neural network, called dynamic reservoir, is randomly created and remains unchanged during training. It is initiated by the input data and stores in its states a nonlinear transformation of the input history. This allows ESNs to develop a self-sustained temporal activation dynamic [LJ09]. The desired output signal is then generated as a linear combination with linear regression, using the training data as target output. A mapping learned by an ESN for the arm mirroring example is shown in gure 4.12. During the rst few steps the ESN has too few points in its history, resulting in large Euclidean distances between calculated and desired output values. After the input history has been set up the error drops tremendously.
0.7 0.7
0.6
0.6
Euclidean Distance
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
PC 1
0.1
0.1
PC 2
0 0 50 100
Timesteps
150
200
250
Figure 4.12: The left gure illustrates the mapping learned with an ESN. Once again colored regions indicate how areas with the highest mapping error correspond with the overall error (diagram on the right). ESNs can be more e cient compared to ANNs [LJ09] since smaller amounts of neurons are used. In contrast to that, single parameter changes can be computationally expen-
26
4.5 Real-time Human Posture Approximation and Interaction sive, because of the resulting size of update cycles. Only relatively small datasets should be used. An additional downside is the variety of control parameters. In the majority of cases expert knowledge is required in order to set these optimal. As an example gure 4.13 illustrates three dierent reservoir sizes and their impact on the learned trajectory.
PC 1
PC 1
PC 1
PC 2
PC 2
PC 2
Figure 4.13: Reservoir sizes and their inuence on the mappings smoothness. The amount of points stored in the ESN can have a considerably impact on the learning result. The shown reservoir sizes are (from left to right): 5, 15 and 25 points.
4.5 Real-time Human Posture Approximation and Interaction

Once the behaviors are recorded and an interaction model has been learned, a virtual human can be used for real-time interaction (see gure 4.14). One human is replaced by a virtual human and the generalizing model is used to have the avatar interact with the remaining interaction partner in virtual reality. Since the interaction will not be executed the same way it has been recorded, the virtual human needs to analyze the persons current posture. The person is once again captured utilizing the Microsoft Kinect depth camera. Then its extracted joint angle values are reduced in dimensionality. Since some dimensionality reduction algorithms do not support unseen input data, a general approach has to be used for the transformation. The well-known k -dimensional tree search structure is employed to project the observed high-dimensional human posture to a previously created low-dimensional embedding. The algorithm operates the same way as the method introduced in chapter 3.3. Then the transformed point is used as an input value for the interaction model in order to calculate the virtual humans posture. During real-time interaction this process takes place several times per second, creating a continuous ow of approximated human postures in low-dimensional space. Especially, for ESNs this feature is used to analyze temporal features of a persons behavior.
27
4.6 Conclusion
Interaction Model
Human Posture Approximation
Virtual Pose Estimation
Posture Execution
Figure 4.14: For live interactions the persons joint angle values are extracted. These angles are then reduced in dimensionality and used as input points for the learned interaction model. Postures for the virtual human and the robot are extracted from it in order interact with the user.
4.6 Conclusion
In this chapter a novel interaction learning method has been introduced. Based on recorded two-person interactions two low-dimensional embeddings are calculated. In order to create life-like interactions the rst behavior is mapped on the second behavior and assigned to a humanoid robot or virtual human. The mapping can be learned with linear regression, articial neural nets and echo state networks. To compare each algorithm an Euclidean error heuristic has been proposed. For a live interaction the interaction model is used to predict the users body posture and calculate the most suitable posture for the robot or virtual human. So far, only a simple arm mirroring example has been used to introduce interaction models. In section 7, an evaluation will be conducted and more complex interaction scenarios are presented. Additionally, all mapping algorithms are checked against each other.
28
5 Emotions and Behavior Modications

In this chapter emotional and behavioral modications that can be added to a learned two-person interaction model are introduced. The implementation of Walt Disneys Principles Of Animation within a calculated interaction model will be presented. In doing so, it will be pointed out that some of these are already included in the model while others can be added manually. Then the conveyance of three basic emotions utilizing interaction models for synthetic humanoids will be addressed. After that the importance of body postures for expressing emotional states will be emphasized.
5.1 Walt Disneys Principles of Animation

Between the late 1920s and the late 1930s animations at Walt Disneys Studios became more and more life-like and sophisticated. They started to create characters that expressed emotions and were visually appealing to viewers eye. It was apparent to Walt Disney that each action, that was going to happen in a scene, had to be unmistakably clear to the audience. For that animators analyzed human motion in nature and gathered every detail. Eventually, they isolated and named certain animation procedures and the newly invented practices and fundamental rules were from now on known as The Principles Of Animation [Joh95]. The applicability of these in 3D computer animation has been pointed out by the work of Lasseter [Las87]. In doing so, the author describes that all principles that have proven to be so well suited for classic 2D animation also apply in the animated three dimensional world. Since a virtual human or humanoid robot has to be convincing and appealing to the human eye, an excerpt of these has been implemented as animation lters for interaction models. The focus of the following sections is on basic concepts how these principles can be implemented for virtual humans and humanoid robots at the same time. Since the underlying data for both target platforms is equal the resulting motions can be exchanged mutually. Since the reproduction of movements on the Nao robot need further optimization the examples are only based on evaluated experiments in virtual reality.
5.1.1 Squash and Stretch

The denition of rigidity and mass of characters and the distortion of such during an action is known as Squash and Stretch. Objects stretch and squash during an animation
29
5.1 Walt Disneys Principles of Animation depending on their mass and rigidity. This does not mean that objects necessarily have to deform. An articulated object, like the puppet in gure 5.1 for example, can fold over itself and stretches by extending out fully without deforming [Joh95].
Figure 5.1: Squash and stretch is the most important rule for life-like character animation. The gure shows that a character stretches while jumping and folds over when landing on the ground. The most important rule for Squash and Stretch is that objects have a constant volume regardless if it they are stretched out or pressed together. This principle is also used in animation timing. Hereby, an object is stretched to reduce the strobing eect in fast movements. This is due to the fact that the human eye stitches single frames together to a smooth animation. When objects are not overlapping, the human brain perceives separate images, destroying the illusion of movement. The underlying natural basis of Squash and Stretch can be observed in nearly all living esh, regardless if it is a bony human arm or clumsy dog. Each character will show considerable movements during an action. Concerning the interaction leaning approach presented in thesis, the principle of Squash and Stretch is already encoded in recorded movements. Since the underlying motion data is acquired from humans, the played back recording has a natural and life-like basis. However, each motion dataset contains noise, which would diminish the natural impression of a behavior. Hence, all datasets have to be ltered in high-dimensional space in order to preserve human movements correctly.
5.1.2 Timing
Walt Disney paid peculiar attention to the time and speed of an action, since proper timing is crucial for a characters believability [Joh95]. Also, the impression of weight is mostly dened by an objects speed. For example, a heavy person would move slowly and lethargic. In contrast, a light object like a canon ball would move very fast indicating a light mass. The timing of a character has also a great impact on its emotional appearance. In general a slow moving human appears to be relaxed and a fast moving person seems to be nervous or excited [Las87]. In reference to interaction models the timing can be inuenced as well, but only by the user interacting with the virtual human. Since the underlying recordings have been acquired simultaneously the temporal coherence remains within the data set. If the execution time is altered only for one behavior the approximated virtual human postures
30
5.1 Walt Disneys Principles of Animation would refer to prior or future human postures, because the poses of the synthetic humanoid are fully managed by the interaction model. This means if a person interacts slowly the virtual human would move slowly as well in order to preserve the desired interaction.
5.1.3 Anticipation
In 2D animation an action occurs in three steps: preparation, proper execution and termination. Anticipation describes the preparation of an action. Correct preparation of actions is one of the oldest rules in theatre. The attention of the audience has to be guided, so they clearly understand what is going to happen next. This also applies to people watching a cartoon or 3D animation. The amount of anticipation added by the animator has a serious impact on the action that will follow. For a very fast movement the amount would be much higher than for a slow action. If this is not done correctly an animation can appear abrupt, unnatural and sti. As an example one can image a man starting to run. He draws back in the opposite direction, gathering like spring, aiming at the track [Joh95] before starting to jump o. Concerning interaction models anticipation can also be added. For this, a lter has been implemented that changes the beginning of a recorded behavior, so that a virtual human appears to build momentum. The anticipation lter analyzes a given behavior and identies key postures during the rst few seconds. After that, the movement direction is calculated. This direction is then inverted and a new starting posture is found by following the behavior trajectory backwards. Visually speaking this would appear as a backward motion of the synthetic humanoid.
Figure 5.2: The upper line shows the rst steps during the defend behavior that will be introduced in chapter 7. The second line shows the virtual human repeating this movement with anticipation added. The newly calculated pose is then used as starting pose and the algorithm utilizes a penalized regression spline to interpolate to the extracted key posture creating a smooth
31
5.1 Walt Disneys Principles of Animation accelerated movement. This calculation is done in high-dimensional space and only aects the rst few seconds of a behavior, since the remaining part has to continue unchanged in order to keep the interaction plausible. Figure 5.2 shows how the virtual humans moves during the rst few time steps while executing the defend behavior. The rst line displays the poses without anticipation. The second line shows the synthetic humanoid with anticipation added. As one can see the virtual human uses its arms to build momentum. For that the arms are pulled back during the beginning of the behavior. Then the arms snap to the desired protective posture and the interaction continues unchanged.
5.1.4 Exaggeration
In 2D animation the principle of exaggeration refers to a characteristic of a cartoon where persons are drawn on the edge of realism. Walt Disney instructed the animators to draw sad scenes even sadder or a bright character even brighter. Sometimes, a characters physical features were altered featuring extreme physical manipulations and supernatural or surreal properties. But to exaggerate an action does in general not mean that an animation has to become more violent or distorted. Instead, Walt Disney wanted to have cartoons wilder and extreme in form by remaining true to reality [Joh95]. That is why he also described exaggeration as extreme unnatural but unmistakable clear.
Figure 5.3: The rst lines shows screenshots of a virtual human playing back the learned defense behavior without additional exaggeration added. The same behavior more exaggerated is shown below. In order to produce exaggerated behaviors with the interaction model approach presented here a lter based on low-dimensional data has been implemented. Ben Amor [Ben10] pointed out that exaggerated postures can emerge at the edges of low-dimensional posture spaces created with PCA. Since principal component analysis can be used as a dimensionality reducer for interaction models as well, this characteristic can be exploited for creating such behaviors.
32
Figure 5.4: The PCA space of the defense motion, showing the original behavior curve (red) and the exaggerated trajectory (red) created through magnication. Also two key postures are displayed for both versions of the action. In order to move low-dimensional points of a virtual humans pose further to the edge of the postures space, a magnication lter has been implemented. The magnication factor can be set depending on the desired results. Figure 5.3 shows the original and exaggerated defense behavior captured with seven time steps. As it can be seen in the gure, the virtual human moves its arms higher and crouches lower on the ground. The nal poses also features a straightened back and an erected head.
5.1.5 Arcs
In the context of 2D animation arcs describe the change of one extreme position to another. This rule has been introduced by animators to avoid mechanical movements. With some limitations nearly all living creatures describe an arc of some kind in their actions [Joh95]. Since arcs describe the movement direction of body parts, their characteristics also encode inbetween timings, which can kill the essence of an action when not set properly. As computer animation evolved, splines were used to create smooth behaviors, which also have been utilized for interaction models. Since every pose of the virtual human is based on postures that humans obtained during the recording, the action has a natural basis that can be described with arcs. Dierent interpolation algorithms can be applied in the user interface to add additional smoothness to an action. For that the algorithm searches for extremes in the high-
33

0.1 0.2
Hip Roll Angle in Radian
0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7
10
20
30
40
50
60
70
80
90
100
Time Steps
Figure 5.5: The diagram shows a single joint angle value (right hip joint) during the rst 100 time steps of the defense behavior with additional spline tting applied. The red curve describes the original joint angles, whereby the green trajectory shows the newly tted angles. dimensional joint angle curve and uses them as control points. A resulting curve can be seen in gure 5.5. Penalized regression splines have been proven to be well suited, because many parametric options allow the most customizability, while providing su cient precision.
5.1.6 Secondary Action

A secondary action results directly from a primary action and has to be kept subordinate. As an example, one can image a person wearing a heavy coat. In a following scene the person starts turning around. The primary action is then the persons turning, while second action is the warping movement of the coat. In this case, the coats behavior is directly dependent on the persons reaction and thereby subsidiary. Also secondary actions can be expressed with facial animations, when the main action of the scene is being told by the body movements. The di culty lies then within the movement speed, since a facial expression can simply be unseen when a movement speed is too fast. Thus, the animation has to be staged obvious but still secondary. Secondary actions that involve movements of body parts are always recorded, as long as the person showing the behavior did display some. Additionally, small behaviors can be added to the main animation. A second behavior is integrated into the animation by partly adding high-dimensional data of user-dened joint angles. A bicubic spline is then used to combine both behaviors to a smooth movement. Over time the inuence of the secondary action on the movement is decreased till it is not visible any more. In gure 5.6 a knee raising action has been added to a virtual humans behavior. As already mentioned the inuence of the secondary action on the main action can be set depending on the desired output. In the example a relation of one to ve has been used.
34
Figure 5.6: The rst line shows a virtual humans motion. The second row displays the same primary action with another animation added on top. As one can see during the secondary action the virtual human raises its knee. Thus, a fth of the primary action is changed compared to the original behavior. Since the secondary action involves the usage of one leg, which has not been uses in the original action the high dimensional space changes. Hence, the low-dimensional embedding diers from the previous one. Figure 5.7 shows how the low-dimensional embedding changes when the mentioned action is added.
PC 1
PC 1
PC 2
PC 2
(a) Original action
(b) A secondary action added
Figure 5.7: How a secondary action changes a low-dimensional embedding of an action can be seen in the gure. Color is used to indicate how points on the trajectory without a secondary action correspond to the curve with an additional secondary action.
35
5.2 Expressing Basic Emotions
5.2 Expressing Basic Emotions

In the work of [Wal98] the author analyzed human emotions regarding their postural expression. He concluded that basic emotions (joy, sadness, pride, shame, fear, anger, disgust, contempt) can be observed as particular postures or movements. In his ndings he also showed that these emotions are expressed the same way regardless of human cultures. It has been pointed out by various researchers that these emotions can be expressed by humanoid robots, like a Nao robot [BHML10, MC11, MLLM] as well as virtual humans [GRM04, SR08]. Thus both research elds work for a common goal, to express articial emotions realistic and convincing. In contrast to virtual humans most humanoid robots do not have facial features. So the expression of emotional states hinges on the question whether an emotion can conveyed through body postures or not. Since the intention of this thesis is to combine the research elds of humanoid robotics and computer animation a framework has been developed to express emotions in both worlds. The recorded behavior of a virtual human or robot is altered. That is why emotional states are in the context of this thesis behavioral modications rather than self-contained expressions to give the user the illusion of dierent emotional states. In order to express dierent emotions or attitudes, the joint angle values of the recorded motions are changed and reassigned to the virtual human or humanoid robot. Filters are implemented to convey a mood of a scene, not to express an emotion at all costs, since the underlying action has to remain the same. Filters can be based on high or low-dimensional data. It is noted that modications of low-dimensional spaces aect all joint angles. In contrast to that high-dimensional lters manipulate recorded data of joints one at the time. Dierent ranges have to be kept in mind, in order to restrain the lters inuence. This means in other words that the resulting angle values have to remain within the physical constraints of the robot. Additionally, joints can be excluded from further manipulation. This feature is especially useful when behaviors already display certain emotional features. The head angles for example should be excluded when the viewing direction is already upwards. This implies also to other joint angles. In order to limit these, the user can set parameters in the softwares conguration le. In the example where a synthetic humanoid learns to defend itself (see chapter 7) an additional emotional change can be applied. Based on the works of Wallbott [Wal98] the following three modications can be added.
5.2.1 Happiness
Wallbott [Wal98] described that happiness can be observed as a collection of various purposeless movements, combined with jumping, dancing for joy and clapping of hands. Also the body is held erect and the head upright. The animation lter for happiness is implemented with regard to these characteristics. A penalized regression spline is used to atten recorded neck angles and limit the virtual humans gaze direction. Additionally, the spine rotations are set to create an upright
36
5.2 Expressing Basic Emotions pose. The second line in gure 5.8 displays a added happiness lter for the defense behavior that will be learned in chapter 7. In contrast to that the rst line shows the same behavior without additional emotions added.
Figure 5.8: First line: the rst few seconds of a behavior without an emotion. Second line: happiness added to the movement. The gure indicates that the head is always upright during the movement and the arms are resting wide open on the side. As soon as the animation starts the virtual humans movements appear smooth and curved, implying a cheerful mood. Also the back is always straight and upright. This increases the impression of happiness of the virtual character even further.
5.2.2 Sadness
According to Wallbott [Wal98] a sad person behaves motionless, passive and with a head hanging on a contracted chest. The emotion lter implementing these characteristics utilizes once again splines for motion smoothing. Since this emotion aects all body joints the spline is applied to all joint angle values. In order to create the impression of a sad character the following steps where implemented. Viewing Direction. Firstly, the viewing direction has been changed so that the character is gazing on the ground. Additionally, random slow head movements have been added to decrease the bodies rigidity. Spine Rotation. Secondly, the characters back is slightly bowed forwards. Movement Speed. A penalized regression spline is then used to smoothen the movement speed of each joint angle. For that, each hinge joint is analyzed and
37
5.2 Expressing Basic Emotions extremes and turning points are used as spline control points. In doing so the nal movement is only changed slightly but the execution appears smoother and more lethargic, implying sadness and sorrow. The second line in gure 5.9 shows a sadness lter applied to a behavior that will be learned in chapter 7. The rst line shows the same motion with no additional emotions added. As one can see the body posture of the virtual human is slumped with a hanging head. Slow motions and the lethargic appearance boost the expression of sadness even further.
Figure 5.9: First line: A behavior with no emotion added. Second line: sadness added to the motion.
5.2.3 Anger
In his articles Wallbott [Wal98] showed that a person expressing anger usually exhibits a trembling body with the intention to push or strike violently away. Also, shaking sts, an erected head and a well expanded chest have been documented [Wal98]. The anger lter implemented for an interaction model boosts hectic movements of the virtual human. For doing so, joint angle changes between time steps are increased. Additional, both feet are planted rmly on the ground. The last line in gure 5.10 displays six time steps of the defense behavior with an added anger lter. As it can be seen in the picture, the bent arms of the synthetic humanoid are resting on the side in a protective pose. When the behavior is played back the movements are fast and hectic accentuating anger.
38
5.3 Summary
Figure 5.10: The rst few seconds of the defense behavior with no additional emotion added can be seen in the rst line. The second line shows the same motion with additional anger added. This behavior featured a fast and hectic moving virtual human.
5.3 Summary
Several fundamental rules of animation applied to interaction models have been presented in this chapter. In doing so, it is has been pointed out that some are already included, while others like exaggeration or anticipation can be added manually. The underlying implementation of such in the form of lters has been presented. It has been shown that an implementation of Walt Disneys The Principles Of Animation in the form of low- and high-dimensional lters can be achieved. A distinction between full motion modication and single joint angle manipulation must be made. In order to convey a certain mood in the virtual character or humanoid robot three basic emotions have been implemented in lters as well. High-dimensional behavioral modications need to made. The underlying body postures have been created by analyzing Wallbotts studies [Wal98]. It has been pointed out that, in the eld of humanoid robotics these emotions have already been transferred to a Nao [BHML10, MC11, MLLM]. Hence, another evaluation of such has not been presented.
39
6 Software Architecture
6.1 Overview
In the following sections main software components implementing the interaction learning algorithm will be explained. Additionally, lters for modifying recorded behaviors are explained in a detailed fashion. Finally, external components that are not essential for the interaction learning algorithm, but highly recommended are introduced. Figure 6.1 gives a brief overview of the main software components implementing the two-person interaction model learning approach.
Figure 6.1: Main software components and additional libraries that build the software basis for learning interaction models. The main software components that are crucial for the interaction learning approach developed in this thesis can be segmented in four parts: Dimension Reducers - This component implements four main dimensionality reduction techniques namely, PCA, Manifold Sculpting, LLE and IsoMap. Additionally Breadth First Unfolding and NeuroPca can be used. Learning Algorithms - Within this part dierent learning algorithms for mapping two low-dimensional posture space are implemented. They include ANNs, ESNs and Linear Regression.
40
6.2 Dimensionality Reduction Filters - Additionally behavior modications based on low or high-dimensional data are combined in this software package. They include classes for basic emotion expression (happiness, sadness, anger) as well as algorithms implementing some Principles Of Animation. Trajectory Visualization - This component is uses Qwt to display low-dimensional behavior trajectories. Additional changes to all points can be made with dragand-drop. Most of the main components are compiled in one le. For reasons of portability and reusability the following features have been compiled in shared libraries including the SQL connector, the Kinect client, all interaction learning algorithms and the OGRE visualization. The following sections each component will be explained briey regarding its main features. For a more detailed description a full source code documentation can be found on the attached DVD.

Dimensionality reduction is an important step in the process of interaction learning. The software basis for this has been implemented with algorithms of the Waes library [Gas11]. A factory design pattern is used to provide required dimension reduction functionalities to the user interface, where each reducer can be selected with drop-down menu. The source code fragment for this can be seen in listing 6.1. Since the calculation of low-dimensional embeddings can be very time consuming and computationally expensive a multi-threaded approach is used to balance the work load on multiple processors. As soon as the computation is done, the user interface is notied and user feedback is created.
6.3 Interaction Learning Algorithm

Three dierent learning algorithms have been implemented utilizing dierent machine learning algorithms, namely Linear Regression, Articial Neural Nets and Echo State Networks. The code fragment 6.1 shows how two behaviors can be mapped utilizing a articial neural net .
// C r e a t e dimension r e d u c e r f o r each b e h a v i o r DimensionReducerFactory dimReducerFactory = new DimensionReducerFactory ( ) ; 3 DimensionReducer s t R e d u c e r = dimReducerFactory >getR ed u cer ( DimensionReducerFactory : : PCA) ; 4 DimensionReducer ndReducer = dimReducerFactory >getR ed u cer ( DimensionReducerFactory : : PCA) ; 5 int t a r g e t D i m e n s i o n = 6 ;
1 2
41
6.4 Behavior Database

6 7 8 9 10 11 12 13 14 15
// Reduce d i m e n s i o n a l i t y o f each b e h a v i o r stReducer >s e t D a t a ( f i r s t P e r s o n B e h a v i o r D a t a ) ; stReducer >t r a n s f o r m ( t a r g e t D i m e n s i o n ) ; ndReducer >s e t D a t a ( s econ dPers onBe havi orDat a ) ; ndReducer >t r a n s f o r m ( t a r g e t D i m e n s i o n ) ;
// C r e a t e i n t e r a c t i o n l e a r n e r and s e t b e h a v i o r d a t a I n t e r a c t i o n L e a r n e r N e u r a l N e t l e a r n e r = new InteractionLearnerNeuralNet () ; 16 l e a r n e r >s e t S t D a t a ( stReducer >getTransformedData ( ) ) ; 17 l e a r n e r >setNdData ( ndReducer >getTransformedData ( ) ) ;

18 19 20
// S t a r t t h e l e a r n i n g a l g o r i t h m l e a r n e r >run ( ) ;
Listing 6.1: A simple example where two behaviors are reduced in dimensionality and mapped with an ANN.
6.4 Behavior Database

In order to calculate a low-dimensional model of a behavior the required motion data has to be provided. Each behavior is stored in a GClasses::GMatrix class as soon as it is read from the database. This relational database can be accessed utilizing the SQL language. This client software is also used by [Ber11] to store and load robot behavior data. Code listing 6.2 shows how a connection to the database can be created.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// C r e a t e a d a t a b a s e c o n n e c t i o n S q l C o n n e c t o r sqlCon = new S q l C o n n e c t o r ( ip , dbName , u s e r , c r e d e n t i a l s ) ; // c o n n e c t t o t h e d a t a b a s e i f ( ! sqlCon >c o n n e c t ( ) ) ; return ; // C r e a t e s t o r a g e f o r a new b e h a v i o r GClasses : : GMatrix b e h a v i o r = new GClasses : : GMatrix ( 2 6 , 0 ) ; // R e q u e s t a b e h a v i o r from t h e d a t a b a s e ( i f e x s i s t i n g ) i f ( ! ( sqlCon >g e t B e h a v i o u r ( name , type , b e h a v i o r ) ) ) return ; // C l o s e t h e c o n n e c t i o n i f ( sqlCon >i s C o n n e c t e d ( ) )
42
6.5 Visualization
18
sqlCon > d i s c o n n e c t ( ) ;
Listing 6.2: C++ code to read a behavior from the motion database If a connector object is created and the connection is established successful the following functions are available: getBehaviourList - Receives a list of all behaviors in the database getBehaviour - A behavior can be requested from the database. There received data will be stored in a GClasses::GMatrix object. pushData - Push a new behavior to the database by proving a unique name, the type of recorded data (human motion data or robot joint angles) and a GClasses::GMatrix object containing the actual data. dropBehaviour - A behavior can be deleted by simply supplying a name In the database each behavior is stored in a separate table with ascending indexes identifying the time steps. Additionally, pelvis positions and rotations are stored for motion visualization purposes.
6.5 Visualization
In order to visualize low-dimensional data Owt and Qt4 extensions have been written. Figure 6.2 shows the main user interface for behavior visualizations. The screen shot displays two low-dimensional behavior trajectories reduced in dimensionality with PCA. Each blue square is a posture obtained by the demonstrator during the recording. It can be moved in the low-dimensional posture space by the user. In doing so, additional modications can be added.
Figure 6.2: A user interface screen shot is shown. Two behaviors have been reduced in dimensionality with PCA. Each blue square is a posture that has been executed by the human demonstrator during a recording. This point can be moved by the user in low-dimensional space.
43
6.5 Visualization One has to keep in mind that only two dimensions are modied at a time since the remaining axes are not visible. The rst two posture space axes are displayed as soon as the dimensionality reduction algorithm calculated a low-dimensional embedding. In order to provide visual feedback for learned movements and to allow interactions with virtual characters an OGRE -based application has been written. An articulated model is used to visualize recorded postures of low-dimensional data. The transmission of data is based on the TCP/IP protocol. Each posture that the virtual human has to adopt is sent to the application separately. Which pose has to be adopted next is calculated by the interaction model algorithm. Hence, a transmission of the complete behavior prior to a user interaction is not possible. Since user postures are transformed into low-dimensional space a continuous ow of points is created. These points are visualized and can be seen in gure 6.3. The green trajectory describes the low-dimensional behavior curve of the user, whereas the red curve encodes the behavior of the virtual human.
Figure 6.3: The gure shows a screen shot of an ongoing interaction in low-dimensional space. The green curve is the behavior trajectory of the human and the red trajectory visualizes the desired virtual human postures in low-dimensional space. Blue rectangles indicate the virtual humans actual pose. Additionally, the users current posture in low-dimensional space is highlighted with a green circle. Since the users movements are similar to the recorded behavior the newly created lowdimensional points are in the range of that trajectory (see green circle). When the user starts the interaction with the virtual human, the low-dimensional point starts moving in the range of the green trajectory. Simultaneously, postures for the virtual human are calculated and visualized in the same low-dimensional space as blue rectangles. That is, each blue rectangle is a past posture of the virtual human.
44
6.6 Emotional and Behavioral Filters
6.6 Emotional and Behavioral Filters

Three dierent lters have been implemented in order to convey emotions in virtual humans and humanoid robots. The basis of each lter is high- or low-dimensional data and they can be added prior to the execution of the behavior. This also applies for behavioral lters. The following UML diagram gives a brief overview of the lter class hierarchy. An abstract factory pattern has been used to provide additional lters depending on user input. Each animation lter utilizes dierent spline interpolation methods to smoothen
Figure 6.4: The diagram gives a brief overview of the class hierarchy for all implemented animation and emotion lters. or manipulate human motion data. For doing so, six types were implemented, namely cubic splines (with three dierent marginal methods), Akima splines, Catmull Rom splines and penalized regression with least squares tting.
6.7 Additional Components

Additional components include a Microsoft Kinect client and a Nao robot controller. The former component is a light-weight TCP/IP client that connects to the recording software of [Ber11]. It is used to request joint angle data of users in real time. Each transmitted package contains 14 joint angles measuring 7 oat values each. This results in a total package size of 98 oat values. The reason for the relatively large size is explained when the total amount of required values is analyzed. Each joint angle has three rotational axes and one condence variable, which describes an estimate of how good the calculations of the joint angles are. Additionally, a position of the users joint within three dimensional space is provided. These values are needed to locate synthetic humanoids in virtual reality depending on the users position.
45
7 Evaluation
In the following sections dierent experiments will be conducted, ranging from the already used arm mirroring example through to complex interaction scenarios. For each experiment an interaction model will be created from recorded two-person interactions. Then, the interaction will be executed once more in virtual reality with a synthetic humanoid. For that, the human in front of a display will control an avatar. Its postures are captured, transformed into low-dimensional space and used as input data for the interaction model. The second person is replaced by a virtual human, which changes the recorded human-human interaction to a human-machine interaction. In the examples the advantages of the interaction learning approach presented here will be pointed out. Also, interaction types that will cause failure or unnatural behavior will be explained.
7.1 Arm Mirroring

The arm-mirroring example used to introduce interaction models is a simple behavior where a virtual human or robot has to learn to mimic a users arm position. Once an interaction model has been learned it can be used to predict behaviors of virtual humans or robots during real-life interaction. Figure 7.2 shows how a person moves its arm and the virtual human successfully mimics this behavior based on observed postures. The low-dimensional posture spaces for this behavior have been introduced in chapter 4.3. The interaction model in use is created with the recorded behaviors mentioned in section 4.2. Three dierent mapping algorithms have been used to map one lowdimensional behavior of a user onto a robots posture space.
0.7 0.6
Linear Regression
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Artificial Neural Net
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Echo State Network
Euclidean Distance
0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200
250
50
100
150
200
250
50
100
150
200
250
Frame Index
Frame Index
Frame Index
Figure 7.1: The gure shows the mapping error that was obtained for each learning algorithm. From left to right: linear regression, articial neural net, echo state network.
46
7.1 Arm Mirroring
Figure 7.2: Illustrated are dierent steps during execution of an arm mirroring behavior. The virtual human in the back of the upper pictures is animated with postures gathered from a previously learned interaction model.
Minimal Error Maximal Error Mean Error Variance Standard Deviation
Linear Regression 0.0205662179 0.4545927397 0.177022329 0.0083494553 0.0913753537
ANN 0.0061785339 0.4928021813 0.0870798897 0.0046621994 0.0682803004
ESN 0.0051569018 0.6047167014 0.0957618007 0.0082322489 0.0907317413
Table 7.1: Comparing all implemented mapping algorithms regarding their produced output. Color is used to indicate largest and smallest values. Figure 7.1 shows the error that each algorithm obtained during the mapping. Additionally, table 7.1 summarizes the calculated mappings concerning various error and variance values. As it can be seen each algorithm was able to calculate a mapping. Since the error values are too close to each other a decision for the most suitable algorithm cannot be made. The generalizing capabilities of the interaction model can be used to control humanoid robots as well. Figure 7.3 shows the arm mirroring example played back on a Nao robot. The two-person interaction model is used to calculate robot movements depending on the users current posture. The calculated joint angles match the ones for virtual humans which accentuates the models versatility in regard of its target platform. Since the execution of movements on the robot require additional optimization in order to solve the correspondence problem human assistance is needed. The author of [Ber11], states that the same behaviors used in this thesis, can be played back on a virtual Nao robot in a simulated environment. In doing so, recorded behaviors are optimized for the
47
7.2 Yoga robots body stature in virtual reality. But due to the fact that the underlying simulation diers from real world characteristics, the calculated behaviors cannot be played back on the real robot without causing stability issues.
Figure 7.3: The underlying generalization capabilities of an interaction model can be used to create responsive robots. The picture series shows how the arm mirroring example is transferred to a Nao robot. The robot mimics the behavior of the person in front of it. The arm mirroring experiment has been conducted several times with dierent execution speeds. The duration of the whole behavior varied between 2 and 25 seconds. Originally, the length of the recorded motion was 10 seconds. The virtual human was able to mirror the arm movement in a visual appealing way if the length measured more than 3 seconds. For all tests with a faster movement speed the virtual human behaved unnatural and robotic. The reason for that is based on the fact that noisy motion data is ltered. Since the measurement intervals for this behavior have a length of an eighth of a second, 16 frames are used to calculate the mean user position. That is, the persons movement was simply too fast and ltered out. Due to the simple nature of the arm mirroring dataset, only a few dierent poses are stored in the posture space. Hence, the generalizing capability of the interaction model is limited. Human postures that have not been recorded result in similar low-dimensional points. That is, only recorded postures can be adopted by the avatar and its virtual interaction partner. So when the user executes a pose that does equal any low-dimensional point in the behavior, the avatars pose will not change. This creates a disparity between the user and its avatar. Since the underlying interaction model for virtual humans and humanoid robot is the same, this characteristic also applies to the Nao robot.
7.2 Yoga
Humans often learn new behaviors through imitation. A person is observed and the adopted postures are memorized. When learning from virtual teachers this technique can be applied as well. If a virtual teacher starts a movement the person in front of the display can imitate these poses. When doing so, a real instructor would examine the trainees body posture and would have the person correct it, if executed unsatisfactory. Analyzing the persons movements can be done in virtual reality with interaction models
48
7.2 Yoga as well. The low-dimensional posture space can be searched for the users current pose. If the space does not contain it, the posture is most likely not executed as in the recording, since the low-dimensional point in the behavior embedding moves along the created trajectory (or in the range of it). In the following section an example is presented where user postures are observed and an interaction model is used to predict its current pose in low-dimensional space. In regard of the previously introduced arm mirroring example the purpose of an interaction model was to map observed user postures to suitable virtual human poses. In the following this procedure will remain the same, but now the intention is to map user postures onto future virtual human poses. The experiment is based on the fact that some low-dimensional points in the posture space cannot be obtained when behaviors are executed incorrectly. Only when the users joint angles match the ones in the recorded behavior, these points can be reached. For that a person has to adopt the posture exactly like in the demonstration. This also applies for the movement direction, since multiple postures are used to calculate a virtual human pose. If a low-dimensional point can be found in the range of the behavior trajectory, the person executed the movement correctly. As soon as this check has been performed the virtual teachers posture can be changed accordingly. The required joint angles are gathered utilizing the interaction model. This means that the virtual human will wait as long as the user does not reach the next key pose. If the human executes a posture not as shown by the teacher the low-dimensional points will not reach certain areas of the trajectory, hence the teachers posture will not change.
Figure 7.4: With the interaction model approach humans can learn from virtual teachers. The gure illustrates how a person learns a basic Yoga move. Within the following example a user is instructed to learn some Yoga moves from a virtual trainer. The motions have been recorded from two demonstrators. The training starts with an initial pose obtained by the teacher. As long as the human user does not reach this position the training will not continue to the next posture. As soon as the user obtains the initial pose the simulation starts with a relaxing body
49
7.2 Yoga posture with both arms slightly tilted forward. Then the knee is pulled up with both arms right-angled. Then the bent knee will be stretched out to the front with both arms on top of each other. The last pose will be retained for some seconds till the initial pose is executed once again. Figure 7.4 shows how a user mimics its trainer with correct key poses. Table 7.2 shows how each implemented mapping algorithm performed regarding several error heuristics. It is clear that linear regression cannot be used since the calculated results are not su cient enough. As it can be seen in the table articial neural nets and echo state networks created a mapping with a mean error smaller than 0.2. Both algorithms can calculate a mapping that is close to the original behavior trajectory during the recording. Nevertheless, ANNs exhibit an enlarged error between the fth and twelfth time step. Figure 7.5 shows the result of each algorithm graphically. Linear Regression 0.3085176656 1.646203031 1.0256883764 0.1164872166 0.3413022364 ANN 0.004800291 1.2709131705 0.1429367233 0.0333035631 0.18391944 ESN 0.0241712849 0.5452127648 0.1864564588 0.0147262849 0.1213519051
Table 7.2: Comparing all mapping algorithms regarding their produced output for the defense behavior. Color is used to indicate largest and smallest values. As it can be seen ANNs and ESNs calculate good results regarding the overall error.
1.8 1.6
Linear Regression
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Echo State Network
Euclidean Distance
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90
100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
Timesteps
Timesteps
Timesteps
Figure 7.5: The yoga behavior learned with linear regression, an ANN and an echo state network. The interaction model learned with a neural net had a sliding window size of 10 input points. The used low-dimensional embeddings projected with PCA can be seen in gure 7.6. The illustration also shows key postures during the interaction. The Yoga example perfectly shows how the introduced two-person interaction model can be used in training-like scenario. This experiment is well-suited since it does not include fast movements. Possible applications lie in the eld of physiotherapy or ergotherapy. Also reinforcement learning or home training implementations are possible.
50
7.3 Defending Oneself
PC 1
PC 1
PC 2
PC 2
Figure 7.6: Two with PCA calculated embeddings of a Yoga teaching behavior. The left trajectory belongs to the teacher whereby the right one to the trainee. Since both curves describe similar movements by dierent persons the key postures are similar as well. Colored regions indicate how key postures correspond to the overall trajectory. A potential downside of this training method lies in the limited tracking capabilities of the employed Kinect SDK. Since not all body parts, like hands or feet, are included in the recording a false pose can be obtained by the user. This is not recognized by the interaction model since no data has been supplied. To avoid this issue a more advanced software development kit could be used. This does in general not aect the interaction model learning approach.

During this example a virtual human has to learn how to defend itself against a userdriven avatar. One person was instructed to move its hand into the direction of the second person to push it away. Simultaneously, the second person lifted its arms to protect itself. Shortly after, the rst person was told to punch again using the other arm. The second person lifted its arms once again for protection. Then, the rst person was instructed to lift its right leg into the direction of the second person. The defender was then told to crouch down and lift its arms. Key postures of the recording can be seen in gure 7.7. The recorded movements are reduced in dimensionality and used for the interaction learning algorithm as a data basis. The behavior of the second person is assigned to the virtual human. Once the mapping has been learned, the synthetic humanoid can defend itself against the trained attacks. Figure 7.8 shows a user trying to attack the virtual human with its avatar. The simulation recognizes the persons postures and the virtual human adopts a defensive position.
51
Figure 7.7: Shown are key postures during the recording of the defense example. The left person was instructed to punch twice and kick once. The right person was instructed to defend itself by crouching down or pulling up its arms.
Figure 7.8: A synthetic humanoids learns to defend it using an interaction model. Three basic movements are executed from the attacker and the virtual human protects itself by pulling up its arms or crouching down. The calculated points from the interaction model have been sent to the virtual human in order to have it adopt a suitable pose. Meanwhile the points have also been sent to a simulated Nao robot. The output can be seen in gure 7.9.
Figure 7.9: The same experiment that has been conducted with the defending virtual human has also been executed with a simulated Nao robot. The user moved in the same way like in the example above but now he was interacting with the robot. The interaction model for this example has been learned with all mapping algorithms (see gure 7.10). A comparing overview of the produced results can be seen in table 7.3. As in the previous example a mapping calculated with linear regression featured a error rate with a maximal error of 1.6130501833. This implies that LR should not be used for more complex scenarios. Echo state networks showed an even larger error with a very high mean deviation. Another disadvantage of ESNs is the creation process of the net. Each time a model is learned a random reservoir is created and the learning capabilities
52
7.3 Defending Oneself change even for the same scenario. This makes an echo state network unpredictable. Hence, articial neural nets should be used for complex behaviors instead. Linear Regression 0.0696543141 1.6130501833 0.6002898575 0.1297731082 0.3602403478 ANN 0.0158157285 0.5487787249 0.1721007023 0.0151932911 0.1234545006 ESN 0.0301591201 2.5188975751 0.6963446182 0.2514476179 0.5014455283
Table 7.3: Comparing all mapping algorithms regarding their produced output for the defense behavior. Color is used to indicate largest and smallest values.
Linear Regression
3 2.5 3 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 60 70 80 90 100
3 2.5

3 2.5 2 1.5 1 0.5 0 10 20 30 40 50 60 70 80 90 100
3 2.5
Echo State Network
Euclidean Distance
2 1.5 1 0.5 0
1.5
1.5
0.5
0.5
10
20
30
40
50
60
70
80
90
100
Timesteps
Timesteps
Timesteps
Figure 7.10: The defense behavior learned with linear regression, an ANN and an echo state network. As it can be seen the error varies greatly and a articial neural net produces the most accurate mapping results. The recorded behavior started with an upright position with both arms stretched. This posture has also been reached at the end of the motion. Hence, PCA creates an enclosed trajectory as it can be seen on the right trajectory in gure 7.11. The highlighted regions indicate how key postures correspond to the overall trajectory. The virtual human clearly learned to rise up its arms as soon as avatar moves to close. Also a dierentiation between arm and leg movements has been learned. That is, the virtual human crouches down and pulls up its arms in a protective position as soon as the avatars right leg starts moving into its direction. The interaction model that has been learned during this example can be valuable for controlling avatars or synthetic humanoids in virtual reality. Since the executed behaviors are based on human motion data, a more life-like appeal of virtual characters can been produced. Also these two-person interaction models could be used for controlling synthetic humanoids in game-like scenarios to increase the reactivity of the virtual characters.
53
7.4 Conclusion
User Postures Virtual Human Postures
PC 1
PC 1
PC 2
PC 2
Figure 7.11: The left graphic shows the rst two PCs of the behavior for the attacking person. The right illustration displays the behavior trajectory for the defending person. Highlighted regions show how key postures correspond to the trajectories.
7.4 Conclusion
In this chapter three examples have been presented. Two-person interaction models have been used to control a virtual human or a humanoid robot. In the rst scenario a simple arm-mirroring movement has been learned. The interaction model has been created with all three mapping algorithms, since the movement is not very complex. It has been stated that all mapping algorithms can achieve nearly the same results with a low average error. During a live interaction this model has been used to control a virtual humans posture as well as the joint angles of a Nao robot. With the learned interaction model the virtual human and the robot were able to mirror the persons arm movement successful. Even when the person started moving faster or slower both were able to generalize the observed behavior and set their own joint angle values accordingly. In the second example an interaction model has been used to predict user positions during a yoga lesson. After the model has been learned, the user in front of the TV was instructed to mimic the behavior of the virtual human. The persons body has been captured with the Kinect camera and the newly created low-dimensional posture was used as interaction model input point. After each key posture has been reached the virtual human adopted the next key pose, so the virtual training could continue. Regarding the mapping algorithm only two techniques have been proven to work in this scenario. Due to the simple nature of linear regression a mapping with a small error could not be calculated. On the other hand ANNs and ESNs created a mapping with an average error of 0.1429367233 and 0.1864564588, respectively. The user moving in front of the TV has been conducting the experiment with various execution speeds. The virtual human was in all scenarios able to reach the next key pose as soon as the user executed the shown movement right. Hence, the interaction model was able to generalize and produce visual appealing results.
54
7.4 Conclusion During the last example a defend behavior has been learned. The virtual human was supposed to learn how to defend itself against a user-driven avatar. It has been pointed out that only an articial neural net could be used to calculate the desired mapping. As in the second example the linear regression method obtained a error greater than 0.6. Also the mapping created with an echo state network has proven to be not suited for this example. An average error of 0.6963 was compared to the one obtained by an ANN too large. Additionally, the mapping created by ESNs seemed to be very sensitive regarding its control variable and the overall smoothness of the trajectory was not sufcient enough. The examples have shown that interaction models can be used in a variety of applications, ranging from simple mirroring task through to computer game like scenarios. In order predict a smooth mapping it is noted that articial neural net should be used. In general ESNs have a smaller average error but they tend to have a larger variance, making a smooth mapping hard to predict. In all cases linear regression can only be used for simple and short behaviors. When shown interactions increase in complexity the overall precision of each algorithm decreases. In order to work against this issue the recording rate has to be set higher. More training data is created for the learning algorithm and the generalization capabilities increase. In contrast to that, large datasets increase the risk overtraining neural nets. This means that for each scenario all learning algorithms should be taken into account and their produced output should be compared regarding its smoothness and precision.
55
8 Conclusion
This thesis presented an approach for developing a two-person interaction learning system that is based on recorded human motion data. In section 8.1 the questions that this thesis sought to answer are addressed. After that I will point out new research questions and directions in section 8.2.
8.1 Summary
The aim of this thesis was to develop a learning mechanism for teaching virtual humans and humanoid robots human-like interactions with people. The teaching of new behaviors and interaction methods should be easy and intuitive avoiding the need of an expert. Additionally, a learned behavior should by modiable in regard of its visual appearance. In order to create reactive and responsive virtual characters and robots a novel twoperson interaction learning method has been proposed. In contrast to other approaches this new technique utilizes two demonstrators instead of a single teacher. This provides the necessary data to enable a robot to respond to user interactions even in changing situations. In classic methods only one demonstrator is recorded and a robot imitates its behavior. With the proposed two-person interaction learning approach a robot learns to react to an interaction partner. Additionally, low-dimensional trajectories created from human motion data are used for model learning. This feature distinguishes the new approach from others. The underlying data basis is intuitively recorded from shown two-person interactions. Two demonstrators have to show the interaction only once in front of a motion capture device. To reduce the amount of data dierent dimensionality reduction algorithms have been implemented. In several experiments principal component analysis has been proven to be well-suited for creating low-dimensional embeddings. The reasons for that are multilayered, but are mostly inuenced by its low computational complexity and precision. Once a low-dimensional embedding has been calculated several machine learning algorithms can be used to map one behavior onto another. In doing so a temporal coherence is encoded by the mapping algorithm. That is, the mapping algorithm is responsible for the selection of human postures depending on an observed user poses. Linear regression, articial neural nets and echo state networks have been implemented, which are not supported by other interaction learning approaches. This gives users more exibility when teaching interactions since the learning technique can be set more accurately. Dierent experiments have shown that ANNs are best suited for learning the proposed interaction models. Due to the missing generalization capabilities of linear regression, it
56
8.2 Future Work can only be used for simple and short behaviors. Echo state networks can be used for model creation as well. It has been pointed out that this algorithm is, however, very sensitive to its control parameters, which are hard to set for unskilled users. Once a mapping has been obtained, it can be used for human-robot interactions. The users postures are captured with a motion capture device and reduced in dimensionality. The calculated low-dimensional points are used as input data for the interaction model. Then the model predicts a virtual human or robot posture depending on past and current user postures. The liveliness of a character is achieved by using human motion data for training and interaction models calculate suitable motor movements. In doing so, vivid and responsive characters and robots can be created that can interact with people either in virtual reality or in the real world. The interaction itself is encoded in the interaction model and calculates the desired motor movements. The second goal of this thesis was to provide additional modications to users to alter existing virtual human movements. In order to fulll this requirement, some of the fundamental rules of computer and classic cartoon animation have implemented in behavioral lters. It has been shown that anticipation and exaggeration can be added to existing motions. In contrast to that it has been pointed out that animation timing and the principle of Squash and Stretch is already encoded in interaction models. To increase the vividness of character animations even further emotional additions have been introduced. Based upon psychological surveys, lters for interaction models have been implemented. Since interaction models can be used in virtual humans and humanoid robots, that do in general not have facial features, the postural expression of emotions had to be addressed. Sadness, anger and happiness are the emotions that are included in provided lter set.
8.2 Future Work

Future work in this area could go in a number of dierent directions. At rst it should be taken in consideration to support dierent or multiple motion capture systems. The currently used Microsoft Kinect camera oers a wide range of advantages. The low-cost nature of this device and its consumer availability are most certain the main benets. On the other hand the camera lacks precision and limits the users interaction radius. Hence, large scale tracking installations should be used to record more complex behaviors. Another possible direction is the integration of multiple demonstrations for the creation of interaction models, since humans learn new skills also by generalizing multiple actions. Including these could increase the precision of calculated low-dimensional embeddings as well as improve the overall learning rate. Another promising research direction concerns dimensionality reduction. Since some techniques tend to be highly sensitive to neighborhood parameter settings an automatic regulator can be taken into account [SMR06]. Additionally, a method for automatic dimension reduction method selection can be considered.
57
A Appendix
A.1 A more Technical Interaction Learner Example
The following code is a complete example for learning a basic interaction model of a shown two-person interaction.
1 #include I n t e r a c t i o n L e a r n e r E N N . h 2 #include DimensionReducerFactory . h 3 #include D i m e n s i o n R e d u c e r S e t t i n g s . h 4 5 void main ( ) { 6 7 S q l C o n n e c t o r sqlCon = 8 new S q l C o n n e c t o r ( ip , dbName , u s e r , c r e d e n t i a l s ) ; 9 10 i f ( ! sqlCon >c o n n e c t ( ) ) ; 11 return ; 12 13 GClasses : : GMatrix b e h a v i o r = new GClasses : : GMatrix ( 2 6 , 0 ) ; 14 15 i f ( ! ( sqlCon >g e t B e h a v i o u r ( name , type , b e h a v i o r ) ) ) 16 return ; 17 18 i f ( sqlCon >i s C o n n e c t e d ( ) ) 19 sqlCon > d i s c o n n e c t ( ) ; 20 21 DimensionReducerSettings s e t t i n g s ; 22 DimensionReducerFactory dimReducerFactory = 23 new DimensionReducerFactory ( s e t t i n g s ) ; 24 DimensionReducer s t R e d u c e r = dimReducerFactory > 25 g et Re d u ce r ( DimensionReducerFactory : : PCA) ; 26 DimensionReducer ndReducer = dimReducerFactory > 27 g et Re d u ce r ( DimensionReducerFactory : : PCA) ; 28 int t a r g e t D i m e n s i o n = 4 ; 29 30 stReducer >s e t D a t a ( f i r s t P e r s o n B e h a v i o r D a t a ) ; 31 stReducer >t r a n s f o r m ( t a r g e t D i m e n s i o n ) ; 32 33 ndReducer >s e t D a t a ( s econ dPers onBe havi orDat a ) ; 34 ndReducer >t r a n s f o r m ( t a r g e t D i m e n s i o n ) ;
58
A.2 An Emotion Filter Example

35 36 37 38 39 40 41 42 43 44 45
InteractionLearnerNeuralNet learner = new I n t e r a c t i o n L e a r n e r N e u r a l E N N ( ) ; l e a r n e r >s e t S t D a t a ( stReducer >getTransformedData ( ) ) ; l e a r n e r >setNdData ( ndReducer >getTransformedData ( ) ) ; l e a r n e r >run ( ) ; return ; }
Listing A.1: A simple example where two behaviors are reduced in dimensionality and mapped with an ANN

The following example demonstrates how a sadness lter could look like when implemented for the interaction model approach presented in this thesis. For reasons of simplication only important functions are mentioned.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
void S ad n e s s : : s e t S p i n e R o t a t i o n ( GClasses : : GMatrix& data ) { double o f f S e t = 0.25; f o r ( unsigned int i =0; i <data . rows ( ) ; i ++){ data [ i ] [ nao : : RHipPitch ] = data [ i ] [ nao : : RHipPitch ]+ o f f S e t ; data [ i ] [ nao : : LHipPitch ] = data [ i ] [ nao : : LHipPitch ]+ o f f S e t ; } } void S ad n e s s : : smoothMotion ( GClasses : : GMatrix& data ) { int dataRows = data . rows ( ) ; int d a t a C o l s = data . c o l s ( ) ; PeakExtractor peakExtractor = s t d : : v e c t o r <Peak > peaksPerCol f o r ( int k=0;k<d a t a C o l s ; k++){ // f i r s t f i n d p e a k s //When t h e r e a r e more than 3 behaviour i f ( peaksPerCol [ k] > g e t S i z e ( ) new P e a k E x t r a c t o r ( ) ; = p e a k E x t r a c t o r > e x t r a c t (& data ) ;
p e a k s c r e a t e a s p l i n e t o smooth t h e > 3) {
// second modulate d a t a b e t w e e n p e a k s // w i t h a s p l i n e and a p p l y some m o d i f i c a t i o n s double x=new double [ peaksPerCol [ k] > g e t S i z e ( ) ] ; f o r ( int i =0; i <peaksPerCol [ k] > g e t S i z e ( ) ; i ++){ x [ i ] = peaksPerCol [ k] > g e t P e a k I n d e x e s ( ) [ i ] 0 . 1 2 5 ;
59

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
} S p l i n e A p p r o x i m a t i o n s p l i n e = new S p l i n e A p p r o x i m a t i o n ( ) ; s p l i n e >c r e a t e S p l i n e ( S p l i n e A p p r o x i m a t i o n : : P e n a l i z e d R e g r e s s i o n , x , peaksPerCol [ k] > get P eak Va l u es ( ) , peaksPerCol [ k] > g e t S i z e ( ) , 0 ) ; double c u r r e n t C o l = new double [ dataRows ] ; data . c o l ( k , c u r r e n t C o l ) ; f o r ( int i =0; i <dataRows ; i ++){ c u r r e n t C o l [ i ] = s p l i n e >getApproximation ( i 0 . 1 2 5 ) ; } data . s e t C o l ( k , c u r r e n t C o l ) ; } }
void S ad n e s s : : s e t H e a d R o t a t i o n ( GClasses : : GMatrix& data ) { double h e a d P i t c h O f f s e t = 0 . 5 ; double headYawOffset = 0 ; double maxHeadPitch = 0.25;; double c o l = new double [ data . rows ( ) ] ; data . c o l ( nao : : HeadPitch , c o l ) ; f o r ( unsigned int i =0; i <data . rows ( ) ; i ++){ col [ i ] = col [ i ] + headPitchOffset ; i f ( c o l [ i ] > maxHeadPitch ) c o l [ i ] = maxHeadPitch ; } data . s e t C o l ( nao : : HeadPitch , c o l ) ; double rows = data . rows ( ) ; double p i = 3 . 1 4 1 5 9 2 6 5 3 5 8 9 ; int framesPerShake = 1 9 ; int headShakes = rows / framesPerShake ; data . c o l ( nao : : HeadYaw , c o l ) ; int l =0; f o r ( int k=0;k<headShakes ; k++){ f o r ( int i =0; i <framesPerShake ; i ++){ c o l [ l ] = headYawOffset s i n ( 2 p i ( ( double ) i ) / ( ( double ) framesPerShake ) ) ; l ++; } } int r e s t = ( int ) rows%( int ) framesPerShake ; double x = new double [ 4 ] ; double y = new double [ 4 ] ;
60
A.3 DVD Contents

70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
y[0] y[1] y[2] y[3] x[0] x[1] x[2] x[3]
= = = = = = = =
col col col col
[ headShakes framesPerShake [ headShakes framesPerShake [ ( int ) rows ( int ) ( r e s t / 2 ) ] ; [ ( int ) rows 1 ] ;
3 ]; 1 ];
0 . 1 2 5 ( headShakes framesPerShake 0 . 1 2 5 ( headShakes framesPerShake 0 . 1 2 5 ( ( int ) rows ( int ) ( r e s t / 2 ) ) ; 0 . 1 2 5 ( rows 1) ;
3) ; 1) ;
S p l i n e A p p r o x i m a t i o n s p l i n e = new S p l i n e A p p r o x i m a t i o n ( ) ; s p l i n e >c r e a t e S p l i n e ( S p l i n e A p p r o x i m a t i o n : : P e n a l i z e d R e g r e s s i o n , x , y ,4 ,0) ; f o r ( int i =(rows ) r e s t ; i <rows ; i ++){ c o l [ i ] = s p l i n e >getApproximation ( i 0 . 1 2 5 ) ; } data . s e t C o l ( nao : : HeadYaw , c o l ) ; }
void S ad n e s s : : s e t S h o u l d e r A n g l e ( GClasses : : GMatrix& data ) { double o f f S e t = 0.25; f o r ( unsigned int i =0; i <data . rows ( ) ; i ++){ data [ i ] [ nao : : RShoulderPitch ] = data [ i ] [ nao : : RShoulderPitch ]+ offSet ; 94 data [ i ] [ nao : : L S h o u l d e r P i t c h ] = data [ i ] [ nao : : L S h o u l d e r P i t c h ]+ offSet ; 95 } 96 }
Listing A.2: Example functions that were implemented to create the impression of a sad character
A.3 DVD Contents

The attached DVD includes the following items: source code of the interaction model algorithm as well as for the user interface source code documentation, which was created with DoxyGen additionally libraries (waffles,alglib,fnn,aureservoir) digital version of this thesis
61
List of Figures
3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 6.1 6.2 Two-dimensional non-linear manifolds lying in a three-dimensional space Three non-linear datasets reduced in dimensionality with PCA . . . . . . Three non-linear datasets reduced in dimensionality with LLE . . . . . . Three non-linear datasets reduced in dimensionality with IsoMap . . . . Comparing the neighborhood parameter values for IsoMap . . . . . . . . 3 non-linear datasets reduced in dimensionality with Manifold Sculpting . Three non-linear datasets reduced in dimensionality with all introduced methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the proposed interaction learning approach . . . . . Illustration of recognized joint rotations . . . . . . . . . . . . . . Recording of a synchronous two-person arm movement . . . . . An arm movement dataset reduced to two dimensions . . . . . . Arm mirroring data reduced to two dimensions with PCA . . . An arm movement dataset compressed with Manifold Sculpting An arm movement dataset assigned to a Nao robot . . . . . . . Behavior mapping for interaction learning . . . . . . . . . . . . Arm mirroring data mapped with LR . . . . . . . . . . . . . . . Neural Net introduction . . . . . . . . . . . . . . . . . . . . . . Arm mirroring data mapped with an ANN . . . . . . . . . . . . Arm mirroring data mapped with an ESN . . . . . . . . . . . . Dierent reservoir sizes for the arm mirroring data . . . . . . . . Live interaction and posture calculation overview . . . . . . . . Squash and stretch showed on a simple character . . . . . . Anticipation added to a synthetic humanoid . . . . . . . . . Exaggeration added to a synthetic humanoid . . . . . . . . . Exaggeration in posture spaces . . . . . . . . . . . . . . . . Splines for smoothing a virtual humans movement . . . . . A secondary action added to a virutal humans behavior . . Low dimensional embeddings change when secondary actions Happiness added to the defense behavior . . . . . . . . . . . Sadness added to the defense behavior . . . . . . . . . . . . Anger added to the defense behavior . . . . . . . . . . . . . . . . . . . . . . . . . are . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 11 12 12 13 15 16 17 18 19 20 21 22 23 24 25 25 26 27 28 30 31 32 33 34 35 35 37 38 39 40 43
. . . . . . . . . . . . . . . . . . . . . . . . added . . . . . . . . . . . .
Main software components implementing the interaction model approach Manipulating low-dimensional trajectory points . . . . . . . . . . . . . .
62
List of Figures 6.3 6.4 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 Motion visualization in low-dimensional space . . . . . . . . . . . . . . . An UML class diagram showing all animation and behavior lters . . . . The obtained mapping error for the arm mirroring example . . . . . Utilizing an interaction model for mirroring an arm movement . . . Using an interaction model to interact with a Nao robot . . . . . . Utilizing an interaction model for learning Yoga . . . . . . . . . . . The yoga mapping learned with three algorithms . . . . . . . . . . Key postures in the Yoga example . . . . . . . . . . . . . . . . . . . Steps during the recording of the defense example . . . . . . . . . . An interaction model used to train an avatar defending itself . . . . The learned defend behavior for a virtual Nao robot . . . . . . . . . The defense mapping learned with three algorithms . . . . . . . . . Low dimensional embeddings and key postures for defence learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 45 46 47 48 49 50 51 52 52 52 53 54
63
Bibliography
[ADGY04] A. P. Atkinson, W. H. Dittrich, A. J. Gemmell, and A. W. Young, Emotion perception from dynamic and static body expressions in point-light and fulllight displays, Perception 33 (2004), 717746. [Adr07] Adriana Tapus and Maja J. Mataric, Emulating Empathy in Socially Assistive Robotics, AAAI Spring Symposium, 2007. [ANDA03] Aris Alissandrakis, Chrystopher L. Nehaniv, Kerstin Dautenhahn, and Hateld Herts Al Ab, Solving the correspondence problem between dissimilarly embodied robotic arms using the alice imitation mechanism, In Proceedings of the second international symposium on imitation in animals and artifacts, 2003, pp. 7992. [BbK03] Nadia Bianchi-berthouze and Andrea Kleinsmith, A categorical approach to aective gesture recognition, Connection Science 15 (2003), no. 4, 259269. [Ben10] Heni Ben Amor, Imitation Learning of Motor Skills for Synthetic Humanoids, Ph.D. thesis, Technische Universit at Bergakademie Freiberg, 2010. [Ber09] Erik Berger, Ein Verfahren zum Imitationslernen durch haptische MenschRoboter Interaktion, 2009. [Ber11] Erik Berger, Visual Bootstrapping:, Masters thesis, Technische Universit at Bergakademie Freiberg, 2011. [BHML10] Aryel Beck, Antoine Hiolle, Alexandre Mazel, and Raymond Losserand, Interpretation of Emotional Body Language Displayed by Robots, Proceedings of the 3rd international workshop on Aective interaction in natural environments, 2010, pp. 3742. [BMS+ 05] Catherina Burghart, Ralf Mikut, Rainer Stiefelhagen, Tamim Asfour, Hartwig Holzapfel, Peter Steinhaus, and Ruediger Dillmann, A cognitive architecture for a humanoid robot: A rst approach, In IEEERAS International Conference on Humanoid Robots (Humanoids 2005, 2005, pp. 357 362. [Bul78] Peter Bull, The interpretation of posture through an alternative methodology to role play, British Journal of Social and Clinical Psychology 17 (1978), 16.
64
Bibliography [CL11] Jing Chen and Yang Liu, Locally linear embedding: a survey, Artif. Intell. Rev. 36 (2011), 2948. [CR10] Rawichote Chalodhorn and R. Rao, Learning to imitate human actions through eigenposes, From Motor Learning to Interaction Learning in Robots (2010), 357381. [CSG+ 11] Aaron Curtis, Jaeeun Shim, Eugene Gargas, Adhityan Srinivasan, and Ayanna M. Howard, Dance dance pleo: developing a low-cost learning robotic dance therapy aid, Proceedings of the 10th International Conference on Interaction Design and Children (New York, NY, USA), IDC 11, ACM, 2011, pp. 149152. [DN02] Kerstin Dautenhahn and Chrystopher L Nehaniv, Imitation in animals and artifacts, vol. 7, MIT Press, 2002. [EL04] Ahmed Elgammal and Chan-Su Lee, Inferring 3d body pose from silhouettes using activity manifold learning, Computer Vision and Pattern Recognition, IEEE Computer Society Conference on 2 (2004), 681688. [ET10] M.S. Erden and Adriana Tapus, Postural Expressions of Emotions in a Humanoid Robot for Assistive Applications, ensta-paristech.fr (2010). [Fod02] I.K. Fodor, A survey of dimension reduction techniques, Center for Applied Scientic Computing, Lawrence Livermore National Laboratory (2002). [Gas11] Michael S. Gashler, Waes: A machine learning toolkit, Journal of Machine Learning Research MLOSS 12 (2011), 23832387. [GHW10] Sebastian Gieselmann, Marc Hanheide, and Britta Wrede, Remembering interaction episodes : an unsupervised learning approach for a humanoid robot, Proc Humanoids Conf (2010), 566571. [GRGT08] Alejandra Garc a-Rojas, Mario Guti errez, and Daniel Thalmann, Simulation of individual spontaneous reactive behavior, Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems - Volume 1 (Richland, SC), AAMAS 08, International Foundation for Autonomous Agents and Multiagent Systems, 2008, pp. 143150. [GRM04] J Gratch, Marina Rey, and Stacy Marsella, Evaluating the modeling and use of emotion in virtual humans, pp. 320327, IEEE Computer Society, 2004. [GSG+ 04] Beatrice De Gelder, Josh Snyder, Doug Greve, George Gerard, and Nouchine Hadjikhani, Fear fosters ight: a mechanism for fear contagion when perceiving emotion expressed by a whole body., Proceedings of the National Academy of Sciences of the United States of America 101 (2004), no. 47, 167016.
65
Bibliography [Hot33] H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 24 (1933), no. 7. [IAM+ 09] Shuhei Ikemoto, Heni Ben Amor, Takashi Minato, Hiroshi Ishiguro, and Bernhard Jung, Physical interaction learning: Behavior adaptation in cooperative human-robot tasks involving physical contact, ROMAN 2009 The 18th IEEE International Symposium on Robot and Human Interactive Communication (2009), 504509. [JMP07] Herbert Jaeger, Wolfgang Maass, and Jose Principe, Special issue on echo state networks and liquid state machines, Neural Networks 20 (2007), no. 3, 287 289, Echo State Networks and Liquid State Machines. [Joh95] Ollie Johnston, The illusion of life, Hyperion, New York, 1995. [KNY05] Hideki Kozima, Cocoro Nakagawa, and Yuriko Yasuda, Interactive robots for communication-care: A case-study in autism therapy, Robot and Human Interactive Communication 2005 ROMAN 2005 IEEE International Workshop on, IEEE, 2005, pp. 341346. [Las87] John Lasseter, Principles of traditional animation applied to 3d computer animation, SIGGRAPH Comput. Graph. 21 (1987), 3544. [LJ09] Mantas Lukoeviius and Herbert Jaeger, Reservoir computing approaches to recurrent neural network training, Computer Science Review 3 (2009), no. 3, 127 149. [MC01] D.P. Mandic and J.A. Chambers, Recurrent neural networks for prediction: learning algorithms, architectures and stability, Adaptive and learning systems for signal processing, communications, and control, John Wiley, 2001. [MC11] Shamma Marzooqi and Jacob W Crandall, Expressing Emotions Through Robots : A Case Study Using O-the-Shelf Programming Interfaces, HRI 11 Proceedings of the 6th international conference on Human-robot interaction (2011), 199200. [MGS05] Manuel Mhlig, Michael Gienger, and Jochen J. Steil, Human-robot interaction for learning and adaptation of object movements, 2005. [MGS10] Manuel M, Michael Gienger, and Jochen J Steil, Human-Robot Interaction for Learning and Adaptation of Object Movements, Memory (2010), 4901 4907. [MLLM] J er ome Monceaux, Ehess Las, Cardinal Lemoine, and Alexandre Mazel, Demonstration First Steps in Emotional Expression of the Humanoid Robot Nao, 235236.
66
Bibliography [MTKM08] Nadia Magnenat-Thalmann, Zerrin Kasap, and Maher Ben Moussa, Communicating with a virtual human or a skin-based robot head, ACM SIGGRAPH ASIA 2008 courses (New York, NY, USA), SIGGRAPH Asia 08, ACM, 2008, pp. 55:155:7. [M06] Mike Gashler, Dan Ventura and Tony Martinez , Iterative Non-linear Dimensionality Reduction by Manifold Sculpting, Computer vision and image understanding 104 (2006), no. 2-3, 90126. [PB02] Katherine Pullen and Christoph Bregler, Motion capture assisted animation: texturing and synthesis, ACM Trans. Graph. 21 (2002), 501508. [Ros08] Dimitris Rosenhahn, Bodo and Klette, Reinhard and Metaxas, Human Motion - Understanding, Modeling, Capture and Animation, Springer, 2008. [SK08] Bruno Siciliano and Oussama Khatib (eds.), Springer handbook of robotics, Springer, Berlin, Heidelberg, 2008. [SMR06] O. Samko, A. D. Marshall, and P. L. Rosin, Selection of the optimal parameter value for the isomap algorithm, Pattern Recogn. Lett. 27 (2006), 968979. [SR00] L.K. Saul and S.T. Roweis, An introduction to locally linear embedding. [SR04] Lawrence K. Saul and Sam T. Roweis, Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds, Journal of Machine Learning Research 4 (2004), no. 2, 119155. [SR08] Ahmad S. Shaarani and Daniela M. Romano, The intensity of perceived emotions in 3d virtual humans, Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems - Volume 3 (Richland, SC), AAMAS 08, International Foundation for Autonomous Agents and Multiagent Systems, 2008, pp. 12611264. [Tho63] W.H. Thorpe, Learning and instinct in animals, John M. Prather Lecture Series in Biology Series, Methuen, 1963. [TLKS08] Kai-Tai Tang, Howard Leung, Taku Komura, and Hubert P. H. Shum, Finding repetitive patterns in 3d human motion captured data, Proceedings of the 2nd international conference on Ubiquitous information management and communication (New York, NY, USA), ICUIMC 08, ACM, 2008, pp. 396 403. [TSL00] Joshua B. Tenenbaum, Vin Silva, and John C. Langford, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science 290 (2000), no. 5500, 23192323.
67
Bibliography [Wal98] Harald G. Wallbott, Bodily expression of emotion, European Journal of Social Psychology 28 (1998), no. 6, 879896. [Zen06] Thomas R Zentall, Imitation: denitions, evidence, and mechanisms., Animal Cognition 9 (2006), no. 4, 33553.
68
Eidesstattliche Erkl arung

Hiermit versichere ich, dass ich die vorliegende Arbeit ohne unzul assige Hilfe Dritter und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe: die aus fremden Quellen direkt oder indirekt u bernommenen Gedanken sind als solche kenntlich gemacht.
Freiberg, den 24.10.2011 David Vogt

Master Thesis Vogt

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Master Thesis Vogt

Transféré par

Droits d'auteur :

Formats disponibles

Contents

1.1 Overview of this Thesis

2.1 Imitation Learning For Humanoid Robots

2.2 Postural Expression of Emotions

3.2 Dimensionality Reduction

3.2 Dimensionality Reduction

3.2.1 Principal Component Analysis

3.2.2 Locally Linear Embedding

3.2 Dimensionality Reduction

3.2.3 Isometric Feature Mapping

3.2.4 Manifold Sculpting

The point with an angle closest to is called co-linear neighbor.

3.3 Dimension Expansion

3.3 Dimension Expansion

4 Learning Interaction Models

4.2 Interaction Data Acquisition

4.3 Dimensionality Reduction

4.3 Dimensionality Reduction

(a) Person A pulling up its right arm

(b) Person B pulling up its left arm

(a) Person A pulling up its right arm

(b) Person B pulling up its left arm

(a) Person A pulling up its arm

(b) Person B pulling up its arm

4.4 Learning Interaction Models

(a) Person A pulling up its arm

(b) Behavior of person B assigned to a Nao robot

4.4 Learning Interaction Models

4.4.1 Linear Regression

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

4.4.2 Articial Neural Net

Multiple Human Postures

Single Virtual Human Posture

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

4.4.3 Echo State Network

4.5 Real-time Human Posture Approximation and Interaction

Human Posture Approximation

Virtual Pose Estimation

5 Emotions and Behavior Modications

5.1 Walt Disneys Principles of Animation

5.1.1 Squash and Stretch

5.1 Walt Disneys Principles of Animation

5.1 Walt Disneys Principles of Animation

Hip Roll Angle in Radian

0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7

5.1.6 Secondary Action

5.1 Walt Disneys Principles of Animation

(a) Original action

(b) A secondary action added

5.2 Expressing Basic Emotions

5.2 Expressing Basic Emotions

6.2 Dimensionality Reduction

6.3 Interaction Learning Algorithm

6.4 Behavior Database

6.4 Behavior Database

6.6 Emotional and Behavioral Filters

6.6 Emotional and Behavioral Filters

6.7 Additional Components

7.1 Arm Mirroring

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Artificial Neural Net

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Echo State Network

0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200