Vous êtes sur la page 1sur 14

Optimal Control of a

Double Inverted Pendulum on a Cart


Alexander Bogdanov ∗
Department of Computer Science & Electrical Engineering,
OGI School of Science & Engineering, OHSU
Technical Report CSE-04-006
December 2004

In this report a number of algorithms for optimal control of a double inverted pendulum on a cart (DIPC)
are investigated and compared. Modeling is based on Euler-Lagrange equations derived by specifying a
Lagrangian, difference between kinetic and potential energy of the DIPC system. This results in a system
of nonlinear differential equations consisting of three 2-nd order equations. This system of equations is then
transformed into a usual form of six 1-st order ordinary differential equations (ODE) for control design pur-
poses. Control of a DIPC poses a certain challenge, since unlike a robot, the system is underactuated: one
controlling force per three degrees of freedom (DOF). In this report, problem of optimal control minimizing
a quadratic cost functional is addressed. Several approaches are tested: linear quadratic regulator (LQR),
state-dependent Riccati equation (SDRE), optimal neural network (NN) control, and combinations of the
NN with the LQR and the SDRE. Simulations reveal superior performance of the SDRE over the LQR and
improvements provided by the NN, which compensates for model inadequacies in the LQR. Limited capa-
bilities of the NN to approximate functions over the wide range of arguments prevent it from significantly
improving the SDRE performance, providing only marginal benefits at larger pendulum deflections.

1 Nomenclature problem is addressed: stabilize DIPC minimizing an ac-


cumulative cost functional quadratic in states and con-
m mass trols. For linear systems, this leads to linear feedback
li distance from a pivot joint to the i-th pen- control, which is found by solving a Riccati equation
dulum link center of mass [5], and thus referred to as linear quadratic regulator
Li length of an i-th pendulum link (LQR). DIPC, however, is a highly nonlinear system,
θ0 wheeled cart position and its linearization is far from adequate for control de-
θ1 , θ 2 pendulum angles sign purposes. Nonetheless, the LQR will be considered
Ii moment of inertia of i-th pendulum link as a baseline controller, and other control designs will
w.r.t. its center of mass be tested and compared against it. To solve the non-
g gravity constant linear optimal control problem, we will employ several
u control force different approaches: a nonlinear extension to the LQR
T kinetic energy called SDRE; direct nonlinear optimization using a neu-
P potential energy ral network (NN); and combinations of the direct NN
L Lagrangian optimization with the LQR and the SDRE.
w neural network (NN) weights For a baseline nonlinear control, we will utilize a
Nh number of neurons in the hidden layer technique that has shown considerable promise and in-
∆t digital control sampling time volves manipulating the system dynamic equations into
Subscripts 0, 1, 2 used with the aforementioned param- a pseudo-linear state-dependent coefficient (SDC) form,
eters refer to the cart, first (bottom) pendulum and sec- in which system matrices are given explicitly as a func-
ond (top) pendulum correspondingly. tion of the current state. Treating the system matrices
as constant, the approximate solution of the nonlinear
state-dependent Riccati equation is obtained for the re-
2 Introduction formulated pseudo-linear dynamical system in discrete
time steps. The solution is then used to calculate a feed-
As a nonlinear underactuated plant, double inverted back control law that is optimized around the system
pendulum on a cart (DIPC) poses a challenging control state estimated at each time step. This technique, re-
problem. It seems to have been one of attractive tools ferred to as State-Dependent Riccati Equation (SDRE)
for testing linear and nonlinear control laws [6, 17, 1]. control, is thus an extension to the LQR as it solves the
Numerous papers use DIPC as a testbed. Cited ones LQR problem at each time step.
are merely an example. Nearly all works on pendulum
control concentrate on two problems: pendulums swing- As a next step, a direct optimization approach will
up control design and stabilization of the inverted pen- be investigated. For this purpose, a NN will be used in
dulums. In this report, optimal nonlinear stabilization the feedback loop, and a standard calculus of variations
approach will be employed to adjust the NN parameters
∗ Senior Research Associate, alexb@cse.ogi.edu (weights) to optimize the cost functional over a wide

1
where
T2 1
I2
T0 = m0 θ̇02
2
l2 L2 1 h 2 2 i
T1 = m1 θ̇0 + l1 θ̇1 cos θ1 + l1 θ̇1 sin θ1
m2g 2
L1 T1
Y

1
T0 I1 + I1 θ̇2
2 1
m0 l1 1 1
m1 θ̇02 + m1 l12 + I1 θ̇12 + m1 l1 θ̇0 θ̇1 cos θ1

u =
m1 g
2 2
1 h 2
X T2 = m2 θ̇0 + L1 θ̇1 cos θ1 + l2 θ̇2 cos θ2
2
2 i 1
Figure 1: Double inverted pendulum on a cart + L1 θ̇1 sin θ1 + l2 θ̇2 sin θ2 + I2 θ̇22
2
1 1 1
m2 θ̇0 + m2 L1 θ̇1 + m2 l2 + I2 θ̇22
2 2 2 2

=
range of initial DIPC states. As a universal function 2 2 2
approximator, the NN is thus trained to implement a + m2 L1 θ̇0 θ̇1 cos θ1 + m2 l2 θ̇0 θ̇2 cos θ2
nonlinear optimal controller. + m2 L1 l2 θ̇1 θ̇2 cos(θ1 − θ2 )
As the last step, two combinations of feedback NN
control with LQR and SDRE will be designed. Approxi- P0 = 0
mation capabilities of a NN are limited by its size and the P1 = m1 gl1 cos θ1
fact that optimization in the space of its weights is non- P2 = m2 g L1 cos θ1 + l2 cos θ2

convex. Thus, only local minimums are usually found,
and the solution is at most suboptimal. The problem Thus the Lagrangian of the system is given by
is more severe with wider ranges of NN inputs and out- 1
m0 + m1 + m2 θ̇02

puts. To address this, a combination of the NN with L =
a conventional feedback suboptimal control is designed 2
to simplify the NN input-output mapping and therefore 1 1
m1 l12 + m2 L21 + I1 θ̇12 + m2 l22 + I2 θ̇22
 
+
reduce training complexity. If the conventional feedback 2 2

control were optimal, the optimal NN output would triv- + m1 l1 + m2 L1 cos(θ1 )θ̇0 θ̇1
ially be zero. Simple logic says that a suboptimal con-
ventional control will simplify the NN mapping to some + m2 l2 cos(θ2 )θ̇0 θ̇2 + m2 L1 l2 cos(θ1 − θ2 )θ̇1 θ̇2

extent: instead of generating optimal controls, the NN is − m1 l1 + m2 L1 g cos θ1 − m2 l2 g cos θ2
trained to produce corrections to the controls generated
by the conventional suboptimal controller. For example, Differentiating the Lagrangian by θ̇ and θ yields La-
the LQR provides near-optimal control in the vicinity grange equation (1) as
of the equilibrium, since nonlinear and linearized DIPC  
d ∂L ∂L
dynamics are close in the equilibrium. Thus the NN will − = u
only have to correct the LQR output when the lineariza- dt ∂ θ̇0 ∂θ0
tion accuracy diminishes. The next sections will discuss
 
d ∂L ∂L
the above concepts in details and illustrate them with − = 0
dt ∂ θ̇1 ∂θ1
simulations.  
d ∂L ∂L
− = 0
dt ∂ θ̇2 ∂θ2
3 Modeling
Or explicitly,
The DIPC system is graphically depicted in Fig. 1. To X  
derive its equations of motion, one of the possible ways u = mi θ̈0 + m1 l1 + m2 L1 cos(θ1 )θ̈1
is to use Lagrange equations:
+ m2 l2 cos(θ2 )θ̈2 − m1 l1 + m2 L1 sin(θ1 )θ̇12

 
d ∂L ∂L
− =Q (1) − m2 l2 sin(θ2 )θ̇22
dt ∂ θ̇ ∂θ 
0 = m1 l1 + m2 L1 cos(θ1 )θ̈0
where L = T − P is a Lagrangian, Q is a vector of
m1 l12 + m2 L21 + I1 θ̈1

generalized forces (or moments) acting in the direction +
of generalized coordinates θ and not accounted for in + m2 L1 l2 cos(θ1 − θ2 )θ̈2
formulation of kinetic energy T and potential energy P .
Kinetic and potential energies of the system are given + m2 L1 l2 sin(θ1 − θ2 )θ̇22
by the sum of energies of its individual components (a

− m1 l1 + m2 L1 g sin θ1
wheeled cart and two pendulums):
0 = m2 l2 cos(θ2 )θ̈0 + m2 L1 l2 cos(θ1 − θ2 )θ̈1
T = T 0 + T1 + T2 + m2 l22 + I2 θ̈2 − m2 L1 l2 sin(θ1 − θ2 )θ̇12


P = P0 + P1 + P2 − m2 l2 g sin θ2

2
Lagrange equations for the DIPC system can be writ- which represents an accumulated cost of the sequence of
ten in a more compact matrix form: states xk and controls uk from the current discrete time t
to the final time tf inal . For regulation problems tf inal =
D(θ)θ̈ + C(θ, θ̇)θ̇ + G(θ) = Hu (2) ∞. Optimization is done with respect to the control
where sequence subject to constraints of the system dynamics
d1 d2 cos θ1 d3 cos θ2
! (6). In our case,
D(θ) = d2 cos θ1 d4 d5 cos(θ1 −θ2 ) (3)
d3 cos θ2 d5 cos(θ1 −θ2 ) d6 Lk (xk , uk ) = xTk Qxk + uTk Ruk (8)

corresponds to the standard Linear Quadratic cost. For


 
0 −d2 sin(θ1 )θ̇1 −d3 sin(θ2 )θ̇2
C(θ, θ̇) =  0 0 d5 sin(θ1 −θ2 )θ̇2  (4) linear systems, this leads to linear state-feedback control,
LQR, designed in the next subsection. For nonlinear
0 −d5 sin(θ1 −θ2 )θ̇1 0
systems the optimal control problem generally requires
0 a numerical solution, which can be computationally pro-
!
G(θ) = −f1 sin θ1 (5) hibitive. An analytical approximation to the nonlin-
−f2 sin θ2 ear optimal control solution is utilized in subsection on
H = (1 0 0)T SDRE control, which represents a nonlinear extension
to the LQR and yields superior results. Neural net-
Assuming that centers of mass of the pendulums are in work (NN) capabilities for function approximation are
the geometrical center of the links, which are solid rods, employed to approximate the nonlinear control solution
we have: li = Li /2, Ii = mi L2i /12. Then for the ele- in subsection on NN control. And combinations of the
ments of matrices D(θ), C(θ, θ̇), and G(θ) we get: NN with LQR and SDRE are investigated in the subsec-
d1 = m0 + m1 + m2 tion following the NN control.
1  4.1 Linear Quadratic Regulator
d2 = m1 l1 + m2 L1 = m1 + m 2 L 1
2 The linear quadratic regulator yields an optimal solu-
1 tion to the control problem (7)–(8) when system dynam-
d3 = m 2 l2 = m2 L 2
2 ics are linear. Since DIPC is nonlinear, as described by
1 (6), it can be linearized to derive an approximate linear
m1 l12 + m2 L21 + I1 = m1 + m2 L21

d4 =
3 solution to the optimal control problem. Linearization
1 of (6) around x = 0 yields:
d5 = m 2 L 1 l2 = m2 L 1 L 2
2
1 ẋ = Ax + Bu (9)
d6 = m2 l22 + I2 = m2 L22
3 where
1
f1 = (m1 l1 + m2 L1 )g = ( m1 + m2 )L1 g 
0 I

2 A = (10)
−D(0) ∂G(0)
−1
1 0
f 2 = m 2 l 2 g = m2 L 2 g ∂θ
2

0
B = −1 (11)
Note that matrix D(θ) is symmetric and nonsingular. D(0) H

and the continuous LQR solution is obtained then by:


4 Control
u = −R−1 BT Pc x ≡ −Kc x (12)
To design a control law, Lagrange equations of motion
(2) are reformulated into a 6-th order system of ordinary where Pc is a steady-state solution of the differential
differential equations. To do this, a state vector x ∈ R 6 Riccati equation. To implement computerized digital
is introduced: control, dynamic equations (9) are approximately dis-
x = (θ θ̇)T cretized as Φ ≈ eA∆t , Γ ≈ B∆t, and digital LQR con-
Then dropping dependencies of the system matrices on trol is then given by
the generalized coordinates and their derivatives, the
system dynamic equations appear as: uk = −R−1 ΓT Pxk ≡ −Kxk (13)
     
0 I 0 0 where P is the steady state solution of the difference
ẋ = x+ + u (6)
0 −D−1 C −D−1 G D−1 H Riccati equation, obtained by solving the discrete-time
algebraic Riccati equation
In this report, optimal nonlinear stabilization control de-
sign is addressed: stabilize the DIPC minimizing an ac- ΦT [P − PΓ(R + ΓT PΓ)−1 ΓT P]Φ − P + Q = 0 (14)
cumulative cost functional quadratic in states and con-
trols. The general problem of designing an optimal con- where Q ∈ R6×6 and R ∈ R are positive definite state
trol law involves minimizing a cost function and control cost matrices. Since linearization (9)–(11)
tf inal accurately represents the DIPC system (6) in the equi-
librium, the LQR control (12) or (13) will be a locally
X
Jt = Lk (xk , uk ), (7)
k=t
near-optimal stabilizing control.

3
4.2 State-Dependent Riccati Equation Control Ruk2
An approximate nonlinear analytical solution of the
optimal control problem (7)–(8) subject to (6) is given xk uk
Neural network Double inverted x k 1
by a technique referred to as the state-dependent Ric- pendulum on a cart
cati equation (SDRE) control. The SDRE approach [3] xk
involves manipulating the dynamic equations
q 1
ẋ = f (x, u)
xTk Qx k
into a pseudo-linear state-dependent coefficient (SDC)
form in which system matrices are explicit functions of
the current state: Figure 2: Neural network control diagram

ẋ = A(x)x + B(x)u (15)

A standard LQR problem (Riccati equation) can then and on the other hand, SDC form of the system dynamics
be solved at each time step to design the state feedback is
control law on-line. For digital implementation, (15) is ẋ = A(x)x + B(x)u
approximately discretized at each time step into
Since O(x)2 → 0 when x → 0, then A(x) → A and
xk+1 = Φ(xk )xk + Γ(xk )uk (16) B(x) → B. Therefore, it is natural to expect that per-
And the SDRE regulator is then specified similar to the formance of the SDRE regulator will be very close to the
discrete LQR (compare with (13)) as LQR in the vicinity of the equilibrium, and the differ-
ence will show at larger pendulum deflections. This is
uk = −R−1 ΓT (xk )P(xk )xk ≡ −K(xk )xk (17) exactly illustrated in the Simulation Results section.

where P(xk ) is the steady state solution of the difference 4.3 Neural Network Learning Control
Riccati equation, obtained by solving the discrete-time A neural network (NN) control is often popular in con-
algebraic Riccati equation (14) using state-dependent trol of nonlinear systems due to universal function ap-
matrices Φ(xk ) and Γ(xk ), which are treated as being proximation capabilities of NNs. Neural networks with
constant at each time step. Thus the approach is some- only one hidden layer and an arbitrarily large num-
times considered as a nonlinear extension to the LQR. ber of neurons represent nonlinear mappings, which can
Proposition 1. Dynamic equations of a double inverted be used for approximation of any nonlinear function
pendulum on a cart are presentable in SDC form (15). f ∈ C(Rn , Rm ) over a compact subset of Rn [7, 4, 12]. In
this section, we utilize the function approximation prop-
Proof. From the derived dynamic equations for the erties of NNs to approximate a solution of the nonlinear
DIPC system (6) it is clear that the required SDC form optimal control problem (7)–(8) subject to system dy-
(15) can be obtained if vector G(θ) is presentable in the namics (6), and thus design an optimal NN regulator.
SDC form: G(θ) = Gsd (θ)θ. Let us construct Gsd (θ) as The problem can be solved by directly implementing a
feedback controller (see Figure 2) as:
0 0 0
 
sin θ
Gsd (θ) =  0 −f1 θ1 1 0  uk = NN (xk , w) ,
0 0 −f2 sinθ2θ2
Next, an optimal set of weights w is computed to
Elements of constructed Gsd (θ) are bounded everywhere solve the optimization problem. To do this, a stan-
and G(θ) = Gsd (θ)θ as required. Thus the system dy- dard calculus of variations approach is taken: mini-
namic equations can be presented in the SDC form as mization of a functional subject to equality constraints
    xk+1 = f (xk , uk ). Let λk be a vector of Lagrange mul-
0 I 0
ẋ = x+ u (18) tipliers in the augmented cost function
−D−1 Gsd −D−1 C D−1 H
tf inal n o
X
H= L(xk , uk ) + λTk (xk+1 −f (xk , uk )) (19)
Derived system equations (18) (compare with the lin- k=t
earized system (9)–(11)) are discretized at each time step
into (16), and control is then computed as given by (17). We can now derive the recurrent Euler-Lagrange equa-
Remark. In the neighborhood of equilibrium x = 0 ∂H
tions by solving ∂x = 0 w.r.t. the Lagrange multipliers
system equations in the SDC form (18) turn into lin- k

earized equations (9) used in the LQR design. This can and then find the optimal set of NN weights w ? by solv-
be checked either by direct computation or by noting ing the optimality condition ∂H ∂w = 0 (numerically by
that on one hand, linearization yields means of gradient descent).
Dropping dependence of L(xk , uk ) and f (xk , uk ) on
ẋ = Ax + Bu + O(x)2 xk and uk , and N N (xk , w) on xk and w for brevity, the

4
∂fc (x,u)
2 Ruk
∂u ,
where fc (x, u) is a brief notation for the right-
hand side of the continuous system (6), i.e. ẋ = fc (x, u).
dx kNN duk
NN Jacobian + Pendulum Ȝ k 1  
Jacobian ∂fc (x, u) 0 I
dx k = (22)
∂x −D−1 M −2D−1 C
Ȝk
q 1
 
+ ∂fc (x, u) 0
= (23)
∂u D−1 H
2Qx k
where matrix M(θ, θ̇) = (M0 M1 M2 ), and each of its
Figure 3: Adjoint system diagram columns (note that ∂D −1 ∂D −1
−1

∂θi = −D ∂θi D ) is given by

∂C ∂G ∂D −1
Euler-Lagrange equations are derived as Mi = θ̇ + − D (Cθ̇ + G − Hu) (24)
∂θi ∂θi ∂θi
 T
∂f ∂f ∂NN
λk = + λk+1 Remark 1. Clearly, Jacobians (22)–(24) transform into
∂xk ∂uk ∂xk linear system matrices (10)–(11) in equilibrium θ = 0,
 T
∂L ∂L ∂NN θ̇ = 0.
+ + (20) Remark 2. From (3)–(5) it follows that M0 ≡ 0.
∂xk ∂uk ∂xk
Remark 3. Jacobians (22)–(24) are derived from the
with λtf inal initialized as zero vector. For L(xk , uk ) continuous system equations (6). BPTT requires com-
∂L ∂L putation of the discrete system Jacobians. Thus, to use
given by (8), ∂x = 2xTk Q, and ∂u = 2uTk R. These
k k the derived matrices in NN training, they should be dis-
equations correspond to an adjoint system shown graph- cretized (e.g. as it was done for the LQR and SDRE).
ically in Figure 3, with optimality condition Computation of the NN Jacobian is easy to perform
tf inal    given the nonlinear functions of individual neural ele-
∂H X ∂f ∂L ∂NN ments are y = tanh(z). In this case, NN with N0 inputs,
= λTk+1 + = 0. (21) single output u and a single hidden layer with Nh ele-
∂w ∂uk ∂uk ∂w
k=t
ments is described by
The overall training procedure for the NN can now be   
Nh N0
summarized as follows: X X
u= w2i tanh  w1i,j xj + w1bi  + w2b
1. Simulate the system forward in time for tf inal time i=1 j=1
steps (Figure 2). Although, as we mentioned,
tf inal = ∞ for regulation problems, in practice it Or in a more compact form,
is set to a sufficiently large number (in our case
tf inal = 500).
u = w2 tanh W1 x + w1b + w2b


2. Run the adjoint system backward in time to accu-


mulate the Lagrange multipliers (20) (see Figure 3). T
where tanh(z) = (tanh(z1 ), . . . , tanh(zNh )) , w2 ∈ RNh
System Jacobians are evaluated analytically or by is a row-vector of weights in the NN output, W1 ∈
perturbation. RNh ×N0 is a matrix of weights of the hidden layer el-
ements, w1b ∈ RNh is a vector of bias weights in the
3. Update the weights using gradient descent1 , ∆w =
T hidden layer and w2b is a bias weight in the NN output.
−γ ∂H
∂w . Graphically NN is depicted in Figure 4.
Remark 4. In case of a NN with Nout outputs
4. Repeat until convergence or until an acceptable level (MIMO control problems), w2 becomes a matrix W2 ∈
of cost reduction is achieved.
RNout ×Nh , and w2b becomes a vector w2b ∈ RNout .
Due to the nature of the training process (propagation of Noting that ∂ tanh(z)
∂z = 1 − tanh2 (z), the NN Jacobian
the Lagrange multipliers backwards through time), this w.r.t. the NN input (state vector x) is given by
training algorithm is referred to as Back Propagation
Through Time (BPTT) [12, 16]. ∂NN T
As seen from Euler-Lagrange equations (20), back = w2T dtanh W1 x + w1b W1 (25)
∂x
propagation of the Lagrange multipliers includes compu-
tation of the system and NN Jacobians. The DIPC Ja- where operator is an element-by-element
cobians can be computed numerically by perturbation or multiplication of two vectors, i.e. x y =
derived analytically from the dynamic equations (6). Let T
(x,u) (x1 y1 , . . . , xn yn ) ; and vector field dtanh(z) =
us show the analytically derived Jacobians ∂fc∂x and T
1 − tanh2 (z1 ), . . . , 1 − tanh2 (zNh ) . Again, in MIMO
1 In practice we use an adaptive learning rate for each weight in case when the NN has Nout outputs, individual rows of
the network using a procedure similar to delta-bar-delta [8] W2 take place of row-vector w2 in (25).

5
W1 , w1b Hidden layer Ruk2
W11,1
W11,2
+ tanh
1
xk uk
Neural network + x k 1
x1 W11,6
Double inverted
pendulum on a cart
Wb11 W2 , w b2 Output xk
x2 W12,1
K
2
W12,2
+ tanh
W21
W22
+
u
q 1
x6 W12,6
Wb12 W2Nh
Wb2 xTk Qx k
W1Nh,1
W1Nh,2
+ tanh Nh
Figure 5: Neural network + LQR control diagram
W1Nh,6
Wb1Nh

function approximation capabilities of the NN. On the


Figure 4: Neural network structure other hand, too many elements make gradient descent
in space of NN weights more prone to getting stuck in
local minima (recall, this is a non-convex problem), thus
The last essential part of the NN training algorithm is prohibiting achievement of good approximation levels.
the NN Jacobian w.r.t. the network weights: Therefore a reasonable compromise is usually necessary.
∂NN T
4.4 Neural Network Control + LQR/SDRE
= w2T dtanh W1 x + w1b xi
∂W1i
In the previous section a NN was optimized directly to
∂NN stabilize the DIPC at minimum cost (7)–(8). To achieve
tanh W1 x + w1b

= (26)
∂w2 this, the NN was trained to generate optimal controls
∂NN T over the wide range of the pendulums motion. As illus-
= w2T dtanh W1 x + w1b trated in Simulation Results section, NN approximation
∂w1b
capabilities, limited by the number of neurons and nu-
∂NN merical challenges of non-convex minimization, provided
= 1
∂w2b achievement of stabilization only within the range com-
parable to the LQR. One might ask whether it is possible
where W1i is an i-th column of W1 . to make the NN training more efficient and faster. Sim-
Now we have all the quantities to compute weight up- ple logic says that if the DIPC were stable and closer
dates and control. Explicitly, from (21) and (26) the to optimal, the NN would not have much to learn since
weight update recurrent law is given by its optimal output would be closer to zero. In the trivial
case, if the DIPC were optimally stabilized by an internal
tf inal(" T #
X 
∂f ∂L controller, the optimal NN output would be zero. Now
W1n+1 = W1n −γ W2Tn λTk+1 + let us recall that LQR provides optimal control in the
∂uk ∂uk
k=t vicinity of the equilibrium. Thus if we include LQR into
the control scheme as shown in Figure 5, it is reason-
)
b
 T
dtanh W1n xk + w1n xk able to expect better overall performance: the NN will
be trained to generate only “corrections” to the LQR
tf inal( T controls to provide optimality in the wide range of pen-
X ∂f ∂L dulums motion.
W2n+1 = W2n −γ λTk+1 +
∂uk ∂uk Although in Figure 5 an LQR is shown, the SDRE
k=t
) controller can be used as well in place of the LQR.
As demonstrated in the Simulation Results section, the
× tanhT W1n xk + w1n b

(27)
SDRE control provides superior performance in terms
of minimized cost and range of stability over the LQR.
tf inal(" T #
Thus, it is natural to expect a better performance of the

X ∂f ∂L
w1b n+1 = w1b n −γ W2Tn λTk+1 + control design shown in Figure 5 when the SDRE is used
∂uk ∂uk
k=t in place of the LQR. Since both the LQR and the SDRE
have a lot in common (in fact, the SDRE is often called a
)
w1b n

dtanh W1n xk + nonlinear extension to the LQR), both cases will be dis-
cussed in this section, and when the differences between
tf inal  T the two will call for clarification, it will be provided.
X ∂f ∂L The problem solved in this section is nearly the same
w2b n+1 = w2b n −γ λTk+1 +
∂uk ∂uk as in the previous section, therefore, almost all the for-
k=t
mulas stay the same or have just a few changes.
In conclusion to this section, it should be mentioned Since control of the cart is a sum of the NN output
that the number of elements in the NN hidden layer af- and the LQR (SDRE),
fects optimality of the control design. One one hand, the
more elements, the better the theoretically achievable uk = NN (w, xk ) + Kxk

6
2 Ruk SDRE control was designed for a simplified nonlinear
model, and the NN made it possible to compensate for
dx kNN duk wind disturbances and higher-order terms unaccounted
NN Jacobian + Pendulum Ȝ k 1
dx k Jacobian for in the SDRE design.
K 4.5 Receding Horizon Neural Network Control
Ȝk Is it possible to further improve control scheme de-
+ + q 1
signed in the previous section? Recall, that the NN was
trained numerous times over the wide range of the pen-
2Qx k
dulum angles to minimize the cost functional (19). The
complexity of the approach is a function of the final time
Figure 6: Adjoint NN + LQR system diagram tf inal , which determines the length of a training epoch.
Solution suggested in this section is to apply a receding
horizon framework and train the NN in a limited range
where in case of SDRE K ≡ K(xk ), minimization of the of the pendulum angles, along a trajectory starting in
augmented cost function (19) by taking derivatives of H the current point and having relatively short duration
w.r.t. xk and w yields slightly modified Euler-Lagrange (horizon) N . This is accomplished by rewriting the cost
equations: function (19) as
  T t+N−1
∂f ∂f ∂NN Xn o
λk = + +K λk+1 H= L(xk , uk )+λTk (xk+1 −f (xk , uk )) +V (xt+N )
∂xk ∂uk ∂xk
  T k=t
∂L ∂L ∂NN
+ + +K (28) where the last term V (xt+N ) denotes the cost-to-go from
∂xk ∂uk ∂xk
time t + N to time tf inal . Since range of the NN inputs
Note that when SDRE is used instead of the LQR, in this case is limited and the horizon length N is short,
∂(K(xk )xk )
should be used in place of K in the above a shorter time is required to train the NN. However, only
∂xk local minimization of function (19) is achieved: starting
equation. These equations correspond to an adjoint sys-
with a significantly different initial condition the NN will
tem shown graphically in Figure 6 (compare to Figure 3),
not provide cost minimization since it was not trained for
with optimality condition (21)
it. This issue is addressed by periodically retraining the
The training procedure for the NN is the same as in
NN: after a period of time called an update interval, the
the previous section, and all the system Jacobians and
NN is retrained taking the current state vector as initial.
the NN Jacobian are computed in the same way. The
Update interval is usually significantly shorter than the
only new item in this section is the SDRE Jacobian
∂(K(xk )xk ) horizon, and in classical model-predictive control (MPC)
∂xk which appears in the Euler-Lagrange equa- is often only one time step. This technique was applied
tions (28). This Jacobian is computed either numeri- in helicopter control problem [15, 14] and is referred to
cally, which may be computationally expensive, or ap- as Model Predictive Neural Control (MPNC).
proximately as In practice, the true value of V (xt+N ) is unknown,
and must be approximated. Most common is to simply
∂ (K(xk )xk ) set V (xt+N ) = 0; however, this may lead to reduced
≈ K (xk )
∂xk stability and poor performance for short horizon lengths
[9]. Alternatively, we may include a control Lyapunov
Experiments in the Simulation Results section illus- function (CLF), which guarantees stability if the CLF is
trate the superior performance of such combination con- an upper bound on the cost-to-go, and results in a region
trol over the pure NN or LQR control. For the SDRE, of attraction for the MPC of at least that of the CLF [9].
a noticeable cost reduction is achieved only near criti- The cost-to-go V (xt+N ) can be approximated using the
cal pendulum deflections (close to the boundaries of the solution of the SDRE at time t + N ,
SDRE recovery region).
Remark. The idea of using NNs to adjust outputs V (xt+N ) ≈ xTt+N P(xt+N )xt+N
of a conventional controller to account for differences
between the actual system and its model used in the This CLF provides the exact cost-to-go for regulation
conventional control design was also employed in a num- assuming a linear system at the horizon time. A sim-
ber of works [15, 14, 2, 11]. Calise, Rysdyk and John- ilar formulation was used for nonlinear regulation by
son [2, 11] used a NN controller to supplement an ap- Cloutier et al [13].
proximate feedback linearization control of a fixed wing All the equations from the previous section apply here
aircraft and a helicopter. The NN provided additional as well. The Euler-Lagrange equations (20) are initial-
controls to match the response of the vehicle to the ref- ized in this case as
erence model, compensating for the approximations in  T
the autopilot design. Wan and Bogdanov [15, 14] de- ∂V (xt+N )
λt+N = ≈ P(xt+N )xt+N
signed a model predictive neural control (MPNC) for an ∂xt+N
autonomous helicopter, where a NN worked in pair with
the SDRE autopilot and provided minimization of the where dependence of the SDRE solution P on state vec-
quadratic cost function over a receding horizon. The tor xt+N was neglected.

7
Stability of the MPNC is closely related to that of the uk
traditional MPC. Ideally, in the case of unconstrained Neural network xˆ k 1 -
+
optimization, stability is guaranteed provided V (xt+N ) Ruk2 emulator
+
xk
is a CLF and is an (incremental) upper bound on the
cost-to-go [9]. In this case, the minimum region of at- xk uk
Neural network x k 1
traction of the receding horizon optimal control is de- Double inverted
pendulum on a cart
termined by the CLF used and horizon length. The xk
guaranteed region of operation contains that of the CLF
controller and may be made as large as desired by in- q 1
creasing the optimization horizon (restricted to the in-
finite horizon domain) [10]. In our case, the minimum xTk Qx k
region of attraction of the receding horizon MPNC is
determined by the SDRE solution used as the CLF to
approximate the terminal cost. In addition, we also re- Figure 7: Neural network adaptive control diagram
strict the controls to be of the form given by (4.4) and
the optimization is performed with respect to the NN uk
weights w. In theory, the universal mapping capability Neural network xˆ k 1 -
+
of NNs implies that the stability guarantees are equiva- Ruk2 emulator
+
xˆ k
lent to that of the traditional MPC framework. However, q 1
in practice stability is affected by the chosen size of the xk uk
Neural network Double inverted x k 1
NN (which affects the actual mapping capabilities), as pendulum on a cart
well as the horizon length N and update interval length xk
(how often NN is re-optimized). When the horizon is
short, performance is more affected by the chosen CLF. q 1

On the other hand, when the horizon is long, perfor-


mance is limited by the NN properties. An additional xTk Qx k
factor affecting stability is the specific algorithm used
for numeric optimization. Gradient descent, which we Figure 8: Neural network adaptive control diagram
use to minimize the cost function (4.5), is guaranteed to
converge to only a local minimum (the cost function is
not guaranteed convex with respect to the NN weights), 2 Ruk
and thus depends on the initial conditions. In addition,
convergence is assured only if the learning rate is kept dx kNN duk
NN Jacobian + NN emulator Ȝ k 1
sufficiently small. To summarize these points, stabil- Jacobian
ity of the MPNC is guaranteed under certain restricted dx k
ideal conditions. In practice, the designer must select Pendulum

appropriate settings and perform sufficient experiments Ȝk


+ q 1
to assure stability and performance.
4.6 Neural Network Adaptive Control 2Qx k
What if pendulum parameters are unknown? In this
case the system Jacobians (22)–(23) can not be com- Figure 9: Adjoint NN adaptive system diagram
puted directly. A solution then is to use an additional
NN, called “emulator”, to emulate pendulum dynamics
(see Figure (7)–(8)). This NN-emulator is trained to ters used in the simulations are presented in Table 1.
recreate pendulum outputs, given control input and cur- Slight deflection (10 degrees in the opposite directions
rent state vector. The regular error back-propagation or 20 degrees in the same direction) results in a very
(BP) algorithm is used if the NN-emulator is connected similar performance of all control approaches as it was
as in Figure (7), and the BPTT is employed in case de- predicted (see Fig. 10–11).
picted in Figure (8). The NN-regulator training proce- Larger deflections (15 degrees in the opposite direc-
dure is the same as in subsection on the NN learning tions or 30 degrees in the same direction) reveal differ-
control, but instead of the system Jacobians (22)–(23), ences between the LQR and the SDRE: the latter one
the NN-emulator Jacobians computed as (25) are used provides lesser cost (7)–(8) and keeps pendulum veloci-
(see Figure (9)). ties lower (see Fig. 12–13 and Table 2). The NN-based
controllers behave similar to the SDRE.
Note that the LQR could not recover the pendulums
5 Simulation results starting from 19 deg. initial deflection in the opposite
directions or 36 deg. deflection in the same direction.
To evaluate the control performance, two sets of simula- Critical cases for the LQR (pendulums starting at 18
tions were conducted. In the first set, the LQR, SDRE, deg. deflections in the opposite directions from the up-
NN and NN+LQR/SDRE control schemes were tested to ward position and 35 deg. in the same direction) are
stabilize the DIPC when both pendulums were initially presented in Fig. 14–15. The NN control provides con-
deflected from their vertical position. System parame- sistently better cost than the LQR at these larger deflec-

8
but are left currently beyond our scope.
Table 1: Simulation parameters

Parameter Value
m0 1.5 kg 6 Conclusions
m1 0.5 kg
m2 0.75 kg This report demonstrated potential advantage of the
L1 0.5 m SDRE technique over the LQR design in nonlinear opti-
L2 0.75 m mal control problems with an underactuated plant. Re-
∆t 0.02 s gion of pendulum recovery for SDRE appeared to be 55
Q diag(5 50 50 20 700 700) to 91 percent larger than in case of the LQR control.
R 1 Direct optimization via using neural networks yields
Nh 40 results superior to the LQR, but the recovery region is
tf inal 500 about the same as in the LQR case. This happens due to
limited approximation capabilities of the NN and non-
convex numerical optimization challenges.
Combination of the NN control with the LQR (or with
Table 2: Control performance (cost) comparison
the SDRE) provides larger recovery regions and better
overall performance. In this case the NN learns to gener-
Control Deflection of pendulums, deg.
ate corrections to the LQR (SDRE) control to compen-
Opposite direction
sate for suboptimality of the LQR (SDRE).
10 15 18 28
LQR 115.2 328.4 705.0 n/r To enhance this report, it would be valuable to inves-
SDRE 112.9 277.3 437.5 3655.8 tigate taking limited control authority into account in
NN 114.1 323.4 n/r n/r all control designs.
NN+LQR 113.3 275.7 448.3 n/r
NN+SDRE 112.5 276.3 436.6 2753.7
Same direction References
20 30 35 67
LQR 36.7 108.3 325.3 n/r [1] R. W. Brockett and H. Li. A light weight rotary
SDRE 36.0 84.0 118.0 5250.5 double pendulum: maximizing the domain of at-
NN 36.9 85.7 144.8 n/r traction. In Proceedings of the 42nd IEEE Confer-
NN+LQR 37.3 86.2 136.4 n/r ence on Decision and Control, Maui, Hawaii, De-
NN+SDRE 36.0 84.0 118.0 n/r cember 2003.

[2] A. Calise and R. Rysdyk. Nonlinear adaptive flight


control using neural networks. IEEE Control Sys-
Table 3: Regions of pendulums recovery
tems Magazine, 18(6), December 1998.
Control Recovery region, deg.
Max. opposite dir. Max. same dir.
[3] J. R. Cloutier, C. N. D’Souza, and C. P. Mracek.
LQR 18 35 Nonlinear regulation and nonlinear H-infinity con-
SDRE 28 67 trol via the state-dependent Riccati equation tech-
NN 15 38 nique: Part1, theory. In Proceedings of the Interna-
NN+LQR 21 40 tional Conference on Nonlinear Problems in Avia-
NN+SDRE 28 62 tion and Aerospace, Daytona Beach, FL, May 1996.

[4] G. Cybenko. Approximation by superpositions of


tions, but it can’t stabilize the system beyond 15 (38) a sigmoidal function. Mathematics of Control, Sig-
deg. pendulum deflections in the opposite (same) di- nals, and Systems, 2(4), 1989.
rections due to limited approximation capabilities of the
NN. [5] G. F. Franklin, J. D. Powell, and A. Emami-Naeini.
In the second set of experiments, regions of pendulum Feedback control of dynamic systems. Addison-
recovery (i.e. maximum initial pendulum deflections Wesley, 2 edition, 1991.
from which a control law can bring the DIPC back to
the equilibrium) were evaluated. The results are shown [6] K. Furuta, T. Okutani, and H. Sone. Computer
in Table 3. Cases of critical initial pendulums positions control of a double inverted pendulum. Computer
for SDRE (28 deg. deflections the in opposite directions and Electrical Engineering, 5:67–84, 1978.
or 67 deg. in the same direction) are presented for the
reference in Fig. 16–17. Note that no limits were imposed [7] K. Hornik, M. Stinchcombe, and H. White. Multi-
on the control force magnitude in all simulations. The layer feedforward neural networks are universal ap-
purpose was to compare LQR and SDRE without intro- proximators. Neural Networks, 2:359–366, 1989.
ducing model changes which are not directly accounted
for in control design. It should be mentioned however, [8] R. A. Jacobs. Increasing rates of convergence
that limited control authority and ways of taking it into through learning rate adaptation. Neural Networks,
account in the SDRE design were investigated earlier [3] 1(4):295–307, 1988.

9
[9] A. Jadbabaie, J. Yu, and J. Hauser. Stabilizing re-
ceding horizon control of nonlinear systems: a con-
trol Lyapunov function approach. In Proceedings of
American Control Conference, 1999.
[10] A. Jadbabaie, J. Yu, and J. Hauser. Unconstrained
receding horizon control of nonlinear systems. In
Proceedings of IEEE Conference on Decision and
Control, 1999.
[11] E. Johnson, A. Calise, R. Rysdyk, and H. El-
Shirbiny. Feedback linearization with neural net-
work augmentation applied to x-33 attitude control.
In Proceedings of the AIAA Guidance, Navigation,
and Control Conference, August 2000.
[12] W. T. Miller, R. S. Sutton, and P. J. Werbos. Neural
networks for control. MIT Press, Cambridge, MA,
1990.
[13] M. Sznaizer, J. Cloutier, R. Hull, D. Jacques, and
C. Mracek. Receding horizon control Lyapunov
function approach to suboptimal regulation of non-
linear systems. Journal of Guidance, Control and
Dynamics, 23(3):399–405, May-June 2000.
[14] E. Wan, A. Bogdanov, R. Kieburtz, A. Baptista,
M. Carlsson, Y. Zhang, and M. Zulauf. Model
predictive neural control for aggressive helicopter
maneuvers. In T. Samad and G. Balas, editors,
Software Enabled Control: Information Technolo-
gies for Dynamical Systems, chapter 10, pages 175–
200. IEEE Press, John Wiley & Sons, 2003.
[15] E. A. Wan and A. A. Bogdanov. Model predictive
neural control with applications to a 6 DOF heli-
copter model. In Proceedings of IEEE American
Control Conference, Arlington, VA, June 2001.
[16] P. Werbos. Backpropagation through time: what it
does and how to do it. Proceedings of IEEE, spe-
cial issue on neural networks, 2:1550–1560, October
1990.
[17] W. Zhong and H. Rock. Energy and passivity based
control of the double inverted pendulum on a cart.
In Proceedings of the IEEE international confer-
ence on control applications, Mexico City, Mexico,
September 2001.

10
Bottom pendulum angles θ1 Bottom pendulum angles θ1
10 25
SDRE
5 20 LQR
NN
0 15 NN+LQR
NN+SDRE

deg
−5 10
deg

−10 SDRE 5
LQR
−15 NN 0

−20 NN+LQR −5
NN+SDRE
−25 −10
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s
time, s
Bottom pendulum velocities dθ1/dt
Bottom pendulum velocities dθ1/dt
100 40

20
0
0
deg/s

−100 SDRE −20

deg/s
SDRE
LQR LQR
−200 NN NN
−40
NN+LQR NN+LQR
NN+SDRE NN+SDRE
−300 −60
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s
time, s
Top pendulum angles θ2
Top pendulum angles θ 20
2
10 LQR
15 SDRE
NN
5 NN+LQR
10
NN+SDRE
0

deg
5
deg

−5 SDRE 0
LQR
−10 NN −5
NN+LQR
−15
NN+SDRE −10
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Top pendulum velocities dθ /dt
2
Top pendulum velocities dθ2/dt 20
30
10
20
0

deg/s
10
−10
deg/s

0 SDRE
SDRE −20 LQR
−10 LQR NN
NN −30 NN+LQR
−20 NN+LQR NN+SDRE
−30 NN+SDRE −40
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Cart position θ0
Cart position θ0 3.5
0.5 SDRE
3 LQR
0 NN
2.5 NN+LQR
−0.5 NN+SDRE
2
−1
m

1.5
−1.5 SDRE 1
m

−2 LQR
NN 0.5
−2.5 NN+LQR
NN+SDRE 0
−3 0 1 2 3 4 5 6 7
0 1 2 3 time, s 4 5 6 7 time, s

Cart velocity dθ0/dt


Cart velocity dθ0/dt 2.5
SDRE
2 2 LQR
NN
1.5 NN+LQR
1 NN+SDRE
m/s

1
m/s

0 0.5
SDRE 0
−1 LQR
NN −0.5
NN+LQR
−2 NN+SDRE −1
0 1 2 3 4 5 6 7
0 1 2 3 time, s 4 5 6 7 time, s

Control force Control force


140 25
SDRE SDRE
LQR LQR
120 NN 20 NN
NN+LQR NN+LQR
NN+SDRE NN+SDRE
100
15

80
10

60
N

5
N

40

0
20

−5
0

−20 −10

−40 −15
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
time, s time, s

Figure 10: 10 deg. opposite direction Figure 11: 20 deg. same direction

11
Bottom pendulum angles θ Bottom pendulum angles θ
1 1
20
SDRE
10 30 LQR
NN
20 NN+LQR
0 NN+SDRE

deg
deg

−10 10
SDRE
LQR
−20 NN 0
NN+LQR
−30 NN+SDRE −10
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s
Bottom pendulum velocities dθ /dt Bottom pendulum velocities dθ /dt
1 1
200 50
100
0
deg/s

deg/s
−100 −50
SDRE SDRE
−200 LQR LQR
NN −100
−300 NN
NN+LQR NN+LQR
−400 NN+SDRE NN+SDRE
−150
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s
Top pendulum angles θ2 Top pendulum angles θ
2
10 30
SDRE
LQR
20 NN
0 NN+LQR
NN+SDRE

deg
10
deg

−10 SDRE
LQR 0
NN
−20 NN+LQR
NN+SDRE −10
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s

Top pendulum velocities dθ /dt Top pendulum velocities dθ /dt


2 2
80
SDRE
60 LQR 0
NN
40 NN+LQR
−20

deg/s
NN+SDRE
deg/s

20
−40 SDRE
0
LQR
NN
−20 −60 NN+LQR
NN+SDRE
−40
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s

Cart position θ0 Cart position θ0


1
5
SDRE
0 4 LQR
NN
−1 3
m

NN+LQR
NN+SDRE
2
m

−2 SDRE
LQR
NN
1
−3 NN+LQR
NN+SDRE 0
−4
0 1 2 3 4 5 6 7 −1
time, s 0 1 2 3 4 5 6 7
time, s
Cart velocity dθ0/dt Cart velocity dθ0/dt
4
4
SDRE
2 3 LQR
NN
2 NN+LQR
m/s

0 NN+SDRE
m/s

1
SDRE
LQR
−2 NN
0
NN+LQR
NN+SDRE −1
−4
0 1 2 3 4 5 6 7 −2
time, s 0 1 2 3 4 5 6 7
time, s
Control force
Control force
250
SDRE 50 SDRE
LQR LQR
NN NN
200 NN+LQR
40 NN+LQR
NN+SDRE NN+SDRE

150
30

100
20
N

50 10

0 0

−50 −10

−100 −20
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
time, s time, s

Figure 12: 15 deg. opposite direction Figure 13: 30 deg. same direction

12
Bottom pendulum angles θ Bottom pendulum angles θ
1 1
30 40 SDRE
LQR
20 NN
20 NN+LQR
10 NN+SDRE
deg

deg
0
−10
−20 SDRE
LQR
NN+LQR
−20
−30
NN+SDRE
−40 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Bottom pendulum velocities dθ /dt
Bottom pendulum velocities dθ1/dt 1
400
100
200
0
0

deg/s
deg/s

−100
SDRE
−200 LQR
SDRE −200 NN
LQR
−400 NN+LQR
NN+LQR
−300 NN+SDRE
NN+SDRE
−600 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
time, s
time, s
Top pendulum angles θ2
Top pendulum angles θ
2
30 LQR
SDRE
10 NN
20 NN+LQR
NN+SDRE
0

deg
10
deg

−10 0
SDRE
LQR
−20 NN+LQR −10
NN+SDRE
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Top pendulum velocities dθ /dt
Top pendulum velocities dθ /dt 2
2
SDRE
100 LQR 0
NN+LQR
NN+SDRE
deg/s

deg/s
50 −50
SDRE
0 LQR
−100 NN
NN+LQR
−50 NN+SDRE

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Cart position θ0
Cart position θ 6
0
1 SDRE
5 LQR
0 4 NN
NN+LQR
−1 3 NN+SDRE
m

2
m

−2
SDRE 1
LQR
−3 NN+LQR 0
NN+SDRE
−1
−4 0 1 2 3 4 5 6 7
0 1 2 3 time, s 4 5 6 7 time, s

Cart velocity dθ0/dt


Cart velocity dθ0/dt 6
6
SDRE
4 LQR
4 NN
2 NN+LQR
NN+SDRE
m/s

2
m/s

−2 SDRE
LQR
0
−4 NN+LQR
NN+SDRE
−6 −2
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s

Control force Control force


350 100
SDRE SDRE
LQR LQR
300 NN+LQR NN
80
NN+SDRE NN+LQR
250 NN+SDRE
60
200

40
150

100 20
N

50
0

0
−20
−50

−40
−100

−150 −60
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
time, s time, s

Figure 14: 18 deg. opposite direction Figure 15: 35 deg. same direction

13
Bottom pendulum angles θ1

20

deg
0

−20

SDRE
−40 NN+SDRE

0 1 2 3 4 5 6 7
time, s
Bottom pendulum velocities dθ /dt
1

Pendulum angles
500 80
θ1
60 θ2
deg/s

0
40

deg
20
−500
SDRE 0
NN+SDRE
−1000 −20
0 1 2 3 4 5 6 7
time, s −40
0 1 2 3 time, s 4 5 6 7
Top pendulum angles θ2
20 Pendulum velocities
1500
10 dθ1/dt
1000 dθ2/dt
deg

0
500
−10

deg/s
0
−20
−500
−30 SDRE
NN+SDRE
−40 −1000
0 1 2 3 4 5 6 7
time, s −1500
0 1 2 3 4 5 6 7
time, s
Top pendulum velocities dθ2/dt
200 Cart position, θ0
15
100
deg/s

0 10
−100
5

m
−200
−300 SDRE
NN+SDRE 0
−400
0 1 2 3 4 5 6 7
time, s −5
0 1 2 3 4 5 6 7
time, s
Cart position θ0
1 Cart velocity, dθ0/dt
25

0 20

15
−1
m

m/s

10

−2 5
SDRE
NN+SDRE 0
−3
0 1 2 3 4 5 6 7 −5
time, s 0 1 2 3 4 5 6 7
time, s
Cart velocity dθ0/dt
10 Control force

5 600

0 400
m/s

−5
200
SDRE
−10 NN+SDRE
0
0 1 2 3 4 5 6 7
time, s
−200
N

Control force
1000
SDRE −400
NN+SDRE
800
−600

600 −800

400 −1000

0 0.5 1 1.5 2 2.5 3


200 time, s
N

0
Figure 17: 67 deg. same direction, SDRE
−200

−400

−600
0 0.5 1 1.5 2 2.5 3
time, s

Figure 16: 28 deg. opposite direction, SDRE vs.


NN+SDRE

14