Vous êtes sur la page 1sur 8

Reinforcement Car Racing with A3C

Se Won Jang Jesik Min Chan Lee


swjang@stanford.edu jesikmin@stanford.edu chanlee@stanford.edu
Stanford University

Abstract performance of entire game-solving deep-rl pipeline. We


hope that the discussion and better understanding of front-
In this paper, we introduce our implementation of the end CNN portion of deep RL pipelines drawn from this
asynchronous advantage critic model and discuss how dif- work on particular 2D game can provide some useful in-
ferent convolutional neural network designs affect its per- sights into more complex real-world challenges such as au-
formance in the OpenAI Gym CarRacing-v0 environment. tonomous vehicles.
We also explore how A3C hyperparameters (learning rate,
number of threads) influence the effectiveness of the CNN 2. Background
feature extractor by visualizing the filter weights and com-
paring the network scores. Our implementation of the 2.1. Simulation Environment
A3C model with carefully designed CNN feature extractor
Our target environment is OpenAI Gym CarRacing-v0
shows an average reward increase of approximately 100.0
Environment, an experimental OpenAI Gym for 2D car rac-
points compared to the vanilla A3C implementation, cur-
ing game that hasnt been solved yet. This particular en-
rently achieving fourth place on the CarRacing-v0 environ-
vironment requires many graphics dependencies including
ment leaderboard.
Box2D and OpenGL. As we implement algorithms such as
A3C that depends on multi-threading, we had to either hack
1. Introduction into OpenAI Gym environment itself or create a wrapper
class that takes care of racing condition so that we can cir-
The OpenAI Gym CarRacing-v0 environment is one of cumvent all the issues.
the very few unsolved environments in the OpenAI Gym There are several rules in this environment. As in Figure
framework. While many recent deep reinforcement algo- 2.1, each frame is given by 96 96 pixels with 3 color
rithms such as DDQN, DDPG, and A3C are reported to per- channels. Reward is computed for every frame, 0.1 ev-
form well in simple environments such as Atari[10][8][9], ery frame and 1000
N for every track tile visited, where N is
the complex and random car racing environment is particu- the total number of tiles in track. Since track is randomly
larly difficult to solve with prior deep reinforcement learn- generated for each game, the total number of tiles in track
ing algorithm due to its complexity and randomness. With also varies, ranging from 230 to 320 in most cases. Episode
random and complicated pixel inputs that are fed into CNN finishes when all tiles are visited or 1000 frames have been
portion, the deep RL networks that follow from the CNN played.
will not learn well. One of the most unique characteristics of the environ-
We try to solve this environment with a carefully- ment is that it has a continuous action space opposed to
designed CNN component and our modification of state- many other games including Atari games that only con-
of-art A3C algorithm with the concept of continuous cer- tain discrete actions that an agent can choose. Each action
tainty. To our knowledge, there has not been any in-depth, consists of 3 continuous values, each value corresponding
extensive exploration of how image processing part, primar- to steering, acceleration, and brake. Steering ranges from
ily composed of convolutional neural network, affects the 1.0 to 1.0, where 1.0 is the leftmost steering angle and
Jaehyun Kim (jakekim@stanford.edu), who is not taking this class but 1.0 is the rightmost steering angle an agent can take. Accel-
CS234, worked with us in devising and implementing A3C with continu- eration and brake ranges from 0 to 1.
ous certainty. We consulted with Guillaume Genthial, our CS234 mentor, Most importantly, the environment defines solving as
for some comments and regarding the direction of the paper. We also ap-
preciate to course staff in CS231A, CS231N, and CS234, including Rishi
getting average reward of 900 over 100 consecutive trials.
Bedi, our CS231N mentor, who gave us very useful feedback throughout We found the standard to be rather strict, because it was
the entire quarter. even hard for an experienced person to consistently score

1
above 900.0, and no model has solved the game yet. niques during last few years, scholars have developed un-
precedented deep reinforcement learning algorithm as well.
Since Mnih et al. redefined the era of deep RL[10], many
people solved many different games (e.g. Atari and OpenAI
Gym) with different sorts of architectures [14]. Recently,
there have been significant efforts in tackling environment
with continuous setting, for example, with continuous ac-
tion space, because such games are highly similar to the
actual world[8][11][9][17].
However, while many deep RL algorithms receive raw
pixels as inputs and therefore, take advantage of CNN heav-
(a) OpenAI Gym CarRacing-v0 Webpage ily to process those image inputs, there have been no exten-
sive exploration of how we can find a proper CNN archi-
tecture, how we can preprocess the inputs for CNNs, and
whether we can apply pretrained CNNs to a target environ-
ment. Some works have rigorously experimented with hy-
perparameters of CNN only[2] and some works have bench-
marked different types of deep RL algorithm[3]. Although
Kawaguchi et al. showed that deep learning needs to be
carefully set to get expected performance[6], there has been
no exploration of CNNs and state-of-art deep RL combined.
Here, we will extensively discuss how our novel modifica-
tion of A3C could give a competitive score by carefully set-
ting CNNs.

3. Method
3.1. Convolutional Neural Network and Image pre-
(b) In-game Screenshot of CarRacing-v0
processing
Figure 1. Our Target Simulated Environment: OpenAI Gym
CarRacing-v0. Figure 1(a) is a screenshot of the main page of As mentioned in the simulation environment section, the
OpenAI Gym CarRacing-v0 experimental environment. Figure CarRacing-v0 environment represents each game state by a
1(b) is a screenshot of the actual game being played by a person. 96 96 3 RGB array. The bottom 12 96 3 pixels
area (see Figure 1(b)) contains the car dashboard displaying
state information such as the velocity, acceleration, gyro-
2.2. Related Work scope and the relative driving wheel angle of the car agent.
There have been recent breakthroughs in convolutional It is important to note that these state information are indi-
neural networks to solve image recognition and com- rectly represented by the 96 96 3 game state array, and
puter vision related task. From the landmark work of is not explicitly accessible anywhere within the framework.
AlexNet[7], which was one of the first works that tackle Since the state frame only contains raw RGB pixel data
ImageNet classification challenge with convolutional neu- and we dont have the exact knowledge on how each state
ral networks, there have been many modifications and information is rendered on the bottom dashboard of the
improvements in the vanilla-version of CNNs such as game state frame, we decided to use Convolutional Neural
VGGNet[16], SqueezeNet[4], and hierarchal CNN or HD- Network as the feature extractor for our A3C model. This
CNN[18]. Many studies focused on implementing deep CNN module behaves as an add-on to the A3C model, ex-
but efficient and stable CNN architecture. However, one tracting visual information from the given state frames and
of the breakthroughs that made many vision task easier is squashing them to an n dimensional real vector. The A3C
the concept of transfer learning, which can improve the model, upon receiving the feature vector, computes the pol-
performance of learning by avoiding much expensive data- icy and expected return of the agent at the given state.
labeling efforts[13]. There have been many applications of Due to a number of reasons further explained in the re-
transfer learning from well-trained networks like VGGNet sults section, instead of training the CNN feature extractor
to various domains including cancer detection, video clas- with raw RGB pixel data, we take a number of steps to pre-
sification, and writing recognition[15][5][1]. process the state frames.
Aided with such remarkable achievements in CNN tech- First, we apply grey-scaling to the original raw pixel

2
frame to reduce the depth of the image from 3 to 1. Then, action our agent chooses is always one of those five actions.
we subtract 127.0 from each pixel so that the values are That is, even though the action space of the environment
zero-centered from range -127 to 128. This is so that the is entirely continuous space, the agent ignores that impor-
model becomes more robust with randomization of the filter tant characteristic just because of the intrinsic architecture
weights. Third, as mentioned in the previous paragraph, we of vanilla A3C model (see Figure 3 for vanilla A3C exam-
remove the dashboard by removing the 1296 pixels area in ple). As we will see in Section 5, DDPG, which is targeted
the bottom of each frame. We additionally crop out 96 6 for continuous action space, is not apt for our target environ-
pixels from both left and right ends of each frame so that ment due to our environments complexity, and vanilla A3C
each frame is a 84 84 square array. Since we cropped out model with cherry-picked discrete actions gives us some-
the bottom dashboard area, we need a way to restore the lost what reasonable results. However, we suggest that the sim-
velocity, acceleration and driving wheel position informa- ple A3C model can still be improved by incorporating con-
tion back into our game state. Therefore, we lastly concate- tinuous nature of softmax probability when choosing opti-
nate 5 consecutive 8484 preprocessed frames to construct mal policy. Instead of simply taking an argmax of softmax
a single state representation. This is the same method used probability vector to be the optimal action, which actually
by Mnih et al.[10] to extract spatial movement information might be extreme when softmax probability of each action
in a number of the Atari game problems. The resulting state is close to each other as we will discuss in the example be-
array has the shape of 84 84 5. low (Figure 4 and 5), by multiplying the softmax probability
This array is then fed to our CNN feature extractor. As with the argmax action, we can smooths out the five discrete
further explained in the results section, we experiment with actions into a totally continuous space. Mathematically, in-
a variety of CNN layers and hyper parameters. The criteria corporating continuous certainty does not harm backprop
that we considered are 1) the performance of the CarRacing procedure as well, since doing so is merely a multiplication
agent as defined by the environment, 2) training time. Since computation.
many RL algorithms, especially A3C models are known to
take a long time to train and converge, we put heavy focus
not only on the performance but also on the training time as
well. Deeper CNN networks, as seen in the recent trends,
may be able to more accurately represent the state as fea-
ture vectors. However, with a large number of parameters
and computational complexity, it would be very hard for us
to distinguish if our model has already converged, or is still
being optimized. This is especially so in the CarRacing-v0
problem that we are dealing with, since it is computation-
ally much heavier than other openAI Gym environments Figure 2. An example of optimal policy output in vanilla A3C
and hence takes longer to execute an action at each step. model. In this contrived example, the softmax probability of the
Since training time is a very important criteria, our im- third action is the greatest, so the policy network chooses the third
plementation of CNN does not have max-pooling, batch action as the optimal action to take from given state.
normalization or dropout layers between the convolutional
layers. Instead, we attempt to replace these layers by 1)
Let us see when our suggestion of certainty multiplica-
zero centering the raw pixel values, 2) xavier weight ini-
tion is particularly helpful. Suppose an agent encountered
tialization with reduced scale, and 3) relatively shallow net-
a corner that gives a softmax probability of, for example,
work. Our results verify our hypothesis. As shown in table
[0.19, 0.19, 0.24, 0.19, 0.19] as demonstrated in Figure 4.
1, shallower and lighter CNN feature extractor showed bet-
The simple A3C model will choose action 3, taking acceler-
ter performance, with our 2-layer CNN model ultimately
ation with magnitude of 1.0 or full acceleration. This might
achieving fourth rank in the OpenAI Gym leaderboard.
work in some cases, but usually, taking full acceleration for
consecutive time frames in this game is not a good strategy.
3.2. Continuous Certainty
If an agent encounters an unfamiliar, sharp corner after con-
We introduce the concept of continuous certainty to secutive full accelerations, the agent or even a well-trained
the vanilla A3C model. It smooths out the discrete action human cannot manipulate the car properly and will most
space that the trained agent can choose from, so that the likely to deviate from the track, which will result in seri-
output of policy network becomes completely continuous. ously bad score at the end.
While we simply cherry-picked five possible actions that the Now, let us observe what happens when an agent ex-
agent would take, it contains stark disadvantages, because tracts more information from softmax probability. As de-
regardless of how well the network is trained, the optimal scribed in Figure 5, by multiplying 0.24 to [0.0, 1.0, 0.0],

3
Figure 3. A bad scenario for optimal policy selection when we use Figure 5. Our Final A3C Model Architecture. A stack of five pre-
vanilla A3C model. The softmax probability of each action is very processed frames are input into the network. The front-end two-
close to each other, and simply taking an argmax will result in one layered CNN extracts image features from the pixel frames and
extreme discrete value. pass them into the policy network and value network. Each net-
work outputs 5 1 softmax vector and 1-dimensional scalar value
estimation respectively. The argmax of softmax vector from the
policy network is then multiplied with the softmax value to give
the optimal action the agent should take.
the agent will take the action of [0.0, 0.24, 0.0] and this will
stabilize the learning as the agent will lessen how much it
accelerates. In other words, instead of taking a full accel-
eration, because the agent is quite uncertain which action 5. Results/Analysis
is definitely better than the other, it becomes more careful 5.1. CNN Architectures
in taking the argmax action. In this particular example, the
certainty of taking the argmax action ([0.0, 1.0, 0.0]) is As mentioned in Section 2, the game agent does not
only 0.24, so the agent decides to take the acceleration of know how each state information (velocity, acceleration,
0.24. One can notice that this is how an actual novice hu- driving wheel position) gets computed and rendered on the
man driver learns how to drive, when he or she is driving bottom dashboard of the game state frame. We hypothe-
the area for the first time. sized that a carefully designed CNN would be able to ex-
tract visual state information from the RGB frames, pro-
viding the A3C RL agent sufficient data to infer critical in-
formation such as the curve angle, velocity, acceleration,
driving wheel position, distance from the road center, etc.
The CNN, upon extracting these information, would embed
them in an n-dimensional state vector, with which the A3C
network in turn computes the policy and the expected re-
ward. Thus, the training process of the two networks (CNN
and A3C) is joint rather than separate.

5.1.1 Limitations of Pretrained CNN


Figure 4. A diagram that shows how a simple additional multi-
plication of continuous probability to the argmax action smooths We initially considered pretraining the CNN to the visual
out the extreme choices of discrete action. It is the same case de- components presented in the environment. However, due to
scribed in Figure 3.2, but multiplying the softmax probability gave a number of limitations, we decided that it would be bet-
us a different, less extreme action.
ter to train the CNN along with the A3C network. First,
we could not use other pretrained CNN from classification
problems. The visual objects presented in each frame could
only be found in this specific environment, and not any-
where else, which means that none of the existing CNNs
4. Experiments had the capability to correctly detect the visual components
of this game environemnt. Second, the number of classes
For evaluation, we basically followed the evaluation rule (objects) to pretrain the network was not definite. As ex-
of OpenAI Gym CarRacing-v0. We tested many different plained in the simluation environment section, this environ-
architectures that will be discussed in the following section ment randomly generates the game map after every episode,
(Section 5, and for each configuration, we ran 100 games to virtually making it impossible to label the frames with a fi-
take average of the score. nite number of classes, let alone the fact that there is no

4
way to guarantee the correctness of the labels generated by 5.1.3 CNN Performance Analysis
automated scripts. Since the number of classes is a critical
hyperparameter for image detection / classification CNNs, We tested our model with multiple CNN architectures of
and due to the fact that they are very hard to change once varying depths, from 2 layers to 7 layers. With the deeper
the network is trained, pretraining the CNN on the game CNNs, we gave filter size of 3 and stride 2 for most layers
environment is bound to be a very complex task. Lastly, so that the network still has a large enough receptive field to
since CNNs are designed to perform well on image classi- detect large critical visual components such as a large curve,
fication / detection tasks, using pretrained CNNs may not curve start indicator, etc. With shallower networks, we had
be well suited for an RL objective, which, in this case, is to to give filters with bigger filter size and stride (8 and 4 for
maximize the reward by completing the course in a timely the 2 layer model) for the same purpose.
manner. Due to the difference in the objective, pretrained Table 1 shows the performance comparison of our mod-
CNN modules may neglect crucial information the the A3C els in the CarRacing-v0 environment. Note that although
RL agent might need to perform better. Due to these rea- we have tested with more CNN architectures, only the mod-
sons, we have deeemed it unreasonable to use a pretrained els that showed reasonable performances are listed on the
network to tackle this problem. table. In Table 1, We can clearly see that the model with
the shallowest CNN architecture performs the best in the
given task. Deeper layer CNNs show converged perfor-
5.1.2 Limitations of Raw RGB Pixel States mance of about 300, while our shallowest model with 2 lay-
ers and wider filter size shows converged performance of
In the methods section, we noted that we made the design 571 (https://gym.openai.com/evaluations/
decision to preprocess the frames and stack 5 consecutive eval_IEdi97CIQeC7ZFKmM9L3dA). Moreover, a
frames to construct a single state. Interestingly, we found close inspection of the episodic results shows that the model
that it is not easy to train the CNN feature extractor to show achieves a very high score of over 700 in approximately
reasonable performance with just the RGB state frames. 25 percent of all evaluation episodes. The reason that the
Upon close analysis of the evaluation videos recorded by mean score is 571.68 is that in three out of one hundred
the OpenAI Gym Monitoring feature, we have noticed that episodes, the car agent achieves a very low score close to
our model trained with pure RGB state frames completely zero. Although the evaluation video was not saved for these
fails to learn to make curves or slow down. This led us to episodes, we were able to reproduce this behavior with the
realize that the 96 96 3 state representation, unlike the same model later. This was due to the randomness in the
1024 1024 3 frame rendered for human players during racing circuit generation of the CarRacing-v0 environment.
interactive gameplay, was too crude to be able to caputre the In the cases where our car agent ends up with a very low
subtle changes in velocity, accleration, driving wheel posi- score, we found out that the frame contains a very sharp
tion, etc. For example, the acceleration bar in the game state 160 to 180 degrees turn in the beginning of the game, and
representation had only 2 to 3 pixels height in average, indi- the frame looks like as if there are two tracks in the game
cating that our CNN had access to only an extremely lossy that you can choose from. The car agent then gets confused
representation of the original 1024 1024 3 state frame. on which road to take, and gets stuck in-between the two
Therefore we had to make a decision to change the architec- roads in the grass zone, resulting in very low scores around
ture of our CNN to take 84 84 5 preprocessed state pixel 0 to 20 points. The result proves our initial hypothesis that
arrays, instead of 96 96 3 raw frames, as mentioned in deeper CNNs would not perform better than the shallower
Section 3.1. ones. This is due to a number of reasons.
To extend the discussion on RGB pixel states, after try- First, as we can see in Figure 1, the state frames returned
ing several schemes as in 5.1.3, we conjectured that ap- by the CarRacing-v0 environment is not complex enough to
plying canny edge detection results into some noisy, low- require a deep CNN architecture. Tracks are colored in grey,
quality edges and hog features also simply loses some im- grass in green, and the car in red. The shape of a frame is
portant information regarding tracks, which do not improve only 84 by 84 (after preprocessing), even smaller than Atari
the pipeline but actually degrade the performance. On the Pong which has the state size of 210 by 160 by 3. The game
other hand, applying Laplacian edge detection improved frame is much smaller than the average image size of the
the average performance by small amount of 20, but the ImageNet examples. Moreover, each frame typically only
standard deviation was twice greater than that of our ini- contains about 5 different colors, with very simple shapes
tial choice. Therefore, for evaluation purpose, we chose a such as straight lines, square patches of grass, etc. This
simple grayscale, meanshift, and crop strategy for image means that we do not require a complex, deep CNN archi-
preprocessing instead of Laplacian edge detector in order tecture to tackle this problem.
to make our performance more consistent and stable during Second, deeper and more complex CNN architectures
evaluation. are harder to train. Since our training objective is not the

5
classification softmax scores, the backpropagation phase is 2 Threads 4 Threads
conducted not after an image and a label is shown, but af- CNN Model #1
ter the agent receives a reward after executing an action. L1-5 (5 layers) 187.03 44.54 169.25 41.87
This means that a deep CNN model that is known to achieve F=16, W=3, S=2
super-human performance in image classification tasks may CNN Model #2
not be the best model for our RL objective. It would be very L1: F=16, W=8, S=3
hard to train the large number of parameters that follow with L2: F=32, W=5, S=2 182.42 31.71 198.23 36.96
the deeper models, and it is very likely that the model would L3: F=32, W=4, S=2
fall into a local minima at an early stage of the training. L4: F=16, W=3, S=2
Third, deeper models take longer to train. Deeper con- CNN Model #3
volutional networks inherently have a larger number of pa- L1: F=16, W=8, S=3
391.26 22.62 370.02 32.14
rameters, and hence, requires a larger set of training exam- L2: F=32, W=3, S=2
ples requires more iterations to converge. This problem be- L3: F=32, W=2, S=1
comes more evident with our A3C model, since the model CNN Model #4
is known to take a lot of time to train. For example, even L1: F=16, W=8, S=4 571.68 19.38 481.65 17.91
our model with the simplest CNN architecture (CNN Model L2: F=32, W=3, S=2
4 of Table 1) took 1 million iterations to converge. Since Table 1. Effects of CNN architecture and the number of threads to
the A3C model utilizes multicore CPUs rather than GPUs, overall performance of the pipeline. The best performance could
the training time increases significantly with increase in the be observed from the two-layered CNN architecture with 2 threads
number of parameters. Moreover, we have found that it is and is highlighted in the table.
very hard to tell whether an A3C model for this environ-
ment has converged or not. As we have described in our
CS234 paper, the model improves performance after sud- Grayscale, Canny Laplacian
den bursts in the losses, and after a long period of extremely HOG
meanshift, Edge Edge
low rewards. We have not yet found the right way to decide Features
and Crop Detector Detector
whether if the model has converged or not. In our imple- 571.68 430.15 590.90 390.28
mentations, we deemed the model converged if the average 19.38 36.71 45.01 29.23
reward does not increase by 10 points for 200,000 iterations.
However, this may not be the right way to determine the Table 2. Effects of different image preprocessing strategies to over-
convergence of the A3C model with CNN. This means that all performance of the entire pipeline. The best performance could
there is a chance that the models with deeper CNN could be observed when we apply Laplacian edge detector for the image
have converged to parameters with higher performance. But preprocessing process but with high standard deviation.
it still does not change the fact that the deeper models took
significantly more time to reach a certain performance, to
the point that we think is quite unreasonable (Over 2 days to interpret, but we assume that it captures more fine details
on Google Cloud 8-core CPU mahines). of the large patches captured by the first layer.

5.2. CNN Filter Activations


In order to check what CNN has actually learned in the
well-performing pipeline, we visualized the CNN weights
for two convolution layers. As stated in the CNN Model 4
of Table 5.1.3, the weights of the first convolutional layer
was 8 8 5 16 and the weights of the second layers was
3 3 10 32. Figure 6. 5th and 11th filter of the first convolution layer.
As shown in Figure 6, most of the filters of the first
convolution layer detect a progression of five frames from
straight line to corner. As shown in Figure 7, some filters
of the first layer detect a progression from corner to straight
track. Considering that those two types of progression are
most important features of track that the agent needs to de-
tect well to give a good performance, we can say that the Figure 7. 13th filter of the first convolution layer.
CNN with optimal hyperparmeters learned something use-
ful. The weights of the second convolution layer was hard

6
Figure 8. 23rd filter of the second convolution layer.

5.3. Different RL Algorithms


While we have focused on how CNN portion of the
pipeline influence the overall performance, another core
component of the pipeline would be deep reinforcement
learning network. We have tried many different deep RL
algorithms including classic DDQN with discrete action
space[10], vanilla DDPG[8], our modification of DDPG
with human-replay buffer, vanilla A3C[9], and our novel
implementation of A3C with continuous certainty. As
shown in Table 9, the classical approach of DDQN did not
perform well. More precisely, the agent trained with DDQN
had hard time after the very first corner. Such behavior was Figure 9. Performance of different deep RL algorithms. We eval-
expected as the map is randomized for each game and with a uated five different models we implemented to solve the task and
linearly decaying exploration epsilon value for DDQN algo- each reported average score is the performance of the best model
rithm, the agent will have hard time learning something use- for each different architecture. A3C model with continuous cer-
tainty recorded 571.68 with standard deviation of 19.38, which
ful, particularly at the early learning stage. DDPG, which
is a competitive record with some of the best scores uploaded on
was originally designed for continuous action space, did
OpenAI Gym so far.
not perform well. Even aided with human-replay buffer,
the performance was not boosted and the optimal action
values that were output from the agent quickly saturated
to some extreme values, for instance, [1.0, 1.0, 1.0]. In cumstances. More importantly, we experimented with many
other words, at least for this particular environment, DDPG different CNN architectures along with various image pre-
quickly fell into the local minimum and could not recover processing techniques that might enhance performance of
back. The agent trained with A3C algorithm gave a very CNN such as gray scaling, mean shifting, cropping, and
decent performance. However, since the vanilla version of edge detection. While there have been no significant fo-
A3C algorithm simply outputs the discrete action, we could cus on CNN component of game-solving deep reinforce-
improve the performance with our idea of continuous cer- ment learning pipeline, as deep reinforcement learning net-
tainty. It gave us the best performance among all the mod- work becomes deeper and more complicated as in the case
els and ranked fourth in the entire leaderboard. We believe of A3C with continuous certainty, our novel modification on
that our concept of reusing softmax probability after tak- A3C with multiple threads, carelessly designed (e.g. unnec-
ing argmax at the end of the output matrix can be widely essarily deep layers or random image preprocessing without
used for any sort of deep neural networks, including CNN verification) CNN portion can degrade the overall perfor-
as well, which require continuous outputs. For instance, mance of the deep RL pipeline.
for video prediction task - a task for predicting future video As a result, our final implementation using a 2-layer
frames from the past frames[12], we may add continuous CNN and 2 threads shows an average reward increase of
certainty at the end of the pipeline to get the continuous 100.0 points compared to the vanilla A3C implementation,
output. currently achieving fourth place on the CarRacing-v0 envi-
ronment leaderboard.
6. Conclusion For future work, we have several ideas to improve our
model presented in this paper. First of all, we think we may
We have presented our exploration of the OpenAI Gym try a deep residual CNN for the CNN component. Accord-
CarRacing-v0 environment where most deep reinforcement ing to our observation so far, we believe doing so will not
learning algorithms perform poorly due to complex and ran- significantly improve or may even harm the performance as
dom nature of the environment. We built on top of the asyn- the CNN becomes harder to train due to a low resolution of
chronous advantage critic (A3C) model with the concept of the inputs, but it still worths trying other network as well
continuous certainty that multiplies argmax action with so that applying state-of-art CNN network is not always ad-
softmax probability to avoid extreme action in uncertain cir- vantageous to deep RL framework. In addition, after having

7
trained for more than 4, 000, 000 iterations, we noticed that ference on Computer Vision and Pattern Recognition, pages
different models at different checkpoints tended to be good 17251732, 2014.
at one task but not as good at another task. For example, [6] K. Kawaguchi. Deep learning without poor local minima.
some overfitted model performed better at the straight por- In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and
tion, while some generalizable model did well in cornering R. Garnett, editors, Advances in Neural Information Pro-
but not as well as the overfitted ones when it comes to the cessing Systems 29, pages 586594. Curran Associates, Inc.,
2016.
straight lane (they would oscillate left and right in straight
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
line, which is an obstacle for getting higher score). Hence,
classification with deep convolutional neural networks. In
we would like to try an ensemble of models that are par- Advances in neural information processing systems, pages
ticularly good at straight lane and models that are good at 10971105, 2012.
cornering. We could also try some other image processing [8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,
to boost performance. Y. Tassa, D. Silver, and D. Wierstra. Continuous con-
Last but not least, we would like to explore the perfor- trol with deep reinforcement learning. arXiv preprint
mance of other models that we did not have enough time to arXiv:1509.02971, 2015.
do so. For instance, we suspect that A3C with LSTMs can [9] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap,
enhance the performance significantly. It is true that our T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous
current model attempts to capture the window of frames methods for deep reinforcement learning. In Proceedings of
to reflect the recent history for the next move, but LSTM International Conference on Machine Learning, 2016.
is more state-of-art, reliable, and explicit way for agent to [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,
learn how to determine its next action from a set of past ac- I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari
with deep reinforcement learning. In NIPS Deep Learning
tions. We would also love to see the performance of simple
Workshop. 2013.
policy gradients and other modifications of DDQN if possi-
[11] G. Neumann. The reinforcement learning toolbox, reinforce-
ble.
ment learning for optimal control tasks. na, 2005.
[12] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-
7. Github Repositories conditional video prediction using deep networks in atari
games. In Advances in Neural Information Processing Sys-
A3C: https://github.com/sjang92/car racing tems, pages 28632871, 2015.
DDPG1: https://github.com/jessemin/racing ddpg [13] S. J. Pan and Q. Yang. A survey on transfer learning.
DDPG2: https://github.com/jessemin/car racing IEEE Transactions on knowledge and data engineering,
DDQN: https://github.com/jakekim1009/hw2 for racing 22(10):13451359, 2010.
[14] J. Peters and S. Schaal. Reinforcement learning of motor
We implemented our DDQN agent based on the code skills with policy gradients. Neural networks, 21(4):682
from CS234 Assignment2. 697, 2008.
[15] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues,
References J. Yao, D. Mollura, and R. M. Summers. Deep convolutional
neural networks for computer-aided detection: Cnn archi-
[1] D. C. Ciresan, U. Meier, and J. Schmidhuber. Transfer learn- tectures, dataset characteristics and transfer learning. IEEE
ing for latin and chinese characters with deep neural net- transactions on medical imaging, 35(5):12851298, 2016.
works. In Neural Networks (IJCNN), The 2012 International [16] K. Simonyan and A. Zisserman. Very deep convolutional
Joint Conference on, pages 16. IEEE, 2012. networks for large-scale image recognition. arXiv preprint
[2] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up arXiv:1409.1556, 2014.
automatic hyperparameter optimization of deep neural net- [17] K. G. Vamvoudakis and F. L. Lewis. Online actorcritic al-
works by extrapolation of learning curves. In IJCAI, pages gorithm to solve the continuous-time infinite horizon optimal
34603468, 2015. control problem. Automatica, 46(5):878888, 2010.
[3] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. [18] Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste,
Benchmarking deep reinforcement learning for continuous W. Di, and Y. Yu. Hd-cnn: hierarchical deep convolutional
control. In Proceedings of the 33rd International Conference neural networks for large scale visual recognition. In Pro-
on Machine Learning (ICML), 2016. ceedings of the IEEE International Conference on Computer
[4] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Vision, pages 27402748, 2015.
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and 0.5 mb model size. arXiv
preprint arXiv:1602.07360, 2016.
[5] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
and L. Fei-Fei. Large-scale video classification with convo-
lutional neural networks. In Proceedings of the IEEE con-

Vous aimerez peut-être aussi