Preparation Download the minirace.py and sprites.py python files. The
class Minirace implements the racing game simulation. Running sprites.py will
create datasets of screenshots for your first task.
A new racing game can be created like here:
from minirace import Minirace
therace = Minirace(level=1)
In this, level sets the information a RL agent gets from the environment. The car
is 2 × 2 pixels, and cannot leave the field. The track segments are 6 pixels wide, and
have positions from 1 (left) to 5 (right), and the car has 7 different positions (from 0
to 6). The front of the car (in the second row from the bottom, row 1) must remain on
drivable terrain at all times. The rear of the car (in the first row from the bottom, row 0)
is allowed to come off road with no penalty.
At each step during a race, the agent will get a reward of +1. Once the front of the car
comes off road, the episode finishes.
Task 1: Train a CNN to predict a clear road ahead 15 points
The python program sprites.py creates a training and test set of “minirace” scenes,
trainingpix.csv (1024 examples) and testingpix.csv (256 examples). Each
row represents a 16 × 16 screenshot (flattened in row-major order), plus an extra value
of either 0 or 1 that indicates if the car can safely drive straight without going off-road
in the immediate next step (i.e., there are 257 columns).
Steps
1. Create the datasets by running the sprites.py code.
2. Create a CNN that predicts the whether the car can safely remain on the current
position (i.e., drive straight) without crashing into non-drivable terrain.
(a) Describe (no programming): what is a good loss function for this problem?
(b) Implement and train the CNN on the training set.
(c) Compute the accuracy of your model on the test data set.
• Your are free to choose the architecture of your network, but there should be
at least one convolutional layer.
• You can normalise/standardise the data if it helps improve the training.
What to submit:
• A description of your CNN and the training. Calculate the size of each layer,
and include it in the description.
• Include the explanation for the loss function in your description.
• For how long did you train your model (number of epochs, time taken)? What is
the performance on the test set?
• Submit the python code for your solution (either as .py or .ipynb).
2
Task 2: Train a convolutional autoencoder 10 points
Create a convolutional autoencoder that compresses the racing game screenshots to a
small number of bytes (the encoder), and transforms them back to original (in the de-
coder part).
Steps
1. Create and train an undercomplete convolutional autoencoder and train it using
the training data set from the first task.
2. You can choose the architecture of the network and size of the representation
h = f (x). The goal is to learn a representation that is smaller than the original,
and still leads to recognisable reconstructions of the original.
3. (No programming): Explain the difference between an undercomplete and a de-
noising autoencoder.
4. (No programming): The input images are 16×16 = 256 pixels. What is the size of
your hidden representation h = f (x) (the middle layer size of your autoencoder).
Include your calculation in your report.
What to submit:
• Submit the python code of your undercomplete autoencoder (either as .py or
.ipynb).
• For your report, write a brief description of your steps to create the model and your
prediction. Include the description undercomplete vs. denoising autoencoder, and
your calculations. How do you measure the quality of your model?
• Include screenshots of 1-2 output images next to the original inputs (e.g., select a
good and a bad example).
Task 3: Create a RL agent for Minirace (level 1) 15 points
The code in minirace.py provides an environment to create an agent that can be
trained with reinforcement learning (a complete description at the end of this sheet).
The following is a description of the environment dynamics:
• The square represents the car, it is 2 pixels wide. The car always appears in the
bottom row, and at each step of the simulation the track scrolls by one row below
the car.
3
• The agent can control the steering of the car, by moving it two pixels to the left
or right. The agent can also choose to do nothing, in which case the car drives
straight. The car cannot be moved outside the boundaries.
• The agent will receive a positive reward at each step where the front part of the
car is still on track.
• An episode is finished when the front of the car hits non-drivable terrain.
In a level 1 version of the game, the observed state (the information made available to
the agent after each step) consists of one number: dx. It is the relative position of the
middle of the track right in front of the car (i.e., the piece of track in the third row from
the bottom of the image). When the track turns left in front of the car, this value will be
negative, and when the track turns right, dx is positive. As the track is six pixels wide,
the car can drive either on the left, middle, or right of a piece of track (it does not need
to drive in the middle of the road).
For this task, you should initialise the simulation like this:
therace = Minirace(level=1)
When you run the simulation, step() returns dx (…, −2, −1, 0, 1, 2, …) for the state.
Steps
1. Manually create a policy (no RL) that successfully plays drives the car, just se-
lecting actions based on the state information. The minirace.py code contains
a function mypolicy() that you should modify for this task.
2. (No programming) How many different values for dx are possible in theory (if
you ignore that the car may crash)? If you were to create a tabular reinforcement
learning agent, what size is your table for this problem (number of rows and
columns)?
3. Create a (tabular or deep) TD agent that learns to drive. If you decide to use –
greedy action selection, set = 1, initially, and reduce it during your training to
a minimum of 0.01. Keep your training going until you are either happy with the
result or the performance does not improve1.
4. When you run your training, reset the environment after every episode. Store the
sum of rewards. After or during the training, plot the total sum of rewards per
episode. This plot — the Training Reward plot — indicates the extent to which
your agent is learning to improve his cumulative reward. It is your decision when
1This means: do not stop just because reached 0.01 – you may want to stop earlier, or you may want
to keep going, just do not reduce any further.
4
to stop training. It is not required to submit a perfectly performing agent, but
show how it learns.
5. After you decide the training to be completed, run 50 test episodes using your
trained policy, but with = 0.0 for all 50 episodes. Again, reset the environment
at the beginning of each episode. Calculate the average over sum-of-rewards-per-
episode (call this the Test-Average), and the standard deviation (the Test-Standard-
Deviation). These values indicate how your trained agent perFORM