Upside down reinforcement learning
Summary
How to make an upside down RL. I used transform: rotateX(180deg).
This has the downside of only working in a div. But it’s still something…
But moving on, in this post I’m going to take you through building a simple example of reinforcement using upside down RL. Upside down reinforcement learning is detailed in this report, and I found this paper to be useful in implementation. The broad outline is contained in the title of the original report: Don’t Predict Rewards – Just Map Them to Actions.
- Typically, most systems predict the returns of actions, and execution proceeds by picking the best action
- In upside down RL, the system predicts reads in a state and a command and predicts the action to take.
I’ve trained an example of upside down RL on the very easy cartpole problem.
Running it yourself
If you’d like to run it right now, all you need to do is:
- Install via pip:
pip install upside-down-rl - Run via python:
python -m "upside_down_rl.cartpole" --render - Run tensorboard to monitor the output:
tensorboard --logdir=experiments serve. This will show the performance of the model improving over time.
You should see a window with the cartpole, and you can watch as it slowly improves.
Alternatively, you can run using docker, as follows:
- Either:
- Clone the repo and cd into the deploy directory, or
- Download the docker compose file and put it into a directory somewhere
- Run
MINIO_ACCESS_KEY=somethingyoumakeup MINIO_SECRET_KEY=somesecretkeyyoulike docker-compose up
This will run 25 runs of the cartpole tool. Model params are saved to minio (an S3 compatible data store), and you can access them by going to http://localhost:80/. The runs will also appear in tensorflow, accessible at http://localhost:6006
Approach
In general, we want to train an algorithm to take a state and a command and tell us the action to take that would achieve that state and command. A command is of the form “Achieve reward x in n steps”. To do this, we train a supervised algorithm to predict the action that will (eventually) lead to the desired command and state. To gather commands, states and actions, we use previous play episodes to build a dataset of commands and states, and train a model to predict the actions taken in those runs. Finally, when running the model to gather data, we send commands that ask for progressively higher and higher rewards.
Overall algorithm
The core algorithm, as I’ve applied it is as follows:
- Play the game with random actions many times to initialise the system. Save each play into two buffers:
- One is a FIFO buffer, that holds the most recent plays.
- One is a priority queue, that keeps the best plays. Note that plays are compared first on total reward, then on length of play, in both cases, higher is better.
- For as many epochs as necessary:
- Train a classifier to map from inputs and command to action, as described below
- Execute plays using the classifier to pick actions based on the output of the classifier
- Collect executed plays in the same FIFO and priority queue as described in step one
The classifier takes two inputs:
- The state, as reported by the game being played
- A “command”, comprised of:
- The reward we desire
- And the number of steps to achieve it
The classifier then suggests the action that would achieve that command, given the state.
Training algorithm
This generates training data, from previous episodes in the FIFO and the priority queue. Note that an episode is a start to end play of the game in question, so it has multiple states and multiple rewards accumulated over the play.
- For all episodes in each of the buffers
- Sample a starting time between the start and end of the episode
- Sample an ending time. The ending time is:
- With 75% probability, set to the end of the episode
- With 25% probability, set to a random time between the start time and the end time
- Generate the command:
- the Δt is end time minus start time
- the reward is the sum of all rewards accumulated between the start time and end time
- Gather the state input, the game state at the sampled start time
- Train the model based on these X and Y values. My model as as follows
- A state embedding:
- A Linear layer, from state size (4) to a vector of length 64
- A PReLU layer
- A command embedding:
- A Linear layer, from command size (2) to a vector of length 64
- A PReLU layer
- Then, I take the elementwise product of state and command embeddings
- And feed it through a network with structure:
- Linear 64 → 64
- PReLU
- Linear 64 → num actions (2)
- Models were trained with Adam, for up to 100 epoch, with early stopping, patience 10.
- Loss was crossentropy loss.
- A state embedding:
Play algorithm
When playing, the algorithm is simply:
- Sample a desired reward and horizon based on the top episodes in the priority queue:
- desired horizon is the average of the lengths of top episodes
- desired reward is between the average and the average plus one standard deviation
- While the game is not finished:
- Get an action from the model, by feeding in game state, desired horizon and desired reward
- Enact the action and receive reward
- Decrement desired horizon by one and reward by the reward from the last step. Clip at zero.
- Save each full episode into the FIFO and priority queue.
Hyperparameters
For cartpole, my hyperparameters were as follows:
- Both the FIFO and queue had max size 200
- Each run is 100 episodes
- I gather 500 random plays at the beginning, before the first training
- To calculate my desired reward and horizon, I use the top 100 plays in the queue
Code tour
Trying new environments
To try a new environment, I suggest the following steps:
- Clone the code
- Get poetry and install, then run
poetry installin the codebase directory. - copy
cartpole.pyto a new file - Modify your new file in two ways:
- Return a different environment in the
cartpolefunction. Consider renaming this function. - Modify the
CartPoleTorchModelto match your chosen environment - Run the code, and see how it all goes
- Tweak the hyperparameters to improve performance.
Implementation details in udrl.py
- The
ReplayBufferclass tracks episodes - The
TrainItemclass represents an item passed to the model to train on - The
ModelInterfaceis the interface a model must support for theudrlcode to use it:traintakes a list ofTrainItemsand trains the model, and returns nothingrunruns the model, this should return something that can be fed into the environmentrandom_actionshould sample a random action
- The
URLDConfigclass configures the tool - An
Episoderepresents a full play of the game in question, including- Actions
- Observations
- Rewards
- The total reward for the episode
- Ident: a random integer between 0 and 999, useful for doing train-test splits
_run_episodewill run a single episode_run_multi_episodeswill run many episodes_par_run_episodeswill run many episodes in parallel_random_fillwill do step 1 of the overall algorithm, using therandom_actionfunction_run_modelwill do step 4 of the overall algorithm (running the model and tracking episodes)- If configured, this function will write out to tensorboard. Simply pass a
tensorboardX.SummaryWriterto theUDRLConfig.writer - If configured, this function will render the environment. Simply set
renderto true inUDRLConfig.render
- If configured, this function will write out to tensorboard. Simply pass a
_train_modelwill build a training dataset and pass it to thetrainfunction of the model_save_modelwill call the functionsave_modelif provided in theURLDConfigtrain_udrlthe main loop. Simply provide:- A function that returns an OpenAI gym compatible environment. This needs to be a factory, rather than an instance, so that multiple instances can be spawned for parallel operation
- A
UDRLConfiginstance
Discussion
I found this approach very easy to implement, which was quite satisfying. The main advantages I see are:
- Simplicity: this is a reasonably simple approach to reinforcement learning
- Directness: this seems like a very straightforward algorithm, rather than having multiple interacting systems as in other approaches
- The command is quite interesting: the idea of commands seems like it would provide a lot of ways to tune and interact with the system.
Downsides:
- Hyperparameters: there are quite a few hyperparameters, and they significantly affect the results.
- I wasn’t able to get the lunar lander gym experiment working. I’m going to look at the hyperparams noted in the paper and see how well they work.
Regardless, it was neat to set the whole thing up, and I plan on playing around more with the command encoding and looking for alternative algorithms that require intrinsically fewer hyper parameters.