To understand training times in reinforcement learning, it's helpful to consider both where we're starting and what we're going through. I spoke more with Ruofan and our Data Scientist, Ross Story, to find out what influences training times and find guidelines to keep them as manageable as possible. Our AI engine will do the work of picking the best algorithm and hyperparameters; your careful selection or engineering of your training environment can provide a simulator that best represents the problem you want to solve. Ross taught me about the fundamentals of reinforcement learning, so that I would have a clear view of the factors involved in finding or creating a quality simulator.
As with so many machine learning topics, it's helpful to consider an analogy to human learning, which is something that we've all been exposed to. Just like baby humans, AI systems start with a blank slate - typically the AI engine has no foreknowledge of the environment it will be working in. All it knows is the state of the environment, the actions it has available to take, and some notion of reward. By iterating over the simulator or data set and observing the rewards and penalties associated with each action and state combination, the engine is able to identify correct combinations of state data and action. In reinforcement learning , the engine is capable of finding successful “paths”, through a sequence of actions that lead to a desired result. This is not much different from a young child, who first understands that they have a hand, and then that they can move it, followed by the reward of successfully swatting at an object - a process that typically takes a full three months after birth.
An iteration is one frame from a simulator. To clearly define an iteration to the AI engine, we want to specify a quality schema, action space, and reward function. In simulators like the OpenAI Gym series, these define the inbound information, the array of available options, and the “reinforcement” component of each iteration.
The schema describes the types of information that is provided to the AI engine that it can use to evaluate the simulator's current state. In some cases, such as arcade style simulators which use image data as the input, convention indicates a schema that is based on the pixels in the source image; with other sims, the schema will be a collection of scalars or booleans that represent notions that are important to reasoning about the problem; an example would be x_position in Mountaincar. The state space is the total of all possible combinations of states, so in MountainCar that is the entire range on the x axis that could be occupied by the car, or in the case of an image buffer, it's the size of all of the pixels in the frame. For best results, your schema will not include any extraneous data; otherwise, your state space will be unnecessarily large and longer training times can result. When building your own simulator, it's best to consider two factors:
Our action space represents all of the different actions that can be performed by the AI engine. As much as possible, these actions should also have a deterministic effect on the state. The union of the state space and the action space is called the state transition space. While this state transition space isn't inherently important for building sims, it can become very large and unwieldy for a human to reason about its contents so it's worthwhile to keep this in mind. In an arcade-style sim, the action space generally relates to the joystick or button inputs that a human player would have available. When building your own simulator, try to make each action have a predictable effect on state. In the event that this is not possible, it's best to restrict the variability to only one action, and to normalize the distribution of possible outcomes. A study of this difficulty is FrozenLake, which adds a high degree of variability to some actions chosen by the AI engine.
Reinforcement learning gets its name from the use of a carrot and stick mechanism to teach the right behaviors and policies to the AI engine. To do this, we provide a score to the engine after each iteration. Think of this as a way to score the correctness of the AI engine's result. This is calculated by the reward function. A minimal interpretation of this is a boolean function that returns 1 in the event of a correct outcome and 0 in the result of an incorrect outcome; a higher-resolution interpretation is a reward function that represents the difference between the decision of the AI engine and the desired outcome.
The key to understanding this core pillar of reinforcement learning is the understanding that the reward function is not the desired outcome for the action, but rather a signal that represents an appraisal of the actual outcome of the action when compared to the desired outcome. It's also important to remember that you can use a different reward function with the same sim to train different AI engines, which provides reusability to your finely-tuned simulator.
A careful simulator author can provide the AI engine with an advantage by clearly defining the simulator environment and reward function. By using the same simulator with multiple rewards, multiple concepts can be trained, resulting in a more robust system. At the same time, it's important to consider the schema, and how complicated the resulting action space and state transition space are.
There are some great open source solutions to get started with simulator design. The SimPy package offers a really straightforward way to build your own. Take a look at the classic control environments in OpenAI Gym for other examples, or Gazebo for a more complex environment.
We're going to write a simple simulator to implement some of the concepts I've discussed in the past few weeks! Follow Bonsai at @BonsaiAI for more updates, or come visit our forums to join the community.