In this post, I’m going to cover tricks and best practices for how to write the most effective reward functions for reinforcement learning models. If you’re unfamiliar with deep reinforcement learning, you can learn more about it here before jumping into the post below.
Crafting reward functions for reinforcement learning models is not easy. It’s not easy for the same reason that crafting incentive plans for employees is not easy. We get things affectionately known as the cobra effect.
Historically, the government tried to incentivize people to assist them in ridding the area of cobras. If citizens brought in a venomous snake they had killed, the government would give you some money. Naturally, people started breeding venomous snakes.
The same thing happens in the real world. People always game the system.
Keep this in mind when writing reward functions: You get what you incentivize, not what you intend.
The following examples highlight this well. The goal was to teach a robotic arm to grasp blocks and stack them on top of each other. (We ultimately succeeded; you can see how we did it here).
But we started with first simply trying to teach the robotic arm to move the blocks. In an early attempt, the reward function was to move the block as far away as possible from the arm. The engineer writing this reward was thinking “The arm will pick it up, go to full extension, and set the block down, thus moving the block as far as possible.”
Except the arm is in a physics environment. It’s familiar with the world. So the first thing it learns to do is smack it really, really hard to get it really, really far away:
Then it learns that it can get it even further away by picking it up and throwing it. So it learns to chuck it and get it really, really far away:
As you’re crafting your reward functions, which is major part of the task as you’re building out reinforcement learning models, be sure to understand what your reward function is doing. And be sure that it’s doing what you intended before you start a 8 hr long training run.
Having tooling, even tooling you build yourself, that lets you check if your model is doing what you wanted before you start training will save you inordinate amounts of time. At Bonsai, we’ve crafted internal tools that let you drag a block through space and watch the reward function change as it gets dragged through space in a simulated environment so you can gauge if you’re on the right track or not.
Reward shaping is a big deal. If you have sparse rewards, you don’t get rewarded very often:
If your robotic arm is only going to get rewarded when it stacks the blocks successfully, then all the time it’s off exploring far away it will never get any feedback. This takes a much logner time.
You want to instead shape rewards that get gradual feedback and let it know it’s getting better and getting closer. It helps it learn a lot faster:
A key to making this work is being able to decompose the reward functions in reasonable ways.
When structuring the reward itself, how you structure them numerically makes a huge difference. If you structure positive rewards, the system will wanting to accumulate as much as possible. This can lead to interesting behavior.
If, for instance, your simulation allows the robotic arm to stack a block, yank it away, stack again and continue getting rewards…then it will do that forever because its entire purpose is to get the highest reward possible.
Generally, positive rewards encourage:
Be careful with positive rewards. You need to make sure you don’t have a lot of reward near the terminals unless it’s a massive step function from where you were really close to it.
Negative rewards are different. Negative rewards incentivize you to get done as quickly as possible because you’re constantly losing points when you play this game. That’s an Important distinction as you build these out.
Generally, negative rewards encourage:
When you’re shaping your functions you need to understand the area that you’re playing in. Here you have a region of 2D space where the goal is to get the agent to bottom left corner red square, but hitting a blue square means you crash. You don’t want your reward function to look like this:
You want it to look like this:
You’re going to go slower but you won’t crash as much.
Shaping rewards makes a huge difference. Time spent here vs tinkering in simulation saves a lot of time.
Sometimes as you shape the reward functions, you need to think in terms of time as well as in terms of space.
This goes back to the importance of decomposing problems into smaller tasks. In the case of the robotic arm, we wanted it to pick up the block, then orient it, then stack it. Those are three distinct activities.
We know that it doesn’t make sense for the arm to waste time trying to grasp the block while it’s still too far away to reach it. So you can craft the reward functions and activation functions so that you don’t spend a lot of time in areas where you can’t hope to reasonably achieve your goal.
It’s fine to have rewards functions that combine a few things when they’re relatively simple. As these problems get larger and more complex, which happens pretty easily with real-world systems, you want to start thinking about using techniques like concept networks instead of just constantly making more and more complex reward functions.
Want to learn more about deep reinforcement learning, reward functions and concept networks? Check out the following resources: