By: Marcos Campos & Victor Shnayder
At Bonsai, we are building an AI platform to enable subject matter experts to teach an AI how to solve complex problems in optimization and control using deep reinforcement learning. Typically, effectively using deep reinforcement learning requires a great deal of expertise in defining suitable reward functions for your task. This becomes even more challenging when the task requires coordination or sequential planning of different skills and operations.
A key feature of the Bonsai platform is the ability to decompose complex tasks using concept networks. Concepts are distinct aspects of a task that can be trained separately, and then combined using a selector concept. This approach drastically reduces the overall complexity, since the simpler problems can be trained with focused and easier-to-specify reward functions. The selector concept can be quickly learned using a simple reward function.
Today, we’ll tell you how we used this approach to solve a complex robotics task requiring dexterous manipulation: training a simulated robot arm to grasp an object and stack it on another one. A similar task was recently studied by DeepMind, getting excellent results . We applied our decompositional approach, improving training efficiency and flexibility. Here is a video of the end result:
A recent paper by DeepMind  described a similar grasping and stacking task, and solved it with two main contributions. First, by carefully crafting reward functions, they could teach an AI to learn how to correctly sequence the sub-tasks needed to solve the complete problem. Solving the problem with this approach required about 10 million interactions with the simulator. Secondly, they showed that if key subtasks were learned separately (each took on the order of 1 million interactions with the simulator), and traces from executing these subtasks were used to prime learning the full task, it was possible to learn the full task in about 1 million interactions with the simulator, thus achieving a 10x speed up over the baseline which did not use subtasks.
Our approach has its precursors in the Options Framework by Sutton et al. . More recently. T. D. Kulkarni et al. has shown how a similar approach using deep hierarchical reinforcement learning could be used to learn complex sequences . The main difference from our approach is that the meta-controller is learned at the same time as the basic controllers (sub-tasks) and there are no constraints on when to use each basic controller.
The robotics task starts with a Kinova Jaco arm at a neutral position in a MuJoCo robotics simulator, and then moves the arm to a work area to grasp a four-sided geometric prism. Once the prism has been grasped, the arm moves the prism to an adjacent work area to stack the prism on top of a cube. The position and orientation of the prism and the cube can vary around the center point of their their respective working areas.
We decompose the task into five subconcepts -- reach the object, orient the hand for grasping it, grasp it, move to the object for stacking it, and stack it on top of a block. We solve each separately, and learn a meta-controller - or selector - concept to combine them into a complete solution.
The hierarchical decomposition gives us several practical benefits:
The “reach for grasping” (reach) and “move for stacking” (move) concepts are simple motions for which we use a classical motion controller. The Bonsai platform allows us to integrate such controllers using Gears, an interoperability feature we announced in June of this year. The orient, grasp, and stack concepts are neural controllers trained with deep reinforcement learning, using the TRPO-GAE algorithm .
Each concept is trained in order once its precursors in the concept graph have been trained. orient, grasp, and stack independently. Once these concepts are trained the system trains the overall grasp and stack concept.
As shown in Figure 3, the selector learns to choose the action recommended by the sub-concept most applicable in the current state. This is a discrete reinforcement learning problem, that we solve with DQN, using progress toward overall task success as the reward (any discrete RL approach could be used). To make this effective, we don’t choose a new sub-concept at each time step. Instead, the selector uses long-running concepts: each subconcept can have pre-conditions for when it can be selected, and a run-until condition to meet before switching to another task. This gives the designer an easy way to specify constraints like “don’t try to grasp until you’re close to the object”, and “once you start to move, continue that for at least 100 time steps”.
Inkling is Bonsai’s special purpose programming language used to codify the concepts the system should learn, how to teach them, and the training sources required (e.g. simulations). Collectively, we refer to these techniques as Machine Teaching. The Bonsai Platform can integrate these taught concepts to learn new skills. Read more about Inkling in the Bonsai Docs.
Figure 4 shows the number of samples (environment transitions) required to learn each of the concepts. The grasp and stack (Selector) concept only took about 22K samples to converge-- this is drastically faster than the number of samples required to learn the other tasks. Because the other concepts can be trained in parallel or could be already pre-trained, the overall time for solving the full problem using a composition of concepts is significantly reduced. In the ideal case, with pre-trained sub-concepts, this gives a 500x speedup over DeepMind’s all-at-once solution, and a 45x speedup over their approach of using subtask traces to speed up training .
All tasks (including the full task) achieved 100% success on 500 test executions. Parameters for the algorithms and detailed equations for the reward functions are provided in our research paper.
We implemented the task reach and move using inverse kinematics classical controllers. These did not require training.
Reach moved the arm from its initial position (always the same) to a staging area for starting grasping. The staging area for grasping was defined as a fixed point centered above the grasping working area.
Move repositioned the arm from the end position of the grasp task to the staging area for stacking. The staging area for stacking was defined as a fixed point centered above the stacking working area.
The orient concept was trained using
Here is the training graph and a video of orient training:
The grasp concept (called lift in our paper) was trained using TRPO-GAE and the endpoints of the orient concept task as starting arm configurations. We collected 100K sample starting points by executing the orient concept with different prism location and orientations. The grasp concept converged after about 5 million samples using the following reward function:
Here is the training graph and a video of grasp training:
The stacking concept was trained with TRPO-GAE on about 2.5 million samples using the following reward function:
Here is the training graph and a video of stack training:
We used DQN  to train the grasp and stack concept. Figure 1 shows a video for an exemplary run for the full task. Figure 11 shows the training graph -- the selector learns very quickly (6K training steps, corresponding to about 22K interactions with the simulator) to sequence the different concepts in order to solve the problem.
We used the following reward function:
The problem we chose to tackle is quite difficult. Even after splitting it into simpler subproblems using Concept Networks, there remain design decisions that require careful consideration. Read our arXiv paper to learn more about
If working on a platform to support flexible, usable reinforcement learning and AI is interesting, join us! If you’re interested in using our platform to solve your control or optimization tasks, check out our Getting Started page.
 I. Popov et al. Data-efficient Deep Reinforcement Learning for Dexterous Manipulation, 2017. URL https://arxiv.org/pdf/1704.03073.pdf
 T. D. Kulkarni et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, 2016. URL https://arxiv.org/pdf/1604.06057.pdf
 J. Schulman, et al. High-dimensional continuous control using generalized advantage estimation, 2015. arXiv:1506.02438v5 [cs.LG].
 Mnih, V. et al. Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop, 2013. arXiv:1312.5602v1.
 Sutton, R. et al. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112 , 1999: 181–211.