We've built a state-of-the-art, self-healing Artificial Intelligence cluster using the public cloud at AWS using cutting edge technology. We can't reveal our secret sauce (otherwise it wouldn't be secret anymore!) but read on to learn about how our Docker expert and Senior Software Engineer Jett Jones and the Bonsai Seattle team built a scalable, resilient AI Engine using some of today's most exciting configuration management, monitoring, and networking technologies.
Provisioning resources in AWS is a tedious and error-prone task, due to the enormous number of possible options. To make this process repeatable there are several options, but the notion of a “tainted” machine drew us to HashiCorp's Terraform for machine-level provisioning over similar solutions such as Cloud Formation for launching instances and creating network configurations.
The recently released Docker Swarm manages post-provisioning creation of services and some of the IPC functionality that allows the services to work together, as well as ensuring that services are launched on the correct machines and providing process management features around starting, stopping, and failure recovery.
Once we've built up the stack of machines and Docker Swarm initiates container launches, we use Consul for service naming and discovery, as well as a key-value store for configuration information and feature flags. This provides service naming through a DNS masquerade service on the conventional DNS port, and also offers an HTTP service for retrieval and storage of keys and values.
Registrator from GliderLabs (sponsored by WeaveWorks) is auditing Docker engine state on our hosts, so it reports each service back to the Docker Swarm process manager for tracking and identification. This provides the quickest route to service discovery, while maintaining the flexibility of on-demand port assignment.
With all of these microservices running around, we need to record their logs and monitor their status. The core of our logging and monitoring solution is the Elastic Stack, consisting of Elastic Search, Logstash, and Kibana (you may know it by its superseded common name, ELK Stack). This collects, sorts, collates, and displays the logs for our review. To monitor container status, we rely on Swarm-Kit's own built-in monitoring and health checks to restart any failed services. Finally, our home-built Watchman service monitors container utilization so we can pause idle containers.
We have a surprisingly diverse set of data to manage in our cluster, so we rely upon several databases to keep things sorted. PostgreSQL provides a fast relational data store, so we keep user data and configurations in a pgSQL container. For our training graph data, we find that a document store such as MongoDB is sufficiently robust and performant.
To store training result data as it's generated, we turn to InfluxDB, the industry leader in time-series data management. Finally, Redis rounds out our data-management stack, providing high speed data operations where they're needed most.
At the View level, a Flask server is wired to Gunicorn as a WSGI gateway and proxy to provide connection stability and recycling among other durability features. This produces the web page for display to end-users, to allow user and BRAIN management, as well as monitoring of training sessions. Gunicorn, in turn, is plugged into the Elastic Load Balancer which forms a network bridge from the public cloud to our cluster.
The “secret sauce” is our training engine, which is a group of containers that manage the long-running socket connections to simulators and provide the BRAIN itself. In addition to driving the simulator state and action cycle, the training engine maintains optional viewports into neural network state and feeds data into the BRAIN details page for the Accuracy Graph. A description of the components of the AI Engine can be found at Under the Hood in the Bonsai documentation.