This is the second in a series of posts discussing issues when using Amazon Web Services at scale. The first was When EC2 Hardware Changes Underneath You….
At PiCloud, we’ve accumulated over 100,000 instance requests on Amazon EC2. While we know of no IaaS provider superior to Amazon, it isn’t perfect. In this post, I’ll be discussing how we’ve built our scaling systems around Amazon EC2, despite frequent data inconsistencies from its API.
Background: The Scaler
As users create jobs, we add them to our job queue until there is a free worker available to do the processing. We are constantly estimating the size of this job queue to scale the number of “worker instances” we have available to perform our customers’ computation. Due to fluctuations in our job queue throughout the day, our scaling system regularly requests and terminates EC2 instances.
Our automated scaling system, or “scaler” as we call it, runs the following algorithm several times a minute:
- Obtain queue size from scheduling system. Infer number of instances (servers) needed.
- Obtain instance state information from EC2 with the DescribeInstances API call.
- Compare the number of instances needed to the number of instances EC2 indicates are running, pending, etc.
- RunInstances if more are needed.
- TerminateInstances that are excessive.
(This is a simplification that doesn’t include our use of the EC2 spot market, our inability to terminate servers running a customer’s jobs, and our optimization to only terminate servers near the end of their chargeable hour. For more information, see our Grand Prize winning Spotathon Application).
The benefit of the above algorithm is that it allows the scaler to maintain minimal internal state, making it simpler, easier to test, and more robust. Aside from the queue size calculated by our scheduler, the EC2 API essentially tracks our system state.
Relying on EC2 as our Single Version of the Truth of our system state would cause us many issues, which we’ll now cover in detail.
When the scaler was first launched, and we had far fewer servers, all was well. However, over time we noticed two odd behaviors:
- Sometimes, far more servers than needed were being deployed.
- Rarely, but catastrophically, the scaler would terminate every worker server, only to immediately spawn new ones afterward!
Sifting through debugging logs brought the problem to light. After requesting a server, subsequent DescribeInstances responses would not necessarily include the newly pending server. In database terms, the EC2 API is only eventually consistent. The stateless scaler, clueless that it had just requested a server, would keep deploying instances until they finally showed up in the DescribeInstances response.
Worse, the instances that did appear in the DescribeInstances response were not necessarily up to date. At times, after an instance had been terminated, it would still appear as running. The stateless scaler, clueless that it had just terminated a server, would then terminate a different server—and so forth—until EC2 finally concurred that they were terminated.
In the end, we had to introduce some state (a list of instances requested/terminating) to supplement the response from DescribeInstances. While the EC2 API does not provide any upper bound on its eventual consistency (let alone document that the API is eventually consistent), we’ve found that this “override list” only needs to exist for a few minutes.
In keeping with the philosophy of a stateless scaler, all meta information about a given instance is stored as EC2 tags. Our tags indicate the instance’s environment (“test”, “production”, etc.), role (“worker”, “webserver”, etc.), etc… Rather than keeping such information in our own database, we let EC2 handle the details.
As we want tags to be set atomically with instance creation, any instance is created with the following API calls:
- RunInstances – create the instance(s)
- CreateTags – tag the just returned (pending) created instance(s)
Things worked for awhile. But at some point, we noticed the scaler was crashing with the error:
InvalidInstanceID.NotFound: The instance IDs ... do not exist
And yet the purportedly “not found” instances were clearly showing up in our DescribeInstances.
Given what we’ve learned from DescribeInstances, CreateTags, not surprisingly, also exhibits eventual consistency. Our solution has been to exponentially back-off, giving up after some timeout, whenever the CreateTags request fails with InvalidInstanceID. Again, it may take over a minute after RunInstances for CreateTags to work.
Our deployment scripts rely on EC2 meta-data to learn about the instance’s attributes, which in turn affect application configuration.
One such application we install is Linux’s Logical Volume Manager (LVM). Instances with large amounts of ephemeral storage often have multiple volumes attached to them. LVM allows us to abstract the multiple volumes into a single one.
The LVM installer needs to know the block device mapping (e.g. /dev/sdb) of the ephemeral storage. Such information is only available in the instance meta-data. Unfortunately, we’ve since discovered that requests for block-device-mapping sometimes return an empty string. So once again, we need to back-off and try again. Complicating matters is that even once a given request returns valid data, subsequent requests may again return no data!
Instances Cannot Always be Launched
Realtime Cores are our way of letting you dictate to our scaler the exact number of cores you need. The scaler allocates Realtime Cores by issuing an all-or-nothing RunInstances request (e.g.
MinCount == MaxCount == 10). While Amazon’s documentation warns that sometimes instances can’t be launched, for months, everything worked. Like the C programmer who doesn’t check that the pointer returned by
malloc is not
NULL, we stopped worrying about what was actually returned.
Sure enough, one day, requests for dozens of Cluster Compute Eight Extra Large Instances (cc2.8xlarge) started failing due to “Insufficient capacity”. We weren’t even requesting instances from a specific Availability Zone (AZ); there wasn’t enough capacity anywhere!
What we thought would never happened turned out to be real.. and we had to update our Realtime interface appropriately.
And there was another subtle lesson too. While we are indifferent to the AZ any worker instance is launched in, RunInstances will never return instances from a heterogeneous mix of AZs; Amazon will always place the batch requested instances in the same AZ. Consequently, we now set MinCount to 1 and keep issuing requests until the correct number of instances is launched.
This article only touches on some of the EC2 difficulties we’ve encountered. What we initially thought would be a simple, clean scaling management system ended up full of hacks to handle inconsistencies in the Amazon API. As we’ve discovered over the past four years creating PiCloud, building a robust, large-scale system, even on EC2, where so much infrastructure management is already handled, ends up far more challenging than it initially appears to be.
Categories: Battle StoriesYou can follow any responses to this entry through the RSS 2.0 feed.