PiCloud Wins Grand Prize in Amazon EC2 Spotathon

December 5th, 2012 by Ken Elkabany

Update: AWS has a blog post covering the results.

We’re excited to announce that Amazon Web Services has chosen PiCloud as the 1st ever Grand Prize winner of the EC2 Spotathon!

As indicated on the contest page, the judging criteria were as follows: cost savings by using spots; performance benefits due to spots; computational scale achieved by application; and overall elegance and efficacy.

To give an idea to the broader community how PiCloud fared with each of the above criteria, we’ve decided to release our Spotathon application. We hope our readers will be able to gain insight into using spot instances effectively for their own applications.

 

Spotathon Application

15. What is your Spot application, what problem does it solve, and why is it important? For example: If you are representing a company or organization, what does your company do and how does your Spot application fit in?

PiCloud offers a Platform-as-a-Service (PaaS) for high-performance computing, batch processing, and scientific computing applications. We differentiate ourselves from the Amazon Web Services offerings by providing high-level APIs that scientists and engineers with minimal system administration experience can leverage. Our platform has been used in a wide range industrial and academic applications that take advantage of computational sciences: pharmaceutical (sequence alignment, protein folding), oil & gas (geophysics), finance (risk analysis), quantum computing, machine learning, image/video processing, and many more.

Our popularity with scientists and engineers stems from our ease of use. Most notably, our users do not provision, administer, or teardown servers. Instead, a user submits jobs to us. A job is a unit of computational work like finding proteins of interest in a genome. It is our responsibility to take these jobs, and distribute them across our cluster of machines made available by Amazon.

Because most workloads we receive are batch, we have the flexibility to trade-off between the number of EC2 servers we rent, and the time it takes for a workload to complete. For a set of jobs, we can bring up more servers to increase parallelization, and hence shorten the time it takes for them to complete, at the risk of incurring the heavy cost of idling servers when that batch is completed. Spot Instances play a vital role in addressing this dilemma by enabling us to increase parallelization with lower risk.

To understand why, you’ll need to understand how we determine the number of EC2 servers to rent at any given point in time. We never know in advance how long each job in our system will take to complete. However, statistical analysis on previous jobs of the same type lets us estimate to a degree of confidence how long our current queue of jobs is. With Amazon charging hourly, we aim to group jobs such that each group takes one hour at full instance utilization. We rent as many servers as we have groups.

If we underestimate the length of our queue of jobs, our user’s experience suffers. If we overestimate, our servers sit idle driving up our costs. Spot instances reduce the risk of poor estimation, allowing us to scale up our cluster and finish scientific workloads faster. We estimate that we’ve been able to bring up roughly 50% more servers at the same cost, improving user experience by delivering results 33% faster. For the thousands of researchers on our platform, who have collectively processed over 100 million jobs, the benefits of spot instances have been immeasurable.

 

16. How have you incorporated Amazon EC2 Spot Instances into your application? Please describe your application architecture, including: how you evaluate the Spot market, how you bid on and manage your Spot Instances, how you handle Spot interruptions, how you integrate them with On Demand or other computing resources (if any), and any third party architecture or software you use.

Many of the instance types we deploy are frequently available at prices as low as one-tenth that of on-demand instances. Thus, leveraging the typical price advantage of spot instances allows PiCloud to simultaneously:

  1. Accept lower server utilization rates, meaning we launch more worker instances to process customer workloads faster.
  2. Provide even more competitive pricing to our users.
  3. Have higher profit margins.

However, the use of spot instances comes with its own set of challenges: price volatility, termination without notification, and slower server provisioning.

To handle these issues, we have designed a sophisticated scaling system that continuously monitors and analyzes the price of spot instances across different availability zones. Merging this analysis and our prediction of job queue size, we are able to predict in real-time the distribution of on-demand and spot instances that optimizes customer experience (minimizing time to complete workloads) and cost to PiCloud (see example in #18 for this in practice). Thus, we are constantly expanding and contracting our pool of spot and on-demand worker instances.

The biggest drawback of spot instances is that they are susceptible to being terminated by AWS at any time due to the fluctuating supply and demand. While the termination of an instance that is running a user’s computation is undesirable, we are capable of handling that event. If our infrastructure detects that an instance has terminated, the terminated instance’s workload is restarted on an active instance; user’s are not charged for the work that was “lost”.

Because these restarts are undesirable, we take several actions to mitigate them:

  • Avoiding volatile spots: Our automated scaling system uses the most recent prices, and historical prices to gauge price volatility to predict the expected cost of a spot over the next hour. We are only willing to use spots if this cost is significantly below that of on-demand instances. Otherwise, only on-demand instances are utilized.
  • Overbidding: As there is a cost to us and the user if we have to restart, we are willing to pay a bit more than even the on-demand cost to minimize the chance of spot termination. Our scaling system is responsible for safely terminating (i.e. waiting for jobs to complete) expensive workers.
  • Multi AZ: Our worker instances are spread across Availability Zones, minimizing the shock of a spot instance price spike.
  • Placement: We ensure that only users with short predicted runtimes are placed on spots. Longer runtime jobs are placed on less volatile on-demand instances. In practice, most jobs have runtimes less than a few minutes, because users typically break down long-running serial computation into smaller jobs to exploit maximum parallelism.

Finally, some users use our “Realtime Cores” service to run at higher levels of parallelism than our estimator would provide. In exchange for paying an hourly rate per core, they are given their own “job queue.” For instance, if a user purchases 200 realtime cores, we guarantee that 200 jobs will be processed in parallel. Many users only request this service for several hours a day. Unfortunately, spot instances deploy much slower than on-demand. This additional boot time prevents us from satisfying a real-time request with spot instances. Fortunately, many users’ real-time requests are issued periodically, making prediction possible. Spots are often cheap enough that it makes economic sense to satisfy a request we believe will occur in 15 minutes. If we are right, our costs may be reduced by 90%; even if the prediction was wrong 50% of the time, we’d still end up with lower average costs.

 

17. What cost savings do you achieve by using Spot Instances in your application? For example: How many instance-hours does your application use, how many are on Spot, and what is the total cost of running your Spot application? What would the total cost be if you were not using Spot Instances? What percent savings do you achieve?

Typical monthly consumption of our platform is 100,000 instance hours per month, with over 85% on spot instances resulting in savings of tens of thousands of dollars per month.

The flexibility of our platform allows us to recoup nearly all of the price difference between on-demand and spot instances. For instance, c1.xlarge spot instances are typically 85% cheaper than on-demand, meaning our steady-state costs are reduced by 85%.

In practice, because spot prices are not constant, we cannot capture all of the price differential. One loss is switching costs—moving from an expensive spot to an on-demand or moving from an on-demand to a cheap spot—where during the switch, we suffer lower effective utilization. Additionally, if an instance is “spot-terminated” while running computation, we must rerun the computation, potentially doubling our costs for that job. In practice, both of these issues are minor and our savings still hover near 65%.

There is a trade-off though between performance and cost savings. We intentionally do not capture some potential savings to increase customer performance. This, along with an example of the savings and performance advantage, is discussed in Q#18.

 

18. What performance benefit(s) does your Spot application achieve by using Spot Instances? Please describe. For example: Are you able to achieve shorter time to results because you can deploy more EC2 instances? If you’re running a simulation, does Spot enable you to execute more computational runs to improve the accuracy of your solution?

As mentioned in #16 and #17, spots let us accept lower utilization over an hourly interval to complete customer workloads faster. A practical example helps explain better:

Definition of core type: Each core we rent out is from a larger instance we’re renting from EC2. Different instances map to different “core types”. As an example, a “c2 core” represents 1 core of an c1.xlarge instance. Each c1.xlarge instance holds 8 (c2) cores.

If we have a user submit 10,000 5 minute c2 core jobs, the entire workload could theoretically be completed in 5 minutes. As we charge by job runtime ($0.13/c2-core-hour), our revenue would be:

10,000 c2 jobs*5 (minutes/job) *(hour / 60 minutes) * ($0.13/(c2*hours) = $108

If we launched enough instances to finish the workload over 60 minutes on on-demand instances, our costs would be:
10,000 c2 jobs*5 (minutes/job) *(hour / 60 minutes) * (1 c1.xlarge instance / 8 c2 jobs) * (0.66/c1.xlarge-hour) = $69

Under an 85% spot discount, our costs would be merely:
$69*(1-0.85) = $10.30

However, sometimes we prefer to increase our costs to give our users higher performance. As an example, we could complete this workload in 10 minutes (+ extra spot provisioning time) by running 5,000 jobs simultaneously. This requires:

5,000 c2 jobs * (1 c1.xlarge instance / 8 c2 jobs) = 625 c1.xlarge instances

At spot rates, this is still pretty cheap: $62. However, this level of performance would be impossible to realize with on-demand instances: It would cost $412, far more than our revenue.

Another source of performance benefit (and a trade-off over cost) is the earlier mentioned realtime prediction. To ensure a positive customer experience, we do not, due to slower provisioning time, request (new) spot instances to satisfy a realtime request; rather, if we have insufficient capacity, we launch on-demand instances (which can later be replaced by spots). However, the low cost of spots allows us to act on realtime request predictions (#16). A correct prediction not only lowers our costs (using spots rather than on-demand), but also ensures the user’s realtime request is satisfied instantly (rather than waiting the 5 minutes it typically takes to deploy our worker instances).

 

19. What computational scale have you been able to achieve with your Spot application? For example: What is the most number of concurrent instances you have been able to run? Does your application run across many regions and instance types? How many instance-hours does it (did it) take to run your application?

Our application extensively makes use of being in a “cloud” environment; we are constantly requesting and terminating instances based on user demand.

As mentioned in Q#15, our workers operate on c1.xlarge, m2.xlarge, t1.micro, and cc2.8xlarge instances. While we operate solely in the US East region, we utilize all availability zones.

Our platform is theoretically unbounded in the number of concurrent instances it supports. Peak customer usage has required provisioning over 1,000 instances.

Categories: Official

You can follow any responses to this entry through the RSS 2.0 feed.

2 Responses to “PiCloud Wins Grand Prize in Amazon EC2 Spotathon”

  1. Hristo Hristov says:

    This is very interesting, thanks for sharing. Using spot instances is great for the price, but I have one big worry: What happens if my 5-min job is running on a spot instance, and this instance is terminated? The job is terminated as well. I understand that you work hard to prevent this, but it is possible to happen. With some jobs, it is not ok to start them again, as some processing should not be performed twice. Does this mean that we have to program in a “transactional” way all the jobs? Do we have to handle possible job termination at any time? This will bring a lot more complexity in many cases… I didn’t know that until now, and this can lead to broken state on my side.

  2. Ken Elkabany says:

    @Hristo Due to the precautions we take, spot termination is exceedingly rare. In fact, after checking logs it looks like it has not happened in the last 30 days. If you mark your jobs as not restartable (_restartable=False), then our systems will try even harder to not assign your job to a spot to avoid any possibility of termination. More so, if you do this, the rare job that is terminated will not be run twice, though parts of it will not be run at all. You can judge for yourself whether this behavior is more desirable.

    If it really is that critical that a job not be run twice, then you will need to keep track of what actions your jobs take separately. This is good practice since it’s always possible for a server to fail at any time, whether on AWS or even on a local machine.

Leave a Reply