PiCloud has joined Dropbox

November 17th, 2013 by Ken Elkabany

We’re excited to announce that the PiCloud team has joined Dropbox! The PiCloud Platform will continue as an open source project operated by an independent service, Multyvac.

The last few years have been an incredible experience. We saw the potential in a nascent cloud industry, and helped bring it to those whose cause we believed in most: scientists and engineers in industry and academia solving difficult problems in computational biology, chemistry, physics, and many more disciplines.

We’re so proud of how far PiCloud has come, but it’s become clear that the industry we’ve chosen is not ready for the fast-paced growth demanded by a venture-backed business. For this reason, the team has decided to join Dropbox, a company full of bright, ambitious individuals whose values remind us of our own. We’ll be bringing our API-building expertise to the Dropbox Platform, helping ease and accelerate developer access to one of the greatest collections of data in the world.

To our users who dedicated their time and resources to our platform and mission, thank you very much. Your support was our fuel, your critiques our drive, and your success stories our pride. Even though the team is moving on, we’re still committed to addressing your needs, so here’s what’s going to happen:

  • Multyvac will be releasing an API-compatible solution, giving our users an easy migration path. Sign up for Multyvac’s mailing list to stay up-to-date on developments.
  • Multyvac will release their solution before our planned shutdown date on February 25th, which is 100 days from now.

Lastly, thank you to our investors, friends, and families for all your support. We couldn’t have done it without you.

-The PiCloud Team

Bucket Mounts

July 24th, 2013 by Josh Hawn

It’s now easier than ever to make use of your PiCloud Bucket in your jobs thanks to our new Bucket Mount feature. A Bucket Mount exposes your Bucket to your job as a read/write filesystem mounted at /bucket. In this folder, you’ll find everything you’ve stored in your Cloud Bucket, organized as a standard file system hierarchy.

Getting Familiar

Here, I’ve SSHed into a running PiCloud job using our exec-shell feature:

$ picloud exec-shell
Job requested as jid 4052. SSHing in..
Welcome to Ubuntu 11.04 (GNU/Linux 3.2.0-29-virtual x86_64)

 * Documentation:  https://help.ubuntu.com/
$ ls -lF /bucket
total 1
drwxrwxr-x 1 root picloud   0 Jun 12 21:22 folder1/
-rw-rw-r-- 1 root picloud  11 Jun 12 01:06 it_works.txt
-rw-rw-r-- 1 root picloud 806 Jun  6 21:57 pogo.py

PiCloud jobs run as a UNIX user in the group picloud and have permission to create, modify, or read any files or folders in the Bucket. You can access files like you normally would with any regular file.

$ cat /bucket/it_works.txt
It Works!

Modifying Files

Modifying a file automatically uploads it to your Bucket when the file is closed or flushed.

$ echo 'Hooray for Bucket Mounts!' > /bucket/hooray.txt
$ cat /bucket/hooray.txt
Hooray for Bucket Mounts!

Renaming Files and Folders

As another example, renaming prefixes (folders) becomes quite easy. Below, all of the Cloud Bucket objects with the prefix folder1/ have been renamed to have the prefix files/.

$ picloud bucket list
filename
folder1/file1.txt
folder1/file2.txt
folder1/file3.txt
hooray.txt
it_works.txt
pogo.py
$ mv /bucket/folder1 /bucket/files
$ picloud bucket list
filename
files/file1.txt
files/file2.txt
files/file3.txt
hooray.txt
it_works.txt
pogo.py

Removing Files

You can also use it to easily delete files.

$ rm -rf /bucket/files
$ picloud bucket list
filename
hooray.txt
it_works.txt
pogo.py

Writing Code

Having your bucket objects already part of the file system can reduce the complexity of your programs. For languages that don’t have a PiCloud driver, you would normally have to start another process (from within your program) to execute picloud bucket get and picloud bucket put, but with your objects mounted at /bucket/ there’s no longer a need for this.

Writing Python

This now lets you simplify your Python code that utilizes your bucket. Code like this will still work, unchanged:

import cloud

def crunch_data():
    cloud.bucket.get('dataset.txt')

    with open('dataset.txt', 'r+') as dataset:
        modify_data(dataset)

    cloud.bucket.put('dataset.txt')

cloud.call(crunch_data)

But it’s no longer necessary for jobs to wrap code with get/put calls:

import cloud

def crunch_data():
    with open('/bucket/dataset.txt', 'r+') as dataset:
        modify_data(dataset)

cloud.call(crunch_data)

If you’re writing Python that needs to run both locally and on PiCloud, you can use also use cloud.running_on_cloud() to determine how best to access objects in your bucket.

Performance and Consistency Considerations

Bucket Mounts can be thought of as a simple file system wrapper on top of the picloud bucket get and picloud bucket put functions. In particular:

  • When you open a file for reading, it will first download the entire object from your Bucket (picloud bucket get).
  • When you open a file for writing, it will also download the entire object first (if it exists).
  • When you flush/close a file open for writing, it will upload the entire file to your Bucket (picloud bucket put).

Cloud Desktops

July 18th, 2013 by Josh Hawn

Have you ever needed to run a GUI application on PiCloud? Well, neither have I, but it’s now possible! We’ve created a custom environment that you can use to run desktop applications on PiCloud.

Creating the VNC Server

Use exec-shell to create a new PiCloud job and SSH session:

$ picloud exec-shell --type c2 --cores 2 --env /picloud/Ubuntu-Desktop

This will create a job using our Ubuntu-Desktop environment. It requires at least 2 c2 cores due to the resource requirements of running the desktop environment.

Once the ssh session begins you can start the VNC server.

$ vncserver -geometry 1280x800

You’ll be asked to set a password for your VNC sessions. You can also specify the dimensions of the X display in pixels. The default pixel depth is 16 bits, but you can set it to use a broader range of colors with 24 bits using the -depth 24 option.

Connecting to the VNC server

You can use your favorite VNC client to connect to your desktop job. For this example, I’ll be using the Screen Sharing app built in to Mac OS X. On other platforms, you should be able to use any VNC client application.

To connect, you’ll need to know the hostname and port for the VNC server. The easiest way to find out what hostname and port your VNC server is running on is to use the jobs panel from the accounts page, choose the job id corresponding to your desktop job, and scroll down to the “listening ports” section.

VNC listens on port 5901 inside a Linux container on the host ec2-50-16-102-78.compute-1.amazonaws.com which is exposed on port 20102 in this case.

In my web browser, I just enter vnc://ec2-50-16-102-78.compute-1.amazonaws.com:20102 into the address bar and it starts the Screen Sharing app where I’m prompted for the password I used when starting the VNC server. Other VNC clients should ask for the hostname/port, or you can include host:port together in the same field.

Once the connection is established you’ll be presented with a basic Ubuntu-2D desktop.

Our installation includes the Mozilla Firefox Web Browser and the LibreOffice productivity suite, but any changes you make to the filesystem are ephemeral and are discarded when the job completes. To save files you’re working on, you’ll want to save them to your Bucket or Volumes.

Like all of our public environments, you can also clone /picloud/Ubuntu-Desktop to customize for your own needs.

Using AUFS and LXC to Manage Dependencies at Cloud Scale

May 1st, 2013 by Ken Park

Environments in a Nutshell

At PiCloud we strive to be the easiest service out-of-the-box, yet versatile enough to be used for the most complex applications. This ideal was especially challenging to realize when expanding our compute platform beyond Python-only modules, to support the whole gamut of non-Python software. How could we extend the same ease-of-use that our cloud library offers to the complex world of software dependencies?

In this blog post, I’ll explore how we’ve harnessed three different technologies to achieve our goal of a highly-scalable software dependency manager. We call it Environments, and for many PaaS users, it’s what sets us apart.

Where We Were

PiCloud originated as a Python-only platform. In that simplified setup, our Dependency Analyzer could identify and ship all pure-Python dependencies at job submission time. Why only pure-Python? Anything that requires compilation cannot be expected to work when shipped to a computer with a different architecture. Unfortunately, for performance purposes, many Python libraries are actually compiled C-extensions, and this meant if you used a Python library with C-extensions, our compute nodes needed to have that library pre-installed. Not surprisingly, it didn’t take long before this limitation became an issue for some of our users.

Where We Wanted To Be

When the time came to revamp our dependency manager, we sought a generic solution that would work for Python and non-Python software alike. We boiled our requirements for the new dependency management down to the following:

  • Simplicity. It must be no more difficult than what a user would do on their own machine. Importantly, it should not require users to learn a new tool or configuration language.
  • Powerful. There should be minimal limitations on what packages or programs a user can install. Such flexibility necessitates that escalated privileges (for dependency installation purposes) are given in controlled situations.
  • Low Overhead. It must not introduce significant overhead to job processing. Among other things, this means satisfying a user’s dependency must not require booting a new machine or rebooting an existing one.
  • Parallelizable. The value of PiCloud is giving users the freedom to use as many cores as needed to accomplish their tasks. Deploying dependencies should not hinder parallelization across many machines.

How We Got There

Filesystem Virtualization

Fundamentally, satisfying the dependencies of a program means ensuring that the appropriate files can be found in the proper places in the filesystem. We can then rephrase the role of the dependency manager as making sure a job is run in whatever filesystem environment it needs for proper function.

We took inspiration from the Python tool virtualenv, which lets you create, on one machine, multiple, isolated Python environments, each with its own packages and settings. In other words, it virtualizes those portions of the filesystem relevant to Python. Generalizing this concept, what we wanted was a virtualenv for the entire filesystem.

On *nix machines, the simplest way to virtualize the filesystem is through the chroot program, which allows you to set any point in your filesystem hierarchy as the new filesystem “root”. Unfortunately, chroot lacks mechanisms for isolation and resource management. At the other end of the virtualization spectrum is a virtual machine, which gives you full control and isolation but comes with the huge overhead of starting and running the VM. However, in between the simple chroot utility and full VMs are container technologies that share the host’s kernel but have the capacity to virtualize most everything else. Combined with other recent Linux kernel developments, like cgroups and namespaces, container technologies provide resource management and isolation, respectively. And starting a container is fast, because there is no operating system to boot up – the host’s kernel is already up and running. We ended up choosing LXC (LinuX Container) as our container technology, because it has been mainlined into the Linux kernel.

Our new dependency management system was beginning to take shape. A user would start with an LXC-bootable base filesystem, and install whatever programs and files their jobs need. We would store this modified filesystem, which we’ll call an Environment from this point on. When the user runs a job, we would start an LXC container using their Environment as the filesystem. The next question is, how do we store and send these big Linux filesystems across the network to compute nodes?

Union Mount System

What we realized was that saving the entire Environment was wasteful. In practice, Environments would be mostly the same, with only a relatively small number of files added in or changed. Theoretically, given the base filesystem the Environment was built on, we only need the changes the user made on top of this base to recreate the full Environment. What would enable us to save just this difference, rather than the whole Environment?

The answer is union mounts. A union mount allows you to stack multiple filesystems and present a unioned view at a different mount point. It is the technology behind many Live CDs. A Live CD contains a bootable operating system but is not typically writable. Hence, upon bootup from a Live CD, the system will create a temporary scratch space in RAM, then union mount this RAMFS on top of the filesystem of the CD. After chrooting into this unioned mount, the user is presented with a machine seemingly running off a read-only CD, but giving you the ability to tinker with and change the operating system files as you try it out. And due to the magic of the union mount, all changes are being written to the RAM filesystem, even when modifying files from the CD.

Several union filesystem tools exist, but after some research, we settled on AUFS (Another Union FileSystem), a nifty module created and maintained by Junjiro Akajima. We chose AUFS, because it is relatively performant and reliable, and we have been impressed with Junjiro’s responsiveness to support requests.

So, how much does AUFS help? On PiCloud, the average size of Environment modifications is around 220 MB. Given a base filesystem size of around 2.5 GB (which includes many pre-installed libraries for convenience and performance), this leads to roughly a 12-fold savings in terms of storage and network transfer.

Environment In Action

So, putting all that together, here’s how PiCloud Environments work:

  1. PiCloud user requests creation of a new Environment through our web UI or client CLI.
  2. PiCloud launches a machine the user can log into, that is actually an AUFS stack of an empty filesystem (read-writable) on top of a base Environment (read-only), much like a Live CD.
  3. User logs into the “setup server” and installs dependencies as they would on their own machine.

Environment Creation

  1. When the user is done setting up the Environment and requests to save it, user changes are sanitized for security and saved into our distributed file system.

Environment Saving

  1. User can now submit jobs specifying it should be run in the custom Environment. In Python, Environment is specified with the “_env” keyword, and in bash, the “-e” flag.
  2. PiCloud recreates the AUFS stack on our compute nodes, and runs the user’s jobs in LXC containers.

Using Environment

DFS Features That Improve Performance

As mentioned above, AUFS reduces the size of an Environment by roughly 12-fold. In practice, we end up doing much better than that thanks to our distributed file system (DFS), which has two important properties:

  1. Files from an environment are transferred at the block level to the compute node on-demand (only when a program accesses them).
  2. Transferred files are cached on the compute node automatically.

To understand the benefits of these properties, consider the Matlab Compiler Runtime (MCR), which enables the stand-alone execution of compiled Matlab applications. When a user installs the MCR in an Environment, close to a gigabyte of files are added. But, for the typical MCR-compiled application, a small minority (< 10%) of MCR is accessed at runtime, significantly reducing the data transferred. And, if a subsequent job runs on the same compute node, and needs the same MCR Environment, it will be available without any data fetching over the network.

Check Out the New Features

If you’re currently a PiCloud user and have wondered how Environments worked, I hope this shed some light into the black box.

If you haven’t already, you should check out our recent updates to the Environment system. We’ve redone the web UI and added client-side CLI support for managing your Environments. Also, we are excited about the new sharing feature that lets users share Environments they’ve created with colleagues and the general public.

Introducing Queues — Creating a Pipeline in the Cloud

April 3rd, 2013 by Ken Elkabany

Queues provide an interface for Dataflow Programming that is built on top of our job system.

While a distributed queue data structure with push, pop, and ack capabilities is provided, the key benefit is the ability to attach a handler to a queue for scalable processing of a queue’s messages. The handler in turn can feed its output messages to other queues.

In other words, you’re probably used to the queue data structure:

Overview of Queue

Our queues link the data structure with a message handler, f, which we call an attachment:

Overview of Queue

f(msg) is any Python-function you define which takes in 1 argument at a time, a message, coming from the input queue. Its return value is pushed into the output queue.

By the end of this post, you’ll be able to:

  • Create a distributed, fault-tolerant pipeline of queues and processors.
  • Scale each component to achieve a throughput of thousands of messages per second.
  • See it all through automatically-generated visualizations.
  • Pay only when you have messages in your pipeline.
  • Do it all, with only Python, and not a single server.

If you’re a developer who just wants to RTFM, see our documentation.

Diving In

Let’s see queues in action. You’ll need to have the latest client installed, released today (4/3).

First, let’s push and pop from a queue in your console to get comfortable:

>>> # import our library
>>> import cloud
>>> q = cloud.queue.get('numbers')
>>> # adds 3 messages to the queue
>>> q.push([1,2,3])
>>> # pops up to 10 messages
>>> q.pop()
[2, 1, 3]

Note that the queue did not dequeue in perfect-FIFO order; this is expected. Our queues are designed for high-throughput, high-parallelism, with minimal queue backlog, making guaranteed FIFO behavior less relevant.

Attach

Now let’s attach a function that increments all numbers in the input queue:

# declare the input and output queue
input_q = cloud.queue.get('numbers')
output_q = cloud.queue.get('bigger-numbers')

# create handler function
def increment(x):
    return x + 1

# attach the handler to the queue
input_q.attach(increment, output_q)

How did that work? We’re using the same automagic dependency transfer we use in our job system to send your increment function to us along with any dependencies it might have.

Visualization

From the Queues Dashboard, we can see an auto-generated layout of our pipeline based on the attachment we made:

Overview of Queue

Message Processing by Attachment

Let’s increment 1,000 numbers:

# range(1000) is a list of numbers from 0 to 999
input_q.push(range(1000))

In the background, our systems have created a job (visible in your Job Dashboard) that applies the increment function to every number in the numbers queue, and outputs the result to the bigger-numbers queue. If you’re unfamiliar with our job framework, don’t worry, queues abstract away most of the details. But, if you’re interested, see our Primer.

After ~10 seconds, you’ll see that all messages have been processed. On the live throughput chart, a single point represents the average throughput during a 10s window of time; the interpolation lines are cosmetic. Below, the single point at 100 msgs/second represents that 1000 messages were processed during the window. In actuality, we got about ~150 msgs/second for 7 seconds.

Overview of Queue

As a sanity check, we can also check the size of the queues:

>>> input_q.count()
0
>>> output_q.count()
1000

Increasing Throughput

What if you want to increase your throughput past 150 msgs/second? Set max_parallel_jobs for the attachment. You can do this from the Queue Dashboard or from the attach call:

# attach the handler to the queue
input_q.attach(increment, output_q, _max_parallel_jobs=5)

Now, assuming there are messages in the queue, you’ll see a throughput of 750 msgs/second!

Overview of Queue

If you click on “view” jobs, you can see a list of the five jobs attached to the queue. For those familiar with our job processing framework, you can now see that our Queue processors are built on top of jobs.

The takeaway: you just increased your throughput 5-fold by changing one number, and without any server management or scaling.

Creating an Image Processing Pipeline

To showcase the power of queues, we’re going to create the following pipeline:

Overview of Queue

This is an auto-generated visualization available in the Queues Dashboard. Rectangles are queues; circles are attachments.

The inputs to this pipeline are URLs, which should be pushed to the img-urls queue. The pipeline downloads the image, and does the following image operations:

  • Resize to 150px (thumbnail)
  • Resize to 400px (medium size)
  • Apply a Sepia-tone filter to the medium size image

For each generated image, an HTTP callback is made to an endpoint of your choice. Note that separating the three image operations into three attachments with different input queues isn’t the most efficient (you’d probably want to combine them into one operation), but it’s done for illustrative purposes.

You can download this pipeline from our repository: basic-example/queue/imgpipeline/pipeline.py

Step 1: Scraping Images

We’re going to use the following function as our attachment:

import os
import Image
import urllib2
from StringIO import StringIO

import cloud

def scrape_to_bucket(target):
    """Downloads image from url, and saves to bucket. *target* should
    be a dict with keys id (image id), and url (location of image).

    Returns a dict with keys id (image id), path (obj key), and
    transforms (empty list)."""

    id = target['id']
    url = target['url']

    # path to save image in bucket
    obj_path = 'imgs/{id}/original.png'.format(id=id)

    # extract extension from url
    ext = os.path.splitext(url)[-1]

    # open connection to image
    u = urllib2.urlopen(url)

    # if image isn't png, convert it to png
    if ext.lower() != 'png':
        i = Image.open(StringIO(u.read()))
        data = StringIO()
        i.save(data, 'png')
        data = data.getvalue()
    else:
        data = u.read()

    u.close()

    # add image to bucket
    cloud.bucket.putf(data, obj_path)

    return {'id': id,
            'path': obj_path,
            'transforms': []}

If you’re unfamiliar with Buckets, just think of them as a key->value object store. We use it here to conveniently retrieve and store objects to and from memory. However, buckets are not necessary, and are completely unrelated to queues. You can modify scrape_to_bucket() so it saves images into your own Amazon S3 account, database, or anywhere else.

Here’s a sample input message we’ll use to demonstrate each operation:

{
 'id': 1,
 'url': 'http://s3.amazonaws.com/pi-user-buckets/vFvZxWVSiHeeB20rAZwnS66OLRjeU8MU4Igf2Kyl/blog/Obama_family_portrait.jpg'
}

The url points to an image of the Obama family:
Obama Family Full Size

Per the source code, the above image will be saved in your bucket. The output message pushed to the thumbnail and medium queues will be:

{
 'id': 1,
 'path': 'imgs/1/original.png',
 'transforms': []
}

You can verify this works by simply running the function on your own machine:

>>> msg = {'id': 1,
'url': 'http://s3.amazonaws.com/pi-user-buckets/vFvZxWVSiHeeB20rAZwnS66OLRjeU8MU4Igf2Kyl/blog/Obama_family_portrait.jpg'}
>>> scrape_to_bucket(msg)
{'id': 1, 'path': 'imgs/1/original.png', 'transforms': []}

This is another advantage of queues. Because your function doesn’t need to be modified in any way to be an attachment, you can just as easily test it locally, as you can on the cloud.

Handling Exceptions

What if the message handler throws an Exception? Maybe the URL was temporarily unavailable, but you’d like to retry it in 60 seconds. Using retry_on, max_retries, and retry_delay, you can specify which Exceptions you’d like to retry, the number of times to retry, and the amount of time between each attempt.

import urllib2

q = cloud.queue.get('img-urls')
output_qs = cloud.queue.get('thumbnail'), cloud.queue.get('medium')
bad_urls_q = cloud.queue.get('bad-urls')

q.attach(scrape_to_bucket,
         output_qs,
         retry_on=[urllib2.HTTPError, urllib2.URLError],
         max_retries=3,
         retry_delay=60,
         on_error={Exception: {'queue': bad_urls_q}})

Using the on_error keyword, the bad-urls queue will be sent messages that raised non-retryable Exceptions, and messages that failed even after three retries. Error messages generated by on_error include the triggered exception, and associated traceback.

You can confirm that your attachment has been setup as intended with the visualization.

Scrape Attachment

Tweaking Performance with Multi-threading

Because this scraping attachment spends most of its time waiting for network data transfer, and is thus I/O bound, it won’t be effectively utilizing the core it’s running on. The workaround is to run multiple, identical attachments in the job.

To do this, set the readers_per_job keyword to the number of simultaneous threads you want running in a job. The default value is 1. Do not mistake this with max_parallel_jobs, which controls the number of jobs that may be running. For this example, we’ll set the value to 4.

import urllib2

q = cloud.queue.get('img-urls')
output_qs = cloud.queue.get('thumbnail'), cloud.queue.get('medium')
bad_urls_q = cloud.queue.get('bad-urls')

q.attach(scrape_to_bucket,
         output_qs,
         retry_on=[urllib2.HTTPError, urllib2.URLError],
         max_retries=3,
         retry_delay=60,
         on_error={Exception: {'queue': bad_urls_q}},
         readers_per_job=4)

Step 2: Resizing Images

We’re going to attach handlers to the thumbnail queue, and medium queue to resize images to 150px, and 400px, respectively. To ease the storage and retrieval of images as PIL Image objects, we’re going to use a custom-defined ImageOperation class.


class ImageOperation(object):
    """Base class for Message Handlers in Image Pipeline.

    Retrieves images from bucket, performs in-memory manipulation
    with PIL object, stores result back in bucket, and then
    outputs message with additional transform listed.

    Override operation() for custom operation."""

    name = 'identity'

    def get_image_from_bucket(self, obj_path):
        """Given *obj_path* in bucket, returns PIL Image object"""

        # get image data as string of raw bytes
        data = cloud.bucket.getf(obj_path).read()

        return Image.open(StringIO(data))

    def put_image_in_bucket(self, img, obj_path):
        """Given PIL image *img*, saves it to *obj_path* in bucket"""

        output_data = StringIO()

        # write raw image bytes to StringIO
        img.save(output_data, 'png')

        # store the image file in your bucket
        cloud.bucket.putf(output_data.getvalue(), obj_path)

    def add_modifier_to_key(self, obj_path):
        """Returns new *obj_path* that includes name of transform"""

        obj_key, obj_ext = os.path.splitext(obj_path)
        obj_path = '{key}.{name}.png'.format(key=obj_key,
                                             name=self.name)
        return obj_path

    def message_handler(self, msg):
        """Entry point for message handling. Do not override."""

        img = self.get_image_from_bucket(msg['path'])

        # apply image operation
        new_img = self.operation(img)

        msg['path'] = self.add_modifier_to_key(msg['path'])
        msg['transforms'].append(self.name)

        self.put_image_in_bucket(new_img, msg['path'])

        return msg

    def operation(self, img):
        """Method to replace for custom operation"""

        return img

Since we support instances as message handlers, we’ll subclass ImageOperation to make two message handlers: ThumbnailOperation and MediumSizeOperation.

class ImageThumbnail(ImageOperation):

    name = 'thumb'

    def operation(self, img):
        """Returns a thumbnail of the *img*"""

        img.thumbnail((150, 150), Image.ANTIALIAS)
        return img

class ImageMediumSize(ImageOperation):

    name = 'med'

    def operation(self, img):
        """Returns a 400px version of the *img*"""

        img.thumbnail((400, 400), Image.ANTIALIAS)
        return img

Now we’ll attach instances of these classes to their respective input queues.

thumbnail_q = cloud.queue.get('thumbnail')
thumbnail_q.attach(ImageThumbnail(), [callback_q])

medium_q = cloud.queue.get('medium')
medium_q.attach(ImageMediumSize(), [sepia_q, callback_q])

If you pushed the example message of the Obama family to img-urls, then there are already messages ready for the thumbnail and medium queue. Once processed, these two objects will appear in your bucket:

imgs/1/original.thumb.png

Obama Family Thumbnail

imgs/1/original.med.png

Obama Family Medium Size

Step 3: Sepia Tone

The final image operation is a sepia-tone filter after the medium-size downscale operation.

class ImageSepia(ImageOperation):
    """Applies Sepia Filter.
    Based on: http://effbot.org/zone/pil-sepia.htm"""

    name = 'sepia'

    def __init__(self):
        self.sepia_palette = self.make_linear_ramp()

    @staticmethod
    def make_linear_ramp():
        """Generate a palette in a format acceptable for `putpalette`,
        which expects [r,g,b,r,g,b,...]"""

        ramp = []
        r, g, b = 255, 220, 162 

        for i in range(255):
            ramp.extend((r*i/255, g*i/255, b*i/255))

        return ramp

    def operation(self, img):
        """Returns a version of the *img* with Sepia applied
        for a vintage look."""

        # convert to grayscale
        orig_mode = img.mode
        if orig_mode != "L":
            img = img.convert("L")

        img = ImageOps.autocontrast(img)

        # apply sepia palette
        img.putpalette(self.sepia_palette)

        # convert back to its original mode
        if orig_mode != "L":
            img = img.convert(orig_mode)

        return img

Attaching:

sepia_q = cloud.queue.get('sepia')
sepia_q.attach(ImageSepia(), [callback_q])

Once again, if you pushed the sample message, there should already be a message ready in the sepia queue. The image outputted to your bucket is:

Obama Family Sepia

Step 4: Callback

Each image operation outputs a message to the callback queue. You probably want your callback handler to:

  • Write to your database that the image is ready
  • Make a POST request to your website for instant notification
  • Store the image somewhere else

For simplicity, we’ll have the callback handler set the image object in your bucket as public, so that it’s accessible by anyone. Based on the above examples, the following should be straightforward:

def callback(msg):
    print msg
    cloud.bucket.make_public(msg['path'])

callback_q = cloud.queue.get('callback')
callback_q.attach(callback)

Debugging Attachments

You may have noticed that in callback(msg), we did a print msg. How would you see standard output for an attachment? The same way you would for a job—by clicking on it in the Jobs Dashboard. To know what jobs are running your attachments, click “view” jobs from the Queues Dashboard, which will take you to the Jobs Dashboard filtered for your attachment.

Using this method, you’ll get access to all the information you’re accustomed to with jobs, including a realtime feed of CPU, memory, and disk usage.

Scaling Up

With just a few tweaks, I was able to get a system throughput greater than 150 images per second. I set max_parallel_jobs to 20 for the scraping step (10 readers_per_job, c2 core), and 30 for all image operation steps. Also, I set the image operation steps to use the f2 core for faster processing.

Here’s a screenshot of the Queue Dashboard in action as I was testing (doesn’t show max throughput). Note how the dequeue rate is able to keep up with the enqueue rate, which is precisely what we want.

Obama Family Sepia

Pricing

We charge for queues based on the amount of time jobs spend processing messages. Assuming you have a steady stream of messages, the maximum you’ll pay in an hour is:

max_parallel_jobs x cost per core hour

You can find the cost per core hour based on the core type you’ve chosen from our pricing page. If your queue is empty, no jobs will be running, and you won’t pay a thing!

Conclusion: Let Us Clean Your Pipes

Letting us manage the full pipeline—the queues, and the processing of messages—has several advantages:

  • No Servers: You won’t have to configure or deploy a single server for storage or processing.
  • Faster Development: It takes only a few lines of code to “use the cloud” for a pipeline.
  • Reliability: Our queues and workers are distributed and replicated across multiple datacenters (AWS availability zones), and even a server failure won’t jeopardize your messages.
  • Scale Effortlessly: Tell us how many cores you want to put to work, and we make it so.
  • Cut Costs: You only pay for processing when there are messages. No idling servers.
  • Monitoring & Analytics: Take advantage of our queue analytics, and the same job monitoring interface that powers our standard service.

If you’re ready to give it try, sign up now, and get 20 free core hours. Happy coding!

XKCD Hash Breaking

April 2nd, 2013 by Josh Hawn

The PiCloud team has been nerd-sniped by the April Fools XKCD comic Externalities which pits visitors against each other in a contest of raw computing power (and a little luck). Here at PiCloud we’ve got lots of computing power, but about how much is needed (on expectation) to win their hash breaking competition?

The probability of matching a certain number of bits in the hashed output follows a binomial distribution. The current leaders of the contest are 389 bits off out of 1024 bits. We can calculate our chance of generating a better hash value like so:

This is the fraction of the distribution that is less than 389 bits off, and it’s one in 228 trillion!

Not to be discouraged, we wrote a program to generate random hashes and test how close they are to the goal given by XKCD. Running it on a single f2 core, we were able to generate and check over 160,000 hashes per second. Multiplying the prior probability by this rate results in a 1 in 1.43 billion chance of finding a better hash in 1 second, but we can do much better than that.

How many f2 cores running in parallel would it take, on expectation, to generate a better hash within the next 6 hours? It turns out that it would take over 66,200 f2 cores running in parallel over 6 hours before we can expect to find a better hash. Unfortunately, we can’t get you 60 thousand f2 cores and the cost would be nearly $87,500 on PiCloud.

Feeling Lucky?

There are still over 6 hours left in the competition as of the time of this blog post and we’ve made it easy for you to get hashing using our environment set up with a skein-1024 hashing program. We’ve shared the environment publicly, so you can use it directly with no setup.

Here’s how you can give it a shot (You need our client library installed):

$ picloud exec -e /picloud/xkcd_skein -t f2 skein

This will run indefinitely and print to standard output the best result it has yet to come accross. You can monitor the standard output of the job from the Jobs Dashboard to see if you’ve found a better hash.

$ picloud exec -e /picloud/xkcd_skein -t f2 skein 389

This form of invocation will not print any output until it has come across a hash that is closer than 389 bits away from the goal.

picloud exec -e /picloud/xkcd_skein -t f2 skein 400 1000000000

This form of invocation will not print any output until it has come across a hash that is closer than 400 bits away from the goal, but will stop hashing after 1 billion attempts.

From the Environments Dashboard, you can also clone our public environment, and modify the program for yourself.

You can run “picloud exec” as many times as you want to maximize your parallel computing power. Watch your bill, and good luck!

Dealing with the Inconsistent EC2 API

February 17th, 2013 by Aaron Staley

This is the second in a series of posts discussing issues when using Amazon Web Services at scale. The first was When EC2 Hardware Changes Underneath You….

At PiCloud, we’ve accumulated over 100,000 instance requests on Amazon EC2. While we know of no IaaS provider superior to Amazon, it isn’t perfect. In this post, I’ll be discussing how we’ve built our scaling systems around Amazon EC2, despite frequent data inconsistencies from its API.

Background: The Scaler

As users create jobs, we add them to our job queue until there is a free worker available to do the processing. We are constantly estimating the size of this job queue to scale the number of “worker instances” we have available to perform our customers’ computation. Due to fluctuations in our job queue throughout the day, our scaling system regularly requests and terminates EC2 instances.

Our automated scaling system, or “scaler” as we call it, runs the following algorithm several times a minute:

  • Obtain queue size from scheduling system. Infer number of instances (servers) needed.
  • Obtain instance state information from EC2 with the DescribeInstances API call.
  • Compare the number of instances needed to the number of instances EC2 indicates are running, pending, etc.
    • RunInstances if more are needed.
    • TerminateInstances that are excessive.

(This is a simplification that doesn’t include our use of the EC2 spot market, our inability to terminate servers running a customer’s jobs, and our optimization to only terminate servers near the end of their chargeable hour. For more information, see our Grand Prize winning Spotathon Application).

The benefit of the above algorithm is that it allows the scaler to maintain minimal internal state, making it simpler, easier to test, and more robust. Aside from the queue size calculated by our scheduler, the EC2 API essentially tracks our system state.

Relying on EC2 as our Single Version of the Truth of our system state would cause us many issues, which we’ll now cover in detail.

DescribeInstances Inconsistency

When the scaler was first launched, and we had far fewer servers, all was well. However, over time we noticed two odd behaviors:

  • Sometimes, far more servers than needed were being deployed.
  • Rarely, but catastrophically, the scaler would terminate every worker server, only to immediately spawn new ones afterward!

Sifting through debugging logs brought the problem to light. After requesting a server, subsequent DescribeInstances responses would not necessarily include the newly pending server. In database terms, the EC2 API is only eventually consistent. The stateless scaler, clueless that it had just requested a server, would keep deploying instances until they finally showed up in the DescribeInstances response.

Worse, the instances that did appear in the DescribeInstances response were not necessarily up to date. At times, after an instance had been terminated, it would still appear as running. The stateless scaler, clueless that it had just terminated a server, would then terminate a different server—and so forth—until EC2 finally concurred that they were terminated.

In the end, we had to introduce some state (a list of instances requested/terminating) to supplement the response from DescribeInstances. While the EC2 API does not provide any upper bound on its eventual consistency (let alone document that the API is eventually consistent), we’ve found that this “override list” only needs to exist for a few minutes.

CreateTags Inconsistency

In keeping with the philosophy of a stateless scaler, all meta information about a given instance is stored as EC2 tags. Our tags indicate the instance’s environment (“test”, “production”, etc.), role (“worker”, “webserver”, etc.), etc… Rather than keeping such information in our own database, we let EC2 handle the details.

As we want tags to be set atomically with instance creation, any instance is created with the following API calls:

  • RunInstances – create the instance(s)
  • CreateTags – tag the just returned (pending) created instance(s)

Things worked for awhile. But at some point, we noticed the scaler was crashing with the error:

InvalidInstanceID.NotFound: The instance IDs ... do not exist

And yet the purportedly “not found” instances were clearly showing up in our DescribeInstances.

Given what we’ve learned from DescribeInstances, CreateTags, not surprisingly, also exhibits eventual consistency. Our solution has been to exponentially back-off, giving up after some timeout, whenever the CreateTags request fails with InvalidInstanceID. Again, it may take over a minute after RunInstances for CreateTags to work.

Unavailable Meta-Data

Our deployment scripts rely on EC2 meta-data to learn about the instance’s attributes, which in turn affect application configuration.

One such application we install is Linux’s Logical Volume Manager (LVM). Instances with large amounts of ephemeral storage often have multiple volumes attached to them. LVM allows us to abstract the multiple volumes into a single one.

The LVM installer needs to know the block device mapping (e.g. /dev/sdb) of the ephemeral storage. Such information is only available in the instance meta-data. Unfortunately, we’ve since discovered that requests for block-device-mapping sometimes return an empty string. So once again, we need to back-off and try again. Complicating matters is that even once a given request returns valid data, subsequent requests may again return no data!

Instances Cannot Always be Launched

Realtime Cores are our way of letting you dictate to our scaler the exact number of cores you need. The scaler allocates Realtime Cores by issuing an all-or-nothing RunInstances request (e.g. MinCount == MaxCount == 10). While Amazon’s documentation warns that sometimes instances can’t be launched, for months, everything worked. Like the C programmer who doesn’t check that the pointer returned by malloc is not NULL, we stopped worrying about what was actually returned.

Sure enough, one day, requests for dozens of Cluster Compute Eight Extra Large Instances (cc2.8xlarge) started failing due to “Insufficient capacity”. We weren’t even requesting instances from a specific Availability Zone (AZ); there wasn’t enough capacity anywhere!

What we thought would never happened turned out to be real.. and we had to update our Realtime interface appropriately.

And there was another subtle lesson too. While we are indifferent to the AZ any worker instance is launched in, RunInstances will never return instances from a heterogeneous mix of AZs; Amazon will always place the batch requested instances in the same AZ. Consequently, we now set MinCount to 1 and keep issuing requests until the correct number of instances is launched.

Conclusion

This article only touches on some of the EC2 difficulties we’ve encountered. What we initially thought would be a simple, clean scaling management system ended up full of hacks to handle inconsistencies in the Amazon API. As we’ve discovered over the past four years creating PiCloud, building a robust, large-scale system, even on EC2, where so much infrastructure management is already handled, ends up far more challenging than it initially appears to be.

2013 Academic Research Grant Recipients

January 16th, 2013 by Daniel Singh

Once again, we are excited to announce the recipients of the $500 (10,000 c1 core hours) Academic Research Program Grant from PiCloud.   In response to the overwhelming number of high quality applicants, we decided to award six grants this year!

We’ve included a list of recipients below so you can see some of the great work that will be benefiting from our platform.

If you missed out on this cycle of grants, don’t worry! We plan on having another Grant Program launched in the very near future.

List of Recipients

Tim Althoff
Graduate Student Researcher
German Research Center for Artificial Intelligence (DFKI)

Over time different topics arise in media and society which reflect shifting interests of groups of individuals. These trending topics are also reflected in online media such as Twitter, Google, and Wikipedia. Our research project seeks to elucidate trending topics in online media through analyzing what people are interested in and characterizing those trends as well as making predictions about the impact and lifetime of those trends. These models and predictions are relevant to finance, advertising, journalism, and social media recommenders. Furthermore, we use trending topics to inform machine learning models in visual concept detection that automatically annotate videos with tags.

Robert Lanfear and Brett Calcott
Postdoctoral Research Fellows
Australian National University, Canberra

Biologists are now routinely producing genome-scale datasets, and bioinformaticians have to work hard to keep up. We are developing a recently-released piece of bioinformatics software, PartitionFinder, to run in the cloud. This will allow biologists to select optimal models and partitioning schemes for genome-scale datasets without investing in huge servers. We hope that this will help improve the inferences we make from genome-scale datasets in biology.

Ken Locey
PhD Candidate in the Dept. of Biology
Utah State University

Development of a general approach to explain and predict macroecological patterns and a common pattern in nature, i.e. the uneven distribution of wealth and abundance.  Developed and test a framework for examining and predicting distributions of wealth and abundance in almost any system.

Massimo Minervini
Ph.D. Candidate in Computer Science and Engineering
IMT Institute for Advanced Studies, Lucca, Italy

Multi-atlas, multi-template anatomical segmentation of brain MRI volumes when combined with label fusion is considered a state-of-the-art method for the segmentation of brain anatomy in humans and animal models. However, it is normally a challenging task and depending on the number of subjects, templates, and atlases involved, the amount of computation required is significant. We want to take advantage of PiCloud’s computational power to speed up this process significantly, in an intelligent and time optimal fashion.

Thomas Robitaille
Research Group Leader
Max Planck Institute for Astronomy, Heidelberg, Germany

Carrying out mosaicking for the GLIMPSE project, a survey of our Galaxy with the Spitzer Space Telescope and compute radiative transfer models with  a Python-based radiative transfer code that can be used to simulate observations of many Astrophysical objects, including for example, comets, forming stars, and galaxies.

Yannick Wurm
Lecturer in Bioinformatics
Queen Mary, University of London

Ants, bees, wasps and termites live in societies of complexities that rival our own. Past studies largely focused on the behaviors, morphologies and evolutionary histories of such social insects. Thanks to recent improvements in DNA sequencing and analysis technologies, we are beginning to identify the genes involved in social interactions in these species.

When EC2 Hardware Changes Underneath You…

January 8th, 2013 by Aaron Staley

At PiCloud, we’ve accumulated over 100,000 instance requests on Amazon EC2. Our scale has exposed us to many odd behaviors and outright bugs, which we’ll be sharing in a series of blog posts to come. In this post, I’ll share one of the strangest we’ve seen.

The Bug

It started with a customer filing a support ticket about code that had been working flawlessly for months suddenly crashing. Some, but not all, of his jobs were failing with an error that looked something like:

Fatal Python error: Illegal instruction

File “/usr/local/lib/python2.6/dist-packages/numpy/linalg/linalg.py”, line 1319 in svd

File “/usr/local/lib/python2.6/dist-packages/numpy/linalg/linalg.py”, line 1546 in pinv

That’s odd, I thought. I had never before seen the Python interpreter use an Illegal Instruction! Naturally, I checked the relevant line that was crashing:

results = lapack_routine(option, m, n, a, m, s, u, m, vt, nvt, work, lwork, iwork, 0)

A call to numpy’s C++ lapack_lite. Great, the robust numpy was crashing out.

More surprising was that a minority of jobs were failing, even though the customer indicated that all jobs were executing the problematic line. We did notice that the job failures were linked to just a few servers and those few servers ran none of the customer’s jobs successfully. Unfortunately, our automated scaling systems had already torn down the server.

Debugging

The first thing I did was Google the error. Most results were unhelpful, but one old, though now solved, bug with Intel’s Math Kernel Library (MKL) seemed notable. MKL would crash with an illegal instruction error when AVX (Advanced Vector Extensions, a 2011 extension to x86) instructions were being executed on CPUs that lacked support. Why notable? We compile numpy and scipy libraries with MKL support to give the best possible multi-threading performance, especially on the hyperthreading & AVX capable f2 core.

Still though, why did only a few servers crash out? Having not much to go on, I launched a hundred High-Memory m2.xlarge EC2 instances (200 m2 cores in PiCloud nomenclature) and reran all the user’s jobs over the nodes. A few jobs, all on the same server, failed.

As I compared the troublesome instance to the sane ones, one difference stood out. The correctly operating m2.xlarge instances were running 2009-era Intel Xeon X5550 CPUs. But the troublesome instance was running a more modern (2012) Xeon E5-2665 CPU. And returning back to the MKL bug noted earlier, this new chip supported AVX.

Examining /proc/cpuinfo showed as much; AVX was supported on the failing instance, but not the new ones. To test it out, I compiled some code from stackoverflow with ‘g++ -mavx”. Sure enough, running the binary produced an Illegal Instruction.

From my perspective as an instance user, the processor was lying, claiming to support AVX but actually crashing when any AVX code would run.

Analysis

Turns out the actual answer was subtle. Per the Intel manual, it is possible for the operating system to disable AVX instructions by disabling the processor’s OSXSAVE feature. By the spec, any application wishing to use AVX first must check if OSXSAVE is enabled.

Amazon seems to have disabled the OSXSAVE feature at the hypervisor layer on their new Xeon E5-2665 based m2.* series of instances. This may just be because their version of the Xen hypervisor that manages these instances lacks support for handling AVX registers in context switching. But even if support does exist in the hypervisor, it makes sense to disable AVX for the m2.* family as long as there are Xeon X5550 based instances. Imagine compiling a program on an m2.xlarge EBS instance, thinking you had AVX support, and then upon stopping/starting the instance, finding that the program crashes, because your instance now runs on older hardware that doesn’t have AVX support! A downside of VM migration is that all your hardware must advertise the least common denominator of capabilities.

Unfortunately, Amazon did not ensure that the Guest OS saw that OSXSAVE was disabled. This led to MKL thinking it had the capabilities to run AVX code, when it actually didn’t.

Ultimately, there was not much to do but:

  1. Given how rare the Xeon E5-2665 instances are, we now simply self-destruct if an m2.*’s /proc/cpuinfo claims that both avx and xsave is enabled
  2. File a support case with Amazon. They have been quite responsive and as I publish this post, it seems that a fix has at least been partially pushed.

So, if you use instances in the m2.* family, be sure to check /proc/cpuinfo. If the instance claims it has both avx and xsave, it is probably lying to you.

Alternatively, if you are doing high performance computation in the cloud, you may just want to pass on the responsibility for such dirty details to us at PiCloud. :)

Introducing the PiCloud Notebook

December 23rd, 2012 by Ken Elkabany

The PiCloud Notebook is a Python console in your browser optimized for data analysis and collaboration. Unlike a traditional console, a notebook lets you intermingle code and descriptive text. The best way to get a feel for how it works is to see it:



We’re big fans of IPython, and those who are familiar with it will immediately recognize the console as an IPython Notebook running on the PiCloud Platform. While we believe that in general developers will continue to code on their local machine for convenience, there are several advantages to using a cloud-based notebook:

  • Your research is accessible from anywhere.
  • You can share your work with collaborators.
  • You can work efficiently with your data stored on PiCloud or AWS.
  • You can get comfortable with the environment your jobs run in.

Getting Started

Log into your account (sign up and get 20 free core hours), and click on the Notebook tab. This brings you to your Notebook Machine.

Think of your Notebook Machine as a computer dedicated to you sitting on the cloud. When you open a notebook, your Notebook Machine is started if it isn’t already. And when you’re done, you should shutdown the machine to avoid wasting resources.

By default, a notebook named “Primer” is available, which will walk you through the examples in the introductory section of our documentation called Primer. Click on it to open your first notebook.

Use your PiCloud account password when prompted.

Using a Notebook

Cell

Each box in your notebook is called a cell. Cells can hold code, or various types of markup (Markdown, headers, or raw text). Unlike in a traditional Python console, pressing enter in a cell will create a new line, but not execute it. This makes it easy to write multi-line functions and sequences of commands.

Executing Code

When you want to execute a cell, whether it’s code or markup, press shift+enter. Note the number indicating the order of execution on the left side of the cell, “In [X]“. Unlike a traditional console, you can execute and re-execute a cell at any time, and cells are thus not necessarily in order of execution. While it takes some time to get used to, it’s very handy when you’re continuously iterating your code.

Executing Shell Commands

While the primary use case of the notebook is for writing Python code, it’s also convenient for running shell commands. To execute a shell command, just prefix a command with “!”. For example, you can run “!ls”, “!pwd”, or even the PiCloud command-line interface (CLI), “!picloud”.



Visualization

Your notebook can also display rich media including images, graphs, videos, and more!



What else?

For more tips and tricks, see the following resources:

What is the Notebook Machine Exactly?

Your Notebook Machine is more than a web application. It’s a full-blown Linux system that can be accessed through the notebook console. We’re able to offer this by leveraging the same job abstraction we’ve already devoted so much effort to. In fact, the machine is running as a job. You can see the job id of your notebook machine from the notebook tab.

How is it Implemented?

As mentioned previously, our notebook is an IPython Notebook running on PiCloud. The notebook is run as a job in your account. For those of you familiar with our platform, you may be wondering how you’re able to connect to the notebook web server running in the job. The answer is that we’ve just released a feature that allows jobs to open listening sockets which can accept external connections. This opens up numerous possibilities including using sockets for job to job communication, as well as hosting web applications.

Using the Notebook to Live the Life of a Job

Another advantage of a notebook is it allows you to explore the system that a job sees. You can do the following:

  • Peek around the filesystem.
  • Import Python libraries to check availability and version.
  • Run non-Python programs.
  • Verify that your custom environment is working as expected.
  • Verify that your volumes are mounted in the way you intended.
  • Benchmark workloads interactively.

Refer to our documentation on configuring your notebook machine for more information regarding mimicking a job’s usage of volumes, environments, multicore, and core types.

Collaboration

If you want to give collaborators access to use your notebook machine, set a secondary password. Then provide the URL of your notebook to your collaborators. They should use the secondary password when prompted. We don’t want you to hand out your primary account password to others.

You can also send a notebook to a collaborator by downloading it (File -> Download as) and manually sending it.

Suggestions?

We’re really excited about this latest addition to the PiCloud Platform. If you have any ideas, let us know!