Posts Tagged ‘new features’

Introducing Queues — Creating a Pipeline in the Cloud

Wednesday, April 3rd, 2013

Queues provide an interface for Dataflow Programming that is built on top of our job system.

While a distributed queue data structure with push, pop, and ack capabilities is provided, the key benefit is the ability to attach a handler to a queue for scalable processing of a queue’s messages. The handler in turn can feed its output messages to other queues.

In other words, you’re probably used to the queue data structure:

Overview of Queue

Our queues link the data structure with a message handler, f, which we call an attachment:

Overview of Queue

f(msg) is any Python-function you define which takes in 1 argument at a time, a message, coming from the input queue. Its return value is pushed into the output queue.

By the end of this post, you’ll be able to:

  • Create a distributed, fault-tolerant pipeline of queues and processors.
  • Scale each component to achieve a throughput of thousands of messages per second.
  • See it all through automatically-generated visualizations.
  • Pay only when you have messages in your pipeline.
  • Do it all, with only Python, and not a single server.

If you’re a developer who just wants to RTFM, see our documentation.

Diving In

Let’s see queues in action. You’ll need to have the latest client installed, released today (4/3).

First, let’s push and pop from a queue in your console to get comfortable:

>>> # import our library
>>> import cloud
>>> q = cloud.queue.get('numbers')
>>> # adds 3 messages to the queue
>>> q.push([1,2,3])
>>> # pops up to 10 messages
>>> q.pop()
[2, 1, 3]

Note that the queue did not dequeue in perfect-FIFO order; this is expected. Our queues are designed for high-throughput, high-parallelism, with minimal queue backlog, making guaranteed FIFO behavior less relevant.

Attach

Now let’s attach a function that increments all numbers in the input queue:

# declare the input and output queue
input_q = cloud.queue.get('numbers')
output_q = cloud.queue.get('bigger-numbers')

# create handler function
def increment(x):
    return x + 1

# attach the handler to the queue
input_q.attach(increment, output_q)

How did that work? We’re using the same automagic dependency transfer we use in our job system to send your increment function to us along with any dependencies it might have.

Visualization

From the Queues Dashboard, we can see an auto-generated layout of our pipeline based on the attachment we made:

Overview of Queue

Message Processing by Attachment

Let’s increment 1,000 numbers:

# range(1000) is a list of numbers from 0 to 999
input_q.push(range(1000))

In the background, our systems have created a job (visible in your Job Dashboard) that applies the increment function to every number in the numbers queue, and outputs the result to the bigger-numbers queue. If you’re unfamiliar with our job framework, don’t worry, queues abstract away most of the details. But, if you’re interested, see our Primer.

After ~10 seconds, you’ll see that all messages have been processed. On the live throughput chart, a single point represents the average throughput during a 10s window of time; the interpolation lines are cosmetic. Below, the single point at 100 msgs/second represents that 1000 messages were processed during the window. In actuality, we got about ~150 msgs/second for 7 seconds.

Overview of Queue

As a sanity check, we can also check the size of the queues:

>>> input_q.count()
0
>>> output_q.count()
1000

Increasing Throughput

What if you want to increase your throughput past 150 msgs/second? Set max_parallel_jobs for the attachment. You can do this from the Queue Dashboard or from the attach call:

# attach the handler to the queue
input_q.attach(increment, output_q, _max_parallel_jobs=5)

Now, assuming there are messages in the queue, you’ll see a throughput of 750 msgs/second!

Overview of Queue

If you click on “view” jobs, you can see a list of the five jobs attached to the queue. For those familiar with our job processing framework, you can now see that our Queue processors are built on top of jobs.

The takeaway: you just increased your throughput 5-fold by changing one number, and without any server management or scaling.

Creating an Image Processing Pipeline

To showcase the power of queues, we’re going to create the following pipeline:

Overview of Queue

This is an auto-generated visualization available in the Queues Dashboard. Rectangles are queues; circles are attachments.

The inputs to this pipeline are URLs, which should be pushed to the img-urls queue. The pipeline downloads the image, and does the following image operations:

  • Resize to 150px (thumbnail)
  • Resize to 400px (medium size)
  • Apply a Sepia-tone filter to the medium size image

For each generated image, an HTTP callback is made to an endpoint of your choice. Note that separating the three image operations into three attachments with different input queues isn’t the most efficient (you’d probably want to combine them into one operation), but it’s done for illustrative purposes.

You can download this pipeline from our repository: basic-example/queue/imgpipeline/pipeline.py

Step 1: Scraping Images

We’re going to use the following function as our attachment:

import os
import Image
import urllib2
from StringIO import StringIO

import cloud

def scrape_to_bucket(target):
    """Downloads image from url, and saves to bucket. *target* should
    be a dict with keys id (image id), and url (location of image).

    Returns a dict with keys id (image id), path (obj key), and
    transforms (empty list)."""

    id = target['id']
    url = target['url']

    # path to save image in bucket
    obj_path = 'imgs/{id}/original.png'.format(id=id)

    # extract extension from url
    ext = os.path.splitext(url)[-1]

    # open connection to image
    u = urllib2.urlopen(url)

    # if image isn't png, convert it to png
    if ext.lower() != 'png':
        i = Image.open(StringIO(u.read()))
        data = StringIO()
        i.save(data, 'png')
        data = data.getvalue()
    else:
        data = u.read()

    u.close()

    # add image to bucket
    cloud.bucket.putf(data, obj_path)

    return {'id': id,
            'path': obj_path,
            'transforms': []}

If you’re unfamiliar with Buckets, just think of them as a key->value object store. We use it here to conveniently retrieve and store objects to and from memory. However, buckets are not necessary, and are completely unrelated to queues. You can modify scrape_to_bucket() so it saves images into your own Amazon S3 account, database, or anywhere else.

Here’s a sample input message we’ll use to demonstrate each operation:

{
 'id': 1,
 'url': 'http://s3.amazonaws.com/pi-user-buckets/vFvZxWVSiHeeB20rAZwnS66OLRjeU8MU4Igf2Kyl/blog/Obama_family_portrait.jpg'
}

The url points to an image of the Obama family:
Obama Family Full Size

Per the source code, the above image will be saved in your bucket. The output message pushed to the thumbnail and medium queues will be:

{
 'id': 1,
 'path': 'imgs/1/original.png',
 'transforms': []
}

You can verify this works by simply running the function on your own machine:

>>> msg = {'id': 1,
'url': 'http://s3.amazonaws.com/pi-user-buckets/vFvZxWVSiHeeB20rAZwnS66OLRjeU8MU4Igf2Kyl/blog/Obama_family_portrait.jpg'}
>>> scrape_to_bucket(msg)
{'id': 1, 'path': 'imgs/1/original.png', 'transforms': []}

This is another advantage of queues. Because your function doesn’t need to be modified in any way to be an attachment, you can just as easily test it locally, as you can on the cloud.

Handling Exceptions

What if the message handler throws an Exception? Maybe the URL was temporarily unavailable, but you’d like to retry it in 60 seconds. Using retry_on, max_retries, and retry_delay, you can specify which Exceptions you’d like to retry, the number of times to retry, and the amount of time between each attempt.

import urllib2

q = cloud.queue.get('img-urls')
output_qs = cloud.queue.get('thumbnail'), cloud.queue.get('medium')
bad_urls_q = cloud.queue.get('bad-urls')

q.attach(scrape_to_bucket,
         output_qs,
         retry_on=[urllib2.HTTPError, urllib2.URLError],
         max_retries=3,
         retry_delay=60,
         on_error={Exception: {'queue': bad_urls_q}})

Using the on_error keyword, the bad-urls queue will be sent messages that raised non-retryable Exceptions, and messages that failed even after three retries. Error messages generated by on_error include the triggered exception, and associated traceback.

You can confirm that your attachment has been setup as intended with the visualization.

Scrape Attachment

Tweaking Performance with Multi-threading

Because this scraping attachment spends most of its time waiting for network data transfer, and is thus I/O bound, it won’t be effectively utilizing the core it’s running on. The workaround is to run multiple, identical attachments in the job.

To do this, set the readers_per_job keyword to the number of simultaneous threads you want running in a job. The default value is 1. Do not mistake this with max_parallel_jobs, which controls the number of jobs that may be running. For this example, we’ll set the value to 4.

import urllib2

q = cloud.queue.get('img-urls')
output_qs = cloud.queue.get('thumbnail'), cloud.queue.get('medium')
bad_urls_q = cloud.queue.get('bad-urls')

q.attach(scrape_to_bucket,
         output_qs,
         retry_on=[urllib2.HTTPError, urllib2.URLError],
         max_retries=3,
         retry_delay=60,
         on_error={Exception: {'queue': bad_urls_q}},
         readers_per_job=4)

Step 2: Resizing Images

We’re going to attach handlers to the thumbnail queue, and medium queue to resize images to 150px, and 400px, respectively. To ease the storage and retrieval of images as PIL Image objects, we’re going to use a custom-defined ImageOperation class.


class ImageOperation(object):
    """Base class for Message Handlers in Image Pipeline.

    Retrieves images from bucket, performs in-memory manipulation
    with PIL object, stores result back in bucket, and then
    outputs message with additional transform listed.

    Override operation() for custom operation."""

    name = 'identity'

    def get_image_from_bucket(self, obj_path):
        """Given *obj_path* in bucket, returns PIL Image object"""

        # get image data as string of raw bytes
        data = cloud.bucket.getf(obj_path).read()

        return Image.open(StringIO(data))

    def put_image_in_bucket(self, img, obj_path):
        """Given PIL image *img*, saves it to *obj_path* in bucket"""

        output_data = StringIO()

        # write raw image bytes to StringIO
        img.save(output_data, 'png')

        # store the image file in your bucket
        cloud.bucket.putf(output_data.getvalue(), obj_path)

    def add_modifier_to_key(self, obj_path):
        """Returns new *obj_path* that includes name of transform"""

        obj_key, obj_ext = os.path.splitext(obj_path)
        obj_path = '{key}.{name}.png'.format(key=obj_key,
                                             name=self.name)
        return obj_path

    def message_handler(self, msg):
        """Entry point for message handling. Do not override."""

        img = self.get_image_from_bucket(msg['path'])

        # apply image operation
        new_img = self.operation(img)

        msg['path'] = self.add_modifier_to_key(msg['path'])
        msg['transforms'].append(self.name)

        self.put_image_in_bucket(new_img, msg['path'])

        return msg

    def operation(self, img):
        """Method to replace for custom operation"""

        return img

Since we support instances as message handlers, we’ll subclass ImageOperation to make two message handlers: ThumbnailOperation and MediumSizeOperation.

class ImageThumbnail(ImageOperation):

    name = 'thumb'

    def operation(self, img):
        """Returns a thumbnail of the *img*"""

        img.thumbnail((150, 150), Image.ANTIALIAS)
        return img

class ImageMediumSize(ImageOperation):

    name = 'med'

    def operation(self, img):
        """Returns a 400px version of the *img*"""

        img.thumbnail((400, 400), Image.ANTIALIAS)
        return img

Now we’ll attach instances of these classes to their respective input queues.

thumbnail_q = cloud.queue.get('thumbnail')
thumbnail_q.attach(ImageThumbnail(), [callback_q])

medium_q = cloud.queue.get('medium')
medium_q.attach(ImageMediumSize(), [sepia_q, callback_q])

If you pushed the example message of the Obama family to img-urls, then there are already messages ready for the thumbnail and medium queue. Once processed, these two objects will appear in your bucket:

imgs/1/original.thumb.png

Obama Family Thumbnail

imgs/1/original.med.png

Obama Family Medium Size

Step 3: Sepia Tone

The final image operation is a sepia-tone filter after the medium-size downscale operation.

class ImageSepia(ImageOperation):
    """Applies Sepia Filter.
    Based on: http://effbot.org/zone/pil-sepia.htm"""

    name = 'sepia'

    def __init__(self):
        self.sepia_palette = self.make_linear_ramp()

    @staticmethod
    def make_linear_ramp():
        """Generate a palette in a format acceptable for `putpalette`,
        which expects [r,g,b,r,g,b,...]"""

        ramp = []
        r, g, b = 255, 220, 162 

        for i in range(255):
            ramp.extend((r*i/255, g*i/255, b*i/255))

        return ramp

    def operation(self, img):
        """Returns a version of the *img* with Sepia applied
        for a vintage look."""

        # convert to grayscale
        orig_mode = img.mode
        if orig_mode != "L":
            img = img.convert("L")

        img = ImageOps.autocontrast(img)

        # apply sepia palette
        img.putpalette(self.sepia_palette)

        # convert back to its original mode
        if orig_mode != "L":
            img = img.convert(orig_mode)

        return img

Attaching:

sepia_q = cloud.queue.get('sepia')
sepia_q.attach(ImageSepia(), [callback_q])

Once again, if you pushed the sample message, there should already be a message ready in the sepia queue. The image outputted to your bucket is:

Obama Family Sepia

Step 4: Callback

Each image operation outputs a message to the callback queue. You probably want your callback handler to:

  • Write to your database that the image is ready
  • Make a POST request to your website for instant notification
  • Store the image somewhere else

For simplicity, we’ll have the callback handler set the image object in your bucket as public, so that it’s accessible by anyone. Based on the above examples, the following should be straightforward:

def callback(msg):
    print msg
    cloud.bucket.make_public(msg['path'])

callback_q = cloud.queue.get('callback')
callback_q.attach(callback)

Debugging Attachments

You may have noticed that in callback(msg), we did a print msg. How would you see standard output for an attachment? The same way you would for a job—by clicking on it in the Jobs Dashboard. To know what jobs are running your attachments, click “view” jobs from the Queues Dashboard, which will take you to the Jobs Dashboard filtered for your attachment.

Using this method, you’ll get access to all the information you’re accustomed to with jobs, including a realtime feed of CPU, memory, and disk usage.

Scaling Up

With just a few tweaks, I was able to get a system throughput greater than 150 images per second. I set max_parallel_jobs to 20 for the scraping step (10 readers_per_job, c2 core), and 30 for all image operation steps. Also, I set the image operation steps to use the f2 core for faster processing.

Here’s a screenshot of the Queue Dashboard in action as I was testing (doesn’t show max throughput). Note how the dequeue rate is able to keep up with the enqueue rate, which is precisely what we want.

Obama Family Sepia

Pricing

We charge for queues based on the amount of time jobs spend processing messages. Assuming you have a steady stream of messages, the maximum you’ll pay in an hour is:

max_parallel_jobs x cost per core hour

You can find the cost per core hour based on the core type you’ve chosen from our pricing page. If your queue is empty, no jobs will be running, and you won’t pay a thing!

Conclusion: Let Us Clean Your Pipes

Letting us manage the full pipeline—the queues, and the processing of messages—has several advantages:

  • No Servers: You won’t have to configure or deploy a single server for storage or processing.
  • Faster Development: It takes only a few lines of code to “use the cloud” for a pipeline.
  • Reliability: Our queues and workers are distributed and replicated across multiple datacenters (AWS availability zones), and even a server failure won’t jeopardize your messages.
  • Scale Effortlessly: Tell us how many cores you want to put to work, and we make it so.
  • Cut Costs: You only pay for processing when there are messages. No idling servers.
  • Monitoring & Analytics: Take advantage of our queue analytics, and the same job monitoring interface that powers our standard service.

If you’re ready to give it try, sign up now, and get 20 free core hours. Happy coding!

Introducing the PiCloud Notebook

Sunday, December 23rd, 2012

The PiCloud Notebook is a Python console in your browser optimized for data analysis and collaboration. Unlike a traditional console, a notebook lets you intermingle code and descriptive text. The best way to get a feel for how it works is to see it:



We’re big fans of IPython, and those who are familiar with it will immediately recognize the console as an IPython Notebook running on the PiCloud Platform. While we believe that in general developers will continue to code on their local machine for convenience, there are several advantages to using a cloud-based notebook:

  • Your research is accessible from anywhere.
  • You can share your work with collaborators.
  • You can work efficiently with your data stored on PiCloud or AWS.
  • You can get comfortable with the environment your jobs run in.

Getting Started

Log into your account (sign up and get 20 free core hours), and click on the Notebook tab. This brings you to your Notebook Machine.

Think of your Notebook Machine as a computer dedicated to you sitting on the cloud. When you open a notebook, your Notebook Machine is started if it isn’t already. And when you’re done, you should shutdown the machine to avoid wasting resources.

By default, a notebook named “Primer” is available, which will walk you through the examples in the introductory section of our documentation called Primer. Click on it to open your first notebook.

Use your PiCloud account password when prompted.

Using a Notebook

Cell

Each box in your notebook is called a cell. Cells can hold code, or various types of markup (Markdown, headers, or raw text). Unlike in a traditional Python console, pressing enter in a cell will create a new line, but not execute it. This makes it easy to write multi-line functions and sequences of commands.

Executing Code

When you want to execute a cell, whether it’s code or markup, press shift+enter. Note the number indicating the order of execution on the left side of the cell, “In [X]“. Unlike a traditional console, you can execute and re-execute a cell at any time, and cells are thus not necessarily in order of execution. While it takes some time to get used to, it’s very handy when you’re continuously iterating your code.

Executing Shell Commands

While the primary use case of the notebook is for writing Python code, it’s also convenient for running shell commands. To execute a shell command, just prefix a command with “!”. For example, you can run “!ls”, “!pwd”, or even the PiCloud command-line interface (CLI), “!picloud”.



Visualization

Your notebook can also display rich media including images, graphs, videos, and more!



What else?

For more tips and tricks, see the following resources:

What is the Notebook Machine Exactly?

Your Notebook Machine is more than a web application. It’s a full-blown Linux system that can be accessed through the notebook console. We’re able to offer this by leveraging the same job abstraction we’ve already devoted so much effort to. In fact, the machine is running as a job. You can see the job id of your notebook machine from the notebook tab.

How is it Implemented?

As mentioned previously, our notebook is an IPython Notebook running on PiCloud. The notebook is run as a job in your account. For those of you familiar with our platform, you may be wondering how you’re able to connect to the notebook web server running in the job. The answer is that we’ve just released a feature that allows jobs to open listening sockets which can accept external connections. This opens up numerous possibilities including using sockets for job to job communication, as well as hosting web applications.

Using the Notebook to Live the Life of a Job

Another advantage of a notebook is it allows you to explore the system that a job sees. You can do the following:

  • Peek around the filesystem.
  • Import Python libraries to check availability and version.
  • Run non-Python programs.
  • Verify that your custom environment is working as expected.
  • Verify that your volumes are mounted in the way you intended.
  • Benchmark workloads interactively.

Refer to our documentation on configuring your notebook machine for more information regarding mimicking a job’s usage of volumes, environments, multicore, and core types.

Collaboration

If you want to give collaborators access to use your notebook machine, set a secondary password. Then provide the URL of your notebook to your collaborators. They should use the secondary password when prompted. We don’t want you to hand out your primary account password to others.

You can also send a notebook to a collaborator by downloading it (File -> Download as) and manually sending it.

Suggestions?

We’re really excited about this latest addition to the PiCloud Platform. If you have any ideas, let us know!

Real-time Data Feed for Jobs and More!

Tuesday, September 18th, 2012

We’ve just completed a major overhaul of the job dashboard. The overhaul marks a fundamental shift in our belief of what information you should be able to see about your job. We had originally striven for minimalism, revealing only what we thought was necessary: standard output & error, exception traceback, and profile. After all, if software ran bug free, you wouldn’t need anything else. But as our users kept reminding us with support tickets, when things aren’t working, there is exceptional value to under-the-hood data.

This post covers the new visibility we offer into your computation. You can see it in action by running a new job, and viewing it through the Job Dashboard.

Count of Memory Allocation Failures

We now report the number of times a memory allocation was requested, but failed, under the “MemAlloc Fails” column.

If you haven’t run out of memory on PiCloud before, the count may not seem all that important. But up until now, users would generally get a cryptic “None” Exception when they ran out of memory, indicating that their process had unexpectedly died. The reason is because most programs don’t handle out of memory errors gracefully, typically segfaulting instead. Now if your program crashes, it’s easy to check for allocation failures, the most likely culprit.

Once you know you’re running out of memory, you can take advantage of our other features. Try using a core type with more RAM, such as our f2 core, or take advantage of our new multicore support.

CPU Usage

A common question is how effectively a job is utilizing a CPU. If your job is supposed to be crunching numbers, but is only using the CPU 30% of its allotted time, then you probably have an unsuspecting bottleneck. Before this update, there was really no way to know, other than profiling the job locally.

Now, you can access three different views of a job’s CPU usage to give you maximum visibility.

Total Usage

Runtime refers to wall-clock time. CPU Time is divided into two categories based on whether time is being spent in user or kernel space. With the data above, we can determine that the CPU was being utilized (7613.14+280.85)/8239.6 = 95.8% of the time. But where’d the other 4% go?

Usage over Time

Using the following chart, we can see what happened.

This hypothetical job was at 99%+ utilization for the majority of the time. However, it spent the first 500 seconds loading data from external sources before crunching it. Depending on your job, a graph like this might look ideal, or it might be an impetus to reduce the data loading time.

The chart is generated by taking a snapshot of a job’s CPU Time every 30 seconds.

Realtime Usage

If you select a job from the Job Dashboard, while it is processing, you’ll be able to see a realtime graph of CPU usage in the right hand column. The graph actively samples your running job every few seconds.

This soothes the anxiety some developers feel when their precious algorithms are all grown up and running on the cloud. Never again will they anxiously wonder, “how’s my job doing?”

Memory Usage

We offer a similar set of views for a job’s memory usage, as we do for its CPU usage. While swap usage is shown, no job can currently use swap.

Peak Usage

Peak usage is the most amount of memory that was ever allocated for the job. Peak usage is viewable while the job is running, or after it has completed.

Current Usage

Current usage is the amount of memory currently being used by the job. Current usage can only be seen for a job that is currently being processed.

Usage over Time

Extending the example of the hypothetical job from the CPU Usage section, we can confirm that the first 500 seconds were spent loading data into memory.

Realtime Usage

Not shown for brevity. Just like the Realtime CPU Usage chart, you can see the memory usage of a job, while it is processing.

Python Logging

The Python logging module is popular for its simplicity and extensibility. Until now, users have had to output their loggers to standard output or error to have their messages retained by PiCloud. We now automatically grab your logger messages, and store them separately for later viewing.

Pi Log

As your job runs, our system sometimes has comments for you. For example, if you spawn subprocesses, it will notify you that you cannot take advantage of our persistent processes optimization. Before, we used to write these messages to standard error, but that unnecessarily worried some users, and others complained that it polluted their output. Now, we have a separate section specifically for messages from PiCloud.

System Log

Since more users are using PiCloud for non-Python computation, we have decided to expose the Syslog associated with a job. If your job is segfaulting, and there are no indications why, this is the place to look.

In the Works

We plan to offer metrics for disk, and network usage.

Suggestions?

Need more visibility? Let us know by leaving a comment!

Introducing Multicore Support

Friday, August 31st, 2012

Up until today, each job run on PiCloud has been limited to using only a single core. For those familiar with the Python global interpreter lock (GIL), this may not seem like a big deal at first. But as our users have let us know, the limitation is acute for the following reasons:

  1. Many performance-focused Python libraries including numpy release the GIL whenever possible, which means that even Python programs can leverage multiple cores.
  2. With the release of Environments, many of our users are running non-Python multithreaded programs. Some of those can use as many cores as we can throw at it.
  3. The most RAM a single core has is 8GB (m1). Up until now, a single job couldn’t break this limit. But now, you can pool multiple cores together to get access to more RAM.

How do I use it?

All you have to do is use the _cores keyword argument.

# uses 4 cores
job_id = cloud.call(func, _type='f2', _cores=4)

# works the same for map jobs
job_ids = cloud.map(func, datapoints, _type='f2', _cores=4)

Each job gets 4 f2 cores of processing power, and 14.8GB (4 cores x 3.7GB per f2 core) of RAM. We use the f2 core because as the next section shows, the c1 core, which is default, does not support the new multicore feature.

How many cores per job?

The number depends on the type of core you select.

Core Type Supported Multiples
c1 1 core
c2 1, 2, 4 or 8 cores
f2 1, 2, 4, 8, or 16 cores
m1 1 or 2 cores
s1 1 core

How much?

Per our pricing page, a job using a single f2 core would cost $0.22/hour. A job using two f2 cores would cost $0.44/hour. In other words, the cost per core has stayed the same, and there are no additional fees. You’re still charged by the millisecond.

Suggestions?

Multicore not enough for you? Let us know by leaving a comment.

Introducing the f2 Core – Our Fastest Core Yet!

Friday, June 15th, 2012

If you’ve been craving more speed, you’ll want to give our brand new f2 core a spin!

How fast?

The f2 core is rated at a whopping 5.5 compute units, which is 69% more than our previous leader, the m1 core with 3.25 compute units.

How about memory?

f2 cores have access to 3.7 GBs of RAM, which is a much-needed middle ground between the 800 MBs available in a c2 core, and the 8 GBs available in an m1 core.

How do I use it?

All you’ll need to do is set the _type keyword.

cloud.call(func, _type='f2')

How much?

At $0.22/core/hour, it’s got 120% more compute units than a c2 core, but only a 69% price premium.

Suggestions?

With our collection of five core types, there’s something for everyone! If you have ideas for another core type, or want to tell us how important a core with a GPU is to you, please leave a comment.

Introducing Scraping-Optimized Cores

Monday, October 10th, 2011

For users who aggregate data from around the web, you’ll find our latest core to be an integral part of your toolbox. s1 cores are comparable in performance to c1 cores with one extra trick up their sleeve: each job running in parallel will have a different IP.

Why is this important?

Using unique IPs is necessary to minimize the automated throttling most sites will impose when seeing fast, repeated access from a single IP.

How do I use it?

If you’re already using our c1 cores, all you’ll need to do is set the _type keyword.

cloud.call(func, _type=’s1′)

How much?

$0.04/core/hour

Why don’t other cores have individual IPs?

For other core types, such as c2, multiple cores may be running on a single machine that is assigned only a single IP address. When using s1 cores, you’re guaranteed that each core sits on a different machine.

Suggestions?

We’re excited to move the s1 core type out of beta for our customers. If you have any suggestions for other core types you would like to see, please let us know.

Introducing Environments — Run Anything on PiCloud

Monday, September 26th, 2011

Environments mark an important milestone for PiCloud. Whereas Function Publishing makes the computing power of PiCloud accessible to all programming languages, Environments enable you to use any library or binary you need in your computation. The possibilities are limitless, but here are a couple examples of what you can do with environments:

  1. Install any non-Python software package you need via apt-get or make.
  2. Install any Python module that we do not automatically extract from your machines, which are typically those that require compilation or depend on external libraries.

In this post, we’ll show you how to create and use your first environment. We’ll be installing the ObsPy package, which is a Python toolbox for processing seismological data.

Why Environments?

We strive to make moving your computation to the cloud as easy as possible. That’s why our cloud Python package automatically detects and transfers dependencies over to our cloud.

import cloud
from your_expansive_library_of_functions import complex_function
# cloud.call transfers all the modules needed to run complex_function on PiCloud
cloud.call(complex_function)

Unfortunately, automatic dependency transfer only works for pure Python modules. The ObsPy package requires both a .pth file and C-code compilation for proper operation. So the following simple function quickly runs into problems:

def simple_function():
    import obspy
>>> jid = cloud.call(simple_function)
>>> cloud.result(jid)
[Mon Sep 19 16:39:13 2011] - [WARNING] - Cloud: Job 1337 threw exception:
 Could not depickle job
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/cloud/serialization/cloudpickle.py", line 679, in subimport
    __import__(name)
ImportError: No module named obspy

Importing ObsPy fails because it could not be transferred to PiCloud in working form. You might be wondering how you’ve been able to use NumPy, SciPy, and other natively-compiled libraries on PiCloud. The answer is we have many libraries pre-installed on our systems. Here are the respective links for what we have pre-installed for Python 2.6, and Python 2.7.

Creating a new Environment

Step 1: Go to the Environments tab in the Control Panel.



Step 2: Click “create new environment”.



A popup box will appear. The Base Environment option allows you to choose what distribution of Ubuntu Linux you would like to use as the base filesystem. It’s important to understand why we give you this option. If you use Python 2.7 on your local machine to offload computation to PiCloud, we will run your functions in the Python 2.7 interpreter from the Ubuntu 11.04 (Natty) base. If you use Python 2.6 on your local machine to offload computation to PiCloud, we will run your functions in the Python 2.6 interpreter from the Ubuntu 10.10 (Maverick) base. We are consistent about which interpreter we use since the modules you install in your environment may compile against a specific version of Python. In short, if you’re using Python 2.6 on your machines, but you use the Natty base, or vice versa, you will most likely run into compatibility issues.

The Environment Name is the name you’ll use to reference the environment in your jobs. The Environment Description is for yourself and/or your team to keep track of the purpose and contents of each environment.

Step 3: Click submit.

When you click submit, your environment will appear under the “Environments being configured” tab. You may have to wait a minute or two while we boot and configure a server with the appropriate base environment for you.

For our example, we’ve named the environment seismology_env.

Connecting to your Environment Setup Server

When the server is ready, click the connect link. Note that the instructions are currently tailored towards *nix environments. If you are using Windows and do not have an SSH client, we recommend Tunnelier.



Download the private key we have generated for you. You will use this same private key for all future environment setup servers. SSH enforces that only the owner should have access to the file, which is why we instruct you to run chmod 400 privatekey.pem. Once you’ve done that, SSH into the provided server using the private key by using the -i flag as shown in the instructions.

Getting Around Your Environment

Once you’ve SSH-ed in, you’ll find yourself in a Ubuntu Linux filesystem environment.

picloud@ip-10-46-223-4:~$ ls /
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  sbin  selinux  srv  sys  tmp  usr  var

Your current working directory is /home/picloud:

picloud@ip-10-46-223-4:~$ pwd
/home/picloud

You can verify the distribution of Ubuntu you’re using:

picloud@ip-10-46-223-4:~$ cat /etc/issue
Ubuntu 11.04 \n \l

We give you sudo access so that you have the freedom to install anything anywhere.

# this does not produce an error
picloud@ip-10-46-223-4:~$ sudo touch /root/i_can_be_root

Important: The owner and group for files and directories in your environment do not matter. While you’ll be using the picloud and root user accounts, your jobs will be run with an entirely different user account that will have access to the entire filesystem environment.

Setting Up Your Environment

We’ll use sudo access to install the ObsPy library.

picloud@ip-10-46-223-4:~$ sudo pip install obspy.core obspy.signal
Downloading/unpacking obspy.core
  Downloading obspy.core-0.4.8.zip (186Kb): 186Kb downloaded
  Running setup.py egg_info for package obspy.core

    no previously-included directories found matching 'docs/other/*'
Downloading/unpacking obspy.signal
  Downloading obspy.signal-0.4.9.zip (4.0Mb): 4.0Mb downloaded
  Running setup.py egg_info for package obspy.signal

Requirement already satisfied (use --upgrade to upgrade): numpy>1.0.0 in /usr/local/lib/python2.7/dist-packages (from obspy.core)
Requirement already satisfied (use --upgrade to upgrade): scipy in /usr/local/lib/python2.7/dist-packages (from obspy.signal)
Installing collected packages: obspy.core, obspy.signal
  Running setup.py install for obspy.core

    no previously-included directories found matching 'docs/other/*'
    Skipping installation of /usr/local/lib/python2.7/dist-packages/obspy/__init__.py (namespace package)
    Installing /usr/local/lib/python2.7/dist-packages/obspy.core-0.4.8-nspkg.pth
    Installing obspy-runtests script to /usr/local/bin
  Running setup.py install for obspy.signal

    building 'libsignal' extension
    gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c obspy/signal/src/recstalta.c -o build/temp.linux-x86_64-2.7/obspy/signal/src/recstalta.o
    gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c obspy/signal/src/xcorr.c -o build/temp.linux-x86_64-2.7/obspy/signal/src/xcorr.o
    ...
I/usr/include/python2.7 -c obspy/signal/src/fft/fftpack_litemodule.c -o build/temp.linux-x86_64-2.7/obspy/signal/src/fft/fftpack_litemodule.o
    gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions build/temp.linux-x86_64-2.7/obspy/signal/src/recstalta.o build/temp.linux-x86_64-2.7/obspy/signal/src/xcorr.o build/temp.linux-x86_64-2.7/obspy/signal/src/coordtrans.o build/temp.linux-x86_64-2.7/obspy/signal/src/pk_mbaer.o build/temp.linux-x86_64-2.7/obspy/signal/src/filt_util.o build/temp.linux-x86_64-2.7/obspy/signal/src/arpicker.o build/temp.linux-x86_64-2.7/obspy/signal/src/bbfk.o build/temp.linux-x86_64-2.7/obspy/signal/src/fft/fftpack.o build/temp.linux-x86_64-2.7/obspy/signal/src/fft/fftpack_litemodule.o -o build/lib.linux-x86_64-2.7/obspy/signal/lib/libsignal.so
    Skipping installation of /usr/local/lib/python2.7/dist-packages/obspy/__init__.py (namespace package)
    Installing /usr/local/lib/python2.7/dist-packages/obspy.signal-0.4.9-nspkg.pth
Successfully installed obspy.core obspy.signal
Cleaning up...

As you can see, installing obspy.signal requires compiling C code with references to the NumPy library. We would not have been able to automatically extract this package from your machine.

Save the Environment

When you click “save” from the Environment Panel, your SSH connection will be closed. The length of time it takes to save your environment depends on how much you’ve installed. Once it’s ready, your new Environment will be listed under the “Your environments” section.

Using Your Environment

To use an environment, use the _env keyword argument to specify the environment you want to use by name. _env is valid for cloud.call, cloud.map, cloud.cron.register, or cloud.rest.publish.

To demonstrate, we will run a beamforming algorithm using the ObsPy library that we just installed. Beamforming is a technique used in seismology for geolocating seismic events. In this case, the event is the demolition of the AGFA skyscraper in Munich. We’ve derived the example from here.

Step 1: Upload the recorded dataset from the demolition to cloud.files

cloud.files.put('agfa.dump')

Step 2: Define the beamforming function.

Note that our version pulls the data from cloud.files directly into memory using the getf function.

import cloud
import pickle, urllib
from obspy.core import UTCDateTime
from obspy.signal.array_analysis import sonic
from obspy.signal import cornFreq2Paz

def beamforming(file_name):
    st = pickle.loads(cloud.files.getf(file_name).read())

    # Instrument correction to 1Hz corner frequency
    paz1hz = cornFreq2Paz(1.0, damp=0.707)
    st.simulate(paz_remove='self', paz_simulate=paz1hz)

    # Execute sonic
    kwargs = dict(
	# slowness grid: X min, X max, Y min, Y max, Slow Step
	sll_x=-3.0, slm_x=3.0, sll_y=-3.0, slm_y=3.0, sl_s=0.03,
	# sliding window propertieds
	win_len=1.0, win_frac=0.05,
	# frequency properties
	frqlow=1.0, frqhigh=8.0, prewhiten=0,
	# restrict output
	semb_thres=-1e9, vel_thres=-1e9, verbose=True, timestamp='mlabhour',
	stime=UTCDateTime("20080217110515"), etime=UTCDateTime("20080217110545")
    )

    return sonic(st, **kwargs)

Step 3: Run it on PiCloud using the _env keyword.

Since this is a computationally intensive task, we use the c2 core type to take advantage of 2.5 compute units of power.

>>> jid = cloud.call(beamforming, 'agfa.dump', _env='seismology_env', _type='c2')
>>> out = cloud.result(jid)
>>> out
array([[  7.33088462e+05,   6.52313948e-01,   4.77909058e-17,
          1.63009177e+02,   1.12929181e+00],
       [  7.33088462e+05,   6.54728115e-01,   4.92721051e-17,
          1.58838740e+02,   9.97246208e-01],
       [  7.33088462e+05,   6.65887892e-01,   5.40099541e-17,
          1.58552264e+02,   9.02496537e-01],
       ...,
       [  7.33088462e+05,   7.97200561e-01,   2.54941247e-16,
          1.84349488e+01,   1.23328829e+00],
       [  7.33088462e+05,   7.93642402e-01,   2.90117096e-16,
         -1.20947571e+01,   8.59069264e-01],
       [  7.33088462e+05,   8.08987796e-01,   3.06803433e-16,
         -2.65650512e+01,   8.72066511e-01]])

To better visualize the result, we can plot the output:



The following code was used to generate the plot:

import matplotlib.pyplot as plt
labels = 'rel.power abs.power baz slow'.split()

fig = plt.figure()
for i, lab in enumerate(labels):
    ax = fig.add_subplot(4, 1, i + 1)
    ax.scatter(out[:, 0], out[:, i + 1], c=out[:, 1], alpha=0.6,
               edgecolors='none')
    ax.set_ylabel(lab)
    ax.xaxis_date()

fig.autofmt_xdate()
fig.subplots_adjust(top=0.95, right=0.95, bottom=0.2, hspace=0)

Pricing

Creating environments is free. There are no additional charges associated with this feature. From our perspective, environments allow you to run more cycles on PiCloud, which is where we get our pay off.

Conclusion

With Environments, every programming language and software package is now a viable tool to use on PiCloud. While we love Python, we’re excited that it is no longer the “sole language of the PiCloud Platform.” Coupled with our function publishing feature, we envision Python serving as the glue language through which users can orchestrate their computation on our otherwise language-agnotistic platform.



Thanks to the select group of users who beta tested this feature for us, and to Ken Park from PiCloud who envisioned and shepherded this project to launch!

Introducing Function Publishing via REST

Wednesday, September 14th, 2011

We’ve been devoting significant time to making PiCloud a useful utility outside of the Python ecosystem. The first feature we have to showcase this is the ability to publish a Python function to a URL. There are a few reasons you might want to do this:

  1. To call Python functions from a programming language other than Python. For example, if you’re integrating the PiCloud platform into a Java codebase, or even into a smartphone application (Android or iPhone).
  2. To use PiCloud from Google AppEngine, since our cloud client library is not supported on GAE.
  3. Because you’re tired of setting up web application projects when what you really need is a scalable RPC system.

In this post, we’ll give you your first taste of publishing functions on the web.

Define your Function

Just like when you offload regular computation to PiCloud, feel free to do anything in your function including importing custom libraries and making external connections.

def add(x, y):
    """This function adds!"""
    return x+y

Publish It

>>> import cloud
>>> cloud.setkey(key, secret_key)
>>> cloud.rest.publish(add, 'addition')
'https://api.picloud.com/r/2/addition'

The first argument, add, is your function. The second argument, addition, is a label so you can reference the function later; it’s also present in the returned URL for clarity. For a list of all other arguments, refer to the cloud.rest module documentation.

Let’s get information about the function we just published by making a GET request on the returned url. We recommend curl to do this from a shell. We authenticate requests using basic authentication. In curl, use “-u” as shown below to specify your key as your username, and secret key as your password. Note that we automatically extract the function’s doc string as the description.

$ curl -k -u 'key:secret_key' https://api.picloud.com/r/2/addition/
{"output_encoding": "json", "version": "0.1", "description": "This function adds!", "signature": "addition(x, y)", "uri": "https://api.picloud.com/r/2/addition", "label": "addition"}

You can also see your published functions from your account control panel.

Call the Published Function

Now let’s call the function by using a POST request to the same URL. To specify arguments to the function add, you simply pass them in as JSON encoded POST values. In this case, you would specify the POST values x and y.

$ curl -k -u 'key:secret_key' https://api.picloud.com/r/2/addition/ -d x=1 -d y=1
{'jid': 809730}

Get the Result

There are two ways we can grab the result of this job. The standard way is through your Python console:

>>> import cloud
>>> cloud.setkey(key, secret_key)
>>> cloud.result(809730)
2

The language-agnostic way to do this using our REST API is to query the following URL: https://api.picloud.com/job/{job_id}/result/.

$ curl -k -u 'key:secret_key' https://api.picloud.com/job/809730/result/
{"result": 2}

The difference between these two methods is that cloud.result will block until the result is ready; our REST API will return a “job not done” error, so you’ll have to keep querying until it’s ready.

For a full specification of our API, please see our REST API documentation.

Taking Advantage of JSON Arguments

Since arguments are specified as JSON, you can easily pass in strings, lists, and dictionaries into your published functions. For example, we can concatenate two strings using our addition function:

$ curl -k -u 'key:secret_key' https://api.picloud.com/r/2/addition -d "x=\"Hello, \"" -d "y=\"World\""
{'jid': 809731}
$ curl -k -u 'key:secret_key' https://api.picloud.com/job/809731/result/
{"result": "Hello, World"}

We can also merge two lists using our addition function:

$ curl -k -u 'key:secret_key' https://api.picloud.com/r/2/addition -d "x=[1,2,3]" -d "y=[4,5,6]"
{'jid': 809732}
$ curl -k -u 'key:secret_key' https://api.picloud.com/job/809732/result/
{"result": [1, 2, 3, 4, 5, 6]}

These work, of course, because in Python the addition operator can be applied to strings and lists, not just numbers.

Handling Raw Data

JSON does not natively support binary data. While you can encode the data to base64, and decode it in your function, we offer a more straightforward and efficient method. Binary data can be passed into a published function by using multipart/form-data as a file upload (MIME Content-Disposition sub-header has a filename parameter).

Example

To showcase raw data handling, we’re going to publish a function to create thumbnails. We’ll use this picture of Albert Einstein.



Here’s the function we’ll use to create a thumbnail of an image. We use StringIO so that we can open and save the image in a memory buffer, rather than to a file.

from PIL import Image
from cStringIO import StringIO

def thumbnail(raw_img_data, width=50, height=50, output_format='JPEG'):
    im = Image.open(StringIO(raw_img_data))
    im.thumbnail((width, height))
    out_data = StringIO()
    im.save(out_data, output_format)
    return out_data.getvalue()

import cloud
# be sure to set the output encoding to raw
cloud.rest.publish(thumbnail, 'thumbnail', out_encoding='raw')

Call the function. Use -F in conjunction with the @ symbol to POST an image file as a file upload, which will be treated as raw data by PiCloud. We can adjust the width and height by passing in POST values, or if we omit them, the default value of 50 will be used.

$ curl -k -u 'key:secret_key' -F width=60 -F height=76 -F "raw_img_data=@albert_einstein.jpg" https://api.picloud.com/r/2/thumbnail/
{'jid': 809737}

The content of the result is the binary data representing the thumbnail image. Unlike JSON encoded results, there is no enclosing dictionary. Thus, all you have to do to see the image is pipe the result of the job into a file.

$ curl -k -u 'key:secret_key' https://api.picloud.com/job/809737/result/ > albert_einstein.thumb.jpg

Open the thumbnail in your favorite image program!

Albert Einstein Thumbnail

Conclusion: Take a rest, and then give it a spin!

We’re particularly excited by function publishing because it bridges PiCloud with the world outside of Python, and in doing so, brings all the computing benefits of our standard service. You can publish functions without any care for the amount of hardware running underneath. As your functions get called more frequently, we automatically scale our servers to meet demand. You can also reserve real-time cores if they want to guarantee a certain number of cores at all times. Lastly, you can be confident that your computation is being run on a system built with performance, robustness, and redundancy at its core.

If this technology captivates you, follow us on Twitter, or go above and beyond and join our team!

Introducing High-Memory Cores

Friday, September 2nd, 2011

When we first launched PiCloud, we provided two options for processing power: standard and high cpu. Standard provided 1 compute unit with 300MB of RAM, while high cpu provided 2.5 compute units and 800MB of RAM. But what about tasks that require GBs of memory and an even faster CPU? Enter core types.

You can now select the type of core you want your job to be run on.

  • c1: Replaces our standard option as default.
  • c2: Replaces the high cpu option.
  • m1: Our new high-memory core with 3.25 compute units and 8GB of memory.

For more details, see our updated pricing page.

How to Use It

We’re committed to maintaining an extraordinarily simple API for you. With our old library you would do the following:

cloud.call(func, _high_cpu=True)

With our new library (available here), you do this instead:

cloud.call(func, _type='c2')

Additional Information

In conjunction with these changes, real-time cores are now reserved by type. You can see the new interface in your control panel.

We’ll be releasing more cores as we hear demand for them. Our next core, s1, which is beta, is a solution for users on our platform who scrape the web. When running jobs in parallel on s1 cores, each job will have its own IP, minimizing throttling effects. However, consecutive jobs may share the same IP address.

Crons in the Cloud!

Tuesday, August 10th, 2010

We’re pleased to announce the addition of crons to the PiCloud platform. A cron is a simple way to schedule a function to be run periodically. Time and dates are specified using the standard crontab format. Crons can be triggered as often as every minute, and there’s no limit to the number of functions you can register as crons. You will be billed for the amount of compute time consumed by the function triggered by your cron–just like if you were running a function on PiCloud. We have also added a tab to the web interface for managing crons.

Here’s how to register a cron:

# registers function ping_webserver with the label heartbeat
# this function could be checking whether a webservice is active
cloud.cron.register(ping_webserver, 'heartbeat', '* * * * *') # runs every minute

When you no longer need a cron, you can deregister it via our web interface or using the following:

# deregister function ping_webserver with the label heartbeat
cloud.cron.deregister('ping_webserver')

Here’s a function that runs once a day at noon.

# 19 is the 19th GMT hour, which translates to 12pm PDT (GMT -7)
cloud.cron.register(sudo_make_me_a_sandwich, 'lunch', '0 19 * * *')

That’s all it takes! See our documentation for the full cron specification.