Archive for the ‘What’s New’ Category

Bucket Mounts

Wednesday, July 24th, 2013

It’s now easier than ever to make use of your PiCloud Bucket in your jobs thanks to our new Bucket Mount feature. A Bucket Mount exposes your Bucket to your job as a read/write filesystem mounted at /bucket. In this folder, you’ll find everything you’ve stored in your Cloud Bucket, organized as a standard file system hierarchy.

Getting Familiar

Here, I’ve SSHed into a running PiCloud job using our exec-shell feature:

$ picloud exec-shell
Job requested as jid 4052. SSHing in..
Welcome to Ubuntu 11.04 (GNU/Linux 3.2.0-29-virtual x86_64)

 * Documentation:  https://help.ubuntu.com/
$ ls -lF /bucket
total 1
drwxrwxr-x 1 root picloud   0 Jun 12 21:22 folder1/
-rw-rw-r-- 1 root picloud  11 Jun 12 01:06 it_works.txt
-rw-rw-r-- 1 root picloud 806 Jun  6 21:57 pogo.py

PiCloud jobs run as a UNIX user in the group picloud and have permission to create, modify, or read any files or folders in the Bucket. You can access files like you normally would with any regular file.

$ cat /bucket/it_works.txt
It Works!

Modifying Files

Modifying a file automatically uploads it to your Bucket when the file is closed or flushed.

$ echo 'Hooray for Bucket Mounts!' > /bucket/hooray.txt
$ cat /bucket/hooray.txt
Hooray for Bucket Mounts!

Renaming Files and Folders

As another example, renaming prefixes (folders) becomes quite easy. Below, all of the Cloud Bucket objects with the prefix folder1/ have been renamed to have the prefix files/.

$ picloud bucket list
filename
folder1/file1.txt
folder1/file2.txt
folder1/file3.txt
hooray.txt
it_works.txt
pogo.py
$ mv /bucket/folder1 /bucket/files
$ picloud bucket list
filename
files/file1.txt
files/file2.txt
files/file3.txt
hooray.txt
it_works.txt
pogo.py

Removing Files

You can also use it to easily delete files.

$ rm -rf /bucket/files
$ picloud bucket list
filename
hooray.txt
it_works.txt
pogo.py

Writing Code

Having your bucket objects already part of the file system can reduce the complexity of your programs. For languages that don’t have a PiCloud driver, you would normally have to start another process (from within your program) to execute picloud bucket get and picloud bucket put, but with your objects mounted at /bucket/ there’s no longer a need for this.

Writing Python

This now lets you simplify your Python code that utilizes your bucket. Code like this will still work, unchanged:

import cloud

def crunch_data():
    cloud.bucket.get('dataset.txt')

    with open('dataset.txt', 'r+') as dataset:
        modify_data(dataset)

    cloud.bucket.put('dataset.txt')

cloud.call(crunch_data)

But it’s no longer necessary for jobs to wrap code with get/put calls:

import cloud

def crunch_data():
    with open('/bucket/dataset.txt', 'r+') as dataset:
        modify_data(dataset)

cloud.call(crunch_data)

If you’re writing Python that needs to run both locally and on PiCloud, you can use also use cloud.running_on_cloud() to determine how best to access objects in your bucket.

Performance and Consistency Considerations

Bucket Mounts can be thought of as a simple file system wrapper on top of the picloud bucket get and picloud bucket put functions. In particular:

  • When you open a file for reading, it will first download the entire object from your Bucket (picloud bucket get).
  • When you open a file for writing, it will also download the entire object first (if it exists).
  • When you flush/close a file open for writing, it will upload the entire file to your Bucket (picloud bucket put).

Cloud Desktops

Thursday, July 18th, 2013

Have you ever needed to run a GUI application on PiCloud? Well, neither have I, but it’s now possible! We’ve created a custom environment that you can use to run desktop applications on PiCloud.

Creating the VNC Server

Use exec-shell to create a new PiCloud job and SSH session:

$ picloud exec-shell --type c2 --cores 2 --env /picloud/Ubuntu-Desktop

This will create a job using our Ubuntu-Desktop environment. It requires at least 2 c2 cores due to the resource requirements of running the desktop environment.

Once the ssh session begins you can start the VNC server.

$ vncserver -geometry 1280x800

You’ll be asked to set a password for your VNC sessions. You can also specify the dimensions of the X display in pixels. The default pixel depth is 16 bits, but you can set it to use a broader range of colors with 24 bits using the -depth 24 option.

Connecting to the VNC server

You can use your favorite VNC client to connect to your desktop job. For this example, I’ll be using the Screen Sharing app built in to Mac OS X. On other platforms, you should be able to use any VNC client application.

To connect, you’ll need to know the hostname and port for the VNC server. The easiest way to find out what hostname and port your VNC server is running on is to use the jobs panel from the accounts page, choose the job id corresponding to your desktop job, and scroll down to the “listening ports” section.

VNC listens on port 5901 inside a Linux container on the host ec2-50-16-102-78.compute-1.amazonaws.com which is exposed on port 20102 in this case.

In my web browser, I just enter vnc://ec2-50-16-102-78.compute-1.amazonaws.com:20102 into the address bar and it starts the Screen Sharing app where I’m prompted for the password I used when starting the VNC server. Other VNC clients should ask for the hostname/port, or you can include host:port together in the same field.

Once the connection is established you’ll be presented with a basic Ubuntu-2D desktop.

Our installation includes the Mozilla Firefox Web Browser and the LibreOffice productivity suite, but any changes you make to the filesystem are ephemeral and are discarded when the job completes. To save files you’re working on, you’ll want to save them to your Bucket or Volumes.

Like all of our public environments, you can also clone /picloud/Ubuntu-Desktop to customize for your own needs.

Introducing Queues — Creating a Pipeline in the Cloud

Wednesday, April 3rd, 2013

Queues provide an interface for Dataflow Programming that is built on top of our job system.

While a distributed queue data structure with push, pop, and ack capabilities is provided, the key benefit is the ability to attach a handler to a queue for scalable processing of a queue’s messages. The handler in turn can feed its output messages to other queues.

In other words, you’re probably used to the queue data structure:

Overview of Queue

Our queues link the data structure with a message handler, f, which we call an attachment:

Overview of Queue

f(msg) is any Python-function you define which takes in 1 argument at a time, a message, coming from the input queue. Its return value is pushed into the output queue.

By the end of this post, you’ll be able to:

  • Create a distributed, fault-tolerant pipeline of queues and processors.
  • Scale each component to achieve a throughput of thousands of messages per second.
  • See it all through automatically-generated visualizations.
  • Pay only when you have messages in your pipeline.
  • Do it all, with only Python, and not a single server.

If you’re a developer who just wants to RTFM, see our documentation.

Diving In

Let’s see queues in action. You’ll need to have the latest client installed, released today (4/3).

First, let’s push and pop from a queue in your console to get comfortable:

>>> # import our library
>>> import cloud
>>> q = cloud.queue.get('numbers')
>>> # adds 3 messages to the queue
>>> q.push([1,2,3])
>>> # pops up to 10 messages
>>> q.pop()
[2, 1, 3]

Note that the queue did not dequeue in perfect-FIFO order; this is expected. Our queues are designed for high-throughput, high-parallelism, with minimal queue backlog, making guaranteed FIFO behavior less relevant.

Attach

Now let’s attach a function that increments all numbers in the input queue:

# declare the input and output queue
input_q = cloud.queue.get('numbers')
output_q = cloud.queue.get('bigger-numbers')

# create handler function
def increment(x):
    return x + 1

# attach the handler to the queue
input_q.attach(increment, output_q)

How did that work? We’re using the same automagic dependency transfer we use in our job system to send your increment function to us along with any dependencies it might have.

Visualization

From the Queues Dashboard, we can see an auto-generated layout of our pipeline based on the attachment we made:

Overview of Queue

Message Processing by Attachment

Let’s increment 1,000 numbers:

# range(1000) is a list of numbers from 0 to 999
input_q.push(range(1000))

In the background, our systems have created a job (visible in your Job Dashboard) that applies the increment function to every number in the numbers queue, and outputs the result to the bigger-numbers queue. If you’re unfamiliar with our job framework, don’t worry, queues abstract away most of the details. But, if you’re interested, see our Primer.

After ~10 seconds, you’ll see that all messages have been processed. On the live throughput chart, a single point represents the average throughput during a 10s window of time; the interpolation lines are cosmetic. Below, the single point at 100 msgs/second represents that 1000 messages were processed during the window. In actuality, we got about ~150 msgs/second for 7 seconds.

Overview of Queue

As a sanity check, we can also check the size of the queues:

>>> input_q.count()
0
>>> output_q.count()
1000

Increasing Throughput

What if you want to increase your throughput past 150 msgs/second? Set max_parallel_jobs for the attachment. You can do this from the Queue Dashboard or from the attach call:

# attach the handler to the queue
input_q.attach(increment, output_q, _max_parallel_jobs=5)

Now, assuming there are messages in the queue, you’ll see a throughput of 750 msgs/second!

Overview of Queue

If you click on “view” jobs, you can see a list of the five jobs attached to the queue. For those familiar with our job processing framework, you can now see that our Queue processors are built on top of jobs.

The takeaway: you just increased your throughput 5-fold by changing one number, and without any server management or scaling.

Creating an Image Processing Pipeline

To showcase the power of queues, we’re going to create the following pipeline:

Overview of Queue

This is an auto-generated visualization available in the Queues Dashboard. Rectangles are queues; circles are attachments.

The inputs to this pipeline are URLs, which should be pushed to the img-urls queue. The pipeline downloads the image, and does the following image operations:

  • Resize to 150px (thumbnail)
  • Resize to 400px (medium size)
  • Apply a Sepia-tone filter to the medium size image

For each generated image, an HTTP callback is made to an endpoint of your choice. Note that separating the three image operations into three attachments with different input queues isn’t the most efficient (you’d probably want to combine them into one operation), but it’s done for illustrative purposes.

You can download this pipeline from our repository: basic-example/queue/imgpipeline/pipeline.py

Step 1: Scraping Images

We’re going to use the following function as our attachment:

import os
import Image
import urllib2
from StringIO import StringIO

import cloud

def scrape_to_bucket(target):
    """Downloads image from url, and saves to bucket. *target* should
    be a dict with keys id (image id), and url (location of image).

    Returns a dict with keys id (image id), path (obj key), and
    transforms (empty list)."""

    id = target['id']
    url = target['url']

    # path to save image in bucket
    obj_path = 'imgs/{id}/original.png'.format(id=id)

    # extract extension from url
    ext = os.path.splitext(url)[-1]

    # open connection to image
    u = urllib2.urlopen(url)

    # if image isn't png, convert it to png
    if ext.lower() != 'png':
        i = Image.open(StringIO(u.read()))
        data = StringIO()
        i.save(data, 'png')
        data = data.getvalue()
    else:
        data = u.read()

    u.close()

    # add image to bucket
    cloud.bucket.putf(data, obj_path)

    return {'id': id,
            'path': obj_path,
            'transforms': []}

If you’re unfamiliar with Buckets, just think of them as a key->value object store. We use it here to conveniently retrieve and store objects to and from memory. However, buckets are not necessary, and are completely unrelated to queues. You can modify scrape_to_bucket() so it saves images into your own Amazon S3 account, database, or anywhere else.

Here’s a sample input message we’ll use to demonstrate each operation:

{
 'id': 1,
 'url': 'http://s3.amazonaws.com/pi-user-buckets/vFvZxWVSiHeeB20rAZwnS66OLRjeU8MU4Igf2Kyl/blog/Obama_family_portrait.jpg'
}

The url points to an image of the Obama family:
Obama Family Full Size

Per the source code, the above image will be saved in your bucket. The output message pushed to the thumbnail and medium queues will be:

{
 'id': 1,
 'path': 'imgs/1/original.png',
 'transforms': []
}

You can verify this works by simply running the function on your own machine:

>>> msg = {'id': 1,
'url': 'http://s3.amazonaws.com/pi-user-buckets/vFvZxWVSiHeeB20rAZwnS66OLRjeU8MU4Igf2Kyl/blog/Obama_family_portrait.jpg'}
>>> scrape_to_bucket(msg)
{'id': 1, 'path': 'imgs/1/original.png', 'transforms': []}

This is another advantage of queues. Because your function doesn’t need to be modified in any way to be an attachment, you can just as easily test it locally, as you can on the cloud.

Handling Exceptions

What if the message handler throws an Exception? Maybe the URL was temporarily unavailable, but you’d like to retry it in 60 seconds. Using retry_on, max_retries, and retry_delay, you can specify which Exceptions you’d like to retry, the number of times to retry, and the amount of time between each attempt.

import urllib2

q = cloud.queue.get('img-urls')
output_qs = cloud.queue.get('thumbnail'), cloud.queue.get('medium')
bad_urls_q = cloud.queue.get('bad-urls')

q.attach(scrape_to_bucket,
         output_qs,
         retry_on=[urllib2.HTTPError, urllib2.URLError],
         max_retries=3,
         retry_delay=60,
         on_error={Exception: {'queue': bad_urls_q}})

Using the on_error keyword, the bad-urls queue will be sent messages that raised non-retryable Exceptions, and messages that failed even after three retries. Error messages generated by on_error include the triggered exception, and associated traceback.

You can confirm that your attachment has been setup as intended with the visualization.

Scrape Attachment

Tweaking Performance with Multi-threading

Because this scraping attachment spends most of its time waiting for network data transfer, and is thus I/O bound, it won’t be effectively utilizing the core it’s running on. The workaround is to run multiple, identical attachments in the job.

To do this, set the readers_per_job keyword to the number of simultaneous threads you want running in a job. The default value is 1. Do not mistake this with max_parallel_jobs, which controls the number of jobs that may be running. For this example, we’ll set the value to 4.

import urllib2

q = cloud.queue.get('img-urls')
output_qs = cloud.queue.get('thumbnail'), cloud.queue.get('medium')
bad_urls_q = cloud.queue.get('bad-urls')

q.attach(scrape_to_bucket,
         output_qs,
         retry_on=[urllib2.HTTPError, urllib2.URLError],
         max_retries=3,
         retry_delay=60,
         on_error={Exception: {'queue': bad_urls_q}},
         readers_per_job=4)

Step 2: Resizing Images

We’re going to attach handlers to the thumbnail queue, and medium queue to resize images to 150px, and 400px, respectively. To ease the storage and retrieval of images as PIL Image objects, we’re going to use a custom-defined ImageOperation class.


class ImageOperation(object):
    """Base class for Message Handlers in Image Pipeline.

    Retrieves images from bucket, performs in-memory manipulation
    with PIL object, stores result back in bucket, and then
    outputs message with additional transform listed.

    Override operation() for custom operation."""

    name = 'identity'

    def get_image_from_bucket(self, obj_path):
        """Given *obj_path* in bucket, returns PIL Image object"""

        # get image data as string of raw bytes
        data = cloud.bucket.getf(obj_path).read()

        return Image.open(StringIO(data))

    def put_image_in_bucket(self, img, obj_path):
        """Given PIL image *img*, saves it to *obj_path* in bucket"""

        output_data = StringIO()

        # write raw image bytes to StringIO
        img.save(output_data, 'png')

        # store the image file in your bucket
        cloud.bucket.putf(output_data.getvalue(), obj_path)

    def add_modifier_to_key(self, obj_path):
        """Returns new *obj_path* that includes name of transform"""

        obj_key, obj_ext = os.path.splitext(obj_path)
        obj_path = '{key}.{name}.png'.format(key=obj_key,
                                             name=self.name)
        return obj_path

    def message_handler(self, msg):
        """Entry point for message handling. Do not override."""

        img = self.get_image_from_bucket(msg['path'])

        # apply image operation
        new_img = self.operation(img)

        msg['path'] = self.add_modifier_to_key(msg['path'])
        msg['transforms'].append(self.name)

        self.put_image_in_bucket(new_img, msg['path'])

        return msg

    def operation(self, img):
        """Method to replace for custom operation"""

        return img

Since we support instances as message handlers, we’ll subclass ImageOperation to make two message handlers: ThumbnailOperation and MediumSizeOperation.

class ImageThumbnail(ImageOperation):

    name = 'thumb'

    def operation(self, img):
        """Returns a thumbnail of the *img*"""

        img.thumbnail((150, 150), Image.ANTIALIAS)
        return img

class ImageMediumSize(ImageOperation):

    name = 'med'

    def operation(self, img):
        """Returns a 400px version of the *img*"""

        img.thumbnail((400, 400), Image.ANTIALIAS)
        return img

Now we’ll attach instances of these classes to their respective input queues.

thumbnail_q = cloud.queue.get('thumbnail')
thumbnail_q.attach(ImageThumbnail(), [callback_q])

medium_q = cloud.queue.get('medium')
medium_q.attach(ImageMediumSize(), [sepia_q, callback_q])

If you pushed the example message of the Obama family to img-urls, then there are already messages ready for the thumbnail and medium queue. Once processed, these two objects will appear in your bucket:

imgs/1/original.thumb.png

Obama Family Thumbnail

imgs/1/original.med.png

Obama Family Medium Size

Step 3: Sepia Tone

The final image operation is a sepia-tone filter after the medium-size downscale operation.

class ImageSepia(ImageOperation):
    """Applies Sepia Filter.
    Based on: http://effbot.org/zone/pil-sepia.htm"""

    name = 'sepia'

    def __init__(self):
        self.sepia_palette = self.make_linear_ramp()

    @staticmethod
    def make_linear_ramp():
        """Generate a palette in a format acceptable for `putpalette`,
        which expects [r,g,b,r,g,b,...]"""

        ramp = []
        r, g, b = 255, 220, 162 

        for i in range(255):
            ramp.extend((r*i/255, g*i/255, b*i/255))

        return ramp

    def operation(self, img):
        """Returns a version of the *img* with Sepia applied
        for a vintage look."""

        # convert to grayscale
        orig_mode = img.mode
        if orig_mode != "L":
            img = img.convert("L")

        img = ImageOps.autocontrast(img)

        # apply sepia palette
        img.putpalette(self.sepia_palette)

        # convert back to its original mode
        if orig_mode != "L":
            img = img.convert(orig_mode)

        return img

Attaching:

sepia_q = cloud.queue.get('sepia')
sepia_q.attach(ImageSepia(), [callback_q])

Once again, if you pushed the sample message, there should already be a message ready in the sepia queue. The image outputted to your bucket is:

Obama Family Sepia

Step 4: Callback

Each image operation outputs a message to the callback queue. You probably want your callback handler to:

  • Write to your database that the image is ready
  • Make a POST request to your website for instant notification
  • Store the image somewhere else

For simplicity, we’ll have the callback handler set the image object in your bucket as public, so that it’s accessible by anyone. Based on the above examples, the following should be straightforward:

def callback(msg):
    print msg
    cloud.bucket.make_public(msg['path'])

callback_q = cloud.queue.get('callback')
callback_q.attach(callback)

Debugging Attachments

You may have noticed that in callback(msg), we did a print msg. How would you see standard output for an attachment? The same way you would for a job—by clicking on it in the Jobs Dashboard. To know what jobs are running your attachments, click “view” jobs from the Queues Dashboard, which will take you to the Jobs Dashboard filtered for your attachment.

Using this method, you’ll get access to all the information you’re accustomed to with jobs, including a realtime feed of CPU, memory, and disk usage.

Scaling Up

With just a few tweaks, I was able to get a system throughput greater than 150 images per second. I set max_parallel_jobs to 20 for the scraping step (10 readers_per_job, c2 core), and 30 for all image operation steps. Also, I set the image operation steps to use the f2 core for faster processing.

Here’s a screenshot of the Queue Dashboard in action as I was testing (doesn’t show max throughput). Note how the dequeue rate is able to keep up with the enqueue rate, which is precisely what we want.

Obama Family Sepia

Pricing

We charge for queues based on the amount of time jobs spend processing messages. Assuming you have a steady stream of messages, the maximum you’ll pay in an hour is:

max_parallel_jobs x cost per core hour

You can find the cost per core hour based on the core type you’ve chosen from our pricing page. If your queue is empty, no jobs will be running, and you won’t pay a thing!

Conclusion: Let Us Clean Your Pipes

Letting us manage the full pipeline—the queues, and the processing of messages—has several advantages:

  • No Servers: You won’t have to configure or deploy a single server for storage or processing.
  • Faster Development: It takes only a few lines of code to “use the cloud” for a pipeline.
  • Reliability: Our queues and workers are distributed and replicated across multiple datacenters (AWS availability zones), and even a server failure won’t jeopardize your messages.
  • Scale Effortlessly: Tell us how many cores you want to put to work, and we make it so.
  • Cut Costs: You only pay for processing when there are messages. No idling servers.
  • Monitoring & Analytics: Take advantage of our queue analytics, and the same job monitoring interface that powers our standard service.

If you’re ready to give it try, sign up now, and get 20 free core hours. Happy coding!

Introducing the PiCloud Notebook

Sunday, December 23rd, 2012

The PiCloud Notebook is a Python console in your browser optimized for data analysis and collaboration. Unlike a traditional console, a notebook lets you intermingle code and descriptive text. The best way to get a feel for how it works is to see it:



We’re big fans of IPython, and those who are familiar with it will immediately recognize the console as an IPython Notebook running on the PiCloud Platform. While we believe that in general developers will continue to code on their local machine for convenience, there are several advantages to using a cloud-based notebook:

  • Your research is accessible from anywhere.
  • You can share your work with collaborators.
  • You can work efficiently with your data stored on PiCloud or AWS.
  • You can get comfortable with the environment your jobs run in.

Getting Started

Log into your account (sign up and get 20 free core hours), and click on the Notebook tab. This brings you to your Notebook Machine.

Think of your Notebook Machine as a computer dedicated to you sitting on the cloud. When you open a notebook, your Notebook Machine is started if it isn’t already. And when you’re done, you should shutdown the machine to avoid wasting resources.

By default, a notebook named “Primer” is available, which will walk you through the examples in the introductory section of our documentation called Primer. Click on it to open your first notebook.

Use your PiCloud account password when prompted.

Using a Notebook

Cell

Each box in your notebook is called a cell. Cells can hold code, or various types of markup (Markdown, headers, or raw text). Unlike in a traditional Python console, pressing enter in a cell will create a new line, but not execute it. This makes it easy to write multi-line functions and sequences of commands.

Executing Code

When you want to execute a cell, whether it’s code or markup, press shift+enter. Note the number indicating the order of execution on the left side of the cell, “In [X]“. Unlike a traditional console, you can execute and re-execute a cell at any time, and cells are thus not necessarily in order of execution. While it takes some time to get used to, it’s very handy when you’re continuously iterating your code.

Executing Shell Commands

While the primary use case of the notebook is for writing Python code, it’s also convenient for running shell commands. To execute a shell command, just prefix a command with “!”. For example, you can run “!ls”, “!pwd”, or even the PiCloud command-line interface (CLI), “!picloud”.



Visualization

Your notebook can also display rich media including images, graphs, videos, and more!



What else?

For more tips and tricks, see the following resources:

What is the Notebook Machine Exactly?

Your Notebook Machine is more than a web application. It’s a full-blown Linux system that can be accessed through the notebook console. We’re able to offer this by leveraging the same job abstraction we’ve already devoted so much effort to. In fact, the machine is running as a job. You can see the job id of your notebook machine from the notebook tab.

How is it Implemented?

As mentioned previously, our notebook is an IPython Notebook running on PiCloud. The notebook is run as a job in your account. For those of you familiar with our platform, you may be wondering how you’re able to connect to the notebook web server running in the job. The answer is that we’ve just released a feature that allows jobs to open listening sockets which can accept external connections. This opens up numerous possibilities including using sockets for job to job communication, as well as hosting web applications.

Using the Notebook to Live the Life of a Job

Another advantage of a notebook is it allows you to explore the system that a job sees. You can do the following:

  • Peek around the filesystem.
  • Import Python libraries to check availability and version.
  • Run non-Python programs.
  • Verify that your custom environment is working as expected.
  • Verify that your volumes are mounted in the way you intended.
  • Benchmark workloads interactively.

Refer to our documentation on configuring your notebook machine for more information regarding mimicking a job’s usage of volumes, environments, multicore, and core types.

Collaboration

If you want to give collaborators access to use your notebook machine, set a secondary password. Then provide the URL of your notebook to your collaborators. They should use the secondary password when prompted. We don’t want you to hand out your primary account password to others.

You can also send a notebook to a collaborator by downloading it (File -> Download as) and manually sending it.

Suggestions?

We’re really excited about this latest addition to the PiCloud Platform. If you have any ideas, let us know!

New Base Environment — Ubuntu Precise

Wednesday, October 24th, 2012

Since its introduction, Environments have become a staple feature of the PiCloud Platform, enabling users to deploy custom libraries and binaries. It was our first step outside of the Python box.

Up until now, we’ve offered two base environments that you could customize, or use directly:

  1. Ubuntu 10.10 Maverick for Python 2.6
  2. Ubuntu 11.04 Natty for Python 2.7

Enter Ubuntu Precise 12.04

Our latest environment is pre-configured with many of the latest libraries, making it easier than ever to move your computation to the cloud. Here are some of the notable packages:

  • NumPy 1.6.2
  • SciPy 0.11
  • Pandas 0.9.0
  • Scikits Learn 0.8.1
  • OpenCV 2.4.2
  • Java 7
  • R 2.14.1
  • Ruby 1.9.1
  • PHP 5.3.10

Click here for a list of all contents.

How do I use the new Base?

To maintain backwards compatibility with users dependent on our Natty base, by default all Python 2.7 jobs still use Natty. To use Precise, specify the environment of a job as ‘base/precise’. In Python:

cloud.call(f, _env='base/precise')

In the shell:

$ picloud exec -e base/precise program

Of course, you can create an environment based off of the Precise base, and use that too.

Need Another Base?

If Precise isn’t enough for you, let us know what other distribution of Linux you’d like to see us support in the comments.

Real-time Data Feed for Jobs and More!

Tuesday, September 18th, 2012

We’ve just completed a major overhaul of the job dashboard. The overhaul marks a fundamental shift in our belief of what information you should be able to see about your job. We had originally striven for minimalism, revealing only what we thought was necessary: standard output & error, exception traceback, and profile. After all, if software ran bug free, you wouldn’t need anything else. But as our users kept reminding us with support tickets, when things aren’t working, there is exceptional value to under-the-hood data.

This post covers the new visibility we offer into your computation. You can see it in action by running a new job, and viewing it through the Job Dashboard.

Count of Memory Allocation Failures

We now report the number of times a memory allocation was requested, but failed, under the “MemAlloc Fails” column.

If you haven’t run out of memory on PiCloud before, the count may not seem all that important. But up until now, users would generally get a cryptic “None” Exception when they ran out of memory, indicating that their process had unexpectedly died. The reason is because most programs don’t handle out of memory errors gracefully, typically segfaulting instead. Now if your program crashes, it’s easy to check for allocation failures, the most likely culprit.

Once you know you’re running out of memory, you can take advantage of our other features. Try using a core type with more RAM, such as our f2 core, or take advantage of our new multicore support.

CPU Usage

A common question is how effectively a job is utilizing a CPU. If your job is supposed to be crunching numbers, but is only using the CPU 30% of its allotted time, then you probably have an unsuspecting bottleneck. Before this update, there was really no way to know, other than profiling the job locally.

Now, you can access three different views of a job’s CPU usage to give you maximum visibility.

Total Usage

Runtime refers to wall-clock time. CPU Time is divided into two categories based on whether time is being spent in user or kernel space. With the data above, we can determine that the CPU was being utilized (7613.14+280.85)/8239.6 = 95.8% of the time. But where’d the other 4% go?

Usage over Time

Using the following chart, we can see what happened.

This hypothetical job was at 99%+ utilization for the majority of the time. However, it spent the first 500 seconds loading data from external sources before crunching it. Depending on your job, a graph like this might look ideal, or it might be an impetus to reduce the data loading time.

The chart is generated by taking a snapshot of a job’s CPU Time every 30 seconds.

Realtime Usage

If you select a job from the Job Dashboard, while it is processing, you’ll be able to see a realtime graph of CPU usage in the right hand column. The graph actively samples your running job every few seconds.

This soothes the anxiety some developers feel when their precious algorithms are all grown up and running on the cloud. Never again will they anxiously wonder, “how’s my job doing?”

Memory Usage

We offer a similar set of views for a job’s memory usage, as we do for its CPU usage. While swap usage is shown, no job can currently use swap.

Peak Usage

Peak usage is the most amount of memory that was ever allocated for the job. Peak usage is viewable while the job is running, or after it has completed.

Current Usage

Current usage is the amount of memory currently being used by the job. Current usage can only be seen for a job that is currently being processed.

Usage over Time

Extending the example of the hypothetical job from the CPU Usage section, we can confirm that the first 500 seconds were spent loading data into memory.

Realtime Usage

Not shown for brevity. Just like the Realtime CPU Usage chart, you can see the memory usage of a job, while it is processing.

Python Logging

The Python logging module is popular for its simplicity and extensibility. Until now, users have had to output their loggers to standard output or error to have their messages retained by PiCloud. We now automatically grab your logger messages, and store them separately for later viewing.

Pi Log

As your job runs, our system sometimes has comments for you. For example, if you spawn subprocesses, it will notify you that you cannot take advantage of our persistent processes optimization. Before, we used to write these messages to standard error, but that unnecessarily worried some users, and others complained that it polluted their output. Now, we have a separate section specifically for messages from PiCloud.

System Log

Since more users are using PiCloud for non-Python computation, we have decided to expose the Syslog associated with a job. If your job is segfaulting, and there are no indications why, this is the place to look.

In the Works

We plan to offer metrics for disk, and network usage.

Suggestions?

Need more visibility? Let us know by leaving a comment!

Introducing Multicore Support

Friday, August 31st, 2012

Up until today, each job run on PiCloud has been limited to using only a single core. For those familiar with the Python global interpreter lock (GIL), this may not seem like a big deal at first. But as our users have let us know, the limitation is acute for the following reasons:

  1. Many performance-focused Python libraries including numpy release the GIL whenever possible, which means that even Python programs can leverage multiple cores.
  2. With the release of Environments, many of our users are running non-Python multithreaded programs. Some of those can use as many cores as we can throw at it.
  3. The most RAM a single core has is 8GB (m1). Up until now, a single job couldn’t break this limit. But now, you can pool multiple cores together to get access to more RAM.

How do I use it?

All you have to do is use the _cores keyword argument.

# uses 4 cores
job_id = cloud.call(func, _type='f2', _cores=4)

# works the same for map jobs
job_ids = cloud.map(func, datapoints, _type='f2', _cores=4)

Each job gets 4 f2 cores of processing power, and 14.8GB (4 cores x 3.7GB per f2 core) of RAM. We use the f2 core because as the next section shows, the c1 core, which is default, does not support the new multicore feature.

How many cores per job?

The number depends on the type of core you select.

Core Type Supported Multiples
c1 1 core
c2 1, 2, 4 or 8 cores
f2 1, 2, 4, 8, or 16 cores
m1 1 or 2 cores
s1 1 core

How much?

Per our pricing page, a job using a single f2 core would cost $0.22/hour. A job using two f2 cores would cost $0.44/hour. In other words, the cost per core has stayed the same, and there are no additional fees. You’re still charged by the millisecond.

Suggestions?

Multicore not enough for you? Let us know by leaving a comment.

Introducing the f2 Core – Our Fastest Core Yet!

Friday, June 15th, 2012

If you’ve been craving more speed, you’ll want to give our brand new f2 core a spin!

How fast?

The f2 core is rated at a whopping 5.5 compute units, which is 69% more than our previous leader, the m1 core with 3.25 compute units.

How about memory?

f2 cores have access to 3.7 GBs of RAM, which is a much-needed middle ground between the 800 MBs available in a c2 core, and the 8 GBs available in an m1 core.

How do I use it?

All you’ll need to do is set the _type keyword.

cloud.call(func, _type='f2')

How much?

At $0.22/core/hour, it’s got 120% more compute units than a c2 core, but only a 69% price premium.

Suggestions?

With our collection of five core types, there’s something for everyone! If you have ideas for another core type, or want to tell us how important a core with a GPU is to you, please leave a comment.

20 Free Core Hours Every Month!

Monday, October 24th, 2011

Starting this month, all users will get 20 c1 core hours worth of credits each and every month.

  • If you ran out of your original 5 core hour credits, you can come back and play around some more!
  • If you have minimal computing needs, this means that you can now use PiCloud regularly without even having to enter a credit card.

Looking for more? Don’t forget, we’re giving away $500 worth of credits as part of our Academic Research Program. Applications are due this Thursday, October 27th.

Introducing Free Core Hours for Academic Research

Thursday, October 20th, 2011

From the beginning, we’ve prided ourselves on bringing the cloud to scientists and engineers who don’t have access to a major compute cluster or lack the system administration know-how to operate one. And we’re indebted to the research groups around the world who quickly understood our value proposition, and enthusiastically adopted our platform. We could not have gotten to where we are today without them. To return the favor, today we’re introducing the PiCloud Academic Research Program to grant $500 (10,000 c1 core hours) free to two researchers.

Since this is the first time we’re doing this, we want to keep it simple. If you wish to apply, send an e-mail to research-funding@picloud.com by Thursday, October 27th with the following:

  1. Full name
  2. Organization or educational Institution
  3. Your position
  4. Short biography
  5. A summary of your research field and project. Feel free to include conference papers, publications, and links to project websites. Please emphasize how PiCloud’s compute power will facilitate your research.

We will be awarding two submissions with free core hours. Winning researchers will also have an opportunity to get larger exposure for their projects on our blog and website. We’re looking forward to your submissions!