Archive for the ‘How To’ Category

Cloud Desktops

Thursday, July 18th, 2013

Have you ever needed to run a GUI application on PiCloud? Well, neither have I, but it’s now possible! We’ve created a custom environment that you can use to run desktop applications on PiCloud.

Creating the VNC Server

Use exec-shell to create a new PiCloud job and SSH session:

$ picloud exec-shell --type c2 --cores 2 --env /picloud/Ubuntu-Desktop

This will create a job using our Ubuntu-Desktop environment. It requires at least 2 c2 cores due to the resource requirements of running the desktop environment.

Once the ssh session begins you can start the VNC server.

$ vncserver -geometry 1280x800

You’ll be asked to set a password for your VNC sessions. You can also specify the dimensions of the X display in pixels. The default pixel depth is 16 bits, but you can set it to use a broader range of colors with 24 bits using the -depth 24 option.

Connecting to the VNC server

You can use your favorite VNC client to connect to your desktop job. For this example, I’ll be using the Screen Sharing app built in to Mac OS X. On other platforms, you should be able to use any VNC client application.

To connect, you’ll need to know the hostname and port for the VNC server. The easiest way to find out what hostname and port your VNC server is running on is to use the jobs panel from the accounts page, choose the job id corresponding to your desktop job, and scroll down to the “listening ports” section.

VNC listens on port 5901 inside a Linux container on the host ec2-50-16-102-78.compute-1.amazonaws.com which is exposed on port 20102 in this case.

In my web browser, I just enter vnc://ec2-50-16-102-78.compute-1.amazonaws.com:20102 into the address bar and it starts the Screen Sharing app where I’m prompted for the password I used when starting the VNC server. Other VNC clients should ask for the hostname/port, or you can include host:port together in the same field.

Once the connection is established you’ll be presented with a basic Ubuntu-2D desktop.

Our installation includes the Mozilla Firefox Web Browser and the LibreOffice productivity suite, but any changes you make to the filesystem are ephemeral and are discarded when the job completes. To save files you’re working on, you’ll want to save them to your Bucket or Volumes.

Like all of our public environments, you can also clone /picloud/Ubuntu-Desktop to customize for your own needs.

How to Get More Parallel Computing Power

Wednesday, November 16th, 2011

We often get asked how much parallel computing power each user on our platform gets. The short answer is that it’s variable and left up to our algorithms by default, but if you need, it’s easy to specify that you want exactly N cores using our Realtime Cores feature—a couple of clicks on our web dashboard, or one line of code.

Standard Service (w/o Realtime)

By default, your jobs take advantage of cores that are shared by all of our users. We do not have a set number of total cores in our system, and there are no set number of cores per user. Instead, PiCloud continually and automatically estimates the workload we have and will have in the next hour. There are two components to this estimation:

  1. Periodic Jobs: A large class of users runs jobs periodically, whether it’s every minute, hour, or day. Over time our system learns the amount of load each of these users contributes, and increases server capacity in anticipation of periodic jobs.
  2. Aperiodic Jobs: For users who do not have a predictable pattern, we scale our worker nodes once their jobs have been added to our queue. The number of workers we bring up depends on how long we estimate their jobs will take.

How do we estimate the runtime of a job?

We estimate the runtime of enqueued jobs by the average and variance of runtimes of similar completed jobs; a similar job is any job by the user that executed the same Python function with the same PiCloud-specific reserved keyword arguments. Keyword arguments such as _type are important since they could affect the speed at which a job is processed.

How does your workload affect the number of workers we bring up?

Since we use Amazon Web Service’s servers, we are charged for every hour we have a server up. To manage our costs, we try to make sure our servers spend as much time as possible processing jobs, rather than sitting idle. Here’s a rule of thumb for how this affects you: If each of your jobs takes an hour to process, PiCloud will automatically scale such that all of your jobs are running simultaneously in parallel. We’re happy to bring up as many as you need since you’re keeping our servers busy for the full hour increment that we rent from Amazon. If each of your jobs takes a half hour, we’ll run about half of your jobs in parallel at a time, so that each core will be busy for the full hour running two jobs each. In practice, our scaling algorithm is more liberal than this, but this should give you a good conservative ballpark.

How much compute power should you expect?

If you’re just running a small batch of short jobs, you’ll generally get less than 10 cores. If you’re running a large number of long-running jobs, then using the above algorithm, we could potentially bring up hundreds of cores for you.

Realtime Service

The model of our standard service is ideal for users trying out PiCloud for the first time, users with relaxed response-time requirements, and users with large batch jobs. Our realtime service is for our other class of users who need N cores to be ready at a moment’s notice, even if that means those cores will be sitting idle at times.

How it Works

You tell us the type of core you want, and the number you want, and we’ll bring them online just for you. Your jobs will be placed in both your own realtime queue, as well as our general queue. So if you have reserved 80 cores, and you check in 200 jobs, at least 80 jobs will be immediately be dequeued and begin processing on your realtime cores. Potentially more than 80 may run simultaneously since you still have access to the cores available in our standard service.

How we (want to) Charge

Right now, we charge a fixed cost for every core hour (rounded up). If you reserve realtime cores, and you run no jobs, you’ll be charged that amount. If you run jobs, you’ll be charged both the cost of the realtime cores, as well as the cost of our standard service for the number of compute hours your jobs run for. For example, if you reserve 10 c1 cores for an hour, and run 1 compute hour worth of jobs on c1 cores during that time, you’ll pay $0.15 + $0.05 = $0.20 cents total (See our pricing page). The problem we have with this is two fold:

  1. The fixed cost of each realtime reservation needs to be sufficiently high to cover our expenses in the event a user does not run any jobs. Charging our elastic rate on top of that is unfairly expensive.
  2. If our standard service scaled a user to 100 cores, and the user then reserved 30 cores, they would not see any additional compute power, while paying substantially more. While it’s true that they would be guaranteed that compute power without any risk of crowding from other users, we don’t believe users were getting the best bang for their buck.

We love users who use realtime because they offer our automatically-scaling system a level of predictability; we don’t want to dis-incentivize this behavior. So we’re going to change our policy as follows: Each realtime core, rather than being an hourly charge, will instead be an hourly minimum. Hypothetically, if you reserved a realtime core that has a minimum of $0.10/hour, then after an hour you would be charged the greater of $0.10 or the bill you accrued from using our standard service. If you don’t run any jobs, that’s $0.10. If you run some jobs, but the standard service bill doesn’t exceed $0.10, then you pay $0.10. If you run enough jobs to exceed $0.10, say $0.15, then you pay just $0.15. We’ll let you all know when we’ve made this change to our pricing!

How to Reserve Realtime Cores

Go to the realtime tab on our web dashboard.

Select the type of core, and the number of cores you want for us to keep available for you, and then hit submit.

When the cores are ready, you’ll receive an email and the “start time” will be set.

When you run jobs, make sure that you set the _type keyword to the core type that you have reserved. In this example, we reserved 160 c2 cores.

cloud.call(f, _type='c2')

Alternatively, reserving realtime cores will be possible through the Python console using the cloud library in its next release:

cloud.realtime.request('c2', 160)

How long does it take to allocate Realtime Cores?

We’ve been working hard making allocation time faster and more consistent, which has been a common issue raised by users. With our latest design, it should take no more than 15 minutes for cores to be provisioned!

How many cores can you allocate?

We have users who provision hundreds and thousands of cores using our realtime service. If you’re worried you might break our backs, file a support ticket and let us know what you’re up to.

Conclusion

The takeaway we want to leave with you is as follows: The PiCloud platform will not only get you up and running in no time with no configuration, but will also be able to meet your exact computational requirements if necessary—even if you need 1000 cores in 15 minutes. It’s not a question of if, but how, and we’re working to make how as easy as possible.

Running a Twitter Bot with Cloud-based Crons

Thursday, August 12th, 2010

Two days ago, we released our latest feature: Crons. In short, you’re able to register a function to be run periodically on PiCloud. Today, we wanted to give you an example of crons in action by making an automatic retweeter bot.

We’ve setup a twitter account, @picloudrt, that will automatically retweet any message that includes the word “picloud” within a minute of its posting. We accomplish this by running a cron every minute that uses the Twitter API to search for new “picloud” tweets. As with our previous video encoding tutorial, we’ll first demonstrate how to run the retweeter locally, and then show how to move it to the cloud.

What can I do with a retweeting bot?

Your retweeting bot has several uses. First, you can follow it to get a consolidated view of what people are saying about your company or product. Second, you can use it to keep a comprehensive history of tweets, rather than the temporally finite (~1.5 weeks) results provided by Twitter Search. Third, you can augment the code to filter tweets and report findings as you see fit. For example, rather than re-tweeting, you can use smtplib to alert yourself of new tweets via e-mail. This will get you functionality similar to that of a service like TweetBeep.

Prerequisite Libraries

cloud – Version >= 2.0.0 of our library. You’ll need to sign up (it’s free!) to download it.
tweepy – Twitter API for Python.

Local Version

We have a function called retweeter(), which does the following:

  1. Uses the tweepy library to search twitter for posts.
  2. Determines which tweets are new, and not from the retweeting account, picloudrt.
  3. Uses tweepy to retweet.

The details are best understood by examining the comments embedded in the code below.

Updated: retweeter() has been improved. The old method assumed that Twitter search would instantaneously display new tweets, when in fact the tweets can be delayed by over 10 seconds. Rather than using specific time intervals, we now use the last retweet as a marker for determining what tweets are new.

import tweepy
import datetime

username = 'picloudrt' # put your twitter handle here
password = 'XXXX' # put your password here
keyword = 'picloud' # the word we're tracking

def retweeter():
    """Searches for the picloud key term on twitter and retweets
    any new tweets tweeted since our last retweet."""

    # create api object (authentication needed for retweeting)
    auth = tweepy.BasicAuthHandler(username, password)
    api = tweepy.API(auth)

    # find the most recent tweet we've retweeted, so that
    # when we search for the latest tweets, we know only
    # to retweet messages that were created_after
    retweets = api.retweeted_by_me()
    if retweets:
        created_after = retweets[0].retweeted_status.created_at
    else:
        # if we've never retweeted before, then we're going to
        # retweet all msgs created after the 20th century, ie. all of them
        created_after = datetime.datetime(year=2000, month=1, day=1)

    # grab all tweets that include our keyword (default: picloud)
    tweets = api.search(keyword)
    # reverse them to get the oldest first
    tweets.reverse()
    for tweet in tweets:
        # if the tweet is new, and was not made from our account, retweet it
        if tweet.created_at > created_after and tweet.from_user != username:
            api.retweet(tweet.id)

To run the function on your local machine, you simply call it, retweeter().

Cloud Version

To run retweeter() periodically on PiCloud, you register it as a cron:

import cloud
cloud.cron.register(retweeter, 'picloud_retweeter', '* * * * *')

That’s it! Note that while you had to install tweepy locally, you did not have to install it on PiCloud. Our cloud library automatically extracts dependencies, such as tweepy, from your machine, and deploys them on PiCloud.

We labeled the newly registered cron ‘picloud_retweeter’; labels make it easy to refer to the cron in the web interface, and in other functions, eg. cloud.cron.deregister(). The expression, ‘* * * * *’, is the UNIX crontab way of saying that retweeter() should be run every minute. You can find more details on specifying the periodicity at the unix man page for crontab.

Here’s what the cron dashboard now shows:


Cron Dashboard

If you click on “View Jobs,” you’ll be taken to our jobs dashboard, which will automatically filter for jobs created by the cron. Jobs spawned by a cron are labeled with the cron’s label prefixed with ‘cron_’. In this case, the jobs are labeled as ‘cron_picloud_retweeter’.


Cron Dashboard

As you can see, your cron is creating a new job every minute. For testing purposes, you can manually run a cron at any time using the cron dashboard. You can also remove the cron using the dashboard, or using our library.

cloud.cron.deregister('picloud_retweeter')

Cost

Assuming that your custom twitter bot takes about a second to scrape and process data, we can estimate your monthly cost. The function will be run approximately (60 minutes) * (24 hours) = 1440 times a day, for a total of (1440 times a day)*(30 days)=43,200 times per month. If it takes one second to execute each time, that’s 43,200 compute seconds, or 43,200/(3,600 seconds per hour) = 12 compute hours. The total cost is therefore (12 compute hours) * ($0.05 per compute hour) = $0.60. Compare that with the $5-$20/month charge for some Twitter alert services, or the $20-$68 price of bringing up an instance from Amazon or Rackspace, directly.

Conclusion: Why PiCloud should be your go-to Cron Artist

Easy: All you need is one line of code: cloud.cron.register().
Fire-and-forget: Once you’ve registered a cron, we’ll make sure it works until the end of time.
Scalable: If you have a lot of crons, we’ll automatically distribute them across multiple machines in our cluster.
Monitoring: Ever wonder how your cron is doing, but can’t find the logs? Would you like to know exactly when your script stopped working? Just check your job dashboard to see a full history of your computation for easy debugging.
Inexpensive: For basic usage, you could be charged less than a dollar per month!

How to encode all of your videos, quickly and cheaply!

Wednesday, July 21st, 2010

Update: This post is outdated. Please refer to our most recent documentation example on video encoding.

With the ubiquity of video on the web, it’s important that services be able to encode their videos in a variety of formats to maximize their viewership. Specific formats are necessary for displaying content on certain mediums, most notably, flv for flash videos and mp4 for the iPhone. Video encoding is a time consuming and computationally intensive task, which makes the computing power of the cloud ideal for the job. This post will cover how to use PiCloud to offload encoding to the cloud using our cloud library and ffmpeg, a popular video encoding tool. With just a couple lines of code, you’ll be able to leverage the compute power of hundreds of cores on Amazon Web Services without touching a single server at a fraction of the cost (3%-20%) of encoding.com.

Source Video

You can use any avi file as the “source video.” If you want to follow this post to the letter, you can download what we used: rickroll.avi. Use the “Save file to your PC” link (BEWARE: The “Download Now” graphics are ads).

ffmpeg Basics

ffmpeg provides a command-line interface for manipulating videos. Since it’s not our purpose to teach ffmpeg in this post, here are the two command strings we’ll be using:

1. Converting to flv: ‘ffmpeg -i source_video.avi -y -b 200000 -r 25 -s 320×240 -ab 56 -ar 44100 -f flv output_video.flv’
2. Converting to mp4: ‘ffmpeg -i source_video.avi -y -b 200000 -r 25 -s 320×240 -acodec aac -ab 128kb -vcodec mpeg4 -b 1200kb -mbd 2 -flags +4mv -cmp 2 -subcmp 2 -s 320×180 output_video.mp4′

For more useful commands, check out the 19 ffmpeg commands for all needs.

Example 1: Encoding a video locally

Assuming you have ffmpeg installed, the function below, ffmpeg_exec(), will encode a specified source video on your local machine.

from subprocess import Popen, PIPE

encoding_cmd_strings = {
'flv': 'ffmpeg -i {0} -y -b 200000 -r 25 -s 320x240 -ab 56 -ar 44100 -f flv {1}',
'mp4': 'ffmpeg -i {0} -y -b 200000 -r 25 -s 320x240 -acodec aac -ab 128kb -vcodec mpeg4 -b 1200kb -mbd 2 -flags +4mv -cmp 2 -subcmp 2 -s 320x180 {1}'
}

def ffmpeg_exec(source, target, encoding):
    """Uses a shell call to ffmpeg to convert a video
    to the desired encoding"""

    # Popen calls the ffmpeg process, and collects the standard out/error
    p = Popen(encoding_cmd_strings[encoding].format(source, target),
                  stdout=PIPE,
                  stderr=PIPE,
                  shell=True)
    stdout, stderr = p.communicate(input=None)

    # return these for debugging purposes
    return stdout, stderr

Running the function ffmmpeg_exec('rickroll.avi', 'rickroll.flv', 'flv') produces a flash video of the rickroll.avi source file. Likewise, ffmmpeg_exec('rickroll.avi', 'rickroll.mp4', 'mp4') produces an mpeg4 encoding.

Example 2: Retrieving a file from the cloud, encoding it locally, and then putting it on the cloud.

We’ll define a function, convert_video(), to download the source video, encode it using ffmpeg_exec(), and then put the encoded file on the cloud. For convenience, we’ll use the cloud.files module to get and put your video files, but you could use other storage locations such as Amazon S3 (via boto), a database, or even a website.

If you have downloaded the rick roll video, you can store it on the cloud from the Python console:

>>> import cloud
>>> cloud.files.put('rickroll.avi')

convert_video() uses cloud.files.get() to retrieve the source video that we’ve stored on the cloud, encodes it, and then puts the encoded file on the cloud with cloud.files.put().

import os
import cloud

def convert_video(source, encoding):
    """Gets the source file, converts it to the specified encoding,
    and puts it on the cloud"""

    # automatically generate target name, ie. video.avi -> video.flv
    basename, ext = os.path.splitext(source)
    target = '%s.%s' % (basename, encoding)

    # gets the source file from the cloud and saves it to the
    # current directory with the same name
    cloud.files.get(source, source)

    # execute ffmpeg (Example 1)
    ret = ffmpeg_exec(source, target, encoding)

    # store output file on the cloud
    cloud.files.put(target)

    return ret

You can verify that convert_video('rickroll.avi', 'flv') adds ‘rickroll.flv’ to your cloud files collection.

>>> convert_video('rickroll.avi', 'flv')
>>> cloud.files.list()
['rickroll.avi', 'rickroll.flv']

Example 3: Encoding a video with PiCloud

Now that we’ve created the functions to encode a video locally, we want to move the computation to the cloud. We’ll use our cloud library to do this. The most basic function in the library is cloud.call(), which takes a function as its argument, and returns a job id (an integer). cloud.call() inspects the execution state of the Python interpreter and copies everything it needs to execute the given function on PiCloud’s cluster. The only change we’ll need to make is the following: Instead of calling convert_video() directly, we’ll instead pass the function into cloud.call().

# executes convert_video('rickroll.avi', 'flv') on the cloud
# _high_cpu mode dedicates 2.5 compute units to the task (2.5-3.0ghz core)
jid = cloud.call(convert_video, 'rickroll.avi', 'flv', _high_cpu=True)

The function is now running on PiCloud. You can check the jobs panel in the web interface to see its status.


Alternatively, you can use cloud.status(jid) to see when the function is done.

>>> cloud.status(jid)
'processing'
>>> cloud.status(jid)   # after some time has passed
'done'

If you check the result of the function using cloud.result() (blocks until completion), you’ll get this:


FFmpeg version SVN-r22379, Copyright (c) 2000-2010 the FFmpeg developers
  built on Mar  9 2010 12:45:06 with gcc 4.4.1
  libavutil     50.11. 0 / 50.11. 0
  libavcodec    52.58. 0 / 52.58. 0
  libavformat   52.55. 0 / 52.55. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
Input #0, avi, from 'rickroll.avi':
  Duration: 00:03:34.96, start: 0.000000, bitrate: 2108 kb/s
    Stream #0.0: Video: mpeg4, yuv420p, 704x544 [PAR 1:1 DAR 22:17], 25 tbr,
 25 tbn, 25 tbc
    Stream #0.1: Audio: mp3, 48000 Hz, 2 channels, s16, 128 kb/s
Output #0, flv, to 'rickroll.flv':
  Metadata:
    encoder         : Lavf52.55.0
    Stream #0.0: Video: flv, yuv420p, 320x240 [PAR 33:34 DAR 22:17],
q=2-31, 200 kb/s, 1k tbn, 25 tbc
    Stream #0.1: Audio: libmp3lame, 44100 Hz, 2 channels, s16, 0 kb/s
Stream mapping:
  Stream #0.0 -> #0.0
  Stream #0.1 -> #0.1
Press [q] to stop encoding
[mp3 @ 0x1adfe70]incomplete frame   8785kB time=211.24 bitrate= 340.7kbits/s
frame= 5374 fps=177 q=2.0 Lsize=    8877kB time=214.96 bitrate= 338.3kbits/s
video:5305kB audio:3359kB global headers:0kB muxing overhead 2.456632%

Congrats! You’re now officially encoding on the cloud.

Example 4: Leveraging Parallelism to Batch Process a Large Video Collection

While encoding a dozen hours of videos using the above functions may be tractable on a single machine, encoding an entire library composed of thousands of hours is not. This is where the elasticity of the cloud shines. Using PiCloud, you can easily leverage the parallel computing power of hundreds of cores on Amazon. Instead of using cloud.call to run a function once in the cloud, use cloud.map to run the same encoding function on all videos.

To encode all videos in both flv and mp4 locally, we can do the following:

# this list can contain as many source files as you want
source_names = ['rickroll.avi', 'source1.avi', 'source2.avi']
source_args = 2*source_names
encoding_args = ['flv']*len(source_args)+['mp4']*len(source_args)

# expands to: map(convert_video, ['rickroll.avi', 'source1.avi', 'source2.avi', 'rickroll.avi', 'source1.avi', 'source2.avi'], ['flv', 'flv', 'flv', 'mp4', 'mp4', 'mp4']
map(convert_video, source_args, encoding_args)

To move the work to PiCloud, change the map function to the cloud.map function:

jids = cloud.map(convert_video, source_args, encoding_args, _high_cpu=True)

That’s all it takes to offload your encoding to our cluster! We’ll automatically scale up the number of Amazon EC2 instances in our cluster depending on how much workload you give us (we estimate this on the fly). Here’s a graph demonstrating the speed gains from this one-line change:


The local machine is equivalent to a single 2.5Ghz Core i7 Intel processor. If you’re still thinking, “but I need to process videos even faster,” then check out our real time compute units feature.

How much did that cost?

According to my account, encoding 30 3-minute videos, which took about 120 seconds total, cost me $0.073. Each video took about 70 seconds to get, encode, and save, for a total of 30*70=2100 seconds or (2100 seconds)/(3600 seconds/hour)*(2.5 compute units)=1.46 compute hours. At the rate of $0.05/compute unit/hour, and noting that I was using high cpu mode (2.5 compute units), the total cost was 1.46 compute hours * $0.05/compute unit/hour = $0.073.

With encoding.com, the same task would cost $2.97 at their cheapest high-volume tier. This was derived from $1.80/GB * (55 mb/Rick Roll) * (30 Rick Rolls). That makes PiCloud less than 3% the cost of encoding.com! To be fair, if you aren’t storing your videos on Amazon, you’ll have to pay bandwidth costs, which will be (1.65GB Data In)*($0.15/GB) + (1.65GB Data Out)*($0.16/GB) = $0.512. PiCloud’s total cost would be $0.512+$0.073=$0.585, which is still only 20% of the cost of encoding.com. Extra point for PiCloud: We didn’t include the amount you’d have to pay for bandwidth to send and receive videos files with encoding.com. Needless to say, they do have a full video encoding service with a wide range of options and customer support, whereas we’re showing you a building block that could be used to replicate their service. But, this does give you an idea of the premium they are charging for their service.

Summary (TL;DR)

  • ffmpeg is a tool for encoding videos, and is available on PiCloud.
  • PiCloud offers the cloud.files module, a simple file storage service, as an easy way to get and put files on the cloud.¬†Using cloud.files is completely optional–use whatever other data store you want–but it’s there when you need it.
  • Getting on the cloud with PiCloud is easy!
    • Passing convert_video() into cloud.call() is all you need to do to offload your encoding to the cloud.
    • If you want to encode a lot of videos, use cloud.map() instead of map(), and all of it will be pushed to the cloud for processing.
  • We’re inexpensive!

Take it from here, Rick!