Posts Tagged ‘s1’

Determining the Life Cycles of Natural Disasters with PiCloud

Tuesday, October 11th, 2011

This is the second in a series of guest blog posts that will be showcasing successful uses of the PiCloud Platform. The first was Building a Biological Database and Doing Comparative Genomics in the Cloud.

Background Information

We are developing various models that help define the life cycles of natural disasters. In order to develop these models, we need to process large amounts of data; the data comes from social networks such as Twitter and official government sources.

As you can imagine, computing and comparing data across different sources can be difficult, not to mention the scale of the data that needs to be processed in a moment’s notice. In addition, the research project is undertaken by social scientists; there’s a constraint in terms of budget and technical personnel; I am one of the few technical personnel working on this research project.

My objective is to develop a software system that helps discover and detect the life cycles of natural disasters, all within the constraints mentioned above.

The general process of my software system is as such:

  1. Scrape data and store it in a MongoDB cluster hosted on Amazon EC2.
  2. Retrieve the data and perform analysis on it, such as natural language processing.
  3. Store the results of step 2.

I am currently running my application on Ubuntu 10.04 and using MongoDB as my main data store. Python is my primary programming language. I’ll be sharing two ways in which PiCloud helped me cut down development time and costs tremendously.

How PiCloud Saved My Day with the s1 Core Type

As mentioned earlier, I am required to scrape data using Twitter’s API. I originally used PiCloud’s c1 core type; it worked for a while but it quickly ran into throttling issues. Fortunately, PiCloud announced the s1 core type just a day before I faced throttling issues.

PiCloud’s s1 core type is optimized for mining and scraping operations since each core uses a different IP address; the effects of IP throttling limits are minimal. Should I attempt to replicate a similar architecture to perform mining operations, I might not be able to complete the alpha version of the software system.

I now find myself leveraging the s1 core type; which helps me to scrape and mine for various data sources, in particular Twitter and other news sources.

Switching to s1 core type is extremely easy, all we needed to do is change the type keyword:

# from
_type = 'c1'
# to
_type = 's1'

Thanks to PiCloud’s s1 core type, I can continue to collect data with minimal worries.

How PiCloud’s Serverless Architecture Saves Development Time

As I am storing my data on Amazon EC2 via a MongoDB cluster, I need a way to for PiCloud to ’speak’ with my database. My initial application has a REST interface that allows PiCloud to ’speak’ with MongoDB; each REST endpoints represents a cost in terms of development time, and ongoing maintenance.

On average, I need to spend 20 minutes or more to develop a specific REST endpoint for a specific use. Now using PiCloud, I do not even need to develop or maintain these REST endpoints. PiCloud’s serverless paradigm means all I need to do is allow PiCloud to read my database directly and calculations can be performed immediately. Here’s how it works:

We typically connect to MongoDB on a localhost by doing the following:

from pymongo import Connection
connection = Connection('localhost')

However, we can run the above code on PiCloud by simply changing localhost to the external IP address of the Mongo server so that PiCloud knows how to reach it.

from pymongo import Connection
# 'xxx.xxx.xxx.xxx' is the IP address and 12345 is the port number
connection = Connection('xxx.xxx.xx.xxx', 123456)

If we want the same code to work both locally and on PiCloud, we use cloud.running_on_cloud() to check how we should connect to MongoDB.

from pymongo import Connection
if cloud.running_on_cloud():
    # 'xxx.xxx.xxx.xxx' is the IP address and 12345 is the port number
    connection = Connection('xxx.xxx.xx.xxx', 123456)
else:
    connection = Connection('localhost')

This means that I can leverage hundreds of cores automatically without any server management or configuration on my part. All I need to do is to write Python code and PiCloud takes care of the rest.

I am also leveraging PiCloud’s other core types, which includes c1, c2 and m1 core types, depending on the amount of computing required. As I am required to compute the data frequency, keyword frequency (and many other metrics) within a moments notice, PiCloud’s serverless paradigm fits perfectly as I do not need to spin up an entire EC2 cluster or even a single instance just to perform my calculations.

Summary

PiCloud has taken care of my server management and configuration needs. The ability to call up computing power without the need to spin up new EC2 instances is a huge time saver. It has saved me tremendous amount of time and effort, allowing me to focus on what is truly important, algorithms and calculations of data.


Eugene enjoys solving difficult problems creatively in the form of building web applications using Python/Django and JavaScript/JQuery. He also enjoys doing research related to the areas of social computing, social media analysis, recommendation algorithms, link analysis, data visualization, data mining, information retrieval, business intelligence, and intelligent user interfaces. You can find him at http://www.liangeugene.com.

Introducing Scraping-Optimized Cores

Monday, October 10th, 2011

For users who aggregate data from around the web, you’ll find our latest core to be an integral part of your toolbox. s1 cores are comparable in performance to c1 cores with one extra trick up their sleeve: each job running in parallel will have a different IP.

Why is this important?

Using unique IPs is necessary to minimize the automated throttling most sites will impose when seeing fast, repeated access from a single IP.

How do I use it?

If you’re already using our c1 cores, all you’ll need to do is set the _type keyword.

cloud.call(func, _type=’s1′)

How much?

$0.04/core/hour

Why don’t other cores have individual IPs?

For other core types, such as c2, multiple cores may be running on a single machine that is assigned only a single IP address. When using s1 cores, you’re guaranteed that each core sits on a different machine.

Suggestions?

We’re excited to move the s1 core type out of beta for our customers. If you have any suggestions for other core types you would like to see, please let us know.