Determining the Life Cycles of Natural Disasters with PiCloud

October 11th, 2011 by Liang Yuxian Eugene, Research Assistant at National Cheng-Chi University

This is the second in a series of guest blog posts that will be showcasing successful uses of the PiCloud Platform. The first was Building a Biological Database and Doing Comparative Genomics in the Cloud.

Background Information

We are developing various models that help define the life cycles of natural disasters. In order to develop these models, we need to process large amounts of data; the data comes from social networks such as Twitter and official government sources.

As you can imagine, computing and comparing data across different sources can be difficult, not to mention the scale of the data that needs to be processed in a moment’s notice. In addition, the research project is undertaken by social scientists; there’s a constraint in terms of budget and technical personnel; I am one of the few technical personnel working on this research project.

My objective is to develop a software system that helps discover and detect the life cycles of natural disasters, all within the constraints mentioned above.

The general process of my software system is as such:

  1. Scrape data and store it in a MongoDB cluster hosted on Amazon EC2.
  2. Retrieve the data and perform analysis on it, such as natural language processing.
  3. Store the results of step 2.

I am currently running my application on Ubuntu 10.04 and using MongoDB as my main data store. Python is my primary programming language. I’ll be sharing two ways in which PiCloud helped me cut down development time and costs tremendously.

How PiCloud Saved My Day with the s1 Core Type

As mentioned earlier, I am required to scrape data using Twitter’s API. I originally used PiCloud’s c1 core type; it worked for a while but it quickly ran into throttling issues. Fortunately, PiCloud announced the s1 core type just a day before I faced throttling issues.

PiCloud’s s1 core type is optimized for mining and scraping operations since each core uses a different IP address; the effects of IP throttling limits are minimal. Should I attempt to replicate a similar architecture to perform mining operations, I might not be able to complete the alpha version of the software system.

I now find myself leveraging the s1 core type; which helps me to scrape and mine for various data sources, in particular Twitter and other news sources.

Switching to s1 core type is extremely easy, all we needed to do is change the type keyword:

# from
_type = 'c1'
# to
_type = 's1'

Thanks to PiCloud’s s1 core type, I can continue to collect data with minimal worries.

How PiCloud’s Serverless Architecture Saves Development Time

As I am storing my data on Amazon EC2 via a MongoDB cluster, I need a way to for PiCloud to ‘speak’ with my database. My initial application has a REST interface that allows PiCloud to ‘speak’ with MongoDB; each REST endpoints represents a cost in terms of development time, and ongoing maintenance.

On average, I need to spend 20 minutes or more to develop a specific REST endpoint for a specific use. Now using PiCloud, I do not even need to develop or maintain these REST endpoints. PiCloud’s serverless paradigm means all I need to do is allow PiCloud to read my database directly and calculations can be performed immediately. Here’s how it works:

We typically connect to MongoDB on a localhost by doing the following:

from pymongo import Connection
connection = Connection('localhost')

However, we can run the above code on PiCloud by simply changing localhost to the external IP address of the Mongo server so that PiCloud knows how to reach it.

from pymongo import Connection
# 'xxx.xxx.xxx.xxx' is the IP address and 12345 is the port number
connection = Connection('xxx.xxx.xx.xxx', 123456)

If we want the same code to work both locally and on PiCloud, we use cloud.running_on_cloud() to check how we should connect to MongoDB.

from pymongo import Connection
if cloud.running_on_cloud():
    # 'xxx.xxx.xxx.xxx' is the IP address and 12345 is the port number
    connection = Connection('xxx.xxx.xx.xxx', 123456)
else:
    connection = Connection('localhost')

This means that I can leverage hundreds of cores automatically without any server management or configuration on my part. All I need to do is to write Python code and PiCloud takes care of the rest.

I am also leveraging PiCloud’s other core types, which includes c1, c2 and m1 core types, depending on the amount of computing required. As I am required to compute the data frequency, keyword frequency (and many other metrics) within a moments notice, PiCloud’s serverless paradigm fits perfectly as I do not need to spin up an entire EC2 cluster or even a single instance just to perform my calculations.

Summary

PiCloud has taken care of my server management and configuration needs. The ability to call up computing power without the need to spin up new EC2 instances is a huge time saver. It has saved me tremendous amount of time and effort, allowing me to focus on what is truly important, algorithms and calculations of data.


Eugene enjoys solving difficult problems creatively in the form of building web applications using Python/Django and JavaScript/JQuery. He also enjoys doing research related to the areas of social computing, social media analysis, recommendation algorithms, link analysis, data visualization, data mining, information retrieval, business intelligence, and intelligent user interfaces. You can find him at http://www.liangeugene.com.

Tags: , ,

Categories: Success Story

You can follow any responses to this entry through the RSS 2.0 feed.

One Response to “Determining the Life Cycles of Natural Disasters with PiCloud”

  1. Eugene says:

    Hey, just want to update about this research project. The initial findings have been published at First Monday ( http://firstmonday.org )

    The url for the publication is at : http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3937

Leave a Reply