Building a Biological Database and Doing Comparative Genomics in the Cloud

September 27th, 2011 by Joemar Taganna, PhD candidate at the Flanders Institute for Biotechnology (VIB)

This is the first in a series of guest blog posts that will be showcasing successful uses of the PiCloud Platform.

Cloud computing is a revolutionizing technology. It sidesteps the need for buying, setting up and maintaining powerful computers for people who would rather spend time and effort directly on their high-level innovations.

In our particular use case, we are using cloud computing for bioinformatics, specifically on building a database and in developing a pipeline for comparative analysis of a subset of genes across different E. coli genomes. I work in a lab that does not do bioinformatics as a primary research interest so I don’t have immediate access to an in-house computer cluster to do some computationally intensive tasks. This led me to explore cloud computing, and since I write scripts in Python, a search on “cloud computing” and “Python” in Google led me to PiCloud.

After experimenting with Amazon EC2 myself trying to figure out how to build a smooth interface to it from my pipeline, it was a sigh of relief to know about a Python library that does exactly this…and it does it in a distinctly Pythonic way – simple and elegant. With PiCloud’s library, you can deploy your computations to the cloud, in a massively parallel mode if need be, with minimal modifications to your Python code.

At the moment, I mainly use PiCloud on two fronts. I am using it for some critical functions in a web-based biological database of our genes/proteins of interest. The database is running on Google App Engine (which in itself is another breed of cloud computing service). App Engine is great but it has certain limitations that prevent biological databases from running in its platform alone. I needed to use NumPy, matplotlib and Biopython among other Python packages as well as command line programs written in C and C++ (e.g. ClustalW, EMBOSS and HMMER). All of which I couldn’t install or use in App Engine. In the pipeline development aspect of my work, which I do on my laptop; while I do have all these packages and programs installed, certain computations take a lot of time and can slow down my system. I now submit these tasks to the cloud for processing where I can choose to run them a lot faster.

Python packages and/or programs written in other languages that you need can be used within PiCloud with their Environments feature. Configuring an environment is essentially setting up a filesystem to include all of the packages and programs you need. You configure a filesystem by SSH-ing into a Linux virtual machine, where you can install packages and programs through the command line as you would on any machine. After you log off of the machine, you save the environment, and then you’re ready for action. As long as you’ve installed all the things you need, you never have to do the configuration again. However, if you need to install more, you just connect again to the VM through SSH and modify it…it’s that easy.

Running functions on PiCloud is just as easy as calling:

>>>, 'argument')

and if you need to use your preconfigured environment you just add the _env keyword argument:

>>>, 'argument', _env='your_saved_environment')

For our database that runs on Google App Engine, invoking won’t work because, just like the other Python packages mentioned, I also can’t install the PiCloud library there. Luckily, PiCloud has a way to talk with any web app through its REST API. For this you need to register functions that you would like to call through REST:

>>>, 'name_of_your_function')

which will give you the URL where you will send your HTTP GET or POST requests to invoke the function, which then returns the job ID as a response. The status of the job and the results are fetched by making GET requests to another set of URLs given the job ID as a parameter. (Please refer to the docs for more details.)

In our database, we use the REST API for generating operon maps and for fetching sequences from databases both using functions that depend on Biopython. We also have REST functions for doing sequence alignment (using ClustalW or EMBOSS) in the cloud from inputs received from the web app. With PiCloud’s REST API and Google App Engine, being both cloud-based platforms, we made ourselves a working prototype of how to leverage cloud computing for building biological databases that can potentially scale without the need for a major rewrite of the code to adapt to any future spike in usage and traffic.

“this would normally take 20 to 25 hours… I was able to run it in PiCloud in only 50 minutes”

In my latest run of our offline bioinformatics pipeline, I analysed more than a 100 E. coli genomes where we found around 1400 of our protein of interest, in total. I needed to do sequence alignment of all the pairwise combinations of those 1400 proteins which is around a million protein pairs. In my laptop (2.26 GHz Intel Core 2 Duo with 4GB RAM), this would normally take 20 to 25 hours to complete without parallelization. I was able to run it in PiCloud in only 50 minutes and I could have run it faster with a more optimized code. All steps in the pipeline can be done from a web app, again through the REST API. Therefore, a plan for the final “publicly consumable” form of the program is to make it completely web-based…completely running on the cloud.

With an easy access to the cloud, the possibilities are unlimited. Now, ordinary biologists can do computationally demanding tasks without the need to delve deep into server and database management and without the complexities and hassles of running jobs in grids and clusters.

Joemar Taganna is a PhD candidate at the Flanders Institute for Biotechnology (VIB) in Belgium. He works in a collaborative project between the Laboratory of Structural and Molecular Microbiology in Vrije Universiteit Brussel and the Laboratory of Medical Biotechnology in Universiteit Gent. The project aims to have a holistic view of the major adhesive protein arsenal of various E. coli pathotypes in order to elucidate pathogen-specific targets for anti-adhesive therapeutics.

Tags: , ,

Categories: Success Story

You can follow any responses to this entry through the RSS 2.0 feed.

One Response to “Building a Biological Database and Doing Comparative Genomics in the Cloud”

  1. [...] This is the second in a series of guest blog posts that will be showcasing successful uses of the PiCloud Platform. The first was Building a Biological Database and Doing Comparative Genomics in the Cloud [...]

Leave a Reply