Using AUFS and LXC to Manage Dependencies at Cloud Scale

May 1st, 2013 by Ken Park

Environments in a Nutshell

At PiCloud we strive to be the easiest service out-of-the-box, yet versatile enough to be used for the most complex applications. This ideal was especially challenging to realize when expanding our compute platform beyond Python-only modules, to support the whole gamut of non-Python software. How could we extend the same ease-of-use that our cloud library offers to the complex world of software dependencies?

In this blog post, I’ll explore how we’ve harnessed three different technologies to achieve our goal of a highly-scalable software dependency manager. We call it Environments, and for many PaaS users, it’s what sets us apart.

Where We Were

PiCloud originated as a Python-only platform. In that simplified setup, our Dependency Analyzer could identify and ship all pure-Python dependencies at job submission time. Why only pure-Python? Anything that requires compilation cannot be expected to work when shipped to a computer with a different architecture. Unfortunately, for performance purposes, many Python libraries are actually compiled C-extensions, and this meant if you used a Python library with C-extensions, our compute nodes needed to have that library pre-installed. Not surprisingly, it didn’t take long before this limitation became an issue for some of our users.

Where We Wanted To Be

When the time came to revamp our dependency manager, we sought a generic solution that would work for Python and non-Python software alike. We boiled our requirements for the new dependency management down to the following:

  • Simplicity. It must be no more difficult than what a user would do on their own machine. Importantly, it should not require users to learn a new tool or configuration language.
  • Powerful. There should be minimal limitations on what packages or programs a user can install. Such flexibility necessitates that escalated privileges (for dependency installation purposes) are given in controlled situations.
  • Low Overhead. It must not introduce significant overhead to job processing. Among other things, this means satisfying a user’s dependency must not require booting a new machine or rebooting an existing one.
  • Parallelizable. The value of PiCloud is giving users the freedom to use as many cores as needed to accomplish their tasks. Deploying dependencies should not hinder parallelization across many machines.

How We Got There

Filesystem Virtualization

Fundamentally, satisfying the dependencies of a program means ensuring that the appropriate files can be found in the proper places in the filesystem. We can then rephrase the role of the dependency manager as making sure a job is run in whatever filesystem environment it needs for proper function.

We took inspiration from the Python tool virtualenv, which lets you create, on one machine, multiple, isolated Python environments, each with its own packages and settings. In other words, it virtualizes those portions of the filesystem relevant to Python. Generalizing this concept, what we wanted was a virtualenv for the entire filesystem.

On *nix machines, the simplest way to virtualize the filesystem is through the chroot program, which allows you to set any point in your filesystem hierarchy as the new filesystem “root”. Unfortunately, chroot lacks mechanisms for isolation and resource management. At the other end of the virtualization spectrum is a virtual machine, which gives you full control and isolation but comes with the huge overhead of starting and running the VM. However, in between the simple chroot utility and full VMs are container technologies that share the host’s kernel but have the capacity to virtualize most everything else. Combined with other recent Linux kernel developments, like cgroups and namespaces, container technologies provide resource management and isolation, respectively. And starting a container is fast, because there is no operating system to boot up – the host’s kernel is already up and running. We ended up choosing LXC (LinuX Container) as our container technology, because it has been mainlined into the Linux kernel.

Our new dependency management system was beginning to take shape. A user would start with an LXC-bootable base filesystem, and install whatever programs and files their jobs need. We would store this modified filesystem, which we’ll call an Environment from this point on. When the user runs a job, we would start an LXC container using their Environment as the filesystem. The next question is, how do we store and send these big Linux filesystems across the network to compute nodes?

Union Mount System

What we realized was that saving the entire Environment was wasteful. In practice, Environments would be mostly the same, with only a relatively small number of files added in or changed. Theoretically, given the base filesystem the Environment was built on, we only need the changes the user made on top of this base to recreate the full Environment. What would enable us to save just this difference, rather than the whole Environment?

The answer is union mounts. A union mount allows you to stack multiple filesystems and present a unioned view at a different mount point. It is the technology behind many Live CDs. A Live CD contains a bootable operating system but is not typically writable. Hence, upon bootup from a Live CD, the system will create a temporary scratch space in RAM, then union mount this RAMFS on top of the filesystem of the CD. After chrooting into this unioned mount, the user is presented with a machine seemingly running off a read-only CD, but giving you the ability to tinker with and change the operating system files as you try it out. And due to the magic of the union mount, all changes are being written to the RAM filesystem, even when modifying files from the CD.

Several union filesystem tools exist, but after some research, we settled on AUFS (Another Union FileSystem), a nifty module created and maintained by Junjiro Akajima. We chose AUFS, because it is relatively performant and reliable, and we have been impressed with Junjiro’s responsiveness to support requests.

So, how much does AUFS help? On PiCloud, the average size of Environment modifications is around 220 MB. Given a base filesystem size of around 2.5 GB (which includes many pre-installed libraries for convenience and performance), this leads to roughly a 12-fold savings in terms of storage and network transfer.

Environment In Action

So, putting all that together, here’s how PiCloud Environments work:

  1. PiCloud user requests creation of a new Environment through our web UI or client CLI.
  2. PiCloud launches a machine the user can log into, that is actually an AUFS stack of an empty filesystem (read-writable) on top of a base Environment (read-only), much like a Live CD.
  3. User logs into the “setup server” and installs dependencies as they would on their own machine.

Environment Creation

  1. When the user is done setting up the Environment and requests to save it, user changes are sanitized for security and saved into our distributed file system.

Environment Saving

  1. User can now submit jobs specifying it should be run in the custom Environment. In Python, Environment is specified with the “_env” keyword, and in bash, the “-e” flag.
  2. PiCloud recreates the AUFS stack on our compute nodes, and runs the user’s jobs in LXC containers.

Using Environment

DFS Features That Improve Performance

As mentioned above, AUFS reduces the size of an Environment by roughly 12-fold. In practice, we end up doing much better than that thanks to our distributed file system (DFS), which has two important properties:

  1. Files from an environment are transferred at the block level to the compute node on-demand (only when a program accesses them).
  2. Transferred files are cached on the compute node automatically.

To understand the benefits of these properties, consider the Matlab Compiler Runtime (MCR), which enables the stand-alone execution of compiled Matlab applications. When a user installs the MCR in an Environment, close to a gigabyte of files are added. But, for the typical MCR-compiled application, a small minority (< 10%) of MCR is accessed at runtime, significantly reducing the data transferred. And, if a subsequent job runs on the same compute node, and needs the same MCR Environment, it will be available without any data fetching over the network.

Check Out the New Features

If you’re currently a PiCloud user and have wondered how Environments worked, I hope this shed some light into the black box.

If you haven’t already, you should check out our recent updates to the Environment system. We’ve redone the web UI and added client-side CLI support for managing your Environments. Also, we are excited about the new sharing feature that lets users share Environments they’ve created with colleagues and the general public.


Categories: How It Works

You can follow any responses to this entry through the RSS 2.0 feed.

Leave a Reply