At PiCloud, we’ve accumulated over 100,000 instance requests on Amazon EC2. Our scale has exposed us to many odd behaviors and outright bugs, which we’ll be sharing in a series of blog posts to come. In this post, I’ll share one of the strangest we’ve seen.
It started with a customer filing a support ticket about code that had been working flawlessly for months suddenly crashing. Some, but not all, of his jobs were failing with an error that looked something like:
Fatal Python error: Illegal instruction
File “/usr/local/lib/python2.6/dist-packages/numpy/linalg/linalg.py”, line 1319 in svd
File “/usr/local/lib/python2.6/dist-packages/numpy/linalg/linalg.py”, line 1546 in pinv
That’s odd, I thought. I had never before seen the Python interpreter use an Illegal Instruction! Naturally, I checked the relevant line that was crashing:
results = lapack_routine(option, m, n, a, m, s, u, m, vt, nvt, work, lwork, iwork, 0)
A call to numpy’s C++ lapack_lite. Great, the robust numpy was crashing out.
More surprising was that a minority of jobs were failing, even though the customer indicated that all jobs were executing the problematic line. We did notice that the job failures were linked to just a few servers and those few servers ran none of the customer’s jobs successfully. Unfortunately, our automated scaling systems had already torn down the server.
The first thing I did was Google the error. Most results were unhelpful, but one old, though now solved, bug with Intel’s Math Kernel Library (MKL) seemed notable. MKL would crash with an illegal instruction error when AVX (Advanced Vector Extensions, a 2011 extension to x86) instructions were being executed on CPUs that lacked support. Why notable? We compile numpy and scipy libraries with MKL support to give the best possible multi-threading performance, especially on the hyperthreading & AVX capable f2 core.
Still though, why did only a few servers crash out? Having not much to go on, I launched a hundred High-Memory m2.xlarge EC2 instances (200 m2 cores in PiCloud nomenclature) and reran all the user’s jobs over the nodes. A few jobs, all on the same server, failed.
As I compared the troublesome instance to the sane ones, one difference stood out. The correctly operating m2.xlarge instances were running 2009-era Intel Xeon X5550 CPUs. But the troublesome instance was running a more modern (2012) Xeon E5-2665 CPU. And returning back to the MKL bug noted earlier, this new chip supported AVX.
Examining /proc/cpuinfo showed as much; AVX was supported on the failing instance, but not the new ones. To test it out, I compiled some code from stackoverflow with ‘g++ -mavx”. Sure enough, running the binary produced an Illegal Instruction.
From my perspective as an instance user, the processor was lying, claiming to support AVX but actually crashing when any AVX code would run.
Turns out the actual answer was subtle. Per the Intel manual, it is possible for the operating system to disable AVX instructions by disabling the processor’s OSXSAVE feature. By the spec, any application wishing to use AVX first must check if OSXSAVE is enabled.
Amazon seems to have disabled the OSXSAVE feature at the hypervisor layer on their new Xeon E5-2665 based m2.* series of instances. This may just be because their version of the Xen hypervisor that manages these instances lacks support for handling AVX registers in context switching. But even if support does exist in the hypervisor, it makes sense to disable AVX for the m2.* family as long as there are Xeon X5550 based instances. Imagine compiling a program on an m2.xlarge EBS instance, thinking you had AVX support, and then upon stopping/starting the instance, finding that the program crashes, because your instance now runs on older hardware that doesn’t have AVX support! A downside of VM migration is that all your hardware must advertise the least common denominator of capabilities.
Unfortunately, Amazon did not ensure that the Guest OS saw that OSXSAVE was disabled. This led to MKL thinking it had the capabilities to run AVX code, when it actually didn’t.
Ultimately, there was not much to do but:
- Given how rare the Xeon E5-2665 instances are, we now simply self-destruct if an m2.*’s /proc/cpuinfo claims that both avx and xsave is enabled
- File a support case with Amazon. They have been quite responsive and as I publish this post, it seems that a fix has at least been partially pushed.
So, if you use instances in the m2.* family, be sure to check /proc/cpuinfo. If the instance claims it has both avx and xsave, it is probably lying to you.
Alternatively, if you are doing high performance computation in the cloud, you may just want to pass on the responsibility for such dirty details to us at PiCloud.
Categories: Battle StoriesYou can follow any responses to this entry through the RSS 2.0 feed.