Guest Post: “Fault tolerance a new key feature for virtualization”

Below is a an article originally published on the guest author’s blog. Who’s the author, you ask?

Kevin Lawton! Bio: pioneer in x86 virtualization, serial entrepreneur, business and technology visionary, prolific idea creator, news and business book junkie. Founding team member in a microprocessor startup, the author and lead for two Open Source projects, a public speaker, and at the forefront of what is now a multi-billion dollar x86 virtualization industry. I have a degree in computer science and started my career at MIT Lincoln Laboratory.


Fault tolerance a new key feature for virtualization

VM migration has been a key feature and enabling technology which has differentiated VMware from Microsoft’s Hyper-V. Though as you may know, Windows Server 2008 R2 is slated for broad availability on or before October 22, 2009 (also the Windows 7 GA date), and Hyper-V will then support VM migration. So you may be wondering, what key new high-tech features will constitute the next battleground for differentiation amongst the virtualization players?

Five-Nines (99.999%) Meets Commodity Hardware

One such key feature is very likely to be fault tolerance (FT) — the ability for a running VM to suffer hardware failure on one machine, and to be restarted on another machine without losing any state. This is not just HA (High Availability), it’s CA (Continuous Availability)! And I believe it’ll be part of the cover-charge that virtualization vendors (VMware, Citrix/XenSource, Microsoft, et al) and providers such as Amazon will have to offer to stay competitive. When I talk about fault tolerance, I don’t mean using special/exotic hardware solutions — I’m talking about software-only solutions which handle fault tolerance in the hypervisor and/or other parts of the software stack.

Here’s a quick summary of where the various key vendors are w.r.t. fault tolerance. Keep watch of this space, because the VM migration battle is nearly over now.

VMware’s product line now offers Fault Tolerance, which they conceptually introduced at VMworld 2008. This was perhaps the biggest wow-factor feature VMware talked about at that VMworld. FT is not supported in VMware Essentials, Essentials Plus or vSphere Standard editions. It’s supported in more advanced(/expensive) versions.

In the Xen camp, there are two distinct FT efforts, Kemari and Remus. Integration/porting to Xen 4.0 are on theroadmap. If/when that occurs, the Xen ecosystem will benefit. After battle-testing, it’s easy to conceive of Amazon offering FT as a premium service. It does after all chew through more network capacity, and will necessitate extra high level logic on their part. There’s also a commercial FT solution for XenServer from Marathon, called everRun VM.

Microsoft appears to be leveraging a partnership with Marathon for their initial virtualization FT solution. This is probably smart given it allows Microsoft a way to quickly compete on fault tolerance, with a partner that’s been doing FT for a living. One would imagine this option will come at a premium though, perhaps a revenue opportunity for Microsoft for big-money customers, with an associated disadvantage vis-à-vis similar features based on free Xen technology and massive scale virtualization (clouds). That may make Marathon a strategic M&A target.

Licensing Issues, Part II

Just when you thought software-in-a-VM issues were mostly resolved, the same questions may be raised again for FT, given there is effectively a shadow copy of any given FT-protected VM. It’s not hard to imagine Microsoft aggressively taking advantage of this situation, given they live at both virtualization/OS and application layers of the stack.

Networking is Key

Fault tolerance of VMs is yet another consumer and driver of high bandwidth, low latency networking. The value in the data center is trending from the compute hardware to the networking. FT is another way-point in the evolution of that trend, allowing continuous availability on commodity hardware. You probably won’t run it on all your workloads (they will run with a performance penalty), but you might start out with the most critical stateful workloads. If you want to do this on any scale, or with flexibility, architect with lots of networking capabilities. For zero-sum IT budgets, this would mean cheaper hardware and better networking, something that might be a little bitter-sweet for Cisco, given its entrance into the server market.

About the author

I'm a blogger, entrepreneur, conference organizer, social media consultant, startup advisor and allround web addict, based in Belgium, Europe. I'm a writer at TechCrunch and managing editor of

Leave a Comment

Powered by WordPress | Deadline Theme : An AWESEM design