Before planning any project, the first thing I like to do is ask what outcome are we trying to achieve here? A case in point is virtualization, the holy grail of IT du jour. A virtualized infrastructure is not a goal in and of itself; rather virtualization is merely one of many tools that may be employed to achieve the same business goals.
One such goal is consolidation, which is actually itself only a means to an end. This in my experience is the most common use case. In operating a decent-sized datacentre there are three things that are real constraints: power, cooling, and floor loading. If you run out of any of these, then you have a decision to make: lease more space in a hosting facility if you can (and deal with the overhead of having components of an application spread across many networks, e.g. latency, security) or relocate (risky and expensive to do with no downtime). Or, increase the density of the equipment you have: more computation and storage capacity within the same Wattage and lbs/square foot. Virtualization can often be useful for this; I have personally achieved compressing entire 42U racks down into 6U using VMware – which itself has costs attached. One reason that this is possible is that system loads vary throughout the day; one system may be busy during trading hours, another only overnight, but both must be sized for their peak loads. It makes sense to share the hardware in that situation, but that is true whether you do it with VMs or just running ordinary Unix processes at different times of day. The inescapable fact is: to get a business benefit, at the end of the day, you must be doing at least the same workload on enough-less hardware that the cost of the consolidation project is less than the savings achieved. Easy to quantify if the alternative is “a new datacentre” but much less so otherwise.
This is one of the things that is most bemusing about the oft-repeated claim that programmers are expensive but hardware is cheap. If it’s a choice between 1 server or 2, sure, go ahead and buy another server. If it’s a choice between 1 server and 10, then the balance starts to shift, even if those servers are “virtual”. One way I have seen SOA implemented is a one-service-per-VM, which harks back in principle to the way mainframes work, except built out of components not really designed for it. So, one full, general-purpose OS instance and one middleware instance per service, and an application may be comprised of many services. If those are J2EE than you can consume a 64G physical host running a dozen VMs before you’ve even noticed (and each one will also need space on a SAN). Sure, that fits in 1U or 2U of rack, 4U at most. But doing the same workload as a dozen native-code† processes on one older machine running the traditional one OS also fits into that space! And if you took that existing system and P2V’d it, still as one-process-per-service, perhaps now you could run a dozen similar complete apps in your new 2U box. So where is the saving? Everything we’ve gained in this scenario, we’ve given back and we are right back where we started, except with another layer of abstraction to manage and pay for, as per Wheeler’s Corrolary.
A second business goal is that of increased manageability, which is also a means to an end: doing more stuff with the same number of people, which may be measured by the DBA:DB ratio I have mentioned before, or by sysadmins:systems. It’s always been difficult to measure ratios like that for programmers, hence their blithe assertions like above. But I am not talking at the level of individuals here, but about teams or entire organizations (not least because a team of 1 or 2 senior engineers writing tools for and mentoring 5 or 10 juniors who in turn do the hands-on work can be incredibly effective). Virtualization is one means to bundle up a set of related objects into a single package that can be managed as a whole. Boot up a VM image in a hypervisor, its OS in turn starting processes in the correct order, and hey presto! But there is one very obvious flaw with this strategy: if you have gone down the route of one-service-per-VM, and your application comprises many services, then once again, you are back where you started, except the “thing” you manage is not individual processes now, it’s entire VMs! And the finer-grained services become, the worse this gets, and it will, because of the pressure on developers to make the services more reusable and composable; it is no coincidence that in the old days this was called “remote procedure call”, a service tends naturally towards the same complexity as a subroutine in a monolithic program. This inexorably leads to the situation when an entire VM exists just to run a hundred lines of actual application code!
Instead of installing or patching an application, instead you must build an entire new VM image. There may be a case for doing this if you have a lot of sysadmins experienced in supporting an OS but not in an app server or the actual application. However, in my experience, this isn’t as useful as it sounds: it’s using a sledgehammer (
top, say) to crack a nut (such as a race condition). Much better to train some of these same people in advanced tools such as DTrace and Flight Recorder, moving them “up” a layer in the stack, and some in the hypervisor itself, moving them “down” a layer, and the overlap in the middle for general OS work. And one thing that is very important to consider is that it is still physical hardware with physical limitations: you still have to deal with issues such as contention on SAN storage. In fact it is worse now, because each VM is an OS that now has to run its swap and housekeeping tasks on the storage. If you are not careful, you can make that an order of magnitude harder to manage and again, where is the real saving?
There is an alternate way to bundle related resources together and manage them as a whole without any of this overhead: a cluster service group. You can distribute services around a cluster and add new cluster nodes on the fly without a VM in sight. And you get your HA features “for free”! This would be an additional consideration in a virtualized setup. You would need one HA mechanism for VMs as a whole, and another mechanism to restart processes inside the VM. Also in the case of J2EE the question must be asked, why are we running a VM inside another VM? Especially when there is JVM on hypervisor now. The stack at this point looks like:
App → middleware → JVM → general purpose OS (VM) → hypervisor → hardware
The items in italics must be examined closely to see if they are really necessary – especially if this stack is duplicated for every service the application provides.
The third consideration is security. This one to my mind is the most interesting. Package up the bare minimum for a self-contained application, harden it, and offer the VM as your service endpoint. If it is compromised, the attacker is trapped within the VM – assuming of course that the same attack doesn’t work on all VMs! To be really sure, you might want to mix Xen, VMware and Hyper-V, with the associated costs and overheads. The difficulty here is that there is a limit to the useful functionality that can be offered without a route from the VM into the rest of the infrastructure. For example, if the application connects to the database, and the credentials for the database are in its configuration files, then an attacker can now masquerade as the application from within the VM. It would be possible to automate building a VM with “today’s website” or similar in it, every midnight, then just running that. An attacker could compromise it and deface your content, but not jump from there into your corporate systems. The temptation to peek at the web server logs during the day would have to be resisted until the VM was safely offline, then they could be extracted directly from its storage without even starting it up! Again the question is whether the management overhead and complexity of this solution is worth it when
chroot() already exists, and other methods such as WPARs, Containers, etc. Or BSD’s append-only filesystem for logs, and exporting apps and data storage read-only from SAN…
These are just a few things to consider before deciding whether virtualization is the right choice. For running a development environment on a desktop, simulating Production, it is invaluable. But in the case of Production servers, it simply does not offer any capabilities that didn’t already exist‡. It introduces complexity, expense, potentially a larger attack surface, and may end up using more hardware, not less. And time spent reinventing the wheel is time that could be spent adding revenue-generating features, or playing Angry Birds!
† Remember that my idea of native code is OCaml or Haskell; Java programmers are at a disadvantage here.
‡ What about AWS, Azure, et al? I am talking about virtualizing your own infrastructure for the above use cases, not outsourcing.