In recent weeks I’ve been getting to grips with Solaris zones; they’re one of the features that shipped in Solaris 10 that are supposed to make Sol 10 Better, Stronger, Faster than the free Unix clones (particularly Linux) that have been eroding commercial Unix vendors out of house and home.
If you start reading up on zones in a casual way, you could be forgiven for thinking they’re like VMWare, only better. This is pretty much how Sun sell them, and definitely how Sun fans push themif you were to go by what’s readily available on Google, you could be forgiven for thinking people should be panic selling their VMWare shares because zones are so good, so much easier and more efficient than hypervisor based virtualisation technology that VMWare will be out of business Real Soon Now.
Reality is a little different.
As you might guess, my hands-on, in anger use of zones has shown that they are a little less of an unlimited good than one might imagine from becoming casually aquainted with them. In this entry I’m going to look at my experience with one of the worst bits of zones: rcapd.
Zones allow you to manage resources in a number of ways. The different resources in a system have different mechanisms for this; CPU is the best managed, with kernel-level management of different schedulers, workload targets, and the ability to (imperfectly) isolate zones down to particular processor threads (and, more recently, fractions of threads). It’s got some big caveats around how it all works (which I’ll go into in a subsequent post). IO is easy. It doesn’t have any kind of resource management at a zone level. Believe it or not, that’s better than memory, which is managed by a shit-tastic scheme involving a userland daemon to enforce your limits.
Let’s take a step back for a moment. When people explain zones, they typically use VMWare as a metaphor, because people understand VMWare: a hypervisor that makes an operating system think it’s running on it’s own unique computer, even though it’s not. Virtual resources are mapped to real ones by the hypervisor, which sits atop a host operating system.
This is actually a terrible analogy for zones, because zones are nothing like this. Zones are like a Unix chroot jail on drugs (the good ones). You create a copy of the Solaris system binaries, set up local copies of /etc and so on. People logging into the zone see something that looks like a standalone server. But like a chroot jail (and unlike VMware), everything runs in a single, global kernel context. Your non-global zones are really just a collection of processes that some clever resource management make look like a standalone server. It’s like VMware in the way a chrooted FTP server iswhich is to say, not at all.
Since zones all run in the same global kernel, memory use in one zone can affect memory use in other zones. If you have a machine with 16 GB of RAM, and an errant application in one zone grabs all of it, then not only will that zone start paging to disk, but the whole system will. Unlike a true virtualisation solution, one zone’s memory misbehaviour really can crap on every zone in the system. This makes memory management a must, right? Enter rcapd.
What rcapd implies is simple enough: you can set zone configuration parameters for capping physical and swap use by the zone. One might think that processes in the zone will get that amountsay, 4 GB of physical RAMand then start swapping, just like they were a real, standalone machine with 4 GB of RAM.
This is where it starts getting complicated, messy, and Bad and Wrong.
A naive person might assume that these limits are enforced in some reasonable fashion by the kernelthat programs calling system routines that tell them how much physical memory is installed will be told 4 GB rather than 16 GB (in our example), that memory use within the zone will be expressed as a proportion of that total. You’d be wrong. If you look at the free memory, you’ll see 16 GB. If you start a Java process in server mode without explicit memory allocation, it’ll try grabbing memory based on the system total, not the zone total, which will often cause some very bad behavior.
So the misreporting is a pain in the arse and occasionally has some problems. It’s nothing compared to what happens when you start to load the system up.
rcapd is an ambulance at the bottom of the cliff: it samples memory use periodically, adds up the processes making up a zone, and then, if those processes have exceeded the memory cap set, forces them to page.
In the 5 second default sample window, of course, a runaway process or group thereof can fuck over the system pretty badly. rcapd doesn’t fix that - it simply comes along later and tries to force them to page to disk. If processes are continually exceeding their cap, rcapd will be continually paging them to disk. Since rcapd runs in the global zone, and the kernel routines it invoke live in a global context, a zone that continually exceeds its memory cap can (and does!) shit up the whole system. rcapd can thrash a whole processor thread quite comfortably, and because the kernel is shared between all zones, the whole system will start slowing down if it spends huge amounts of time paging things out.
Now, a little thought will make something obvious: if rcapd going beserk when a zone persistently exceeds it’s memory cap can cripple system performance (and it can and does, in my recently aquired experience), then you need to make sure zone caps are set high enough that this never happens in day-to-day use.
So, configuring caps. If you’re like me, you started using *ix at a point where pretty much any *ix you run into has the ability to share memory between process’s executable segments, conserving memory globally. Solaris, of course, does. A quick use of pmap will show you that the typical group of processes on a Solaris system uses only a fraction of the real memory RSS reports through top or prstat.
Pity rcapd doesn’t understand this. In fact, when rcapd is assessing a zone’s memory use, it adds up all the RSS for processes, and patrols based on that. It’s like Unix 20 or more years ago: you don’t have shared memory. If you have a bucnh of big, fat Oracle executables reporting RSS of 300 MB each, then the fact they may be sharing 200+ MB of that memory between them is ignored.
Imagine a zone with a cap of 2 GB, and 8 Oracle processes x 300 MB = 2.4 GB of use as far as rcapd is concerned, and it’ll start swapping those processes, even if the actual use in the VMM is only 1 GB. Your system will be in crises mode, continually swapping even though you haven’t actually exceeded the physical memory cap you set. If you want to avoid this, you set it to at least 2.5 GB, in this example.
But this means rcapd is managing memory like it’s 1982. And when something does go wrong, as noted before, it can bring the whole system to its knees. Which is why the best thing to do, as far as I can see, is simply to turn the fucker off. So that’s what we did.