Here’s a thought experiment. Suppose each server uses X watts idle and X+Y watts busy. If you have N programs that
currently run on N servers that are Z% idle then in the best case you really only need D= (100-Z)/100*N servers. So in a perfect world you have D machines busy all the time for D*(X+Y) power use instead of N*(Z/100)*X + N*(100-Z)/100 *(X+Y).
Something like this
N | servers= | 200 | ||
Z | idle rate= | 70 | ||
D | need= | 60 | ||
X | idle watts= | 150 | ||
X+Y | busy watts= | 800 | ||
Power use= | 69000 | |||
Consolidated Power use= | 48000 | Savings= | 21000 |
Ok that rocks, 21KW/hour. But can you really get that savings? If program 1 and program 2 need to run at the same times: you can’t save anything on them[1] and then the virtualization itself adds cycles for purely overhead and increases memory which burns more power. Suppose we now add K% as the overhead measure – real need R= D+D*K/100. If K=20, we cut savings in half. But 20 seems wildly overoptimistic to me, so put it at 50. We’re still only needing 72 servers instead of 200, but they are running flat out so we actually increase power use.
K | overhead= | 50 | ||
R | need with overhead= | 90 | ||
Consolidated Power use with overhead= | 72000 | Savings= -3000 |
That does not rock. But here’s the kicker – if we reduce idle power to 25Watts, maybe by turning off idle machines, a 50% overhead rate for virtualization means virtualization increases power use by 20KW. If we can reduce idle power to 1 watt, even at 10% overhead, virtualization increases power use. So what are the real numbers? I have not seen any published studies (of actual data centers) – maybe there are some.
If you want a copy of the xls, send me an email. Maybe I set it up wrong.
http://www.hpl.hp.com/personal/Lucy_Cherkasova/papers/final-perf-study-usenix.pdf
This claims lower levels of overhead
http://www.engineyard.com/blog/2009/10-years-of-virtual-machine-performance-semi-demystified/
but I’m unconvinced. In particular, I’m curious about memory usage.
This is funny.
The benchmark itself reported its elapsed time by calling a function to find the system time at both the beginning and end of the benchmark. The elapsed time reported by the benchmark was less than the wall-clock elapsed time. What we hypothesize is that, due to the unrelenting CPU consumption by the benchmarks, the virtualization layer was unable to update its clock with the virtual CPU clock ticks. This phenomenon is mentioned in [10] and [11] but we feel that this type of CPU workload severely exaggerates the situation.
See the last letter in this
While it is obvious that load is critical in any analysis, it may not be obvious how for example memory usage can depend on scheduling. If VM1 and VM2 run serially, memory usage is the max of the two, if they overlap it is the sum – unless it’s ok to slow them both down with more VM operations – which will, of course, increase the time to complete which uses capacity since that time is now not available for a third VM etc.
It’s also important to understand whether overhead is per VM. Suppose that we have 2% overhead per VM, all roughly the same size and 10 VMs. Is this overhead additive? Clearly cpu time is additive and so is I/O time, memory pressure is fuzzier and depends on how many VMs we run at any one time.
Notes
[1] If you have multiple cores, which you do, then if there are enough cores to run VM1 and VM2 in parallel, no problem. And this brings up the question of the relative power use, say of 2 dual core machines versus one 4 core machine or other multiples. Note that it’s hard to get power savings by turning off 4 of the 8 cores of a 8 core machine, but possibly easy to power down the one 4 core machine.