This article from Ars Technica discusses a talk over the summer by Merrill Lynch’s chief technology architect, Jeffrey Birnbaum on “stateless cloud computing” – most concretely on distributed file systems.
Birnbaum believes that one of the key foundational elements of a stateless computing environment is a networked storage system that enables ubiquitous availability of software. The file paths of the individual applications should be based on clearly defined nomenclature, much like the domain of a web site. All application dependencies should be accessible through the network filesystem, and version numbers should be expressed with the path nomenclature.
Big distributed file system – sure. Why should version numbers be expressed with the path nomenclature (a Plan9 idea, btw)? Now we go on to the ancient problem of caching distributed data.
The obvious challenge posed by rolling out worldwide network storage infrastructure is scalability. If everyone in a global organization is depending on a network storage solution, then it needs to be fast and consistently reliable. The solution that Birnbaum proposes is regional mirroring and caching. The storage system would be universally synchronized between mirrors that have all the data. Caching can also be used at individual facilities to further improve performance. To achieve this kind of global scalability, he says, the best approach is similar to that of Akamai.
So even with a non-globally distributed file system, the problem of shared access is non-trivial. A global file system makes things quite challenging. Suppose we have a file recording trades and the Singapore, London, NY, and Espanola main offices all are reading and writing at the same time. Caching and cache coherency is an utter nightmare. Akamai, like Google, solves the problem of massive amounts of distributed data by focusing on “delivery” – otherwise known as “read only content” or “many readers one writer” and with no requirement for true synchronization. But the ML problem is more difficult even if we ignore multiple writers because, presumably, you want Singapore to actually see every trade made in Espanola even though for Akamai, it’s ok if the cache is not fresh. How to solve multiple readers and writers is something else as well.
These concepts don’t cover a whole lot of new ground yet. Much of this was already possible with conventional thin-client systems. The point at which it becomes immensely valuable, according to Birnbaum, is when all of these technologies are used together with virtualization to abstract the processes away from the hardware. Once this is done, individual operations can seamlessly float around data centers and balance out in a manner that offers a more optimal level of resource utilization.
And this seems to me to gloss over the even harder problem. Imagine a serious Oracle application “seamlessly floating” from some set of machines in one data-center to another set. I can’t imagine how that works. Imagining little jobs floating is easier, but is that really an interesting problem? And this brings us to the most interesting claim:
He claims that 61 percent of a company’s enterprise server capacity goes completely unused and proposes an automated load balancing solution—
SIXTY ONE PERCENT! Think of the power use.