Alibaba Cloud claims new DB manager beats rival hyperscalers • The Register

Alibaba Cloud has revealed a cluster manager it says allows it to run databases more efficiently than its hyperscale rivals.
The Chinese cloud champ revealed its tech in a paper [PDF] titled “Eigen+: Memory Over-Subscription for Alibaba Cloud Databases” that it presented at the recent SIGMOD/PODS conference, an Association of Computing Machinery (ACM) event dedicated to database research.
The paper opens with an observation that hyperscalers often assign more memory to VMs than is physically available – a technique called “memory oversubscription” that’s used because virtual machines don’t always use all the RAM allocated to them. Memory oversubscription therefore makes it possible to run more guest machines on each host.
If that sounds a bit fraught, you’re not alone in worrying about memory oversubscription because Alibaba Cloud’s researchers worry that it “increases the risk of Out of Memory (OOM) errors, potentially compromising service availability and violating Service Level Objectives.”
Users of memory oversubscription try to avoid such incidents in two ways. One is using historical data to predict future memory usage. They also employ bin packing algorithms – an optimization technique used to figure out how to pack differently sized objects into bins of fixed size. Think of them as Tetris, but for fitting workloads into a pool of compute resources.
Alibaba Cloud thinks the combination of historical data and bin packing “often fall short in providing precise predictions, particularly in high-utilization environments where slight forecast errors can result in critical failures.”
The company offered that conclusion based on analysis of its own operations.
Which is a bit awkward because this paper is a sequel of sorts to Alibaba Cloud’s 2023 paper describing the first version of the Eigen cluster manager.
This time around Alibaba Cloud thinks it’s found an even better way to cram more database VMs into its servers, by starting with the Pareto Principle – aka the 80/20 rule – that most problems come from a small number of causes. In the case of cloudy databases running on Alibaba Cloud, that means “database instances with memory utilization changes exceeding five percent within a week constitute no more than five percent of all instances, yet these instances lead to more than 90 percent of OOM errors.”
Eigen+, Alibaba Cloud’s new cluster manager, therefore profiles all database instances to detect those with transient memory use and prevents them from using memory oversubscription. Eigen+ also models the impact of oversubscription and can initiate live migration of database workloads to reduce the likelihood of OOM errors across its server fleet.
Alibaba Cloud’s paper claims that applying Eigen+ to VMs running MySQL allowed it to eliminate OOM errors and improved memory allocation by 36 percent, meaning the Chinese cloud can use less memory to host more database VMs.
The paper asserts that Eigen+’s classification of dangerous DBs is something its cloudy rivals AWS, Google, and Microsoft don’t do, and that its cluster management capabilities represent advances on tools such as Google’s Borg, Kubernetes, and Mesos.
Of course they would say that – but the paper said it well enough that the ACM thought it worthy of a slot at SIGMOD/PODS. ®