OpenEvolve AI coding agent built a better algorithm • The Register
Computer scientists at UC Berkeley say that AI models show promise as a way to discover and optimize algorithms.
In a preprint paper titled “Barbarians at the Gate: How AI is Upending Systems Research,” 17 UC Berkeley researchers describe how they employed OpenEvolve, an open source implementation of Google DeepMind’s AlphaEvolve, to improve a load balancing algorithm so that it significantly outperforms prior human designs.
Specifically, the authors claim to have used OpenEvolve to achieve a 5x speedup for an Expert Parallelism Load Balancer (EPLB) algorithm, which is used in large language models to route tokens to specialized expert modules – an efficiency mechanism that reduces the number of processed parameters.
The authors say that AI-Driven Research for Systems (ADRS), through which an AI model iteratively generates, evaluates, and refines solutions, promises to transform systems research.
“As AI assumes a central role in algorithm design, we argue that human researchers will increasingly focus on problem formulation and strategic guidance,” they state in their paper. “Our results highlight both the disruptive potential and the urgent need to adapt systems research practices in the age of AI.”
Google in May talked up AlphaEvolve, an “evolutionary coding agent” that improved the efficiency of Google’s data center orchestration, optimized matrix multiplication operations in its Tensor Processing Unit hardware, and optimized its FlashAttention kernel implementation in Transformer-based AI models.
As if to further underscore the potential of machine learning as an algorithmic discovery mechanism, a paper published this week in Nature from Google DeepMind researchers describes “an autonomous method for discovering [reinforcement learning] rules solely through the experience of many generations of agents interacting with various environments.” To date, the DeepMind eggheads claim, automated approaches have failed to outperform human-designed reinforcement learning systems.
The UC Berkeley crew has now shown the value of AI-based optimization work by having OpenEvolve work out a more efficient approach to load balancing across GPUs handling LLM inference.
The researchers started with DeepSeek’s open-source EPLB implementation, which they note is slow because it’s written in Python and relies on a for-loop to conduct a linear search for the optimal GPU to process an expert module workload. On average, the DeepSeek version took about 540 ms to rebalance the expert modules across GPUs.
They also looked at a non-public EPLB implementation from an unidentified frontier lab that handled rebalancing in 19.6 ms.
OpenEvolve, using a combination of 80 percent Gemini 2.5 Flash and 20 percent Gemini 2.5 Flash Lite, at a cost of less than $10 and five hours, came up with a more efficient approach to packing the expert modules into GPUs – it replaced loops with vectorized tensor operations and implemented a zig-zag partitioning scheme to achieve a runtime of only 3.7 ms.
That’s a 5.0x speedup over the undisclosed reference implementation and a 146x speedup over DeepSeek’s implementation.
Another case study described in the UC Berkeley paper reports that through the use of OpenEvolve, the authors were able to speed up relational analytics where SQL queries invoke LLM inference operations over each row by a factor of three.
Asked whether OpenEvolve’s “reasoning” consists of just connecting dots that people missed in available data or shows evidence of a novel approach, co-author Audrey Cheng, PhD candidate at UC Berkeley, told The Register in an email, “I think these are hard questions to answer definitively (as they come down to whether LLMs are actually ‘thinking’ or just doing sophisticated probability calculations).
“LLMs definitely benefit from being trained on a much larger corpus of literature than any individual human researcher can comprehend, and this gives it advantages in discovering new ways to apply ideas from other domains.
“Currently in systems/database performance research, we consider algorithms as ‘novel’ if they show significant improvements in some way, even if they borrow ideas from other fields (as an example, see my paper applying fair sharing ideas from networking/operating systems to databases). So based on this criteria, yes, the developments would be considered novel by research standards.”
Asked whether OpenEvolve is simply brute-forcing novelty from known data or is being “creative,” Cheng said that too is a difficult question.
“I think one way to look at this is to think about how humans come up with ideas now,” Cheng said. “As researchers, we know that we ‘stand on the shoulders of giants.’ Only by deeply understanding the ideas of others can we come up with ‘novel’ solutions. The creative process requires known data. OpenEvolve uses this data and applies it to new problems (and may come up with unexpected solutions as well). So, I would say ADRS frameworks are creative.”
Cheng said she believes the potential impact of ADRS is huge.
“We focus on systems performance problems because AI can already beat human-expert solutions here,” she explained. “Performance problems are generally easier to verify, and we’ve already seen some initial adoption in industry (see Datadog’s recent blog post as an example). I expect that most companies running systems at scale will eventually use some form of ADRS for performance tuning.”
And once researchers figure out how to do verification for other problems like security and fault tolerance, Cheng expects ADRS to be able to come up with more novel solutions.
“The current bottleneck is having a robust evaluation and validation framework,” she explained. “If that is in place, I imagine ADRS can apply widely to all kinds of systems problems (and also beyond computer science).” ®


