Carnegie Mellon study • The Register

Feature IT consultancy Gartner predicts that more than 40 percent of agentic AI projects will be cancelled by the end of 2027 due to rising costs, unclear business value, or insufficient risk controls.
That implies something like 60 percent of agentic AI projects would be retained, which is actually remarkable given that the rate of successful task completion for AI agents, as measured by researchers at Carnegie Mellon University (CMU) and at Salesforce, is only about 30 to 35 percent for multi-step tasks.
To further muddy the math, Gartner contends that most of the purported agentic AI vendors offer products or services that don’t actually qualify as agentic AI.
AI agents use a machine learning model that’s been connected to various services and applications to automate tasks or business processes. Think of them as AI models in an iterative loop trying to respond to input using applications and API services.
The idea is that given a task like, “Find all the emails I’ve received that make exaggerated claims about AI and see whether the senders have ties to cryptocurrency firms,” an AI model authorized to read a mail client’s display screen and to access message data would be able to interpret and carry out the natural language directive more efficiently than a programmatic script or a human employee.
The AI agent, in theory, would be able to formulate its own definition of “exaggerated claims” while a human programmer might find the text parsing and analysis challenging. One might be tempted just to test for the presence of the term “AI” in the body of scanned email messages. A human employee presumably could identify the AI hype in a given inbox but would probably take longer than a computer-driven solution.
The notion of software that just accepts orders and executes them efficiently, correctly, affordably, and without fuss shows up again and again in science fiction. When Captain Picard says in Star Trek: The Next Generation, “Tea, Earl Grey, hot,” that’s agentic AI, translating the voice command and passing the input for the food replicator. When astronaut Dave Bowman orders the HAL 9000 computer to, “Open the pod bay doors, HAL,” that’s agentic AI too.
Makers of AI tools like Anthropic tend to suggest more down-to-earth applications, such as AI-based customer service agents that can take calls and handle certain tasks like issuing refunds or referring complicated calls to a live agent.
It’s an appealing idea, if you overlook the copyright, labor, bias, and environmental issues associated with the AI business. Also, as Meredith Whittaker, president of the Signal Foundation, observed at SWSX earlier this year, “There’s a profound issue with security and privacy that is haunting this sort of hype around agents…” Specifically, agents need access to sensitive data to act on a person’s behalf and that imperils personal and corporate security and privacy expectations.
But agents that exhibit the competence of Iron Man’s JARVIS remain largely science fiction when it comes to actual office work.
According to Gartner, many agents are fiction without the science. “Many vendors are contributing to the hype by engaging in ‘agent washing’ – the rebranding of existing products, such as AI assistants, robotic process automation (RPA) and chatbots, without substantial agentic capabilities,” the firm says. “Gartner estimates only about 130 of the thousands of agentic AI vendors are real.”
Testing agents at the office
For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.
They call it TheAgentCompany. It’s a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.
The gap between these two positions, they argue in a paper [PDF] detailing their project, is due to the lack of a way to test how agents handle common workplace activities. Hence the need for a benchmark, which suggests AI agents have a way to go before they’re truly useful.
Using two agent frameworks – OpenHands CodeAct and OWL-Roleplay – the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.
- Gemini-2.5-Pro (30.3 percent)
- Claude-3.7-Sonnet (26.3 percent)
- Claude-3.5-Sonnet (24 percent)
- Gemini-2.0-Flash (11.4 percent)
- GPT-4o (8.6 percent)
- o3-mini (4.0 percent)
- Gemini-1.5-Pro (3.4 percent)
- Amazon-Nova-Pro-v1 (1.7 percent)
- Llama-3.1-405b (7.4 percent)
- Llama-3.3-70b (6.9 percent),
- Qwen-2.5-72b (5.7 percent),
- Llama-3.1-70b (1.7 percent)
- Qwen-2-72b (1.1 percent).
“We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks,” the authors state in their paper.
The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn’t find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided “to create a shortcut solution by renaming another user to the name of the intended user.”
The CMU authors – Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig – have published their code to GitHub.
Graham Neubig, an associate professor at CMU’s Language Technologies Institute and one of the paper’s co-authors, told The Register in a phone interview that the impetus for TheAgentCompany was a paper from researchers at OpenAI and the Wharton School of the University of Pennsylvania about all of the jobs that theoretically could be automated.
“Basically their methodology was that they asked ChatGPT whether the job could be automated,” he explained. “They also asked people whether the job could be automated and then they said ChatGPT and people agreed some portion of the time.”
Neubig, who also works at a startup building coding agents, said he was skeptical so he wanted to create a benchmark to test how well AI models handle knowledge work tasks. After around eight months of work, they released TheAgentCompany.
Initially, a software agent was able to completely finish about 24 percent of tasks that involved web browsing, coding, and related tasks.
“Recently, we tried a newer version of an agent and it got 34 percent,” he said. “So it increased from like one quarter to one third. And that’s after about six months. One thing that’s been a little bit disappointing to me is this benchmark hasn’t been picked up by the big frontier labs. Maybe it’s too hard and it makes them look bad.”
Neubig said he expects agents will become more capable in time but added that even imperfect agents can be useful, at least in the context of coding agents – a partial code suggestion can be filled out and improved.
For agents dealing with more general office tasks, the situation is different. “It’s very easy to sandbox code and not have it affect anything outside of the sandbox,” he said. “Whereas, if an agent is processing emails on your company email server… it could send the email to the wrong people.”
That said, Neubig sees the adoption of the Model Context Protocol (MCP) as a positive development for agents because it makes more systems programmatically accessible.
Meanwhile, researchers from Salesforce – Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, and Chien-Sheng Wu – have proposed a benchmark of their own that’s tuned for Customer Relationship Management (CRM).
The benchmark, dubbed, CRMArena-Pro, consists of “nineteen expert-validated tasks across sales, service, and ‘configure, price, and quote’ processes, for both Business-to-Business and Business-to-Customer scenarios,” and covers both single-turn (prompt and response) and multi-turn interaction (a series of prompts and responses where the context is maintained throughout the conversation).
“Our results reveal that even leading LLM agents achieve modest overall success rates on CRMArena-Pro, typically around 58 percent in single-turn scenarios, with performance significantly degrading to approximately 35 percent in multi-turn settings,” the Salesforce computer scientists state.
“Our findings indicate that LLM agents are generally not well-equipped with many of the skills essential for complex work tasks; Workflow Execution stands out as a notable exception, however, where strong agents like gemini-2.5-pro achieve success rates higher than 83 percent.”
They add all of the models evaluated “demonstrate near-zero confidentiality awareness.” That’s going to make AI agents a tough sell in corporate IT environments.
The findings from CMU and Salesforce more or less align with Gartner’s assessment of the present state of agentic AI.
“Most agentic AI propositions lack significant value or return on investment (ROI), as current models don’t have the maturity and agency to autonomously achieve complex business goals or follow nuanced instructions over time,” said Anushree Verma, senior director analyst, in a statement. “Many use cases positioned as agentic today don’t require agentic implementations.”
That said, Gartner still expects that by 2028 about 15 percent of daily work decisions will be made autonomously by AI agents, up from 0 percent last year. Also, the firm sees 33 percent of enterprise software applications including agentic AI by that time. ®