Natural language processing

Anthropic launches smarter Claude models with computer skills

Anthropic launches smarter Claude models with computer skills



summary
Summary

Anthropic has announced upgrades to its Claude AI models, including an enhanced Claude 3.5 Sonnet and a new Claude 3.5 Haiku. The company is also introducing a new feature that allows the model to interact directly with computer interfaces.

The updated Claude 3.5 Sonnet shows significant improvements in programming tasks. Its performance on the SWE Bench Verified Test increased from 33.4% to 49.0%, which Anthropic claims outperforms all publicly available models, including specialized programming systems.

Sonnet also made strides in the TAU Bench, a test for agentic tool use. In the retail sector, its performance rose from 62.6% to 69.2%, while in the more challenging aviation sector, it improved from 36.0% to 46.0%.

Table: Comparison of AI models across various benchmarks. Claude 3.5 Sonnet (New) leads in several categories, including GPQA, MMLU, HumanEval, and AIME 2024.
The new sonnet makes the biggest leaps in reasoning and agentic tool testing. | Image: Anthropic

New Haiku model outperforms previous flagship

Anthropic is also introducing a new Claude 3.5 Haiku model. The company claims that this model outperforms the previous top-of-the-line Claude 3 Opus on many benchmarks, while maintaining similar speed and cost as the previous Claude 3 Haiku. Notably, Anthropic did not mention any plans for a new Opus model in this announcement.

Ad

Comparison table: AI model performance in various benchmarks, Claude 3.5 Sonnet (new) leading in several categories.
The new Claude 3.5 Sonnet model shows improved performance, especially in logical reasoning, mathematical problem-solving and programming tasks. On the general language comprehension benchmark MMLU, it is only slightly ahead of the old Sonnet 3.5. | Image: Anthropic

The new Claude 3.5 Haiku demonstrates impressive capabilities relative to its speed and cost in programming tasks. It scores 40.6% on the SWE-bench Verified test, which Anthropic says exceeds the performance of many agents based on “publicly available state-of-the-art models,” including GPT-4o.

Regarding knowledge cutoff dates, Sonnet 3.5 is current through April 2024, while the new Haiku model has information up to July 2024. Anthropic plans to release Haiku later this month.

AI-driven computer interaction

Anthropic describes its new “computer use” feature as a significant innovation. Rather than developing specific tools for individual tasks, the company is taking a broader approach by teaching Claude general computer skills. This allows the AI to use various standard tools and software programs originally designed for human use.

Anthropic has developed an API that enables Claude to perceive and interact with computer interfaces. Developers can integrate this API to allow Claude to translate instructions like “Use data from my computer and the internet to fill out this form” into actual computer commands.

The system can move the mouse pointer, click on screen elements, and enter information using a virtual keyboard. In the OSWorld benchmark, which assesses AI models’ ability to use computers in a human-like manner, Claude 3.5 Sonnet scored 14.9% in the “screenshots only” category. While this is significantly higher than the next best AI system at 7.8%, it still falls far short of human capabilities.

Recommendation

Anthropic recognizes that Claude’s current computer interaction skills are imperfect. Some actions that humans find effortless, such as scrolling, dragging, or zooming, are still challenging for Claude. The company recommends that developers start with low-risk tasks when implementing this feature.

Anthropic launches smarter Claude models with computer skills

Source link