Anthropic Claude 4 Release: A New Benchmark in AI

Bang-x · Published on May 26, 2025

Anthropic officially released the Claude 4 series models, including Claude Opus 4 and Claude Sonnet 4, on May 22, 2025, at 9 AM Eastern Time (early morning on May 23, 2025, Beijing Time). This release marks Anthropic's latest breakthrough in the AI field, particularly in programming capabilities, reasoning depth, and Agent development. Anthropic announced that the models are now available on all relevant product platforms, including Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.

Core Features and Technical Characteristics of Claude 4

1. New Model Series: Opus 4 and Sonnet 4

Claude Opus 4: Positioned as the most powerful and intelligent model, designed for complex reasoning, top-tier programming, and AI Agent workflows. Excels in handling highly complex tasks.
Claude Sonnet 4: Achieves a balance between performance and efficiency, with significant improvements over the previous Sonnet 3.7, suitable for most everyday AI application scenarios and high-throughput usage.

2. Hybrid Reasoning Modes

Both models adopt a hybrid system design, offering two operating modes:

Instant Response Mode: Provides answers within seconds, suitable for routine tasks.
Extended Thinking Mode: Takes more time for thinking and planning, suitable for complex problems and multi-step tasks. This is a beta feature that allows the model to switch between thinking and using tools.

3. Enhanced Agent Infrastructure Capabilities

Improved Memory Capabilities: Especially Opus 4, which can create and maintain "memory files" to store key information, maintaining coherence and focus in long-duration tasks.
Stronger Instruction Following: Significant improvements in handling complex, lengthy system prompts, with more precise understanding of user intent.
Reduced Reward Hacking: The model's tendency to take shortcuts to achieve goals has been reduced by over 80%, improving the reliability of outputs.
Parallel Tool Calling: Claude can now call multiple tools simultaneously, increasing efficiency.

4. Breakthrough API Features

Code Execution Tool: Allows Claude to run Python code in a sandbox environment, enabling data analysis, chart generation, and more.
MCP Connector: Supports seamless integration with any remote Model Context Protocol (MCP) server, simplifying the construction of tool-enabled Agents.
File API: Simplifies document management, allowing file uploads and references across multiple conversations, integrated with code execution tools.
Extended Prompt Cache: Maintains context for up to an hour, providing cost-effective memory management for long interactions.

5. Multimodal Capabilities

Claude 4 series models possess powerful multimodal capabilities, able to process and understand visual information such as images and charts, and combine them with text for reasoning and generation.

6. AI Safety Level 3 (ASL-3)

Claude 4 Opus is Anthropic's first model deployed under the ASL-3 standard, adopting additional safety measures, particularly in CBRN (Chemical, Biological, Radiological, and Nuclear) related knowledge and capabilities, tending toward caution.

Comparison of Claude 4 with Claude 3 Series and Other Competitors

Claude 4 series performs excellently in multiple benchmark tests, especially in programming and reasoning capabilities, surpassing the previous Claude 3 series and some competitors.

1. Comparison with Claude 3 Series

Performance Improvement: Claude 4 series shows significant improvements compared to Claude 3.7 models, especially in reasoning capabilities, tool usage accuracy, and overall intelligence level. Opus 4's accuracy in answering challenging open-ended questions has doubled compared to Claude 2.1.
New Features: Claude 4 introduces extended thinking, enhanced memory, code execution tools, MCP connectors, File API, and other new features not available in the Claude 3 series.
Instruction Following: Claude 4 performs better in following complex multi-step instructions, reducing issues of "excessive enthusiasm" or "unnecessary rejection" present in previous generation models.
Context Window: Both Claude 4 and Claude 3 series offer a 200K token context window, but Claude 4 has the capability to process inputs exceeding 1 million tokens (for users with specific needs).

2. Comparison with Other AI Competitors (GPT-4o, Gemini 2.5 Pro)

The following table summarizes Claude 4's performance against major competitors in key benchmark tests:

Benchmark Test	Claude Opus 4	Claude Sonnet 4	GPT-4o	Gemini 2.5 Pro
SWE-bench (Programming)	72.50%	72.70%	55.30%	50.10%
Terminal-bench (Command Line)	43.20%	41.80%	28.40%	25.70%
GPQA Diamond (Scientific Reasoning)	74.90%	70.00%	65.20%	62.80%
MMLU (Comprehensive Knowledge)	87.40%	85.40%	82.10%	79.60%
MMMU (Multimodal Understanding)	73.70%	72.60%	69.80%	67.20%
AIME (Mathematical Reasoning)	33.90%	33.10%	29.20%	26.80%

Programming Capabilities: Claude 4 series shows clear advantages in programming-related tests such as SWE-bench and Terminal-bench, with scores far exceeding GPT-4o and Gemini 2.5 Pro, considered to be the strongest programming models currently.
Reasoning Capabilities: In scientific reasoning (GPQA) and comprehensive knowledge (MMLU) tests, Claude Opus 4 leads the way.
Multimodal Understanding: In multimodal understanding (MMMU), Claude 4 series also demonstrates leading advantages.
Mathematical Abilities: In mathematical reasoning (AIME), Claude 4 series similarly leads ahead of competitors.
Speed and Cost: Claude Sonnet 4 maintains high performance while being faster and more cost-effective, suitable for everyday tasks. Opus 4 offers the strongest performance but with relatively slower response times and higher costs. GPT-4o and Claude 3.5 Sonnet are significantly outperformed by Kimi k1.5 in mathematics, code, visual multimodal, and general capabilities in short-CoT mode (with improvements up to 550%), indicating fierce market competition with different models focusing on different aspects.
Integration and Ecosystem: Claude 4 is available through Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI, and is building a more comprehensive developer ecosystem, including development tool integration and enterprise-level solutions. Both GPT-4 and Claude 2 perform well in integration, with Claude possibly having a slight advantage in integration with applications like Slack and Zoom.

Summary

Anthropic's Claude 4 series models were released on May 22/23, 2025, bringing two powerful models: Opus 4 and Sonnet 4. Their core highlights include significantly enhanced programming and reasoning capabilities, innovative Agent infrastructure features (such as extended thinking, enhanced memory, code execution tools, etc.), and multimodal understanding capabilities. In multiple authoritative benchmark tests, the Claude 4 series, especially Sonnet 4 and Opus 4, have set new industry benchmarks in programming and reasoning, surpassing competitors like GPT-4o and Gemini 2.5 Pro. These advancements make Claude 4 a powerful tool for developers building complex AI applications and Agents.