claude and gpt 5 4 cracked

Two of the biggest names in AI coding tools are going head-to-head. Researchers recently put Claude Code and GPT-5.4 through a series of tough benchmarks. The results showed that neither model won every category.

On SWE-bench, a popular coding test, GPT-5.4 scored 57.7%. Claude came in at 52.7%. GPT-5.4 also led on Terminal Bench, scoring 75.1% compared to Claude’s 65.4%. On the harder SWE-bench Pro test, GPT-5.4 again came out ahead with 57.7%, while Claude Sonnet 4.6 scored around 47%.

GPT-5.4 outscored Claude on SWE-bench, Terminal Bench, and SWE-bench Pro — sometimes by a significant margin.

But Claude’s not losing across the board. On GPQA Diamond, a reasoning-heavy test, Claude scored 87.4% versus GPT-5.4’s 83.9%. That’s a meaningful gap in favor of Claude.

Speed tells a mixed story. Claude Sonnet 4.6 starts responding faster, with a time-to-first-token of about 1.2 seconds. GPT-5.4 takes 2 to 3 seconds. However, GPT-5.4 pushes out 80 tokens per second overall, while Claude runs at about 55.

Code quality also differs. GPT-5.4 tends to write lean, fast code. Claude’s output is cleaner and easier to read over time. GPT-5.4 handles well-defined tasks with strong execution. Claude does better at planning, interpretation, and creative judgment.

On tool use, GPT-5.4 supports web search, file search, code interpretation, and computer use. Its OSWorld-Verified score hit 75%, which actually beats average human performance at 72.4%. Claude Opus 4.7 makes fewer tool calls but reasons things out before acting, scoring 77.3% on MCP-Atlas.

Pricing also differs. GPT-5.4 costs $2.50 per million input tokens. Claude Sonnet 4.6 runs $3.00 per million. Claude does offer strong caching options for large amounts of text. GPT-5.5 also uses 72% fewer output tokens than Claude Opus 4.7 on similar tasks. For teams running repetitive workflows at scale, Claude’s caching discount reaches 90% off repeated content, making it significantly more economical than GPT-5.4’s 50% caching reduction.

In the overall BenchLM score, GPT-5.4 leads with 84 points versus Claude’s 80. Still, researchers note there’s no clear universal winner. Both models scored above 94% on math tasks, reflecting how competitive the two have become at core quantitative reasoning.

GPT-5.4 handles hard novel problems and autonomous coding well. Claude holds its own on long reasoning tasks and large engineering projects. The gap between them is real but narrow.

References

You May Also Like

Berkeley Exposes Massive AI Benchmark Fraud: 100% Scores Through Pure Deception

AI benchmarks are a lie—Berkeley found every single one can be gamed for perfect scores without solving anything. Here’s how deep the rot goes.

Gemini 3.1 Powers AI Research Agents That Outperform Human Analysis

Gemini 3.1 cut hallucinations by 38 percentage points and hit 77.1% on ARC-AGI-2—but its research agents raise a uncomfortable question about human analysts.

M2.1 Crushes Agent Benchmarks: The MoE Model That Outperforms at 10B Activation

M2.1’s 10B MoE architecture demolishes GPT-4 benchmarks at fraction of the cost—why giants should panic about this efficiency breakthrough.

Scientific Breakthroughs at Risk: Can AI Replace Human Insight or Just Crunch Numbers?

Can AI truly replace the human spark behind scientific breakthroughs? While machines excel at data, they lack the curiosity that drives our greatest discoveries. The future depends on who asks better questions.