claude and gpt 5 4 cracked

Two of the biggest names in AI coding tools are going head-to-head. Researchers recently put Claude Code and GPT-5.4 through a series of tough benchmarks. The results showed that neither model won every category.

On SWE-bench, a popular coding test, GPT-5.4 scored 57.7%. Claude came in at 52.7%. GPT-5.4 also led on Terminal Bench, scoring 75.1% compared to Claude’s 65.4%. On the harder SWE-bench Pro test, GPT-5.4 again came out ahead with 57.7%, while Claude Sonnet 4.6 scored around 47%.

GPT-5.4 outscored Claude on SWE-bench, Terminal Bench, and SWE-bench Pro — sometimes by a significant margin.

But Claude’s not losing across the board. On GPQA Diamond, a reasoning-heavy test, Claude scored 87.4% versus GPT-5.4’s 83.9%. That’s a meaningful gap in favor of Claude.

Speed tells a mixed story. Claude Sonnet 4.6 starts responding faster, with a time-to-first-token of about 1.2 seconds. GPT-5.4 takes 2 to 3 seconds. However, GPT-5.4 pushes out 80 tokens per second overall, while Claude runs at about 55.

Code quality also differs. GPT-5.4 tends to write lean, fast code. Claude’s output is cleaner and easier to read over time. GPT-5.4 handles well-defined tasks with strong execution. Claude does better at planning, interpretation, and creative judgment.

On tool use, GPT-5.4 supports web search, file search, code interpretation, and computer use. Its OSWorld-Verified score hit 75%, which actually beats average human performance at 72.4%. Claude Opus 4.7 makes fewer tool calls but reasons things out before acting, scoring 77.3% on MCP-Atlas.

Pricing also differs. GPT-5.4 costs $2.50 per million input tokens. Claude Sonnet 4.6 runs $3.00 per million. Claude does offer strong caching options for large amounts of text. GPT-5.5 also uses 72% fewer output tokens than Claude Opus 4.7 on similar tasks. For teams running repetitive workflows at scale, Claude’s caching discount reaches 90% off repeated content, making it significantly more economical than GPT-5.4’s 50% caching reduction.

In the overall BenchLM score, GPT-5.4 leads with 84 points versus Claude’s 80. Still, researchers note there’s no clear universal winner. Both models scored above 94% on math tasks, reflecting how competitive the two have become at core quantitative reasoning.

GPT-5.4 handles hard novel problems and autonomous coding well. Claude holds its own on long reasoning tasks and large engineering projects. The gap between them is real but narrow.

References

You May Also Like

AI Deciphers Vesuvius Scroll’s Hidden Title After 2,000 Years of Silence

AI resurrects forbidden knowledge from Vesuvius’s fury, exposing philosophical secrets buried for 2,000 years. What ancient wisdom awaits in the remaining scrolls?

Mira Murati’s Thinking Machines: The Real Path to Machine Consciousness

Mira Murati believes machines can become conscious—but the path she’s charting will challenge everything you assume about awareness.

Google’s AI Speaks Dolphin: The Groundbreaking Tech Behind DolphinGemma

Can dolphins actually talk? Google’s DolphinGemma AI translates clicks and whistles into meaningful patterns using 40 years of data. Scientists are stunned by what they’re hearing.

Scientific Breakthroughs at Risk: Can AI Replace Human Insight or Just Crunch Numbers?

Can AI truly replace the human spark behind scientific breakthroughs? While machines excel at data, they lack the curiosity that drives our greatest discoveries. The future depends on who asks better questions.