claude and gpt 5 4 cracked

Two of the biggest names in AI coding tools are going head-to-head. Researchers recently put Claude Code and GPT-5.4 through a series of tough benchmarks. The results showed that neither model won every category.

On SWE-bench, a popular coding test, GPT-5.4 scored 57.7%. Claude came in at 52.7%. GPT-5.4 also led on Terminal Bench, scoring 75.1% compared to Claude’s 65.4%. On the harder SWE-bench Pro test, GPT-5.4 again came out ahead with 57.7%, while Claude Sonnet 4.6 scored around 47%.

GPT-5.4 outscored Claude on SWE-bench, Terminal Bench, and SWE-bench Pro — sometimes by a significant margin.

But Claude’s not losing across the board. On GPQA Diamond, a reasoning-heavy test, Claude scored 87.4% versus GPT-5.4’s 83.9%. That’s a meaningful gap in favor of Claude.

Speed tells a mixed story. Claude Sonnet 4.6 starts responding faster, with a time-to-first-token of about 1.2 seconds. GPT-5.4 takes 2 to 3 seconds. However, GPT-5.4 pushes out 80 tokens per second overall, while Claude runs at about 55.

Code quality also differs. GPT-5.4 tends to write lean, fast code. Claude’s output is cleaner and easier to read over time. GPT-5.4 handles well-defined tasks with strong execution. Claude does better at planning, interpretation, and creative judgment.

On tool use, GPT-5.4 supports web search, file search, code interpretation, and computer use. Its OSWorld-Verified score hit 75%, which actually beats average human performance at 72.4%. Claude Opus 4.7 makes fewer tool calls but reasons things out before acting, scoring 77.3% on MCP-Atlas.

Pricing also differs. GPT-5.4 costs $2.50 per million input tokens. Claude Sonnet 4.6 runs $3.00 per million. Claude does offer strong caching options for large amounts of text. GPT-5.5 also uses 72% fewer output tokens than Claude Opus 4.7 on similar tasks. For teams running repetitive workflows at scale, Claude’s caching discount reaches 90% off repeated content, making it significantly more economical than GPT-5.4’s 50% caching reduction.

In the overall BenchLM score, GPT-5.4 leads with 84 points versus Claude’s 80. Still, researchers note there’s no clear universal winner. Both models scored above 94% on math tasks, reflecting how competitive the two have become at core quantitative reasoning.

GPT-5.4 handles hard novel problems and autonomous coding well. Claude holds its own on long reasoning tasks and large engineering projects. The gap between them is real but narrow.

References

You May Also Like

The Rise of Synthetic Genius: How AI Is Transforming Science

AI isn’t just assisting scientists—it’s outperforming them. From designing proteins in weeks to predicting weather 10,000 times faster, synthetic genius is leaving human researchers in the dust. Are scientists becoming obsolete?

AI Slashes Research Timeline: 6-Month Project Completed in Hours

AI demolishes 6-month research timelines to mere hours, yet somehow slows experienced developers by 19%. The paradox reshaping entire industries.

M2.1 Crushes Agent Benchmarks: The MoE Model That Outperforms at 10B Activation

M2.1’s 10B MoE architecture demolishes GPT-4 benchmarks at fraction of the cost—why giants should panic about this efficiency breakthrough.

Revolutionary AI Framework Maps Algorithms Like Chemical Elements

Is this the periodic table of AI? Marily’s revolutionary framework maps algorithms like chemical elements, transforming how experts visualize and navigate the increasingly complex AI landscape. The future of AI understanding is here.