Two of the biggest names in AI coding tools are going head-to-head. Researchers recently put Claude Code and GPT-5.4 through a series of tough benchmarks. The results showed that neither model won every category.
On SWE-bench, a popular coding test, GPT-5.4 scored 57.7%. Claude came in at 52.7%. GPT-5.4 also led on Terminal Bench, scoring 75.1% compared to Claude’s 65.4%. On the harder SWE-bench Pro test, GPT-5.4 again came out ahead with 57.7%, while Claude Sonnet 4.6 scored around 47%.
GPT-5.4 outscored Claude on SWE-bench, Terminal Bench, and SWE-bench Pro — sometimes by a significant margin.
But Claude’s not losing across the board. On GPQA Diamond, a reasoning-heavy test, Claude scored 87.4% versus GPT-5.4’s 83.9%. That’s a meaningful gap in favor of Claude.
Speed tells a mixed story. Claude Sonnet 4.6 starts responding faster, with a time-to-first-token of about 1.2 seconds. GPT-5.4 takes 2 to 3 seconds. However, GPT-5.4 pushes out 80 tokens per second overall, while Claude runs at about 55.
Code quality also differs. GPT-5.4 tends to write lean, fast code. Claude’s output is cleaner and easier to read over time. GPT-5.4 handles well-defined tasks with strong execution. Claude does better at planning, interpretation, and creative judgment.
On tool use, GPT-5.4 supports web search, file search, code interpretation, and computer use. Its OSWorld-Verified score hit 75%, which actually beats average human performance at 72.4%. Claude Opus 4.7 makes fewer tool calls but reasons things out before acting, scoring 77.3% on MCP-Atlas.
Pricing also differs. GPT-5.4 costs $2.50 per million input tokens. Claude Sonnet 4.6 runs $3.00 per million. Claude does offer strong caching options for large amounts of text. GPT-5.5 also uses 72% fewer output tokens than Claude Opus 4.7 on similar tasks. For teams running repetitive workflows at scale, Claude’s caching discount reaches 90% off repeated content, making it significantly more economical than GPT-5.4’s 50% caching reduction.
In the overall BenchLM score, GPT-5.4 leads with 84 points versus Claude’s 80. Still, researchers note there’s no clear universal winner. Both models scored above 94% on math tasks, reflecting how competitive the two have become at core quantitative reasoning.
GPT-5.4 handles hard novel problems and autonomous coding well. Claude holds its own on long reasoning tasks and large engineering projects. The gap between them is real but narrow.
References
- https://www.youtube.com/watch?v=NSez63Ngmxo
- https://www.nxcode.io/resources/news/claude-sonnet-4-6-vs-gpt-5-4-coding-comparison-2026
- https://www.trendingtopics.eu/gpt-5-4-targets-anthropics-claude-with-premium-pricing-and-coding-muscle/
- https://www.mindstudio.ai/blog/gpt-54-vs-claude-opus-46-vs-gemini-31-pro-benchmarks/
- https://www.datacamp.com/blog/opus-4-7-vs-gpt-5-4
- https://www.mindstudio.ai/blog/gpt-55-vs-claude-opus-47-coding-comparison/