When the Research Harness Pit Claude Code Against GPT-5.4, Both Models Cracked

Two of the biggest names in AI coding tools are going head-to-head. Researchers recently put Claude Code and GPT-5.4 through a series of tough benchmarks. The results showed that neither model won every category.

On SWE-bench, a popular coding test, GPT-5.4 scored 57.7%. Claude came in at 52.7%. GPT-5.4 also led on Terminal Bench, scoring 75.1% compared to Claude’s 65.4%. On the harder SWE-bench Pro test, GPT-5.4 again came out ahead with 57.7%, while Claude Sonnet 4.6 scored around 47%.

GPT-5.4 outscored Claude on SWE-bench, Terminal Bench, and SWE-bench Pro — sometimes by a significant margin.

But Claude’s not losing across the board. On GPQA Diamond, a reasoning-heavy test, Claude scored 87.4% versus GPT-5.4’s 83.9%. That’s a meaningful gap in favor of Claude.

Speed tells a mixed story. Claude Sonnet 4.6 starts responding faster, with a time-to-first-token of about 1.2 seconds. GPT-5.4 takes 2 to 3 seconds. However, GPT-5.4 pushes out 80 tokens per second overall, while Claude runs at about 55.

Code quality also differs. GPT-5.4 tends to write lean, fast code. Claude’s output is cleaner and easier to read over time. GPT-5.4 handles well-defined tasks with strong execution. Claude does better at planning, interpretation, and creative judgment.

On tool use, GPT-5.4 supports web search, file search, code interpretation, and computer use. Its OSWorld-Verified score hit 75%, which actually beats average human performance at 72.4%. Claude Opus 4.7 makes fewer tool calls but reasons things out before acting, scoring 77.3% on MCP-Atlas.

Pricing also differs. GPT-5.4 costs $2.50 per million input tokens. Claude Sonnet 4.6 runs $3.00 per million. Claude does offer strong caching options for large amounts of text. GPT-5.5 also uses 72% fewer output tokens than Claude Opus 4.7 on similar tasks. For teams running repetitive workflows at scale, Claude’s caching discount reaches 90% off repeated content, making it significantly more economical than GPT-5.4’s 50% caching reduction.

In the overall BenchLM score, GPT-5.4 leads with 84 points versus Claude’s 80. Still, researchers note there’s no clear universal winner. Both models scored above 94% on math tasks, reflecting how competitive the two have become at core quantitative reasoning.

GPT-5.4 handles hard novel problems and autonomous coding well. Claude holds its own on long reasoning tasks and large engineering projects. The gap between them is real but narrow.