ai research agents excel

Google’s Gemini 3.1 has arrived, and it’s pushing AI research agents to new heights. The new model’s scores on key AI benchmarks show major jumps over its predecessor, Gemini 3 Pro. Experts are taking notice of what the numbers reveal.

On the ARC-AGI-2 benchmark, which tests abstract reasoning, Gemini 3.1 scored 77.1%. That’s more than double Gemini 3 Pro’s 31.1%. The model also hit the highest score ever recorded on the GPQA Diamond test, which measures graduate-level science knowledge.

Gemini 3.1 also made big strides in agentic tasks. These are jobs where an AI works on its own to complete multi-step goals. On the APEX-Agents score, it reached 33.5%, compared to 18.4% for Gemini 3 Pro. That’s an 82% relative improvement. For web research tasks measured by BrowseComp, it scored 85.9% versus 59.2% for the older model.

One of the model’s standout features is its ability to reduce hallucinations. That’s when AI makes up wrong answers instead of admitting it doesn’t know something. Gemini 3.1 cut its hallucination rate by 38 percentage points compared to Gemini 3 Pro Preview. It’s now much better at saying “I don’t know” rather than guessing incorrectly.

In coding tasks, the model’s performance was also strong. It hit 80.6% on SWE-Bench Verified, a test for fixing real software bugs. It also scored a 2887 Elo rating on LiveCodeBench Pro, which measures competitive coding skills.

For businesses, Google launched Deep Research Max integration. It turns the model into a research tool that can gather data from multiple sources, check facts, and produce detailed, cited reports. Companies in finance, life sciences, and market research are among those expected to benefit. With the global AI market projected to grow from $391 billion in 2025 to $1.81 trillion by 2030, tools like this are arriving at a critical moment for enterprise adoption.

The tool works through a single API call and can blend proprietary data with open web sources.

Gemini 3.1’s 1 million token context window also helps it handle huge amounts of information at once. That makes it useful for tasks like literature reviews and hypothesis generation in research settings. The model also supports multimodal input, allowing it to process text, images, audio, and video within the same workflow.

Developers can access the model directly through Google AI Studio using the gemini-3.1-pro-preview identifier, giving teams a straightforward path to integrate these research capabilities into their own applications.

References

You May Also Like

Berkeley Exposes Massive AI Benchmark Fraud: 100% Scores Through Pure Deception

AI benchmarks are a lie—Berkeley found every single one can be gamed for perfect scores without solving anything. Here’s how deep the rot goes.

AI Deciphers Vesuvius Scroll’s Hidden Title After 2,000 Years of Silence

AI resurrects forbidden knowledge from Vesuvius’s fury, exposing philosophical secrets buried for 2,000 years. What ancient wisdom awaits in the remaining scrolls?

Neurosymbolic AI: Where Logic Meets Learning to Crush AI’s Biggest Failures

Neural networks can’t explain themselves. Symbolic AI can’t learn. This hybrid approach fixes both—and it’s transforming healthcare and law right now.

The Rise of Synthetic Genius: How AI Is Transforming Science

AI isn’t just assisting scientists—it’s outperforming them. From designing proteins in weeks to predicting weather 10,000 times faster, synthetic genius is leaving human researchers in the dust. Are scientists becoming obsolete?