ai research agents excel

Google’s Gemini 3.1 has arrived, and it’s pushing AI research agents to new heights. The new model’s scores on key AI benchmarks show major jumps over its predecessor, Gemini 3 Pro. Experts are taking notice of what the numbers reveal.

On the ARC-AGI-2 benchmark, which tests abstract reasoning, Gemini 3.1 scored 77.1%. That’s more than double Gemini 3 Pro’s 31.1%. The model also hit the highest score ever recorded on the GPQA Diamond test, which measures graduate-level science knowledge.

Gemini 3.1 also made big strides in agentic tasks. These are jobs where an AI works on its own to complete multi-step goals. On the APEX-Agents score, it reached 33.5%, compared to 18.4% for Gemini 3 Pro. That’s an 82% relative improvement. For web research tasks measured by BrowseComp, it scored 85.9% versus 59.2% for the older model.

One of the model’s standout features is its ability to reduce hallucinations. That’s when AI makes up wrong answers instead of admitting it doesn’t know something. Gemini 3.1 cut its hallucination rate by 38 percentage points compared to Gemini 3 Pro Preview. It’s now much better at saying “I don’t know” rather than guessing incorrectly.

In coding tasks, the model’s performance was also strong. It hit 80.6% on SWE-Bench Verified, a test for fixing real software bugs. It also scored a 2887 Elo rating on LiveCodeBench Pro, which measures competitive coding skills.

For businesses, Google launched Deep Research Max integration. It turns the model into a research tool that can gather data from multiple sources, check facts, and produce detailed, cited reports. Companies in finance, life sciences, and market research are among those expected to benefit. With the global AI market projected to grow from $391 billion in 2025 to $1.81 trillion by 2030, tools like this are arriving at a critical moment for enterprise adoption.

The tool works through a single API call and can blend proprietary data with open web sources.

Gemini 3.1’s 1 million token context window also helps it handle huge amounts of information at once. That makes it useful for tasks like literature reviews and hypothesis generation in research settings. The model also supports multimodal input, allowing it to process text, images, audio, and video within the same workflow.

Developers can access the model directly through Google AI Studio using the gemini-3.1-pro-preview identifier, giving teams a straightforward path to integrate these research capabilities into their own applications.

References

You May Also Like

Mira Murati’s Thinking Machines: The Real Path to Machine Consciousness

Mira Murati believes machines can become conscious—but the path she’s charting will challenge everything you assume about awareness.

GPT-4.5 Fools Humans 73% of Time: Turing Test Barrier Finally Broken

OpenAI’s GPT-4.5 shattered AI limitations by fooling humans 73% of the time—outperforming actual people in the legendary Turing Test. Has machine intelligence finally surpassed us?

M2.1 Crushes Agent Benchmarks: The MoE Model That Outperforms at 10B Activation

M2.1’s 10B MoE architecture demolishes GPT-4 benchmarks at fraction of the cost—why giants should panic about this efficiency breakthrough.

Beyond Today’s AI: The Quest for Machines That Think Like Humans

Can machines truly feel? Scientists race to close the gap between AI and human thought with RTNet. Your brain still wins—for now.