Google’s Gemini 3: Progress or Just Hype?

24

Google’s latest AI model, Gemini 3, is making headlines with benchmark scores that suggest significant advances in artificial intelligence capabilities. While these results may temporarily reinforce confidence in the field, real-world performance and reliability remain crucial questions.

Benchmark Scores and Their Limitations

Google claims Gemini 3 exhibits “PhD-level reasoning,” citing its performance on tests like Humanity’s Last Exam—a rigorous assessment of graduate-level knowledge across math, science, and humanities. The model scored 37.5%, surpassing OpenAI’s GPT-5 (26.5%). However, experts caution against overinterpreting these scores. Luc Rocher at the University of Oxford notes that improving from 80% to 90% on a benchmark doesn’t necessarily equate to a meaningful leap in genuine reasoning ability.

Benchmark tests, often relying on multiple-choice or single-answer formats, may not accurately reflect real-world problem-solving skills. Rocher points out that doctors and lawyers don’t assess clients with multiple choice questions; their expertise requires nuanced evaluation. There’s also the concern that models may be “cheating” by simply regurgitating information from their training data.

Hallucinations and Reliability Concerns

Despite advancements in performance metrics, Gemini 3 continues to exhibit a troubling flaw common to large language models: factual inaccuracies and hallucinations. Google acknowledges this, stating that the model will still produce false or misleading information at rates comparable to other leading AI systems. This is particularly concerning because a single significant error can erode trust in the technology. Artur d’Avila Garcez at City St George’s, University of London, underscores that reliability is paramount—a catastrophic hallucination could undermine the entire system.

Real-World Applications and Future Outlook

Google positions Gemini 3 as an improvement for tasks like software development, email organization, and document analysis. The company also plans to enhance Google Search with AI-generated graphics and simulations. However, the most significant gains may lie in agentic coding—the use of AI to autonomously write code. Adam Mahdi at the University of Oxford suggests that Gemini 3 Pro will excel in complex workflows rather than everyday conversational tasks.

Initial user feedback highlights both praise for Gemini 3’s coding abilities and reasoning skills, as well as reports of failures in simple visual reasoning tests. The true test will be how effectively people integrate the model into their workflows and whether its reliability justifies the massive investments in AI infrastructure.

The ultimate measure of success for Gemini 3 and similar AI models isn’t just benchmark scores, but their practical value and trustworthiness in real-world applications.

The AI arms race continues, but until hallucinations are reliably addressed, the promise of truly intelligent systems remains unfulfilled.