Jan 28, 2026•6:11pm UTC

New Gemini model levels up image understanding

AI models have long prioritized text over images. Google's new agentic model changes that.

Agentic Vision in Gemini 3, unveiled Tuesday, combines visual reasoning with code execution to actively understand images. Google explains that typically AI models like Gemini take a single static glance at the world and then if they miss a detail, will compensate with a guess. Instead, Agentic Vision in Gemini 3 “treats vision as an active investigation,” according to the tech giant.

The results speak for themselves: Gemini 3 Flash with code execution performs up to 10% better across most vision benchmarks including the MMMU Pro, Visual Probe, and OfficeQA than just Gemini 3 Flash alone.

Here’s how it works:

Zooming in: Instead of just taking a single glance at an object and missing some details, Gemini 3 Flash is trained to zoom in when fine-grained details are detected.
Annotating images: With Agentic Vision, the model can annotate images, going a step beyond simply describing the image but also executing code that draws directly on the image to ground reasoning. For example, Google includes a sample prompt in which a user asks Gemini how many fingers are on an image of a hand. Agentic Vision uses Python to draw boxes over every finger it identifies, then assigns it a number to produce an accurate final answer.
Plotting and visual math: While standard LLMs typically hallucinate during multi-step visual arithmetic, according to Google, Agentic Vision can “parse through high-density data tables and execute Python code to visualize the findings.” his means it can analyze a data table and convert it into other mediums, such as bar charts and graphs.

In practice, for example, Google’s model can more accurately identify the amount of objects in a picture or the small print text on an object, which can then be helpful on its own or used as context to answer broader questions or help with bigger tasks.

Agentic Vision is currently available in the Gemini API in Google AI Studio and Vertex AI. Non-developers will also be able to access it in the Gemini app by selecting “Thinking” from the model drop-down, where it is currently rolling out.

Our Deeper View

Over the past year, AI companies have raced to improve image and video generation. OpenAI's Sora and Google's Imagen 3 and Veo produce strikingly realistic media, pushing the technology forward dramatically. But this progress has focused almost entirely on creating new content. Accurate image analysis is equally important, if not more so. Users need AI assistance with visual tasks far more often than they need to generate new images, making analysis capabilities critical for everyday applications.

Report: SoftBank in talks for $30B OpenAI bet

Nat Rubio-Licht

New Gemini model levels up image understanding

Sabrina Ortiz

•

Jan 28, 2026

•

6:11pm UTC

Copy link

Share on X

Share on LinkedIn

Share on Instagram

Share via Facebook

AI models have long prioritized text over images. Google's new agentic model changes that.

Here’s how it works:

Zooming in: Instead of just taking a single glance at an object and missing some details, Gemini 3 Flash is trained to zoom in when fine-grained details are detected.
Annotating images: With Agentic Vision, the model can annotate images, going a step beyond simply describing the image but also executing code that draws directly on the image to ground reasoning. For example, Google includes a sample prompt in which a user asks Gemini how many fingers are on an image of a hand. Agentic Vision uses Python to draw boxes over every finger it identifies, then assigns it a number to produce an accurate final answer.
Plotting and visual math: While standard LLMs typically hallucinate during multi-step visual arithmetic, according to Google, Agentic Vision can “parse through high-density data tables and execute Python code to visualize the findings.” his means it can analyze a data table and convert it into other mediums, such as bar charts and graphs.

Our Deeper View

Nat Rubio-Licht

Sabrina Ortiz

New Gemini model levels up image understanding

Our Deeper View

Related

Report: SoftBank in talks for $30B OpenAI bet

Students use “AI Humanizer” tools to make AI text look human

Anthropic continues winning streak with $20B funding round

New Gemini model levels up image understanding

Our Deeper View

Related

Report: SoftBank in talks for $30B OpenAI bet

Students use “AI Humanizer” tools to make AI text look human

Anthropic continues winning streak with $20B funding round