An introduction to Google's Gemini 3

The world of artificial intelligence is moving at a breakneck pace, with new models and updates being released constantly. Each new iteration promises to be more capable, more intelligent, and more useful than the last. Recently, Google dropped a significant update that has the developer community buzzing: Gemini 3.

The benchmark results alone suggest a monumental leap forward, with Gemini 3 Pro outperforming established rivals like GPT-5.1 and Claude Sonnet 4.5 across a vast array of tasks. But do these numbers translate into real-world performance?

In this article, we will go beyond the benchmarks and put Gemini 3 to the test.

Gemini 3: a new contender in the AI arena

The release of Gemini 3 was met with bold claims, and the initial data provided by Google certainly backs them up. It appears to be a significant step forward, not just a small incremental improvement. Let's take a closer look at what makes this model so special right out of the gate.

Unpacking the benchmark results

Benchmarks are standardized tests designed to measure and compare the performance of different systems. In the context of AI, these benchmarks test everything from academic reasoning and scientific knowledge to complex coding and multimodal understanding. When we look at how Gemini 3 Pro stacks up against its competitors, the results are staggering.

A detailed benchmark comparison table showing Gemini 3 Pro outperforming Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1 across numerous categories.

The provided benchmark table paints a clear picture: Gemini 3 Pro establishes a new state-of-the-art across the board. Here are some key takeaways:

Academic and Scientific Reasoning: In benchmarks like "Humanity's Last Exam" and "GPQA Diamond," Gemini 3 Pro achieves scores of 45.8% and 91.9% respectively, significantly higher than its predecessors and competitors. This demonstrates a profound ability to understand and reason about complex, knowledge-intensive subjects.
Visual and Multimodal Understanding: In tasks like "ARC-AGI-2" (visual reasoning puzzles) and "Video-MMU" (knowledge acquisition from videos), Gemini 3 Pro scores 31.1% and 87.6%. These are massive leaps, especially in the ARC-AGI-2 benchmark, where the next best model, GPT-5.1, scored only 17.6%. This indicates a superior ability to interpret and reason about visual information.
Coding and Agentic Tasks: For developers, this is where it gets truly exciting. In "LiveCodeBench Pro" (competitive coding problems) and "Terminal-Bench 2.0" (agentic terminal/coding), Gemini 3 Pro sets new records. The only area where it doesn't take the top spot is the "SWE-Bench Verified" benchmark for agentic coding, where it falls short of Anthropic's model by a mere 1%.

This overwhelming dominance across such a diverse set of benchmarks suggests that Gemini 3 is not just better at one thing; it's a more fundamentally capable and versatile model.

The promise of "Deep Think" mode

Adding another layer to its impressive capabilities, Gemini 3 introduces a feature called "Deep Think" mode. While the specifics are still emerging, this mode is designed for tasks that require more profound, complex reasoning. When engaged, it performs even better than the standard Gemini 3 Pro model, as evidenced by its separate, higher scores on benchmarks like the ARC-AGI-2 leaderboard. This suggests a mode that allows the AI to dedicate more computational resources or a different reasoning pathway to tackle exceptionally difficult problems, pushing the boundaries of what's possible even further.

Practical test 1: building a 3D Minecraft clone

Benchmarks are one thing, but the true test of a model's utility for developers is how it performs on a real project. To evaluate this, we'll task Gemini 3 with a complex, single-prompt challenge: creating a 3D Minecraft clone using Three.js, a popular JavaScript library for creating 3D graphics in a web browser.

Setting up the environment: introducing Google Antigravity

For this test, we are using a specialized, next-generation IDE developed by Google called Antigravity. This AI-native Integrated Development Environment is designed from the ground up to work seamlessly with large language models. It features an "Agent Manager" where you can start conversations, define workspaces, and interact with the AI in a project-based context. The interface shows a clear, turn-by-turn log of the AI's thought process, the files it's editing, and progress updates, providing transparency and control over the development process. For our test, we'll be using the "Gemini 3 Pro (High)" setting within Antigravity to ensure we're getting the best performance.

Crafting the perfect prompt

The quality of an AI's output is directly proportional to the quality of the input prompt. For a complex project like a Minecraft clone, a detailed, well-structured prompt is essential. We didn't just ask for a "Minecraft clone"; we provided a comprehensive list of requirements.

A screenshot of the detailed prompt used to create the 3D Minecraft clone, outlining features, project structure, textures, and more.

Here is a breakdown of the key sections of our prompt:

Core Features: This section set the high-level goals. We asked for a procedural tree generation approach, smooth terrain generation, and basic gameplay mechanics like movement, interaction, and terrain manipulation.
Project Structure: We specified the technical foundation: a single HTML file with a full-page canvas layout, basic 3D controls, and styles for a full-screen 3D experience.
Textures: We requested that textures be generated programmatically using the Canvas API rather than using image files. We asked for unique block patterns for grass, dirt, stone, wood, and leaves. This is a more complex request that tests the AI's ability to generate visual assets with code.
Terrain Generation: We asked for a multi-octave Simplex noise generator to create varied terrain with rolling hills, plateaus, and valleys. We also specified block placement logic for different layers (grass, dirt, stone).
Procedural Trees: This section detailed the tree generation system, including trunk block placement, canopy leaf distribution, and custom wood/leaf textures.
Player Controls & Interactions: We defined the control scheme (WASD for movement, mouse for looking, space for jumping) and interaction mechanics (left-click for breaking blocks, right-click for placing blocks).
Atmosphere: To add polish, we requested a gradient skybox system, a day/night cycle, and atmospheric depth effects.
Technology Stack: Finally, we explicitly stated the tech stack: JavaScript, Three.js using ES Modules, and CSS.

This level of detail gives the AI a clear blueprint, minimizing ambiguity and increasing the likelihood of a successful outcome.

The generation process: step-by-step

Once the prompt was submitted, the Antigravity IDE showed Gemini 3's agent getting to work.

Planning: The AI first analyzed the prompt and created a high-level task and a detailed implementation plan. This plan covered the architecture, file structure, and core systems.
Project Setup: The agent initiated the codebase, creating the necessary index.html, style.css, and main.js files.
Iterative Implementation: The agent then began implementing the features in logical order. It started with procedural texture generation using the Canvas API, then moved on to world generation, implementing the noise generator and chunk system. During this process, it created additional JavaScript files like noise.js, world.js, and chunk.js to keep the code organized.
Continuation: The initial generation was extensive and hit a context limit or a timeout within the IDE. A simple follow-up prompt, "please continue," was all that was needed for the agent to pick up right where it left off and complete the project.

Analyzing the result: a functional Minecraft clone

The final output was remarkably impressive for a project generated almost entirely from a single prompt.

A view of the generated 3D Minecraft clone, showing a blocky landscape with procedurally generated trees and a user interface for selecting blocks.

Here's an evaluation of the generated game:

What Worked Well:
- World Generation: The game successfully generated an infinite, procedurally generated world with rolling hills and terrain variation, just as requested.
- Procedural Trees: Trees were scattered across the landscape, complete with distinct trunk and leaf blocks.
- Textures: The block textures were generated using the Canvas API, creating the classic pixelated look without any external image files.
- UI: A basic UI was included at the bottom of the screen, allowing the player to select different block types (Grass, Dirt, Stone, Wood, Leaves).
Areas for Improvement:
- Player Movement: The movement was functional but extremely fast and floaty, with an exaggerated jump that allowed the player to leap across the map. This would need to be fine-tuned.
- Block Interaction: The core mechanics of breaking and placing blocks were not implemented in the initial build. While the UI for selecting blocks was present, the player couldn't interact with the world.

Despite these shortcomings, creating a visually coherent and procedurally generated 3D world from a single prompt is a massive achievement. It's a solid foundation that a developer could easily build upon.

Comparative analysis: Gemini 3 vs. Claude Sonnet 4.5

To put Gemini 3's performance into perspective, the same detailed prompt was given to another powerful model, Anthropic's Claude Sonnet 4.5.

Claude's attempt at Minecraft

Claude also produced a functional Minecraft clone, but with a different set of strengths and weaknesses.

What Worked Well:
- Block Interaction: Unlike the Gemini version, Claude's clone successfully implemented the ability to break blocks with a left-click.
- World Generation: It also created a procedurally generated world with trees.
Areas for Improvement:
- Player Movement: The movement in Claude's version was the opposite of Gemini's: it was incredibly slow and sluggish.
- Visuals: The overall aesthetic was less refined. The colors were more muted, and the world felt less dynamic. The textures and lighting were not as appealing as in the Gemini version.

Head-to-head: which AI built a better game?

Comparing the two, Gemini 3 won the single-prompt test. While Claude implemented a key gameplay feature (block breaking) that Gemini missed, Gemini's output was a far better starting point. The world generation was more sophisticated, the procedural textures were more impressive, and the overall visual fidelity was much higher. The core engine and visual foundation from Gemini were superior, even if it missed one of the interaction features. This suggests Gemini has a better grasp of complex graphical and logical systems like procedural generation.

Gemini 3's superpower: spatial logic and UI design

The Minecraft test highlights one of Gemini 3's standout abilities: understanding and implementing spatial logic and 3D applications. This is a skill that many other models struggle with, but Gemini seems to excel at it.

Community showcase: 3D applications and beyond

Since its release, developers on platforms like Twitter have been pushing Gemini 3 to its limits, and the results further confirm its strength in this area. We've seen incredible examples of complex applications built in a single shot:

A fully functional 3D LEGO editor, complete with a user interface for selecting brick colors and placing them on a baseplate.
An interactive 3D simulation of a nuclear power plant, allowing users to walk through and learn about the different stages of power generation.
Numerous other Minecraft-style games, some with far more advanced graphics and features than our initial test.

This consistent success across various 3D tasks indicates a deep, native understanding of spatial reasoning and graphical programming.

Dominating the design arena

Gemini's prowess isn't limited to 3D. It also shows exceptional talent in 2D UI and UX design. A great way to objectively measure this is with Design Arena, a platform where users can submit a design prompt (e.g., "Create a landing page for an education startup") and blindly vote on the results generated by different AI models.

The Design Arena leaderboard showing Gemini 3 Pro Preview with the highest Elo rating for website design, outperforming various GPT and Claude models.

The platform uses an Elo rating system, similar to chess, to rank the models based on user preferences. The leaderboards consistently show Gemini 3 at the very top for website design, with a significantly higher rating than all other models. This means that when users are shown designs side-by-side without knowing which AI made them, they prefer Gemini's output most of the time. It has a better sense of layout, typography, color theory, and modern design principles.

Practical test 2: designing a streaming platform UI

To verify the Design Arena results, we conducted our own UI generation test. We asked Gemini 3 to design and code a landing page for a fictional streaming platform.

The prompt: building a Netflix-style landing page

The prompt was straightforward: "Design a streaming platform landing page with hero/banners, category rows, floating cards, and mock video thumbnails for an imaginary service."

Gemini 3's UI output: a familiar design

The result was a polished and professional-looking landing page that is immediately recognizable as a Netflix-inspired design.

The landing page for "StreamFlix" generated by Gemini 3, featuring a large hero image, carousels of movie thumbnails, and a clean, dark-themed layout.

Key Features:
- Hero Section: A large, compelling hero image for a featured movie ("Interstellar Horizon") with a title, tagline, and "Play" and "More Info" buttons.
- Sticky Header: A navigation bar that sticks to the top of the page as the user scrolls.
- Carousels: Multiple horizontal carousels for different categories like "Trending Now," "New Releases," and "Sci-Fi."
- Hover Effects: When hovering over a movie thumbnail, a card with more information (title, match percentage, runtime) appears, mimicking the Netflix user experience.
- Image Generation: All the placeholder movie posters and hero images were also generated by Gemini 3, demonstrating its multimodal capabilities.

The design is clean, intuitive, and functionally complete. It correctly interpreted the implicit request for a "Netflix clone" and executed it flawlessly.

The Claude comparison: a gradient-heavy alternative

Running the same prompt with Claude 4.5 produced a functional but aesthetically different result. The layout was similar, but the design fell into a common AI trap: an overuse of bright, colorful gradients for all the thumbnails. While functional, it lacked the photorealistic and professional feel of Gemini's version. Gemini 3's design choices were more subtle and aligned with modern web design trends, resulting in a more believable and high-quality final product.

Under the hood: benchmarks, pricing, and performance

Beyond our hands-on tests, third-party benchmarks and pricing analysis provide a more complete picture of Gemini 3's standing.

Decoding third-party benchmarks

Platforms like Artificial Analysis provide an aggregated "Intelligence Index" based on performance across numerous evaluations. Their findings corroborate what we've seen:

Artificial Analysis Intelligence Index: Gemini 3 Pro Preview ranks as the most intelligent model overall, closely followed by GPT-5.1.
Artificial Analysis Coding Index: Again, Gemini 3 leads the pack in coding-specific benchmarks, reaffirming its strength as a tool for developers.

The ARC-AGI-2 leaderboard: a glimpse into the future?

The ARC-AGI-2 benchmark is particularly noteworthy. It's designed to be a challenging test of abstract reasoning, considered by some to be a measure of progress toward Artificial General Intelligence (AGI).

The ARC-AGI-2 Leaderboard chart, where Gemini 3 Pro and its "Deep Think" variant are shown to be significantly ahead of all other models.

On this leaderboard, Gemini 3 isn't just slightly better; it's in a class of its own. Its score is miles above the rest of the competition. When using the "Deep Think" variant, it achieves a score of 45%, a huge leap towards solving these complex reasoning puzzles. This exceptional performance suggests a more advanced underlying reasoning capability.

Is it worth the price? Gemini 3 pricing explained

Powerful performance often comes at a high cost, so how does Gemini 3 stack up? The pricing is competitive and positions it as a compelling option.

A pricing table comparing different AI models, with a red arrow pointing to Gemini 3 Pro's input and output costs.

Input Tokens: $2.00 per million tokens (for contexts up to 200k) and $4.00 per million tokens (for larger contexts).
Output Tokens: $12.00 per million tokens (for contexts up to 200k) and $18.00 per million tokens (for larger contexts).

This places it in a middle ground. It's generally cheaper than Claude's top-tier models but slightly more expensive than OpenAI's GPT-5.1. Importantly, it maintains a massive 1 million token context window, making it suitable for applications that need to process large amounts of information.

Intelligence vs. price: finding the sweet spot

When you plot intelligence against price, you can see where each model offers the best value.

An "Intelligence vs. Price" scatter plot showing Gemini 3 Pro in the top-left quadrant, indicating high intelligence at a moderate price.

The chart shows that while models like Grok 4 Fast might be cheaper, they are significantly less capable. Gemini 3 Pro sits comfortably as the most intelligent model currently available, with a price point that, while not the absolute cheapest, is very reasonable for its state-of-the-art performance. This makes it a highly attractive option for developers who need maximum capability without the premium cost of models like Claude Opus.

Final thoughts

Gemini 3 is not just another update; it's a statement. Our practical tests confirm that its benchmark dominance translates into real-world excellence. It demonstrates a remarkable ability to handle complex, multi-step coding tasks like building a 3D game from scratch and exhibits a sophisticated eye for modern UI/UX design that surpasses its competitors.

Its standout performance on challenging reasoning benchmarks like ARC-AGI-2 hints at a more profound cognitive architecture, pushing us closer to more generally capable AI systems. While its pricing isn't the absolute lowest on the market, it offers an unparalleled intelligence-to-cost ratio, making it the new go-to model for developers seeking the highest level of performance.

The AI race is far from over, but with the release of Gemini 3, Google has gone from being a contender to being the one to beat. For developers, this is fantastic news. We now have an incredibly powerful new tool at our disposal, ready to help us build more complex, more intelligent, and more beautiful applications than ever before.

Got an article suggestion? Let us know

Token-Efficient LLM Workflows with TOON

A practical guide to using TOON (Token-Oriented Object Notation) to reduce token usage in LLM workflows. Learn how TOON compares to JSON, see a TypeScript example converting log data, and measure real-world token and cost savings of 30–60% or more

→

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contents