Comparing Qwen 3.5 and Claude Sonnet 4.5 for Coding Tasks

Stanley Ulili

Updated on March 2, 2026

Setting the stage: the contenders and our testing environment
The easy challenge: building a to-do list app
The medium challenge: an interactive solar system
The hard challenge: modifying existing code
Final analysis: why benchmarks aren't the whole story
Final thoughts

Alibaba has introduced the Qwen 3.5 Medium Model Series, an open-source release that claims benchmark performance on par with, or even exceeding, top-tier models like Claude Sonnet 4.5. As AI coding assistants rapidly improve, the idea of running a powerful model locally on your own hardware is incredibly appealing. Qwen 3.5 promises that you can run an advanced coding assistant on a modern MacBook Pro, giving you free, private, and unrestricted access without relying on external APIs.

But benchmark numbers do not always reflect real-world results.

In this article, you’ll see a direct head-to-head comparison between Qwen 3.5 and Claude Sonnet 4.5. Both models are tested through increasingly complex coding challenges, with close evaluation of their code quality, debugging ability, reasoning process, and overall reliability. By the end, you’ll have a clear understanding of how they truly compare and whether the open-source promise matches practical performance.

Setting the stage: the contenders and our testing environment

Before diving into the coding challenges, understanding the models being tested and the environment in which they're evaluated ensures a fair and transparent comparison.

Understanding the models

The two models at the center of this showdown represent two different philosophies in the AI world: the open-source challenger and the established, closed-source incumbent.

Qwen 3.5: the open-source challenger

Developed by Alibaba Cloud, the Qwen series of models has quickly gained a reputation for its impressive capabilities. The Qwen 3.5 Medium series, specifically the 35-billion parameter model (Qwen3.5-35B-A3B), is built on a hybrid architecture that likely incorporates a Mixture-of-Experts (MoE) design. This allows the model to have a large number of total parameters (35 billion) while only activating a fraction of them during inference (around 3 billion). This is the key to its "more intelligence, less compute" philosophy, making it feasible to run on local machines with sufficient memory.

What truly turned heads were the benchmark charts released by Alibaba. These charts showed Qwen 3.5 outperforming or matching Claude Sonnet 4.5 and even GPT-4 in various tasks, including document understanding and video reasoning. This bold claim is what we are here to investigate in the context of coding.

A benchmark chart comparing the performance of various Qwen 3.5 models against competitors like Claude Sonnet 4.5 and GPT-5 mini across different tasks.

Claude Sonnet 4.5: the established incumbent

From the AI safety and research company Anthropic, Claude Sonnet 4.5 is positioned as the ideal balance between high intelligence and speed, making it a workhorse for many enterprise and developer use cases. Unlike Qwen, Sonnet 4.5 is a closed-source model accessible only via an API.

It has earned a stellar reputation for its long context window, strong reasoning abilities, and sophisticated coding skills. It forms the backbone of tools like Claude Code, a terminal-based coding assistant that provides a premium, integrated development experience. While it isn't free and requires an internet connection, its reliability and quality of output have made it a favorite among many developers.

Our testing methodology

To create a level playing field, the testing environment has been carefully configured. The goal is to evaluate the raw coding capability of each model on its own merits.

The tools of the trade

Initially, the plan was to run Qwen 3.5 locally. However, even with its efficient architecture, the model requires a significant amount of unified memory (RAM). The test machine, a 13-inch M1 MacBook Pro with 16 GB of memory, proved insufficient for smooth and proper inference.

Therefore, for this test, Qwen 3.5 (35B) model will be accessed via OpenRouter, a platform that provides access to various AI models. This will be connected to OpenCode, an open-source AI coding agent. Any special "skills" or augmentations within OpenCode are disabled for a fair test of the base model's abilities.

For the other contender, Claude Sonnet 4.5 will run within Claude Code, its native environment. To ensure a fair comparison, it runs in "clean mode." This means the model will not have access to any custom skills, plugins, or specialized tools (MCP tools). This isolates the model's performance to its inherent capabilities, just as with Qwen.

The three-tiered challenge

Both models will face three distinct coding tasks, each designed to test different aspects of their programming prowess:

The Easy Task: Build a complete to-do list application from scratch using React and Vite. This tests fundamental app structure, state management, basic styling, and component creation.

The Medium Task: Create an interactive 3D solar system explorer using React, Vite, and the Three.js library. This challenge assesses the model's ability to work with complex external libraries, handle 3D graphics, and implement interactive controls.

The Hard Task: Modify an existing, unfamiliar open-source codebase. The models must add a new feature (a tweet screenshot generator) to a Twitter/X video downloader application. This is the ultimate test of code comprehension, refactoring, and debugging in a real-world scenario.

The easy challenge: building a to-do list app

The first challenge is a classic developer task: create a fully functional to-do list. The prompt given to both models was to build the application from the ground up using the React framework and the Vite build tool.

Sonnet 4.5's approach: clean and functional

Sonnet 4.5 tackled the task with impressive competence. In a single go, it produced a complete and polished application.

The app featured a clean, dark-themed UI with a pleasant "AI purple" accent color. It included all the essential functionalities: an input field to add new tasks, the ability to mark tasks as complete, and buttons to filter between all, active, and completed todos. It even included a "Clear completed" button.

A key feature was its use of the browser's localStorage. This meant that any to-dos added to the list would persist even after the page was refreshed, providing a better user experience. A look at the code revealed a well-structured React component using standard useState and useEffect hooks. The code was clean, readable, and followed best practices, resulting in a single App.jsx file that was easy to understand.

The finished, functional to-do list application created by Claude Sonnet 4.5, showcasing its clean dark theme and purple accents.

Qwen 3.5's performance: a tale of two attempts

Qwen's performance on this task was more complicated and revealing.

The first attempt: a surprising overachievement

Initially, Qwen 3.5 appeared to blow Sonnet out of the water. It produced a to-do list that was far more feature-rich. It included options to assign categories (Personal, Work, Shopping, etc.), set priority levels (Low, Medium, High), and even add a due date using a date picker.

From a code architecture perspective, it also made the intelligent decision to abstract the TodoItem into its own separate component, a practice that is highly encouraged for better maintainability and scalability. This result was, at first glance, a stunning victory.

However, this impressive output was due to a "superpowers skill" being enabled by default in the OpenCode environment. This skill is an external tool that augments the base model's capabilities, essentially giving it a significant and unfair advantage.

The second attempt: the unassisted reality

To conduct a fair test, the challenge was run again with all skills disabled. The result was dramatically different. The unassisted Qwen 3.5 produced a very basic and visually broken to-do list. The styling was minimal, and the functionality was a significant step down from both Sonnet's version and its own "assisted" version.

Verdict for round 1: Sonnet takes the lead

While Qwen's first result was impressive, it was artificially inflated by an external tool. When tested on its own merits, it failed to produce a functional and polished application. Claude Sonnet 4.5 is the clear winner of the easy challenge, delivering a high-quality, fully-featured, and persistent to-do list app without any special assistance.

The medium challenge: an interactive solar system

For the second round, complexity increased significantly. The models were tasked with building an interactive 3D solar system explorer using React, Vite, and the popular 3D graphics library, Three.js.

Sonnet 4.5's masterpiece: a journey through space

Sonnet 4.5 once again demonstrated its superior capabilities. It successfully generated a visually stunning and fully interactive 3D application.

The application rendered the sun and several planets orbiting it. The user experience was flawless; you could use the mouse to pan the camera, rotate the view around the sun, and scroll to zoom in and out. Clicking on a planet or the sun would focus the camera on it and display a sidebar with relevant information, such as its radius, orbital period, and average temperature.

The only small issue was that it didn't include all the planets of the solar system. However, given the complexity of the task, this was a minor omission in an otherwise masterful execution.

The stunning and interactive 3D solar system created by Claude Sonnet 4.5, showing the sun, orbiting planets, and an information sidebar.

Qwen 3.5's crash landing: a blank canvas of errors

Qwen 3.5's attempt at this medium-difficulty task was a complete failure.

The model produced a project that, when run, displayed nothing but a blank page. An inspection of the browser's developer console revealed a critical error that prevented the application from rendering. Despite being provided with the error message multiple times, Qwen was unable to diagnose or fix its own mistake. The entire development process was cumbersome; the model would frequently "go to sleep" and had to be re-prompted to continue, and it struggled to maintain context.

Furthermore, the project structure it created was chaotic. It generated a redundant, unused node_modules directory and package.json file in the root directory, while the actual, working project was nested inside a subdirectory. This indicates a lack of coherent planning and execution.

A view of the messy and redundant file structure created by Qwen 3.5 for the failed solar system project.

Verdict for round 2: a decisive win for Sonnet

This round wasn't even close. Claude Sonnet 4.5 wins overwhelmingly. It created a complex, functional, and impressive 3D application, while Qwen 3.5 failed to produce anything that worked, struggled with the process, and created a messy project structure.

The hard challenge: modifying existing code

The final and most difficult test measures a skill crucial for professional developers: understanding, modifying, and debugging an existing, unfamiliar codebase. The models were tasked with adding a new screenshot feature to x-dl, an open-source Twitter/X video downloader.

Sonnet 4.5 as a collaborator: seamless integration

Sonnet 4.5 performed like a seasoned software engineer. It correctly understood the existing codebase and successfully added the requested functionality.

It created a new /screen route and page within the application. This new page featured an input for a tweet URL and various customization options, such as changing the screenshot's background color and padding.

The initial implementation had a bug (it would time out when trying to capture the screenshot). When presented with the error log, Sonnet not only understood the problem but also implemented a robust fix. It correctly identified the issue and resolved it, showcasing a deep understanding of asynchronous operations and web page rendering.

The result was a perfectly working feature that could take any tweet URL and generate a clean, downloadable PNG image of that tweet.

The final, working tweet screenshot feature built by Claude Sonnet 4.5, showing a captured tweet ready for download.

Qwen 3.5's struggle: a tale of timeouts and errors

Qwen 3.5 struggled immensely with this complex task, ultimately failing to deliver a working solution.

It managed to create a new /screen page, but with minor UI flaws from the start, such as incorrect button text ("Extract Video" instead of "Capture"). It encountered the same timeout error that Sonnet initially did. However, Qwen's attempt to fix it was superficial. Instead of addressing the root cause, it simply increased the timeout duration from 30 seconds to 60 seconds. This is a common but ineffective fix that doesn't solve the underlying problem. As a result, the feature remained broken and would still time out.

Verdict for round 3: Sonnet completes the clean sweep

Claude Sonnet 4.5 wins the final round. Its ability to comprehend, modify, and, most importantly, effectively debug an existing codebase was far superior to Qwen's. This test highlights a critical gap in practical, real-world coding ability between the two models.

Final analysis: why benchmarks aren't the whole story

Across three distinct challenges of varying difficulty, Claude Sonnet 4.5 was the clear and undisputed winner. So, why do the benchmarks for Qwen 3.5 tell such a different story?

The benchmark illusion

The discrepancy between benchmark scores and real-world performance likely comes down to benchmark contamination or overfitting to the test. There's a high probability that the Qwen 3.5 model was specifically post-trained on the datasets used in popular benchmarks like SWE-bench. This is akin to a student memorizing the answers for a specific exam. They can ace that one test, but they lack the fundamental understanding to solve new, unseen problems.

In contrast, models like Sonnet 4.5 are trained on a much broader and more diverse dataset. This fosters a more generalized and robust reasoning ability, allowing it to excel at novel tasks that it has never explicitly seen before, which is exactly what real-world software development entails.

The full picture: parameters and architecture

It's also important to consider the underlying models. While Qwen 3.5 has 35 billion parameters, its MoE architecture means only a small fraction are used for any given task. While estimates for Anthropic's models are not public, it is widely believed that Sonnet 4.5 has a significantly larger parameter count (potentially over 70B for the underlying model) and a denser architecture. This greater computational depth and broader training regimen contribute directly to its superior performance in complex, nuanced tasks like coding and debugging.

Final thoughts

After three rigorous coding rounds, Claude Sonnet 4.5 stands out as the clear winner. Across simple and complex tasks alike, it delivered stronger coding performance, sharper debugging, and a more consistent development experience. The claim that Qwen 3.5 matches Sonnet 4.5 in real-world coding scenarios does not hold up under practical testing.

This comparison highlights an important lesson: benchmarks are not the full picture. They can provide useful signals, but they do not replace hands-on evaluation. Models optimized for leaderboard performance do not always translate into reliable day-to-day tools.

That said, progress in open-source models like Qwen 3.5 remains impressive. Running a model of this scale locally is a significant achievement that advances the ecosystem as a whole. It may not dethrone Sonnet 4.5, but it is still a capable and valuable tool. The takeaway is simple: use the tools available to you, but always validate their performance against the real demands of your projects.

Got an article suggestion? Let us know

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.