Running a Local AI Coding Agent with an RTX 4060 8GB — Ollama + Open-source LLM Practical Benchmark

Ollama + Open-source LLM Practical Benchmark — From Security Analysis to Blog Writing and Tool Calling

Introduction

What if you could run a coding agent on a local GPU when cloud AI service costs become a burden? We tested three open-source LLMs with practical tasks on the most common mid-range GPU, the RTX 4060 8GB. Using Ollama v0.30.6, we compared Qwen 3.5 4B, Qwen 3.6 35B, and Gemma4 12B under identical conditions.

To cut to the chase — coding assistance is possible. Writing is not. And the boundary between them is surprisingly clear.

1. Test Environment

Item	Specification
GPU	NVIDIA RTX 4060 8GB GDDR6
CPU	Intel i7-14700K (20C/28T)
RAM	32 GB
OS	Windows 11
Runtime	Ollama v0.30.6

Three test models:

Model	Parameters	Quantization	VRAM
Qwen 3.5 4B	4.2B	Q4_K_M (4-bit)	3,173 MB
Qwen 3.6-35B-A3B	34.7B (Active 3B)	IQ2_M (2-bit)	6,091 MB
Gemma4 12B	11.9B	Q4_K_M (4-bit)	5,743 MB

Although Qwen 3.6 is a 35B model, its MoE (Mixture of Experts) architecture means active parameters are around 3B. To fit it into 8GB VRAM, extreme 2-bit quantization was necessary.

2. Security Vulnerability Analysis — Coding Works

We intentionally inserted five security vulnerabilities into an Express.js code snippet and asked each model to "find all of them and provide corrected code."

Model	Found	Time	tok/s	Characteristics
Qwen 3.5	3.5/5	22 sec	69.5	Fastest. Mentioned token forgery only in remarks
Qwen 3.6	4/5	33 sec	55	Highest quality. Perfect JSON structure
Gemma4	4/5	154 sec	17.3	Slowest but provided integrated fix for all

Two SQL Injections, Path Traversal, and token forgery were accurately identified by all three models. The attack scenarios (' OR '1'='1, ../../etc/passwd) were specific, and the correction codes (parameter binding, JWT conversion, path.resolve validation) were production-ready.

However, all three models missed the fifth vulnerability: "next() is called without a token, allowing authentication bypass." This is a subtle bug that requires tracing the entire code flow, demonstrating the limitations of models 12B and below.

3. Thinking Mode — Smart but Overambitious

Qwen 3.5 and 3.6 are reasoning models, so they generate a "thinking" process before every response. This is where the problem lies.

With a context of 4096 tokens, the model would only think and then stop without outputting the actual answer. This was resolved by increasing it to 8192. At 16384 tokens, it accurately solved a short coding problem (optimizing O(n²) to O(n)) in 1.4 seconds with 77 tokens.

The core formula is this:

If given enough "paper" (context) for thinking, it's accurate and fast.
If there isn't enough "paper," it just thinks until time runs out.

We tried to disable thinking with the /no_think tag, but it didn't work in the GGUF environment. The --nothink flag in Ollama CLI also returned only an empty response.

4. Blog Writing — Here's the Wall

We asked the same models to write a Korean blog post using today's benchmark data. The results were disastrous.

Gemma4: Responded in Chinese to a Korean prompt. The content was also a general Ollama guide, not based on the test data.
Qwen 3.6: Wrote in Korean but deviated from the topic into an Express.js code tutorial.
Qwen 3.5: Also in Chinese. Furthermore, it fell into a loop, repeatedly outputting the same content.

We tried strengthening the prompts: "Must be in Korean only," "Use the data below exactly as provided," "Do not add any information not given." The results improved slightly but were essentially the same. Qwen 3.6's 392-token memo was the best outcome.

Why did models that scored 4/5 in coding completely fail at blog writing? There are two main reasons:

2-bit quantization degraded language distinction. The probability of Korean and Chinese tokens became similar, leading to confusion.
"Problems with a correct answer" and "creative writing" are different capabilities. Code analysis is closer to pattern matching, but reading given data and structuring it into a coherent article requires much higher instruction adherence.

5. Tool Calling — Only Gemma4 Works

Checking the capabilities field of Ollama models reveals:

Gemma4 12B: completion, tools, thinking, vision
Qwen 3.5/3.6: completion only

When asked to "write and execute a Python script to validate YAML," Gemma4 generated a 58-line script in 96 seconds and successfully performed a run_python tool_call. It automatically generated four test fixtures and validated parsing with yaml.safe_load, demonstrating a high level of completeness.

However, this was successful only in a single turn. A multi-turn agent loop involving execution → error → correction → re-execution would put pressure on the 8192-token context due to cumulative context, making stable operation difficult.

6. Conclusion — Intern, Yes; Senior, No

Area	Feasibility	Notes
Single function bug fix	✅	1.4~22 sec
Code review (single file)	✅	4/5 accuracy
YAML/JSON validation	✅	15~25 sec
Tool calling (Gemma4)	✅	Single turn only
Korean blog writing	❌	Language confusion + topic deviation
Multifile debugging	❌	Insufficient context
Complex rule adherence	❌	Instruction adherence limitations

"One file, one function, clear instructions" — within this scope, local LLMs can be used as a zero-cost coding intern. If a higher-level agent (Opus, Sonnet) breaks down tasks and reviews the results, practical deployment is possible.

However, the current limitation is not the GPU but quantization. To squeeze a 35B model into 8GB VRAM, it must be cut down to 2-bit, which leads to repetitive collapse, language confusion, and inability to create. If the same model could run in 4-bit on a 24GB GPU, even the blog writing that failed today might become possible.

The RTX 4060 was sufficient for validating the PoC. It was a day that quantitatively proved the situation of having the brain but not the body to match.