Ollama + Open-source LLM Practical Benchmark — From Security Analysis to Blog Writing and Tool Calling
Introduction
What if you could run a coding agent on a local GPU when cloud AI service costs become a burden? We tested three open-source LLMs with practical tasks on the most common mid-range GPU, the RTX 4060 8GB. Using Ollama v0.30.6, we compared Qwen 3.5 4B, Qwen 3.6 35B, and Gemma4 12B under identical conditions.
To cut to the chase — coding assistance is possible. Writing is not. And the boundary between them is surprisingly clear.
1. Test Environment
| Item | Specification |
|---|---|
| GPU | NVIDIA RTX 4060 8GB GDDR6 |
| CPU | Intel i7-14700K (20C/28T) |
| RAM | 32 GB |
| OS | Windows 11 |
| Runtime | Ollama v0.30.6 |
Three test models:
| Model | Parameters | Quantization | VRAM |
|---|---|---|---|
| Qwen 3.5 4B | 4.2B | Q4_K_M (4-bit) | 3,173 MB |
| Qwen 3.6-35B-A3B | 34.7B (Active 3B) | IQ2_M (2-bit) | 6,091 MB |
| Gemma4 12B | 11.9B | Q4_K_M (4-bit) | 5,743 MB |
Although Qwen 3.6 is a 35B model, its MoE (Mixture of Experts) architecture means active parameters are around 3B. To fit it into 8GB VRAM, extreme 2-bit quantization was necessary.
2. Security Vulnerability Analysis — Coding Works
We intentionally inserted five security vulnerabilities into an Express.js code snippet and asked each model to "find all of them and provide corrected code."
| Model | Found | Time | tok/s | Characteristics |
|---|---|---|---|---|
| Qwen 3.5 | 3.5/5 | 22 sec | 69.5 | Fastest. Mentioned token forgery only in remarks |
| Qwen 3.6 | 4/5 | 33 sec | 55 | Highest quality. Perfect JSON structure |
| Gemma4 | 4/5 | 154 sec | 17.3 | Slowest but provided integrated fix for all |
Two SQL Injections, Path Traversal, and token forgery were accurately identified by all three models. The attack scenarios (' OR '1'='1, ../../etc/passwd) were specific, and the correction codes (parameter binding, JWT conversion, path.resolve validation) were production-ready.
However, all three models missed the fifth vulnerability: "next() is called without a token, allowing authentication bypass." This is a subtle bug that requires tracing the entire code flow, demonstrating the limitations of models 12B and below.
3. Thinking Mode — Smart but Overambitious
Qwen 3.5 and 3.6 are reasoning models, so they generate a "thinking" process before every response. This is where the problem lies.
With a context of 4096 tokens, the model would only think and then stop without outputting the actual answer. This was resolved by increasing it to 8192. At 16384 tokens, it accurately solved a short coding problem (optimizing O(n²) to O(n)) in 1.4 seconds with 77 tokens.
The core formula is this:
- If given enough "paper" (context) for thinking, it's accurate and fast.
- If there isn't enough "paper," it just thinks until time runs out.
We tried to disable thinking with the /no_think tag, but it didn't work in the GGUF environment. The --nothink flag in Ollama CLI also returned only an empty response.
4. Blog Writing — Here's the Wall
We asked the same models to write a Korean blog post using today's benchmark data. The results were disastrous.
- Gemma4: Responded in Chinese to a Korean prompt. The content was also a general Ollama guide, not based on the test data.
- Qwen 3.6: Wrote in Korean but deviated from the topic into an Express.js code tutorial.
- Qwen 3.5: Also in Chinese. Furthermore, it fell into a loop, repeatedly outputting the same content.
We tried strengthening the prompts: "Must be in Korean only," "Use the data below exactly as provided," "Do not add any information not given." The results improved slightly but were essentially the same. Qwen 3.6's 392-token memo was the best outcome.
Why did models that scored 4/5 in coding completely fail at blog writing? There are two main reasons:
- 2-bit quantization degraded language distinction. The probability of Korean and Chinese tokens became similar, leading to confusion.
- "Problems with a correct answer" and "creative writing" are different capabilities. Code analysis is closer to pattern matching, but reading given data and structuring it into a coherent article requires much higher instruction adherence.
5. Tool Calling — Only Gemma4 Works
Checking the capabilities field of Ollama models reveals:
- Gemma4 12B:
completion, tools, thinking, vision - Qwen 3.5/3.6:
completiononly
When asked to "write and execute a Python script to validate YAML," Gemma4 generated a 58-line script in 96 seconds and successfully performed a run_python tool_call. It automatically generated four test fixtures and validated parsing with yaml.safe_load, demonstrating a high level of completeness.
However, this was successful only in a single turn. A multi-turn agent loop involving execution → error → correction → re-execution would put pressure on the 8192-token context due to cumulative context, making stable operation difficult.
6. Conclusion — Intern, Yes; Senior, No
| Area | Feasibility | Notes |
|---|---|---|
| Single function bug fix | ✅ | 1.4~22 sec |
| Code review (single file) | ✅ | 4/5 accuracy |
| YAML/JSON validation | ✅ | 15~25 sec |
| Tool calling (Gemma4) | ✅ | Single turn only |
| Korean blog writing | ❌ | Language confusion + topic deviation |
| Multifile debugging | ❌ | Insufficient context |
| Complex rule adherence | ❌ | Instruction adherence limitations |
"One file, one function, clear instructions" — within this scope, local LLMs can be used as a zero-cost coding intern. If a higher-level agent (Opus, Sonnet) breaks down tasks and reviews the results, practical deployment is possible.
However, the current limitation is not the GPU but quantization. To squeeze a 35B model into 8GB VRAM, it must be cut down to 2-bit, which leads to repetitive collapse, language confusion, and inability to create. If the same model could run in 4-bit on a 24GB GPU, even the blog writing that failed today might become possible.
The RTX 4060 was sufficient for validating the PoC. It was a day that quantitatively proved the situation of having the brain but not the body to match.
