Claude Harness: How Far Can It Go? — It's About the Skeleton, Not Just the Model

New models are released every month. One came out yesterday. Benchmarks are rising, and they're said to handle longer, more complex tasks better. That's welcome news. But after working with such a model for a few days, the question that lingers in my mind is surprisingly different. It's not "How smart is this model?" but rather, "How much can I trust and rely on it?"

These two questions are entirely different. And the answer to the second one isn't found within the model itself. It lies within the structure that envelops the model — often called a harness, an orchestration framework. Depending on the harness it's wrapped in, the same model can become a runaway toy or a reliable worker you can trust. What looks plausible in a demo is entirely different from what runs flawlessly in actual operation.

This is why the big tech competition has shifted from model parameters to harness design. The model is a component; the harness is the skeleton. Components are replaced monthly, but a well-built skeleton lasts a long time. So, instead of chasing new models, we pushed the limits of what a skeleton can do. We're sharing those boundaries.

First Boundary — Web and Local are Asymmetrical

The first thing we encounter is space. The latest features almost always become available in the cloud first. New workflows, new automations, new connections all originate there. Yet, what we truly want to control — our files, our servers, our environment — resides locally.

Here's a crucial fact: This wall is solid in only one direction. Cloud agents cannot arbitrarily access my local environment (which is natural for security reasons). Conversely, local agents can call cloud APIs as much as they want. This asymmetry provides the answer to design. If you try to center the axis in the cloud, you hit a wall. But if you center the axis locally and relegate new cloud technologies to "callable components," the wall disappears. It's not a structure where "the cloud directs my work," but rather "the cloud sends signals, and the local environment executes." You might have to forgo flashy real-time collaboration demos. But in return, you gain control. In most practical scenarios, this exchange is a net gain.

Second Boundary — Don't Entrust Everything to the Model

When you first build a harness, you get ambitious. With a smart model, you want to entrust every step to it. But this is expensive, slow, and most importantly, unstable.

The principle is simple: Fix patterned tasks with code, and only assign tasks requiring judgment to the model. We jokingly call this a "vending machine." It's a deterministic wiring that produces a fixed output for a given input, without judgment. Tasks like format conversion, schema validation, and channel branching are always the same. If you ask the model to do them, it burns tokens and sometimes does something unexpected, but if hardcoded, it works for free, forever, and consistently.

Interestingly, this applies even to tool calls. If "this tool for this step" is predetermined, it's much more robust for the code to call it directly than for the model to choose. The model doesn't even need to know if the tool exists. It simply outputs text, and the code takes that output to execute the designated tool. Uncertainty becomes zero.

This isn't about eliminating judgment; it's about elevating judgment. A vending machine is a spinal reflex, not a brain. What to do is decided higher up, and the vending machine merely executes. And the smaller the vending machine, the more robust it is. A single giant vending machine is prone to breaking, but a structure linking multiple small vending machines like LEGO bricks rarely collapses. The more you minimize expensive judgment and maximize inexpensive execution, the faster, cheaper, and more stable your harness becomes.

Third Boundary — The True Place for Small Local Models

Since cost has come up, I'll share our experience running small local models ourselves. We loaded several open-source lightweight models onto a single consumer GPU and gave them the same tasks.

The conclusion was clear: Long-form creative tasks are still out of reach. If asked to write an essay, they'd mix languages, stray off-topic, and collapse into infinite repetition of the same sentences. Entrusting a small model with the "protagonist" role leads to disappointment.

However, the same models were useful for other tasks. For validation tasks like finding defects in code, they scored close to larger models, and for short, structured problems, they provided answers in under a second. In other words, the place for small models is not generation, but rather verification, classification, and short, structured tasks. They shine not as leads, but as supporting actors — as validators, classifiers, or components of a vending machine. Attempts to move a heavy harness entirely to a small local environment failed because the context was overwhelmed by system prompts and tool definitions alone. The lesson is clear: the answer is not "complete localization" but rather "lightweight vending machines + local model nodes." Heavy tasks go to the cloud, light tasks to local. We return to the first boundary.

Fourth Boundary — Even the Strongest Model Must Be Verified

The last boundary is the most important. We ran a large analytical task with the most powerful, latest model. The results were impressive. However, we didn't use those results at face value. We deliberately had different agents in different environments cross-verify them.

There's one reason: A single agent cannot filter out its own hallucinations. It's like not being able to see your own blind spot. And the newer and more powerful a model is, the greater the risk of it being plausibly wrong. Strength and reliability are different axes. High benchmarks don't automatically equate to trustworthiness.

Indeed, the initial analysis had false positives/negatives. Something deemed "absent" in one environment was perfectly present in another. The agent, seeing only a part, missed the overall context. Cross-verification immediately caught this. If we had blindly trusted that initial report, we would have escalated a non-issue, misunderstanding something perfectly fine as a problem.

Therefore, a good harness must include a verification loop. Separate the generating agent from the verifying agent, and the more irreversible the judgment — the more money is involved or the more irreversible the task — the thicker the verification layers should be. One caveat: similar models share the same blind spots, so cross-verification is effective only when performed by models from different families. If two identical models verify each other, they'll both be wrong in the same places.

So, How Far Can It Go?

Back to the initial question: How far can a Claude harness go?

The answer is paradoxical: The "how far" of a harness is determined by its structure, not the model's intelligence. There's no need to wait for smarter models. Even with current models, a well-built structure can take you surprisingly far. Conversely, without a proper structure, no matter how powerful the model, it will run wild and cause problems.

The core principle of that structure can be summarized in one sentence:

Explore with agents, and once a pattern is established, solidify it with code.

Agents are not eternal workers but scouts. Their value lies not in endlessly repeating tasks, but in revealing patterns of work. Once a pattern is observed, that part is solidified with code, and the agent is freed to move on to the next unknown territory. This way, expensive intelligence always remains on the front lines, while inexpensive, stable code supports the backend.

The model is a component, and the harness is the skeleton. Components are replaced monthly. A new one came out yesterday, and another will come out next month. To remain unshaken each time, don't chase models; build the skeleton. If the skeleton is solid, each new component will simply make it better. A harness goes as far as its structure allows, not just as far as the model is smart.