/ project

self-hosted coding agent

A single-file terminal coding agent on a local model — built to make the harness self-describing enough that the model can recover from its own mistakes.

stack Python · Ollama · qwen3-coder:30b · DeepSeek (fallback) · ripgrep · OpenAI client · Tailscale

scope solo build, single file (agent.py)

status v1.2

ProblemI wanted to understand how agentic coding tools really work under the hood, and to run one entirely on my own hardware against open models — local-first, with no per-token cost in the common case and only an optional hosted fallback for when the local box is unreachable.

What it doesA single-file agent (agent.py) drives a local Qwen model through a six-tool loop: read a file, create a file, edit with fenced search/replace blocks, search the tree with ripgrep, run bash (the model itself flags risky commands for approval), and signal task_complete. It streams the model's output and feeds tool results back, ending only when the agent calls task_complete with file-path evidence the harness verifies — a self-describing harness that states the working directory and tool invariants and returns actionable errors, so the model can recover from its own mistakes.

OutcomeA genuinely usable, local-first coding agent in a single file — and a much sharper feel for tool-use design, harness legibility, evidence-gated termination, and where agentic systems break.

/ architecture

Architecture

qwen-code is a minimal coding agent in one file (agent.py): a loop driving a local Qwen model through six tools, with the OpenAI Python client pointed at Ollama's OpenAI-compatible endpoint and native tool calling. It runs local-first and falls back to a hosted DeepSeek endpoint only when the local box fails a startup health check.

  [task or REPL prompt]   (/model, /cost slash commands)
        │
        ▼
  [agent loop · 25 iterations local / 50 fallback]
        │  stream model text → collect tool calls → execute → feed results back
        │  ends ONLY when task_complete validates its evidence
        ▼
  tools:
    • read_file       (line-numbered, display-only)
    • write_file      (refuses overwrite — forces edits through replace_in_file)
    • replace_in_file (fenced SEARCH/REPLACE blocks; all-or-nothing; must match once)
    • search          (ripgrep across the tree; ≤ 100 matches)
    • bash            (fresh subprocess; cwd = working dir; model flags risky cmds)
    • task_complete   (summary + files_changed + harness-verified evidence)
        │
        ▼
  model: qwen3-coder:30b  (Ollama, local; reachable over Tailscale)
          └─ fallback: deepseek-v4-flash  (hosted; only if local is unreachable)

Each iteration streams the model's output, executes any tool calls, and feeds results back as tool messages. The loop ends only when the agent calls task_complete and its evidence validates — plain text with no tool call is nudged, not accepted — or when the iteration cap (25 local, 50 fallback) is hit. All file paths resolve through a single _resolve_path() and cannot escape the working directory.

/ technical decisions

Technical decisions

A self-describing harness

the system prompt tells the model its exact absolute working directory, that file tools resolve against it, and that bash runs in a fresh subprocess each call so cd won't persist. Most agent failures are the model misunderstanding its own environment; stating the invariants explicitly removes a whole class of them.

Evidence-gated termination

the loop ends only when the model calls task_complete with the files it changed — and the harness verifies every cited path was actually read or searched this session. It's a direct guard against the failure mode where a model declares a success it can't support, or invents a bug to look busy: an unverifiable citation fails the call, and the model has to either drop the claim or go read the file before it can finish.

Tool feedback designed for self-correction

replace_in_file failures are actionable, not opaque — zero matches returns the file's first 20 lines; multiple matches returns each match's line number with surrounding context and a hint to add more. The agent can fix its own mistake on the next turn instead of flailing.

Fenced search/replace edits over re-emitting files

edits go through replace_in_file as fenced SEARCH/REPLACE blocks (the Aider/Cline format): each block's SEARCH must match exactly once, multiple blocks apply in one call, and the whole thing is all-or-nothing. This sidesteps the JSON-string-escaping failures of the old str_replace tool and keeps changes precise and reviewable; write_file refuses to overwrite, so the model can't clobber a file by re-emitting it whole.

Display-only line numbers

read_file prefixes each line with a number, and the prompt is explicit that these must never appear in replace_in_file SEARCH text — a subtle but common source of failed edits.

Local-first with a hosted fallback

a 3-second startup health check on the local Ollama endpoint decides the model; if it's unreachable the agent announces a fallback to a hosted DeepSeek endpoint instead of just dying. /model switches mid-session (history preserved) and /cost tracks per-model token use. The local box stays the default — the fallback exists so an unreachable homelab doesn't end the session, not to make the tool API-dependent.

Path-safety hardening

_resolve_path follows symlinks via Path.resolve() and verifies the real path is a descendant of the working directory, rejecting both .. traversal and symlinks that point outside.

/ what broke

What broke / what I learned

The interesting work early on wasn't adding features — it was making the harness legible to the model. The recurring failure mode was the agent misreading its own context: not knowing where it was, assuming cd persisted across bash calls, or getting a bare "no match" from an edit and having nothing to recover with. The fix in almost every case was better feedback, not a smarter model: tell it the working directory, make errors carry the resolved path and a hint, return enough context on a failed edit that the next attempt succeeds. Building it in one file made those failure modes impossible to hide from.

The later versions pushed on a different failure mode: the model declaring victory. Swapping str_replace for fenced search/replace blocks cut the edit-escaping failures, but the sharper lesson came with task_complete — left to end on its own, the model would sometimes announce a fix it hadn't made, or invent a bug just to have something to solve. Making termination a tool call whose evidence the harness checks against the files actually read turned "trust the model's word" into "verify its citations," and the fabrications stopped. The hosted fallback came last, so an unreachable homelab degrades gracefully instead of ending the run.

Still deliberately minimal at v1.2. Repo-map / tree-sitter indexing, sandboxing, diff-preview approval, automatic in-session failover (the health check is startup-only), dollar-cost estimation, other providers, MCP, and subagents are out of scope for now.

Code

Finding and fixing a bug across a multi-file project: search → read → edit → verify → evidence-gated task_complete (sped up; the local 30b runs in real time).

← back to work