Harbor Agentic Coding Case Study
This case study shows how a platform team can use outcome-based evaluation before changing routing weights. A routing decision is an operational claim: the deployed model group should complete the workload at the expected quality, cost, latency, and reliability envelope. Harbor is one way to test that claim with repeatable agent tasks instead of anecdotal "felt better" or "felt worse" reports.
The model group names shown here are deployment examples from the Metrum engineering environment. GenAI Smart Router does not require names such as big-coder, and customer deployments can expose different model groups, providers, and policies.
For a benchmark tailored to your workloads, contact contact@metrum.ai.
What Harbor Is
Harbor is an open-source framework for evaluating AI agents in sandboxed, reproducible environments. Its documentation and GitHub repository describe running agents against tasks and benchmarks, capturing trajectories and artifacts, and scoring results with verifiers.
Harbor is relevant to Smart Router because it can hold the task, agent, agent version, prompts, tools, seeds, and scoring constant while changing only the model endpoint or router model group. That lets a team compare:
- final task outcome, such as reward or pass/fail;
- cost, token usage, cache behavior, and latency;
- selected upstream provider/model families, retries, and fallbacks;
- client compatibility for agent traffic such as OpenAI Responses, Anthropic Messages, or OpenAI Chat.
Harbor's public docs describe support for multiple coding-agent clients, including Codex CLI, Claude Code, Aider, OpenCode, Cursor CLI, OpenHands, Goose, Gemini CLI, Cline, and Mini SWE Agent. That public list was checked on June 30, 2026. It is agent-context only: it does not mean every listed client has been validated against a particular Smart Router deployment.
Terminal-Bench Connection
Terminal-Bench is a public benchmark collection for terminal and coding-agent tasks. The Terminal-Bench paper describes a benchmark built around realistic terminal tasks and reproducible evaluation; Terminal-Bench tasks use the Harbor task format and harness. Terminal-Bench is useful public evaluation material, but it is not a universal proxy for customer success. A customer support workflow, browser-control task, OCR pipeline, or repository-specific coding job still needs its own workload-appropriate verifier.
Current Production Run
The current published run was captured from the production engineering endpoint on June 29, 2026 after deploying router build 762592b.
| Field | Value |
|---|---|
| Router build | 762592b |
| Harbor task | aider/polyglot_python_two-bucket |
| Model group | big-coder |
| Harbor CLI | 0.13.2 |
| Codex CLI | 0.142.0 |
| Claude Code CLI | 2.1.186 |
| Caller token model | authorized evaluation caller |
The task asks the agent to implement two_bucket.py so the verifier accepts bucket-measuring behavior and required ValueError handling.
Measurement Caveat
The table below is a production smoke/evaluation snapshot, not a statistically significant benchmark. A single deterministic Harbor task is useful for catching regressions and proving whether a route can complete a specific workload, but promotion decisions should use repeated runs, fixed-model controls, and workload-specific pass-rate thresholds.
This result proves only the specific task, agent versions, router build, model group, and run window shown here. It does not establish a universal model ranking, and it should not be read as evidence that Harbor, Terminal-Bench, or one fixed task predicts every customer workload. Where the run count supports it, interpret results with repeated attempts and confidence intervals.
Results
| Agent | Group | Status | Reward | Errors | Elapsed | Harbor input | Harbor cache | Harbor output | Interpretation |
|---|---|---|---|---|---|---|---|---|---|
| Codex CLI | big-coder | ok | 1 | 0 | 68 s | 48,470 | 31,744 | 2,598 | Passed verifier |
| Claude Code CLI | big-coder | failed | 0 | 0 | 348 s | 278,176 | 276,886 | 22,530 | Agent completed without router exceptions but left the starter implementation unchanged |
The Codex path passed the task. The Claude Code path did not pass task quality even though the request path completed without Harbor exceptions. The failed artifact still contained the starter pass implementation, and the trajectory showed repeated file-read calls rather than a successful edit. This is exactly why model-group validation should use outcome checks rather than HTTP status alone.
Verifier Reward
Outcome by agent
Elapsed Time
Runtime by agent
Token Demand
Harbor-reported tokens by agent
Token totals are from the Harbor job summaries for the current production run. Cache-read tokens are included in total token demand because they affect context pressure and agent loop behavior.
What This Proves
- The
big-coderroute was reachable from an authorized evaluation caller. - Codex CLI completed the Harbor task through the router on the deployed build.
- Claude Code reached the router and upstream path, but the selected model behavior did not complete this workload.
- The route should not be promoted as fully validated for Claude Code on this task until a follow-up run passes with reward
1. - The evidence is about task outcome plus operational behavior, not cost alone.
Upstream Reporting
Public case studies should avoid exposing private routing internals as if they are a product contract. For this run, upstreams are reported by anonymized route family:
| Route family | Role in the run |
|---|---|
| Provider A, OpenAI Responses-compatible | Served the passing Codex trial |
| Provider B, Anthropic Messages-compatible | Served the failed Claude Code trial |
Customer deployments can choose different provider families, hosted models, private upstreams, and weights. The stable product contract is the model group plus validation evidence, not a fixed upstream model name.
Router-Versus-Fixed Pattern
Use the same dataset/task, agent, agent version, seed policy, token caps, timeouts, and verifier for both arms. Change only the model endpoint or model value.
# Install Harbor in an isolated tool environment.
uv tool install harbor
# A: routed model group through GenAI Smart Router.
export ROUTER_BASE_URL="https://router.example.com/v1"
export ROUTER_TOKEN="rtr_metrum_<user>_<project>_<env>_<key>_<secret>"
# Dataset run, for example terminal-bench@2.0.
harbor run -d <dataset-id> \
--agent <agent> \
--model <router-model-group>
# Single-task run, for example aider/polyglot_python_two-bucket.
harbor run -t <task-id> \
--agent <agent> \
--model <router-model-group>
# B: fixed-model control with the same dataset/task, agent, versions, and seed policy.
harbor run -d <dataset-id> \
--agent <agent> \
--model <fixed-model-id>
harbor run -t <task-id> \
--agent <agent> \
--model <fixed-model-id>
The exact environment variables, model string, concurrency, attempt count, and seed controls depend on the agent adapter, Harbor version, and dataset. Use the same Harbor-supported seed and attempt policy in both arms. Use placeholder router endpoints and tokens in shared examples, and use /v1/models with the evaluation caller token to discover allowed router model groups.
Record router build, safe config summary or checksum, task ID, agent/client versions, run timestamps, request IDs when available, reward/pass-fail result, errors, elapsed time, token demand, cache usage, selected route families, fallback behavior, and task artifacts. Promote or reweight only if the target pass-rate and operational criteria are met.
Bring Your Own Verifier
Harbor is one evaluation option, not a product dependency or the only accepted proof method. Teams can also use unit tests, OCR target checks, extraction goldens, product acceptance tests, browser-control tasks, tool-call correctness checks, human-reviewed rubrics, or any workload-specific verifier that produces repeatable evidence. The important rule is model-group sufficiency: prove that the cheaper or more flexible route still completes the customer's workload before making it the default.
For practical context beyond the primary Harbor and Terminal-Bench sources, see Tessl's Harbor introduction and LangChain's Terminal-Bench/Harbor harness engineering example:
Related Smart Router Docs
For a general router-versus-fixed-model evidence template, including non-Harbor workloads and rollout decisions, see Prove Router Quality. For model-group contracts and promotion criteria, see Model Group Quality Criteria. For coding-agent setup and compatibility checks, see Coding-Agent Client Matrix. For report evidence, see Usage Reporting and Admin Browser Reports.