Prove Router Quality

Use this playbook when a team asks whether a routed model group works as well as a fixed model, a previous routing policy, or another provider mix. A routing decision is a claim. Claims need measurable, repeatable evidence.

The right answer is not always "the router wins." The right answer is knowing which model group or fixed model is good enough for which workload at which cost, latency, and reliability envelope. If the evidence says a workload needs a stronger target, the deployment can promote that target, add a workload-specific group, change weights, or route only that workload differently.

Executive Framing

Anecdotes are useful bug reports. They are not sufficient evidence for changing production routing policy by themselves.

Compare a router group against a fixed model or previous-policy control with the workload, client, agent, tools, prompt shape, seed policy, versions, token caps, timeouts, and scoring held constant. Change only the model selection: routed group versus fixed model, or new policy versus previous policy.

Teams own their routing destiny. GenAI Smart Router makes routing policy controllable, measurable, and reversible; it does not guarantee every workload improves automatically. Use evidence to decide whether to promote a routed group, keep a fixed model, split the workload, or collect more data before rollout.

Admissible Evidence

Anecdote / weak signal	Evidence / strong signal
"it felt dumber today"	fixed task set, pinned versions, repeated runs
one screenshot	saved test cases with expected outcomes
one cherry-picked failure	distribution across tasks/seeds/users
no config/version record	config version or safe routing/config summary
no control	fixed-model baseline or previous-policy baseline
no metric	pre-registered pass rate, cost, latency, or business metric

Smoke tests prove compatibility for a narrow request shape. Offline evaluations prove outcome quality on a fixed dataset. Shadow tests show how a candidate policy behaves on production-like traffic without changing user experience. A/B tests compare production cohorts when the risk is acceptable and the business metric is defined before the run.

Router Vs Fixed-Model Template

Use this template before running the comparison:

Field	What to record
Hypothesis	Example: `<router-group>` matches `<fixed-model>` pass rate while reducing cost or latency.
Workload/task set	Dataset, task IDs, product flow, or acceptance-test suite.
Metric and success threshold	Pass rate, reward, resolution rate, extraction accuracy, business metric, cost, latency, or reliability threshold.
Baseline/control model	Fixed model ID or previous routing policy. Mark IDs as deployment examples when not validated for the current deployment.
Candidate router group or policy	Deployment-defined model group, target weights, scripted policy label, config version, or safe routing/config summary.
Client/agent version	SDK, CLI, application build, agent version, and evaluator version.
Tool/image/structured-output/reasoning settings	API shape, tools, modalities, schemas, reasoning/thinking controls, and eligibility expectations.
Seeds/attempts	Number of runs, seed policy, retry policy, and whether tasks are independent.
Token caps/timeouts	`max_tokens`, `max_output_tokens`, request timeout, per-attempt timeout, and agent timeout.
Routing config version or safe routing/config summary	Enough safe config detail to rerun without exposing provider keys, router tokens, token hashes, private URLs, or full production config.
Provider/model entitlement state	Whether the account was entitled to every compared model during the run.
Statistical comparison method	Distribution table, confidence interval, paired test, bootstrap, or other method appropriate for the metric and sample design.
Rollout decision	Promote, hold, split, rollback, or collect more evidence.

Where statistics are used, keep the claim precise. Confidence intervals and p-values depend on task independence, sample design, metric choice, and whether the comparison is paired.

Harbor Coding-Agent Example

Harbor is one useful coding-agent evaluation harness because it runs the agent loop and checks the produced artifact with a verifier. It is not required. Any objective verifier that matches the workload can be used. See the Harbor Case Study for source-dated Harbor and Terminal-Bench context plus a worked production snapshot.

The model IDs below are placeholders. Replace them with model groups and fixed-model IDs validated for your deployment. The intended difference between A and B is only --model or route selection.

# Install Harbor in an isolated tool environment.
uv tool install harbor

# A: routed group
harbor run -d <dataset-or-task> \
  --agent <agent> \
  --model <router-model-group>

# B: fixed model control
harbor run -d <dataset-or-task> \
  --agent <agent> \
  --model <fixed-model-id>

Use the same Harbor-supported attempt and seed policy for both arms. Report the task-level reward or pass/fail result, agent errors, elapsed time, selected provider/model, retries, fallbacks, token counts, request-time cost, and throughput for both arms. Join the Harbor run window to router usage reports by timestamp, caller/project, client, model group, request ID, or run label.

Outcome Gate Artifact

For production route changes, convert the run matrix and workload results into an explicit gate artifact before promotion. The repository includes a generic gate summarizer that can run against Harbor results.tsv files or JSON result rows, and can merge safe usage-report rows when available:

python3 scripts/evaluate_workload_gate.py \
  --matrix examples/harbor-algotune-pca/workload_gate_matrix.json \
  --results examples/harbor-algotune-pca/runs/<CASE_ID>/results.tsv \
  --usage-json examples/harbor-algotune-pca/reports/<CASE_ID>/usage-rows.json \
  --out-json examples/harbor-algotune-pca/reports/<CASE_ID>/workload-gate.json \
  --out-md examples/harbor-algotune-pca/reports/<CASE_ID>/workload-gate.md

The matrix should declare the task set, reward/verifier, clients, model groups, attempts/seeds, fixed-model or previous-policy controls, pass-rate and reward thresholds, p95 latency ceiling, cost-per-success ceiling, error and fallback ceilings, and rollback criteria. A mock fixture self-test is available for local or CI environments where live Harbor is not installed:

python3 scripts/evaluate_workload_gate_test.py

The gate report is safe to share when populated from safe result and usage rows: it includes task/run counts, pass rate with confidence interval, reward, cost, latency, fallback/error rates, selected upstream distribution, and request IDs. It intentionally excludes raw router tokens, token hashes, provider keys, prompts, images, tool outputs, and full deployment config.

Non-Harbor Examples

Workload	Objective evidence
Support-chat answer rubric	Score a fixed set of tickets with expected facts, forbidden claims, tone requirements, and escalation rules.
Extraction accuracy on a golden dataset	Compare exact-match, field-level F1, invalid JSON rate, and cost per accepted record.
OCR target answer validation	Ask for a specific receipt merchant, invoice total, or form field and compare against the expected answer.
Browser-control task success	Run a fixed browser task and score whether the final page state, form value, or downloaded artifact is correct.
OpenAI Chat tool-call correctness	Assert the selected tool name, arguments, forced/auto `tool_choice` behavior, and final answer.
Responses function-call continuation	Assert function-call output is accepted and the continuation reaches the expected final answer.
Anthropic Messages client-tool task	Assert client tool calls have the expected shape and the final tool-result continuation succeeds.
Internal app acceptance tests	Run product tests that already represent user success, then compare route, cost, latency, and errors.
Production shadow or A/B cohort	Use when appropriate governance exists; pre-register the cohort, metric, guardrails, and rollback rule.

Metrics To Report

Include enough data for a skeptical reviewer to rerun or challenge the result:

Metric class	Required evidence
Primary outcome	Pass rate, reward score, resolution rate, acceptance-test result, or business success metric.
Cost	Actual request-time upstream input and output token usage, image cost fields where relevant, cache cost where relevant, and comparison against the baseline.
Latency	Downstream latency, upstream duration, TTFB where available, output tokens/sec, and total tokens/sec.
Reliability	Retries, fallbacks, provider errors, `no-eligible-target`, timeout, cancellation, and agent/runtime errors.
Compatibility	API shape, tool dialect, modality, structured-output behavior, reasoning/thinking controls, and cap forwarding.
Distribution	Task-level and seed-level results, not only one aggregate.
Uncertainty	Confidence interval or uncertainty estimate when sample size and metric design allow it.

Reports should sum stored request-time cost values. Do not reprice historical actuals from current provider config.

Decision Matrix

Evidence result	Decision
Router group beats fixed model	Promote or increase access, then continue monitoring outcome, cost, latency, and error budgets.
Router group ties fixed model at lower cost or better latency	Promote cautiously, retain rollback criteria, and monitor production distribution.
Router group underperforms	Keep the fixed model, create a workload-specific group, increase stronger target weight, or adjust capability filters.
Mixed results	Split by task category, model group, request shape, user/project, or policy label.
Evidence inconclusive	Gather more tasks/seeds, improve the rubric, or use shadow mode before production rollout.

Prove It, Do Not Feel It: Auditing a Smart Router Against a Fixed Model

The visual is maintainable Mermaid markup that ships with the page. The audit rule is simple: publish enough safe evidence for another reviewer to understand what changed, what stayed constant, how outcomes were scored, and why the rollout decision follows from the data.

Safe Proof Package

Share safe artifacts: task IDs, expected outcomes, scorer version, client/agent version, model group, fixed-model control, config version or safe routing/config summary, timestamps, request IDs, selected provider/model, token counts, cost, latency, fallback/error counts, and anonymized aggregate tables.

Do not share provider keys, bearer tokens, token hashes, raw prompts, raw images, raw tool outputs, private repository contents, private hostnames, or full production config unless a governed support path explicitly permits it.

Executive Framing​

Admissible Evidence​

Router Vs Fixed-Model Template​

Harbor Coding-Agent Example​

Outcome Gate Artifact​

Non-Harbor Examples​

Metrics To Report​

Decision Matrix​

Prove It, Do Not Feel It: Auditing a Smart Router Against a Fixed Model​

Safe Proof Package​