Skip to main content

Model Group Quality Criteria

Model groups are the caller-facing quality and cost contracts in GenAI Smart Router. A group name should mean something operational: what work it is intended to handle, what clients can use it, what modalities and tools it supports, and what outcome it must preserve while the router optimizes cost, latency, and provider mix behind the scenes.

This is the key product principle: not every task needs the most expensive model. Expensive targets should be reserved for workloads that require them. Simpler text, extraction, summarization, and routine coding work can often be served by lower-cost routes when validation shows the group still meets its objective.

For strategy selection, start with Routing Strategy Decision Tree. For the full customer-controlled routing contract, ownership controls, and proof workflow, see Customer-Controlled Routing. For a repeatable router-versus-fixed-model proof plan, see Prove Router Quality. For one coding-agent evidence pattern, see the Harbor Case Study.

Group Contract Fields

Define these fields for every model group in a deployment:

FieldDescription
Group nameDeployment-defined caller-facing name
Intended usersTeams, apps, agents, or environments allowed to use the group
Intended workloadsThe tasks the group is expected to complete
API shapesOpenAI Chat, OpenAI Responses, Anthropic Messages
ModalitiesText, image/VLM, or other configured modalities
Tool dialectsOpenAI Chat tools, Responses function tools, Anthropic client tools
Quality targetObjective success threshold for this group
Cost targetCost/request, daily budget, or savings target against a baseline
Latency targetp50/p95 latency or token-throughput target
Reliability targetTimeout, fallback, and provider error thresholds
Validation harnessHarbor, workload-specific tests, or another objective harness
Promotion criteriaWhen to increase target weight or caller access
Rollback criteriaWhen to reduce weight, disable a target, or isolate the group

Example Group Contracts

Group TypeIntended WorkloadsExample Success Criteria
Low-cost general groupshort chat, summarization, extraction, simple code editsextraction accuracy threshold, simple unit tests pass, latency and cost target met
Balanced developer groupday-to-day coding, refactors, tool use, occasional image contextcoding task tests pass, tool-call file assertion passes, image-bearing requests select VLM-capable targets
Coding-agent groupCodex CLI, Claude Code, multi-step tool tasksHarbor reward score meets threshold, created files pass verifier, fallback and timeout rates stay within target
VLM-capable groupOCR, screenshot reasoning, browser-control contextimage smoke passes, OCR target accuracy meets threshold, image cost fields populate
Reasoning-capable groupexplicit OpenAI reasoning or Anthropic thinking requestsdirect and router reasoning smokes pass, requested reasoning controls are forwarded or safely translated, no-compatible-target requests fail before upstream
Private-model groupinternal vLLM/SGLang or private GPU workloadsdirect upstream and router smokes pass, private endpoint remains hidden from callers, chargeback values populate

These are examples only. The deployment chooses group names and contracts that match its teams, applications, and governance model.

For the reasoning metadata, caller examples, and negative test behind a reasoning-capable group, see Reasoning Routing.

Agentic Quality Validation

Harbor-style validation is useful because it tests the whole agent loop, not only a single completion. For a coding-agent group, run tasks that require the agent to inspect files, call tools, edit artifacts, and pass an external verifier. Harbor is optional; a deployment can use unit tests, product acceptance tests, browser-control checks, OCR goldens, tool-call assertions, or another workload-specific verifier when those better represent the group contract.

Track:

  • task reward score or pass/fail result;
  • agent runtime errors;
  • selected provider/model;
  • fallback count;
  • input, output, cache, and image token counts;
  • elapsed time and token throughput;
  • request-time router cost and upstream-reported billed cost;
  • generated artifact correctness.

Promotion is justified when the group maintains the required outcome while meeting cost, latency, and reliability targets. A cheaper provider mix should be adopted when it passes the same objective criteria. A stronger target should remain available for workloads that need it, but it does not need to handle every request.

Use the Prove Router Quality decision matrix when deciding whether to promote a group, keep a fixed model, split a workload, or collect more evidence.

Release Gates

Before a group receives broad caller access:

  • direct upstream smokes pass for each active target;
  • router-level smokes pass for each required API shape;
  • tool requests are validated with real tool calls when tools are advertised;
  • image requests are validated with realistic VLM budgets when image modality is advertised;
  • reasoning or thinking requests are validated for each advertised API shape and target control mode;
  • capped request behavior is tested for the caller API's output cap field, including OpenAI Chat max_completion_tokens;
  • usage rows include selected provider/model, token counts, status, latency, cache behavior, and cost fields;
  • Harbor or workload-specific validation meets the group success criteria;
  • a gate artifact records the run matrix, verifier/reward, client matrix, pass/fail thresholds, cost and latency results, selected upstream distribution, request IDs, and promotion or rollback decision;
  • rollback criteria are documented.

For local or CI checks where live Harbor is unavailable, use the mock fixture gate:

python3 scripts/evaluate_workload_gate_test.py

For a Harbor or workload run, generate a gate summary after the run results and safe usage rows are available:

python3 scripts/evaluate_workload_gate.py \
--matrix examples/harbor-algotune-pca/workload_gate_matrix.json \
--results examples/harbor-algotune-pca/runs/<CASE_ID>/results.tsv \
--usage-json examples/harbor-algotune-pca/reports/<CASE_ID>/usage-rows.json \
--out-md examples/harbor-algotune-pca/reports/<CASE_ID>/workload-gate.md

Ongoing Governance

Review model groups after provider price changes, model entitlement changes, provider instability, new client requirements, new modalities, or observed quality regressions. The router makes model substitution operationally easier, but the group contract determines whether a substitution is acceptable.