Evaluate GenAI Smart Router
Use this checklist to evaluate GenAI Smart Router with a real deployment or a Metrum-managed evaluation instance. The goal is to verify client compatibility, governance, routing behavior, cost evidence, and operational trust signals before rollout. If the first question is whether the product fits a common enterprise concern, start with Enterprise FAQ. If the first question is where the router should run, start with Enterprise Deployment Patterns.
Need an evaluation environment or help choosing proof points? Contact contact@metrum.ai.
When the question is whether a routed group is as good as a fixed model or previous policy, use Prove Router Quality to run an evidence-first comparison instead of relying on anecdotes.
What To Prove First
| Question | Proof point |
|---|---|
| Can existing clients integrate with minimal change? | Run OpenAI Chat, OpenAI Responses, Anthropic Messages, or SDK/CLI traffic by changing base URL, token, and model group. |
| Are provider keys and private endpoints hidden? | Confirm clients receive only the router endpoint, router token, and model group names. |
| Can the platform control who uses what? | Call /v1/models with different tokens and verify allow-list filtering. |
| Are tool and image requests routed safely? | Run one tool call and one image request through a group with validated compatible targets. |
| Does the router explain spend and performance? | Generate or review a report with provider/model, tokens, cost, latency, throughput, cache, attempts, and fallback fields. |
| Can model groups be tuned by outcome? | Run Harbor or another verifier and compare reward/pass status against cost and latency. |
Commercial Evaluation Flow
Commercial evaluations can run on a Metrum-managed hosted service, an enterprise/on-prem licensed package, or a private customer-cloud deployment. The commercial path is:
- request an evaluation through contact@metrum.ai;
- receive a router endpoint and token, or a deployment package plus signed JSON license for customer-controlled infrastructure;
- validate production-like workloads and clients;
- inspect savings, performance, quota, provider mix, security access, and retention/rollup evidence;
- choose enterprise self-hosted, private managed, marketplace/private-offer, renewal, top-up, or other contracted commercial access.
Self-service portal checkout and automated license download are planned only for approved evaluation, pilot, renewal, or top-up packages and are not shipped until the licensing portal and Stripe fulfillment work is enabled. Public docs therefore use placeholders and contact links instead of publishing private hosted endpoints or internal deployment operations. Production enterprise access uses a signed license file with capability + time + volume entitlements combined with an annual, private managed, marketplace/private-offer, or volume-prepurchase commercial path.
Thirty-Minute Evaluation Path
- Receive a router base URL, a caller token, and one or more allowed model groups.
- Run
/v1/modelsand verify only allowed groups appear. - Run an OpenAI-compatible chat smoke from the Hosted Quickstart.
- Run the client that matters most: Codex CLI, Claude Code CLI, or an OpenAI-compatible SDK.
- Run one request-shape test: image input, tool call, quota/rate-limit behavior, or a deployment-specific policy route.
- Ask the administrator for the usage/report excerpt for the evaluation window.
- Confirm selected upstream provider/model, token counts, cost fields, latency, fallback behavior, and request status are visible without storing raw prompts or raw images.
This path should prove both caller-visible behavior and operator controls. The caller sees stable model-group names, compatibility with its chosen API shape, allow-list enforcement, and clear errors such as model-not-allowed or no-eligible-target. The operator sees request-time evidence for target selection, cost, latency, attempts, fallback, and quota impact.
Expected First Outputs
| Step | Successful result |
|---|---|
/v1/models | JSON list of model group IDs allowed for the token. |
| Chat smoke | Assistant returns the requested short text. |
| Codex or Claude Code smoke | CLI returns the requested text or creates the requested disposable smoke file. |
| Image smoke | Model reads the image through the same router group or returns no-eligible-target when the group has no validated image target. |
| Usage report | Report groups requests by caller, project, model group, provider/model, status, latency, tokens, cache, and cost. |
Proof Package To Request
- One text request through
/v1/chat/completions. - One Responses request or Codex CLI task.
- One Anthropic Messages request or Claude Code task.
- One tool-call request using the API shape your client sends.
- One image request through the same group users would normally call.
- A
/v1/modelsallow-list check for the evaluation token. - A usage/report excerpt for the test window.
- A safe license-status summary when evaluating an enterprise or on-prem licensed deployment.
- A security summary covering provider-key handling, diagnostics redaction, metrics-admin isolation, and private-upstream network controls.
- A retention and rollup summary explaining raw operational rows, finalized daily rollups, legal holds, archived exports, and dry-run-only purge status.
- A rollback plan for disabling a target or reducing its weight.
Outcome-Based Model Validation
Model routing should be evaluated by task success, not only by model reputation or token price. Harbor is one useful coding-agent harness, but any verifier that reflects real work can be used:
- unit tests for coding tasks;
- extraction accuracy checks;
- OCR target answers;
- tool-call correctness assertions;
- browser-control tasks;
- internal golden datasets;
- product acceptance tests.
The target state is a model group that preserves the required task outcome while improving cost, latency, reliability, or provider optionality.
For router-versus-fixed-model comparisons, pre-register the task set, control, metrics, seed policy, config version or safe routing/config summary, and rollout decision rule using the Prove Router Quality template.
Common Evaluation Failures
| Symptom | Likely meaning | Next step |
|---|---|---|
401 or 403 before any model call | Token or endpoint mismatch | Verify base URL and router token. |
Expected group missing from /v1/models | Caller token is not allowed to use it | Ask the administrator to update access if intended. |
no-eligible-target for tools or images | Group exists, but no target satisfies that request shape | Use a compatible group or add a validated target. |
| Timeout on a long agent task | Upstream or per-attempt timeout is too small for the workload | Inspect request attempts and adjust target mix or timeout. |
| A cheaper group fails the verifier | The group does not satisfy that workload contract | Keep it for simpler tasks or promote a stronger target for that workload. |
See Error Reference for structured error types.