Metrum GenAI Smart Router
Metrum GenAI Smart Router is the governed gateway layer for enterprise LLM, VLM, and AI agent traffic. Applications, developer tools, and coding agents call one stable OpenAI-compatible or Anthropic-compatible endpoint while provider credentials, model selection, routing policy, quotas, cache rules, telemetry, and request-time cost accounting stay server-side.
The product is strongest when platform teams need to move quickly across model providers without asking every application or agent user to keep switching endpoints, raw model IDs, or provider keys. A deployment can route ordinary chat, coding-agent tool calls, image/VLM requests, private GPU endpoints, and cost-sensitive production traffic through the same control point.
These docs are built into the hosted GenAI Smart Router server delivered for your deployment. Examples that show the router base URL use this browser origin, so on this deployment they render as https://your-router.example.com and https://your-router.example.com/v1.
Interested in deploying GenAI Smart Router for your organization? Contact contact@metrum.ai.
Why Teams Deploy It
AI applications rarely standardize on one model forever. Model quality, latency, price, availability, context windows, tool behavior, image support, and account entitlements change over time. Hard-coding provider endpoints and provider keys into every client creates migration cost and weakens governance.
GenAI Smart Router lets platform teams define deployment-owned model groups as quality and cost contracts. Each group has intended workloads, allowed clients, API shapes, modalities, success criteria, quotas, and an upstream provider/model mix that can evolve behind one stable caller-facing name.
| Outcome | What changes |
|---|---|
| Keep clients stable while providers change | Apps, SDKs, Codex CLI, Claude Code, and agent frameworks keep calling one router endpoint and one allowed model group. |
| Lower cost without losing task success | Route simple work to lower-cost targets and reserve stronger targets for requests that need them, using outcome evaluation instead of guesswork. |
| Govern access centrally | Caller tokens enforce model-group allow lists, RPM/TPM limits, concurrency caps, traffic shaping, budgets, and metrics-admin separation before any provider call. |
| Route by real request requirements | Tool-bearing, image-bearing, capped, and dialect-specific requests are filtered to targets validated for that shape. |
| Use private and hosted models together | vLLM, SGLang, Baseten-style, OpenRouter-style, and direct provider targets can participate in one deployment policy while private endpoints stay hidden. |
| Make spend and performance explainable | Usage rows and reports preserve selected provider/model, token counts, image fields, cache behavior, latency, attempts, fallbacks, and request-time cost. |
How It Works
The caller asks for a model group, not a raw provider model. The router checks the caller token, filters targets that cannot satisfy the request shape, selects an eligible target with the configured policy, injects the provider credential server-side, returns the response in the caller's expected API shape, and records usage for governance and triage.
Product Strengths
- Multi-dialect gateway: OpenAI Chat Completions, OpenAI Responses, and Anthropic Messages surfaces for ordinary apps, Codex CLI, Claude Code, and compatible agent frameworks.
- Model-group contracts: deployment-defined names encode workload, cost, quality, API shape, modality, and validation expectations.
- Validated eligibility filtering: text, image/VLM, tool-call, dialect, and max-token-cap metadata prevent unsafe target selection.
- Programmable policy: static, weighted, failover, dynamic-score, TypeScript-scripted, and external-policy routing patterns.
- Cost governance: per-key access and budgets plus request-time price/cost storage, including image cost fields and upstream-reported billed cost when available.
- Operational visibility: request logs, relational usage data, diagnostics child tables, optional governed content-capture tables, report examples, downstream and upstream latency/throughput views, and Prometheus telemetry restricted to metrics-admin tokens.
- Private upstream control: internally hosted OpenAI-compatible inference services can sit behind the same caller API as external providers.
- Outcome validation: Harbor or any workload-appropriate verifier can compare success, cost, token volume, latency, fallbacks, and provider mix before promotion.
Model Groups
Callers request stable model groups defined by the deployment. Names such as default, fast, small, medium, high, big-coder, or vision appear in some examples because they are used by one reference or hosted deployment; GenAI Smart Router does not require those names.
Not every task needs the most expensive model. A well-designed deployment uses objective evaluation to keep each model group successful for its intended workload while routing simpler work to lower-cost targets and reserving stronger targets for requests that need them.
See Concepts And Glossary for the terms used throughout these docs, and Model Group Quality Criteria for the contract fields to define before rollout.
Evaluate The Product
For a first evaluation, start with the Enterprise FAQ and Commercial Evaluation Path, then choose whether the proof should run on a Metrum-managed hosted endpoint, an enterprise/on-prem licensed package, or a private customer-cloud deployment. Then run one path through the gateway and inspect both the caller result and the operator evidence:
- Call
/v1/modelswith the router token to discover allowed groups. - Run an OpenAI-compatible chat request from the Hosted Quickstart.
- Run the relevant agent workflow: Codex CLI, Claude Code CLI, or an OpenAI-compatible SDK.
- Exercise one request-shape control: tool call, image input, quota, or a deployment-specific model group.
- Review usage, latency, selected provider/model, fallback behavior, and cost using Report Examples.
Evaluate GenAI Smart Router gives a complete technical checklist and proof points to request. Self-service portal checkout is planned only for approved evaluation, pilot, renewal, or top-up packages and is not shipped until the licensing portal and Stripe fulfillment work is enabled; for hosted evaluations, private managed deployments, enterprise licenses, marketplace/private-offer procurement, and volume-prepurchase access, contact contact@metrum.ai. See Choose a Deployment Path.
The expected guarantee is concrete: a caller can discover the groups its token is allowed to use, send the same client request shape it will use in production, and receive either a compatible response from a validated target or a clear router error before an unsafe upstream call is attempted. Operators should be able to show which provider/model served the request, why fallback did or did not happen, and what cost and latency were recorded for that request window.
Proof Points
The docs include an outcome-based Harbor coding-agent case study that compares Codex CLI and Claude Code CLI across router model groups. It records the task goal, verifier reward score, models used, tokenomics, cache behavior, latency, throughput, fallback behavior, and provider/model usage. The same pattern can be applied to extraction accuracy checks, OCR targets, unit tests, browser-control tasks, golden datasets, or product acceptance tests.
- Read the Harbor case study
- Review product capabilities
- Review cost governance
- Review security and trust posture
- Compare gateway fit
Where To Go Next
| Role | Start here |
|---|---|
| Buyer or evaluator | Enterprise FAQ, Evaluation And Case Studies, Commercial Evaluation Path, Solution Brief, Security And Trust |
| Application developer | Hosted Quickstart, API Compatibility, Available Models And Access, Troubleshooting |
| Coding-agent user | Agents, Tools, And Vision, Codex CLI, Claude Code CLI, Structured Outputs |
| Platform administrator | Installation, Licensing, Router Configuration, Routing, Providers And Models, Usage, Cost, And Reports |
API Surfaces
The router supports the common LLM API surfaces used by modern tools:
| Surface | Typical clients |
|---|---|
/v1/chat/completions | OpenAI-compatible chat clients |
/v1/responses | Codex CLI and Responses-compatible clients |
/v1/messages | Claude Code and Anthropic-compatible clients |
/v1/models | Model discovery filtered by caller token allow list |
/v1/usage | Caller quota/usage lookup |
/metrics | Prometheus telemetry for caller subjects authorized for metrics read |
Endpoint Example
export ROUTER_BASE_URL="https://your-router.example.com"
export ROUTER_TOKEN="rtr_metrum_<user>_<project>_<env>_<key>_<secret>"
export ROUTER_MODEL="<allowed-model-group>"
Router-issued tokens are customer-specific. For deployment access or an evaluation environment, contact contact@metrum.ai.