Enterprise Deployment Patterns
GenAI Smart Router is deployment-owned infrastructure. It is not a one-size-fits-all shared public API: enterprises can run it as self-hosted infrastructure, a private managed dedicated deployment, or an evaluation-hosted endpoint while they prove client compatibility and model-group quality.
Model groups, provider credentials, private upstreams, caller access, telemetry retention, and routing strategy are deployment-defined. The product principle is that the customer or operating team owns its routing destiny. The router supplies the control surface, compatibility layer, evidence, and enforcement points so teams can evolve provider/model choices without rewriting every client.
From Operations, you might be looking for install paths: see Docker Compose, Kubernetes, or Binary Installation.
For installation mechanics and deployment-shape selection, see Installation. For licensing and commercial paths, see Choose a Deployment Path.
Pattern A: Metrum-Managed Evaluation Endpoint
Use a Metrum-managed evaluation endpoint when the fastest proof is more valuable than standing up customer infrastructure first.
Best fit:
- quick OpenAI Chat, OpenAI Responses, Codex CLI, Anthropic Messages, or Claude Code compatibility checks;
- model-group quality proof against real workload samples;
- spend, latency, provider/model mix, and fallback report evidence;
- pilot handoff before a self-hosted or private managed production deployment.
The evaluator receives a deployment-specific base URL, router token, and allowed model groups. Any example group names are hosted/reference examples only; /v1/models is the caller-facing source of truth for the groups allowed to that token.
Acceptance evidence to request:
/v1/modelsoutput for the evaluation token;- one
/v1/chat/completionssmoke; - one OpenAI Responses or Codex CLI smoke;
- one Anthropic Messages or Claude Code smoke when that client matters;
- one report excerpt showing provider/model, latency, tokens, cost, status, attempts, and fallback behavior;
- one security and retention summary covering provider-key handling, diagnostics redaction, metrics-admin isolation, and content-retention policy.
This pattern is for evaluation or a contracted private managed service. It is not a public shared multitenant inference product.
Pattern B: Enterprise Self-Hosted Central Gateway
Use one central router deployment per enterprise environment, VPC, or network trust boundary when the platform team owns GenAI access for many applications.
Apps call the router instead of provider APIs. The central platform team owns provider credentials, caller tokens, model-group access, metrics-admin isolation, retention policy, and the license file. Teams request allowed groups, and the platform tunes targets, weights, fallbacks, and validation metadata behind those groups.
This pattern works well when provider keys must stay server-side, private upstreams must remain on internal networks, and finance or platform operations need chargeback-style reports across teams.
Pattern C: Per-Environment Routers
Use separate dev, staging, and production routers or configs when provider activation and weight changes need a promotion path.
Keep test provider keys separate from production BYOK credentials where policy requires it. Use separate usage databases or state stores when the reporting, retention, or license envelope differs by environment. Validate new providers, model IDs, weights, tool metadata, image metadata, and routing scripts in staging before promotion.
Rollout and rollback flow:
- Update staging config and run
/readyz,/v1/models, Chat, Responses, Messages, tool, image, report, and license smokes that match the change. - Capture provider/model selection, status, latency, usage, and request IDs for the test window.
- Promote the reviewed config or package to production with a timestamped backup.
- Repeat the same production smokes.
- Roll back by restoring the previous package/config/license input and rerunning the failed smoke.
Pattern D: Per-Team Or Per-Business-Unit Routers
Use separate router instances for teams that need independent provider keys, cost centers, retention policy, private upstreams, release cadence, or regional controls.
Caller users, projects, environments, and API keys map to reporting and access. A single team router can expose multiple model groups for that team's workloads, and reports can still separate usage by caller, project, client, model group, provider/model, latency, cost, and status.
Model group names are deployment-defined. Do not bake example names such as default, fast, high, big-coder, or vision into application logic as product constants. Clients should discover allowed groups with /v1/models for their token.
Pattern E: Hierarchical Or Federated Routers
Hierarchical and federated topologies are supported conceptually through standard API boundaries, even when there is not one turnkey config that defines every enterprise variant. A team router can call a central enterprise router as an upstream OpenAI-compatible service, or multiple team routers can sit behind central ingress and governance.
Useful cases:
- central enterprise policy with team-local model-group strategy;
- regional or data-residency routers that forward only eligible traffic;
- a private GPU router exposed as an upstream to a central router;
- migration from a Metrum-managed pilot to customer-owned production;
- blue/green or canary router instances.
Responsibility boundaries:
- store upstream-router auth tokens only in the downstream router's protected environment or secret manager;
- define model-group access at each hop, because the app's token and the downstream router's upstream token are separate trust decisions;
- propagate or record request IDs so reports can be correlated across hops without storing prompts or responses;
- budget timeouts across the full path so one hop does not consume all caller patience;
- avoid prompt, image, response, or tool-output capture unless a governed content-capture policy explicitly enables it;
- keep
/metricsrestricted to metrics-admin subjects and do not add tenant data labels to unauthenticated or ordinary-caller endpoints.
For hierarchical production readiness, test authentication, model access, request ID correlation, timeout budgets, failure behavior, and reporting at every hop before shifting real traffic.
Pattern F: Private Managed Dedicated Deployment
Use a private managed dedicated deployment when one customer wants a dedicated router instance operated for them instead of running the service themselves.
One customer or contracted customer environment maps to one dedicated deployment. Provider-cost handling, BYOK scope, network isolation, reporting, retention, and acceptance tests are defined in the managed-service plan. Public docs intentionally keep private operational hostnames, SSH procedures, token files, and production-only runbooks out of this page.
See Choose a Deployment Path, Enterprise Private Managed, and License-Protected Deployments.
Pattern Selection Table
| Concern | Recommended pattern | Proof to request | Operational owner |
|---|---|---|---|
| Data residency | Per-environment, per-region, or federated routers | Region-specific routing policy, provider/upstream inventory, report scope, and failure test | Platform plus regional compliance owner |
| BYOK | Self-hosted central gateway or private managed dedicated deployment | Provider-key custody summary and one caller smoke that never exposes upstream keys | Customer platform or managed-service operator |
| Private upstreams | Self-hosted central gateway, per-team router, or federated private GPU router | Direct upstream smoke, router-level smoke, network boundary summary, and rollback plan | Platform or owning ML infrastructure team |
| Cost attribution | Central gateway or per-team routers | Usage/report excerpt grouped by caller, project, model group, provider/model, tokens, latency, and cost | Platform FinOps or team operations |
| Evaluation speed | Metrum-managed evaluation endpoint | /v1/models, Chat, Responses/Codex, Messages/Claude Code, report excerpt, and retention summary | Metrum evaluation operator plus customer evaluator |
| Central governance | Enterprise self-hosted central gateway | Caller allow-list test, metrics-admin isolation, report-admin authorization, and license status | Enterprise platform team |
| Team autonomy | Per-team routers or hierarchical routers | Team-local model group config, team report excerpt, and central policy compatibility smoke | Team platform owner with central governance review |
| DR/region | Per-environment or federated routers | Backup/restore, regional failover, timeout, and rollback tests | Platform SRE |
| Compliance/audit | Self-hosted central gateway, per-environment routers, or private managed dedicated deployment | Security assessment, retention summary, admin/report authorization test, and sanitized diagnostics review | Security, compliance, and platform operations |
| Heavy coding-agent workloads | Central, per-team, or hierarchical routers with validated agent groups | Codex/Claude Code file-edit smoke, tool-call evidence, model-group quality contract, latency/cost report | Developer platform or AI engineering team |
What To Test Before Production
Run the smokes that match the selected pattern and the API shapes clients will actually use.
Core production checklist:
/readyzand/version;/v1/modelswith each caller class;- OpenAI Chat text request;
- OpenAI Responses request and Codex CLI smoke when Responses clients are in scope;
- Anthropic Messages request and Claude Code smoke when Messages clients are in scope;
- OpenAI Chat, Responses, or Anthropic tool-call smoke for every claimed tool dialect;
- image/VLM smoke for groups that accept image input;
- usage/report excerpt for the test window;
- license status and one licensed-feature smoke for licensed deployments;
- backup/restore of config, state, license state, and usage DB according to the deployment policy;
- rollback to the previous package or config and rerun of the failed smoke.
For hierarchical deployments, also test:
- request ID propagation or report-correlation fields across hops;
- timeout budget across app, team router, enterprise router, and upstream provider;
- authentication at both router hops;
- model-group access at both hops;
- failure behavior when the upstream router returns
401,403,429, timeout,no-eligible-target, or provider failure.