Policy-Driven Model Routing: Selecting the Right LLM Per Request

Policy-Driven Model Routing: Selecting the Right LLM Per Request

The first LLM integration is usually simple: you pick one provider, one model, one API key, and ship. The second and third are where things start to hurt.

Suddenly you have a “cheap” model for bulk tasks, a “smart” model for critical flows, maybe a vision model, maybe a provider change for one region. Routing logic leaks into handlers, feature flags, and tests. A simple “call the model” turns into nested if tenant == ... and feature == ... branches scattered across the codebase.

This post is about treating model choice as a first-class concern.

We’ll walk through how to build a policy-driven routing layer—an AI gateway that picks the right LLM per request based on feature, tenant tier, risk profile, latency budget, and cost constraints. Along the way, we’ll see how to keep it:

  • Config-driven, not hardcoded
  • Safe, with fallbacks and experiments
  • Observable, so every routing decision is explainable

And we’ll ground it in a real product context, as used in Hoomanely’s AI stack powering pet wellness experiences.


The Problem: One “Default Model” Doesn’t Scale

At small scale, a single default model is fine. But as your AI surface area grows, three constraints collide:

Cost

    • Draft emails vs. safety-critical health advice should not be billed at the same $/1K tokens.
    • Some tenants can pay for premium quality; some can’t.

Latency

    • A 2-second multi-turn assistant is acceptable.
    • A 2-second autocomplete in a mobile app is not.

Risk profile

    • A misphrased marketing tagline is annoying.
    • A misworded escalation or health recommendation can be damaging.

If you route everything through a single “best” model, you either:

  • Overpay, because you’re using an over-powered model for trivial tasks, or
  • Underperform, because you’re using a cheap model where you really needed reliability, context handling, or better safety tooling.

This starts as a quick patch and ends as if–else spaghetti across services, with no single place to answer:

  • Why did this request hit that model?
  • What’s the cost distribution per feature and tenant?
  • Can we turn on a canary for one segment?

We need a routing layer that centralizes these decisions.


Concept: What Is Policy-Driven Model Routing?

Policy-driven model routing means:

Given a request with known attributes (feature, user, risk, budget, etc.), choose which model to call using declarative policies instead of ad-hoc code.

You define policies like:

  • “For free-tier users on bulk summarization, use cheap-4k up to 512 tokens.”
  • “For enterprise incident reports, always use premium-32k and enable safety filters.”
  • “During this canary experiment, 10% of chat_support traffic goes to candidate-8b.”

The routing layer becomes an AI gateway:

  1. Receives a typed, normalized inference request
  2. Evaluates policies over request attributes
  3. Picks a model (or chain of models)
  4. Executes with fallbacks and logs why that decision was made

This gives you:

  • Separation of concerns – app code says what task it wants, policies decide which model.
  • Easier evolution – changing routing rules is a config change, not a deploy.
  • Better observability – you can answer “what policy fired?” in one place.

Architecture: A Python-Based AI Gateway with Routing

Let’s start from a high-level data flow. Imagine your stack:

  • Mobile/web apps
  • Backend APIs (Python / Node / Go, etc.)
  • AI gateway service (Python-based)
  • Multiple LLM providers (Bedrock, OpenAI, Anthropic, internal models, etc.)

The AI gateway owns:

  1. Request normalization
  2. Policy evaluation
  3. Model client abstraction
  4. Fallbacks & experiments
  5. Telemetry & audit log

At a high level:

Gateway normalizes and enriches with metadata:

    • Tenant config
    • User tier
    • Region, platform
    • Historical signals (e.g., spend YTD)

Policy engine runs:

    • Evaluate rules -> pick model config:
      • model_id = "vendor_x::smart-16k"
      • max_tokens = 1024
      • safety_preset = "standard"

Model client executes:

    • Call provider
    • Handle retries / timeouts / fallbacks

Router logs decision:

    • Policy ID, matched conditions, chosen model
    • Latency, tokens, success/failure

App sends a logical request to the AI gateway:

{
  "feature": "session_summary",
  "tenant_id": "acme_inc",
  "user_tier": "pro",
  "risk_level": "medium",
  "input_size": 2300,
  "timeout_ms": 2500,
  "payload": { "...": "..." }
}

Policy Design: Encoding Product Decisions as Config

The heart of the system is the policy model: how you declare rules and resolve conflicts.

A practical approach is a tiered policy evaluation:

  1. Feature-level policy – what’s the default behavior for this feature?
  2. Tenant / tier overrides – what changes for paid plans or specific customers?
  3. Risk & SLA constraints – what if latency or safety must be prioritized?
  4. Experiment flags – A/B tests, canaries, or traffic splits.

A simple YAML/JSON structure might look like:

policies:
  - id: "feature.chat.default"
    when:
      feature: "chat_assistant"
    use_model: "vendor_a::chat-8k"
    max_tokens: 1024

  - id: "feature.chat.enterprise_upgrade"
    when:
      feature: "chat_assistant"
      tenant_tier: "enterprise"
    use_model: "vendor_b::chat-premium-32k"
    max_tokens: 2048

  - id: "feature.summary.low_risk_bulk"
    when:
      feature: "bulk_summary"
      risk_level: "low"
    use_model: "vendor_c::cheap-4k"
    max_tokens: 512

You can extend this with conditional expressions:

  - id: "latency_sensitive_fast_path"
    when:
      feature: "inline_assist"
      max_latency_ms: "<=1500"
    prefer_models:
      - "vendor_a::fast-4k"
      - "vendor_a::fast-1k"

The engine should:

  • Match all applicable policies
  • Resolve with priority or specificity rules (e.g., most specific when wins)
  • Attach metadata (like max tokens, safety preset, temperature, etc.)

Tenants, Tiers, and SLAs

In multi-tenant products, routing is a powerful way to encode SLAs:

  • Free tier: low-cost model, stricter rate limits, lower max tokens.
  • Pro tier: balanced model, moderate context window.
  • Enterprise: premium models, longer context, higher timeouts, better fallbacks.

Those decisions stay outside your app handlers; the handler just says:

router.infer(
    feature="coach_answer",
    tenant_id=current_tenant.id,
    payload=...
)

At Hoomanely, we build AI features around pet health, nutrition, and daily routines. Downstream, this includes:

  • Light-weight explanations (“What does this step count mean today?”)
  • Deeper, multi-signal insights (“How is activity + food + temperature trending?”)
  • User-facing guidance (“What should I adjust in my dog’s feeding schedule?”)

Here, policy-driven routing helps us:

  • Use cost-efficient models for bulk summarization of sensor events.
  • Reserve stronger models for flows where wording and nuance matter more (e.g., health-related nudges, long-form explanations).
  • Gradually introduce new model families (e.g., better at time-series reasoning) via canaries for specific segments before rolling out broadly.

We rarely want to hardcode “use Model X for EverBowl and Model Y for EverSense”. Instead, we route by feature category (e.g. “daily insight”, “trend explanation”, “setup assistant”) and risk level, so new devices and features can plug in without rewriting routing logic.


Implementing the Policy Engine (Python Sketch)

You don’t need a full rule engine to get started. A structured, testable approach is:

Request Envelope

Create a typed request object that every caller uses:

from dataclasses import dataclass
from typing import Dict, Any

@dataclass
class InferenceRequest:
    feature: str
    tenant_id: str
    user_tier: str
    risk_level: str
    input_size: int
    timeout_ms: int
    payload: Dict[str, Any]
    metadata: Dict[str, Any] = None

All routing decisions must be derivable from this plus static config.

Policy Representation

Load your YAML/JSON into normalized in-memory rules:

@dataclass
class Policy:
    id: str
    conditions: Dict[str, Any]   # e.g. {"feature": "chat_assistant", "user_tier": "enterprise"}
    config: Dict[str, Any]       # e.g. {"use_model": "vendor_a::chat-8k"}
    priority: int = 0

Matching Logic

Simple matching can be “all key-value pairs equal” plus some helpers for ranges:

def matches(policy: Policy, req: InferenceRequest) -> bool:
    for key, expected in policy.conditions.items():
        actual = getattr(req, key, None)
        if isinstance(expected, dict) and "lte" in expected:
            if not (actual <= expected["lte"]):
                return False
        elif actual != expected:
            return False
    return True

Then evaluation:

def select_policy(policies, req: InferenceRequest) -> Policy:
    candidates = [p for p in policies if matches(p, req)]
    if not candidates:
        raise RuntimeError("No matching policy")
    # pick highest priority, then most specific (more conditions)
    candidates.sort(key=lambda p: (p.priority, len(p.conditions)), reverse=True)
    return candidates[0]

The output of the policy engine is a model invocation plan, not just a string:

@dataclass
class ModelPlan:
    model_id: str
    max_tokens: int
    temperature: float
    safety_preset: str
    fallbacks: list  # list of model_ids

This keeps everything explicit and testable.


Fallbacks, Canaries, and A/B Tests

Routing isn’t just “pick one model”. It’s also about what happens when things go wrong and how to introduce change safely.

Fallback Strategies

Common fallbacks:

  • Same provider, smaller context – if premium-32k times out, retry with smart-8k.
  • Different provider, similar spec – if vendor X has an outage, switch to vendor Y.
  • Heuristic fallback – degrade to simpler behavior (e.g., shorter summary) instead of full feature.

Policies can include fallback chains:

  - id: "session_summary_pro"
    when:
      feature: "session_summary"
      user_tier: "pro"
    use_model: "vendor_a::smart-16k"
    fallbacks:
      - "vendor_a::smart-8k"
      - "vendor_b::chat-16k"

The router’s execution path:

  1. Try smart-16k with given timeout.
  2. If timeout or specific error class, try smart-8k.
  3. If provider unavailable, try vendor_b::chat-16k.
  4. Log which step succeeded.

Canaries and A/B Testing

You’ll often want to compare two models on real traffic:

  • New, cheaper model vs current expensive one
  • New provider vs existing one
  • Different safety tuning or system prompts

Introduce a traffic split policy:

  - id: "chat_candidate_canary"
    when:
      feature: "chat_assistant"
      user_tier: "pro"
    experiment:
      name: "chat_model_v2_canary"
      variants:
        - model_id: "vendor_a::chat-8k"
          weight: 0.8
        - model_id: "vendor_c::candidate-12b"
          weight: 0.2

The router:

  • Deterministically assigns a user or session to a variant (e.g., hash of tenant_id + user_id).
  • Logs variant name, chosen model, and metrics.
  • Lets you compare cost, latency, and user outcomes.

Because this is config, not code, product and infra teams can collaborate without stepping on each other.


Observability: Making Routing Decisions Explainable

Routing is only useful if you can inspect it when something breaks.

At minimum, every request should emit a structured decision log:

{
  "trace_id": "abc-123",
  "feature": "session_summary",
  "tenant_id": "acme_inc",
  "user_tier": "pro",
  "policy_id": "session_summary_pro",
  "chosen_model": "vendor_a::smart-16k",
  "fallback_used": false,
  "experiment": null,
  "latency_ms": 842,
  "prompt_tokens": 1800,
  "completion_tokens": 400,
  "cost_usd": 0.011,
  "timestamp": "2025-11-26T13:05:45Z"
}

From this, you can build dashboards answering:

  • Which policies fire most often?
  • How is traffic distributed across models?
  • Where do fallbacks trigger?
  • Which experiments are succeeding?

In more advanced setups, you can:

  • Capture input shape metrics (e.g., input length, language, device type) to see where a model struggles.
  • Attach user outcomes (clicks, likes, follow-up actions) for offline analysis.
  • Run offline replays: “How would this week’s traffic look if routed to Model Z?”

Avoiding Anti-Patterns

Policy Explosion

If every small nuance becomes its own policy, you’ll drown in config.
Guardrails:

  • Start with feature-level policies.
  • Introduce tenant or tier overrides only when truly needed.
  • Define defaults (e.g., default latency budgets per feature).

Hidden Routing in App Code

Once you introduce the router, don’t keep sprinkling extra if conditions in handlers:

Better:

router.infer(feature="session_summary", tenant_id=tenant.id, payload=...)

Bad:

if feature == "session_summary" and user_tier == "enterprise":
    router.infer(..., model_override="premium")

Use overrides sparingly, and always in a structured way (e.g., debug_force_model for QA-only workflows).

No Tests for Policies

Policies are part of your behavior contract. Treat them like code:

  • Unit tests for policy matching (given request → expect model plan).
  • Snapshot tests for “golden” scenarios.
  • Validations on startup (no dangling model IDs, no overlapping rules without priority).

A simple testing pattern:

def test_enterprise_chat_uses_premium():
    req = InferenceRequest(
        feature="chat_assistant",
        tenant_id="bigco",
        user_tier="enterprise",
        risk_level="medium",
        input_size=800,
        timeout_ms=3000,
        payload={}
    )
    policy = select_policy(ALL_POLICIES, req)
    assert policy.config["use_model"] == "vendor_b::chat-premium-32k"

This keeps your routing layer from silently drifting.

Wrap existing calls behind a router interface

    • router.infer(...) that currently just calls your old default model.
    • No behavior change yet.

Introduce a simple policy file

    • Start with 1–2 features (e.g., bulk summarization and chat).
    • Add policies that reproduce existing behavior.

Add one new routing dimension

    • For example: user tier → use different models for free vs enterprise.
    • Roll out to a small subset of tenants.

Add fallbacks and experiments

    • Configure fallback chains for the most critical features.
    • Start one canary experiment with clear monitoring.

Gradually migrate remaining features

    • Over time, require all new features to define a feature code and routing policy before they ship.

Within a few iterations, you’ll find that “Which model should we use?” becomes a conversation about policies and SLAs, not copy-pasted API keys.


Takeaways

Policy-driven model routing is less about fancy rule engines and more about good boundaries:

  • Treat model choice as a platform concern, not a local optimization.
  • Express routing logic as config and policies, not scattered conditionals.
  • Support fallbacks, canaries, and A/B tests as first-class features.
  • Invest early in observability, so every decision is explainable.
  • Tie routing to product decisions—features, tiers, risk profiles—so your AI stack naturally reflects your business and UX priorities.

AI touches multiple surfaces—from device-driven insights to in-app coaching—this approach keeps the system adaptable. New models and providers can be introduced safely, different experiences can coexist under one gateway, and you can reason about cost, latency, and safety with clarity.

Read more