The Anatomy of an LLM Guardrail: What Runs on the Input, What Runs on the Output

The Anatomy of an LLM Guardrail: What Runs on the Input, What Runs on the Output

Not every safety check belongs in the same place. Some only make sense on the user’s question, some only on the model’s answer, and some on both. Here is how we structure the guardrails on our veterinary chatbot — and why the input/output split matters.

A guardrail isn’t one thing — it’s two moments

When people say “we put a guardrail on the model,” they usually picture a single gate. In practice a guardrail fires at two distinct moments in a request, and the checks that run at each are not the same. The input pass sees only the user’s message and runs before the model generates anything; the output pass sees the model’s generated response and runs after generation, before the text reaches the user.

This matters because a check can only run where its data exists. You can’t grounding-check an answer that hasn’t been written yet, and you can’t detect a prompt injection in a reply the model authored. So each policy lives on input, output, or both — and getting that placement right is most of the design.

The policies we apply

Our chatbot uses managed cloud guardrails, and we enable five policy families. Each one runs on the input, the output, or both, and getting that placement right is the whole exercise.

Policy Runs on Action Why there
Prompt-attack filter Input only block Jailbreaks and injection live in the user's message; nothing to detect on output.
Content filters (sexual, violence, hate, insults, misconduct) Input + output block Unsafe content can arrive in a question or be produced in an answer; both screened at high strength.
Denied topics (human-medical, drug-seeking, at-home surgery, harm-to-animals, off-topic, prompt-disclosure) Input + output block A forbidden subject can show up in either the ask or the reply.
Word filters (custom blocklist + managed profanity) Input + output block Specific banned phrases caught coming and going.
PII: email, phone, address, name Input + output redact Mask contact details in the message and in anything the model emits.
PII: card numbers, SSNs, passwords Input + output block High-risk identifiers are blocked outright rather than masked.
Contextual grounding (groundedness + relevance) Output only block Compares the answer to the source material; cannot run before the answer exists.

Two of these are worth calling out because they are the clearest examples of the input/output split.

Input-specific: prompt-attack detection

The prompt-attack filter is set to high on input and off on output. That is deliberate: a prompt injection (“ignore your instructions and…”) is something a user does to the model. The model’s reply is not going to attack anyone, so running this filter on the output would only cost latency and false positives. This check belongs exclusively on the input.

Output-specific: contextual grounding

Grounding is the mirror image. It asks whether the answer the model just produced is actually supported by the reference material we gave it, and whether it is relevant to what was asked. Both inputs to that question — the answer and the source — only exist after generation, so grounding runs as a separate output-only step where we tag three pieces of content for the guardrail: the source material, the original question, and the model’s answer. This check is meaningless on the input side; there is nothing to ground yet.

Two actions: block vs. redact

Most policies, when they fire, block — the entire message is replaced with a safe refusal (“I can only help with questions about your dog…”). Block is binary and easy to reason about. But the PII policy on names, emails, phones, and addresses uses a different action: anonymize, or redact. Here the message is allowed through but the sensitive span is masked in place. That is a fundamentally different operation. Blocking is a decision about the whole message — keep it or swap it. Redacting is an edit inside the message — you have to find the span and rewrite it. That distinction has real consequences once you stream.

How the two passes are actually wired

We enforce all of this with two mechanisms, not one. First, an attached guardrail on the model call itself: when we invoke the model, we pass the guardrail along, and the platform automatically evaluates the input before generation and the output after generation against the content, topic, word, and PII policies. One attachment, both directions. Second, a separate grounding call after generation: because grounding needs the answer plus the source tagged in a specific way, we run it as an explicit second step on the output only, and if it fails we replace the answer with a safe fallback (“I don’t have enough information to answer that”).

So the full lifecycle is: input screened, model generates, output screened for content, topic, word, and PII, output grounding-checked, then delivered.

Why streaming forces the question

Streaming responses token by token, for the typing effect, is where the input/output distinction becomes unavoidable, because the two sides behave completely differently. Input checks can run before a single token is emitted: the user’s message is fully known up front, so we screen it before opening the stream, and if it is blocked the user gets a clean refusal and the model is never even called. Cheap and deterministic.

Output checks, on the other hand, collide with streaming. The model’s response does not exist until it is generated, and once you have streamed a token to the user you can’t take it back. That is fine for a block — you can stop and swap the final message — but it is a real problem for redaction and grounding. To mask an email you must recognize the whole span before it is shown; token by token, you would leak the raw email before the guardrail catches it. And grounding needs the complete answer to score it, so a strict grounding gate means buffering the response rather than streaming it live.

The practical fix is to make the output guardrail synchronous — buffer each chunk, let the guardrail evaluate and redact it, then release it — at the cost of some latency. The looser alternative, stream first and check after, means unsafe or unredacted content can flash in front of the user before it is caught. The rule of thumb: input guardrails are “free” to run early and hard; output guardrails, especially redaction and grounding, force a latency-versus-safety trade-off the moment you stream.

Design principles we settled on

Place each check where its data lives. Prompt-attack on input only; grounding on output only; everything that can appear in either direction on both. Don’t pay for checks that can’t possibly fire.

Prefer blocking for categories, redaction for spans. Block when the whole message is the problem — a forbidden topic, an injection. Redact when only a fragment is — a name, an email — and the rest of the message is useful.

Screen the input before you spend a generation. A blocked question shouldn’t cost a model call. Gating on input is cheaper and more predictable than catching it mid-stream.

Treat the output guardrail as the expensive one. It is the side that fights with streaming, adds latency, and has to handle in-place edits. Budget for it accordingly, and decide consciously whether a given response path needs synchronous (safe) or asynchronous (fast) output checking.

Match the action to the risk. Anonymize what’s recoverable, such as contact details. Block what isn’t, such as a payment card, SSN, or password. One policy family, two actions, chosen per entity.

In one sentence

Guardrails aren’t a single wall around the model — they’re a set of checks placed at two moments in the request, and the art is knowing which checks belong on the user’s question, which belong on the model’s answer, and which belong on both.

Read more