Software

Observability for AI Systems: How We Solved LLM Performance in Production

Badal Chowdhary

08 Dec 2025 — 5 min read

Pet parents face a unique challenge: their companions can't tell them what's wrong. Is that whimper a sign of anxiety or something more serious? Can Bella eat blueberries? Why is Max suddenly refusing his favourite food? These questions keep pet parents awake at night, scrolling through conflicting advice on forums and second-guessing every decision.

This is where our AI-powered pet care assistant Hoomanely AI chat assistant steps in. By combining veterinary knowledge, nutritional databases, and behavioural insights, our systems provide instant, personalised guidance to worried pet parents at 2 AM when their vet's office is closed. The promise is compelling: expert-level advice, available 24/7, tailored to your pet's specific needs.

But here's the catch—when you deploy an LLM to answer critical questions about pet health, you've created a new problem: you have no idea what it's actually saying to your users.

The Black Box Problem

Large Language Models are powerful, but they're also unpredictable. Ask the same question twice, and you might get two different answers. Change the temperature parameter slightly, and the tone shifts from cautious to overly confident. Feed it incomplete context, and it might hallucinate facts that sound authoritative but are dangerously wrong.

For a pet care AI, the stakes are real. An incorrect answer about toxic foods could lead to an emergency vet visit. Overly confident advice might delay treatment for a serious condition. Inconsistent responses erode trust—if your AI says one thing today and something else tomorrow, users will stop relying on it.

Traditional monitoring doesn't help here. You can track API uptime, response times, and error rates, but none of that tells you whether your LLM just told someone that chocolate is safe for dogs. You need visibility into what the model is actually doing—the prompts it receives, the responses it generates, the context it uses, and the resources it consumes.

Approaches to LLM Observability

Teams building production LLM systems have developed several approaches to this problem, each with different tradeoffs:

Manual Spot-Checking: Some teams randomly sample conversations and review them manually. This is cheap to implement—you just need a database query and someone's time—but it's slow, doesn't scale, and you'll miss most issues. It's like trying to find a needle in a haystack by examining one piece of straw at a time.

User Feedback Systems: Thumbs up/down buttons and feedback forms let users flag problems. This catches the most egregious errors but suffers from selection bias—most users don't leave feedback unless something goes really wrong. You'll never know about the subtle quality degradations or the users who silently left.

Third-Party Observability Platforms: Services like LangSmith, Weights & Biases, or Helicone offer sophisticated LLM monitoring out of the box. They're powerful but add external dependencies, costs, and sometimes latency. For some teams, that's worth it. For others, especially those with strict data privacy requirements or cost constraints, it's not feasible.

Custom Logging Infrastructure: Building your own system gives you complete control over what's tracked, where data lives, and how it's analysed. The downside is obvious—you're building and maintaining infrastructure instead of buying it. But if you're already investing heavily in your AI systems, and you need observability tailored to your specific use cases, this approach makes sense.

Building Our Log Dashboard

At Hoomanely, we chose to build a custom observability system. Not because we thought we'd build something better than dedicated platforms, but because we needed something that fit our specific workflow, integrated with our existing tools, and gave us the flexibility to track metrics unique to pet care interactions.

The core insight was simple: treat every LLM call as a structured event that you can query, analyse, and learn from.

What We Track

Every time an LLM is invoked—whether it's answering a pet parent's question, analysing a food label photo, or generating a pet insight - we log the event with key metadata:

Model Configuration: Which model was used, what temperature setting, max tokens allowed, and any special parameters. This helps us correlate quality issues with specific configurations.

Prompt Engineering: We tag each call with its use case type (QnA, food analysis, pet insight, etc.) and log what context was passed to the model. Did we include the pet's medical history? Breed information? Previous conversation turns? Knowing what the model had access to is crucial for debugging hallucinations.

Resource Consumption: Token counts (prompt + completion) and their associated costs. LLM usage can get expensive fast—we've seen single conversations consume thousands of tokens when users ask follow-up questions. Tracking this lets us optimise prompts and catch runaway generation.

Performance Metrics: End-to-end latency measured in milliseconds. We track both the time to first token and total completion time. Pet parents won't wait 10 seconds for an answer, so we need to catch performance regressions immediately.

Response Metadata: The actual completion text, any tool calls made, whether the response used retrieval (RAG). This is the ground truth - what did we actually tell the user?

Why This Matters

Having this data centralised in a dashboard changes how we work. Instead of guessing why users report inconsistent answers, we can query logs by use case and see exactly what prompts were used and what context was available. When a pet parent says "your AI gave me bad advice," we can pull up that exact conversation, see the prompt, examine the context, and understand what went wrong.

Here are specific ways this data helps improvement:

Hallucination Detection: If the LLM mentions something that we've never documented, or contradicts established guidelines, we know that this needs to be checked.

Prompt Optimisation: By tracking performance by prompt type, anomalies can be detected and the prompt for that specific use case can be optimised to perform better.

Key Takeaways

If you're building AI-powered products, especially in domains where accuracy matters, invest in observability early. You don't need a perfect system from day one, but you need something that lets you see what's happening in production.

Start with structured logging. Make every LLM call generate a queryable event with relevant metadata. This costs almost nothing to implement and gives you a foundation to build on.

Track what matters to your use case. Generic metrics like "average response time" are fine, but you need domain-specific signals too. For us, that's measuring retrieval quality and context completeness. For you, it might be something else entirely.

Finally, remember that observability is a means to an end. The goal isn't to collect data—it's to build better, more reliable AI systems that users can trust. Every log entry, every tracked metric, every analysed anomaly should ultimately lead to improvements in how your AI serves your users.