- Published on
- 6 min read
AI Observability: When Everything Looks Healthy But AI Is Wrong
Your AI app can fail… even when nothing is technically broken.
The server is healthy. The API is fast. The dashboards are all green.
But your chatbot still gives a completely wrong answer to a customer.
And nobody notices.
That's the weird thing about GenAI systems.
In traditional software, failures are obvious:
- APIs fail
- Databases crash
- Latency spikes
But LLM apps fail differently.
They can be fully online, fast, and technically "working"… and still be wrong.
What is AI Observability?
AI observability helps teams understand why an AI system behaved a certain way.
A patient asks:
Do I need to fast before tomorrow's blood test?
The AI replies:
No fasting required.
But fasting was mandatory.
Nothing crashed. No alerts fired. Everything looked normal.
Yet the answer was still wrong.
A key concept is a Trace.
A trace is the full step-by-step journey of an AI request — from the user question, to retrieval, to model response, to the final answer.
| Step | What happens |
|---|---|
| User sends question | Input captured |
| Retrieval runs | Relevant chunks fetched from vector store |
| Prompt assembled | Context + question combined |
| LLM responds | Model generates answer |
| Answer returned | User receives response |
Every step is logged. Every step can be inspected.
Quality in AI Means
Unlike traditional software where quality = uptime + latency, AI quality means:
- Useful answers — does it actually help the user?
- Grounded responses — is the answer based on retrieved context?
- Fewer hallucinations — does the model invent facts?
- Safe outputs — does it avoid harmful content?
- Consistent behavior — does quality hold after prompt changes?
These cannot be measured with CPU graphs or error rates. They require a completely different observability stack.
The Traditional vs AI Observability Gap
Traditional monitoring tells you — server healthy, API fast, error rate near zero, all dashboards green.
But it cannot tell you this:
Assistant responded: "You don't need to fast before your blood test." — Completely wrong. Fasting was mandatory.
Notice the gap. Traditional monitoring could not catch that failure.
The Building Blocks of AI Observability
Traces
Full journey logs. Every request traced end-to-end.
Trace 2025-12-01T09:32:00
Input: "Do I need to fast before my blood test?"
Retrieval: 3 chunks fetched
chunk_1: "CBC blood panel - no fasting"
chunk_2: "General wellness - no fasting" <- Wrong chunk retrieved
chunk_3: "Glucose test - fasting required" <- Not retrieved
Prompt: system + 3 chunks + user question
Output: "No fasting required."
The trace shows retrieval fetched the wrong documents. The model answered based on those wrong documents.
Evaluators
Automated checkers that score responses.
response = ai_app.answer(question)
groundedness_score = evaluate_groundedness(response, retrieved_context)
# Is the answer based on what was actually retrieved?
relevance_score = evaluate_relevance(response, question)
# Does the answer actually address the question?
if groundedness_score < 0.7:
flag_for_review(response)
Common evaluators:
| Evaluator | What it checks |
|---|---|
| Groundedness | Is the answer based on retrieved context? |
| Relevance | Does it address the actual question? |
| Toxicity | Is the content harmful? |
| Coherence | Is the response logically structured? |
| Similarity | Does it match expected answers? |
Feedback Loops
Human review and automated scoring that feeds back into the system.
User gives thumbs down on AI response
↓
Feedback logged to observability system
↓
Team investigates: Wrong retrieval chunk was used
↓
Embedding model retrained on better document splits
↓
Quality improves on similar questions
Azure AI Foundry Observability
Microsoft built Azure AI Foundry with observability built in.
| Capability | What it does |
|---|---|
| Tracing | Full end-to-end request traces |
| Evaluation | Built-in quality evaluators |
| Monitoring | Real-time dashboards for AI quality |
| Risk & Safety | Detects harmful content |
| Prompt Flow | Visual pipeline with trace steps visible |
Setting Up Tracing
from azure.ai.inference.tracing import AIInferenceInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
SimpleSpanProcessor(your_exporter)
)
AIInferenceInstrumentor().instrument(
tracer_provider=tracer_provider,
enable_content_recording=True
)
Every AI call is now traced automatically.
Running Evaluations
from azure.ai.evaluation import evaluate, GroundednessEvaluator, RelevanceEvaluator
evaluators = {
"groundedness": GroundednessEvaluator(model_config=azure_openai_config),
"relevance": RelevanceEvaluator(model_config=azure_openai_config),
}
results = evaluate(
data="test_dataset.jsonl",
evaluators=evaluators,
azure_ai_project=project,
output_path="./evaluation_results.json"
)
print(f"Groundedness score: {results['groundedness']:.2f}")
print(f"Relevance score: {results['relevance']:.2f}")
The Continuous Improvement Loop
Production AI App
↓
Traces + Scores collected automatically
↓
Team reviews low-scoring traces weekly
↓
Root cause: Retrieval? Prompt? Model config?
↓
Fix deployed → new eval run
↓
Compare before/after scores
↓
Repeat
This loop is what separates teams that improve AI quality from teams that just hope it works.
What Good AI Observability Looks Like
Dashboard you actually want:
AI Quality Dashboard — Last 7 Days
Groundedness Score: 0.84 (+0.06)
Relevance Score: 0.79 (+0.03)
Hallucination Rate: 3.2% (-1.1%)
User Thumbs Up: 78% (+5%)
Flagged Responses: 12 (-8)
Low-Score Traces This Week:
- Query: "fasting before blood test" groundedness: 0.32
- Query: "medication interactions" groundedness: 0.41
- Query: "post-surgery diet" groundedness: 0.58
Numbers trending in the right direction. Worst traces surfaced for manual review. Root causes visible.
Quick Reference
| Concept | One-line definition |
|---|---|
| Trace | Full log of one AI request, step by step |
| Evaluator | Automated scorer for response quality |
| Groundedness | Measure of how much the answer uses retrieved context |
| Hallucination | When the model invents facts not in the context |
| Feedback loop | System for catching errors and improving over time |
| AI Foundry | Microsoft's platform for building and observing AI apps |
What to Do This Week
If you have an LLM app in production — add tracing now. Even basic OpenTelemetry tracing is better than nothing.
Run a groundedness evaluation on 50 recent queries. You will likely find 5–10 where the model answered from hallucination.
Set up a weekly review of the bottom 10% traces. That review alone will surface the biggest quality issues.
AI observability is not a nice-to-have.
It's the difference between knowing your AI works… and hoping it does.