In production, monitor failure modes that affect correctness, safety, and cost. Common correctness failures include hallucinations (confidently wrong answers), instruction drift (ignoring your format rules), and context misuse (answering from prior conversation when it should rely on provided sources). Operational failures include latency spikes (especially with long contexts), token blowups (huge outputs), and rate-limit errors. For agentic workflows, also monitor loops: repeated tool calls without progress and “thrashing” where the model makes unrelated edits.
Instrument your system with both model-level and system-level signals. Track: input/output token counts per request, time-to-first-token, total latency, error codes, tool-call frequency, and retry rates. Add quality checks: schema validation pass rates for structured outputs, citation coverage (did it cite retrieved sources when required), and “unknown” rate (did it appropriately say it doesn’t know). Set alerts on sudden shifts in these metrics because they often indicate prompt regressions, retrieval degradation, or upstream capacity issues.
Finally, treat retrieval quality as a core production metric. If you use Milvus or managed Zilliz Cloud, monitor vector search latency, top-k similarity score distributions, and how often retrieved chunks actually contain the answer. Many “model failures” are really retrieval failures: wrong chunking, missing metadata filters, or stale indexes. When you log retrieved chunk IDs alongside the model output, you can debug issues quickly and improve your system without guessing what the model “meant.”
