ML System Design for Enterprise Support AI
ML System Design for Enterprise Support AI
There's a particular kind of overconfidence that hits teams around week three of building a support AI. The demo works. Ticket deflection is up. The CEO has seen the prototype. And someone says, with genuine conviction, "we just need to productionize it."
What comes next is a semester's worth of hard lessons about the gap between a proof of concept and a system that handles fifty thousand tickets a day without hallucinating a refund policy that expired two years ago.
This post is about that gap. It's organized around thirteen ML system design problems that come up when building a production enterprise support agent, something resembling what Decagon, Intercom Fin, or an in-house platform team would actually face. For each problem I've tried to go one layer deeper than the standard tutorial: past the architecture diagram and into the failure modes, the messy data realities, and the non-obvious trade-offs that only surface under real load.
Each section is written to stand alone, so feel free to jump to whichever problem is most relevant to where you are.
A note on scope: Examples are grounded in customer support throughout: ticket routing, response generation, knowledge base retrieval, agent escalation. The techniques generalize, but the framing here is deliberate. Support AI has a particular combination of constraints (high volume, adversarial queries, brand-sensitive tone, heterogeneous knowledge sources) that stress-tests ML systems in useful ways.
Fine-Tuning a Small LLM for Query Rewriting Before RAG Retrieval
The problem with using a user's raw query as the retrieval signal is that users don't write retrieval queries. They write things like "my thing keeps doing that annoying error" or "I already tried resetting it." A support agent that retrieves against this verbatim will find almost nothing useful. Query rewriting is the step that transforms what the user said into what the knowledge base was written to answer.
Data
The honest answer is that you almost certainly don't have labeled (raw query → good retrieval query) pairs at the start. You have to construct them.
The best source of signal is your existing support ticket history. For each resolved ticket, the final KB article that the human agent consulted is a weak supervision label: you can work backwards from the article title and first paragraph to construct what a good retrieval query would look like. This isn't clean: human agents often used institutional knowledge rather than the KB, so many resolved tickets have no KB reference at all. Filter aggressively and keep only tickets where the resolution was traceable to a specific document.
A second source is your search logs if you have an internal KB search tool for agents. The queries agents actually typed to find articles are genuinely useful. They're written by domain experts and tend to encode the right vocabulary. I've found these are often 5-10x better as training targets than anything synthetically generated.
The trickier cleaning step is deduplication. Support queries cluster heavily around a few core issues, and if you don't deduplicate carefully you'll fine-tune a model that's overfit to your three most common product failures. Cluster your queries at roughly the semantic level using something like FAISS + a sentence transformer, then subsample each cluster to prevent this.
For a complete training set: aim for 10–50k (raw query, canonical rewrite) pairs. If you're below 5k, synthetic augmentation becomes load-bearing (see section 5).
Training Strategy
A 1–3B parameter model is the right size here. You're not asking the model to reason; you're asking it to perform a constrained transformation. Phi-3 Mini, Qwen2.5-1.5B, or Gemma-2-2B all work well. Fine-tuning a 70B model for this task is almost always a cost-performance mistake.
The fine-tuning objective is straightforward cross-entropy over the rewritten query given the input, but a few details matter:
Instruction format consistency: Wrap the raw query in a consistent template (Rewrite the following customer query as a concise search query for a product knowledge base: {query}) and never deviate during training. Query rewriting models are remarkably sensitive to prompt drift. A model trained on one template will degrade noticeably if the production template changes.
Length penalty: A common failure mode is model collapse toward verbose rewrites ("Please search the knowledge base for information about the following customer issue regarding..."). Add a length regularization term or just filter your training data to remove rewrites that are longer than the original query plus 20%.
Serving: This model sits on the critical path of every retrieval call, so latency matters. Use 4-bit quantization (AWQ or GPTQ) and measure carefully: you typically retain 80–90% of full-precision quality with significant latency reduction. At scale, a synchronous 200ms query rewriter can become your pipeline's bottleneck faster than you'd expect.
The non-obvious failure mode: query rewriting can actively hurt retrieval for short, precise queries. If a user types "password reset link expiry" verbatim, the raw query is already a good retrieval signal and rewriting it introduces noise. Consider training a router that decides whether to rewrite based on query length and vocabulary overlap with your KB corpus.
Evaluation Framework
Offline: The core metric is retrieval recall: does the rewritten query surface the correct KB article in the top-5 results, compared to the raw query? Build a held-out evaluation set of 500–1000 queries with known ground-truth documents and measure recall@1, recall@5 before and after rewriting. A rewriter that doesn't improve recall@5 by at least 10 points relative to no rewriting probably isn't worth the latency cost.
Online: Track retrieval failure rate in production, specifically the fraction of queries where no retrieved document has relevance score above threshold. This is your primary canary. Also monitor rewrite latency p50/p99; the 99th percentile matters because support agents working on a resolved ticket while the system spins up a slow rewrite creates exactly the kind of UX friction that drives escalation.
Monitoring: Model drift on query rewriting tends to be slow but cumulative. When your product launches a new feature, the language around it doesn't appear in the rewriter's training distribution. Watch for systematic retrieval failures on new product vocabulary. Set up an alert when any KB article's traffic drops by more than 40% week-over-week, which is often the first signal that queries about that topic are being rewritten in ways that miss it.
Building an SFT Pipeline for a Customer Support Agent from Historical Ticket Data
Historical ticket data is seductive. There's a lot of it. It contains real user queries, real resolutions, real product knowledge. And most of it is completely unusable as SFT training data.
The problem is that human support agents are inconsistent, often wrong, and writing for themselves rather than for a downstream ML model. They use internal shorthand, reference ticket IDs that aren't in the training data, write "checked account, resolved" with no context for what was checked, and occasionally give users incorrect information that the product team quietly corrected later. Training an LLM on this data uncritically is one of the most reliable ways to bake in your worst agents' habits rather than your best agents' judgment.
Data
The data pipeline has two stages: survival filtering and quality scoring.
Survival filtering removes the structurally unusable data:
- Tickets with resolution notes shorter than 50 tokens (almost always "escalated" or "resolved - see notes" with no substantive content)
- Tickets where the resolution happened across more than one conversation thread (the context is fragmented and reconstructing it is more work than it's worth)
- Tickets closed as "spam" or "test"
- Any ticket where the agent's response was later corrected by a supervisor (check your ticketing system's edit history)
After survival filtering, expect to lose 40–60% of your data. This isn't a disaster. It's the filter working correctly.
Quality scoring ranks the surviving data. Train a small quality classifier, or just use an LLM with a well-crafted rubric, to score each (query, response) pair on: correctness, completeness, tone, and whether the response actually resolves the stated issue without requiring follow-up. Keep the top 30%. I've found that the top 30% of ticket data produces a significantly better SFT model than training on everything, even accounting for the reduced volume.
The right training format is a multi-turn dialogue: system prompt (your agent persona, available tools, escalation criteria), then the full conversation thread as (user, assistant) turns. Don't collapse multi-turn tickets into a single (query, response) pair. The model needs to learn conversation management, not just response generation.
Non-obvious data problem: survivorship bias. Your historical data only contains tickets that reached your human support team. This means it's systematically missing the simple issues your existing automation already handles, and systematically overrepresenting edge cases, angry users, and complex multi-product issues. A model trained only on this data will be tuned for hard cases and may actually perform worse on common cases.
Training Strategy
For a support agent, a 7–13B model is usually the right operating range. Smaller than 7B and the instruction-following is unreliable under adversarial inputs (users trying to manipulate the agent into giving unauthorized discounts, for example). Larger than 13B and the serving cost becomes hard to justify unless your ticket volume is enormous.
LoRA fine-tuning with rank 16–64 is the standard approach. Higher ranks give you more capacity for domain-specific knowledge but also increase the risk of overfitting to your worst agents. I'd start at rank 32 and ablate. Use full precision for the embedding and LM head while keeping the rest in 4-bit (QLoRA), which preserves embedding fidelity during quantized training.
The most important hyperparameter decision is the learning rate schedule. Linear decay with warmup works fine for the first fine-tune, but if you're running continual training cycles (which you should be; see section 6), cosine decay with restarts helps prevent the model from collapsing to narrow modes that only represent your most recent data.
For loss function, standard cross-entropy on assistant turns only (mask the user turns). The temptation to add a supervised contrastive loss on quality-scored pairs here is understandable but usually not worth the complexity until you've already hit the ceiling on standard SFT, and most teams haven't.
Evaluation Framework
Offline: Held-out test set of 200–500 tickets, scored by a combination of automatic metrics (ROUGE-L against gold response, BERTScore against KB ground truth) and LLM-as-judge. The LLM-as-judge rubric matters a lot here. Be specific. "Evaluate this response for: (1) factual accuracy against the provided KB articles, (2) tone appropriateness, (3) completeness, (4) whether it would resolve the ticket without further follow-up." Vague rubrics produce noisy judge outputs that aren't useful for iteration.
Online: Track ticket escalation rate and re-open rate by query category. Escalation rate is your primary quality signal: if the agent's responses are good, fewer tickets will need human intervention. Re-open rate captures a different failure mode: responses that technically close the ticket but don't actually solve the problem, so the user comes back.
Monitoring: Human agent quality is not static. If you retrain on data from a period when your support team was understaffed and burning out, you'll distill that burnout into your model. Watch your training data distribution for CSAT-weighted quality scores over time and consider weighting recent high-quality tickets more heavily in future training runs.
Designing a DPO/RLHF Alignment Pipeline for Tone, Brand Voice, and Refusal Behaviour
Getting the support agent to say the right thing is hard enough. Getting it to say things the right way, in your brand voice, with appropriate empathy, declining to make unauthorized commitments, is a different category of problem, and one that SFT alone almost never solves completely.
SFT teaches the model what responses look like. Preference alignment teaches it which responses are better. The distinction matters because tone and voice are preference problems, not demonstration problems: you can show the model two responses and tell it which one sounds more like your brand, but you can't easily generate a corpus of demonstrations that captures every subtle aspect of voice without spending an enormous amount on annotation.
Data
The preference dataset needs to contain (prompt, chosen, rejected) triples where chosen and rejected differ specifically on the dimension you're aligning. This sounds obvious but it's easy to construct a dataset that mixes multiple alignment dimensions, which produces a reward model that learns the correlations between them rather than their independent effects.
Build separate preference datasets for:
- Tone and empathy: Same factual content, different warmth/formality level
- Brand voice: Same intent, different vocabulary and phrasing choices
- Refusal behavior: Correct refusals vs. over-refusals (refusing to help with legitimate requests) and under-refusals (complying with requests for unauthorized discounts, PII disclosure, etc.)
For customer support, the refusal data is the hardest to get right. You need real examples of users trying to manipulate the agent: asking it to override pricing rules, requesting information about other accounts, trying to get the model to agree to SLA terms that don't exist. Annotating these edge cases correctly requires domain experts who understand your actual policies, not just general annotation guidelines.
Minimum viable dataset size for DPO: 1,000–5,000 preference pairs per alignment dimension. Below 500 pairs on any single dimension and the signal-to-noise ratio is usually too low to see clean improvements.
Training Strategy
DPO is the right starting point for most teams, not RLHF with PPO. The practical reason isn't just complexity. Training a separate reward model on your support domain introduces a second model that can fail, overfit, and require its own maintenance. DPO collapses the reward model into the policy update and is much more stable.
The DPO objective:
The parameter controls how strongly the model diverges from the reference policy (your SFT-fine-tuned model). The right value is context-dependent: for tone alignment where you want conservative nudges, is usually safe. For refusal behavior where you need hard constraints, pushing higher (0.5–1.0) gives you stronger signal but increases the risk of policy degradation.
The failure mode nobody talks about: DPO on brand voice tends to shift the model's vocabulary but also subtly shifts its factual behavior. If your "chosen" responses happen to be written by your most knowledgeable agents (which they often are, because knowledgeable agents also write better), the model will associate good writing with specific knowledge patterns. This creates a version of Goodhart's Law: the model learns to write in a way that sounds like your best agents but may actually be confabulating the specific facts they knew. Always run a factual accuracy eval after tone alignment, not just tone quality.
For refusal alignment specifically, the chosen/rejected framing can be unintuitive. The correct framing for an over-refusal is: chosen = correctly helpful response, rejected = unnecessarily unhelpful response. Many teams accidentally flip this because the word "refused" feels like it should be the rejected sample.
Evaluation Framework
Offline: LLM-as-judge with dimension-specific rubrics. A unified "alignment score" is tempting but obscures the individual alignment failures. Score tone, voice, and refusal behavior separately. For refusal specifically, construct a red-team set of manipulative prompts and score refusal accuracy. You want to see both false positive rate (over-refusals) and false negative rate (under-refusals) because both are costly.
Online: Brand voice is harder to measure online than safety. The cleanest proxy is CSAT for responses where the factual content was controlled (i.e., the correct answer was in the KB and was retrieved). If CSAT is lower for aligned vs. unaligned responses when the facts are the same, your alignment is actively degrading user experience.
For refusal behavior, track escalation-on-refusal rate: the fraction of times the agent refuses a request and the user immediately escalates to a human. A high rate suggests over-refusal on legitimate requests. Track alongside resolution rate for requests that were not refused. An agent that never refuses looks good on resolution rate until you discover it's been agreeing to refunds it shouldn't.
GRPO for Fine-Tuning Reasoning and Instruction-Following Behaviour
Group Relative Policy Optimisation (GRPO) deserves a section of its own because it's one of the few recent training innovations that genuinely changes what small models are capable of, rather than just making them faster or cheaper to train. It's also frequently misunderstood in ways that lead to broken training runs.
How GRPO Differs from PPO and DPO
PPO (Proximal Policy Optimization) is an actor-critic method: it trains both a policy (the model you want) and a value function (a critic that estimates how good a state is), uses the value function to compute advantage estimates, and clips the policy update to stay near the current policy. The critic is load-bearing. Without it, you get high-variance gradient estimates. But the critic also adds cost: an additional model of comparable size to the policy, memory for its activations, and a separate update loop. PPO is powerful but operationally heavy.
DPO removes the reward model entirely by reformulating preference optimization as a direct supervised loss on the policy, using the reference policy as an implicit reward signal. Elegant, but this means DPO can only use preference pairs. It can't learn from verifiable outcomes like "this answer is mathematically correct." That's a significant limitation for reasoning tasks.
GRPO's key insight is: keep the verifiable-outcome reward from RL, but eliminate the critic by computing baselines from group statistics. For each prompt, you sample a group of responses from the current policy, compute a scalar reward for each, and use the group mean as the baseline:
The policy update then maximizes the advantage-weighted log-likelihood of chosen actions, with a KL penalty to the reference policy. No critic, no value function training, no separate model. The gradient variance is controlled by the group size: larger groups give lower variance at higher sampling cost.
(This is the simplified form. The full GRPO objective, as described in the DeepSeek-Math paper, applies per-token clipped probability ratios similar to PPO's clipped surrogate, which bounds the update magnitude per token. Without the clipping, the above reduces to vanilla REINFORCE with a group baseline.)
Why GRPO Suits Chinese Base Models (Qwen, DeepSeek)
The short answer is that Qwen2.5 and DeepSeek-R1 were trained with GRPO explicitly in mind, and their base models have particular properties that work well with group-sampled rollouts: diverse decoding behavior at moderate temperatures (which makes the group-sampled rewards informative rather than degenerate), strong base instruction following (which means the SFT cold start is shorter), and pre-training that included code and math (which means verifiable rewards for these domains are dense).
The longer answer is that GRPO's advantage normalization works best when group rewards are informative, meaning when different rollouts get meaningfully different rewards. If your model already does well on a task, all rollouts in the group get similar rewards, the normalized advantages are near zero, and the gradient is uninformative. The practical motivation for GRPO at DeepSeek was operational: eliminating the critic model cuts memory requirements roughly in half, making RL feasible at larger model scales. The fact that these base models also tend to show diverse rollout behavior at moderate temperatures (likely due to code and math-heavy pre-training rather than multilingual content specifically) makes group-sampled advantages informative rather than degenerate.
Constructing Group-Sampled Rollouts for a Support Agent
For a customer support context, the typical GRPO setup samples – rollouts per prompt, with temperature . The prompt is a customer query plus retrieved KB articles. Each rollout is a complete agent response.
The reward function is where the real design work happens. For support, I'd compose three reward signals:
Faithfulness reward (0–1): Does the response make claims that are supported by the retrieved documents? Use an NLI model (a fine-tuned DeBERTa or a small LLM) to check each factual claim in the response against the KB articles. This is the most important signal. Hallucinated responses that confidently state wrong policies cause real business harm.
Resolution reward (binary, lagged): For training environments where you can simulate resolution, use a classifier trained on historical tickets. In production, this reward is delayed (you learn if the ticket actually resolved 24–48 hours later), which creates credit assignment challenges (see section 12).
Format compliance reward (binary): Does the response follow the required format? Did it include a ticket reference when one was available? Did it avoid phrases on your prohibition list? This is cheap to compute and important because LLMs at non-zero temperature will occasionally violate constraints that seemed robustly learned.
The composite reward: with typical weights .
Failure Modes
Reward hacking is the first thing that goes wrong. The faithfulness reward can be gamed by responses that parrot verbatim phrases from the retrieved documents regardless of relevance. Add a novelty penalty for responses with high n-gram overlap with the context, or use a faithfulness metric that distinguishes entailment from verbatim copying.
Response collapse happens when the model's rollout diversity collapses, usually because the reward landscape has a high-reward "attractor" that most rollouts converge to. The symptom is all rollouts in a group becoming nearly identical, which makes the normalized advantage near zero for everything and kills the learning signal. Monitor rollout diversity (average pairwise BLEU or embedding cosine similarity within a group) during training. If it drops below a threshold, increase the sampling temperature or reduce the KL coefficient to allow more exploration.
KL explosion usually indicates that your reference policy is too far from your current policy, often because you've over-trained on SFT before starting GRPO. The KL penalty that was calibrated for the early training phase becomes insufficient as the policy drifts. Linearly increase over training, or implement adaptive KL control as in the original PPO paper.
Evaluation Framework
Offline: For instruction-following, use benchmarks like IFEval (instruction following evaluation) against a held-out support instruction set. For reasoning, construct a set of policy interpretation questions (cases where the answer requires reasoning through multiple KB documents) and grade against expert-annotated gold answers.
Online: Track the fraction of responses requiring human correction by query complexity tier. GRPO-trained models should show the largest improvements on the hardest tier (multi-step reasoning, contradictory policy resolution), not on the simple factual queries.
Synthetic Data Generation for Low-Resource Fine-Tuning Tasks
There are tasks in your support pipeline for which you have almost no real training data. The newly launched enterprise tier has been live for six weeks. A partner integration shipped two months ago. Your escalation classifier needs examples of specific failure modes that thankfully haven't happened yet.
Synthetic data is not a free lunch, and it's not a replacement for real data. But used carefully it can turn a 500-example dataset that barely trains a good model into a 5,000-example dataset that does. The key is understanding what the teacher model does and doesn't know.
Data
Teacher model rewrites are the simplest and usually most effective approach. You have a seed corpus of real examples (even a small one). You prompt a strong LLM (GPT-4o, Claude 3.5 Sonnet, or a large open-source model) with a few-shot template and ask it to rewrite each example while preserving the intent and varying the phrasing, formality, and user vocabulary. This works well for query rewriting augmentation, paraphrase generation for the retrieval system, and SFT data expansion.
The system prompt matters enormously. Vague instructions ("rewrite this in different words") produce diverse but meaningless variation. Specific instructions ("rewrite this as a frustrated user who has already tried the standard troubleshooting steps and is considering cancellation") produce variation that targets real distribution coverage gaps.
Backtranslation (translating to an intermediate language and back) is underused in English-language support settings but surprisingly effective for generating lexical diversity. Round-tripping through Spanish or French with a modern MT model produces fluent English paraphrases with different vocabulary patterns that can expand retrieval coverage at essentially zero cost. The non-obvious problem: backtranslation through certain language pairs introduces systematic phrase patterns. Chinese→English round-trips often produce more formal phrasing; German→English tends toward longer compound constructions. Monitor for these artifacts.
Self-instruct style pipelines are the right approach when you need to generate entirely new examples rather than paraphrase existing ones. The basic loop: prompt a teacher model with a domain description and a few seed examples, generate new (instruction, response) pairs, filter using a quality classifier, add to the training set. The failure mode is that self-instruct pipelines are prone to topic drift: after enough iterations, the generated examples become systematically different from real user queries. Counter this by re-anchoring to your real data distribution every N generations (e.g., include recent real queries in the generation prompt).
Training Strategy
The critical question is whether to mix synthetic and real data or train on them sequentially. In my experience, mixing usually works better than sequential training. Sequential training (real data first, then synthetic) can cause the model to partially overwrite what it learned from real data. Mixing, with the real data weighted 3–5x higher, tends to produce more stable training.
For self-instruct generated data specifically, add a confidence filter: use the teacher model itself to score each generated (instruction, response) pair and discard the bottom 20%. Teacher models are good at recognizing low-quality outputs even when they occasionally produce them.
Non-obvious failure mode: Synthetic data amplifies teacher model biases. If your teacher model has a systematic tendency to frame refusals in a specific way, or to use certain explanatory patterns, your fine-tuned model will learn those patterns. This is especially problematic for refusal behavior. A model trained to refuse like GPT-4 will sound like GPT-4 refusing, which may not match your brand voice at all.
Evaluation Framework
Offline: Compare model performance with and without synthetic augmentation on a held-out real-data test set. The synthetic data should improve performance on the real test set. If it doesn't, it's adding noise. Also check whether the model's outputs on synthetic-origin prompts differ systematically from real-origin prompts in ways you don't want (vocabulary, format, tone).
Online: Track performance on recently deployed features specifically. Synthetic data should help most with knowledge gaps from new products. If escalation rate for new feature tickets is still high after synthetic augmentation, the teacher model's knowledge about your new features (usually from public documentation, not internal specs) wasn't sufficient.
Continual Learning and Hot-Fix Adapter System for a Deployed LLM Using LoRA
Your product ships a new pricing tier. A regulation changes your refund policy. A partner integration launches. In each case, your deployed model has the wrong information, and you need to update it without a full retraining cycle, without touching your production model weights, and without introducing regressions on everything it currently handles correctly.
This is the problem LoRA-based adapter patching is actually good at, but it requires thinking about adapters as a system rather than just a fine-tuning technique.
Data
Hot-fix adapter training requires a small, high-quality dataset that is deliberately narrow: the specific policy changes, new product information, or query patterns you need to address. Typical size: 100–1,000 examples per patch. This sounds small but is intentional. A focused adapter trained on narrow data patches the specific behavior you need without disturbing the base model's broader capabilities.
For each hot-fix, construct three types of examples:
- Positive examples: Queries about the changed policy/feature with correct new responses
- Negative rejection examples: Queries about the old behavior, with responses that acknowledge it has changed (not just ignore the question)
- Stability anchors: A random sample of existing high-quality (query, response) pairs that are unrelated to the change, included to detect regression
The stability anchors are the non-obvious but essential component. Without them, you won't know if your hot-fix adapter is damaging performance on adjacent topics until you've deployed it and measured the regression in production.
Training Strategy
The adapter architecture question is whether to stack adapters (one base model, multiple LoRA adapters applied simultaneously) or switch adapters (select one adapter per query at inference time). Stacking is simpler but causes interference between adapters as you accumulate patches. Switching avoids interference but requires a router that decides which adapter(s) are relevant to each query.
For most deployments, a hybrid approach works best: maintain a small library of LoRA adapters by knowledge domain (pricing, integrations, policy, troubleshooting), and at inference time select the top-1 or top-2 relevant adapters using a fast embedding-based router. The base model provides the general instruction-following capability; the domain adapters inject the current knowledge.
Catastrophic forgetting in the adapter context is more subtle than in full fine-tuning. A LoRA adapter trained on new pricing information won't forget how to write English, but it may shift the model's behavior on pricing-adjacent topics in unintended ways. The practical mitigation is L2 regularization toward the original adapter initialization, which penalizes large parameter shifts. (This is motivated by the same intuition as EWC, but for low-rank adapters with only a few thousand parameters, the Fisher information computation of full EWC is rarely justified over simple L2.)
Rank selection: Use lower rank for hot-fix adapters (4–8) than for the initial fine-tune (16–64). Hot-fixes should be small perturbations, not large behavioral shifts. If you find you need high-rank adapters to express a policy change, that's a signal that the change is too large for an adapter and needs a full fine-tuning cycle.
Deployment strategy: Version your adapters semantically (e.g., adapter-pricing-v12-2026-03-15) and keep a 30-day rollback window. The most common failure mode isn't the adapter being wrong. It's the adapter being correct but conflicting with cached responses or downstream classifiers that were calibrated on the old behavior.
Evaluation Framework
Offline: Test the patched model on (1) the specific cases the patch was designed to fix, (2) the stability anchor set to detect regression, and (3) a held-out set of adversarial prompts that probe the boundary of the patch. A pricing policy change should improve responses about the new tier and not change responses about features that have nothing to do with pricing.
Online: For hot-fix adapters specifically, A/B testing is often too slow, since the whole point is rapid deployment. Instead, use a shadow mode: route a small fraction of live traffic to the patched model, compare against the base model, and monitor for deviation in CSAT, escalation rate, and refusal rate. Set tight automated guardrails: if the patched model's escalation rate increases by more than 5% relative to baseline within the first 24 hours, roll back automatically.
Monitoring: Track adapter stack size over time. Every adapter you add is technical debt in your inference stack. Teams that don't periodically reconcile their adapter library into a new full fine-tune will eventually end up with eight stacked adapters from different epochs of their product history, and the adapter interactions will produce behaviors that are impossible to debug.
Knowledge Base Ingestion and Chunking Pipeline for RAG Across Heterogeneous Sources
Your knowledge base is not a homogeneous corpus of clean markdown documents. It's PDFs with two-column layouts, Zendesk tickets with nested quotes and time-stamped agent notes, Confluence pages with inconsistent heading hierarchies, and Slack threads where the resolution is buried in message 47 of a 200-message thread.
Each of these formats fails differently in a naive chunking pipeline, and the failures compound because the retrieval system doesn't know why a chunk is garbled. It just retrieves it.
Data
PDF ingestion is harder than it looks. The naïve approach (pdfplumber or PyMuPDF for text extraction) works for simple single-column PDFs, but fails on:
- Multi-column layouts where the text order extracted is wrong (column 1 first half, column 2 first half, column 1 second half)
- Tables, where the cell ordering is often nonsensical as plain text
- Scanned PDFs, where you need OCR and layout analysis
For support documentation specifically, tables are the highest-risk format: warranty terms, pricing tiers, SLA parameters, feature comparison matrices. A table that extracts as scrambled text is worse than no table at all because the retrieval system will find it and surface it confidently. Use a layout-aware PDF parser (Marker, Docling, or LayoutLMv3-based) for any PDF that might contain tables. It's slower and more complex, but the alternative is silently wrong retrieval.
Zendesk ticket ingestion requires deciding what to index. The full ticket thread (user messages + agent responses + internal notes) is rarely the right unit. Better: index the agent's final resolution note separately, indexed with the user's problem description as context. This way retrieval surfaces "how to resolve X" rather than "what a conversation about X looked like."
Confluence requires navigating the hierarchy. A page's content often only makes sense in the context of its parent page. Chunk at the section level (H2/H3) but prepend a breadcrumb: [Product Documentation > Account Management > Billing > ]. This costs tokens but dramatically improves retrieval precision for questions about nested policy areas.
Slack threads: index sparingly. The signal-to-noise ratio is low, and institutional knowledge in Slack tends to have a short shelf life. Index only threads that have been explicitly bookmarked or where an admin has tagged a message as "resolved" or "answer." Raw Slack ingestion is how you end up retrieving advice from a 2023 conversation about a product feature that was deprecated in 2024.
Training Strategy
The chunking strategy is actually a design decision with a clear trade-off surface:
Semantic chunking (splitting at natural semantic boundaries using an embedding model to detect topic shifts) produces coherent chunks but is expensive and slow. It's worth it for high-value documentation (product guides, policy documents) but overkill for FAQ pages.
Fixed-size chunking (512 tokens, 128-token overlap) is fast, predictable, and works adequately for uniform prose. The failure mode is split sentences or concepts that span the chunk boundary. The 128-token overlap handles most of these, at the cost of duplicate retrieval.
Hierarchical chunking maintains both paragraph-level and section-level embeddings, retrieving at paragraph level but expanding context to section level when the LLM generates its response. This is the right architecture for heterogeneous sources. You get precision from small chunks, coherence from large context windows.
For document freshness, embed the last_modified timestamp in the chunk metadata and use it as a retrieval filter for policy documents. A policy document that hasn't been touched in 18 months should be flagged for review before being served to users (see section 9).
Evaluation Framework
Offline: Chunking quality is best evaluated by retrieval recall on a gold set of (question, expected document section) pairs built by domain experts. If section-level recall@5 is high but the retrieved text is garbled (common with table extraction failures), your chunking pipeline has a format-specific problem that retrieval metrics won't surface directly. Add a readability check to your eval pipeline.
Online: Track chunk retrieval diversity, specifically the fraction of your KB chunks that appear at least once in the top-5 retrieval results per day. If this number is low (below 20–30%), you have a retrieval concentration problem where a small set of chunks dominates all queries. This could mean your embedding model is biased toward certain document types or that your chunking has created many low-quality chunks that never win a retrieval competition.
Hybrid Retrieval System Design: BM25 + Dense Embeddings + Cross-Encoder Reranker
The practical case for hybrid retrieval is simple: dense embeddings and BM25 fail in complementary ways. Dense embeddings miss exact matches ("error code E-2041") but generalize well to semantic similarity. BM25 nails exact matches but struggles with paraphrase and semantic variants. A hybrid system captures both failure modes.
The case for adding a cross-encoder reranker is subtler: both BM25 and dense retrieval score documents independently of each other, so they can't model the relative quality of two documents against the same query. A cross-encoder takes (query, document) pairs and produces a joint relevance score, which tends to be significantly more accurate for the final ranking but is too slow to run at full corpus scale.
Data
The main data requirement for building this system is a relevance dataset: (query, relevant document, irrelevant document) triples. Constructing these is the expensive part.
For the dense embedding model, if you're fine-tuning rather than using off-the-shelf (sentence-transformers, E5, BGE), you need at least 10–50k positive (query, document) pairs. Positive pairs can be constructed from your ticket history: resolved tickets plus the KB articles agents consulted. Hard negatives (documents that look relevant but aren't) are critical for training a discriminative model. The standard approach is to use BM25 retrieval to find near-matches that the embedding model should learn to distinguish.
For the cross-encoder, you need genuine relevance labels (0/1 or graded). These are expensive to generate from scratch. A practical shortcut: use the top-20 results from your dense retrieval as the candidate set and collect binary relevance labels only for these candidates. This is biased (you can't measure recall for documents outside the candidate set) but is usually sufficient for fine-tuning a reranker.
Training Strategy
BM25 requires no training, just indexing. But it requires careful field design. For support KB articles, consider separate BM25 fields with different weights for: title (highest weight), headings/subheadings, body text, and tags/metadata. Standard BM25 treats all text equally, which causes query "password reset" to match a document titled "Password Reset Guide" the same as one that mentions password resetting in passing.
Dense embedding fine-tuning with contrastive learning (specifically InfoNCE loss with in-batch negatives) is standard. The non-obvious detail is the batch composition: you want hard negatives in the same batch as positives, which requires careful batch sampling. Using random in-batch negatives alone often produces an embedding model that separates random irrelevant documents from relevant ones easily, but fails on the hard cases where two documents are both superficially relevant.
Cross-encoder: A fine-tuned cross-encoder/ms-marco-MiniLM-L-12-v2 is a strong baseline. For support-specific fine-tuning, the most valuable training signal is cases where the dense retriever returned a semantically similar but factually wrong document as the top result. These are the cases where a good reranker earns its latency cost.
When each component matters most:
- BM25 is essential for product-specific terminology, error codes, and model numbers
- Dense embeddings handle user paraphrases and semantic queries ("why does my screen keep going black" → "display power management settings")
- The cross-encoder matters most when the top-k retrieved set contains multiple highly relevant documents that need to be ranked by specificity
Where it breaks: Hybrid retrieval with a cross-encoder can have latency profiles that are hard to bound. BM25 retrieval takes ~5ms, dense retrieval ~50ms, and the cross-encoder on top-20 candidates can take 200–500ms at CPU speeds. On the critical path of a synchronous user interaction, this is potentially unacceptable. Consider running the cross-encoder asynchronously and serving the dense retrieval result while the reranker is computing, upgrading the response if the reranker returns a different top-1 before the LLM finishes generating.
Evaluation Framework
Offline: NDCG@10, MRR@10, and recall@100 on your relevance dataset. Evaluate each component independently (BM25 alone, dense alone, hybrid, hybrid+reranker) to understand the additive value. In my experience, the hybrid step (BM25 + dense) typically gives a 10–15% relative improvement in recall@10 over dense alone; the reranker adds another 5–10% improvement in NDCG@5 at the cost of latency.
Online: Retrieval precision at the query level, meaning the fraction of queries where the top-1 retrieved document is marked as relevant by the LLM's faithfulness check. Track this separately for exact-match queries (error codes, product names) and semantic queries. A regression in exact-match precision after a model update is usually a sign that the embedding model update overfit to semantic similarity at the expense of lexical matching.
Detecting and Fixing Knowledge Base Rot
Knowledge bases degrade. Policies change and old documents don't get updated. Products are deprecated and their documentation remains indexed. A feature gets renamed and the KB still uses the old name. Over time, your retrieval system faithfully surfaces increasingly wrong information, and your LLM, having been instructed to stay grounded in retrieved content, faithfully repeats it.
This is knowledge base rot. It's insidious because it's gradual, it's domain-specific (some parts of the KB stay current while others drift), and it's invisible to users until they get genuinely wrong advice.
Data
The first step is an audit. Run your full KB corpus through three automated checks:
Temporal staleness: For each document, compare the last_modified timestamp against the frequency of queries that retrieved it in the last 90 days. Documents with high retrieval frequency but stale modification dates are high-risk. Anything not modified in 12 months that handles more than 0.5% of query volume should go on a review queue.
Cross-document contradiction detection: Use an LLM to identify contradictions between documents on the same topic. Prompt: "Given these two documents about [topic], identify any factual claims that contradict each other." This is expensive at corpus scale but you don't need to run it on all pairs. Cluster your documents by topic first and only run contradiction detection within clusters.
Retrieval precision decay: Track the per-document faithfulness score over time. If a document was faithful 90% of the time six months ago and is now 60%, something changed: either the document or the underlying facts it describes. A declining faithfulness trend on a high-traffic document is an early warning signal.
Training Strategy
Fixing KB rot is not primarily a model training problem. It's a data maintenance problem. But it has training implications:
Temporal weighting in retrieval: Add a recency bias to your retrieval scoring. For policy-sensitive topics, a document modified 30 days ago should score higher than a semantically similar document modified 18 months ago, all else equal. The implementation is a multiplicative decay factor on retrieval scores:
where controls the decay rate. For policy documents, a half-life of 90 days is reasonable. For evergreen content (installation guides, API documentation), a longer half-life or no decay.
Conflict resolution: When contradiction detection surfaces two documents that disagree, you need a process to resolve the conflict, not just to flag it. Build a human review queue, prioritize by query volume, and track time-to-resolution. The ML system detects the rot; human domain experts fix it.
Non-obvious failure mode: Fixing KB rot by deleting old documents can hurt retrieval coverage if the old documents covered edge cases that your new documents don't. Before deleting any document, run a retrieval coverage analysis: what queries did this document contribute to the top-5 that no other document now answers? If the answer is "none," delete safely. If it's "many," you have a content gap to fill before deleting.
Evaluation Framework
Offline: Track your corpus-level contradiction rate (fraction of document pairs with detected contradictions) as a long-term metric. A rising contradiction rate is a lagging indicator of insufficient KB maintenance cadence. Also track the freshness distribution of your top-100 most-retrieved documents.
Online: Monitor the rate of user-reported incorrect information. Not all wrong responses are KB rot. Some are LLM hallucinations, some are retrieval failures, but KB rot has a distinctive pattern: responses that cite specific KB documents and report specific information that has changed. If you're doing source attribution (you should be), track the rate of negative CSAT for responses citing specific documents, and use this as a per-document quality signal.
Multi-Model Routing: Small Fine-Tuned vs. Large General Model
There's a clean story about LLM routing: use a cheap small model for easy queries, escalate to an expensive large model for hard ones. This story is true but incomplete. In practice, routing decisions interact with your data flywheel, your SLA commitments, and your cost structure in ways that make a naïve confidence-based router fail more often than it should.
Data
The routing training problem is: given a query, predict whether the small model will produce an acceptable response. This is a binary classification problem, but what makes it hard is that you don't know the small model's quality until you've already run it, which defeats the point.
The resolution is to build a difficulty estimator that predicts quality from the input without running inference. Features that are predictive:
- Query complexity proxies: sentence count, number of clauses, presence of negation or qualification ("except when...", "unless I...")
- Knowledge boundary signals: does the query require integration across multiple KB documents? Does it involve a recently updated policy? Does it contain product identifiers that appear in your KB?
- Historical difficulty: for semantically similar queries in your query history, what was the small model's resolution rate?
Construct a training set by running a batch of historical queries through both models, collecting human or LLM-as-judge quality labels, and training a small router model (a fine-tuned classifier on top of a sentence transformer) to predict when the small model's output is unacceptable.
Training Strategy
The router should output a confidence score, not a hard binary decision. At inference time, route to the large model if confidence drops below a threshold, and calibrate that threshold against your cost and quality targets. A threshold that routes 20% of queries to the large model should be evaluated differently than one that routes 5%.
Confidence calibration is the hard part. LLMs are notoriously overconfident. A small model's own confidence score (log-probability of its response under the policy) is a weak predictor of actual quality. Models assign high probability to fluent but wrong responses. The router should use external features, not the small model's self-reported confidence.
Latency budgets complicate routing. If you route to the large model, you're not just paying more per token. You're potentially violating a latency SLA. A query that arrives during a peak traffic period might need to accept a worse-quality response from the small model because the large model's queue is backed up. Build latency-aware routing: factor the current large model queue depth into the routing decision.
Cost modeling: Route 100% of traffic through the large model and you have a simple cost structure (expensive, uniform). Route 100% through the small model and you have a simple cost structure (cheap, inconsistent). The hybrid system has an optimization loop: as you improve the small model through continual fine-tuning, the routing threshold can be moved to route more traffic there. Track the fraction routed to each model as a monthly metric, not just as an operational parameter.
Non-obvious failure mode: Multi-model routing creates a silent feedback loop problem. If the router consistently routes hard queries to the large model and easy queries to the small model, your production data for fine-tuning the small model will be disproportionately easy queries. The small model never gets trained on hard cases because you always route those to the large model. Over time, the distribution gap between what the small model is trained on and what it needs to handle widens.
Evaluation Framework
Offline: Evaluate routing decision quality on a held-out set where you have quality labels for both models. Measure: routing accuracy (did the router correctly predict which model produces the better response?), cost at quality threshold (what fraction of queries require the large model to meet a target quality level?), and false escalation rate (queries routed to the large model where the small model would have been fine).
Online: Track quality metrics and cost per query category. The goal is Pareto improvement: better quality on hard queries, lower cost on easy queries, no regression overall. If you see regression in quality on any query category after deploying routing, the router is misclassifying those queries.
Evaluation Framework for a RAG Pipeline End-to-End
RAG pipelines fail in several distinct ways: the retrieval misses the relevant document, the retrieved document is outdated, the LLM ignores the retrieved document and hallucinates, the LLM correctly summarizes the retrieved document but the document itself is wrong. Aggregate quality metrics won't tell you which failure mode is happening, which makes them nearly useless for diagnosis.
The evaluation framework needs to be layered.
Data
Build a permanent golden evaluation set: 500–1000 queries with ground-truth answers, ground-truth source documents, and (optionally) intermediate labels for whether the correct document was retrieved. This set should be:
- Representative of your production query distribution (not just hand-crafted edge cases)
- Updated quarterly with new queries and new documents as your KB grows
- Versioned with your model versions so you can trace quality changes over time
The hardest part of building this set is the source attribution labels. For each query, you need to know which KB document contains the authoritative answer. Domain experts need to do this, as there's no good automated method for initial label construction. Budget for it.
Training Strategy
The evaluation pipeline itself is a system with multiple models in it. The components you'll build:
Retrieval metrics: Recall@k (fraction of test queries where the gold document is in the top-k retrieved) and MRR (mean reciprocal rank of the gold document). These measure whether the right information was surfaced, independent of what the LLM does with it.
Faithfulness metric: Given the response and the retrieved documents, does the response make only claims supported by the documents? Use an NLI model or LLM judge with the prompt: "For each factual claim in the response, is it directly supported by, contradicted by, or absent from the provided documents?" Score as proportion of claims that are supported. This measures hallucination.
Groundedness metric: Similar to faithfulness but focuses on whether the response correctly attributes information to the right source. Relevant when you're doing multi-document retrieval and the response correctly summarizes the content but from the wrong document.
Answer correctness: Given the response and the gold answer, is the response correct? This requires either human evaluation or a strong LLM judge with domain knowledge. It's the most informative metric but the most expensive.
Final response quality (LLM-as-judge): A holistic evaluation on completeness, tone, actionability, and absence of contradictions. Use this for trend monitoring but not for diagnostic purposes, as it's too aggregate to tell you why quality changed.
Evaluation Framework
The key insight is to evaluate each stage independently before evaluating the pipeline end-to-end. A drop in end-to-end quality could come from retrieval, from the LLM, or from the KB itself, and each has a different fix.
Set up a staged evaluation pipeline:
- Retrieval-only eval: recall@5, MRR
- Retrieval + LLM with perfect context (inject gold document): measures LLM quality ceiling when retrieval is correct
- Full pipeline: end-to-end quality
The gap between (2) and (3) is the retrieval contribution to quality degradation. The gap between (1) and (2) is the LLM's ability to use retrieved context. These separate diagnostics are worth the infrastructure cost.
Online RAG metrics: Production faithfulness (LLM judge running on a 5% sample of live traffic), escalation rate by query type (a proxy for response quality), and source diversity (how many distinct KB sections are cited per day, where a drop signals retrieval concentration).
Data Flywheel Design: Using Production Signals as Training Signal
The central promise of the data flywheel is that every interaction your agent has produces signal that can improve the next version. In practice, the signal is delayed, noisy, biased, and sometimes adversarially poisoned. Getting a flywheel to actually spin, to measurably improve model quality with each iteration, requires careful engineering at every step.
Data
CSAT (customer satisfaction score) is the primary signal, but it has three problems: low response rate (typically 10–30% of interactions get rated), response bias (angry users and very happy users rate more than neutral users), and attribution noise (a user rating a ticket 1/5 may be expressing frustration with the overall support experience, not with this specific response).
Escalation rate is a cleaner signal: binary, unambiguous, and available for 100% of interactions. But it's a coarse signal that only captures the worst failures, not the gradient between "good" and "excellent."
Resolution rate is the most meaningful but the most delayed: did the ticket reopen within 72 hours? This gives you a relatively clean binary label about whether the issue was actually resolved, but the 72-hour delay makes the training loop slower.
Label noise handling: Don't discard noisy signals. Weight them. A ticket with a 1/5 CSAT and an immediate reopen is a high-confidence negative signal. A ticket with a 3/5 CSAT and no reopen is ambiguous. A ticket with a 5/5 CSAT is a weak positive (could be a polite user who still wasn't helped). Design a composite label that combines CSAT, escalation, and resolution into a single weighted quality score with calibrated uncertainty.
Feedback delay: The resolution signal arrives 72 hours after the interaction, which creates a temporal gap in your training data. A model deployed today won't see resolution labels for its interactions until 3 days from now. This matters for online learning approaches, but for batch retraining on a weekly or monthly cadence, it's manageable.
Training Strategy
The flywheel architecture has three components: label pipeline (converting raw production signals into quality labels), filtering pipeline (removing low-confidence or adversarially manipulated labels), and training pipeline (incorporating the labeled data into the next model version).
The filtering pipeline is where most teams underinvest. Users who are aware of your support AI will occasionally try to influence its training by systematically rating good responses poorly (to degrade service to competitors, or to test the system). Monitor for anomalies in rating patterns: users who give 1/5 to responses that match high-quality KB answers, or rating clusters that don't match your resolution data.
Curriculum design: Don't uniformly sample from your production data for training. Sample more heavily from:
- Low-confidence predictions (cases where the model was uncertain)
- Negative-label examples (escalations, low CSAT, reopens)
- Novel query clusters (queries unlike anything in the current training set)
Uniform sampling wastes capacity on the easy cases the model already handles well.
Non-obvious failure mode: The flywheel can converge to a local optimum rather than a global one. If your initial model is bad at handling a specific topic, all interactions on that topic get poor ratings, the model gets trained to avoid that topic (or produce safer-but-less-helpful responses on it), and the poor performance becomes self-reinforcing. Monitor quality by topic category, not just overall, to detect these local optima.
Evaluation Framework
Track flywheel health with a set of invariant quality probes: a fixed set of test queries whose correct answers you know, evaluated every time a new model version is trained. The invariant probes test both improvement (the specific failure modes you collected training data on) and non-regression (adjacent topics you didn't collect data on). If invariant probe quality drops after a training cycle, the flywheel is generating noise, not signal.
A/B Testing Framework for LLM Changes in a Stateful, Non-Deterministic System
A/B testing LLM changes is significantly harder than A/B testing, say, a recommendation model. The reasons stack up: LLMs are non-deterministic (two identical prompts produce different outputs), support interactions are stateful (a user's experience depends on all prior turns), your primary metrics are delayed (CSAT comes 24–72 hours after the interaction), and the system's behavior is fundamentally multivariate (the response depends on the query, the retrieved context, the model, and the conversation history simultaneously).
Getting an A/B test result you can trust requires careful design at each step.
Data
The A/B test data pipeline needs to capture the full context of each interaction: not just the query and response, but the retrieved documents, the conversation history, the model version, and all available outcome signals. You'll need this for debugging and for variance reduction.
Holdout design: Random user-level assignment (all interactions from a given user go to either variant A or B) is better than query-level assignment for stateful interactions. Query-level assignment is biased because a user who has a bad experience in conversation turn 3 will produce correlated bad ratings in turns 4 and 5, regardless of which variant they're in.
The holdout should be permanent for a subset of traffic: a 5–10% holdout group that never receives any new model update. This gives you a long-run baseline that accounts for seasonal effects and external factors. Without a permanent holdout, it's easy to confuse "model improvement" with "product launched a new feature and users are happier generally."
Training Strategy
Minimum detectable effect sizing: Before running a test, calculate the sample size required to detect your minimum relevant effect. For escalation rate, if your baseline is 15% and you want to detect a 1-point improvement (to 14%), you need roughly 15–20k interactions per variant assuming 80% power and α=0.05. At 10k interactions per day, that's 3–4 days of testing. For CSAT rate with 20% response rate, the effective sample is much smaller and the required test duration is much longer.
Be honest about this calculation. I've seen teams call a test significant after 500 interactions when the required sample was 15,000. The result is usually noise that doesn't replicate.
Guardrail metrics: Alongside your primary success metric, define metrics that should not get worse. Escalation rate is a typical guardrail when CSAT is the primary metric (you don't want a model that improves CSAT by being over-helpful in ways that create escalations). Hard-coding these as stopping conditions in your test framework prevents the temptation to rationalize degradation.
Variance reduction: The non-determinism of LLMs inflates variance in your test metrics. Control for it using CUPED (Controlled Experiment Using Pre-Experiment Data) by regressing your outcome variable on pre-experiment covariates (historical CSAT for this user, historical resolution rate for this query category) and testing on the residuals. This can reduce variance by 30–50% and effectively doubles your statistical power.
Non-obvious failure mode: Multi-armed bandit approaches sound appealing for LLM A/B tests (automatically shift traffic to better variants) but have a critical failure mode for non-deterministic systems: the bandit algorithm can misattribute noise as signal and prematurely shift traffic to a variant that happened to get lucky in the first few hundred interactions. For LLM changes where the effect size is small and the noise is high, run a fixed holdout experiment with a predetermined duration rather than an adaptive bandit.
Evaluation Framework
Pre-experiment checklist: Before starting, verify that the variants are correctly assigned (run an A/A test on a small slice first), that outcome logging is capturing the right signals, and that you've calculated the required sample size.
During the experiment: Monitor guardrail metrics daily. If an guardrail metric degrades beyond a pre-specified threshold, stop the test automatically. Don't wait for the pre-specified end date if a guardrail is clearly failing.
Post-experiment analysis: Report a confidence interval, not just a p-value. "We observed a 0.8 ± 0.3 point improvement in CSAT (95% CI)" is more useful than "the test was significant at p=0.04." Segment the results by query category, user tenure, and resolution complexity. The average effect often masks large heterogeneity, and knowing which user segments drive the improvement guides your next training iteration.
Long-run validity: LLM changes interact with each other in ways that A/B testing of individual changes doesn't capture. A tone alignment update might be positive in isolation and negative in combination with a query rewriting update from the prior month. Run periodic full-system evaluations (all current changes vs. a clean baseline from 6 months ago) to measure the cumulative effect of your improvement cadence.
Closing Thoughts
The throughline across all thirteen of these problems is the same: production ML systems fail in the spaces between the components, not in the components themselves. The retrieval is fine; the KB is stale. The model is good; the router sends it the wrong queries. The A/B test is significant; the winning variant has a guardrail violation you didn't check.
Building a production support AI that actually works requires treating every handoff as an explicit design decision rather than an implementation detail. This applies between retrieval and generation, between training and evaluation, and between data collection and model update. The systems that hold up under real load are the ones where someone thought carefully about what happens when each component does its job correctly but their combination doesn't.
The good news is that most of these failure modes are predictable and preventable. The teams that succeed aren't the ones with the best models. They're the ones that built the evaluation and monitoring infrastructure first, and used the model as one component of a system they could actually debug.
If there are failure modes I missed or approaches that have worked better for you in practice, I'd like to hear about it: [email protected]