LLM-Embedded Pipeline for Production Data Enrichment
The team had a working LLM enrichment in development. Putting it in the production data path was a different problem.
A platform whose data pipeline ingested large volumes of unstructured records — product descriptions, user-submitted content, third-party feeds — that needed to be classified, enriched, and converted into structured form before downstream systems could use them. The team had built a working LLM enrichment in development and needed to move it into the production data path.
The development version had worked at small scale and at the team's tolerance for slow iteration. Production was different. The data volume was orders of magnitude higher. Latency budgets per record were tight. LLM cost at production volume was a real economic constraint. And the team had no good answer for what should happen when the LLM produced wrong outputs — which it did, regularly enough to matter.
The leadership had committed to the LLM-enrichment approach as the right strategic direction. The engagement was the engineering work to make it production-grade.
A production LLM-embedded pipeline with the operational properties the team's previous prototype had not had.
The LLM call was wrapped in cost gating. Records were routed to model tiers based on the value of getting the enrichment right for that record. High-value records went to the strongest model; bulk records went to a cheaper, smaller model that handled the common cases adequately. Records where the cheap-model output had low confidence were escalated to the strong model; records where the cheap model was confident bypassed the expensive call.
Caching was implemented at the input level. Records with identical or near-identical inputs returned cached outputs rather than producing duplicate LLM calls. The cache layer absorbed a substantial portion of the production call volume.
Validation was implemented on the LLM output. Each output was checked against domain-specific structural and semantic rules — not as a quality assessment in the abstract, but against the specific properties downstream systems required. Outputs that failed validation were retried, escalated to a stronger model, or quarantined for human review depending on the failure category.
Latency budgeting was instrumented per pipeline stage. The LLM call had a defined timeout; records that exceeded the timeout were routed to fallback handling rather than blocking the pipeline. The pipeline's overall latency held even when individual LLM calls degraded.
Observability was built around the LLM behavior. Output quality drift, latency drift, and cost drift were each monitored separately. Drift surfaced through alerts before becoming structural problems.
Failover behavior was specified for the LLM provider. Provider outages or degradation routed traffic to a secondary provider with equivalent capability. The pipeline continued operating through provider incidents.
The LLM enrichment moved into production at the platform's full data volume. Cost held within the budget the team had been given. Output quality met the bar downstream systems required. The pipeline operated reliably across LLM provider behavior — including incidents the secondary provider's existence absorbed.