RAG for Manufacturing: Grounding LLMs in Technical Docs

A generic LLM answering a manufacturing technician's question is a liability. The model may sound confident while hallucinating a torque spec, inventing a part number, or conflating procedures from unrelated vehicle platforms. Retrieval-augmented generation, grounded in the organisation's actual documentation, removes that failure mode. This walk-through covers the production architecture that manufacturing teams can defend in an audit: Azure AI Search as the retrieval layer, Azure OpenAI as the generator, citations on every answer, and evaluation discipline that catches regressions.

What Makes Manufacturing RAG Different

Three properties distinguish manufacturing RAG from consumer chatbot patterns:

Consequential answers. A wrong torque value, a wrong lubricant, a wrong replacement part — all translate to recalls, safety incidents, or warranty claims. Hallucination is not a UX issue; it is a legal one.
Applicability scoping. The same question has different correct answers depending on vehicle platform, model year, market, revision, and customer configuration. Retrieval must filter correctly before generation runs.
Citation mandate. Technicians must be able to verify the answer against the source. The generator never returns text without a citation to the retrieved document.

The Reference Architecture

Seven components, wired together:

Query preprocessing. A small LLM rewrites the user's query into a normalised form, expands abbreviations, and extracts applicability context (platform, model year, etc.) where inferable.
Hybrid retrieval. Azure AI Search hybrid query combining keyword, vector, and scoring-profile boosts. Applicability filters applied as structured constraints.
Semantic re-ranking. Top 50 retrieved chunks re-scored using semantic ranker; top 5–10 fed to the generator.
Context assembly. Selected chunks formatted with document titles, revision identifiers, and section headings. Typically 3,000–5,000 tokens of context.
Generation. Azure OpenAI with a system prompt instructing grounded answers, explicit citation format, refusal for unsupported questions.
Citation post-processing. Parse the model's citations and link them to the retrieved document IDs so the UI can render inline footnotes.
Logging and evaluation. Every query, retrieval set, generated answer, and citation logged for offline evaluation.

The System Prompt That Keeps It Honest

You are an assistant for technicians working with {vehicle platform} documentation.

Answer strictly from the sources provided below. If the sources do not contain
the answer, say so explicitly and do not guess.

Every statement of fact must end with a citation in square brackets with the
source ID, e.g. [DOC-TWG-2341-rev-C]. Use the exact IDs from the sources.

When citing torque specifications, fastener sizes, part numbers, or diagnostic
trouble codes, quote the source verbatim. Do not paraphrase numeric values.

If the sources contain conflicting information (for example, one revision
superseding another), use only the most recent revision and note the conflict.

If a question is outside the scope of the sources or outside your role, reply:
"That question is outside the scope of this assistant. Please consult [...]."

Sources:
{retrieved_chunks}

Question: {user_query}

The Retrieval Query

// RAG retrieval query with applicability filters
POST https://{svc}.search.windows.net/indexes/mfg-docs/docs/search?api-version=2024-07-01

{
  "search": "{rewritten_query}",
  "queryType": "semantic",
  "semanticConfiguration": "mfg-semantic",
  "answers": "extractive|count-3",
  "captions": "extractive",
  "vectorQueries": [{ "kind": "vector", "vector": [/* ... */], "fields": "contentVector", "k": 50 }],
  "filter":
    "status eq 'effective'
      and platform/any(p: p eq '{platform}')
      and modelYearFrom le {year} and modelYearTo ge {year}
      and markets/any(m: m eq '{market}')",
  "top": 8,
  "select": "id,documentId,title,content,revision,sourcePath"
}

The answers and captions features instruct the semantic ranker to extract candidate answer spans directly. These are useful for UI display and as additional signal for the generator.

Agentic RAG for Multi-Step Queries

Some queries need more than one retrieval. "Which diagnostic procedures changed on the B4204T35 engine between 2022 and 2024 due to the P0420 recall?" implies two lookups: find the recall, find the affected procedures. Agentic retrieval lets the LLM drive the sequence.

The 2024-11-01 API adds a knowledgeAgents endpoint. The LLM receives the user query, decides which sub-queries to issue, and composes the final answer from the intermediate results. This replaces much of the custom orchestration that earlier RAG patterns required. Trade-off: more LLM calls per user query, higher cost. Reserve for genuinely multi-hop cases.

Evaluating the Pipeline

RAG evaluation has two layers:

Retrieval evaluation. Given a query, did the retriever find the correct source documents in the top 10? Recall@10, MRR, NDCG. Measured against a labelled set.
Generation evaluation. Given the retrieved sources, was the generated answer correct, cited properly, and free of hallucination? Measured via LLM-as-judge (DeepEval, Promptfoo) and spot-checked by domain experts.

The two-layer structure matters because they fail independently. A correct retrieval can be turned into a wrong answer by a sloppy generator prompt. A bad retrieval guarantees a bad answer no matter how well the generator behaves.

Hallucination Defenses

Refusal prompting. The system prompt explicitly instructs refusal when sources are insufficient. Measure refusal rate in evaluation; too low means the model is filling gaps.
Citation validation. Every claim in the answer must cite a document. Post-process the generator output to verify citations correspond to retrieved documents. Flag any claim without a citation.
Numeric verbatim rule. Ask the model to quote numeric values verbatim. Then verify numbers in the generated answer appear in the cited source. Mismatches are hallucination indicators.
Human-in-the-loop for consequential actions. For torque specs, diagnostic procedures, and similar safety-critical content, the UI displays the generated answer alongside the source document and requires the technician to acknowledge.

Cost Profile

A single RAG query typically costs:

Query rewrite (small model, ~300 tokens): ~0.001 USD
Embedding for vector query (text-embedding-3-large): ~0.0002 USD
Azure AI Search hybrid + semantic: included in service tier, marginal per-query
Generator (GPT-5 with 4K context, 500 output tokens): ~0.015–0.025 USD
Typical total: 0.02–0.03 USD per query

At 10,000 queries per day per large OEM, this is ~200–300 USD/day in LLM cost, plus the fixed search service tier. Prompt caching on the system prompt (which is stable) typically reduces generator cost by 30–50%.

Where Manufacturing RAG Fails in Practice

Retrieval scope too wide. Applicability filters not applied, so retrieval returns documents from the wrong platform. Easy to catch in evaluation if you have labelled queries per platform.
Chunk boundaries cutting through procedures. A torque spec table gets split across two chunks. The retriever returns only half. The generator makes up the other half. Structure-aware chunking prevents this.
Stale index. A revision ships but the indexer has not picked it up yet. The generator answers from the old revision. Indexer schedules aligned with publication cadence are essential.
Prompt drift. Someone tweaks the system prompt to "be friendlier" and refusal rate collapses. Evaluation suite on every prompt change catches this.

Production-Ready Checklist

Applicability filters applied automatically from user context.
Hybrid retrieval with semantic re-ranking on the top 50 candidates.
System prompt that mandates citations and refusal on uncovered questions.
Post-processing that validates every numeric value against source text.
Evaluation set with 100–300 labelled queries covering main use cases.
Generation quality evaluated via LLM-as-judge with domain expert spot-checks.
Observability: every query, retrieval set, answer, and citation logged.
Prompt caching on the stable portion of the system prompt for cost control.
UI displays source alongside answer; no bare answer without linked source.

The Discipline That Distinguishes Production

RAG for manufacturing is not a novel technology stack. Azure AI Search has been indexing documents for years; Azure OpenAI has generated text for two. The discipline is treating the combined system as safety-critical software: labelled evaluation, regression testing on every change, citation as a hard requirement, and a UI that makes the source inspectable. Teams that treat RAG as "chat over documents" ship prototypes. Teams that treat it as a controlled retrieval and reasoning system ship products the compliance officer signs off on.