Indexing Manufacturing Documents: OCR, Skillsets, Ranking
A typical manufacturing corpus is not a uniform pile of text. It is scanned work instructions from 1998, CAD drawings exported to PDF, Excel BOMs with cell-level semantics, quality audit reports, and supplier specs in three languages. Turning that into a searchable index that returns the right chunk on a query like "M12 flange bolt torque spec for assembly line 4" takes more than a default indexer configuration.
The Three Enrichment Stages
Every document passing through the indexer goes through three stages: text extraction, enrichment, and chunking + embedding. Each stage has choices that compound. Bad OCR feeds bad embeddings feed bad retrieval. Get the first stage right and the rest is tuning.
Stage 1: Text Extraction
Azure AI Search ships with a built-in document cracking layer that handles Office formats, PDFs, plain text, HTML, and JSON. For scanned documents and image-heavy PDFs, add #Microsoft.Skills.Vision.OcrSkill or #Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill for structured extraction.
The Document Intelligence Layout skill produces substantially better results on manufacturing content because it preserves document structure: tables become tables with row and column semantics, headings retain hierarchy, figures get captioned. On a typical quality inspection report, layout-aware extraction recovers 15–25% more queryable text than raw OCR.
// Skillset fragment — Document Intelligence layout skill
{
"@odata.type": "#Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill",
"name": "di-layout",
"description": "Layout-aware extraction for scanned mfg documents",
"context": "/document",
"outputMode": "oneToMany",
"markdownHeaderDepth": "h2",
"inputs": [{ "name": "file_data", "source": "/document/file_data" }],
"outputs": [
{ "name": "markdown_document", "targetName": "markdown" },
{ "name": "figures", "targetName": "figures" }
]
}
Stage 2: Enrichment
Extracted text feeds a chain of enrichments. Five are useful for manufacturing:
- Language detection. Multi-language corpora need language tagging to pick the right analyzer and to filter search results per locale.
- Key-phrase extraction. Surfaces the non-stopword domain terms. Feeds facet and filter fields.
- Entity recognition with custom entity types. Part numbers, material codes, and supplier names recognised via a custom entity lookup skill pointed at a dictionary of the organisation's vocabulary.
- Translation. For multi-plant organisations, translating non-English documents into English provides a fallback retrieval path when the user's query language differs from the source.
- PII detection and redaction. Where quality records include employee names or supplier contact details, a PII skill blanks them before indexing.
Stage 3: Chunking and Embedding
Chunking is the single highest-impact retrieval-quality decision. A single large document stored as one chunk returns shallow results because the relevant passage is diluted by surrounding irrelevant text. Chunks that are too small lose context. The sweet spot for manufacturing documents sits at 500–1,000 tokens with 10–15% overlap.
Two chunking strategies work well in practice:
- Fixed-size with overlap. Simple, fast, good default. 700 tokens, 100-token overlap.
#Microsoft.Skills.Text.SplitSkill. - Structure-aware chunking. Exploits the markdown or section hierarchy from Document Intelligence Layout. A chunk boundary aligns with section boundaries. Preserves semantic coherence at the cost of variable chunk size.
// Split skill configuration
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "chunker",
"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 200,
"unit": "characters",
"context": "/document",
"inputs": [{ "name": "text", "source": "/document/markdown" }],
"outputs": [{ "name": "textItems", "targetName": "chunks" }]
}
Each chunk then flows into the integrated Azure OpenAI embedding skill, producing a dense vector stored alongside the chunk's text in the index.
Semantic Ranking on the Retrieval Side
Retrieval is a two-phase process: an initial hybrid query pulls the top 50 candidates, then the semantic ranker re-scores them. The ranker uses a deep learning model trained on passage relevance and boosts candidates that answer the query directly versus ones that merely mention the terms.
Semantic configurations are declared at the index level and referenced per query. Prioritise title and structured metadata fields when present; they carry high signal in manufacturing contexts because engineers name documents precisely.
// Semantic configuration at index level
{
"semantic": {
"configurations": [{
"name": "mfg-semantic",
"prioritizedFields": {
"titleField": { "fieldName": "title" },
"prioritizedContentFields": [{ "fieldName": "content" }],
"prioritizedKeywordsFields": [
{ "fieldName": "partNumbers" },
{ "fieldName": "materialCodes" }
]
}
}]
}
}
Measuring Retrieval Quality
Without measurement the index tunes itself on intuition. Build an evaluation set of 100–300 queries with known-correct document IDs. Rerun it on every skillset change. Track three numbers: recall@10, MRR (mean reciprocal rank), and NDCG@10.
A labelled eval set takes about one engineering week to produce from scratch. It pays back on the first tuning decision that would otherwise have been guesswork.
Handling Multi-Language Corpora
European manufacturers often maintain documents in Swedish, German, Finnish, French, or Italian alongside English. Azure AI Search supports per-language analyzers that tokenise with language-appropriate rules. Set the analyzer based on the language detection skill's output:
// Per-chunk language-aware content field
{
"name": "content_sv",
"type": "Edm.String",
"searchable": true,
"analyzer": "sv.microsoft"
},
{
"name": "content_de",
"type": "Edm.String",
"searchable": true,
"analyzer": "de.microsoft"
}
For queries, pick the field matching the query's detected language. Or, simpler, translate non-English documents to English during indexing, store both originals and translations, and query the English field by default while preserving the originals for display and citation.
Incremental Updates and Deletion
Manufacturing corpora evolve constantly: revisions published, documents superseded, obsolete drawings removed. Indexers support change detection via high-water-mark policies on Blob Storage (uses lastModified) or via SQL change tracking. Deletion detection uses a soft-delete policy keyed on a metadata flag. Together, these keep the index synchronized without requiring a full rebuild on each cycle.
Operational Playbook
- Deploy the skillset behind an alias-versioned index. Bump the version on schema-breaking changes.
- Run a labelled eval set on every skillset change. Fail the deploy if recall@10 drops more than 5%.
- Monitor indexer success rates in Application Insights. Alert on error rate above 2%.
- Dashboard embedding skill throughput. If it falls below your SLA, scale the Azure OpenAI deployment or batch the backfill over weekends.
- Periodic sampling: pick 20 random recent documents, run the full pipeline in a staging index, and manually review the extracted, chunked, embedded output. Catches silent regressions.
What Good Looks Like
A production manufacturing search index, tuned, typically hits recall@10 above 85% on a labelled eval set of domain queries, MRR above 0.6, and p95 query latency under 600 ms including semantic re-ranking. These are achievable numbers on Standard S2 or S3 with a well-designed schema. The difference between those numbers and a default-configured index is usually a factor of two or three on every metric.
The gap closes fast once the three stages are treated as first-class engineering work. Text extraction picks the right skill for each format. Enrichment adds the signals the retrieval layer needs. Chunking and embedding respect the corpus structure. Semantic ranking catches what keyword-plus-vector missed. None of these steps is difficult; each requires attention.