Azure AI Search Skillsets: OCR and Entity Extraction

Skillsets are where Azure AI Search earns its keep on industrial corpora. A default indexer extracts text from Word and PDF; a well-designed skillset extracts text from scanned paper, recognises part numbers as structured entities, identifies the document's language, pulls out tables as queryable data, and redacts personal information. These are the skills that matter most

The Built-In Skills That Carry Most of the Load

OcrSkill and DocumentIntelligenceLayoutSkill. Extract text from images and structure from documents.
LanguageDetectionSkill. Identifies the primary language of each document.
TextTranslationSkill. Translates non-English text for cross-lingual retrieval.
EntityRecognitionSkill (and its custom-entity-lookup variant). Pulls domain entities into structured fields.
KeyPhraseExtractionSkill. Surfaces the salient terms in each document.
PIIDetectionSkill. Detects and optionally redacts personal identifiers.
SplitSkill. Chunks text for embedding.
AzureOpenAIEmbeddingSkill. Generates dense vectors per chunk.
Custom WebApiSkill. Calls user-hosted logic for domain-specific processing.

Document Intelligence Layout for Industrial Documents

Industrial documents. Quality procedures, work instructions, supplier specs. Frequently come as PDF with embedded images, tables, and mixed layouts. The DocumentIntelligenceLayoutSkill preserves structure that raw OCR discards:

Tables are preserved as structured tables with row and column coordinates.
Markdown-formatted output with heading hierarchy.
Figure detection with captions and page coordinates.
Handwriting recognition on annotated drawings.

// Layout skill configuration
{
  "@odata.type": "#Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill",
  "name": "layout",
  "description": "Structured extraction for industrial PDFs and scans",
  "context": "/document",
  "outputMode": "oneToMany",
  "markdownHeaderDepth": "h3",
  "inputs": [{ "name": "file_data", "source": "/document/file_data" }],
  "outputs": [
    { "name": "markdown_document", "targetName": "markdown" },
    { "name": "figures",           "targetName": "figures" },
    { "name": "tables",            "targetName": "tables" }
  ]
}

Custom Entity Lookup for Domain Vocabulary

Generic entity recognition catches people, organisations, and places. It misses the manufacturing vocabulary that matters: part numbers, material codes, supplier IDs, tool identifiers, engine codes. The CustomEntityLookupSkill takes a user-provided dictionary and flags matches as structured entities.

// Custom entity lookup with domain vocabulary
{
  "@odata.type": "#Microsoft.Skills.Text.CustomEntityLookupSkill",
  "name": "mfg-entities",
  "context": "/document",
  "entitiesDefinitionUri": "https://technspirestorage.blob.core.windows.net/mfg-lexicon/entities.json",
  "inputs": [{ "name": "text", "source": "/document/content" }],
  "outputs": [{ "name": "entities", "targetName": "mfgEntities" }]
}

// entities.json excerpt
[
  { "name": "PN-5428-B",  "type": "PartNumber", "id": "PN-5428-B" },
  { "name": "EN-AC-42100","type": "MaterialCode", "id": "EN-AC-42100" },
  { "name": "B4204T35",   "type": "EngineCode", "id": "B4204T35" }
]

The extracted entities populate separate index fields, becoming filterable and facetable without further processing. For an index keyed on "documents discussing engine B4204T35 with failures on part PN-5428-B", the filter is a straightforward intersection of two fields.

Language Detection and Translation

Multi-plant and multi-market organisations accumulate documents in the local language of each plant or customer. Three patterns work:

Store native, index with language-specific analyzer. Content stays in the original language; the analyzer handles tokenisation correctly for that language.
Translate to English, store both. Non-English documents are translated; both original and translation populate separate fields. English-language searches retrieve from the translated field, local searches from the original.
Multilingual embeddings. Use a multilingual embedding model (e.g. text-embedding-3-large with mixed-language input). Vector search bridges languages without explicit translation.

// Language detection + conditional translation skillset fragment
{
  "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
  "context": "/document",
  "inputs": [{ "name": "text", "source": "/document/content" }],
  "outputs": [{ "name": "languageCode", "targetName": "detectedLanguage" }]
},
{
  "@odata.type": "#Microsoft.Skills.Text.TranslationSkill",
  "context": "/document",
  "defaultToLanguageCode": "en",
  "suggestedFrom": "/document/detectedLanguage",
  "inputs": [{ "name": "text", "source": "/document/content" }],
  "outputs": [{ "name": "translatedText", "targetName": "contentEn" }]
}

PII Detection for Quality Records and Incident Reports

Quality records routinely name employees, witnesses, and supplier contacts. Incident reports may include home addresses, personal phone numbers, or health information. GDPR and similar regimes require either redaction or a justified lawful basis for retaining identifiable data in searchable form. The PIIDetectionSkill handles both detection (for downstream processing) and masking (for search index storage).

{
  "@odata.type": "#Microsoft.Skills.Text.PIIDetectionSkill",
  "name": "pii",
  "context": "/document",
  "piiCategories": ["Person","PhoneNumber","Email","Address","CreditCard"],
  "domain": "phi",
  "maskingMode": "replace",
  "maskingCharacter": "*",
  "inputs": [{ "name": "text", "source": "/document/content" }],
  "outputs": [
    { "name": "piiEntities",  "targetName": "piiEntities" },
    { "name": "maskedText",   "targetName": "contentRedacted" }
  ]
}

Custom Skills for Domain-Specific Processing

The WebApiSkill calls a user-hosted endpoint with document content and receives enrichments. Typical industrial uses:

Part-number normalisation. Different plants write the same part as PN 5428 B, PN5428B, or 5428B. A normaliser service returns the canonical form plus current revision.
BOM extraction. A specialised service parses bill-of-materials tables into structured child-part lists.
Supersession lookup. Given a part number, the service returns whether it is active, superseded (by what), or withdrawn.
Regulatory tag mapping. Documents in scope of IATF 16949, ISO 13485, or AS9100 get tagged via a lookup against an internal registry.

// WebApi skill calling a user-hosted normalizer
{
  "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
  "name": "part-normalizer",
  "uri": "https://mfg-enrich.azurewebsites.net/api/normalize-parts",
  "httpMethod": "POST",
  "timeout": "PT30S",
  "batchSize": 5,
  "degreeOfParallelism": 3,
  "context": "/document",
  "inputs": [{ "name": "partNumbers", "source": "/document/mfgEntities/*/text" }],
  "outputs": [{ "name": "normalizedParts", "targetName": "parts" }]
}

Skillset Composition and Debugging

Skills execute in dependency order based on input/output sources. An OCR skill producing content can feed a language detection skill, which conditionally triggers translation. The full graph for a production industrial skillset runs 8–12 skills. Debugging uses the debug mode on the indexer run, which dumps intermediate document state for inspection.

Cost and Throughput

Skill-metered costs scale with document count and complexity:

OCR and Layout skill: metered per page, typically 1–2 USD per 1,000 pages.
Cognitive Services skills (entity, language, translation, PII): Free tier up to 20,000 records per indexer, paid thereafter at Cognitive Services rates.
Embedding skill: metered at Azure OpenAI rates.
Custom WebApi skill: metered by your own infrastructure.

For a 500,000-document industrial corpus with average 15-page documents, the one-time ingestion cost is typically 15,000–30,000 USD in skills and embeddings. Ongoing cost for incremental updates (2,000 documents per month, average 20 pages) is ~100–200 USD/month.

Operational Patterns

Start with built-in skills. Only add custom WebApi skills when a domain-specific need cannot be met by standard ones. Each custom skill is an operational dependency.
Version the skillset. Skillset definitions change over time; keep them in source control, deploy via IaC.
Run the eval set on every skillset change. Retrieval quality is sensitive to skillset changes in ways that are not obvious until measured.
Monitor cognitive service throttling. Entity, language, and translation skills can throttle under high-concurrency indexer runs. Adjust batchSize and degreeOfParallelism to stay within provisioned throughput.

When to Build, When to Buy

The built-in skills cover roughly 80% of industrial indexing needs. The remaining 20%. Domain-specific entity normalisation, BOM parsing, revision lookup, regulatory tagging. Is typically custom. Resist the temptation to build custom skills for problems the standard ones already solve. Reserve custom development for genuine domain logic that your organisation uniquely understands.

A well-composed skillset turns an industrial corpus into a retrieval layer that respects the domain's structure. The index is not merely text; it is text plus entities plus languages plus structured tables plus redaction. Queries operate on all of that. Done well, the difference in retrieval quality is the difference between "we have search" and "we have a knowledge platform."