Building Reliable Agent Tools: Schemas, Idempotency, Recovery

Tools are the agent's interface to reality. The model decides which tool to call, with what arguments, and how to interpret the result. A tool that returns ambiguous errors makes the agent retry forever. A tool with overlapping responsibilities makes the agent pick wrong. A tool without idempotency means the agent's mid-loop crash leaves the world in a half-changed state. The agent stack writeups focus on prompts and frameworks; the part that actually decides whether agents work in production is the tool layer underneath.

Tool Granularity

The first design choice is granularity. Too coarse, and the model wastes tokens describing nuance the tool cannot use. Too fine, and the model has to call ten tools to accomplish one task, multiplying latency, cost, and decision points where it can pick wrong.

The pragmatic rule: a tool should map to one user-comprehensible action. create_invoice is a tool. create_invoice_header, add_invoice_line, and finalize_invoice are usually not separate tools, even though the underlying API may have those endpoints. Compose at the tool boundary; do not expose the underlying granularity to the model.

Counterexample: when a workflow has natural decision points where the agent should choose between paths, splitting tools is correct. search_customers and create_customer are separate because the agent's job is precisely to decide which applies.

Schema Design

Frontier models call tools well when the schemas are well-designed. Three principles matter.

Describe semantics, not types. A field called amount: number is useless. A field called amount with description "Net amount in the order's currency, before tax. Must be non-negative." gives the model what it needs to use correctly.

Constrain enums where possible. An order_status field that accepts any string lets the model invent values. The same field with enum: ["pending", "confirmed", "shipped", "cancelled"] guarantees a valid value at decode time.

Make required fields actually required. Optional fields are fine. Optional fields the agent must pass anyway are not. If the operation cannot succeed without a customer ID, mark customerId required and let the schema validator reject malformed calls before they hit your business logic.

// A well-shaped tool definition (Anthropic format, equivalent in OpenAI)
{
  name: 'create_invoice',
  description: 'Create a draft invoice for a customer. The invoice is not sent until finalize_invoice is called.',
  input_schema: {
    type: 'object',
    properties: {
      customerId: {
        type: 'string',
        description: 'Customer identifier, format CUST-NNNNNN. Must exist (use search_customers first if unsure).'
      },
      currency: {
        type: 'string',
        enum: ['SEK', 'EUR', 'USD'],
        description: 'ISO-4217 currency code. Must match the customer record.'
      },
      lineItems: {
        type: 'array',
        minItems: 1,
        items: {
          type: 'object',
          properties: {
            description: { type: 'string', minLength: 1, maxLength: 200 },
            quantity:    { type: 'number', minimum: 0.01 },
            unitAmount:  { type: 'number', minimum: 0 }
          },
          required: ['description', 'quantity', 'unitAmount']
        }
      },
      idempotencyKey: {
        type: 'string',
        description: 'Caller-generated UUID. Same key returns the same invoice without creating a duplicate.'
      }
    },
    required: ['customerId', 'currency', 'lineItems', 'idempotencyKey']
  }
}

Idempotency: The Unit-of-Work Principle

An agent loop crashes mid-execution. The orchestrator restarts. Without idempotency, the agent re-issues a tool call that already succeeded, creating a duplicate invoice, double-charging a customer, or sending a notification twice. With idempotency, the second call returns the result of the first.

Make every state-changing tool require an idempotency key. The agent passes a UUID for each logical operation. The implementation stores the result keyed on that UUID for some retention window (24 hours is typical). Subsequent calls with the same key return the cached result.

Idempotency keys also defend against the model itself. Frontier models occasionally call the same tool twice in adjacent turns, because of a context window edit, a misread of intermediate state, or a streaming retry. Idempotency turns this from an incident into a no-op.

Error Responses the Model Can Act On

A 500 status code with no body teaches the model nothing. The agent retries blindly until the step budget runs out. A structured error response with a code, a human-readable message, and ideally a recovery hint lets the model recover.

// Error response shape that the model can reason over
{
  "ok": false,
  "error": {
    "code": "CUSTOMER_NOT_FOUND",
    "message": "No customer with id CUST-000111 exists.",
    "hint": "Call search_customers with the customer's name or email to find the correct ID, then retry.",
    "retryable": false
  }
}

// vs the unhelpful version
{
  "error": "Internal server error"
}

The codes that matter most:

VALIDATION_FAILED with field-level details. The agent corrects the call.
NOT_FOUND with a hint about how to look up the missing entity. The agent searches first.
CONFLICT with the conflicting state. The agent reconciles.
FORBIDDEN with the missing permission named explicitly. The agent escalates to a human-in-loop step or stops.
RATE_LIMITED with a retry-after value. The agent waits and retries.

Read-Then-Write Pattern

The model is more reliable when it grounds writes in a prior read. The pattern: tool A returns the current state, tool B accepts the state's revision identifier and refuses to write if it has changed.

This catches the classic agent failure where the model decides to update something based on stale context. Optimistic concurrency control on the tool layer removes the failure mode at the cost of one extra retry-with-fresh-read in the rare conflict case.

// Read includes a version
get_order returns:  { orderId: "...", status: "pending", version: 7, ... }

// Write requires the version
update_order_status({
  orderId: "...",
  newStatus: "confirmed",
  expectedVersion: 7    // refuses if current version is not 7
})

Dry-Run Before High-Stakes Writes

For writes that are expensive, public, or irreversible, expose a dry_run mode that returns what the operation would do without doing it. The system prompt instructs the agent to dry-run high-stakes operations and confirm with the user before committing.

Implementation is cheap: the same code path with a flag that diverts the final commit step. The model can compose the verification into its own reasoning, presenting "I am about to send 47 emails to customers in the EU region; here is the list" before actually doing it.

Compensating Actions

Some operations cannot be made idempotent at the data layer. Sending an email cannot be retried with idempotency in the strict sense; the email server has already delivered the message. The defence is a compensating tool: a way to undo the operation. cancel_invoice alongside finalize_invoice. recall_email alongside send_email where the underlying provider supports it.

Compensating tools convert "the agent did something irreversible during a confused turn" into "the agent did something it can undo, and a follow-up turn can recognise the mistake and call the compensating tool."

Versioning Tools Without Breaking Running Agents

Tool schemas evolve. Adding a new optional field is safe. Renaming a field, adding a required field, or changing the meaning of an enum value is breaking. The pattern: version the tool name, not the schema. create_invoice stays stable; create_invoice_v2 appears alongside it for the new behaviour. Deprecate the old one over a documented window.

Inside the tool implementation, the v1 wrapper translates calls into the v2 logic. Behavioural drift between the two is a regression worth catching in evaluation.

Tool Documentation as a Production Asset

The tool description is what the model reads when deciding whether to call it. Treat it as production-grade documentation. The tone, the level of detail, the inclusion of "use this when" and "do not use this when" hints all change the model's choice rate.

A pattern that works on most frontier models: the description starts with a sentence about when to use the tool, then a sentence about its preconditions, then a sentence about its postconditions, then a sentence about what it does not do.

description: 'Create a draft invoice for a customer.
Use this when the user wants to start an invoice or you have all
the line items and the customer is identified.
Preconditions: customerId must exist (use search_customers if unsure)
and currency must match the customer record.
Postconditions: a draft invoice exists with the given line items;
it is not sent until finalize_invoice is called.
This tool does not send the invoice or charge the customer.'

Anti-Patterns

Tools that do too much. A process_order tool that creates a customer if missing, validates inventory, charges the card, and sends a receipt is six tools wearing a costume. The agent cannot reason about partial failure.
Free-form arguments. A tool that accepts a single query: string argument and parses it server-side defeats the schema. The model has nowhere to be precise.
Different tools that do the same thing slightly differently. The model picks the wrong one. Consolidate.
Mutating shared state without a versioned read. Stale-context mistakes go undetected.
Throwing exceptions instead of returning structured errors. The runtime sees the exception, not the model. The model gets a generic "tool failed" message and cannot adapt.

An Evaluation Habit

Tool quality compounds. A tool that fails 5% of the time in a five-step agent loop fails 23% of the time end-to-end. The remediation lives in the tool layer, not in better prompting. Build a labelled evaluation set of tasks where the correct tool sequence is known, run it on every tool change, and treat tool-call accuracy as a first-class metric alongside answer quality.

When a regression appears, the cause is usually a tool description that drifted, a schema that grew an inconvenient enum, or an error response that stopped surfacing the recovery hint. The fix is in the tool, not in the agent.

Where Production Agents Win

Teams that ship reliable agents in 2026 spend the majority of their engineering time on the tool layer. Schema design, idempotency, error response shape, versioning, evaluation. The visible artifact is an agent that does the right thing. The invisible foundation is a set of tools each one of which is genuinely well-designed software, treated with the same engineering discipline you would apply to a public API. Treat tools as the API your most demanding consumer ever uses, because that is what they are.