Technical Guide

AI Decisions You Can Defend: Auditability, Traceability, and Proof in Production

How to build AI systems with full decision traceability. Structured audit events, HMAC receipts, session-scoped decision chains, human approval records, and retention architecture.

March 5, 202618 min readOronts Engineering Team

"What Did the AI Do, and Can You Prove It?"

This question comes up in every enterprise AI deployment. Not from engineers. From legal, compliance, procurement, and the board. The answer they need is not "we used GPT-4" or "the model was fine-tuned on our data." They need specifics: what data went in, which model processed it, what tools were called, what human approved the action, and whether the record can be verified after the fact.

Most AI systems can't answer this question. They log prompts and responses (if anything), but those logs don't tell you the decision chain. They don't tell you why the system chose option A over option B. They don't tell you who approved a high-value action. And they definitely don't provide tamper-evident proof that the record hasn't been modified since the decision was made.

We built decision traceability into multiple production AI systems. This article covers the architecture patterns that make AI decisions defensible. Not theoretically defensible. Provably defensible, with cryptographic receipts and immutable records.

For context on how we approach AI governance broadly and human-in-the-loop systems specifically, those guides cover related patterns. This article focuses on the proof layer: what to log, how to structure it, and how to make it verifiable.

What Decision Traceability Actually Means

Decision traceability is not logging. Logging tells you what happened. Traceability tells you why it happened, who authorized it, and whether the record is trustworthy.

Capability	Standard Logging	Decision Traceability
What happened	Prompt and response text	Structured decision event with typed fields
Which model	Maybe in headers	Explicit: model ID, version, provider, temperature
What data was used	Raw prompt (contains PII)	Token IDs referencing session mapping (no PII)
What tools were called	Maybe in debug logs	Structured tool call chain with inputs and outputs
Who approved it	Not tracked	Approval record: who, when, what they saw, what they decided
Can you verify it	No (logs can be edited)	HMAC receipt: tamper-evident, cryptographically signed
Retention	Whatever your log aggregator keeps	Policy-based: 90 days operational, 7 years archive

The difference matters when a customer disputes an AI-generated recommendation, when a regulator asks how a decision was made, or when an internal audit needs to verify that the AI system followed policy.

The Decision Event Schema

Every AI decision generates a structured event. Not a log line. A typed record with explicit fields for every dimension of the decision.

interface AiDecisionEvent {
    // Identity
    event_id: string;              // UUID, unique per event
    event_type: string;            // "transform", "rehydrate", "tool_call", "agent_action", "approval"
    timestamp: string;             // ISO 8601 UTC

    // Actor
    actor_type: string;            // "agent" | "human" | "system" | "scheduler"
    actor_id: string;              // agent thread ID, user ID, or system component name

    // Context
    tenant_id: string;             // multi-tenant scoping
    session_id: string;            // groups events within a session
    correlation_id: string;        // links related events across services
    channel_id?: string;           // which channel (web, api, widget)

    // Model
    model_provider?: string;       // "openai" | "anthropic" | "local"
    model_id?: string;             // "gpt-4o" | "claude-sonnet-4-20250514"
    model_version?: string;        // deployment version or checkpoint

    // Decision
    action: string;                // what was done: "generate_response", "call_tool", "approve_order"
    input_summary: object;         // structured summary (NO raw PII, only token IDs and types)
    output_summary: object;        // structured summary of the result
    decision_rationale?: string;   // why this action was taken (from agent reasoning)

    // Policy
    policy_id?: string;            // which policy was evaluated
    policy_result?: string;        // "allowed" | "denied" | "escalated"
    policy_conditions?: object;    // which conditions were checked

    // Approval (if HITL)
    approval_required: boolean;
    approval_status?: string;      // "pending" | "approved" | "rejected"
    approved_by?: string;          // user ID of approver
    approved_at?: string;          // when approval was given
    approval_context?: object;     // what the approver saw when deciding

    // Integrity
    receipt_hmac?: string;         // HMAC-SHA256 of the event payload
    previous_event_id?: string;    // chain link to previous event in session
}

The key design decisions:

No raw PII in events. The input_summary contains token IDs (p_001, e_001) and entity types, never raw values. This means your audit storage doesn't become a GDPR-regulated system. See our GDPR compliance guide for the full architecture.

Explicit model identification. Not just "we used an LLM." The specific provider, model ID, and version are recorded. When a model is updated or swapped, you can trace which decisions used which version.

Chain linking. The previous_event_id field creates a linked chain of events within a session. Event 3 points to event 2, which points to event 1. The chain proves the sequence of decisions and that no events were inserted or removed after the fact.

Session-Scoped Decision Chains

A single AI interaction often involves multiple decisions. A customer support agent might: read the ticket (event 1), look up customer info (event 2), check billing (event 3), draft a response (event 4), and send the email (event 5). Each step is a decision event. Together, they form a decision chain.

┌──────────────────────────────────────────────────────┐
│                  SESSION: sess_abc123                  │
│                                                       │
│  Event 1         Event 2         Event 3              │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐         │
│  │ READ     │──▶│ LOOKUP   │──▶│ CHECK    │         │
│  │ TICKET   │   │ CUSTOMER │   │ BILLING  │         │
│  │          │   │          │   │          │         │
│  │ model:   │   │ tool:    │   │ tool:    │         │
│  │ claude   │   │ crm_api  │   │ billing  │         │
│  │          │   │          │   │ _api     │         │
│  │ tokens:  │   │ tokens:  │   │ tokens:  │         │
│  │ p_001    │   │ cid_001  │   │ o_001    │         │
│  └──────────┘   └──────────┘   └──────────┘         │
│       │              │              │                 │
│       ▼              ▼              ▼                 │
│  Event 4         Event 5                              │
│  ┌──────────┐   ┌──────────┐                         │
│  │ DRAFT    │──▶│ SEND     │                         │
│  │ RESPONSE │   │ EMAIL    │                         │
│  │          │   │          │                         │
│  │ model:   │   │ channel: │                         │
│  │ claude   │   │ email    │                         │
│  │          │   │          │                         │
│  │ policy:  │   │ restore: │                         │
│  │ support  │   │ formatted│                         │
│  └──────────┘   └──────────┘                         │
│                                                       │
└──────────────────────────────────────────────────────┘

Each event references the previous one. The chain is verifiable: if someone deletes event 3, the chain from event 4 back to event 2 has a gap. If someone inserts a fake event between 2 and 3, the chain links don't match.

Querying the Chain

When an auditor asks "what happened with session X?", you query by session_id and reconstruct the chain:

async function getDecisionChain(sessionId: string): Promise<AiDecisionEvent[]> {
    const events = await this.eventStore.findBySessionId(sessionId, {
        orderBy: 'timestamp',
        direction: 'ASC',
    });

    // Verify chain integrity
    for (let i = 1; i < events.length; i++) {
        if (events[i].previous_event_id !== events[i - 1].event_id) {
            throw new ChainIntegrityError(
                `Chain broken at event ${events[i].event_id}: ` +
                `expected previous ${events[i - 1].event_id}, ` +
                `got ${events[i].previous_event_id}`
            );
        }
    }

    return events;
}

For how we handle similar chain verification in commerce transactions, see our agentic commerce guide which uses HMAC receipts for the same purpose.

HMAC Receipts: Tamper-Evident Proof

Decision events stored in a database can be modified. An HMAC receipt proves that the event data has not changed since it was created.

function signDecisionEvent(event: AiDecisionEvent, tenantSecret: string): string {
    // Canonical form: sorted keys, deterministic JSON
    const canonical = JSON.stringify(event, Object.keys(event).sort());
    return crypto.createHmac('sha256', tenantSecret).update(canonical).digest('hex');
}

function verifyDecisionEvent(event: AiDecisionEvent, storedHmac: string, tenantSecret: string): boolean {
    const recomputed = signDecisionEvent(event, tenantSecret);
    return crypto.timingSafeEqual(
        Buffer.from(recomputed, 'hex'),
        Buffer.from(storedHmac, 'hex')
    );
}

Every decision event is signed at creation time. The HMAC is stored alongside the event. To verify, recompute the HMAC from the current event data and compare. If a single field was modified after signing, the HMAC won't match.

Property	Value
Algorithm	HMAC-SHA256
Key	Per-tenant secret (rotated annually)
Canonicalization	`JSON.stringify(payload, Object.keys(payload).sort())`
Output	Hex-encoded string (64 characters)
Comparison	Timing-safe (`crypto.timingSafeEqual`)

The per-tenant secret means one tenant's receipts can't be verified using another tenant's key. Key rotation includes a 24-hour overlap period where both old and new keys are accepted for verification.

Human Approval Records

When a decision requires human approval (high-value transactions, sensitive data access, policy exceptions), the approval itself is a decision event with specific fields:

interface ApprovalEvent extends AiDecisionEvent {
    event_type: 'approval';
    approval_required: true;

    // What the human saw when deciding
    approval_context: {
        original_request: string;      // summary of what was requested
        estimated_impact: string;      // "Order for 2,500 EUR from supplier Alpha"
        policy_triggered: string;      // "require_human_approval_above: 500"
        agent_recommendation: string;  // what the agent suggested
        risk_flags: string[];          // any warnings surfaced to the approver
    };

    // What the human decided
    approved_by: string;               // user ID
    approved_at: string;               // ISO 8601
    approval_status: 'approved' | 'rejected';
    rejection_reason?: string;         // if rejected, why
    approval_duration_ms: number;      // how long the human took to decide
}

The approval_context field is critical. It records what information was presented to the human when they made their decision. This prevents the argument "I approved it but I didn't know X." The record shows exactly what the approver saw.

approval_duration_ms is also useful for audit. If an approver consistently approves in under 2 seconds, that suggests rubber-stamping rather than genuine review. Compliance teams use this metric to evaluate whether human oversight is meaningful.

What NOT to Log

Decision traceability requires discipline about what goes into the audit trail.

Do log:

Token IDs and entity types (e.g., "entity p_001 of type person was detected")
Model identifiers and versions
Tool call names and structured parameters
Policy evaluation results
Approval records with context
Timing information (latencies, durations)
Error codes and failure reasons

Do NOT log:

Raw PII (names, emails, phone numbers, addresses)
Full prompt text (contains PII and is enormous)
Full model responses (same problems)
Authentication credentials or API keys
Internal system passwords or connection strings

// Good: structured, PII-free
{
    event_type: "transform",
    action: "detect_and_tokenize",
    input_summary: {
        entities_detected: 3,
        entity_types: ["person", "email", "customer_id"],
        token_ids: ["p_001", "e_001", "cid_001"],
        detection_confidence: [0.95, 1.0, 0.99],
    },
    policy_id: "german-support",
    model_id: "ner-spacy-de",
    duration_ms: 12,
}

// Bad: contains PII, useless for structured queries
{
    event_type: "transform",
    action: "process_input",
    input: "Hallo, ich bin Sara Mustermann, meine Kundennummer ist 948221...",
    output: "Hallo, ich bin {{person:p_001}}...",
}

The good example is queryable ("show me all events where detection confidence was below 0.8"), filterable ("show me all events for policy german-support"), and PII-free. The bad example is a blob of text that becomes a GDPR liability.

For the full architecture of PII-safe logging in AI systems, see our AI observability guide.

Retention Architecture

Decision events have different retention requirements depending on their regulatory context:

Tier	Storage	Retention	Queryable	Use Case
Hot	Database (PostgreSQL / DynamoDB)	90 days	Full SQL/query	Debugging, ops dashboards, real-time monitoring
Warm	Object storage (S3)	2 years	By session_id, date range	Internal audits, customer disputes, compliance reviews
Cold	Object storage with write-once locks	7 years	By session_id only	Regulatory audits, legal holds, financial compliance

The cold tier uses object storage with compliance-mode locks. Once written, records cannot be modified or deleted until the retention period expires. This is not just access control. The storage system physically prevents deletion, even by administrators.

Decision Event Created
  │
  ├──▶ Hot Tier (database): immediate write, queryable
  │
  ├──▶ Warm Tier (object storage): batched daily export
  │
  └──▶ Cold Tier (locked object storage): stream from database
       via change data capture, write-once, 7-year lock

The streaming from database to cold storage happens through change data capture (database streams or WAL shipping). Events are written to the immutable archive within minutes of creation. There is no batch job that runs daily and might miss events. The stream is continuous.

Correlation Across Services

In a distributed AI system, a single user request might touch multiple services: an API gateway, a data protection runtime, an LLM provider, a tool server, and an audit service. The correlation_id ties all decision events from all services together.

// API gateway generates correlation_id
const correlationId = generateUUID();

// Every downstream service receives it
const response = await dataProtection.transform(input, {
    headers: { 'X-Correlation-Id': correlationId },
});

// Every decision event includes it
const event: AiDecisionEvent = {
    correlation_id: correlationId,
    // ...
};

When debugging or auditing, query by correlation_id to get the complete picture across all services. This is the same pattern used in distributed tracing, but applied specifically to decision events rather than performance traces.

Practical Implementation

Storage Choice

Requirement	PostgreSQL	DynamoDB	Event Store (e.g., EventStoreDB)
Structured queries	Excellent	Limited (key-value)	Limited (stream-based)
Write throughput	Good (with connection pooling)	Excellent (auto-scaling)	Excellent
Chain integrity	Application-level	Application-level	Built-in (append-only streams)
Retention policies	Application-level	TTL on items	Built-in
Cost at scale	Fixed (server-based)	Pay-per-request	Fixed

For most implementations, PostgreSQL is the right choice for the hot tier. It's queryable, transactional, and your team already knows it. DynamoDB works well if you're on AWS and need auto-scaling write throughput. A dedicated event store is overkill unless you have thousands of decision events per second.

Query Patterns

The most common queries against the decision event store:

-- All decisions in a session (reconstruct the chain)
SELECT * FROM ai_decision_events
WHERE session_id = $1
ORDER BY timestamp ASC;

-- All decisions by a specific agent in the last 24 hours
SELECT * FROM ai_decision_events
WHERE actor_type = 'agent' AND actor_id = $1
AND timestamp > NOW() - INTERVAL '24 hours'
ORDER BY timestamp DESC;

-- All denied policy evaluations (find misconfigured policies)
SELECT * FROM ai_decision_events
WHERE policy_result = 'denied'
AND timestamp > NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC;

-- All human approvals with short review time (rubber-stamping detection)
SELECT * FROM ai_decision_events
WHERE event_type = 'approval'
AND approval_status = 'approved'
AND approval_duration_ms < 3000
AND timestamp > NOW() - INTERVAL '30 days';

-- Verify chain integrity for a session
SELECT e1.event_id, e1.previous_event_id,
       CASE WHEN e2.event_id IS NULL AND e1.previous_event_id IS NOT NULL
            THEN 'BROKEN' ELSE 'OK' END as chain_status
FROM ai_decision_events e1
LEFT JOIN ai_decision_events e2 ON e1.previous_event_id = e2.event_id
WHERE e1.session_id = $1;

Indexing

CREATE INDEX idx_session ON ai_decision_events (session_id, timestamp);
CREATE INDEX idx_actor ON ai_decision_events (actor_type, actor_id, timestamp);
CREATE INDEX idx_correlation ON ai_decision_events (correlation_id);
CREATE INDEX idx_policy_result ON ai_decision_events (policy_result, timestamp);
CREATE INDEX idx_approval ON ai_decision_events (event_type, approval_status, timestamp)
    WHERE event_type = 'approval';

Common Pitfalls

Logging raw prompts as audit trail. Prompts contain PII. Your audit storage becomes GDPR-regulated. Use structured events with token IDs instead.
No chain linking between events. Without previous_event_id, you can't prove the sequence of decisions. Events can be inserted, deleted, or reordered without detection.
No HMAC signing. Database records can be modified. Without cryptographic receipts, the audit trail is not tamper-evident. "Trust us, we didn't edit the logs" is not defensible.
Same retention for everything. Debugging data needs 90 days. Compliance data needs 7 years. Mixing them wastes money (keeping debug data too long) or creates risk (deleting compliance data too early).
No approval context. Recording that "user X approved action Y" is not enough. Record what information the approver saw when deciding. Without context, the approval is meaningless for audit.
Rubber-stamp detection missing. If human oversight is a compliance requirement, you need to verify that humans are actually reviewing, not just clicking "approve" reflexively. Track approval_duration_ms.
No correlation across services. If your AI system spans multiple services, events from each service are isolated. Without a correlation_id, you can't reconstruct the full decision chain.
Mutable cold storage. If your long-term archive can be edited or deleted by administrators, it's not an audit trail. Use write-once storage with compliance-mode locks.

Key Takeaways

"We used GPT-4" is not a defensible answer. Record the specific model, version, provider, input tokens, output tokens, tools called, policies evaluated, and humans who approved. Every dimension of the decision.
Structured events, not log lines. Typed fields enable structured queries, dashboards, anomaly detection, and compliance reports. Free-text logs enable nothing except grep.
No PII in decision events. Use token IDs from your data protection layer. The audit trail must not itself become a data protection liability.
Chain linking proves sequence. Each event points to its predecessor. Gaps and insertions are detectable. Combined with HMAC signing, the chain is tamper-evident.
HMAC receipts provide cryptographic proof. Per-tenant signing keys, canonical JSON serialization, timing-safe comparison. Any modification to any field invalidates the receipt.
Human approval records must include context. What did the approver see? How long did they take? Without this, "human oversight" is a checkbox, not a control.
Three-tier retention matches regulatory reality. Hot for debugging (90 days), warm for audits (2 years), cold with write-once locks for regulators (7 years).

We apply these patterns across our AI systems, from data protection runtimes to agentic commerce platforms. If you're building AI systems that need to satisfy enterprise compliance requirements, talk to our team or request a quote. You can explore our AI services, our trust and compliance approach, and our guides on AI systems architecture and AI failure modes for more context.

Topics covered

AI auditabilityAI traceabilityAI decision loggingAI compliance auditAI decision proofAI governance productionLLM audit trailAI accountability

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation

AI Decisions You Can Defend: Auditability, Traceability, and Proof in Production

"What Did the AI Do, and Can You Prove It?"

What Decision Traceability Actually Means

The Decision Event Schema

Session-Scoped Decision Chains

Querying the Chain

HMAC Receipts: Tamper-Evident Proof

Human Approval Records

What NOT to Log

Retention Architecture

Correlation Across Services

Practical Implementation

Storage Choice

Query Patterns

Indexing

Common Pitfalls

Key Takeaways

Topics covered

Related Guides

Enterprise Guide to Agentic AI Systems

Agentic Commerce: How to Let AI Agents Buy Things Safely

The 9 Places Your AI System Leaks Data (and How to Seal Each One)

Ready to build production AI systems?

Get the Latest AI Insights

Services

Solutions

Company

Resources

Legal