Technical Guide

From AI Prototype to Production: The 15 Things That Change Completely

What changes when you move an AI system from prototype to production. Auth, cost tracking, PII handling, fallback models, monitoring, compliance, and the team structure shift.

April 15, 202616 min readOronts Engineering Team

The Prototype Delusion

Every AI prototype works. You call an API, pass a prompt, get a response. The demo impresses stakeholders. The team estimates two sprints to production.

Twelve months later, the system is still not in production. Not because the AI doesn't work. Because everything around it doesn't work: auth, rate limiting, cost tracking, PII handling, fallback models, retry logic, version management, monitoring, alerting, compliance documentation, multi-region deployment, disaster recovery, and operational runbooks.

The AI model is maybe 10% of a production AI system. The other 90% is engineering. This article covers the 15 things that change completely when you move from prototype to production.

For specific patterns, see our guides on AI GDPR compliance, AI observability, and AI decision traceability.

The 15 Things

1. Authentication and Multi-Tenancy

Prototype: one API key hardcoded in the environment.

Production: JWT tokens with tenant scoping, API key management with rotation, role-based access control, rate limiting per tenant, usage tracking per tenant.

Every AI request must carry tenant identity. Every response must be scoped. Every cost must be attributed. See our multi-tenant design guide for the full architecture.

2. Cost Management

Prototype: $50/month on the credit card.

Production: $5,000-50,000/month across multiple providers, models, and use cases. Without cost tracking per tenant, per model, per use case, you can't price your product, identify waste, or forecast spend.

// Track cost per request
const cost = calculateCost({
    provider: 'openai',
    model: 'gpt-4o',
    promptTokens: response.usage.prompt_tokens,
    completionTokens: response.usage.completion_tokens,
});

await costTracker.record({
    tenantId: ctx.tenantId,
    model: 'gpt-4o',
    useCase: 'customer-support',
    costUsd: cost,
    timestamp: new Date(),
});

3. PII Handling

Prototype: raw customer data in every prompt.

Production: semantic tokenization, trust boundaries, policy-driven restore, audit trails without PII, GDPR compliance. See our GDPR compliance guide and data leakage prevention guide for the full architecture.

4. Fallback Models

Prototype: one model, one provider. If it's down, the system is down.

Production: primary model with automatic fallback to secondary. Different models for different tasks (fast model for classification, accurate model for generation). Provider-level redundancy.

async function generateWithFallback(prompt: string, options: GenerateOptions): Promise<string> {
    const providers = [
        { provider: 'anthropic', model: 'claude-sonnet-4-20250514' },
        { provider: 'openai', model: 'gpt-4o' },
        { provider: 'local', model: 'llama-3.1-70b' },
    ];

    for (const config of providers) {
        try {
            return await llmClient.generate(prompt, config);
        } catch (error) {
            logger.warn('Provider failed, trying fallback', {
                provider: config.provider,
                error: error.message,
            });
            continue;
        }
    }
    throw new Error('All providers failed');
}

5. Rate Limiting

Prototype: no limits.

Production: per-tenant rate limits, per-model rate limits, global rate limits. Without them, one tenant's batch job saturates the API and every other tenant gets timeouts.

6. Retry Logic

Prototype: if it fails, try again manually.

Production: exponential backoff with jitter for transient failures. Circuit breaker for provider outages. No retry for validation errors. Different strategies for different failure types.

7. Version Management

Prototype: latest model, latest prompt.

Production: pinned model versions, versioned prompts, A/B testing between prompt versions, rollback capability, evaluation suites that run before deploying a new prompt version.

8. Monitoring and Alerting

Prototype: check the console.

Production: latency percentiles (p50, p95, p99), error rates by provider, token usage trends, cost per day/week/month, hallucination detection, quality scoring, alert on anomalies.

// Metrics to track
const metrics = {
    latency_ms: response.latencyMs,
    tokens_prompt: response.usage.promptTokens,
    tokens_completion: response.usage.completionTokens,
    cost_usd: response.cost,
    model: response.model,
    provider: response.provider,
    status: response.error ? 'error' : 'success',
    finish_reason: response.finishReason,
    tenant_id: ctx.tenantId,
};

await metricsCollector.record('llm_request', metrics);

See our AI observability guide and OpenTelemetry guide for implementation patterns.

9. Compliance Documentation

Prototype: "we use GPT-4."

Production: data processing records (GDPR Art. 30), data protection impact assessment (Art. 35), model cards documenting capabilities and limitations, audit trails for every decision, human approval records for high-value actions.

See our AI decision traceability guide for the audit architecture.

10. Caching

Prototype: every request hits the LLM.

Production: semantic caching for similar queries, response caching for identical queries, embedding caching for repeated documents. Caching reduces cost and latency by 30-60% for typical workloads.

11. Input Validation

Prototype: trust the user input.

Production: prompt injection detection, input length limits, content filtering, language detection, intent classification before expensive LLM calls.

12. Output Validation

Prototype: trust the model output.

Production: output guard for hallucinated PII, citation verification against retrieved context, structured output parsing with schema validation, confidence scoring, fallback responses for low-quality output.

13. Streaming

Prototype: wait for the full response, display it.

Production: stream tokens to the user as they're generated. First token appears in 200-500ms even though the full response takes 2-5 seconds. Streaming changes the perceived latency dramatically.

14. Multi-Region

Prototype: one region, one deployment.

Production: data residency requirements (EU data stays in EU), latency optimization (serve from closest region), disaster recovery (failover to secondary region).

15. The Team Change

Prototype: one ML engineer or full-stack developer.

Production: you need people who understand ops, infrastructure, compliance, cost management, and monitoring. The ML/AI expertise is necessary but not sufficient. The team needs:

RolePrototypeProduction
AI/ML engineerBuilds the model integrationMaintains prompts, evaluations, model selection
Backend engineerN/ABuilds the infrastructure: auth, caching, rate limiting
DevOps/SREN/AMonitoring, deployment, incident response
Compliance/LegalN/AGDPR documentation, model governance
ProductEvaluates the demoDefines quality metrics, user feedback loops

The Production Readiness Checklist

Before going live, verify:

CategoryCheckStatus
AuthJWT/API key auth on every endpoint
AuthTenant scoping on every request
AuthRate limiting per tenant
CostCost tracking per request
CostCost alerts (daily/weekly thresholds)
CostBudget caps per tenant
PIISemantic tokenization before LLM
PIINo PII in logs
PIIOutput guard for hallucinated PII
ReliabilityFallback model configured
ReliabilityRetry with exponential backoff
ReliabilityCircuit breaker for provider outages
MonitoringLatency, error rate, token usage metrics
MonitoringAlerts on anomalies
MonitoringCost dashboard
ComplianceGDPR Art. 30 processing record
ComplianceDecision audit trail
ComplianceModel version tracking
CacheSemantic cache for similar queries
ValidationInput length and content validation
ValidationOutput schema validation
StreamingToken streaming to client

Common Pitfalls

  1. Estimating production timeline from prototype timeline. The prototype took 2 weeks. Production takes 6-12 months. The AI is 10% of the work.

  2. No cost tracking from day one. By the time you notice costs are out of control, you've already overspent. Track from the first production request.

  3. Single provider dependency. If OpenAI is down, your system is down. Configure fallback providers.

  4. No input validation. Prompt injection is a real attack vector. Validate and sanitize inputs before they reach the prompt.

  5. Treating compliance as afterthought. Legal will block your launch if GDPR documentation isn't ready. Start compliance work in parallel with engineering.

  6. No semantic caching. Similar questions from different users trigger the full pipeline every time. A semantic cache reduces costs significantly.

  7. Monolithic deployment. Separate your API server from your worker processes. A long-running AI generation should not block HTTP request handling.

  8. No evaluation suite. Changing a prompt can degrade quality in ways you don't notice until users complain. Run evaluations before deploying prompt changes.

Key Takeaways

  • The AI model is 10% of a production system. Auth, cost tracking, PII handling, monitoring, compliance, caching, and reliability engineering are the other 90%.

  • Cost management is not optional. Track per-request, per-tenant, per-model. Alert on thresholds. Set budget caps. Costs scale faster than you expect.

  • Fallback providers prevent outages. No single LLM provider has 100% uptime. Configure automatic failover.

  • Compliance starts at day one, not at launch. GDPR documentation, audit trails, and model governance take time. Parallelizing with engineering saves months.

  • The team changes. A prototype needs an AI engineer. Production needs ops, infrastructure, compliance, and product people in addition.

We help teams make this transition as part of our AI services practice. From prototype architecture review to full production deployment, talk to our team or request a quote. See also our methodology page for how we approach AI projects.

Topics covered

AI productionLLM production deploymentAI scalingAI infrastructure productionAI demo to productionAI opsLLM monitoringAI cost management

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation