Technical Guide

From AI Prototype to Production: The 15 Things That Change Completely

What changes when you move an AI system from prototype to production. Auth, cost tracking, PII handling, fallback models, monitoring, compliance, and the team structure shift.

April 15, 202616 min readOronts Engineering Team

The Prototype Delusion

Every AI prototype works. You call an API, pass a prompt, get a response. The demo impresses stakeholders. The team estimates two sprints to production.

Twelve months later, the system is still not in production. Not because the AI doesn't work. Because everything around it doesn't work: auth, rate limiting, cost tracking, PII handling, fallback models, retry logic, version management, monitoring, alerting, compliance documentation, multi-region deployment, disaster recovery, and operational runbooks.

The AI model is maybe 10% of a production AI system. The other 90% is engineering. This article covers the 15 things that change completely when you move from prototype to production.

For specific patterns, see our guides on AI GDPR compliance, AI observability, and AI decision traceability.

The 15 Things

1. Authentication and Multi-Tenancy

Prototype: one API key hardcoded in the environment.

Production: JWT tokens with tenant scoping, API key management with rotation, role-based access control, rate limiting per tenant, usage tracking per tenant.

Every AI request must carry tenant identity. Every response must be scoped. Every cost must be attributed. See our multi-tenant design guide for the full architecture.

2. Cost Management

Prototype: $50/month on the credit card.

Production: $5,000-50,000/month across multiple providers, models, and use cases. Without cost tracking per tenant, per model, per use case, you can't price your product, identify waste, or forecast spend.

// Track cost per request
const cost = calculateCost({
    provider: 'openai',
    model: 'gpt-4o',
    promptTokens: response.usage.prompt_tokens,
    completionTokens: response.usage.completion_tokens,
});

await costTracker.record({
    tenantId: ctx.tenantId,
    model: 'gpt-4o',
    useCase: 'customer-support',
    costUsd: cost,
    timestamp: new Date(),
});

3. PII Handling

Prototype: raw customer data in every prompt.

Production: semantic tokenization, trust boundaries, policy-driven restore, audit trails without PII, GDPR compliance. See our GDPR compliance guide and data leakage prevention guide for the full architecture.

4. Fallback Models

Prototype: one model, one provider. If it's down, the system is down.

Production: primary model with automatic fallback to secondary. Different models for different tasks (fast model for classification, accurate model for generation). Provider-level redundancy.

async function generateWithFallback(prompt: string, options: GenerateOptions): Promise<string> {
    const providers = [
        { provider: 'anthropic', model: 'claude-sonnet-4-20250514' },
        { provider: 'openai', model: 'gpt-4o' },
        { provider: 'local', model: 'llama-3.1-70b' },
    ];

    for (const config of providers) {
        try {
            return await llmClient.generate(prompt, config);
        } catch (error) {
            logger.warn('Provider failed, trying fallback', {
                provider: config.provider,
                error: error.message,
            });
            continue;
        }
    }
    throw new Error('All providers failed');
}

5. Rate Limiting

Prototype: no limits.

Production: per-tenant rate limits, per-model rate limits, global rate limits. Without them, one tenant's batch job saturates the API and every other tenant gets timeouts.

6. Retry Logic

Prototype: if it fails, try again manually.

Production: exponential backoff with jitter for transient failures. Circuit breaker for provider outages. No retry for validation errors. Different strategies for different failure types.

7. Version Management

Prototype: latest model, latest prompt.

Production: pinned model versions, versioned prompts, A/B testing between prompt versions, rollback capability, evaluation suites that run before deploying a new prompt version.

8. Monitoring and Alerting

Prototype: check the console.

Production: latency percentiles (p50, p95, p99), error rates by provider, token usage trends, cost per day/week/month, hallucination detection, quality scoring, alert on anomalies.

// Metrics to track
const metrics = {
    latency_ms: response.latencyMs,
    tokens_prompt: response.usage.promptTokens,
    tokens_completion: response.usage.completionTokens,
    cost_usd: response.cost,
    model: response.model,
    provider: response.provider,
    status: response.error ? 'error' : 'success',
    finish_reason: response.finishReason,
    tenant_id: ctx.tenantId,
};

await metricsCollector.record('llm_request', metrics);

See our AI observability guide and OpenTelemetry guide for implementation patterns.

9. Compliance Documentation

Prototype: "we use GPT-4."

Production: data processing records (GDPR Art. 30), data protection impact assessment (Art. 35), model cards documenting capabilities and limitations, audit trails for every decision, human approval records for high-value actions.

See our AI decision traceability guide for the audit architecture.

10. Caching

Prototype: every request hits the LLM.

Production: semantic caching for similar queries, response caching for identical queries, embedding caching for repeated documents. Caching reduces cost and latency by 30-60% for typical workloads.

11. Input Validation

Prototype: trust the user input.

Production: prompt injection detection, input length limits, content filtering, language detection, intent classification before expensive LLM calls.

12. Output Validation

Prototype: trust the model output.

Production: output guard for hallucinated PII, citation verification against retrieved context, structured output parsing with schema validation, confidence scoring, fallback responses for low-quality output.

13. Streaming

Prototype: wait for the full response, display it.

Production: stream tokens to the user as they're generated. First token appears in 200-500ms even though the full response takes 2-5 seconds. Streaming changes the perceived latency dramatically.

14. Multi-Region

Prototype: one region, one deployment.

Production: data residency requirements (EU data stays in EU), latency optimization (serve from closest region), disaster recovery (failover to secondary region).

15. The Team Change

Prototype: one ML engineer or full-stack developer.

Production: you need people who understand ops, infrastructure, compliance, cost management, and monitoring. The ML/AI expertise is necessary but not sufficient. The team needs:

Role	Prototype	Production
AI/ML engineer	Builds the model integration	Maintains prompts, evaluations, model selection
Backend engineer	N/A	Builds the infrastructure: auth, caching, rate limiting
DevOps/SRE	N/A	Monitoring, deployment, incident response
Compliance/Legal	N/A	GDPR documentation, model governance
Product	Evaluates the demo	Defines quality metrics, user feedback loops

The Production Readiness Checklist

Before going live, verify:

Category	Check	Status
Auth	JWT/API key auth on every endpoint
Auth	Tenant scoping on every request
Auth	Rate limiting per tenant
Cost	Cost tracking per request
Cost	Cost alerts (daily/weekly thresholds)
Cost	Budget caps per tenant
PII	Semantic tokenization before LLM
PII	No PII in logs
PII	Output guard for hallucinated PII
Reliability	Fallback model configured
Reliability	Retry with exponential backoff
Reliability	Circuit breaker for provider outages
Monitoring	Latency, error rate, token usage metrics
Monitoring	Alerts on anomalies
Monitoring	Cost dashboard
Compliance	GDPR Art. 30 processing record
Compliance	Decision audit trail
Compliance	Model version tracking
Cache	Semantic cache for similar queries
Validation	Input length and content validation
Validation	Output schema validation
Streaming	Token streaming to client

Common Pitfalls

Estimating production timeline from prototype timeline. The prototype took 2 weeks. Production takes 6-12 months. The AI is 10% of the work.
No cost tracking from day one. By the time you notice costs are out of control, you've already overspent. Track from the first production request.
Single provider dependency. If OpenAI is down, your system is down. Configure fallback providers.
No input validation. Prompt injection is a real attack vector. Validate and sanitize inputs before they reach the prompt.
Treating compliance as afterthought. Legal will block your launch if GDPR documentation isn't ready. Start compliance work in parallel with engineering.
No semantic caching. Similar questions from different users trigger the full pipeline every time. A semantic cache reduces costs significantly.
Monolithic deployment. Separate your API server from your worker processes. A long-running AI generation should not block HTTP request handling.
No evaluation suite. Changing a prompt can degrade quality in ways you don't notice until users complain. Run evaluations before deploying prompt changes.

Key Takeaways

The AI model is 10% of a production system. Auth, cost tracking, PII handling, monitoring, compliance, caching, and reliability engineering are the other 90%.
Cost management is not optional. Track per-request, per-tenant, per-model. Alert on thresholds. Set budget caps. Costs scale faster than you expect.
Fallback providers prevent outages. No single LLM provider has 100% uptime. Configure automatic failover.
Compliance starts at day one, not at launch. GDPR documentation, audit trails, and model governance take time. Parallelizing with engineering saves months.
The team changes. A prototype needs an AI engineer. Production needs ops, infrastructure, compliance, and product people in addition.

We help teams make this transition as part of our AI services practice. From prototype architecture review to full production deployment, talk to our team or request a quote. See also our methodology page for how we approach AI projects.

Topics covered

AI productionLLM production deploymentAI scalingAI infrastructure productionAI demo to productionAI opsLLM monitoringAI cost management

Ready to build production AI systems?

Our team specializes in building production-ready AI systems. Let's discuss how we can help transform your enterprise with cutting-edge technology.

Start a conversation

From AI Prototype to Production: The 15 Things That Change Completely

The Prototype Delusion

The 15 Things

1. Authentication and Multi-Tenancy

2. Cost Management

3. PII Handling

4. Fallback Models

5. Rate Limiting

6. Retry Logic

7. Version Management

8. Monitoring and Alerting

9. Compliance Documentation

10. Caching

11. Input Validation

12. Output Validation

13. Streaming

14. Multi-Region

15. The Team Change

The Production Readiness Checklist

Common Pitfalls

Key Takeaways

Topics covered

Related Guides

The Complete Guide to AI Observability

Enterprise Guide to Agentic AI Systems

Agentic Commerce: How to Let AI Agents Buy Things Safely

Ready to build production AI systems?

Get the Latest AI Insights

Services

Solutions

Company

Resources

Legal