Each layer adds management; you pick where to start. Most production AI apps build on layer 2 (model API) and add layer 3 (RAG); enterprise reaches into layer 4.
Azure AI Foundry — OpenAI under MS perimeter + Llama + Mistral + Cohere + Phi
Same models, different compliance perimeter, different egress & latency profile
Frontier vs first-party tradeoff
Anthropic via api.anthropic.com versus via Bedrock: same model, same intelligence. Different: pricing, region, compliance (HIPAA only via Bedrock until recently), authentication (API key vs IAM role), latency to your VPC. For most enterprise workloads, the hyperscaler route is mandatory.
04
Specialist Inference Providers
Open-weights models (Llama, Mistral, Qwen, DeepSeek) hosted by specialists. Same weights as you could self-host (deck reference: Local LLM Hosting) but billed per token.
High-throughput open-weights hosts
Together AI — broadest Llama / Mixtral / DeepSeek / Qwen catalogue, fine-tuning
Fireworks AI — best-in-class throughput per dollar, FireFunction-tuned models
DeepInfra — cheapest end of the market
Replicate — model-by-model deploy, popular for image/video
Latency-sensitive use cases: real-time agents, voice, code completion
Image / video / audio specialists
FAL — fast image diffusion API; Flux models
Black Forest Labs — Flux family, frontier image
ElevenLabs · Cartesia · Resemble — TTS
Deepgram · AssemblyAI — STT
Runway · Pika · Luma — video
Why specialists matter
Price floor — DeepInfra Llama 70B at ~$0.50/M tokens vs Anthropic Sonnet at ~$3/M
Latency floor — Groq's 700 tok/s vs OpenAI's ~80 tok/s
Compliance floor — open weights you can self-host as a backup
Innovation pace — new models on Together / Fireworks days after release
Caveat — specialist routing
Specialists may share inference fleets; the "same model" can have measurably different output distributions across providers. Pin a provider per workload; benchmark before switching.
05
Pricing — Per-Token Economics
Model
Context
Input $/M tok
Output $/M tok
Notes
Claude Opus 4.7
200k
$15
$75
5× discount with 1h prompt-cache hit
Claude Sonnet 4.6
200k–1M
$3
$15
most popular workhorse
Claude Haiku 4.5
200k
$1
$5
cheap-and-fast tier
GPT-5
400k
$2.50
$10
frontier OpenAI
GPT-5-mini
400k
$0.25
$2
cheapest flagship-class
OpenAI o3
200k
$60
$240
reasoning model; thinks for tokens you pay for
Gemini 2.5 Pro
1M+
$1.25 (≤200k) · $2.50 (>200k)
$10–15
caching; long-context tiered pricing
Gemini 2.5 Flash
1M
$0.30
$2.50
cheap workhorse
Mistral Large 2
128k
$2
$6
EU-resident
Llama 3.3 70B (Together)
128k
$0.88
$0.88
open-weights pricing
DeepSeek-V3 (DeepInfra)
64k
$0.27
$1.10
provider-hosted in PRC
What dominates the bill
Output tokens are 2–5× the price of input — keep output concise
Long contexts (RAG-stuffing) — every retrieval round hits input pricing
Reasoning models — invisible "thinking" tokens are billed too
Agent loops — every tool-call round is a full prompt + response
OpenAI prompt caching — automatic, 50% discount on hits
Gemini context caching — pay storage by minute; cheaper for stable system prompts
Caching turns expensive 100k-token system prompts from 100% to 10% cost
06
Latency — TTFT vs Throughput
"How fast is the model?" decomposes into two numbers: TTFT (time to first token — how long the user waits to see anything) and tok/s (steady-state output rate).
Anthropic: api.anthropic.com is US-only; via Bedrock you get eu-central-1, eu-west-2, ap-northeast-1, etc.
Latency to model often dwarfs latency from model — pick the region near your users
Vertex AI lets you pin model region; Azure OpenAI mandates it (and inherits Azure's region list)
07
Prompt Caching — The 10× Cost Lever
Every conversation, agent loop, and RAG response sends the same system prompt + tool schema again and again. Prompt caching makes the provider remember the prefix; you pay for new tokens, not for the cached prefix.
Anthropic example
POST /v1/messages
{
"model": "claude-sonnet-4-6",
"system": [
{
"type": "text",
"text": "You are an expert coding assistant…(50KB)…",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [{"role":"user","content":"…"}]
}
# First call: full input price (e.g. $3/M)
# Cached calls: $0.30/M input — 90% off — for 5 min
# 1h caching: $0.50/M input — 80% off — opt-in
# Cache write: 1.25× of base input price (one-time)
When it pays
Long system prompts repeated across users
RAG context that's stable for minutes (e.g. a doc the user is asking 5 questions about)
Multi-turn agents — entire previous turn history can cache
Tool schemas (often big; rarely change)
Provider variants
OpenAI — automatic for prompts ≥ 1024 tokens; 50% off; no opt-in
Together / Fireworks — provider-side KV cache; no API surface yet
Cache placement matters
Put stable content first: system prompt → tool schemas → RAG docs → user message
Set cache_control on the last stable item — everything before is cached
Re-send the cached blocks every call; cache survives by being re-referenced
Where caching fails
Personalised system prompts (per-user tone tweaks); shuffled doc order in RAG; templating that re-renders the same content with whitespace drift. Lock the prefix bytes-stable.
LLM-judge — another model rates output (Claude / GPT) on rubric; cheap-but-noisy
Human review — sampling for quality drift; expensive but ground-truth
"Vibes-based deploys"
The most common reason ML features regress — no eval suite, no regression test, no comparison run before deploying a prompt change. Build evals before you build features.
12
Guardrails & AI Gateways
A guardrail intercepts requests + responses, applying policy: redact PII, block prompt injection, refuse policy-violating output, log everything. An AI gateway is that-and-more: routing, fallback, caching, key management.
One key for many providers — abstract the provider behind a stable API
Fallback — Anthropic 5xx → fall through to OpenAI / Bedrock copy
Semantic cache — repeat questions answered from cache
Cost-aware routing — easy to Haiku, hard to Opus
Per-user rate limit + budget — stop runaways
Single audit log — compliance loves it
Where to put it
In front of every LLM call from your app. Don't hard-code provider URLs in services; route everything through the gateway. When a model deprecates or a provider goes down, you don't redeploy services.
The gateway is a critical path
Your AI gateway is now in the hot path of every user request. Self-host it on CaaS (Cloud Run / Container Apps) or pick a vendor with a 99.99% SLA. Don't put it on a single Lambda.
13
MCP Server Hosting & Gateways
Model Context Protocol — Anthropic's open standard (2024) for letting LLMs use external tools, resources, and prompts. The 2025–26 wave of MCP-as-a-service hosts MCP servers and brokers OAuth.
Vercel MCP — push to Vercel, get a remote MCP endpoint
Mintlify · Stainless — auto-generate MCP servers from API specs
MCP gateways
Docker MCP Gateway — local + remote, runs MCP servers as containers behind one endpoint
Supergateway — SSE-bridge for stdio MCP servers
MCP Inspector — Anthropic's debugger
OAuth for MCP
The 2025 MCP authorisation profile (deep dive deck) specifies how an MCP client (LLM agent) authenticates the user against the MCP server's resource — Resource Indicators (RFC 8707), DPoP-bound tokens, audience-scoped consent. Most managed MCP hosts handle this for you.
If your MCP server passes the user's token to a downstream API without checking the audience, an attacker can replay it elsewhere. Mandatory: Resource Indicators on every issued token, audience-check on every consumed token. The MCP profile is explicit about this; many implementations get it wrong.
14
LLM-Specific Security
Cloud security (deck 05) plus a layer of new shapes: the model itself is part of the attack surface, prompts are user input, outputs are code-execution.
OWASP LLM Top 10 (2025)
LLM01 Prompt injection
LLM02 Sensitive info disclosure
LLM03 Supply chain (model + deps)
LLM04 Data & model poisoning
LLM05 Improper output handling (XSS, RCE in tools)
LLM06 Excessive agency (agent runs amok)
LLM07 System prompt leakage
LLM08 Vector / embedding weaknesses
LLM09 Misinformation / hallucination
LLM10 Unbounded consumption (cost / DoS)
Concrete controls
Treat all model output as untrusted — escape before HTML, parse before exec
Tool sandboxes — code interpreter in Firecracker / gVisor
Read every provider's data-processing addendum (DPA) before relying on this
"Free" models = your data is the price
Some free / consumer-tier offerings train on inputs. For any product handling user data, use the paid API explicitly with the no-training DPA. Check the provider's setting; defaults vary.
Anthropic Privatemode — confidential-compute LLM offering for HIPAA / FedRAMP
Don't roll-your-own model card
Vendors publish model cards (Anthropic, OpenAI, Meta) — re-use them, point to them. Your own card describes your application's use of the model: scope, limits, evaluations, who's accountable.
16
Cost Engineering for LLMaaS
The five biggest levers
Prompt caching — 50–90% off input on cached prefixes
Model routing — Haiku/4o-mini for easy, Opus/o3 for hard; classify cheaply, route smart
Output limits — max_tokens; constrain to JSON; ask for shorter answers
RAG over context-stuffing — search rather than send the whole corpus
Batch APIs — Anthropic / OpenAI batch are 50% off if you can wait 24h
Per-tenant accounting
Tag every model call with tenant_id, feature, user
Aggregate to a real-time "cost-per-tenant" metric
Surface to the customer (transparency = trust)
Charge for it (deck 04: metering)
Common runaway shapes
Agent loops without max-iteration ceiling
Streaming that's never stopped on disconnect
Embedding the entire user corpus on every signup ("just-in-case")
Logging full prompt+response into Datadog / Sentry — log size eats more than the API call
Testing in prod with a model you forgot to switch back from o3
Hard limits, by default
Per-org spend ceiling (provider has these — set them)
Per-user / per-tenant rate limit at gateway
Per-request max_tokens + max_iterations for agents
Budget alarms at 50% / 80% / 100% of monthly target
"AI costs are infinite"
Only if you let them. The same FinOps practices that work for cloud bills (deck 02) work here — tag, alarm, optimise. The difference is a 10× cost lever (caching) usually exists.
17
LLMaaS Anti-Patterns
"Hardcode openai.com in every service"
One provider 5xx → 100% of your AI features down. Use an AI gateway from day one.
"One frontier model for everything"
You pay frontier prices on classification tasks a Haiku could do for 1/15 the cost. Model-route by task complexity.
"Trust the model output as data"
If the model returns <script>…</script> and your app renders it raw, you have a stored XSS via prompt injection. Always escape; always validate.
"No evals, just vibes"
Two weeks after launch, "is the new model better?" has no answer. Eval suite first.
"Customer data in free-tier endpoints"
Your free tier might train on user data. For any production app, paid API + DPA + data-residency review.
"Agents with full repo write access"
Excessive agency. Constrain tools to the smallest scope; gate writes behind human approval; sandbox every tool call.
"RAG = vector DB"
Retrieval is hybrid (keyword + vector + reranker). Pure-vector misses on code, names, IDs, exact strings. Add BM25 from day one.
"Fine-tuning before evals"
Without an eval suite, you don't know whether your tune helped or hurt. Build the regression suite first.
18
Summary & Series Close
Three takeaways
LLMaaS is a stack — inference, retrieval, agents, evals, governance. Build each layer on managed services unless you have a real reason to self-host.
Cost-control levers exist (caching, routing, batch). Use them; default behaviour is expensive.
The new security shapes — prompt injection, excessive agency, supply-chain, data leakage — are real and concrete. Treat them with the same rigour as cloud security in deck 05.
"LLM-as-a-Service is the cloud's newest layer — and it inherits every prior layer's discipline: identity, security, observability, cost — plus a few new ones unique to running models you didn't write."