CLOUD SERVICE MODELS · PART 6 OF 6

LLM-as-a-Service

Providers · pricing · RAG · agents · evals · governance — the managed-AI layer of the cloud
OpenAI · Anthropic · Google Bedrock · Vertex · Azure OpenAI Together · Groq · Fireworks Pinecone · Weaviate · Turbopuffer LangSmith · Braintrust · Langfuse
📝 Prompt 🔎 Retrieve 🧠 Model 🛠 Tool / agent 📊 Eval

Managed counterparts to every part of the LLM stack. Companion to the Local LLM Hosting sub-hub for the self-hosted side.

Inference  ·  RAG  ·  Agents  ·  Evals  ·  Governance
01

Topics

Inference layer

  • The provider landscape — frontier & specialist
  • Per-token pricing economics
  • Latency: TTFT, throughput, regional routing
  • Prompt caching — 10× cost reduction

Retrieval layer

  • Embeddings-as-a-Service
  • Vector DB / hybrid search providers
  • Hosted RAG end-to-end

Agent & ops layer

  • Agents-as-a-Service (Bedrock, Vertex, Assistants)
  • Fine-tuning APIs
  • Evals / observability / guardrails
  • MCP server hosting

Governance & cost

  • LLM-specific security — prompt injection, PII, BYOK
  • Compliance (EU AI Act, ISO 42001, NIST AI RMF)
  • Cost engineering — caching, routing, fallbacks
  • Anti-patterns and a final summary
02

The LLMaaS Landscape — Five Sub-Layers

5. Application your SaaS · your users · your prompts 4. Agents · Evals · Guardrails Bedrock Agents · Vertex AI Agent Builder · OpenAI Assistants · LangSmith · Braintrust · Langfuse · Lakera 3. Retrieval / RAG Pinecone · Weaviate · Turbopuffer · Vespa Cloud · MongoDB Atlas Vector · Cohere Rerank 2. Inference (the model API) OpenAI · Anthropic · Google · Mistral · Bedrock · Vertex · Azure OpenAI · Together · Groq · Fireworks · DeepInfra · Replicate 1. Compute (GPUs / TPUs / accelerators) NVIDIA Hopper / Blackwell · Google TPU v5p/v6 · AWS Trainium 2 · Groq LPU · SambaNova RDU · Cerebras WSE

Each layer adds management; you pick where to start. Most production AI apps build on layer 2 (model API) and add layer 3 (RAG); enterprise reaches into layer 4.

03

Frontier Providers

ProviderFlagship (2025-26)StrengthsGotchas
AnthropicClaude Opus 4.7 · Sonnet 4.6 · Haiku 4.5 Long-context discipline, agentic tasks, prompt caching, extended thinking, MCP-native Tighter rate limits than OpenAI on hobby tier; first-party endpoint or via Bedrock / Vertex
OpenAIGPT-5 · o3 · GPT-5-mini Largest tooling ecosystem, broadest function-calling, Realtime & Voice APIs, Assistants v2 Frequent model-deprecation churn; data-residency only via Azure OpenAI
GoogleGemini 2.5 Pro · Flash · Nano Banana 1M-token context, native video, deep Vertex integration, $0 egress between Google services API surface fragments across AI Studio, Vertex, Firebase
Mistral AIMistral Large 2 · Codestral 25 · Pixtral EU-resident, partial open-weights, fine-tuning on the API Context windows shorter than rivals at flagship tier
xAIGrok 4Real-time X data, 256k contextLess enterprise tooling, narrower compliance posture
DeepSeekDeepSeek-V4 · R1.5 (reasoning) Extremely cheap; OSS weights of equivalents Provider hosted in PRC; many enterprises self-host the weights via specialists

Hyperscaler model gardens

  • AWS Bedrock — Anthropic, Meta Llama, Mistral, Cohere, Stability, Amazon Nova, custom imports
  • GCP Vertex AI — Gemini + Anthropic + Llama + Mistral + JAX-based custom
  • Azure AI Foundry — OpenAI under MS perimeter + Llama + Mistral + Cohere + Phi
  • Same models, different compliance perimeter, different egress & latency profile

Frontier vs first-party tradeoff

Anthropic via api.anthropic.com versus via Bedrock: same model, same intelligence. Different: pricing, region, compliance (HIPAA only via Bedrock until recently), authentication (API key vs IAM role), latency to your VPC. For most enterprise workloads, the hyperscaler route is mandatory.

04

Specialist Inference Providers

Open-weights models (Llama, Mistral, Qwen, DeepSeek) hosted by specialists. Same weights as you could self-host (deck reference: Local LLM Hosting) but billed per token.

High-throughput open-weights hosts

  • Together AI — broadest Llama / Mixtral / DeepSeek / Qwen catalogue, fine-tuning
  • Fireworks AI — best-in-class throughput per dollar, FireFunction-tuned models
  • DeepInfra — cheapest end of the market
  • Replicate — model-by-model deploy, popular for image/video
  • RunPod Serverless · Modal — bring-your-own-model serverless GPU

Custom-silicon, ultra-low-latency

  • Groq — LPU; 700+ tokens/sec on Llama 70B; 100ms TTFT
  • Cerebras Inference — WSE wafer-scale; world's fastest, often > 1000 tok/s
  • SambaNova Cloud — RDU; flagship "Cloud-1" for Llama 3.x & DeepSeek
  • Latency-sensitive use cases: real-time agents, voice, code completion

Image / video / audio specialists

  • FAL — fast image diffusion API; Flux models
  • Black Forest Labs — Flux family, frontier image
  • ElevenLabs · Cartesia · Resemble — TTS
  • Deepgram · AssemblyAI — STT
  • Runway · Pika · Luma — video

Why specialists matter

  • Price floor — DeepInfra Llama 70B at ~$0.50/M tokens vs Anthropic Sonnet at ~$3/M
  • Latency floor — Groq's 700 tok/s vs OpenAI's ~80 tok/s
  • Compliance floor — open weights you can self-host as a backup
  • Innovation pace — new models on Together / Fireworks days after release

Caveat — specialist routing

Specialists may share inference fleets; the "same model" can have measurably different output distributions across providers. Pin a provider per workload; benchmark before switching.

05

Pricing — Per-Token Economics

ModelContextInput $/M tokOutput $/M tokNotes
Claude Opus 4.7200k$15$755× discount with 1h prompt-cache hit
Claude Sonnet 4.6200k–1M$3$15most popular workhorse
Claude Haiku 4.5200k$1$5cheap-and-fast tier
GPT-5400k$2.50$10frontier OpenAI
GPT-5-mini400k$0.25$2cheapest flagship-class
OpenAI o3200k$60$240reasoning model; thinks for tokens you pay for
Gemini 2.5 Pro1M+$1.25 (≤200k) · $2.50 (>200k)$10–15caching; long-context tiered pricing
Gemini 2.5 Flash1M$0.30$2.50cheap workhorse
Mistral Large 2128k$2$6EU-resident
Llama 3.3 70B (Together)128k$0.88$0.88open-weights pricing
DeepSeek-V3 (DeepInfra)64k$0.27$1.10provider-hosted in PRC

What dominates the bill

  • Output tokens are 2–5× the price of input — keep output concise
  • Long contexts (RAG-stuffing) — every retrieval round hits input pricing
  • Reasoning models — invisible "thinking" tokens are billed too
  • Agent loops — every tool-call round is a full prompt + response

Caching tiers

  • Anthropic prompt caching — 5min default, 1h opt-in; 90% input price discount on hits
  • OpenAI prompt caching — automatic, 50% discount on hits
  • Gemini context caching — pay storage by minute; cheaper for stable system prompts
  • Caching turns expensive 100k-token system prompts from 100% to 10% cost
06

Latency — TTFT vs Throughput

"How fast is the model?" decomposes into two numbers: TTFT (time to first token — how long the user waits to see anything) and tok/s (steady-state output rate).

Typical numbers (2025-26)

Provider / modelTTFTTok/s
Cerebras (Llama 70B)~150 ms~2,000
Groq (Llama 70B)~170 ms~700
SambaNova (Llama)~250 ms~600
Anthropic Haiku 4.5~250 ms~120
GPT-5-mini~300 ms~85
Anthropic Sonnet 4.6~500 ms~75
Anthropic Opus 4.7~800 ms~50
Reasoning (o3, Sonnet thinking)1–10 s+varies; thinking before output

When latency matters most

  • Voice agents — sub-200ms turn-taking; needs Cerebras / Groq + Realtime API
  • Code completion — every 200ms above 400ms feels laggy
  • Chat UX — < 1.5 s TTFT keeps users engaged
  • Agents — N tool-calls × per-call latency; latency multiplies

Streaming, every time

Regional routing

  • Anthropic: api.anthropic.com is US-only; via Bedrock you get eu-central-1, eu-west-2, ap-northeast-1, etc.
  • Latency to model often dwarfs latency from model — pick the region near your users
  • Vertex AI lets you pin model region; Azure OpenAI mandates it (and inherits Azure's region list)
07

Prompt Caching — The 10× Cost Lever

Every conversation, agent loop, and RAG response sends the same system prompt + tool schema again and again. Prompt caching makes the provider remember the prefix; you pay for new tokens, not for the cached prefix.

Anthropic example

POST /v1/messages
{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "You are an expert coding assistant…(50KB)…",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [{"role":"user","content":"…"}]
}

# First call:    full input price (e.g. $3/M)
# Cached calls:  $0.30/M input — 90% off — for 5 min
# 1h caching:    $0.50/M input — 80% off — opt-in
# Cache write:   1.25× of base input price (one-time)

When it pays

  • Long system prompts repeated across users
  • RAG context that's stable for minutes (e.g. a doc the user is asking 5 questions about)
  • Multi-turn agents — entire previous turn history can cache
  • Tool schemas (often big; rarely change)

Provider variants

  • OpenAI — automatic for prompts ≥ 1024 tokens; 50% off; no opt-in
  • Gemini context caching — explicit cache create; pay storage per hour
  • DeepSeek — automatic, 90% off cached input
  • Together / Fireworks — provider-side KV cache; no API surface yet

Cache placement matters

  • Put stable content first: system prompt → tool schemas → RAG docs → user message
  • Set cache_control on the last stable item — everything before is cached
  • Re-send the cached blocks every call; cache survives by being re-referenced

Where caching fails

Personalised system prompts (per-user tone tweaks); shuffled doc order in RAG; templating that re-renders the same content with whitespace drift. Lock the prefix bytes-stable.

08

Embeddings & Vector DBs — RAG-as-a-Service

Embedding APIs

ProviderModelDim$/M tok
OpenAItext-embedding-3-large3072$0.13
OpenAItext-embedding-3-small1536$0.02
Voyage AIvoyage-3-large1024$0.18
Cohereembed-v4variable$0.12
Googletext-embedding-005768$0.025
Mistralmistral-embed1024$0.10
Jina v4 (cloud)multilingual2048$0.05

Vector DB providers

  • Pinecone — managed, serverless tier, hybrid (BM25 + dense), generous free tier
  • Weaviate Cloud — OSS lineage, hybrid, generative modules built-in
  • Turbopuffer — object-storage-backed, cheap at scale
  • Vespa Cloud — search engine + vectors; for "real" search workloads
  • Qdrant Cloud · Milvus / Zilliz — open-source-rooted
  • Postgres with pgvector — Neon, Supabase, Crunchy; cheaper at small scale

Hybrid search wins

  • Pure-vector misses keyword/code matches
  • BM25 alone misses semantic similarity
  • Hybrid (RRF fusion or learned reranker) is now table-stakes — Pinecone, Weaviate, Vespa, Elastic, Vespa all support it natively
  • Add a reranker (Cohere Rerank, Jina Rerank, Voyage rerank) for the last 10% of relevance

Hosted RAG end-to-end

  • Bedrock Knowledge Bases — S3 ingest → Bedrock embeddings → OpenSearch / Pinecone / Aurora pgvector → RetrieveAndGenerate API
  • Vertex AI Search — Google's managed RAG layer
  • Azure AI Search + Foundry — full RAG stack on Azure
  • Claude file search / OpenAI file search — provider-native RAG tools, no DB to operate

RAG companion deck

Deeper architecture lives in the RAG & Retrieval Systems sub-hub.

09

Agents-as-a-Service

The provider hosts the agent loop — tool calls, memory, retries, observability — so you give it a task and a set of tools, and it runs the loop.

Hyperscaler offerings

  • AWS Bedrock Agents + AgentCore — declarative agent, code-interpreter, knowledge-base, action groups (Lambda)
  • Vertex AI Agent Builder — Google's; integrates Vertex Search, gen-app builder
  • Azure AI Foundry Agents — Azure-side, builds on Assistants v2

Provider-side agents

  • OpenAI Agent Kit + Responses API — built-in tools (web, file, code interpreter, computer use)
  • Anthropic Computer Use · Code Execution · Web Search — first-party tools, MCP client built-in
  • Google AI Studio Agents — Gemini + tools

DIY framework + managed runtime

  • LangGraph Cloud — host LangGraph graphs, persistent state, checkpoints
  • CrewAI Enterprise — multi-agent teams
  • AutoGen Studio — Microsoft's
  • LlamaIndex Workflows — declarative multi-step
  • See Agents & Orchestration sub-hub

When to outsource the loop

  • Standard "answer-then-tool-then-answer" agents → provider-side (cheaper, better-cached)
  • Complex graphs with checkpoints, human-in-the-loop, observability needs → LangGraph Cloud
  • Tightly-coupled to your domain logic → DIY in your CaaS / FaaS

The agent cost trap

An agent loop hides token cost. 10 tool-call rounds × full-prompt resends × output = bills 50× a single completion. Cache aggressively, set max-iteration ceilings, alarm on "agent ran > 30 steps".

10

Fine-Tuning APIs

What's offered

ProviderTypeModels
OpenAISFT, DPO, vision SFT4o, 4o-mini, GPT-5-mini
Anthropic (via Bedrock)SFTHaiku 3.5; Sonnet 4 GA in 2025
Google VertexSFT, RLHF, distillationGemini Flash; Gemma open-weights
MistralSFT, LoRAMistral Large & smaller
Together / FireworksSFT, LoRA, DPOany open-weight on platform
AWS BedrockSFT, continued pretrainLlama, Titan, Nova

When to fine-tune

  • Distillation — small model gets close to a frontier model on your task
  • Tone / style / format that prompting can't reliably get
  • Domain vocabulary the base model handles poorly
  • Output schema reliability (e.g. JSON adherence) when constrained decoding isn't enough

When not to fine-tune

  • "Knowledge update" — RAG is faster, cheaper, more current
  • Prompt engineering plus eval cycle hasn't been exhausted
  • You don't have ≥ a few hundred high-quality examples
  • The base model improves quarterly; your tune ages out

Cost shape

  • Training: $/1M training tokens (rough — OpenAI 4o: $25/M training tokens)
  • Inference: tuned models cost more per-token (~1.5–2× base) and can't share caches with base
  • Hosting: some providers charge a per-hour or per-month "deploy" fee

Fine-tuning is rarely the first lever

Most production wins come from prompt engineering + retrieval + good evals. Fine-tune only when those hit a ceiling.

11

Evals & Observability — As-a-Service

LLM observability platforms

  • LangSmith — LangChain native; traces, datasets, evals
  • Langfuse — OSS-rooted; self-host or hosted
  • Helicone — proxy-based, simple drop-in
  • Braintrust — eval-first, used by Anthropic / Stripe / Notion
  • Arize Phoenix / AX — eval + drift monitoring
  • Weights & Biases Weave — extends W&B to LLMs
  • Datadog LLM Observability — if already on Datadog

What you actually want

  • Trace every call: prompt, response, tokens, latency, cost, model
  • Replay a request with a different prompt or model — "what if?"
  • Tag with tenant, feature, user-id (deck 04: per-tenant SLOs)
  • Dataset of hard cases to regression-test against

Eval-as-a-service

  • Braintrust — eval scripts as code, regression suites in CI
  • Patronus AI — managed evals + safety tests
  • Vellum — visual prompt-engineering + evals
  • PromptLayer · Humanloop — prompt versioning + evals
  • OpenAI Evals · Anthropic Workbench — provider-native

Three eval types you need

  1. Code-graded — JSON schema, exact match, regex; cheap, deterministic
  2. LLM-judge — another model rates output (Claude / GPT) on rubric; cheap-but-noisy
  3. Human review — sampling for quality drift; expensive but ground-truth

"Vibes-based deploys"

The most common reason ML features regress — no eval suite, no regression test, no comparison run before deploying a prompt change. Build evals before you build features.

12

Guardrails & AI Gateways

A guardrail intercepts requests + responses, applying policy: redact PII, block prompt injection, refuse policy-violating output, log everything. An AI gateway is that-and-more: routing, fallback, caching, key management.

Guardrail products

  • AWS Bedrock Guardrails — content filters, denied topics, PII redaction, hallucination grounding
  • Azure AI Content Safety — moderation, prompt-shield, groundedness
  • Vertex AI safety filters
  • Lakera Guard — prompt-injection focus
  • Protect AI Layer · HiddenLayer — model security
  • NeMo Guardrails (NVIDIA, OSS) — Colang DSL
  • Guardrails AI (OSS) — input/output validation library + cloud

AI gateways

  • Cloudflare AI Gateway — caching, fallback, rate-limit, observability across providers
  • Portkey — multi-provider gateway, retries, semantic cache
  • Helicone — proxy + observability
  • LiteLLM Proxy — OSS multi-provider router
  • Kong AI Gateway · F5 AI Gateway — enterprise

What a gateway buys you

  • One key for many providers — abstract the provider behind a stable API
  • Fallback — Anthropic 5xx → fall through to OpenAI / Bedrock copy
  • Semantic cache — repeat questions answered from cache
  • Cost-aware routing — easy to Haiku, hard to Opus
  • Per-user rate limit + budget — stop runaways
  • Single audit log — compliance loves it

Where to put it

In front of every LLM call from your app. Don't hard-code provider URLs in services; route everything through the gateway. When a model deprecates or a provider goes down, you don't redeploy services.

The gateway is a critical path

Your AI gateway is now in the hot path of every user request. Self-host it on CaaS (Cloud Run / Container Apps) or pick a vendor with a 99.99% SLA. Don't put it on a single Lambda.

13

MCP Server Hosting & Gateways

Model Context Protocol — Anthropic's open standard (2024) for letting LLMs use external tools, resources, and prompts. The 2025–26 wave of MCP-as-a-service hosts MCP servers and brokers OAuth.

MCP hosting providers

  • Cloudflare Workers + Agents SDK — first-class remote MCP hosting
  • Anthropic Connectors — enterprise-managed MCP connections in Claude
  • Pipedream MCP · Composio MCP · Smithery — multi-tenant MCP marketplaces
  • Vercel MCP — push to Vercel, get a remote MCP endpoint
  • Mintlify · Stainless — auto-generate MCP servers from API specs

MCP gateways

  • Docker MCP Gateway — local + remote, runs MCP servers as containers behind one endpoint
  • Supergateway — SSE-bridge for stdio MCP servers
  • MCP Inspector — Anthropic's debugger

OAuth for MCP

The 2025 MCP authorisation profile (deep dive deck) specifies how an MCP client (LLM agent) authenticates the user against the MCP server's resource — Resource Indicators (RFC 8707), DPoP-bound tokens, audience-scoped consent. Most managed MCP hosts handle this for you.

Confused-deputy in MCP

If your MCP server passes the user's token to a downstream API without checking the audience, an attacker can replay it elsewhere. Mandatory: Resource Indicators on every issued token, audience-check on every consumed token. The MCP profile is explicit about this; many implementations get it wrong.

14

LLM-Specific Security

Cloud security (deck 05) plus a layer of new shapes: the model itself is part of the attack surface, prompts are user input, outputs are code-execution.

OWASP LLM Top 10 (2025)

  1. LLM01 Prompt injection
  2. LLM02 Sensitive info disclosure
  3. LLM03 Supply chain (model + deps)
  4. LLM04 Data & model poisoning
  5. LLM05 Improper output handling (XSS, RCE in tools)
  6. LLM06 Excessive agency (agent runs amok)
  7. LLM07 System prompt leakage
  8. LLM08 Vector / embedding weaknesses
  9. LLM09 Misinformation / hallucination
  10. LLM10 Unbounded consumption (cost / DoS)

Concrete controls

  • Treat all model output as untrusted — escape before HTML, parse before exec
  • Tool sandboxes — code interpreter in Firecracker / gVisor
  • Constrained generation — JSON-schema-validated decoding
  • PII redaction at gateway — Macie / Presidio / Lakera
  • No PII in system prompts that get cached or logged
  • Per-tenant rate & budget limits on model calls
  • Egress lockdown — what URLs can the agent's browser tool hit?

Provider data-handling promises

  • Anthropic API — no training on customer data; 30-day default retention; Zero Data Retention available
  • OpenAI Enterprise / API — no training; 30-day retention; ZDR via API
  • Bedrock / Vertex / Azure OpenAI — no training, customer-managed keys, in-region
  • Read every provider's data-processing addendum (DPA) before relying on this

"Free" models = your data is the price

Some free / consumer-tier offerings train on inputs. For any product handling user data, use the paid API explicitly with the no-training DPA. Check the provider's setting; defaults vary.

15

Compliance for AI Services

EU AI Act (in force 2024–2027)

  • Risk-tiered: prohibited / high-risk / limited / minimal
  • High-risk (HR, education, infrastructure, biometrics): conformity assessment, registration, post-market monitoring
  • GPAI obligations on providers (model cards, energy reporting, copyright policy)
  • Transparency: AI-generated content must be labelled; users must know when they talk to AI
  • Penalties up to 7% of global turnover
  • Phased: prohibited (Feb 2025), GPAI (Aug 2025), HR (Aug 2026)

Other frameworks

  • ISO/IEC 42001 — AI management system; first ISO standard for AI
  • NIST AI RMF — voluntary US framework, common reference
  • SOC 2 + AI controls — auditors now ask
  • UK AI Safety Institute guidance, voluntary
  • US EOs & states — Colorado AI Act (2026), CA SB 1047 follow-on attempts

Practical checklist for an AI feature

  • Document model + version + provider + region per feature (model card)
  • Disclose AI use in UI & ToS
  • Persist eval results per release
  • Provide opt-out / human-review for sensitive decisions
  • BYOK / data residency option for regulated tenants
  • Tamper-evident audit log of every model call

BYOK and customer data isolation

  • Bedrock — supports customer KMS keys for data; not for the model weights
  • Vertex / Azure OpenAI — equivalent
  • AWS Bedrock Confidential Inference — confidential VMs / GPUs (preview)
  • Anthropic Privatemode — confidential-compute LLM offering for HIPAA / FedRAMP

Don't roll-your-own model card

Vendors publish model cards (Anthropic, OpenAI, Meta) — re-use them, point to them. Your own card describes your application's use of the model: scope, limits, evaluations, who's accountable.

16

Cost Engineering for LLMaaS

The five biggest levers

  1. Prompt caching — 50–90% off input on cached prefixes
  2. Model routing — Haiku/4o-mini for easy, Opus/o3 for hard; classify cheaply, route smart
  3. Output limitsmax_tokens; constrain to JSON; ask for shorter answers
  4. RAG over context-stuffing — search rather than send the whole corpus
  5. Batch APIs — Anthropic / OpenAI batch are 50% off if you can wait 24h

Per-tenant accounting

  • Tag every model call with tenant_id, feature, user
  • Aggregate to a real-time "cost-per-tenant" metric
  • Surface to the customer (transparency = trust)
  • Charge for it (deck 04: metering)

Common runaway shapes

  • Agent loops without max-iteration ceiling
  • Streaming that's never stopped on disconnect
  • Embedding the entire user corpus on every signup ("just-in-case")
  • Logging full prompt+response into Datadog / Sentry — log size eats more than the API call
  • Testing in prod with a model you forgot to switch back from o3

Hard limits, by default

  • Per-org spend ceiling (provider has these — set them)
  • Per-user / per-tenant rate limit at gateway
  • Per-request max_tokens + max_iterations for agents
  • Budget alarms at 50% / 80% / 100% of monthly target

"AI costs are infinite"

Only if you let them. The same FinOps practices that work for cloud bills (deck 02) work here — tag, alarm, optimise. The difference is a 10× cost lever (caching) usually exists.

17

LLMaaS Anti-Patterns

"Hardcode openai.com in every service"

One provider 5xx → 100% of your AI features down. Use an AI gateway from day one.

"One frontier model for everything"

You pay frontier prices on classification tasks a Haiku could do for 1/15 the cost. Model-route by task complexity.

"Trust the model output as data"

If the model returns <script>…</script> and your app renders it raw, you have a stored XSS via prompt injection. Always escape; always validate.

"No evals, just vibes"

Two weeks after launch, "is the new model better?" has no answer. Eval suite first.

"Customer data in free-tier endpoints"

Your free tier might train on user data. For any production app, paid API + DPA + data-residency review.

"Agents with full repo write access"

Excessive agency. Constrain tools to the smallest scope; gate writes behind human approval; sandbox every tool call.

"RAG = vector DB"

Retrieval is hybrid (keyword + vector + reranker). Pure-vector misses on code, names, IDs, exact strings. Add BM25 from day one.

"Fine-tuning before evals"

Without an eval suite, you don't know whether your tune helped or hurt. Build the regression suite first.

18

Summary & Series Close

Three takeaways

  1. LLMaaS is a stack — inference, retrieval, agents, evals, governance. Build each layer on managed services unless you have a real reason to self-host.
  2. Cost-control levers exist (caching, routing, batch). Use them; default behaviour is expensive.
  3. The new security shapes — prompt injection, excessive agency, supply-chain, data leakage — are real and concrete. Treat them with the same rigour as cloud security in deck 05.

Series recap

  • 01 Service Models — the *aaS taxonomy
  • 02 IaaS Foundations — VMs, VPC, storage
  • 03 PaaS / FaaS / CaaS — managed compute
  • 04 SaaS Architecture — multi-tenancy, B2B identity
  • 05 Cloud Security — IAM, secrets, network, compliance
  • 06 LLM-as-a-Service — this deck

Companion decks & hubs

One sentence

"LLM-as-a-Service is the cloud's newest layer — and it inherits every prior layer's discipline: identity, security, observability, cost — plus a few new ones unique to running models you didn't write."