CLOUD SERVICE MODELS · PART 6 OF 6
    

LLM-as-a-Service

Providers · pricing · RAG · agents · evals · governance — the managed-AI layer of the cloud

OpenAI · Anthropic · Google Bedrock · Vertex · Azure OpenAI Together · Groq · Fireworks Pinecone · Weaviate · Turbopuffer LangSmith · Braintrust · Langfuse

📝 Prompt → 🔎 Retrieve → 🧠 Model → 🛠 Tool / agent → 📊 Eval

Managed counterparts to every part of the LLM stack. Companion to the Local LLM Hosting sub-hub for the self-hosted side.

Inference · RAG · Agents · Evals · Governance

Topics

Inference layer

The provider landscape — frontier & specialist
Per-token pricing economics
Latency: TTFT, throughput, regional routing
Prompt caching — 10× cost reduction

Retrieval layer

Embeddings-as-a-Service
Vector DB / hybrid search providers
Hosted RAG end-to-end

Agent & ops layer

Agents-as-a-Service (Bedrock, Vertex, Assistants)
Fine-tuning APIs
Evals / observability / guardrails
MCP server hosting

Governance & cost

LLM-specific security — prompt injection, PII, BYOK
Compliance (EU AI Act, ISO 42001, NIST AI RMF)
Cost engineering — caching, routing, fallbacks
Anti-patterns and a final summary

The LLMaaS Landscape — Five Sub-Layers

Each layer adds management; you pick where to start. Most production AI apps build on layer 2 (model API) and add layer 3 (RAG); enterprise reaches into layer 4.

Frontier Providers

Provider	Flagship (2025-26)	Strengths	Gotchas
Anthropic	Claude Opus 4.7 · Sonnet 4.6 · Haiku 4.5	Long-context discipline, agentic tasks, prompt caching, extended thinking, MCP-native	Tighter rate limits than OpenAI on hobby tier; first-party endpoint or via Bedrock / Vertex
OpenAI	GPT-5 · o3 · GPT-5-mini	Largest tooling ecosystem, broadest function-calling, Realtime & Voice APIs, Assistants v2	Frequent model-deprecation churn; data-residency only via Azure OpenAI
Google	Gemini 2.5 Pro · Flash · Nano Banana	1M-token context, native video, deep Vertex integration, $0 egress between Google services	API surface fragments across AI Studio, Vertex, Firebase
Mistral AI	Mistral Large 2 · Codestral 25 · Pixtral	EU-resident, partial open-weights, fine-tuning on the API	Context windows shorter than rivals at flagship tier
xAI	Grok 4	Real-time X data, 256k context	Less enterprise tooling, narrower compliance posture
DeepSeek	DeepSeek-V4 · R1.5 (reasoning)	Extremely cheap; OSS weights of equivalents	Provider hosted in PRC; many enterprises self-host the weights via specialists

Hyperscaler model gardens

AWS Bedrock — Anthropic, Meta Llama, Mistral, Cohere, Stability, Amazon Nova, custom imports
GCP Vertex AI — Gemini + Anthropic + Llama + Mistral + JAX-based custom
Azure AI Foundry — OpenAI under MS perimeter + Llama + Mistral + Cohere + Phi
Same models, different compliance perimeter, different egress & latency profile

Frontier vs first-party tradeoff

Anthropic via api.anthropic.com versus via Bedrock: same model, same intelligence. Different: pricing, region, compliance (HIPAA only via Bedrock until recently), authentication (API key vs IAM role), latency to your VPC. For most enterprise workloads, the hyperscaler route is mandatory.

Specialist Inference Providers

Open-weights models (Llama, Mistral, Qwen, DeepSeek) hosted by specialists. Same weights as you could self-host (deck reference: Local LLM Hosting) but billed per token.

High-throughput open-weights hosts

Together AI — broadest Llama / Mixtral / DeepSeek / Qwen catalogue, fine-tuning
Fireworks AI — best-in-class throughput per dollar, FireFunction-tuned models
DeepInfra — cheapest end of the market
Replicate — model-by-model deploy, popular for image/video
RunPod Serverless · Modal — bring-your-own-model serverless GPU

Custom-silicon, ultra-low-latency

Groq — LPU; 700+ tokens/sec on Llama 70B; 100ms TTFT
Cerebras Inference — WSE wafer-scale; world's fastest, often > 1000 tok/s
SambaNova Cloud — RDU; flagship "Cloud-1" for Llama 3.x & DeepSeek
Latency-sensitive use cases: real-time agents, voice, code completion

Image / video / audio specialists

FAL — fast image diffusion API; Flux models
Black Forest Labs — Flux family, frontier image
ElevenLabs · Cartesia · Resemble — TTS
Deepgram · AssemblyAI — STT
Runway · Pika · Luma — video

Why specialists matter

Price floor — DeepInfra Llama 70B at ~$0.50/M tokens vs Anthropic Sonnet at ~$3/M
Latency floor — Groq's 700 tok/s vs OpenAI's ~80 tok/s
Compliance floor — open weights you can self-host as a backup
Innovation pace — new models on Together / Fireworks days after release

Caveat — specialist routing

Specialists may share inference fleets; the "same model" can have measurably different output distributions across providers. Pin a provider per workload; benchmark before switching.

Pricing — Per-Token Economics

Model	Context	Input $/M tok	Output $/M tok	Notes
Claude Opus 4.7	200k	$15	$75	5× discount with 1h prompt-cache hit
Claude Sonnet 4.6	200k–1M	$3	$15	most popular workhorse
Claude Haiku 4.5	200k	$1	$5	cheap-and-fast tier
GPT-5	400k	$2.50	$10	frontier OpenAI
GPT-5-mini	400k	$0.25	$2	cheapest flagship-class
OpenAI o3	200k	$60	$240	reasoning model; thinks for tokens you pay for
Gemini 2.5 Pro	1M+	$1.25 (≤200k) · $2.50 (>200k)	$10–15	caching; long-context tiered pricing
Gemini 2.5 Flash	1M	$0.30	$2.50	cheap workhorse
Mistral Large 2	128k	$2	$6	EU-resident
Llama 3.3 70B (Together)	128k	$0.88	$0.88	open-weights pricing
DeepSeek-V3 (DeepInfra)	64k	$0.27	$1.10	provider-hosted in PRC

What dominates the bill

Output tokens are 2–5× the price of input — keep output concise
Long contexts (RAG-stuffing) — every retrieval round hits input pricing
Reasoning models — invisible "thinking" tokens are billed too
Agent loops — every tool-call round is a full prompt + response

Caching tiers

Anthropic prompt caching — 5min default, 1h opt-in; 90% input price discount on hits
OpenAI prompt caching — automatic, 50% discount on hits
Gemini context caching — pay storage by minute; cheaper for stable system prompts
Caching turns expensive 100k-token system prompts from 100% to 10% cost

Latency — TTFT vs Throughput

"How fast is the model?" decomposes into two numbers: TTFT (time to first token — how long the user waits to see anything) and tok/s (steady-state output rate).

Typical numbers (2025-26)

Provider / model	TTFT	Tok/s
Cerebras (Llama 70B)	~150 ms	~2,000
Groq (Llama 70B)	~170 ms	~700
SambaNova (Llama)	~250 ms	~600
Anthropic Haiku 4.5	~250 ms	~120
GPT-5-mini	~300 ms	~85
Anthropic Sonnet 4.6	~500 ms	~75
Anthropic Opus 4.7	~800 ms	~50
Reasoning (o3, Sonnet thinking)	1–10 s+	varies; thinking before output

When latency matters most

Voice agents — sub-200ms turn-taking; needs Cerebras / Groq + Realtime API
Code completion — every 200ms above 400ms feels laggy
Chat UX — < 1.5 s TTFT keeps users engaged
Agents — N tool-calls × per-call latency; latency multiplies

Streaming, every time

HTTP SSE from the API → forward to client
Don't buffer for "completeness" — first token to user as fast as possible
SSE deck: Introduction to Server-Sent Events

Regional routing

Anthropic: api.anthropic.com is US-only; via Bedrock you get eu-central-1, eu-west-2, ap-northeast-1, etc.
Latency to model often dwarfs latency from model — pick the region near your users
Vertex AI lets you pin model region; Azure OpenAI mandates it (and inherits Azure's region list)

Prompt Caching — The 10× Cost Lever

Every conversation, agent loop, and RAG response sends the same system prompt + tool schema again and again. Prompt caching makes the provider remember the prefix; you pay for new tokens, not for the cached prefix.

Anthropic example

POST /v1/messages
{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "You are an expert coding assistant…(50KB)…",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [{"role":"user","content":"…"}]
}

# First call:    full input price (e.g. $3/M)
# Cached calls:  $0.30/M input — 90% off — for 5 min
# 1h caching:    $0.50/M input — 80% off — opt-in
# Cache write:   1.25× of base input price (one-time)

When it pays

Long system prompts repeated across users
RAG context that's stable for minutes (e.g. a doc the user is asking 5 questions about)
Multi-turn agents — entire previous turn history can cache
Tool schemas (often big; rarely change)

Provider variants

OpenAI — automatic for prompts ≥ 1024 tokens; 50% off; no opt-in
Gemini context caching — explicit cache create; pay storage per hour
DeepSeek — automatic, 90% off cached input
Together / Fireworks — provider-side KV cache; no API surface yet

Cache placement matters

Put stable content first: system prompt → tool schemas → RAG docs → user message
Set cache_control on the last stable item — everything before is cached
Re-send the cached blocks every call; cache survives by being re-referenced

Where caching fails

Personalised system prompts (per-user tone tweaks); shuffled doc order in RAG; templating that re-renders the same content with whitespace drift. Lock the prefix bytes-stable.

Embeddings & Vector DBs — RAG-as-a-Service

Embedding APIs

Provider	Model	Dim	$/M tok
OpenAI	text-embedding-3-large	3072	$0.13
OpenAI	text-embedding-3-small	1536	$0.02
Voyage AI	voyage-3-large	1024	$0.18
Cohere	embed-v4	variable	$0.12
Google	text-embedding-005	768	$0.025
Mistral	mistral-embed	1024	$0.10
Jina v4 (cloud)	multilingual	2048	$0.05

Vector DB providers

Pinecone — managed, serverless tier, hybrid (BM25 + dense), generous free tier
Weaviate Cloud — OSS lineage, hybrid, generative modules built-in
Turbopuffer — object-storage-backed, cheap at scale
Vespa Cloud — search engine + vectors; for "real" search workloads
Qdrant Cloud · Milvus / Zilliz — open-source-rooted
Postgres with pgvector — Neon, Supabase, Crunchy; cheaper at small scale

Hybrid search wins

Pure-vector misses keyword/code matches
BM25 alone misses semantic similarity
Hybrid (RRF fusion or learned reranker) is now table-stakes — Pinecone, Weaviate, Vespa, Elastic, Vespa all support it natively
Add a reranker (Cohere Rerank, Jina Rerank, Voyage rerank) for the last 10% of relevance

Hosted RAG end-to-end

Bedrock Knowledge Bases — S3 ingest → Bedrock embeddings → OpenSearch / Pinecone / Aurora pgvector → RetrieveAndGenerate API
Vertex AI Search — Google's managed RAG layer
Azure AI Search + Foundry — full RAG stack on Azure
Claude file search / OpenAI file search — provider-native RAG tools, no DB to operate

RAG companion deck

Deeper architecture lives in the RAG & Retrieval Systems sub-hub.

Agents-as-a-Service

The provider hosts the agent loop — tool calls, memory, retries, observability — so you give it a task and a set of tools, and it runs the loop.

Hyperscaler offerings

AWS Bedrock Agents + AgentCore — declarative agent, code-interpreter, knowledge-base, action groups (Lambda)
Vertex AI Agent Builder — Google's; integrates Vertex Search, gen-app builder
Azure AI Foundry Agents — Azure-side, builds on Assistants v2

Provider-side agents

OpenAI Agent Kit + Responses API — built-in tools (web, file, code interpreter, computer use)
Anthropic Computer Use · Code Execution · Web Search — first-party tools, MCP client built-in
Google AI Studio Agents — Gemini + tools

DIY framework + managed runtime

LangGraph Cloud — host LangGraph graphs, persistent state, checkpoints
CrewAI Enterprise — multi-agent teams
AutoGen Studio — Microsoft's
LlamaIndex Workflows — declarative multi-step
See Agents & Orchestration sub-hub

When to outsource the loop

Standard "answer-then-tool-then-answer" agents → provider-side (cheaper, better-cached)
Complex graphs with checkpoints, human-in-the-loop, observability needs → LangGraph Cloud
Tightly-coupled to your domain logic → DIY in your CaaS / FaaS

The agent cost trap

An agent loop hides token cost. 10 tool-call rounds × full-prompt resends × output = bills 50× a single completion. Cache aggressively, set max-iteration ceilings, alarm on "agent ran > 30 steps".

Fine-Tuning APIs

What's offered

Provider	Type	Models
OpenAI	SFT, DPO, vision SFT	4o, 4o-mini, GPT-5-mini
Anthropic (via Bedrock)	SFT	Haiku 3.5; Sonnet 4 GA in 2025
Google Vertex	SFT, RLHF, distillation	Gemini Flash; Gemma open-weights
Mistral	SFT, LoRA	Mistral Large & smaller
Together / Fireworks	SFT, LoRA, DPO	any open-weight on platform
AWS Bedrock	SFT, continued pretrain	Llama, Titan, Nova

When to fine-tune

Distillation — small model gets close to a frontier model on your task
Tone / style / format that prompting can't reliably get
Domain vocabulary the base model handles poorly
Output schema reliability (e.g. JSON adherence) when constrained decoding isn't enough

When not to fine-tune

"Knowledge update" — RAG is faster, cheaper, more current
Prompt engineering plus eval cycle hasn't been exhausted
You don't have ≥ a few hundred high-quality examples
The base model improves quarterly; your tune ages out

Cost shape

Training: $/1M training tokens (rough — OpenAI 4o: $25/M training tokens)
Inference: tuned models cost more per-token (~1.5–2× base) and can't share caches with base
Hosting: some providers charge a per-hour or per-month "deploy" fee

Fine-tuning is rarely the first lever

Most production wins come from prompt engineering + retrieval + good evals. Fine-tune only when those hit a ceiling.

Evals & Observability — As-a-Service

LLM observability platforms

LangSmith — LangChain native; traces, datasets, evals
Langfuse — OSS-rooted; self-host or hosted
Helicone — proxy-based, simple drop-in
Braintrust — eval-first, used by Anthropic / Stripe / Notion
Arize Phoenix / AX — eval + drift monitoring
Weights & Biases Weave — extends W&B to LLMs
Datadog LLM Observability — if already on Datadog

What you actually want

Trace every call: prompt, response, tokens, latency, cost, model
Replay a request with a different prompt or model — "what if?"
Tag with tenant, feature, user-id (deck 04: per-tenant SLOs)
Dataset of hard cases to regression-test against

Eval-as-a-service

Braintrust — eval scripts as code, regression suites in CI
Patronus AI — managed evals + safety tests
Vellum — visual prompt-engineering + evals
PromptLayer · Humanloop — prompt versioning + evals
OpenAI Evals · Anthropic Workbench — provider-native

Three eval types you need

Code-graded — JSON schema, exact match, regex; cheap, deterministic
LLM-judge — another model rates output (Claude / GPT) on rubric; cheap-but-noisy
Human review — sampling for quality drift; expensive but ground-truth

"Vibes-based deploys"

The most common reason ML features regress — no eval suite, no regression test, no comparison run before deploying a prompt change. Build evals before you build features.

Guardrails & AI Gateways

A guardrail intercepts requests + responses, applying policy: redact PII, block prompt injection, refuse policy-violating output, log everything. An AI gateway is that-and-more: routing, fallback, caching, key management.

Guardrail products

AWS Bedrock Guardrails — content filters, denied topics, PII redaction, hallucination grounding
Azure AI Content Safety — moderation, prompt-shield, groundedness
Vertex AI safety filters
Lakera Guard — prompt-injection focus
Protect AI Layer · HiddenLayer — model security
NeMo Guardrails (NVIDIA, OSS) — Colang DSL
Guardrails AI (OSS) — input/output validation library + cloud

AI gateways

Cloudflare AI Gateway — caching, fallback, rate-limit, observability across providers
Portkey — multi-provider gateway, retries, semantic cache
Helicone — proxy + observability
LiteLLM Proxy — OSS multi-provider router
Kong AI Gateway · F5 AI Gateway — enterprise

What a gateway buys you

One key for many providers — abstract the provider behind a stable API
Fallback — Anthropic 5xx → fall through to OpenAI / Bedrock copy
Semantic cache — repeat questions answered from cache
Cost-aware routing — easy to Haiku, hard to Opus
Per-user rate limit + budget — stop runaways
Single audit log — compliance loves it

Where to put it

In front of every LLM call from your app. Don't hard-code provider URLs in services; route everything through the gateway. When a model deprecates or a provider goes down, you don't redeploy services.

The gateway is a critical path

Your AI gateway is now in the hot path of every user request. Self-host it on CaaS (Cloud Run / Container Apps) or pick a vendor with a 99.99% SLA. Don't put it on a single Lambda.

MCP Server Hosting & Gateways

Model Context Protocol — Anthropic's open standard (2024) for letting LLMs use external tools, resources, and prompts. The 2025–26 wave of MCP-as-a-service hosts MCP servers and brokers OAuth.

MCP hosting providers

Cloudflare Workers + Agents SDK — first-class remote MCP hosting
Anthropic Connectors — enterprise-managed MCP connections in Claude
Pipedream MCP · Composio MCP · Smithery — multi-tenant MCP marketplaces
Vercel MCP — push to Vercel, get a remote MCP endpoint
Mintlify · Stainless — auto-generate MCP servers from API specs

MCP gateways

Docker MCP Gateway — local + remote, runs MCP servers as containers behind one endpoint
Supergateway — SSE-bridge for stdio MCP servers
MCP Inspector — Anthropic's debugger

OAuth for MCP

The 2025 MCP authorisation profile (deep dive deck) specifies how an MCP client (LLM agent) authenticates the user against the MCP server's resource — Resource Indicators (RFC 8707), DPoP-bound tokens, audience-scoped consent. Most managed MCP hosts handle this for you.

Companion decks

Confused-deputy in MCP

If your MCP server passes the user's token to a downstream API without checking the audience, an attacker can replay it elsewhere. Mandatory: Resource Indicators on every issued token, audience-check on every consumed token. The MCP profile is explicit about this; many implementations get it wrong.

LLM-Specific Security

Cloud security (deck 05) plus a layer of new shapes: the model itself is part of the attack surface, prompts are user input, outputs are code-execution.

OWASP LLM Top 10 (2025)

LLM01 Prompt injection
LLM02 Sensitive info disclosure
LLM03 Supply chain (model + deps)
LLM04 Data & model poisoning
LLM05 Improper output handling (XSS, RCE in tools)
LLM06 Excessive agency (agent runs amok)
LLM07 System prompt leakage
LLM08 Vector / embedding weaknesses
LLM09 Misinformation / hallucination
LLM10 Unbounded consumption (cost / DoS)

Concrete controls

Treat all model output as untrusted — escape before HTML, parse before exec
Tool sandboxes — code interpreter in Firecracker / gVisor
Constrained generation — JSON-schema-validated decoding
PII redaction at gateway — Macie / Presidio / Lakera
No PII in system prompts that get cached or logged
Per-tenant rate & budget limits on model calls
Egress lockdown — what URLs can the agent's browser tool hit?

Provider data-handling promises

Anthropic API — no training on customer data; 30-day default retention; Zero Data Retention available
OpenAI Enterprise / API — no training; 30-day retention; ZDR via API
Bedrock / Vertex / Azure OpenAI — no training, customer-managed keys, in-region
Read every provider's data-processing addendum (DPA) before relying on this

"Free" models = your data is the price

Some free / consumer-tier offerings train on inputs. For any product handling user data, use the paid API explicitly with the no-training DPA. Check the provider's setting; defaults vary.

Compliance for AI Services

EU AI Act (in force 2024–2027)

Risk-tiered: prohibited / high-risk / limited / minimal
High-risk (HR, education, infrastructure, biometrics): conformity assessment, registration, post-market monitoring
GPAI obligations on providers (model cards, energy reporting, copyright policy)
Transparency: AI-generated content must be labelled; users must know when they talk to AI
Penalties up to 7% of global turnover
Phased: prohibited (Feb 2025), GPAI (Aug 2025), HR (Aug 2026)

Other frameworks

ISO/IEC 42001 — AI management system; first ISO standard for AI
NIST AI RMF — voluntary US framework, common reference
SOC 2 + AI controls — auditors now ask
UK AI Safety Institute guidance, voluntary
US EOs & states — Colorado AI Act (2026), CA SB 1047 follow-on attempts

Practical checklist for an AI feature

Document model + version + provider + region per feature (model card)
Disclose AI use in UI & ToS
Persist eval results per release
Provide opt-out / human-review for sensitive decisions
BYOK / data residency option for regulated tenants
Tamper-evident audit log of every model call

BYOK and customer data isolation

Bedrock — supports customer KMS keys for data; not for the model weights
Vertex / Azure OpenAI — equivalent
AWS Bedrock Confidential Inference — confidential VMs / GPUs (preview)
Anthropic Privatemode — confidential-compute LLM offering for HIPAA / FedRAMP

Don't roll-your-own model card

Vendors publish model cards (Anthropic, OpenAI, Meta) — re-use them, point to them. Your own card describes your application's use of the model: scope, limits, evaluations, who's accountable.

Cost Engineering for LLMaaS

The five biggest levers

Prompt caching — 50–90% off input on cached prefixes
Model routing — Haiku/4o-mini for easy, Opus/o3 for hard; classify cheaply, route smart
Output limits — max_tokens; constrain to JSON; ask for shorter answers
RAG over context-stuffing — search rather than send the whole corpus
Batch APIs — Anthropic / OpenAI batch are 50% off if you can wait 24h

Per-tenant accounting

Tag every model call with tenant_id, feature, user
Aggregate to a real-time "cost-per-tenant" metric
Surface to the customer (transparency = trust)
Charge for it (deck 04: metering)

Common runaway shapes

Agent loops without max-iteration ceiling
Streaming that's never stopped on disconnect
Embedding the entire user corpus on every signup ("just-in-case")
Logging full prompt+response into Datadog / Sentry — log size eats more than the API call
Testing in prod with a model you forgot to switch back from o3

Hard limits, by default

Per-org spend ceiling (provider has these — set them)
Per-user / per-tenant rate limit at gateway
Per-request max_tokens + max_iterations for agents
Budget alarms at 50% / 80% / 100% of monthly target

"AI costs are infinite"

Only if you let them. The same FinOps practices that work for cloud bills (deck 02) work here — tag, alarm, optimise. The difference is a 10× cost lever (caching) usually exists.

LLMaaS Anti-Patterns

"Hardcode `openai.com` in every service"

One provider 5xx → 100% of your AI features down. Use an AI gateway from day one.

"One frontier model for everything"

You pay frontier prices on classification tasks a Haiku could do for 1/15 the cost. Model-route by task complexity.

"Trust the model output as data"

If the model returns <script>…</script> and your app renders it raw, you have a stored XSS via prompt injection. Always escape; always validate.

"No evals, just vibes"

Two weeks after launch, "is the new model better?" has no answer. Eval suite first.

"Customer data in free-tier endpoints"

Your free tier might train on user data. For any production app, paid API + DPA + data-residency review.

"Agents with full repo write access"

Excessive agency. Constrain tools to the smallest scope; gate writes behind human approval; sandbox every tool call.

"RAG = vector DB"

Retrieval is hybrid (keyword + vector + reranker). Pure-vector misses on code, names, IDs, exact strings. Add BM25 from day one.

"Fine-tuning before evals"

Without an eval suite, you don't know whether your tune helped or hurt. Build the regression suite first.

Summary & Series Close

Three takeaways

LLMaaS is a stack — inference, retrieval, agents, evals, governance. Build each layer on managed services unless you have a real reason to self-host.
Cost-control levers exist (caching, routing, batch). Use them; default behaviour is expensive.
The new security shapes — prompt injection, excessive agency, supply-chain, data leakage — are real and concrete. Treat them with the same rigour as cloud security in deck 05.

Series recap

01 Service Models — the *aaS taxonomy
02 IaaS Foundations — VMs, VPC, storage
03 PaaS / FaaS / CaaS — managed compute
04 SaaS Architecture — multi-tenancy, B2B identity
05 Cloud Security — IAM, secrets, network, compliance
06 LLM-as-a-Service — this deck

Companion decks & hubs

LLMs hub — umbrella index for AI / agentic content
Local LLM Hosting — the self-hosted counterpart
Docker for LLMs & Agents — Ollama / vLLM / GPU passthrough
RAG & Retrieval Systems
Agents & Orchestration
OAuth for MCP
*aaS series hub

One sentence

"LLM-as-a-Service is the cloud's newest layer — and it inherits every prior layer's discipline: identity, security, observability, cost — plus a few new ones unique to running models you didn't write."