The architectural side of SaaS — the patterns that decide whether your platform scales to ten or ten-thousand tenants without rewrites. Companion to Monetising & Distributing Software (the business side).
SaaS isn't a particular tech stack — it's the discipline of running one application for many customers at once, where each customer (tenant) believes the app exists for them alone, while underneath it's shared.
SaaS still uses the same primitives: VPCs, containers, queues, databases. The art is partitioning them by tenant — sometimes logically, sometimes physically.
Product-Led Growth — frictionless self-serve signup; viral inside companies. Sales-led — pilot, security review, MSA. Most successful SaaS today is hybrid: PLG funnels feed sales for enterprise tier.
Highest density. One running application, one (sharded) database, one deploy pipeline. Every row carries a tenant_id; every query filters on it. Used by Slack at scale, Notion at scale, Linear from day one.
CREATE TABLE projects (
id uuid PRIMARY KEY,
tenant_id uuid NOT NULL REFERENCES tenants(id),
name text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
-- Index every tenant-scoped query on (tenant_id, ...)
CREATE INDEX projects_by_tenant
ON projects (tenant_id, created_at DESC);
Every query must include WHERE tenant_id = $1. Forgetting once = cross-tenant leak.
ALTER TABLE projects ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON projects
USING (tenant_id = current_setting('app.tenant_id')::uuid);
-- In your connection pooler / middleware:
SET app.tenant_id = '...uuid of current request...';
RLS turns a leak from "always" to "almost-impossible" — the DB rejects unscoped queries.
Shard tenants across multiple Postgres clusters by hash(tenant_id) % N. Citus, Vitess, AWS Aurora Limitless, PlanetScale. Most large SaaS are sharded pool, not single-DB pool.
Compromise: shared compute, isolated data. Every tenant gets their own schema (Postgres) or own database in a shared Postgres cluster. The cluster is shared; the data is fully partitioned.
CREATE SCHEMA tenant_acme;
CREATE TABLE tenant_acme.projects (id uuid PRIMARY KEY, ...);
-- Per request, in middleware:
SET search_path = tenant_acme, public;
Same DDL applied to every schema. Migrations replay across thousands of schemas.
pg_dump --schema=…Most SaaS that start at bridge end up moving to sharded pool (small tenants) + silo (whales) — the hybrid model.
Each customer gets their own application instance, their own database, often their own VPC. Highest isolation, highest cost. Used for top-tier enterprise tenants and regulated industries.
A whole new product category — companies whose entire offering is silo-only because of regulation: Privatemode AI, healthcare-specific platforms, defence-tech. Charges 3–10× the pooled equivalent.
Customer brings the AWS / GCP account; you run your software in it via cross-account roles. Hashicorp HCP, Confluent Cloud Networking, Databricks, Snowflake on private deploys all do this.
Your support team can't reproduce a bug without the customer's logs. Your release cadence slows — every silo upgrade is a maintenance window. Your on-call has N customers' alerts, not one. Charge accordingly.
"Tenant isolation" is not a checkbox; it's layered enforcement — at app, DB, network, and operations.
WHERE tenant_id = ? is a build failureset local role tenant_acmeAn ORM "convenience" that lets you skip WHERE tenant_id "just for admin queries" and ships to prod via a feature flag. RLS or perish.
In B2C, every user has a Google login. In B2B, every customer brings their own identity provider — Okta, Entra, Google Workspace, OneLogin, JumpCloud — and your app must federate to all of them at once.
| Need | Standard | What it does |
|---|---|---|
| Login | SAML 2.0 or OIDC | Browser SSO from customer's IdP |
| User provisioning | SCIM 2.0 | Customer's IdP creates / updates / deactivates users in your app |
| Just-in-time provisioning | via SAML/OIDC claims | Create user on first login from claims |
| Group sync | SCIM groups | Map IdP groups to your app's roles |
| Audit / SOC 2 | Audit logs | Customer can answer "who did what" |
# User hits app.your-saas.com/login
# We detect their email domain → tenant → IdP
GET /sso?email=alice@acme.com
→ 302 to acme.okta.com/saml/yourapp
# user logs in at Okta
# Okta POSTs SAML assertion back
POST /sso/callback
<SAMLResponse>...</SAMLResponse>
→ we verify signature against Okta cert
→ extract email, name, groups
→ create / update user, set tenant
→ set session cookie
# Okta admin assigns user to your app
POST /scim/v2/Users HTTP/1.1
Authorization: Bearer <tenant-scoped-token>
{
"userName":"alice@acme.com",
"name":{"givenName":"Alice","familyName":"Brown"},
"emails":[{"value":"alice@acme.com","primary":true}],
"active": true
}
# Later — Alice leaves Acme; Okta deprovisions
PATCH /scim/v2/Users/<id>
{ "active": false }
# → your app immediately invalidates her sessions
From the moment you sell to companies of more than ~50 people, you'll hear: "no SSO no sale". Build SAML/OIDC and SCIM before your first $50k contract.
Building SAML, OIDC, SCIM, MFA, magic links, password reset, brute-force protection, audit logs is months of work and a permanent maintenance burden. Use a platform.
| Provider | Best fit | SAML/OIDC | SCIM | Pricing notes |
|---|---|---|---|---|
| Auth0 (Okta) | Mature B2C + B2B | ✔ | extra fee | per-MAU; cheap small, expensive at scale |
| WorkOS | B2B SaaS, "SSO as one API" | ✔ (every IdP normalised) | ✔ | flat $125/connection/month, first 1M users free |
| Stytch | Passwordless-first, dev-native | ✔ | ✔ (B2B SDK) | per-MAU; B2B SDK is the standout |
| Clerk | Frontend-led, React-shaped | ✔ (Pro / Enterprise) | ✔ (Enterprise) | per-MAU; great UI components out of the box |
| FrontEgg | B2B with built-in admin portal | ✔ | ✔ | flat by tier |
| AWS Cognito | If already deeply on AWS | ✔ (limits) | ✘ | cheap; UX is the catch |
| Microsoft Entra External ID | Customer-facing apps for Azure shops | ✔ | ✔ | per-MAU, generous free tier |
| Keycloak / ZITADEL / Authentik | Self-hosted, EU-resident, OSS | ✔ | varies | free + ops cost; see OAuth for MCP |
The full provider tour, with self-hosted options and the OAuth specification trail, is in the Introduction to OAuth and OAuth for MCP decks.
SaaS billing is a write-mostly time-series workload. Every metered action emits an event; the billing engine aggregates and applies pricing rules.
{
"event_id": "evt_01HRX2...",
"tenant_id": "tnt_acme",
"user_id": "usr_alice",
"metric": "tokens_used",
"value": 1284,
"timestamp": "2026-05-04T10:33:21Z",
"idempotency": "req_01HRX2...",
"metadata": { "model":"gpt-5", "feature":"chat" }
}
Idempotency key prevents double-billing on retry. Stored append-only — never updated or deleted (audit, replay).
plan = "growth" # tier
rules = [
{ metric:"seats", flat: 12.00 }, # per-seat
{ metric:"tokens_used",
tiered:[
{ up_to: 1_000_000, unit: 0.0 }, # included
{ up_to: 10_000_000, unit: 0.000004 }, # overage
{ up_to: null, unit: 0.0000035 }
] }
]
| Option | What it gives you |
|---|---|
| Stripe Billing + Meters | Subscriptions, prorations, invoicing; usage meters since 2024 |
| Lago (OSS) | Self-hosted metering + pricing engine; SQL-friendly |
| Orb | Spec-the-pricing-as-code, reconciliation-first |
| Metronome | Used by OpenAI, Anthropic, Anysphere — usage at AIaaS scale |
| m3ter | UK-based, enterprise-billing depth |
Did our metering stream record everything Stripe billed for? For SOC1 / SOX-aligned customers, you need a daily reconciliation report and a way to credit-note discrepancies. Lago / Orb / Metronome do this; rolling-your-own usually skips it and breaks at audit.
"Our p99 is 250 ms" is meaningless to a SaaS — what matters is "Customer Acme's p99 is 250 ms". Build observability that sees tenants as first-class.
// OpenTelemetry, every span:
span.setAttribute("tenant.id", req.tenant.id);
span.setAttribute("tenant.tier", req.tenant.tier);
span.setAttribute("tenant.region", req.tenant.region);
// Then in Honeycomb / Tempo / Datadog:
GROUP BY tenant.id → per-tenant dashboards
| Tier | p99 latency | Availability |
|---|---|---|
| Free | 1.5 s | 99.5% |
| Growth | 500 ms | 99.9% |
| Enterprise | 250 ms | 99.95% |
| Enterprise+ (silo) | 250 ms | 99.99% |
Each tier monitored, alerted, and reported to that tier's customers (status page).
X-RateLimit-Limit, X-RateLimit-Remaining, Retry-AfterPool architecture's biggest existential risk. Per-tenant rate-limit + per-tenant query budget + circuit breakers on long-running ops.
European customers want their data in EU regions, US-Federal customers want US-Gov, Asia customers want APAC. The earlier you think about residency, the cheaper it is to add.
At signup, ask: where do you want your data? The answer determines:
"Customer data is in EU" but billing IDs, audit logs, support chats route through the US. Auditors will catch this. Either keep all of it regional, or document and disclose what doesn't.
For B2B customers, audit logs are not a bonus — they are part of the product. SOC 2 and ISO 27001 customers will demand them; without them, you fail the security review.
{
"id":"aud_01HRX...",
"tenant_id":"tnt_acme",
"actor":{"type":"user","id":"usr_alice","ip":"203.0.113.5","ua":"..."},
"action":"document.share",
"resource":{"type":"document","id":"doc_42","name":"Q1 plan"},
"context":{"to":"bob@vendor.com","permission":"comment"},
"session_id":"ses_...",
"request_id":"req_01HRX...",
"ts":"2026-05-04T10:33:21Z"
}
For HIPAA / FedRAMP / SOX, auditors want logs that can't be silently rewritten. Append-only storage + hash-chain (each row hashes the previous) + immutability (S3 Object Lock / Glacier Vault Lock).
Compliance is what distinguishes "SaaS for engineers" from "SaaS for the regulated world". Each framework asks the same things in slightly different language. (Deck 05 covers the cloud-side controls.)
| Framework | Trigger | What customers will ask for | Year-1 cost (rough) |
|---|---|---|---|
| SOC 2 Type II | B2B mid-market, $50k+ contracts | Audit report, security questionnaire | $30–80k (auditor) + $30k tooling (Vanta/Drata/Secureframe) |
| ISO 27001 | European customers, government | Certificate, SoA, ISMS | $25–60k |
| HIPAA (BAA) | Health data — patients, devices, providers | BAA signed, encryption, audit logs, isolation | $10–30k tooling, BAA from cloud provider |
| GDPR / DPF | Any EU personal data | DPA, sub-processor list, data residency option, deletion / export endpoints | Mostly engineering work, no auditor |
| PCI DSS | Storing card numbers | Tokenisation (Stripe handles for most) | Avoid — let Stripe / Adyen handle |
| FedRAMP | US Federal customers | FedRAMP Moderate / High, GovCloud / Azure Gov | $1M+ — only if there's revenue waiting |
| EU AI Act | Selling AI features in EU | Risk classification, transparency, model cards | Mostly process; ramp through 2025–2027 |
Earlier than you think. Once your sales pipeline has a $50k deal, security review will demand it. Vanta + 6 months ≪ blowing the deal.
From signup to GDPR-delete, every tenant moves through a state machine. Build it explicitly; don't let it evolve as a tangle of active booleans.
Time-bound, feature-gated. Convert via Stripe trial-end + 24h grace. Don't auto-charge silently — confirmation email.
Read-only after 7 days; full pause after 30. Dunning emails (Stripe / Lago / Recurly handle this).
GDPR right-to-erasure: 30-day soft-delete (recoverable), then hard delete (overwritten in DB, removed from backups via lifecycle, removed from search indexes). Document the SLA.
Past a certain size, "one big pool" stops scaling. The pattern: divide tenants into cells, each a fully independent slice of the platform. Used by Slack, Stripe, AWS itself.
Don't day-one. Build pool, with a tenant_id everywhere and a routing-aware data layer. When the database is bumping limits, slice by cell. The routing abstraction is the first thing to build right.
If tenants need to talk to other tenants (Slack Connect, Notion guests), the routing-via-tenant model breaks. Solve at the application layer with explicit cross-cell contracts.
The first cross-tenant leak is a career event. Lint, RLS, code review every query. No exceptions for "internal-only" admin pages.
SAML is XML, signed XML, with edge cases dating to 2005. Every XML signature library has historically had a signature wrapping CVE. Use WorkOS/Auth0; don't roll it.
You won't. Metering plumbing should be live before billing rules; events flowing six months early is fine. Wiring it after launch into a working app means missing a quarter of usage.
Performance footgun (write amp), eventually deletion-resistance footgun. Ship to a separate store from day one.
Customer A's CSV export takes down customer B's logins. Per-tenant rate limits on every expensive operation.
You can. You won't ship product. Vanta / Drata / Secureframe — pick one, $30k/yr, 70% time saved.
Multi-region is an architecture, not a feature. Latency, replication lag, partition tolerance, conflict resolution — all surface immediately. Plan for it from the data model upward.
The data is still in the DB, the backups, the analytics warehouse, the search index, the audit log, the LLM context cache… Build delete as a fan-out job that touches every stage.
"SaaS is the discipline of running one product for many strangers — every architectural choice you make ought to make that easier, not harder, six years from now."