CLOUD SERVICE MODELS · PART 2 OF 6
    

IaaS
Foundations

Compute · Networking · Storage · Regions · Identity — the floor of cloud

VMs VPC Block / Object / File Region · AZ IAM

🖥 Compute + 🌐 Network + 💾 Storage + 🔐 Identity → 🧱 a system

What you actually rent at the IaaS layer, the operational tax that comes with it, and when IaaS still wins versus climbing to PaaS or CaaS.

Compute · Network · Storage · Identity · Cost

Topics

Compute

The VM as a primitive — instance families & shapes
Spot / preemptible — the economics of "you can lose it"
Bare metal & dedicated hosts — when virtualisation hurts
Burstable, GPU, ARM (Graviton, Axion) instances

Networking

The VPC — virtual data centre
Subnets, routes, NAT, internet gateways
Security groups vs NACLs
PrivateLink, peering, Transit Gateway, hybrid VPN/Direct Connect

Storage

Block (EBS / PD / Managed Disks) — the disk metaphor
Object (S3 / GCS / Blob) — the bucket metaphor
File (EFS / Filestore / Azure Files) — when you really want NFS
Storage class lifecycle & tiering

Operability

Region & AZ design — what fails, when
IAM, STS, instance profiles
Provisioning — Terraform / Pulumi / CDK
Cost watch-points & anti-patterns

The VM — IaaS's One Primitive

An IaaS VM is a virtualised x86_64 / ARM64 server with: a chosen shape (vCPU + RAM + ephemeral disk), a network interface in a VPC subnet, a boot disk, an IAM identity, and a region + AZ. Everything else in IaaS exists to network, store, or govern this primitive.

The hypervisor underneath

AWS Nitro — KVM-based, custom ASICs offload network & storage; bare-metal-fast for guests
GCP — KVM with custom Andromeda networking
Azure — Hyper-V; Boost & Azure Boost ASIC offload (2023+)
Firecracker — Amazon's micro-VM; powers Lambda & Fargate; Fly.io adopted it

Instance families (AWS naming, others mirror)

m — general purpose (m7i, m7g)
c — compute-optimised
r / x — memory-optimised (r-class, x-class for in-memory DBs)
i / d — storage-optimised, local NVMe
g / p / inf / trn — GPU / accelerator
t — burstable / cheap dev

ARM is no longer optional

AWS Graviton (Neoverse N1 → V2): ~20% cheaper, ~40% better perf-per-watt vs x86 equivalents
GCP Axion (Neoverse V2): C4A, T2A — generally 30% better price-perf
Azure Cobalt 100 — Microsoft's first ARM silicon (2024 GA)
Build pipelines: linux/arm64 images now table-stakes; Docker buildx / GitHub Actions cover it cleanly

Reading the shape

# AWS
m7i.2xlarge  → m7 family, intel, 2× the base size (8 vCPU, 32 GB)
c7g.large    → c7 family, graviton (g), large (2 vCPU, 4 GB)

# GCP
n2-standard-4 → n2 family, standard mem ratio, 4 vCPU
c4a-highcpu-8 → c4 family, Axion (a), 8 vCPU, 4 GB

# Azure
Standard_D4s_v5 → 4 vCPU, premium disk (s), v5 generation

Spot & Preemptible — Pay 60–90% Less

Spot instances are surplus capacity, sold at a steep discount but with the catch: the cloud can reclaim them on minutes' notice. Used right, they slash costs; used wrong, they take down production.

Three flavours

Cloud	Name	Eviction notice	Discount
AWS	Spot	2 min	50–90%
GCP	Spot VM (was Preemptible)	30 s	~70%
Azure	Spot	30 s	~80%

When spot is right

Stateless web tier behind a load balancer; nodes come and go
Batch jobs that checkpoint (ML training, video transcoding, ETL)
CI runners, dev environments, ephemeral test fleets
Kubernetes node pools tagged "tolerates eviction"

When spot will hurt you

Stateful primary databases — eviction = downtime + recovery
Long single-node training (no checkpointing) — burned hours
Whole fleet on one instance shape in one AZ — capacity drains together

Patterns that work

Diversified pools — multiple shapes / AZs; orchestrator picks whichever has capacity (AWS Spot Fleet, EKS Karpenter)
Mixed on-demand / spot — small on-demand baseline + spot burst
Eviction handlers — drain endpoint, persist state in 30s, rejoin elsewhere

Bare metal & dedicated hosts

Other end of the spectrum: EC2 i4i.metal, GCP sole-tenant nodes, Azure dedicated hosts. No hypervisor overhead, BYOL licensing eligibility, full physical isolation. Used for HPC, BYO virtualisation, and licensing-bound workloads (Oracle, Windows Server core-licensed).

The VPC — Your Virtual Data Centre

A VPC (Virtual Private Cloud) is a software-defined network you own inside the provider's substrate. CIDR range, subnets, routes, security — yours; the underlying physical fabric — theirs.

Three-tier subnet pattern

Public — load balancers + NAT only. Private — app tier; outbound via NAT, inbound only from ALB. Data — DBs + caches; no internet at all, both ways.

Two AZs minimum

Every tier replicated across at least two AZs. An AZ failure (an entire data centre going dark) should leave you running, not down. Cross-AZ traffic is billed — but cheaper than downtime.

Routing — IGW · NAT · Endpoints

Routes inside a VPC are explicit: each subnet has a route table, and every destination CIDR has a target. There is no "default" — if you didn't write a route to 0.0.0.0/0, the subnet has no internet.

Three ways out

Internet Gateway (IGW) — bidirectional. Subnet that routes 0.0.0.0/0 → IGW + has a public IP = "public subnet".
NAT Gateway — outbound-only. Private instances reach the internet through it; the internet can't reach them.
VPC Endpoints / PrivateLink — talk to AWS services (S3, DynamoDB, Bedrock, …) over the AWS backbone, without leaving the VPC.

NAT gateway — the fee meter

~$0.045/hour per NAT GW plus $0.045/GB processed (us-east-1)
One per AZ for HA → ~$32/month idle × AZs
Heavy data fetch (Docker pulls, package installs) routes through NAT — surprise bills
Mitigation: VPC endpoints for ECR, S3; pull-through caches; amazon-vpc-cni for Kubernetes pods to use ENIs directly

Cross-VPC connectivity

VPC Peering — point-to-point, non-transitive, 1:1 between VPCs
Transit Gateway — hub-and-spoke router; many VPCs, one route table
PrivateLink — expose one service from your VPC to others, without exposing the VPC
Cloud WAN (AWS) / NCC (GCP) — global SD-WAN-as-a-Service

On-prem ↔ cloud

Site-to-site VPN — IPsec over the public internet; cheapest, ~1 Gbps each
Direct Connect / Cloud Interconnect / ExpressRoute — private fibre into the cloud's edge; up to 100 Gbps; the only sensible choice for hybrid at scale
SD-WAN partners — Megaport, Equinix Fabric — multi-cloud across one fibre

Security Groups vs NACLs

Two stateless concepts and one stateful one. Confuse them and you ship vulnerabilities or break production at 3am.

Security Groups (the daily-driver)

Attached to network interfaces (instances, ENIs, RDS, Lambda VPC ENIs)
Stateful — return traffic for an established flow is automatically allowed
Allow rules only — there is no deny
Default rule: deny all inbound, allow all outbound
Source can be a CIDR or another SG — chaining ("web SG can talk to db SG") is the right pattern

Network ACLs (the perimeter)

Attached to subnets
Stateless — must explicitly allow return traffic, both directions
Allow and deny rules; numbered, evaluated in order
Default: allow all (deliberately permissive)
Used as a coarse "blast radius" backstop — rare in day-to-day, vital for compliance perimeters

Defence in depth

NACL: "no traffic from outside the corporate IP space"
SG: "ALB ↔ app, app ↔ db, nothing else"
Host firewall: distro defaults (often nothing — that's fine when SGs are tight)
Application: TLS, mTLS, app-layer auth

The classic SG anti-pattern

0.0.0.0/0 on port 22 "for now". That gets scanned and brute-forced inside an hour. Use SSM Session Manager / IAP tunnels — see deck 05.

Storage — Block / Object / File

Cloud storage is three almost-unrelated services trading off durability, latency, throughput, and price. Use the wrong one and you pay 100× — in either direction.

Type	AWS	GCP	Azure	Latency	Throughput	Pricing shape	Best for
Block	EBS (gp3, io2, st1)	Persistent Disk (pd-ssd, pd-balanced)	Managed Disks (Premium/Ultra)	< 1 ms	up to 16 GB/s (io2 BX)	$/GB-month + IOPS + throughput	VM root, DB volumes
Object	S3 (Standard / IA / Glacier)	GCS (Standard / Nearline / Archive)	Blob (Hot / Cool / Archive)	10–100 ms TTFB	massive parallel	$/GB-month + GET/PUT + egress	backups, media, data lakes
File	EFS (NFSv4) / FSx (Lustre, ONTAP, OpenZFS)	Filestore (NFS), Cloud Storage FUSE	Azure Files (SMB / NFS)	1–5 ms	up to 10s of GB/s (FSx Lustre)	$/GB-month + IOPS	shared home dirs, lift-and-shift NFS

Block — disk-shaped

One VM mounts one volume. Looks like /dev/nvme1n1. Snapshot to object storage, never accessible to a second VM (except in clustered FS setups).

Object — bucket-shaped

HTTP-addressed key/value over flat namespace. 11-nines durability. The right answer for >90% of "files" you'd otherwise put in a filesystem.

File — NFS-shaped

POSIX shared filesystem. Most lift-and-shift uses; almost always more expensive than designing for object storage.

Storage Class Lifecycle — Tiering Down

Object storage has a price/access trade-off baked in: hot tiers cost more per GB but cheap to read; cold tiers are nearly free at rest but expensive (and slow) to retrieve. Lifecycle rules automate the descent.

S3 storage classes (per GB-month, us-east-1, ~)

Class	Storage	GET	Min duration
Standard	$0.023	$0.0004 / 1k	—
Standard-IA	$0.0125	$0.001 / 1k	30 d
Intelligent-Tiering	auto	auto	—
Glacier Instant Retrieval	$0.004	$0.01 / 1k	90 d
Glacier Flexible	$0.0036	+ retrieval	90 d
Glacier Deep Archive	$0.00099	+ retrieval	180 d

A typical rule

# Standard → IA after 30d → Glacier after 90d → expire after 7 years
<LifecycleConfiguration>
  <Rule>
    <Filter><Prefix>logs/</Prefix></Filter>
    <Status>Enabled</Status>
    <Transition>
      <Days>30</Days>
      <StorageClass>STANDARD_IA</StorageClass>
    </Transition>
    <Transition>
      <Days>90</Days>
      <StorageClass>GLACIER</StorageClass>
    </Transition>
    <Expiration>
      <Days>2555</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>

Cloudflare R2 — the disruptor

S3-compatible API, $0.015/GB-month, zero egress fees
No regional choice — Cloudflare picks the closest of its 300+ locations
"Data egress is the price of cloud lock-in" was their explicit pitch
Forced AWS to introduce free egress on full account exits (early 2024)

Cost gotchas

Small-object tax — moving 1 KB to Glacier costs more in retrieval than the GB it sits in
Replication doubles cost — CRR / SRR / Object Replication; useful for compliance, not cost
Versioning + delete markers — "deleted" objects keep accruing storage cost forever unless lifecycled
Multi-part upload incompletes — abandoned uploads stick around indefinitely; add a cleanup rule

Regions & Availability Zones

A region is a geographic cluster of availability zones (AZs); each AZ is one or more physically separate data centres on independent power and network. Designing for failure means designing across AZs first, regions second.

Choose your region for

Latency to users — pick the nearest one with the services you need
Compliance / residency — UK, eu-central-1, gov-cloud
Service availability — not every region has every service or SKU
Cost — us-east-1 is the cheapest baseline; eu-west-1 ~5–10% more; ap-east-1 ~30% more

Multi-region patterns

Active-passive — primary region serves; DR site sits warm. RPO minutes, RTO hours.
Active-active read — global reads close to user, writes go to primary (Aurora Global, Spanner)
Truly global — Spanner, DynamoDB Global Tables, Cosmos DB; conflict-free or last-writer-wins
Service-by-service — ML batch in cheap region, latency-sensitive front-end at the edge

Cloud Identity Primitives

Every IaaS resource has an identity, every API call is authenticated and authorised, every audit log records who did what. This is the cloud's most powerful feature — and the hardest to use right. (Deep dive in deck 05.)

Things that have identity

Humans — federated via SSO (rarely as IAM users)
Workloads — instance roles, service accounts, K8s service accounts
CI/CD — OIDC-federated GitHub Actions runners, GitLab runners, Buildkite agents
External services — cross-account roles, AssumeRole-with-WebIdentity

The four building blocks (AWS naming)

Identity — user, role, service account
Policy — JSON document of allow/deny statements
Resource — what the policy applies to
STS — short-lived credentials, role assumption, AssumeRole

An IAM policy

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:GetObject",
      "s3:PutObject"
    ],
    "Resource": "arn:aws:s3:::data-prod/uploads/*",
    "Condition": {
      "StringEquals": {
        "aws:PrincipalTag/team": "ingest"
      },
      "Bool": { "aws:SecureTransport": "true" }
    }
  }]
}

Allow only POST/GET, only on one prefix, only over TLS, only for a tagged principal. Conditions are where the real expressiveness lives.

No long-lived keys, ever

Workload Identity (GCP) / IRSA (EKS) / Pod Identity (EKS, 2023) / Managed Identity (Azure) / OIDC federation (GitHub Actions, GitLab) all give you short-lived STS credentials, no static keys to leak.

Observability Primitives

You don't have a server room to walk into; the only window onto your fleet is what the platform exposes plus what you instrument. Plan observability first, not last.

Metrics

CloudWatch / Cloud Monitoring / Azure Monitor — built-in, basic but expensive at scale
Prometheus / Mimir / Cortex — OSS standard for high-cardinality metrics
Datadog / New Relic / Grafana Cloud — managed, pricey but turn-key

Logs

CloudWatch Logs / Cloud Logging / Log Analytics
Loki / OpenSearch — OSS aggregators
Datadog Logs / Splunk — eats your budget if you ship every line; sample & index selectively

Traces

OpenTelemetry — the cross-vendor standard; instrument once
X-Ray / Cloud Trace / App Insights — provider-native, often weaker than OTel-based tools
Tempo / Jaeger / Zipkin — OSS

The four golden signals

Latency · traffic · errors · saturation. Per-service. Per-tenant for SaaS (deck 04). On the same dashboard as the cloud's platform metrics (CPU credits, NAT GW throughput, NAT errors, ALB 5xx) so you can correlate.

The cloud-bill-as-observability anti-pattern

You'll hear "we know there's a problem because the bill spiked". That's too late. Set rate-of-change alerts on your bill (Cost Anomaly Detection, GCP Budget alerts) and normal SRE alerts on golden signals.

Provisioning — Infrastructure as Code

Click-ops your way to one VPC and you've spent an afternoon. Click-ops your way to fifty across three accounts and you've made a career-limiting error. IaC is non-negotiable past the toy stage.

The four serious tools

Terraform / OpenTofu — declarative HCL, multi-cloud, the de-facto standard. OpenTofu is the post-license-change OSS fork.
Pulumi — same model, real programming languages (TS, Python, Go, .NET)
AWS CDK / CDKTF — TypeScript / Python that compiles to CloudFormation or Terraform
CloudFormation / ARM / Bicep / Deployment Manager — vendor-native

A small Terraform module

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  tags = { Name = "prod" }
}

resource "aws_subnet" "private" {
  for_each          = var.azs
  vpc_id            = aws_vpc.main.id
  availability_zone = each.value
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, 10 + index(keys(var.azs), each.key))
  tags = { Tier = "private", Name = "prod-${each.key}" }
}

Ops practices that matter

State remote & locked — S3 + DynamoDB, GCS, Azure Storage; lock prevents two engineers stomping each other
Plan in CI, apply behind PR review — Atlantis, Terraform Cloud, Spacelift, Env0
No drift — alert when reality ≠ state. Detect with terraform plan -detailed-exitcode
Modules — one VPC module, one EKS module — not 12 copies

GitOps for IaC

The path forward: PR opens a plan; merge to main triggers an apply via a CI runner with OIDC-federated cloud creds. No long-lived keys on developer laptops, no manual terraform apply in production. CI/CD deck covers the build side.

Don't import the world

If something was clicked first, write the resource block, then terraform import. Don't try to model existing chaos perfectly — most "drift" is unmanaged config you should probably leave outside IaC anyway.

What Costs Money on IaaS

Compute

VM hours — biggest line on most bills
Reserved / Savings Plans / CUDs — 30–72% off for commitment
Spot — 50–90% off, eviction risk
Idle — running VMs at 02:00 with no users

Storage

Block — provisioned size + IOPS + throughput, even if 90% empty
Object — storage + GETs/PUTs + retrieval; small-object tax
Snapshots / images — accumulate without lifecycle
Versioned objects — silent duplication

Network

Egress — bytes leaving the cloud (~$0.05–0.12/GB)
Cross-AZ — ~$0.01–0.02/GB each way
NAT Gateway — $0.045/hr + $0.045/GB processed
Public IPv4 — chargeable since Feb 2024 (~$3.65/month each)
Load balancers — fixed hourly + LCU/processed-bytes

Surprise generators

S3 versioning + no lifecycle = silent doubling each year
RDS Multi-AZ for "dev" = ~2× the cost forever
Forgotten test EKS clusters at $73/month each, ten of them
NAT GW pulling images on every Lambda cold start
CloudWatch Logs at $0.50/GB ingested for DEBUG-level chatty services

FinOps essentials

Tag everything (cost-center, env, service); split bills with OpenCost; review weekly with engineers, not finance.

When IaaS Still Wins (and When to Climb)

IaaS still wins for

Specialised kernels / drivers — eBPF, real-time, GPU drivers, RDMA, custom networking, FUSE filesystems
Stateful workloads with no managed equivalent — legacy ERPs, MPI / HPC clusters, niche databases (CockroachDB self-hosted, Vitess on bare metal)
Compliance demanding "we patch the OS" — some frameworks still want you operating below the hypervisor
Cost ceilings on steady-state large workloads — 3-year RIs / CUDs make IaaS substantially cheaper than PaaS at scale
BYO-licensing — Oracle, Windows Server core licensing, SAP HANA

Climb to CaaS / PaaS / FaaS when…

Your workload is stateless HTTP / gRPC
You don't need OS-level control (the truth: 90% don't)
You'd rather pay $0.000004/req than $0.05/hr for an idle VM
You don't have a platform team and won't have one soon

Climb to SaaS for

Anything generic — email, helpdesk, CRM, BI dashboards, observability, identity
Anything where the differentiation isn't yours — auth, signup flows, e-commerce checkout

Decision rule

Default to PaaS / CaaS. Drop to IaaS only when you've identified exactly which provider abstraction is the problem, and the operational tax on the higher layer is documented and bigger than the IaaS one.

IaaS Anti-Patterns

"Pets, not cattle"

Each VM has a name (irene-prod-3), an SSH-tweaked config, and is irreplaceable because no one wrote down what's on it. Use AMIs / images / Packer; treat instances as disposable.

"One VPC per developer"

Hundred-VPC sprawl, 50 forgotten NAT gateways, no shared services. Use one shared dev VPC, prod-isolation only for prod.

"Public IPs everywhere"

SSH on 0.0.0.0/0, RDS in a public subnet "for now", DBs reachable from the internet. Use IAP / Session Manager / bastion + private subnets only.

"Static IAM access keys"

Long-lived AKIA… keys checked into a developer's .aws/credentials, then leaked to GitHub. Use SSO + STS.

"Mutually-exclusive subnets"

Twelve subnets per AZ, each a tier, each only used by one app. Subnets are cheap; reuse them within tiers, scope with security groups.

"All-AZ disaster planning"

A single-AZ MVP that you forgot to widen. The first real outage is the AZ outage.

"Snowflake security groups"

Hand-edited, undocumented SGs that nobody dares change. Drift them once, you'll never untangle them. SGs in IaC, with names.

"Backups but no restore drill"

Backups run nightly; nobody has restored. Run a disaster-recovery drill at least quarterly.

Summary

Three takeaways

IaaS is the floor — VMs, VPC, storage, IAM. Everything above it is built from this.
The operational tax of IaaS is real. Move up the stack unless you have a specific reason not to.
IaC, identity, and observability are not optional. The cloud is API-only — if you can't trace, audit, and reproduce, you're gambling.

Next in the series

03 PaaS / FaaS / CaaS — the layer you'll actually use
04 SaaS Architecture — multi-tenancy, B2B identity
05 Cloud Security — IAM in depth, secrets, network, compliance
06 LLM-as-a-Service

Companion decks

Deploying with Docker — what you put on these VMs
Deploying Node Microservices — service architecture
CI/CD — the pipeline that drives all of this

One sentence

"IaaS is the cloud's most powerful and most punishing layer — full control of the compute primitive, in exchange for taking on every problem above the hypervisor."

IaaSFoundations

Topics

Compute

Networking

Storage

Operability

The VM — IaaS's One Primitive

The hypervisor underneath

Instance families (AWS naming, others mirror)

ARM is no longer optional

Reading the shape

Spot & Preemptible — Pay 60–90% Less

Three flavours

When spot is right

When spot will hurt you

Patterns that work

Bare metal & dedicated hosts

The VPC — Your Virtual Data Centre

Three-tier subnet pattern

Two AZs minimum

Routing — IGW · NAT · Endpoints

Three ways out

NAT gateway — the fee meter

Cross-VPC connectivity

On-prem ↔ cloud

Security Groups vs NACLs

Security Groups (the daily-driver)

Network ACLs (the perimeter)

Defence in depth

The classic SG anti-pattern

Storage — Block / Object / File

Block — disk-shaped

Object — bucket-shaped

File — NFS-shaped

Storage Class Lifecycle — Tiering Down

S3 storage classes (per GB-month, us-east-1, ~)

A typical rule

Cloudflare R2 — the disruptor

Cost gotchas

Regions & Availability Zones

Choose your region for

Multi-region patterns

Cloud Identity Primitives

Things that have identity

The four building blocks (AWS naming)

An IAM policy

No long-lived keys, ever

Observability Primitives

Metrics

Logs

Traces

The four golden signals

The cloud-bill-as-observability anti-pattern

Provisioning — Infrastructure as Code

The four serious tools

A small Terraform module

Ops practices that matter

GitOps for IaC

Don't import the world

What Costs Money on IaaS

Compute

Storage

Network

Surprise generators

FinOps essentials

When IaaS Still Wins (and When to Climb)

IaaS still wins for

Climb to CaaS / PaaS / FaaS when…

Climb to SaaS for

Decision rule

IaaS Anti-Patterns

"Pets, not cattle"

"One VPC per developer"

"Public IPs everywhere"

"Static IAM access keys"

"Mutually-exclusive subnets"

"All-AZ disaster planning"

"Snowflake security groups"

"Backups but no restore drill"

Summary

IaaS
Foundations