CLOUD SERVICE MODELS · PART 2 OF 6

IaaS
Foundations

Compute · Networking · Storage · Regions · Identity — the floor of cloud
VMs VPC Block / Object / File Region · AZ IAM
🖥 Compute + 🌐 Network + 💾 Storage + 🔐 Identity 🧱 a system

What you actually rent at the IaaS layer, the operational tax that comes with it, and when IaaS still wins versus climbing to PaaS or CaaS.

Compute  ·  Network  ·  Storage  ·  Identity  ·  Cost
01

Topics

Compute

  • The VM as a primitive — instance families & shapes
  • Spot / preemptible — the economics of "you can lose it"
  • Bare metal & dedicated hosts — when virtualisation hurts
  • Burstable, GPU, ARM (Graviton, Axion) instances

Networking

  • The VPC — virtual data centre
  • Subnets, routes, NAT, internet gateways
  • Security groups vs NACLs
  • PrivateLink, peering, Transit Gateway, hybrid VPN/Direct Connect

Storage

  • Block (EBS / PD / Managed Disks) — the disk metaphor
  • Object (S3 / GCS / Blob) — the bucket metaphor
  • File (EFS / Filestore / Azure Files) — when you really want NFS
  • Storage class lifecycle & tiering

Operability

  • Region & AZ design — what fails, when
  • IAM, STS, instance profiles
  • Provisioning — Terraform / Pulumi / CDK
  • Cost watch-points & anti-patterns
02

The VM — IaaS's One Primitive

An IaaS VM is a virtualised x86_64 / ARM64 server with: a chosen shape (vCPU + RAM + ephemeral disk), a network interface in a VPC subnet, a boot disk, an IAM identity, and a region + AZ. Everything else in IaaS exists to network, store, or govern this primitive.

The hypervisor underneath

  • AWS Nitro — KVM-based, custom ASICs offload network & storage; bare-metal-fast for guests
  • GCP — KVM with custom Andromeda networking
  • Azure — Hyper-V; Boost & Azure Boost ASIC offload (2023+)
  • Firecracker — Amazon's micro-VM; powers Lambda & Fargate; Fly.io adopted it

Instance families (AWS naming, others mirror)

  • m — general purpose (m7i, m7g)
  • c — compute-optimised
  • r / x — memory-optimised (r-class, x-class for in-memory DBs)
  • i / d — storage-optimised, local NVMe
  • g / p / inf / trn — GPU / accelerator
  • t — burstable / cheap dev

ARM is no longer optional

  • AWS Graviton (Neoverse N1 → V2): ~20% cheaper, ~40% better perf-per-watt vs x86 equivalents
  • GCP Axion (Neoverse V2): C4A, T2A — generally 30% better price-perf
  • Azure Cobalt 100 — Microsoft's first ARM silicon (2024 GA)
  • Build pipelines: linux/arm64 images now table-stakes; Docker buildx / GitHub Actions cover it cleanly

Reading the shape

# AWS
m7i.2xlarge  → m7 family, intel, 2× the base size (8 vCPU, 32 GB)
c7g.large    → c7 family, graviton (g), large (2 vCPU, 4 GB)

# GCP
n2-standard-4 → n2 family, standard mem ratio, 4 vCPU
c4a-highcpu-8 → c4 family, Axion (a), 8 vCPU, 4 GB

# Azure
Standard_D4s_v5 → 4 vCPU, premium disk (s), v5 generation
03

Spot & Preemptible — Pay 60–90% Less

Spot instances are surplus capacity, sold at a steep discount but with the catch: the cloud can reclaim them on minutes' notice. Used right, they slash costs; used wrong, they take down production.

Three flavours

CloudNameEviction noticeDiscount
AWSSpot2 min50–90%
GCPSpot VM (was Preemptible)30 s~70%
AzureSpot30 s~80%

When spot is right

  • Stateless web tier behind a load balancer; nodes come and go
  • Batch jobs that checkpoint (ML training, video transcoding, ETL)
  • CI runners, dev environments, ephemeral test fleets
  • Kubernetes node pools tagged "tolerates eviction"

When spot will hurt you

  • Stateful primary databases — eviction = downtime + recovery
  • Long single-node training (no checkpointing) — burned hours
  • Whole fleet on one instance shape in one AZ — capacity drains together

Patterns that work

  • Diversified pools — multiple shapes / AZs; orchestrator picks whichever has capacity (AWS Spot Fleet, EKS Karpenter)
  • Mixed on-demand / spot — small on-demand baseline + spot burst
  • Eviction handlers — drain endpoint, persist state in 30s, rejoin elsewhere

Bare metal & dedicated hosts

Other end of the spectrum: EC2 i4i.metal, GCP sole-tenant nodes, Azure dedicated hosts. No hypervisor overhead, BYOL licensing eligibility, full physical isolation. Used for HPC, BYO virtualisation, and licensing-bound workloads (Oracle, Windows Server core-licensed).

04

The VPC — Your Virtual Data Centre

A VPC (Virtual Private Cloud) is a software-defined network you own inside the provider's substrate. CIDR range, subnets, routes, security — yours; the underlying physical fabric — theirs.

VPC 10.0.0.0/16 AZ-a public 10.0.1.0/24 ALB NAT GW private 10.0.10.0/24 app VM app VM data 10.0.20.0/24 (private) RDS primary ElastiCache AZ-b public 10.0.2.0/24 ALB NAT GW private 10.0.11.0/24 app VM app VM data 10.0.21.0/24 (private) RDS replica Internet Gateway

Three-tier subnet pattern

Public — load balancers + NAT only. Private — app tier; outbound via NAT, inbound only from ALB. Data — DBs + caches; no internet at all, both ways.

Two AZs minimum

Every tier replicated across at least two AZs. An AZ failure (an entire data centre going dark) should leave you running, not down. Cross-AZ traffic is billed — but cheaper than downtime.

05

Routing — IGW · NAT · Endpoints

Routes inside a VPC are explicit: each subnet has a route table, and every destination CIDR has a target. There is no "default" — if you didn't write a route to 0.0.0.0/0, the subnet has no internet.

Three ways out

  • Internet Gateway (IGW) — bidirectional. Subnet that routes 0.0.0.0/0 → IGW + has a public IP = "public subnet".
  • NAT Gateway — outbound-only. Private instances reach the internet through it; the internet can't reach them.
  • VPC Endpoints / PrivateLink — talk to AWS services (S3, DynamoDB, Bedrock, …) over the AWS backbone, without leaving the VPC.

NAT gateway — the fee meter

  • ~$0.045/hour per NAT GW plus $0.045/GB processed (us-east-1)
  • One per AZ for HA → ~$32/month idle × AZs
  • Heavy data fetch (Docker pulls, package installs) routes through NAT — surprise bills
  • Mitigation: VPC endpoints for ECR, S3; pull-through caches; amazon-vpc-cni for Kubernetes pods to use ENIs directly

Cross-VPC connectivity

  • VPC Peering — point-to-point, non-transitive, 1:1 between VPCs
  • Transit Gateway — hub-and-spoke router; many VPCs, one route table
  • PrivateLink — expose one service from your VPC to others, without exposing the VPC
  • Cloud WAN (AWS) / NCC (GCP) — global SD-WAN-as-a-Service

On-prem ↔ cloud

  • Site-to-site VPN — IPsec over the public internet; cheapest, ~1 Gbps each
  • Direct Connect / Cloud Interconnect / ExpressRoute — private fibre into the cloud's edge; up to 100 Gbps; the only sensible choice for hybrid at scale
  • SD-WAN partners — Megaport, Equinix Fabric — multi-cloud across one fibre
06

Security Groups vs NACLs

Two stateless concepts and one stateful one. Confuse them and you ship vulnerabilities or break production at 3am.

Security Groups (the daily-driver)

  • Attached to network interfaces (instances, ENIs, RDS, Lambda VPC ENIs)
  • Stateful — return traffic for an established flow is automatically allowed
  • Allow rules only — there is no deny
  • Default rule: deny all inbound, allow all outbound
  • Source can be a CIDR or another SG — chaining ("web SG can talk to db SG") is the right pattern

Network ACLs (the perimeter)

  • Attached to subnets
  • Stateless — must explicitly allow return traffic, both directions
  • Allow and deny rules; numbered, evaluated in order
  • Default: allow all (deliberately permissive)
  • Used as a coarse "blast radius" backstop — rare in day-to-day, vital for compliance perimeters

Defence in depth

  • NACL: "no traffic from outside the corporate IP space"
  • SG: "ALB ↔ app, app ↔ db, nothing else"
  • Host firewall: distro defaults (often nothing — that's fine when SGs are tight)
  • Application: TLS, mTLS, app-layer auth

The classic SG anti-pattern

0.0.0.0/0 on port 22 "for now". That gets scanned and brute-forced inside an hour. Use SSM Session Manager / IAP tunnels — see deck 05.

07

Storage — Block / Object / File

Cloud storage is three almost-unrelated services trading off durability, latency, throughput, and price. Use the wrong one and you pay 100× — in either direction.

TypeAWSGCPAzureLatencyThroughputPricing shapeBest for
Block EBS (gp3, io2, st1) Persistent Disk (pd-ssd, pd-balanced) Managed Disks (Premium/Ultra) < 1 ms up to 16 GB/s (io2 BX) $/GB-month + IOPS + throughput VM root, DB volumes
Object S3 (Standard / IA / Glacier) GCS (Standard / Nearline / Archive) Blob (Hot / Cool / Archive) 10–100 ms TTFB massive parallel $/GB-month + GET/PUT + egress backups, media, data lakes
File EFS (NFSv4) / FSx (Lustre, ONTAP, OpenZFS) Filestore (NFS), Cloud Storage FUSE Azure Files (SMB / NFS) 1–5 ms up to 10s of GB/s (FSx Lustre) $/GB-month + IOPS shared home dirs, lift-and-shift NFS

Block — disk-shaped

One VM mounts one volume. Looks like /dev/nvme1n1. Snapshot to object storage, never accessible to a second VM (except in clustered FS setups).

Object — bucket-shaped

HTTP-addressed key/value over flat namespace. 11-nines durability. The right answer for >90% of "files" you'd otherwise put in a filesystem.

File — NFS-shaped

POSIX shared filesystem. Most lift-and-shift uses; almost always more expensive than designing for object storage.

08

Storage Class Lifecycle — Tiering Down

Object storage has a price/access trade-off baked in: hot tiers cost more per GB but cheap to read; cold tiers are nearly free at rest but expensive (and slow) to retrieve. Lifecycle rules automate the descent.

S3 storage classes (per GB-month, us-east-1, ~)

ClassStorageGETMin duration
Standard$0.023$0.0004 / 1k
Standard-IA$0.0125$0.001 / 1k30 d
Intelligent-Tieringautoauto
Glacier Instant Retrieval$0.004$0.01 / 1k90 d
Glacier Flexible$0.0036+ retrieval90 d
Glacier Deep Archive$0.00099+ retrieval180 d

A typical rule

# Standard → IA after 30d → Glacier after 90d → expire after 7 years
<LifecycleConfiguration>
  <Rule>
    <Filter><Prefix>logs/</Prefix></Filter>
    <Status>Enabled</Status>
    <Transition>
      <Days>30</Days>
      <StorageClass>STANDARD_IA</StorageClass>
    </Transition>
    <Transition>
      <Days>90</Days>
      <StorageClass>GLACIER</StorageClass>
    </Transition>
    <Expiration>
      <Days>2555</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>

Cloudflare R2 — the disruptor

  • S3-compatible API, $0.015/GB-month, zero egress fees
  • No regional choice — Cloudflare picks the closest of its 300+ locations
  • "Data egress is the price of cloud lock-in" was their explicit pitch
  • Forced AWS to introduce free egress on full account exits (early 2024)

Cost gotchas

  • Small-object tax — moving 1 KB to Glacier costs more in retrieval than the GB it sits in
  • Replication doubles cost — CRR / SRR / Object Replication; useful for compliance, not cost
  • Versioning + delete markers — "deleted" objects keep accruing storage cost forever unless lifecycled
  • Multi-part upload incompletes — abandoned uploads stick around indefinitely; add a cleanup rule
09

Regions & Availability Zones

A region is a geographic cluster of availability zones (AZs); each AZ is one or more physically separate data centres on independent power and network. Designing for failure means designing across AZs first, regions second.

Region: eu-west-2 (London) AZ-a (eu-west-2a) DC1 + DC2 ~10 km from peers independent power & cooling AZ-b DC3 + DC4 ~< 2ms RTT to peers AZ-c DC5 "AZ" is a logical name; AWS uses zone IDs (use1-az1) so two accounts' "us-east-1a" can map to different physical zones.

Choose your region for

  • Latency to users — pick the nearest one with the services you need
  • Compliance / residency — UK, eu-central-1, gov-cloud
  • Service availability — not every region has every service or SKU
  • Cost — us-east-1 is the cheapest baseline; eu-west-1 ~5–10% more; ap-east-1 ~30% more

Multi-region patterns

  • Active-passive — primary region serves; DR site sits warm. RPO minutes, RTO hours.
  • Active-active read — global reads close to user, writes go to primary (Aurora Global, Spanner)
  • Truly global — Spanner, DynamoDB Global Tables, Cosmos DB; conflict-free or last-writer-wins
  • Service-by-service — ML batch in cheap region, latency-sensitive front-end at the edge
10

Cloud Identity Primitives

Every IaaS resource has an identity, every API call is authenticated and authorised, every audit log records who did what. This is the cloud's most powerful feature — and the hardest to use right. (Deep dive in deck 05.)

Things that have identity

  • Humans — federated via SSO (rarely as IAM users)
  • Workloads — instance roles, service accounts, K8s service accounts
  • CI/CD — OIDC-federated GitHub Actions runners, GitLab runners, Buildkite agents
  • External services — cross-account roles, AssumeRole-with-WebIdentity

The four building blocks (AWS naming)

  • Identity — user, role, service account
  • Policy — JSON document of allow/deny statements
  • Resource — what the policy applies to
  • STS — short-lived credentials, role assumption, AssumeRole

An IAM policy

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:GetObject",
      "s3:PutObject"
    ],
    "Resource": "arn:aws:s3:::data-prod/uploads/*",
    "Condition": {
      "StringEquals": {
        "aws:PrincipalTag/team": "ingest"
      },
      "Bool": { "aws:SecureTransport": "true" }
    }
  }]
}

Allow only POST/GET, only on one prefix, only over TLS, only for a tagged principal. Conditions are where the real expressiveness lives.

No long-lived keys, ever

Workload Identity (GCP) / IRSA (EKS) / Pod Identity (EKS, 2023) / Managed Identity (Azure) / OIDC federation (GitHub Actions, GitLab) all give you short-lived STS credentials, no static keys to leak.

11

Observability Primitives

You don't have a server room to walk into; the only window onto your fleet is what the platform exposes plus what you instrument. Plan observability first, not last.

Metrics

  • CloudWatch / Cloud Monitoring / Azure Monitor — built-in, basic but expensive at scale
  • Prometheus / Mimir / Cortex — OSS standard for high-cardinality metrics
  • Datadog / New Relic / Grafana Cloud — managed, pricey but turn-key

Logs

  • CloudWatch Logs / Cloud Logging / Log Analytics
  • Loki / OpenSearch — OSS aggregators
  • Datadog Logs / Splunk — eats your budget if you ship every line; sample & index selectively

Traces

  • OpenTelemetry — the cross-vendor standard; instrument once
  • X-Ray / Cloud Trace / App Insights — provider-native, often weaker than OTel-based tools
  • Tempo / Jaeger / Zipkin — OSS

The four golden signals

Latency · traffic · errors · saturation. Per-service. Per-tenant for SaaS (deck 04). On the same dashboard as the cloud's platform metrics (CPU credits, NAT GW throughput, NAT errors, ALB 5xx) so you can correlate.

The cloud-bill-as-observability anti-pattern

You'll hear "we know there's a problem because the bill spiked". That's too late. Set rate-of-change alerts on your bill (Cost Anomaly Detection, GCP Budget alerts) and normal SRE alerts on golden signals.

12

Provisioning — Infrastructure as Code

Click-ops your way to one VPC and you've spent an afternoon. Click-ops your way to fifty across three accounts and you've made a career-limiting error. IaC is non-negotiable past the toy stage.

The four serious tools

  • Terraform / OpenTofu — declarative HCL, multi-cloud, the de-facto standard. OpenTofu is the post-license-change OSS fork.
  • Pulumi — same model, real programming languages (TS, Python, Go, .NET)
  • AWS CDK / CDKTF — TypeScript / Python that compiles to CloudFormation or Terraform
  • CloudFormation / ARM / Bicep / Deployment Manager — vendor-native

A small Terraform module

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  tags = { Name = "prod" }
}

resource "aws_subnet" "private" {
  for_each          = var.azs
  vpc_id            = aws_vpc.main.id
  availability_zone = each.value
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, 10 + index(keys(var.azs), each.key))
  tags = { Tier = "private", Name = "prod-${each.key}" }
}

Ops practices that matter

  • State remote & locked — S3 + DynamoDB, GCS, Azure Storage; lock prevents two engineers stomping each other
  • Plan in CI, apply behind PR review — Atlantis, Terraform Cloud, Spacelift, Env0
  • No drift — alert when reality ≠ state. Detect with terraform plan -detailed-exitcode
  • Modules — one VPC module, one EKS module — not 12 copies

GitOps for IaC

The path forward: PR opens a plan; merge to main triggers an apply via a CI runner with OIDC-federated cloud creds. No long-lived keys on developer laptops, no manual terraform apply in production. CI/CD deck covers the build side.

Don't import the world

If something was clicked first, write the resource block, then terraform import. Don't try to model existing chaos perfectly — most "drift" is unmanaged config you should probably leave outside IaC anyway.

13

What Costs Money on IaaS

Compute

  • VM hours — biggest line on most bills
  • Reserved / Savings Plans / CUDs — 30–72% off for commitment
  • Spot — 50–90% off, eviction risk
  • Idle — running VMs at 02:00 with no users

Storage

  • Block — provisioned size + IOPS + throughput, even if 90% empty
  • Object — storage + GETs/PUTs + retrieval; small-object tax
  • Snapshots / images — accumulate without lifecycle
  • Versioned objects — silent duplication

Network

  • Egress — bytes leaving the cloud (~$0.05–0.12/GB)
  • Cross-AZ — ~$0.01–0.02/GB each way
  • NAT Gateway — $0.045/hr + $0.045/GB processed
  • Public IPv4 — chargeable since Feb 2024 (~$3.65/month each)
  • Load balancers — fixed hourly + LCU/processed-bytes

Surprise generators

  • S3 versioning + no lifecycle = silent doubling each year
  • RDS Multi-AZ for "dev" = ~2× the cost forever
  • Forgotten test EKS clusters at $73/month each, ten of them
  • NAT GW pulling images on every Lambda cold start
  • CloudWatch Logs at $0.50/GB ingested for DEBUG-level chatty services

FinOps essentials

Tag everything (cost-center, env, service); split bills with OpenCost; review weekly with engineers, not finance.

14

When IaaS Still Wins (and When to Climb)

IaaS still wins for

  • Specialised kernels / drivers — eBPF, real-time, GPU drivers, RDMA, custom networking, FUSE filesystems
  • Stateful workloads with no managed equivalent — legacy ERPs, MPI / HPC clusters, niche databases (CockroachDB self-hosted, Vitess on bare metal)
  • Compliance demanding "we patch the OS" — some frameworks still want you operating below the hypervisor
  • Cost ceilings on steady-state large workloads — 3-year RIs / CUDs make IaaS substantially cheaper than PaaS at scale
  • BYO-licensing — Oracle, Windows Server core licensing, SAP HANA

Climb to CaaS / PaaS / FaaS when…

  • Your workload is stateless HTTP / gRPC
  • You don't need OS-level control (the truth: 90% don't)
  • You'd rather pay $0.000004/req than $0.05/hr for an idle VM
  • You don't have a platform team and won't have one soon

Climb to SaaS for

  • Anything generic — email, helpdesk, CRM, BI dashboards, observability, identity
  • Anything where the differentiation isn't yours — auth, signup flows, e-commerce checkout

Decision rule

Default to PaaS / CaaS. Drop to IaaS only when you've identified exactly which provider abstraction is the problem, and the operational tax on the higher layer is documented and bigger than the IaaS one.

15

IaaS Anti-Patterns

"Pets, not cattle"

Each VM has a name (irene-prod-3), an SSH-tweaked config, and is irreplaceable because no one wrote down what's on it. Use AMIs / images / Packer; treat instances as disposable.

"One VPC per developer"

Hundred-VPC sprawl, 50 forgotten NAT gateways, no shared services. Use one shared dev VPC, prod-isolation only for prod.

"Public IPs everywhere"

SSH on 0.0.0.0/0, RDS in a public subnet "for now", DBs reachable from the internet. Use IAP / Session Manager / bastion + private subnets only.

"Static IAM access keys"

Long-lived AKIA… keys checked into a developer's .aws/credentials, then leaked to GitHub. Use SSO + STS.

"Mutually-exclusive subnets"

Twelve subnets per AZ, each a tier, each only used by one app. Subnets are cheap; reuse them within tiers, scope with security groups.

"All-AZ disaster planning"

A single-AZ MVP that you forgot to widen. The first real outage is the AZ outage.

"Snowflake security groups"

Hand-edited, undocumented SGs that nobody dares change. Drift them once, you'll never untangle them. SGs in IaC, with names.

"Backups but no restore drill"

Backups run nightly; nobody has restored. Run a disaster-recovery drill at least quarterly.

16

Summary

Three takeaways

  1. IaaS is the floor — VMs, VPC, storage, IAM. Everything above it is built from this.
  2. The operational tax of IaaS is real. Move up the stack unless you have a specific reason not to.
  3. IaC, identity, and observability are not optional. The cloud is API-only — if you can't trace, audit, and reproduce, you're gambling.

Next in the series

  • 03 PaaS / FaaS / CaaS — the layer you'll actually use
  • 04 SaaS Architecture — multi-tenancy, B2B identity
  • 05 Cloud Security — IAM in depth, secrets, network, compliance
  • 06 LLM-as-a-Service

Companion decks

One sentence

"IaaS is the cloud's most powerful and most punishing layer — full control of the compute primitive, in exchange for taking on every problem above the hypervisor."