Block (EBS / PD / Managed Disks) — the disk metaphor
Object (S3 / GCS / Blob) — the bucket metaphor
File (EFS / Filestore / Azure Files) — when you really want NFS
Storage class lifecycle & tiering
Operability
Region & AZ design — what fails, when
IAM, STS, instance profiles
Provisioning — Terraform / Pulumi / CDK
Cost watch-points & anti-patterns
02
The VM — IaaS's One Primitive
An IaaS VM is a virtualised x86_64 / ARM64 server with: a chosen shape (vCPU + RAM + ephemeral disk), a network interface in a VPC subnet, a boot disk, an IAM identity, and a region + AZ. Everything else in IaaS exists to network, store, or govern this primitive.
Azure Cobalt 100 — Microsoft's first ARM silicon (2024 GA)
Build pipelines: linux/arm64 images now table-stakes; Docker buildx / GitHub Actions cover it cleanly
Reading the shape
# AWS
m7i.2xlarge → m7 family, intel, 2× the base size (8 vCPU, 32 GB)
c7g.large → c7 family, graviton (g), large (2 vCPU, 4 GB)
# GCP
n2-standard-4 → n2 family, standard mem ratio, 4 vCPU
c4a-highcpu-8 → c4 family, Axion (a), 8 vCPU, 4 GB
# Azure
Standard_D4s_v5 → 4 vCPU, premium disk (s), v5 generation
03
Spot & Preemptible — Pay 60–90% Less
Spot instances are surplus capacity, sold at a steep discount but with the catch: the cloud can reclaim them on minutes' notice. Used right, they slash costs; used wrong, they take down production.
Three flavours
Cloud
Name
Eviction notice
Discount
AWS
Spot
2 min
50–90%
GCP
Spot VM (was Preemptible)
30 s
~70%
Azure
Spot
30 s
~80%
When spot is right
Stateless web tier behind a load balancer; nodes come and go
Batch jobs that checkpoint (ML training, video transcoding, ETL)
CI runners, dev environments, ephemeral test fleets
Eviction handlers — drain endpoint, persist state in 30s, rejoin elsewhere
Bare metal & dedicated hosts
Other end of the spectrum: EC2 i4i.metal, GCP sole-tenant nodes, Azure dedicated hosts. No hypervisor overhead, BYOL licensing eligibility, full physical isolation. Used for HPC, BYO virtualisation, and licensing-bound workloads (Oracle, Windows Server core-licensed).
04
The VPC — Your Virtual Data Centre
A VPC (Virtual Private Cloud) is a software-defined network you own inside the provider's substrate. CIDR range, subnets, routes, security — yours; the underlying physical fabric — theirs.
Three-tier subnet pattern
Public — load balancers + NAT only. Private — app tier; outbound via NAT, inbound only from ALB. Data — DBs + caches; no internet at all, both ways.
Two AZs minimum
Every tier replicated across at least two AZs. An AZ failure (an entire data centre going dark) should leave you running, not down. Cross-AZ traffic is billed — but cheaper than downtime.
05
Routing — IGW · NAT · Endpoints
Routes inside a VPC are explicit: each subnet has a route table, and every destination CIDR has a target. There is no "default" — if you didn't write a route to 0.0.0.0/0, the subnet has no internet.
Three ways out
Internet Gateway (IGW) — bidirectional. Subnet that routes 0.0.0.0/0 → IGW + has a public IP = "public subnet".
NAT Gateway — outbound-only. Private instances reach the internet through it; the internet can't reach them.
VPC Endpoints / PrivateLink — talk to AWS services (S3, DynamoDB, Bedrock, …) over the AWS backbone, without leaving the VPC.
NAT gateway — the fee meter
~$0.045/hour per NAT GW plus $0.045/GB processed (us-east-1)
One per AZ for HA → ~$32/month idle × AZs
Heavy data fetch (Docker pulls, package installs) routes through NAT — surprise bills
Mitigation: VPC endpoints for ECR, S3; pull-through caches; amazon-vpc-cni for Kubernetes pods to use ENIs directly
Cross-VPC connectivity
VPC Peering — point-to-point, non-transitive, 1:1 between VPCs
Transit Gateway — hub-and-spoke router; many VPCs, one route table
PrivateLink — expose one service from your VPC to others, without exposing the VPC
Cloud WAN (AWS) / NCC (GCP) — global SD-WAN-as-a-Service
On-prem ↔ cloud
Site-to-site VPN — IPsec over the public internet; cheapest, ~1 Gbps each
Direct Connect / Cloud Interconnect / ExpressRoute — private fibre into the cloud's edge; up to 100 Gbps; the only sensible choice for hybrid at scale
SD-WAN partners — Megaport, Equinix Fabric — multi-cloud across one fibre
06
Security Groups vs NACLs
Two stateless concepts and one stateful one. Confuse them and you ship vulnerabilities or break production at 3am.
Security Groups (the daily-driver)
Attached to network interfaces (instances, ENIs, RDS, Lambda VPC ENIs)
Stateful — return traffic for an established flow is automatically allowed
Allow rules only — there is no deny
Default rule: deny all inbound, allow all outbound
Source can be a CIDR or another SG — chaining ("web SG can talk to db SG") is the right pattern
Network ACLs (the perimeter)
Attached to subnets
Stateless — must explicitly allow return traffic, both directions
Allow and deny rules; numbered, evaluated in order
Default: allow all (deliberately permissive)
Used as a coarse "blast radius" backstop — rare in day-to-day, vital for compliance perimeters
Defence in depth
NACL: "no traffic from outside the corporate IP space"
SG: "ALB ↔ app, app ↔ db, nothing else"
Host firewall: distro defaults (often nothing — that's fine when SGs are tight)
Application: TLS, mTLS, app-layer auth
The classic SG anti-pattern
0.0.0.0/0 on port 22 "for now". That gets scanned and brute-forced inside an hour. Use SSM Session Manager / IAP tunnels — see deck 05.
07
Storage — Block / Object / File
Cloud storage is three almost-unrelated services trading off durability, latency, throughput, and price. Use the wrong one and you pay 100× — in either direction.
Type
AWS
GCP
Azure
Latency
Throughput
Pricing shape
Best for
Block
EBS (gp3, io2, st1)
Persistent Disk (pd-ssd, pd-balanced)
Managed Disks (Premium/Ultra)
< 1 ms
up to 16 GB/s (io2 BX)
$/GB-month + IOPS + throughput
VM root, DB volumes
Object
S3 (Standard / IA / Glacier)
GCS (Standard / Nearline / Archive)
Blob (Hot / Cool / Archive)
10–100 ms TTFB
massive parallel
$/GB-month + GET/PUT + egress
backups, media, data lakes
File
EFS (NFSv4) / FSx (Lustre, ONTAP, OpenZFS)
Filestore (NFS), Cloud Storage FUSE
Azure Files (SMB / NFS)
1–5 ms
up to 10s of GB/s (FSx Lustre)
$/GB-month + IOPS
shared home dirs, lift-and-shift NFS
Block — disk-shaped
One VM mounts one volume. Looks like /dev/nvme1n1. Snapshot to object storage, never accessible to a second VM (except in clustered FS setups).
Object — bucket-shaped
HTTP-addressed key/value over flat namespace. 11-nines durability. The right answer for >90% of "files" you'd otherwise put in a filesystem.
File — NFS-shaped
POSIX shared filesystem. Most lift-and-shift uses; almost always more expensive than designing for object storage.
08
Storage Class Lifecycle — Tiering Down
Object storage has a price/access trade-off baked in: hot tiers cost more per GB but cheap to read; cold tiers are nearly free at rest but expensive (and slow) to retrieve. Lifecycle rules automate the descent.
S3 storage classes (per GB-month, us-east-1, ~)
Class
Storage
GET
Min duration
Standard
$0.023
$0.0004 / 1k
—
Standard-IA
$0.0125
$0.001 / 1k
30 d
Intelligent-Tiering
auto
auto
—
Glacier Instant Retrieval
$0.004
$0.01 / 1k
90 d
Glacier Flexible
$0.0036
+ retrieval
90 d
Glacier Deep Archive
$0.00099
+ retrieval
180 d
A typical rule
# Standard → IA after 30d → Glacier after 90d → expire after 7 years
<LifecycleConfiguration>
<Rule>
<Filter><Prefix>logs/</Prefix></Filter>
<Status>Enabled</Status>
<Transition>
<Days>30</Days>
<StorageClass>STANDARD_IA</StorageClass>
</Transition>
<Transition>
<Days>90</Days>
<StorageClass>GLACIER</StorageClass>
</Transition>
<Expiration>
<Days>2555</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>
Cloudflare R2 — the disruptor
S3-compatible API, $0.015/GB-month, zero egress fees
No regional choice — Cloudflare picks the closest of its 300+ locations
"Data egress is the price of cloud lock-in" was their explicit pitch
Forced AWS to introduce free egress on full account exits (early 2024)
Cost gotchas
Small-object tax — moving 1 KB to Glacier costs more in retrieval than the GB it sits in
Replication doubles cost — CRR / SRR / Object Replication; useful for compliance, not cost
Multi-part upload incompletes — abandoned uploads stick around indefinitely; add a cleanup rule
09
Regions & Availability Zones
A region is a geographic cluster of availability zones (AZs); each AZ is one or more physically separate data centres on independent power and network. Designing for failure means designing across AZs first, regions second.
Choose your region for
Latency to users — pick the nearest one with the services you need
Service availability — not every region has every service or SKU
Cost — us-east-1 is the cheapest baseline; eu-west-1 ~5–10% more; ap-east-1 ~30% more
Multi-region patterns
Active-passive — primary region serves; DR site sits warm. RPO minutes, RTO hours.
Active-active read — global reads close to user, writes go to primary (Aurora Global, Spanner)
Truly global — Spanner, DynamoDB Global Tables, Cosmos DB; conflict-free or last-writer-wins
Service-by-service — ML batch in cheap region, latency-sensitive front-end at the edge
10
Cloud Identity Primitives
Every IaaS resource has an identity, every API call is authenticated and authorised, every audit log records who did what. This is the cloud's most powerful feature — and the hardest to use right. (Deep dive in deck 05.)
Things that have identity
Humans — federated via SSO (rarely as IAM users)
Workloads — instance roles, service accounts, K8s service accounts
Allow only POST/GET, only on one prefix, only over TLS, only for a tagged principal. Conditions are where the real expressiveness lives.
No long-lived keys, ever
Workload Identity (GCP) / IRSA (EKS) / Pod Identity (EKS, 2023) / Managed Identity (Azure) / OIDC federation (GitHub Actions, GitLab) all give you short-lived STS credentials, no static keys to leak.
11
Observability Primitives
You don't have a server room to walk into; the only window onto your fleet is what the platform exposes plus what you instrument. Plan observability first, not last.
Metrics
CloudWatch / Cloud Monitoring / Azure Monitor — built-in, basic but expensive at scale
Prometheus / Mimir / Cortex — OSS standard for high-cardinality metrics
Datadog / New Relic / Grafana Cloud — managed, pricey but turn-key
Logs
CloudWatch Logs / Cloud Logging / Log Analytics
Loki / OpenSearch — OSS aggregators
Datadog Logs / Splunk — eats your budget if you ship every line; sample & index selectively
Traces
OpenTelemetry — the cross-vendor standard; instrument once
X-Ray / Cloud Trace / App Insights — provider-native, often weaker than OTel-based tools
Tempo / Jaeger / Zipkin — OSS
The four golden signals
Latency · traffic · errors · saturation. Per-service. Per-tenant for SaaS (deck 04). On the same dashboard as the cloud's platform metrics (CPU credits, NAT GW throughput, NAT errors, ALB 5xx) so you can correlate.
The cloud-bill-as-observability anti-pattern
You'll hear "we know there's a problem because the bill spiked". That's too late. Set rate-of-change alerts on your bill (Cost Anomaly Detection, GCP Budget alerts) and normal SRE alerts on golden signals.
12
Provisioning — Infrastructure as Code
Click-ops your way to one VPC and you've spent an afternoon. Click-ops your way to fifty across three accounts and you've made a career-limiting error. IaC is non-negotiable past the toy stage.
The four serious tools
Terraform / OpenTofu — declarative HCL, multi-cloud, the de-facto standard. OpenTofu is the post-license-change OSS fork.
Pulumi — same model, real programming languages (TS, Python, Go, .NET)
AWS CDK / CDKTF — TypeScript / Python that compiles to CloudFormation or Terraform
CloudFormation / ARM / Bicep / Deployment Manager — vendor-native
State remote & locked — S3 + DynamoDB, GCS, Azure Storage; lock prevents two engineers stomping each other
Plan in CI, apply behind PR review — Atlantis, Terraform Cloud, Spacelift, Env0
No drift — alert when reality ≠ state. Detect with terraform plan -detailed-exitcode
Modules — one VPC module, one EKS module — not 12 copies
GitOps for IaC
The path forward: PR opens a plan; merge to main triggers an apply via a CI runner with OIDC-federated cloud creds. No long-lived keys on developer laptops, no manual terraform apply in production. CI/CD deck covers the build side.
Don't import the world
If something was clicked first, write the resource block, then terraform import. Don't try to model existing chaos perfectly — most "drift" is unmanaged config you should probably leave outside IaC anyway.
13
What Costs Money on IaaS
Compute
VM hours — biggest line on most bills
Reserved / Savings Plans / CUDs — 30–72% off for commitment
Spot — 50–90% off, eviction risk
Idle — running VMs at 02:00 with no users
Storage
Block — provisioned size + IOPS + throughput, even if 90% empty
Stateful workloads with no managed equivalent — legacy ERPs, MPI / HPC clusters, niche databases (CockroachDB self-hosted, Vitess on bare metal)
Compliance demanding "we patch the OS" — some frameworks still want you operating below the hypervisor
Cost ceilings on steady-state large workloads — 3-year RIs / CUDs make IaaS substantially cheaper than PaaS at scale
BYO-licensing — Oracle, Windows Server core licensing, SAP HANA
Climb to CaaS / PaaS / FaaS when…
Your workload is stateless HTTP / gRPC
You don't need OS-level control (the truth: 90% don't)
You'd rather pay $0.000004/req than $0.05/hr for an idle VM
You don't have a platform team and won't have one soon
Climb to SaaS for
Anything generic — email, helpdesk, CRM, BI dashboards, observability, identity
Anything where the differentiation isn't yours — auth, signup flows, e-commerce checkout
Decision rule
Default to PaaS / CaaS. Drop to IaaS only when you've identified exactly which provider abstraction is the problem, and the operational tax on the higher layer is documented and bigger than the IaaS one.
15
IaaS Anti-Patterns
"Pets, not cattle"
Each VM has a name (irene-prod-3), an SSH-tweaked config, and is irreplaceable because no one wrote down what's on it. Use AMIs / images / Packer; treat instances as disposable.
"One VPC per developer"
Hundred-VPC sprawl, 50 forgotten NAT gateways, no shared services. Use one shared dev VPC, prod-isolation only for prod.
"Public IPs everywhere"
SSH on 0.0.0.0/0, RDS in a public subnet "for now", DBs reachable from the internet. Use IAP / Session Manager / bastion + private subnets only.
"Static IAM access keys"
Long-lived AKIA… keys checked into a developer's .aws/credentials, then leaked to GitHub. Use SSO + STS.
"Mutually-exclusive subnets"
Twelve subnets per AZ, each a tier, each only used by one app. Subnets are cheap; reuse them within tiers, scope with security groups.
"All-AZ disaster planning"
A single-AZ MVP that you forgot to widen. The first real outage is the AZ outage.
"Snowflake security groups"
Hand-edited, undocumented SGs that nobody dares change. Drift them once, you'll never untangle them. SGs in IaC, with names.
"Backups but no restore drill"
Backups run nightly; nobody has restored. Run a disaster-recovery drill at least quarterly.
16
Summary
Three takeaways
IaaS is the floor — VMs, VPC, storage, IAM. Everything above it is built from this.
The operational tax of IaaS is real. Move up the stack unless you have a specific reason not to.
IaC, identity, and observability are not optional. The cloud is API-only — if you can't trace, audit, and reproduce, you're gambling.
Next in the series
03 PaaS / FaaS / CaaS — the layer you'll actually use
"IaaS is the cloud's most powerful and most punishing layer — full control of the compute primitive, in exchange for taking on every problem above the hypervisor."