ARM AMBA · PRESENTATION 03

AXI Deep Dive

Channels · Bursts · IDs · Ordering · QoS · Lite & Stream
AXI3 · AXI4 · AXI4-Lite · AXI4-Stream · AXI5 · 5 channels · VALID/READY · ID-based OOO
Navigate: → ←  |  Overview: Esc  |  Fullscreen: F
02

What AXI Solves

  • AHB couples address and data phases. One slow transaction stalls the entire bus.
  • AXI fully decouples five channels, each with its own VALID/READY handshake.
  • Transactions carry IDs — the same master can issue multiple outstanding transactions, and the slave (e.g. DRAM controller) can reorder and return them in any order as long as ordering rules per-ID are preserved.
  • This is the single most important AMBA innovation since 1996 — it's what lets a modern CPU maintain hundreds of outstanding loads while streaming DMA writes concurrently.

The five channels

  • AW — Write Address (AWID, AWADDR, AWLEN, AWSIZE, AWBURST, AWLOCK, AWCACHE, AWPROT, AWQOS)
  • W — Write Data (WDATA, WSTRB, WLAST)
  • B — Write Response (BID, BRESP)
  • AR — Read Address (ARID, …)
  • R — Read Data (RID, RDATA, RRESP, RLAST)
Every channel has exactly the same handshake: source drives VALID, sink drives READY, transfer happens on the rising edge when both are high.
03

VALID / READY — The One Handshake

  • Source drives VALID and the payload; sink drives READY.
  • Transfer happens on the rising ACLK edge when both are asserted.
  • Source must not wait for READY before asserting VALID (no combinational dependency) — this prevents deadlock and keeps static timing clean.
  • Sink may wait for VALID before asserting READY.
  • Once VALID is high, it stays high until the handshake; payload must not change.
// Source side — safe shape
always_ff @(posedge ACLK) begin
  if (!ARESETn) begin
    VALID <= 0;
    data  <= 0;
  end else begin
    if (VALID && READY) begin
      // handshake — advance
      VALID <= next_valid;
      data  <= next_data;
    end else if (!VALID && have_data) begin
      // present new data
      VALID <= 1;
      data  <= new_data;
    end
  end
end

Sink-side back-pressure is just: assert READY whenever it has buffer room.

04

A Read Transaction Walk-through

Burst read of 4 words — ARLEN=3, INCR ACLK ARVALID ARREADY AR handshake (1 cycle) RVALID RREADY RDATA D0 D1 D2 wait D3 RLAST Slave pauses on D3 by dropping RVALID. RLAST only on the final beat (D3).
05

Burst Types — FIXED / INCR / WRAP

Click a burst type to see how the address evolves.

FIXED
INCR
WRAP
INCR (AWBURST = 2'b01)
Most common. Address increments by 2^AxSIZE each beat.
  • Length: 1–16 beats (AXI3); 1–256 beats (AXI4, provided not crossing 4 KB boundary).
  • Typical use: DMA memcpy, CPU cache line fill (usually 1 beat × 64 B), NIC packet read/write.
  • Example: ARLEN=3, ARSIZE=2 (4-byte beats), ARADDR=0x1000 → 0x1000, 0x1004, 0x1008, 0x100C.
06

Burst Lengths & Alignment

AXI versionINCR maxWRAP options
AXI316 beats2/4/8/16
AXI4256 beats2/4/8/16
AXI4-Lite1 beat only
AXI4-Streamno length concept

4 KB rule: a single INCR burst must not cross a 4 KB address boundary — because that's the smallest MMU page size, and bursts that cross pages would risk partial permission violations.

AxSIZE encoding

AxSIZEBeat width
08 bits (byte)
116 bits (half)
232 bits (word)
364 bits
4128 bits
5256 bits
6512 bits
71024 bits

AxSIZE must be ≤ the data-bus width. Narrower bursts on a wide bus waste lanes (write strobes, or read-mask on the consumer side).

07

AXI IDs — The Out-of-Order Engine

  • Every transaction carries an AxID (AWID or ARID). Responses come back tagged with the same ID (BID or RID).
  • Ordering rules (AXI4):
    • Transactions with the same ID: responses return in order.
    • Transactions with different IDs: responses may return in any order.
  • This lets a DRAM controller satisfy ID=7 (which hit a page) before ID=3 (which missed and had to open a new row).
  • A master that wants strong ordering just uses one ID; a master that wants bandwidth uses many.

Worked example — 4 outstanding reads

Master issues:
  ARID=3 AR0
  ARID=5 AR1
  ARID=3 AR2
  ARID=5 AR3

Slave can return:
  RID=5 R1  (ahead — page hit)
  RID=5 R3  (after R1, same ID)
  RID=3 R0  (ID=3 first outstanding)
  RID=3 R2  (after R0, same ID)
A single-ID system loses the advantage: all responses must stay in order, so the slowest transaction stalls the rest. Modern DRAM and NIC traffic needs many IDs.
08

Write Interleaving — and Why It's Gone

  • AXI3 allowed write-data interleaving: a master could send W-beats belonging to two different write IDs alternately on the W channel.
  • This was bounded by AWINTERLEAVE depth ≥ 2.
  • AXI4 removed it. Write data must now be sent strictly in the order of the AW transactions, and all beats of one AW are sent before the first beat of the next.

Why? Because nobody ever used it. Masters either couldn't buffer the second write's data, or preferred to serialise writes to avoid corner cases. Verification shops hated it — the state space exploded for zero benefit.

Consequence today

Write-side ordering on AXI4: AW order defines W order. Read-side ordering per-ID only. Most masters can still issue AWs with different IDs and rely on the slave to reorder responses (B channel) — that was the more important part anyway.

If you see AXI3-derived RTL with AWINTERLEAVE depth >1, it's a legacy path. Modern AXI4/AXI5 design implicitly keeps it at depth 1.
09

Write Strobes — WSTRB

  • WSTRB[N-1:0] has one bit per byte lane of WDATA. A bit of 1 means "write this byte"; 0 means "ignore it".
  • Lets partial writes happen without separate transactions — useful for byte and half-word stores on a 64-bit bus.
  • Align strobes with WDATA byte order: for little-endian, bit 0 corresponds to byte 0 (bits 7:0 of WDATA).
  • The slave must honour WSTRB exactly — setting it wrong corrupts neighbouring bytes.

Sparse writes

// Write 0xAA to byte 3 of a word
// on a 64-bit (8-byte) bus
AWADDR = 0x1000
AWSIZE = 3 (64-bit)
WDATA  = 0x0000_0000_AA00_0000
WSTRB  = 8'b0000_1000
// Byte 3 = 0xAA; others untouched
AXI4-Lite rule: WSTRB must be all-ones for aligned register writes. A Lite master doing a byte-lane-selective write should use full AXI4.
10

Responses — BRESP & RRESP

RRESP/BRESPMeaning
2'b00 OKAYNormal success (non-exclusive)
2'b01 EXOKAYExclusive access succeeded (store visible atomically)
2'b10 SLVERRSlave error (bad address, protected region, internal error)
2'b11 DECERRDecode error (no slave selected for this address)

Exclusive access

  • Master drives ARLOCK/AWLOCK = 1 to mark an exclusive pair.
  • A matching read + write must target the same address, same ID, same size.
  • If no other master wrote that location between the read and write → EXOKAY (write committed).
  • If another write invalidated the reservation → OKAY (write failed silently; master must retry).
  • Implements LDREX/STREX, LDAXR/STLXR semantics on Arm.
11

AxCACHE — Memory Attributes

  • AxCACHE[3:0] tells the bus what memory type the transaction targets. Crucial for cache maintenance and correctness.
  • Bits:
    • [0] Bufferable — write buffer allowed to re-order (writes)
    • [1] Cacheable / Modifiable — the transaction can be merged, split, or buffered by the system cache
    • [2] Read-Allocate — cache should allocate on read miss
    • [3] Write-Allocate — cache should allocate on write miss
  • AXI4 renamed some bits (Modifiable / Allocate) and clarified semantics for system-cache-aware slaves.

Common patterns

AxCACHEMeaning
4'b0000Strongly-Ordered device
4'b0001Device, bufferable
4'b0010Normal Non-cacheable Non-buff.
4'b0011Normal Non-cacheable Bufferable
4'b1111Normal Cacheable WB WA RA

The MMU page table attributes on Armv8 map directly to AxCACHE when the CPU issues the transaction.

12

AxPROT — Security & Privilege

  • AxPROT[2:0] = 3 attribute bits on every transaction:
    • [0] Privileged — 1 = privileged CPU mode
    • [1] Non-Secure — 1 = NS, 0 = Secure
    • [2] Instruction — 1 = instruction fetch, 0 = data
  • Bus filters (TrustZone AXI filters, SMMU fault modes) gate traffic based on AxPROT.
  • An AXI5 SoC with RME (Realm Management Extension) adds AxPROT[3]=NSE so a 4-way security state (Root/Realm/Secure/Non-Secure) can be encoded.

Hardware attribution

AxPROT is produced at the CPU (based on the MMU / EL) and must propagate unaltered through bridges, crossbars, and async FIFOs to the slave. Losing it is a security bug.

A common SoC design pattern: place an AXI firewall between a Non-Secure master and a Secure slave. It inspects AxPROT[1] and responds DECERR / SLVERR for disallowed combinations.
13

AxQOS & AxREGION

AxQOS[3:0] — Quality of Service

  • 4-bit hint: higher value = higher priority.
  • Implementation-defined — the spec does not dictate behaviour.
  • Typical use: memory controllers prioritise display-refresh or real-time DMA over CPU loads.

AxREGION[3:0]

  • 4-bit region identifier, set by the interconnect, not the master.
  • Lets a slave expose up to 16 distinct logical regions from one physical port.
  • E.g. a DRAM controller exposing banked regions without needing 16 separate slave ports.

Newer AMBA5 AxMMUVALID / AxMMUSIDV

AMBA 5 AXI adds SMMU-side-band signals so a master's StreamID can travel with the transaction — crucial for virtualised I/O and IOMMUs:

  • AWMMUVALID, AWMMUSID, AWMMUSSID
  • ARMMUVALID, ARMMUSID, ARMMUSSID
Most crossbars propagate AxQOS to downstream slaves but do not prioritise inside the crossbar itself — prioritisation is left to the memory controller.
14

AxUSER — the Escape Hatch

  • Per-channel USER signals (AWUSER, WUSER, BUSER, ARUSER, RUSER) — implementation-defined width, spec-defined presence.
  • Intended for SoC-specific sideband: PMU hints, parity, ECC side-channel, IP licensee IDs, SMMU stream hints.
  • Master and slave must agree on the semantics — the spec does nothing here except carry bits.
  • AMBA5 formalises more attributes so the USER channel isn't the only sideband: AxMMUSID, MPAM PARTID, MTE tags, etc.

Cache stashing (AMBA 5)

AXI5 AxSTASH signals let a coherent master hint that a transaction's data should be installed into a target cache, not just delivered to the slave. Used for low-latency NIC → CPU producer-consumer.

MPAM PARTID: AxMPAM fields let a hypervisor tag transactions with a partition ID; the memory controller enforces per-partition bandwidth/cache quotas.
15

AXI4-Lite

  • Stripped-down AXI4 for control/status register access:
    • Burst length = 1 (no AxLEN)
    • No AxID (or implied ID=0)
    • No AxLOCK, AxCACHE, AxBURST, AxQOS
    • AxSIZE fixed to bus width
    • WSTRB only for byte-level writes
  • Still keeps the five channels and the VALID/READY handshake — so it composes cleanly with full AXI via a gateway.
  • Used by virtually every register-mapped IP block a CPU talks to today.

Why AXI4-Lite beat APB for IP

Between ~2012 and ~2018 IP vendors (especially on FPGA, via Xilinx) switched register interfaces from APB to AXI4-Lite. Reasons:

  • No bridge needed when the host bus is AXI.
  • 5 channels give you inherent pipelining — 10–100× throughput vs APB for CSR bursts.
  • Same VIP / same formal assertions / same training as full AXI.
MCUs still use APB — area matters there. Servers/FPGAs use AXI4-Lite — integration matters more.
16

AXI4-Stream

  • A pure data pipe. No address, no response. Just data flowing between a Source ("Master") and Sink ("Slave").
  • Signals:
    • TDATA — data (any width)
    • TVALID, TREADY — VALID/READY handshake
    • TLAST — end of packet
    • TKEEP — per-byte "this byte is valid"
    • TSTRB — per-byte "this byte is a data byte (vs null)"
    • TID, TDEST, TUSER — routing & metadata
  • The de-facto FPGA streaming interface. Every Xilinx video, Ethernet, and DSP IP block exposes it.

Why AXI4-Stream caught on

DSP pipelines don't address memory — they're just chains of "process one beat, hand it to the next stage". AXI4-Stream is that interface standardised.

Because it's so light, you can clock-gate entire stages based on TVALID activity.

Deadlock hazard: rings of AXI4-Stream without back-pressure credit protocols can deadlock. Always break rings with DMA buffers.
17

AXI5 — AMBA 5 Evolution

Significant additions

  • Atomic transactions — AtomicStore, AtomicLoad, AtomicSwap, AtomicCompare — offloaded from the CPU to the memory system (for LSE instructions on Armv8.1+).
  • Cache stashing — AxSTASH hints to push data into a target cache (NIC → L3).
  • User request attributes — AxNSAID (Non-Secure Access Identifier for RME/CCA).
  • MTE (Memory Tagging Extension) support — tag bits on every transfer.
  • Deallocation hints — AxCHUNK / WriteNoSnoop / CMO (Cache Maintenance Operations).

AXI5-Lite, AXI5-Stream

Lite and Stream got AMBA 5 refreshes too — mostly adding security attributes (NSE) and user-signal alignment with the rest of AMBA 5.

Atomic offload is a big deal: instead of LR/SC loops with ~100-cycle retries on contention, the atomic operation is performed inside the memory controller — one round trip, success guaranteed.
18

AXI Crossbar — What Happens in the Middle

  • Routes masters to slaves based on AxADDR decoding.
  • Each slave port has its own arbiter across masters; each master port has its own response multiplexer across slaves.
  • ID re-mapping: the crossbar prepends a master-index to AxID so slaves see unique IDs, and strips it on the way back.
  • Typical delays inserted per stage for timing closure — a 4×4 AXI crossbar at 2 GHz typically has 3–5 cycle latency per direction.
  • Must preserve ordering per source-ID: the same master's same-ID transactions come back in order.
AXI 3×3 crossbar CPU GPU DMA Address decode + per-slave arbiter + per-master response mux DDR SRAM Peri 3 masters × 3 slaves — 9 directed paths
19

Common AXI Pitfalls

Pitfall: VALID → READY dependency

Source must not wait for READY before asserting VALID. Creating that combinational loop deadlocks the bus and breaks STA.

Pitfall: Changing payload while VALID is high

Once VALID is asserted, the payload (ADDR, DATA, etc.) must not change until READY takes effect.

Pitfall: RLAST / WLAST missed or premature

RLAST must be high on exactly the last beat. Generating it too early or too late confuses the master's burst counter.

Pitfall: Burst crosses 4 KB boundary

Protocol-illegal. Split at the master or have the slave error out.

Pitfall: Write response before all W-beats

Slave must wait for WLAST before issuing B. Some slaves combine this with an internal write buffer; you still must respect the ordering.

Pitfall: Deadlock rings

If Master A is waiting on Slave X's B channel while A holds up Slave Y's AR queue that X in turn waits on, the bus deadlocks. Topology and depth of outstanding transactions must be analysed.

Pitfall: ID width explosion

Crossbars widen AxID by log2(N_masters). If downstream slaves hard-code the ID width, the crossbar break-out fails.

Pitfall: AxCACHE mismatch vs MMU attributes

CPU's MMU attributes must match what the CPU drives on AxCACHE, or coherency contracts break.

20

Protocol Comparison — AXI3 vs AXI4 vs AXI5

FeatureAXI3AXI4AXI5
INCR burst len1–161–2561–256
Write interleaveyesnono
QoS / Regionnoyesyes
USER sidebandnoyesyes
AtomicsExclusive onlyExclusive onlyAtomic*
Cache stashingnonoyes
RME / MTE hooksnonoyes

Which will you see in practice?

  • AXI3: legacy IP, FPGA interfaces from the Zynq-7000 era.
  • AXI4: dominant today — Cortex-A53/A72/A76, Mali GPU, most ASIC design.
  • AXI5: Neoverse N2/V2, Cortex-M85, new NPU/accelerator designs requiring atomics/MTE.
21

Minimal Hardware — SystemVerilog Sketches

AXI4-Stream loopback — literally wires

No registers at all: just pass VALID/DATA/LAST forward and READY backward. Zero latency, zero area. A perfect DV sanity-check fabric for any Stream DUT.

module axis_loopback #(parameter W = 64) (
  // slave (sink-side) port
  input  logic           s_tvalid,
  output logic           s_tready,
  input  logic [W-1:0]   s_tdata,
  input  logic           s_tlast,
  // master (source-side) port
  output logic           m_tvalid,
  input  logic           m_tready,
  output logic [W-1:0]   m_tdata,
  output logic           m_tlast
);
  assign m_tvalid = s_tvalid;
  assign m_tdata  = s_tdata;
  assign m_tlast  = s_tlast;
  assign s_tready = m_tready;   // back-pressure pass-through
endmodule

Because AXI4-Stream has no address, no ID, and no response channel, a pure combinational pass-through is legal & protocol-compliant.

Single-register pipeline stage (skid buffer)

One flop + one "skid" holding register is the minimum that re-times a VALID/READY channel without breaking the protocol's "VALID must not depend combinationally on READY" rule. Drop into any AXI channel to close timing.

module axi_skid #(parameter W = 32) (
  input  logic         clk, rstn,
  input  logic         s_valid, output logic s_ready,
  input  logic [W-1:0] s_data,
  output logic         m_valid, input  logic m_ready,
  output logic [W-1:0] m_data
);
  logic [W-1:0] skid_q;  logic skid_v;

  always_ff @(posedge clk or negedge rstn)
    if (!rstn) {skid_v, skid_q} <= '0;
    else if (s_valid && s_ready && !m_ready)
           {skid_v, skid_q} <= {1'b1, s_data};
    else if ( m_ready) skid_v <= 1'b0;

  assign m_valid = s_valid | skid_v;
  assign m_data  = skid_v ? skid_q : s_data;
  assign s_ready = !skid_v;
endmodule

This 10-line cell is the backbone of every AXI crossbar, bridge, and clock-domain-crossing FIFO front-end.

22

Minimal AXI4-Lite Slave (one register + loopback)

module axil_loopback_reg (
  input  logic        ACLK, ARESETn,
  // AW / W / B
  input  logic        AWVALID, input logic [31:0] AWADDR,
  output logic        AWREADY,
  input  logic        WVALID,  input logic [31:0] WDATA,
  input  logic [3:0]  WSTRB,   output logic WREADY,
  output logic        BVALID,  output logic [1:0] BRESP,
  input  logic        BREADY,
  // AR / R
  input  logic        ARVALID, input logic [31:0] ARADDR,
  output logic        ARREADY,
  output logic        RVALID,  output logic [31:0] RDATA,
  output logic [1:0]  RRESP,   input  logic RREADY
);
  logic [31:0] reg_q;
  logic aw_hs, w_hs, b_pend, ar_hs, r_pend;

  // simultaneous AW+W one-beat write handshake
  assign AWREADY =  !b_pend;
  assign WREADY  =  !b_pend;
  assign aw_hs   =  AWVALID && AWREADY;
  assign w_hs    =  WVALID  && WREADY;

  always_ff @(posedge ACLK or negedge ARESETn)
    if (!ARESETn) {reg_q, b_pend, r_pend} <= '0;
    else begin
      // write: byte-strobed update
      if (aw_hs && w_hs) begin
        for (int i = 0; i < 4; i++)
          if (WSTRB[i]) reg_q[i*8 +: 8] <= WDATA[i*8 +: 8];
        b_pend <= 1'b1;
      end else if (BVALID && BREADY) b_pend <= 1'b0;

      // read: one beat
      if (ARVALID && ARREADY) r_pend <= 1'b1;
      else if (RVALID && RREADY) r_pend <= 1'b0;
    end

  assign ARREADY = !r_pend;
  assign BVALID  =  b_pend;  assign BRESP = 2'b00;
  assign RVALID  =  r_pend;  assign RDATA = reg_q;  assign RRESP = 2'b00;
endmodule

What "minimal" gets you

  • One 32-bit register.
  • One write pending flop, one read pending flop.
  • Byte-strobed writes (WSTRB).
  • Always OKAY response.
  • ID-less (AXI4-Lite).
  • No bursts, no exclusive, no QoS.

Loopback trick

Read returns whatever was last written — so this slave doubles as a write-read loopback for DV. Point any master's AXI test at it and bounce data off. No memory model needed.

For full AXI4 (with AxID, bursts, outstanding): add a small FIFO per channel, widen AWID/ARID through to BID/RID, and decode AWLEN. The minimal full-AXI4 slave is ~100 lines.
23

Interview-Ready Takeaways

  • "Why five channels?" → Independent backpressure for AW, W, B, AR, R. Lets the memory controller overlap reads and writes while respecting per-ID ordering.
  • "What does AxID buy you?" → Out-of-order responses across IDs; in-order within the same ID. Maps neatly to a CPU's load queue and a DRAM controller's reorder buffer.
  • "Why isn't WLAST redundant?" → Because the slave might not know AWLEN in time (gated by arbitration/clock crossings). WLAST is the single bit that authoritatively ends the burst.
  • "Why the 4 KB rule?" → Prevents a single burst from spanning two MMU pages with potentially different permissions.
  • "What's the difference between AXI4-Lite and APB?" → Channel count (5 vs effectively 2), pipelining, and ecosystem. AXI4-Lite wins on integration, APB wins on area.
  • "When do you use AXI4-Stream?" → DSP/video/packet pipelines where there's no address — just a directed flow of data with TLAST packet boundaries.
  • "Why did AXI4 drop write interleaving?" → Nobody implemented it usefully; verification was expensive; AXI's response reordering already gave you the bandwidth you wanted.
  • "When do you use AXI5 atomics vs Exclusive?" → Atomics for contended counters (faster, one round trip); Exclusive for ABA-style compare-and-swap patterns still supported by older CPUs.
24

References

Arm Ltd.AMBA AXI and ACE Protocol Specification (IHI 0022) — AXI3, AXI4, AXI5
Arm Ltd.AMBA AXI-Stream Protocol Specification (IHI 0051)
Arm Ltd.AMBA AXI4-Lite FAQ and errata (Arm Developer website)
Arm Ltd.Neoverse N2 / V2 Technical Reference Manual — AXI5 port definitions, atomic transactions
Xilinx UG761 / UG1037AXI Reference Guide — practical AXI4 / AXI4-Lite / AXI4-Stream examples
Xilinx PG059AXI Interconnect v2.1 — crossbar architecture reference
Siemens EDA / Cadence / Synopsys — AXI protocol checker docs & coverage models
Arm ABVIPAMBA Protocol Verification IP (AXI, AXI-Stream) — formal property libraries
Clifford Wolf, Dan Gisselquist — "Building AXI Infrastructure" blog series — pragmatic protocol-checker implementation notes
Wikipedia — "Advanced eXtensible Interface" — well-sourced cross-references

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.