Gang Liao | AMD MI350 GPU Architecture Deep Dive

AMD’s MI350 series, built on the CDNA 4 architecture, marks a significant leap in datacenter GPU compute. This post breaks down its architecture, programming model, hardware features, and what makes it distinct — both from its predecessor MI300X and from NVIDIA’s competing Blackwell lineup.

1. CDNA vs RDNA: Two Divergent Paths from GCN

AMD maintains two GPU architecture lines, both descended from GCN (Graphics Core Next) but optimized for very different workloads:

	CDNA (Datacenter)	RDNA (Consumer)
Target	HPC, AI/ML training & inference	Gaming, desktop graphics
Wavefront	Wave64 (64 threads)	Wave32 (32 threads, dual-issue Wave64)
Matrix Cores	Yes, dedicated	Minimal
Graphics HW	Stripped — no rasterization	Full pipeline + ray tracing
FP64	Full rate or 1:2	1:16 (vestigial)
Precision formats	FP64, FP32, BF16, FP16, FP8, FP6, FP4, INT8	FP32, FP16 primarily
Memory	HBM with ECC	GDDR6/6X, no ECC
Multi-GPU	Infinity Fabric scale-out	Not designed for it

CDNA keeps Wave64 for throughput-oriented compute and strips the graphics pipeline entirely, dedicating die area to Matrix Cores and memory controllers. RDNA moved to Wave32 for lower-latency gaming and added Infinity Cache — a design choice irrelevant to datacenter workloads.

The MI350 is pure CDNA 4 — no graphics heritage remains.

2. MI350 Specs at a Glance

Spec	MI350X	MI355X	MI300X (CDNA 3)
Architecture	CDNA 4	CDNA 4	CDNA 3
Process	TSMC N3 (3nm)	TSMC N3 (3nm)	TSMC 5nm/6nm
Compute Units	256	256	304
HBM	288 GB HBM3E	288 GB HBM3E	192 GB HBM3
Memory BW	8 TB/s	8 TB/s	5.3 TB/s
FP32	2.39 PFLOPS	2.52 PFLOPS	1.31 PFLOPS
FP4 (dense)	—	9.2 PFLOPS	N/A
FP32 w/ Sparsity	4.61 PFLOPS	5.03 PFLOPS	2.61 PFLOPS
TDP	1000W	1400W	750W

Fewer CUs than MI300X (256 vs 304), but substantially higher per-CU compute density thanks to the 3nm process shrink and architectural improvements. Memory jumps 50% in capacity and 51% in bandwidth.

3. Chiplet Architecture

Like MI300X, the MI350 uses a multi-die chiplet design:

Accelerator Complex Dies (XCDs) — each containing CUs, shared L2 cache, and Async Compute Engines (ACEs)
I/O dies — handling Infinity Fabric, PCIe/CXL, and memory controllers
HBM3E stacks — vertically integrated with the package

All chiplets communicate via AMD Infinity Fabric, a coherent interconnect that provides:

Low-latency data transit between XCDs
Unified memory addressing across all chiplets
Multi-GPU scale-out for distributed training

This is architecturally similar to AMD’s EPYC CPU chiplet strategy — a shared design philosophy across their server product line.

4. What’s New in CDNA 4

Native FP4 and FP6

The headline feature. MI350 is AMD’s first architecture with native FP4/FP6 compute. MI355X hits 9.2 PFLOPS at FP4 — this is what drives AMD’s claim of 35x inference performance over MI300X (which lacked FP4 entirely).

FP4 is particularly relevant for:

Quantized LLM inference (GPTQ, AWQ, etc.)
High-throughput serving where precision can be traded for speed
Prefill-heavy workloads in agentic systems

3nm Process

Moving from 5nm/6nm to TSMC N3 delivers:

Higher transistor density → more compute per die
Better power efficiency per FLOP (though absolute TDP rises to 1000-1400W)
Enables the per-CU performance uplift despite fewer total CUs

Enhanced Matrix Cores

CDNA 3 already delivered 3x FP16/BF16 and 6.8x INT8 improvement over CDNA 2. CDNA 4 pushes this further with wider matrix units and support for the new low-precision formats.

288 GB HBM3E at 8 TB/s

For LLM workloads, memory capacity is often the binding constraint. 288 GB allows:

A 70B parameter model in FP4 (~35 GB) with massive KV cache headroom
A 70B model in FP16 (~140 GB) without tensor parallelism
Multiple smaller models co-located on a single GPU

8 TB/s bandwidth addresses the memory-bound nature of autoregressive decoding.

5. Programming Model: ROCm & HIP

AMD’s software stack is ROCm (Radeon Open Compute), built on open-source foundations:

HIP (Heterogeneous Interface for Portability)

HIP is syntactically near-identical to CUDA:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// CUDA
__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}
cudaMalloc(&d_a, size);
vecAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);

// HIP — nearly identical
__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}
hipMalloc(&d_a, size);
vecAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);

The key difference: HIP compiles to both AMD (via Clang + LLVM AMDGPU backend) and NVIDIA (via NVCC) targets. Write once, run on both.

HIPIFY automates CUDA → HIP source translation for existing codebases.

Key Libraries

AMD (ROCm)	NVIDIA (CUDA)	Function
rocBLAS	cuBLAS	Linear algebra
MIOpen	cuDNN	DL primitives
RCCL	NCCL	Collective comms
rocFFT	cuFFT	FFT
rocRAND	cuRAND	RNG
hipSOLVER	cuSOLVER	LAPACK solvers

Framework Support

PyTorch, TensorFlow, JAX, ONNX Runtime, vLLM, and llama.cpp all have ROCm backends. PyTorch’s ROCm support is the most mature — most operations work out of the box.

Software Stack

┌─────────────────────────────────────────┐
│  PyTorch / TensorFlow / JAX / vLLM      │
├─────────────────────────────────────────┤
│  MIOpen · rocBLAS · RCCL · rocFFT       │
├─────────────────────────────────────────┤
│  HIP Runtime · OpenCL · OpenMP          │
├─────────────────────────────────────────┤
│  ROCclr (Common Language Runtime)       │
├─────────────────────────────────────────┤
│  ROCr (Runtime) · ROCt (Thunk)          │
├─────────────────────────────────────────┤
│  ROCk (Kernel Driver - amdgpu)          │
└─────────────────────────────────────────┘

The entire stack (except firmware) is MIT-licensed — a meaningful differentiator from NVIDIA’s proprietary CUDA ecosystem.

6. MI350 vs NVIDIA Blackwell (B200)

	MI350X	MI355X	B200
Process	TSMC 3nm	TSMC 3nm	TSMC 4NP
Memory	288 GB HBM3E	288 GB HBM3E	288 GB HBM3E
Memory BW	8 TB/s	8 TB/s	~8 TB/s
FP4	High	9.2 PFLOPS	~10 PFLOPS
TDP	1000W	1400W	~1000W
Interconnect	Infinity Fabric	Infinity Fabric	NVLink 5 (1.8 TB/s)
Software	ROCm (open)	ROCm (open)	CUDA (proprietary)

On paper, MI355X matches B200 in FP4 throughput but at 40% higher power. MI350X is the closer competitor in the same 1000W envelope but with lower FP4 throughput.

NVIDIA’s advantage remains in:

NVLink 5 bandwidth (1.8 TB/s GPU-to-GPU) for multi-GPU scaling
CUDA ecosystem maturity — libraries, tooling, profiling, community
Tensor Core specialization with Transformer Engine and FP8 training recipes

AMD’s advantages:

Open-source stack — full visibility into compiler, runtime, and libraries
Competitive memory specs — matching HBM capacity and bandwidth
Price/performance — historically more competitive on $/FLOP
HIPIFY migration path — lower barrier for CUDA codebases to port

7. What Sets MI350 Apart

For inference at scale, the MI350’s combination of native FP4, 288 GB HBM3E, and 8 TB/s bandwidth creates a compelling package. The 35x inference claim over MI300X (for FP4 workloads) is real — MI300X simply had no FP4 hardware.

For training, the story is more nuanced. Raw FLOPS are competitive, but NVIDIA’s interconnect (NVLink 5 + NVSwitch) remains ahead for large-scale distributed training where GPU-to-GPU bandwidth is the bottleneck.

For the open-source ecosystem, ROCm’s MIT licensing means you can inspect, modify, and optimize every layer of the stack. For teams building custom kernels or debugging performance at the hardware level, this transparency matters.

The real question for MI350 adoption isn’t hardware specs — it’s software maturity. The gap is closing. PyTorch on ROCm is production-ready. vLLM runs on AMD. But the long tail of custom CUDA kernels, Flash Attention variants, and training recipes still favors NVIDIA. Each generation of ROCm narrows this gap, and MI350 paired with ROCm 6.x is the strongest AMD has been.