AMD’s MI350 series, built on the CDNA 4 architecture, marks a significant leap in datacenter GPU compute. This post breaks down its architecture, programming model, hardware features, and what makes it distinct β€” both from its predecessor MI300X and from NVIDIA’s competing Blackwell lineup.


1. CDNA vs RDNA: Two Divergent Paths from GCN

AMD maintains two GPU architecture lines, both descended from GCN (Graphics Core Next) but optimized for very different workloads:

Β  CDNA (Datacenter) RDNA (Consumer)
Target HPC, AI/ML training & inference Gaming, desktop graphics
Wavefront Wave64 (64 threads) Wave32 (32 threads, dual-issue Wave64)
Matrix Cores Yes, dedicated Minimal
Graphics HW Stripped β€” no rasterization Full pipeline + ray tracing
FP64 Full rate or 1:2 1:16 (vestigial)
Precision formats FP64, FP32, BF16, FP16, FP8, FP6, FP4, INT8 FP32, FP16 primarily
Memory HBM with ECC GDDR6/6X, no ECC
Multi-GPU Infinity Fabric scale-out Not designed for it

CDNA keeps Wave64 for throughput-oriented compute and strips the graphics pipeline entirely, dedicating die area to Matrix Cores and memory controllers. RDNA moved to Wave32 for lower-latency gaming and added Infinity Cache β€” a design choice irrelevant to datacenter workloads.

The MI350 is pure CDNA 4 β€” no graphics heritage remains.


2. MI350 Specs at a Glance

Spec MI350X MI355X MI300X (CDNA 3)
Architecture CDNA 4 CDNA 4 CDNA 3
Process TSMC N3 (3nm) TSMC N3 (3nm) TSMC 5nm/6nm
Compute Units 256 256 304
HBM 288 GB HBM3E 288 GB HBM3E 192 GB HBM3
Memory BW 8 TB/s 8 TB/s 5.3 TB/s
FP32 2.39 PFLOPS 2.52 PFLOPS 1.31 PFLOPS
FP4 (dense) β€” 9.2 PFLOPS N/A
FP32 w/ Sparsity 4.61 PFLOPS 5.03 PFLOPS 2.61 PFLOPS
TDP 1000W 1400W 750W

Fewer CUs than MI300X (256 vs 304), but substantially higher per-CU compute density thanks to the 3nm process shrink and architectural improvements. Memory jumps 50% in capacity and 51% in bandwidth.


3. Chiplet Architecture

Like MI300X, the MI350 uses a multi-die chiplet design:

  • Accelerator Complex Dies (XCDs) β€” each containing CUs, shared L2 cache, and Async Compute Engines (ACEs)
  • I/O dies β€” handling Infinity Fabric, PCIe/CXL, and memory controllers
  • HBM3E stacks β€” vertically integrated with the package

All chiplets communicate via AMD Infinity Fabric, a coherent interconnect that provides:

  • Low-latency data transit between XCDs
  • Unified memory addressing across all chiplets
  • Multi-GPU scale-out for distributed training

This is architecturally similar to AMD’s EPYC CPU chiplet strategy β€” a shared design philosophy across their server product line.


4. What’s New in CDNA 4

Native FP4 and FP6

The headline feature. MI350 is AMD’s first architecture with native FP4/FP6 compute. MI355X hits 9.2 PFLOPS at FP4 β€” this is what drives AMD’s claim of 35x inference performance over MI300X (which lacked FP4 entirely).

FP4 is particularly relevant for:

  • Quantized LLM inference (GPTQ, AWQ, etc.)
  • High-throughput serving where precision can be traded for speed
  • Prefill-heavy workloads in agentic systems

3nm Process

Moving from 5nm/6nm to TSMC N3 delivers:

  • Higher transistor density β†’ more compute per die
  • Better power efficiency per FLOP (though absolute TDP rises to 1000-1400W)
  • Enables the per-CU performance uplift despite fewer total CUs

Enhanced Matrix Cores

CDNA 3 already delivered 3x FP16/BF16 and 6.8x INT8 improvement over CDNA 2. CDNA 4 pushes this further with wider matrix units and support for the new low-precision formats.

288 GB HBM3E at 8 TB/s

For LLM workloads, memory capacity is often the binding constraint. 288 GB allows:

  • A 70B parameter model in FP4 (~35 GB) with massive KV cache headroom
  • A 70B model in FP16 (~140 GB) without tensor parallelism
  • Multiple smaller models co-located on a single GPU

8 TB/s bandwidth addresses the memory-bound nature of autoregressive decoding.


5. Programming Model: ROCm & HIP

AMD’s software stack is ROCm (Radeon Open Compute), built on open-source foundations:

HIP (Heterogeneous Interface for Portability)

HIP is syntactically near-identical to CUDA:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// CUDA
__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}
cudaMalloc(&d_a, size);
vecAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);

// HIP β€” nearly identical
__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}
hipMalloc(&d_a, size);
vecAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);

The key difference: HIP compiles to both AMD (via Clang + LLVM AMDGPU backend) and NVIDIA (via NVCC) targets. Write once, run on both.

HIPIFY automates CUDA β†’ HIP source translation for existing codebases.

Key Libraries

AMD (ROCm) NVIDIA (CUDA) Function
rocBLAS cuBLAS Linear algebra
MIOpen cuDNN DL primitives
RCCL NCCL Collective comms
rocFFT cuFFT FFT
rocRAND cuRAND RNG
hipSOLVER cuSOLVER LAPACK solvers

Framework Support

PyTorch, TensorFlow, JAX, ONNX Runtime, vLLM, and llama.cpp all have ROCm backends. PyTorch’s ROCm support is the most mature β€” most operations work out of the box.

Software Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PyTorch / TensorFlow / JAX / vLLM      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  MIOpen Β· rocBLAS Β· RCCL Β· rocFFT       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  HIP Runtime Β· OpenCL Β· OpenMP          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ROCclr (Common Language Runtime)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ROCr (Runtime) Β· ROCt (Thunk)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ROCk (Kernel Driver - amdgpu)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The entire stack (except firmware) is MIT-licensed β€” a meaningful differentiator from NVIDIA’s proprietary CUDA ecosystem.


6. MI350 vs NVIDIA Blackwell (B200)

Β  MI350X MI355X B200
Process TSMC 3nm TSMC 3nm TSMC 4NP
Memory 288 GB HBM3E 288 GB HBM3E 288 GB HBM3E
Memory BW 8 TB/s 8 TB/s ~8 TB/s
FP4 High 9.2 PFLOPS ~10 PFLOPS
TDP 1000W 1400W ~1000W
Interconnect Infinity Fabric Infinity Fabric NVLink 5 (1.8 TB/s)
Software ROCm (open) ROCm (open) CUDA (proprietary)

On paper, MI355X matches B200 in FP4 throughput but at 40% higher power. MI350X is the closer competitor in the same 1000W envelope but with lower FP4 throughput.

NVIDIA’s advantage remains in:

  • NVLink 5 bandwidth (1.8 TB/s GPU-to-GPU) for multi-GPU scaling
  • CUDA ecosystem maturity β€” libraries, tooling, profiling, community
  • Tensor Core specialization with Transformer Engine and FP8 training recipes

AMD’s advantages:

  • Open-source stack β€” full visibility into compiler, runtime, and libraries
  • Competitive memory specs β€” matching HBM capacity and bandwidth
  • Price/performance β€” historically more competitive on $/FLOP
  • HIPIFY migration path β€” lower barrier for CUDA codebases to port

7. What Sets MI350 Apart

For inference at scale, the MI350’s combination of native FP4, 288 GB HBM3E, and 8 TB/s bandwidth creates a compelling package. The 35x inference claim over MI300X (for FP4 workloads) is real β€” MI300X simply had no FP4 hardware.

For training, the story is more nuanced. Raw FLOPS are competitive, but NVIDIA’s interconnect (NVLink 5 + NVSwitch) remains ahead for large-scale distributed training where GPU-to-GPU bandwidth is the bottleneck.

For the open-source ecosystem, ROCm’s MIT licensing means you can inspect, modify, and optimize every layer of the stack. For teams building custom kernels or debugging performance at the hardware level, this transparency matters.

The real question for MI350 adoption isn’t hardware specs β€” it’s software maturity. The gap is closing. PyTorch on ROCm is production-ready. vLLM runs on AMD. But the long tail of custom CUDA kernels, Flash Attention variants, and training recipes still favors NVIDIA. Each generation of ROCm narrows this gap, and MI350 paired with ROCm 6.x is the strongest AMD has been.


References