AMDβs MI350 series, built on the CDNA 4 architecture, marks a significant leap in datacenter GPU compute. This post breaks down its architecture, programming model, hardware features, and what makes it distinct β both from its predecessor MI300X and from NVIDIAβs competing Blackwell lineup.
1. CDNA vs RDNA: Two Divergent Paths from GCN
AMD maintains two GPU architecture lines, both descended from GCN (Graphics Core Next) but optimized for very different workloads:
| Β | CDNA (Datacenter) | RDNA (Consumer) |
|---|---|---|
| Target | HPC, AI/ML training & inference | Gaming, desktop graphics |
| Wavefront | Wave64 (64 threads) | Wave32 (32 threads, dual-issue Wave64) |
| Matrix Cores | Yes, dedicated | Minimal |
| Graphics HW | Stripped β no rasterization | Full pipeline + ray tracing |
| FP64 | Full rate or 1:2 | 1:16 (vestigial) |
| Precision formats | FP64, FP32, BF16, FP16, FP8, FP6, FP4, INT8 | FP32, FP16 primarily |
| Memory | HBM with ECC | GDDR6/6X, no ECC |
| Multi-GPU | Infinity Fabric scale-out | Not designed for it |
CDNA keeps Wave64 for throughput-oriented compute and strips the graphics pipeline entirely, dedicating die area to Matrix Cores and memory controllers. RDNA moved to Wave32 for lower-latency gaming and added Infinity Cache β a design choice irrelevant to datacenter workloads.
The MI350 is pure CDNA 4 β no graphics heritage remains.
2. MI350 Specs at a Glance
| Spec | MI350X | MI355X | MI300X (CDNA 3) |
|---|---|---|---|
| Architecture | CDNA 4 | CDNA 4 | CDNA 3 |
| Process | TSMC N3 (3nm) | TSMC N3 (3nm) | TSMC 5nm/6nm |
| Compute Units | 256 | 256 | 304 |
| HBM | 288 GB HBM3E | 288 GB HBM3E | 192 GB HBM3 |
| Memory BW | 8 TB/s | 8 TB/s | 5.3 TB/s |
| FP32 | 2.39 PFLOPS | 2.52 PFLOPS | 1.31 PFLOPS |
| FP4 (dense) | β | 9.2 PFLOPS | N/A |
| FP32 w/ Sparsity | 4.61 PFLOPS | 5.03 PFLOPS | 2.61 PFLOPS |
| TDP | 1000W | 1400W | 750W |
Fewer CUs than MI300X (256 vs 304), but substantially higher per-CU compute density thanks to the 3nm process shrink and architectural improvements. Memory jumps 50% in capacity and 51% in bandwidth.
3. Chiplet Architecture
Like MI300X, the MI350 uses a multi-die chiplet design:
- Accelerator Complex Dies (XCDs) β each containing CUs, shared L2 cache, and Async Compute Engines (ACEs)
- I/O dies β handling Infinity Fabric, PCIe/CXL, and memory controllers
- HBM3E stacks β vertically integrated with the package
All chiplets communicate via AMD Infinity Fabric, a coherent interconnect that provides:
- Low-latency data transit between XCDs
- Unified memory addressing across all chiplets
- Multi-GPU scale-out for distributed training
This is architecturally similar to AMDβs EPYC CPU chiplet strategy β a shared design philosophy across their server product line.
4. Whatβs New in CDNA 4
Native FP4 and FP6
The headline feature. MI350 is AMDβs first architecture with native FP4/FP6 compute. MI355X hits 9.2 PFLOPS at FP4 β this is what drives AMDβs claim of 35x inference performance over MI300X (which lacked FP4 entirely).
FP4 is particularly relevant for:
- Quantized LLM inference (GPTQ, AWQ, etc.)
- High-throughput serving where precision can be traded for speed
- Prefill-heavy workloads in agentic systems
3nm Process
Moving from 5nm/6nm to TSMC N3 delivers:
- Higher transistor density β more compute per die
- Better power efficiency per FLOP (though absolute TDP rises to 1000-1400W)
- Enables the per-CU performance uplift despite fewer total CUs
Enhanced Matrix Cores
CDNA 3 already delivered 3x FP16/BF16 and 6.8x INT8 improvement over CDNA 2. CDNA 4 pushes this further with wider matrix units and support for the new low-precision formats.
288 GB HBM3E at 8 TB/s
For LLM workloads, memory capacity is often the binding constraint. 288 GB allows:
- A 70B parameter model in FP4 (~35 GB) with massive KV cache headroom
- A 70B model in FP16 (~140 GB) without tensor parallelism
- Multiple smaller models co-located on a single GPU
8 TB/s bandwidth addresses the memory-bound nature of autoregressive decoding.
5. Programming Model: ROCm & HIP
AMDβs software stack is ROCm (Radeon Open Compute), built on open-source foundations:
HIP (Heterogeneous Interface for Portability)
HIP is syntactically near-identical to CUDA:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// CUDA
__global__ void vecAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
cudaMalloc(&d_a, size);
vecAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);
// HIP β nearly identical
__global__ void vecAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
hipMalloc(&d_a, size);
vecAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);
The key difference: HIP compiles to both AMD (via Clang + LLVM AMDGPU backend) and NVIDIA (via NVCC) targets. Write once, run on both.
HIPIFY automates CUDA β HIP source translation for existing codebases.
Key Libraries
| AMD (ROCm) | NVIDIA (CUDA) | Function |
|---|---|---|
| rocBLAS | cuBLAS | Linear algebra |
| MIOpen | cuDNN | DL primitives |
| RCCL | NCCL | Collective comms |
| rocFFT | cuFFT | FFT |
| rocRAND | cuRAND | RNG |
| hipSOLVER | cuSOLVER | LAPACK solvers |
Framework Support
PyTorch, TensorFlow, JAX, ONNX Runtime, vLLM, and llama.cpp all have ROCm backends. PyTorchβs ROCm support is the most mature β most operations work out of the box.
Software Stack
βββββββββββββββββββββββββββββββββββββββββββ
β PyTorch / TensorFlow / JAX / vLLM β
βββββββββββββββββββββββββββββββββββββββββββ€
β MIOpen Β· rocBLAS Β· RCCL Β· rocFFT β
βββββββββββββββββββββββββββββββββββββββββββ€
β HIP Runtime Β· OpenCL Β· OpenMP β
βββββββββββββββββββββββββββββββββββββββββββ€
β ROCclr (Common Language Runtime) β
βββββββββββββββββββββββββββββββββββββββββββ€
β ROCr (Runtime) Β· ROCt (Thunk) β
βββββββββββββββββββββββββββββββββββββββββββ€
β ROCk (Kernel Driver - amdgpu) β
βββββββββββββββββββββββββββββββββββββββββββ
The entire stack (except firmware) is MIT-licensed β a meaningful differentiator from NVIDIAβs proprietary CUDA ecosystem.
6. MI350 vs NVIDIA Blackwell (B200)
| Β | MI350X | MI355X | B200 |
|---|---|---|---|
| Process | TSMC 3nm | TSMC 3nm | TSMC 4NP |
| Memory | 288 GB HBM3E | 288 GB HBM3E | 288 GB HBM3E |
| Memory BW | 8 TB/s | 8 TB/s | ~8 TB/s |
| FP4 | High | 9.2 PFLOPS | ~10 PFLOPS |
| TDP | 1000W | 1400W | ~1000W |
| Interconnect | Infinity Fabric | Infinity Fabric | NVLink 5 (1.8 TB/s) |
| Software | ROCm (open) | ROCm (open) | CUDA (proprietary) |
On paper, MI355X matches B200 in FP4 throughput but at 40% higher power. MI350X is the closer competitor in the same 1000W envelope but with lower FP4 throughput.
NVIDIAβs advantage remains in:
- NVLink 5 bandwidth (1.8 TB/s GPU-to-GPU) for multi-GPU scaling
- CUDA ecosystem maturity β libraries, tooling, profiling, community
- Tensor Core specialization with Transformer Engine and FP8 training recipes
AMDβs advantages:
- Open-source stack β full visibility into compiler, runtime, and libraries
- Competitive memory specs β matching HBM capacity and bandwidth
- Price/performance β historically more competitive on $/FLOP
- HIPIFY migration path β lower barrier for CUDA codebases to port
7. What Sets MI350 Apart
For inference at scale, the MI350βs combination of native FP4, 288 GB HBM3E, and 8 TB/s bandwidth creates a compelling package. The 35x inference claim over MI300X (for FP4 workloads) is real β MI300X simply had no FP4 hardware.
For training, the story is more nuanced. Raw FLOPS are competitive, but NVIDIAβs interconnect (NVLink 5 + NVSwitch) remains ahead for large-scale distributed training where GPU-to-GPU bandwidth is the bottleneck.
For the open-source ecosystem, ROCmβs MIT licensing means you can inspect, modify, and optimize every layer of the stack. For teams building custom kernels or debugging performance at the hardware level, this transparency matters.
The real question for MI350 adoption isnβt hardware specs β itβs software maturity. The gap is closing. PyTorch on ROCm is production-ready. vLLM runs on AMD. But the long tail of custom CUDA kernels, Flash Attention variants, and training recipes still favors NVIDIA. Each generation of ROCm narrows this gap, and MI350 paired with ROCm 6.x is the strongest AMD has been.