All Images AI-generated by the author for free with NightCafe Studio - see the footer for the link.
The era of high-performance computing has been defined by a single name: CUDA.
NVIDIA's platform unlocked the power of GPUs, becoming the de facto standard.
For over a decade, to program a GPU meant to program in CUDA.
This dominance, however, has created a cage, locking progress into a single vendor.
But today, mid-2025 - things are changing.
The computing world is now undergoing a radical transformation towards heterogeneity.
We are seeing a proliferation of specialized hardware:
Intel Gaudi Series:
Intel's Gaudi processors are designed specifically for deep learning training and inference, offering a competitive alternative to Nvidia's GPUs.
AMD Instinct MI Series:
AMD's MI series of GPUs is designed for high-performance computing and AI workloads, providing an alternative to Nvidia's data center GPUs.
Groq Tensor Streaming Processor (TSP):
Groq's TSP architecture is designed for low-latency inference and high throughput, particularly for large language models.
Google TPUs (Tensor Processing Units):
Google's TPUs are custom-designed chips optimized for machine learning workloads, particularly in Google's cloud infrastructure.
AWS Trainium:
AWS Trainium is a chip designed for machine learning training, offering high performance and cost-effectiveness.
And more and more startups that build custom silicon chips pop up every day.
This new, diverse landscape demands a new programming philosophy.
This is not just another competitor; they represent a fundamental paradigm shift.
This is a revolution in how we design, optimize, and deploy software for any hardware.
This article will deeply explore the architectural chasm between CUDA and MLIR.
This impact extends to critical emerging domains such as Generative AI, Quantum Computing, and even Blockchain.
We will also look to the future, covering mining ASICs, Neuromorphic Computing, and specialized hardware for sparse data streams that GPUs handle poorly.
This is the story of the end of an era and the dawn of a new one.
To grasp the magnitude of this shift, we must first understand the four key players.
CUDA stands for Compute Unified Device Architecture.
It is NVIDIA's parallel computing platform and programming model.
It allows developers to write C++-like code, called kernels, that run on NVIDIA GPUs.
Its ecosystem of libraries is mature and unmatched:
It provides direct, low-level control over the hardware, enabling peak performance for experts.
Its long history has built a massive community with vast documentation and support.
Vendor Lock-In: CUDA code runs only on NVIDIA GPUs.
This shackles developers and entire industries to a single, expensive hardware supplier.
It stifles competition and limits the freedom to choose the best hardware for the job.
Researchers prototype in a high-level language like Python for its simplicity and speed of iteration.
But for production, performance-critical code must be completely rewritten in low-level C++/CUDA.
This creates a painful and costly disconnect, slowing the path from research to deployment.
CUDA is powerful but notoriously complex and verbose.
The developer is forced to be
a manual memory manager, transferring data between the CPU (host) and GPU (device).
The developer must also be a hardware scheduler, managing thread blocks, grids, and synchronization.
This complexity is a steep learning curve and a frequent source of subtle bugs.
The LLVM Project is a collection of modular and reusable compiler technologies.
Its core is the LLVM Intermediate Representation (IR), a low-level, assembly-like language.
LLVM became the standard for modern compiler backends, especially for CPUs.
A compiler frontend (like Clang for C++) translates source code into LLVM IR.
The LLVM backend then optimizes this IR and converts it into machine code for a specific CPU.
This modularity was revolutionary for its time.
However, LLVM was designed for a CPU-centric world.
Its IR is too low-level for the new world of heterogeneous hardware.
It loses crucial high-level information from the source code, a problem known as the "semantic gap."
For example, when compiling a TensorFlow model, the knowledge that an operation is a Convolution is lost.
LLVM IR only sees a generic collection of loops and arithmetic instructions.
This prevents the compiler from performing powerful, domain-specific optimizations.
It no longer understands the programmer's high-level intent.
This is the essence of the “semantic gap problem.”
And this problem is what MLIR has Solved.
MLIR was born at Google from the need to compile TensorFlow for CPUs, GPUs, and their TPUs.
They realized LLVM's single, low-level IR was not enough.
MLIR's breakthrough is a unified infrastructure for defining and composing multiple IRs.
These composable IRs are called dialects.
MLIR is like a universal translator, fluent in everything from high-level concepts to low-level machine details.
A high-level dialect can represent domain-specific concepts directly.
For example, a "TensorFlow dialect" has an operation for tf.conv2d.
A "Linear Algebra dialect" has an operation for linalg.matmul.
This retains the critical semantic information that LLVM discards.
This enables a powerful compiler strategy called progressive lowering*.*
This process preserves high-level context for as long as possible.
This enables vastly superior optimizations for any hardware target.
MLIR is the missing link between high-level languages and diverse silicon.
If MLIR is the powerful, complex engine, Mojo is the sleek, intuitive user interface.
Mojo was created by Chris Lattner, the original architect of LLVM and the Swift language.
It is designed from first principles to be the perfect language for the MLIR era.
In this regard, it is the most technologically advanced language today.
Even Rust is based on LLVM and has all of LLVM’s shortcomings.
Mojo is the only major programming language today based on MLIR.
Mojo elegantly solves the "two-language problem."
Theory is one thing; practice is another.
The following full, working code examples -
Will demonstrate the profound differences between the two paradigms.
This is the "Hello, World!" of high-performance computing, and it clearly reveals the core philosophy of each platform.
This is a complete, compilable CUDA program for matrix multiplication.
(CUDA C++)
// Filename: matmul.cu
// To compile: nvcc matmul.cu -o matmul_cuda
#include <iostream>
#include <vector>
#include <cuda_runtime.h>
// Helper to check for CUDA errors
#define CUDA_CHECK(err) { \
cudaError_t err_code = err; \
if (err_code != cudaSuccess) { \
std::cerr << "CUDA Error: " << cudaGetErrorString(err_code) << " at line " << __LINE__ << std::endl; \
exit(EXIT_FAILURE); \
} \
}
// CUDA Kernel for Matrix Multiplication (Device Code)
__global__ void matrixMulKernel(float* C, const float* A, const float* B, int N) {
// Calculate the global row and column index of the element
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Boundary check to avoid accessing out-of-bounds memory
if (row < N && col < N) {
float p_value = 0.0f;
// Each thread computes one element of the result matrix C
for (int k = 0; k < N; ++k) {
p_value += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = p_value;
}
}
// Main function (Host Code)
int main() {
const int N = 256;
const int size = N * N * sizeof(float);
// Step 1. Allocate host memory
std::vector<float> h_A(N * N);
std::vector<float> h_B(N * N);
std::vector<float> h_C(N * N);
// Initialize host matrices
for (int i = 0; i < N * N; ++i) {
h_A[i] = static_cast<float>(rand()) / RAND_MAX;
h_B[i] = static_cast<float>(rand()) / RAND_MAX;
}
// Step 2. Allocate device memory
float *d_A, *d_B, *d_C;
CUDA_CHECK(cudaMalloc((void**)&d_A, size));
CUDA_CHECK(cudaMalloc((void**)&d_B, size));
CUDA_CHECK(cudaMalloc((void**)&d_C, size));
// Step 3. Copy matrices from host to device
std::cout << "Copying data from host to device..." << std::endl;
CUDA_CHECK(cudaMemcpy(d_A, h_A.data(), size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(d_B, h_B.data(), size, cudaMemcpyHostToDevice));
// Step 4. Define kernel launch configuration
// Use 16x16 threads per block, a common choice
dim3 threadsPerBlock(16, 16);
// Calculate the number of blocks needed in each dimension
dim3 numBlocks((N + threadsPerBlock.x - 1) / threadsPerBlock.x, (N + threadsPerBlock.y - 1) / threadsPerBlock.y);
// Step 5. Launch the kernel on the device
std::cout << "Launching kernel..." << std::endl;
matrixMulKernel<<<numBlocks, threadsPerBlock>>>(d_C, d_A, d_B, N);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize()); // Wait for the kernel to finish
// Step 6. Copy the result matrix back from device to host
std::cout << "Copying result from device to host..." << std::endl;
CUDA_CHECK(cudaMemcpy(h_C.data(), d_C, size, cudaMemcpyDeviceToHost));
// Step 7. Free device memory
CUDA_CHECK(cudaFree(d_A));
CUDA_CHECK(cudaFree(d_B));
CUDA_CHECK(cudaFree(d_C));
std::cout << "CUDA Matrix Multiplication finished successfully." << std::endl;
// (Optional: Add verification step here)
return 0;
}
The code is dominated by boilerplate and low-level management.
Steps 1, 2, 3, 6, and 7 are purely for managing memory across the CPU/GPU boundary.
This is tedious, error-prone, and obscures the core algorithm.
The
This code is fundamentally and permanently tied to NVIDIA's hardware architecture.
The actual algorithm—three nested loops—is a tiny fraction of the total code.
The programmer's mental overhead is spent on hardware management, not on the problem itself.
This Mojo version achieves the same result with breathtaking simplicity and power.
(Mojo)
# Filename: matmul.mojo
# To run: mojo matmul.mojo
from memory import DType, Tensor
from random import rand
from time import now
fn matmul_naive(C: Tensor[DType.float32], A: Tensor[DType.float32], B: Tensor[DType.float32]):
"""A naive, high-level implementation of matrix multiplication."""
let N = A.dim(0)
let M = A.dim(1)
let P = B.dim(1)
for i in range(N):
for j in range(P):
var sum: Float32 = 0.0
for k in range(M):
sum += A.load(i, k) * B.load(k, j)
C.store(i, j, sum)
fn main():
let N = 256
# 1. Allocate and initialize tensors.
# Mojo's Tensor handles memory allocation automatically.
# The compiler will place it in the most appropriate memory space.
var A = Tensor[DType.float32](N, N)
var B = Tensor[DType.float32](N, N)
var C = Tensor[DType.float32](N, N)
for i in range(N):
for j in range(N):
A.store(i, j, rand[DType.float32]())
B.store(i, j, rand[DType.float32]())
print("Starting Mojo Matrix Multiplication...")
let start_time = now()
# 2. Call the function.
# The MLIR-based compiler optimizes this high-level code.
# It can automatically tile, vectorize, and parallelize this code
# for the target hardware (CPU, GPU, etc.).
matmul_naive(C, A, B)
let end_time = now()
let duration_ms = (end_time - start_time) / 1_000_000.0
print("Mojo Matrix Multiplication finished successfully.")
print("Execution time:", duration_ms, "ms")
# (Optional: Print a corner of the result matrix to verify)
print("Result C[0,0]:", C.load(0,0))
}
And that is all!
MLIR-based Mojo is undeniably set to replace LLVM-based CUDA, and developers will enjoy the change!
For more on Mojo, refer to the article below:
The "attention" mechanism is the heart of models like GPT-4 and is a major computational bottleneck.
Optimizing it is critical.
FlashAttention is a landmark algorithm that manually and expertly orchestrates data movement between the GPU's slow main memory (HBM) and its fast on-chip memory (SRAM) to reduce bottlenecks.
The real code is thousands of lines long and incredibly complex.
The links to the components of the full algorithm implementation are given below:
https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/src/flash_fwd_kernel.h
https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/flash_api.cpp
Together, they are almost 3000 lines long.
The repository contains thousands of files.
The learning curve and the onboarding curve are both steep.
A simplified version (AI-generated) is given below:
(CUDA C++)
// This is a simplified conceptual view of a FlashAttention-style CUDA kernel.
// The actual implementation is far more complex.
template<typename Kernel_traits>
__global__ void flash_attention_fwd_kernel(Flash_fwd_params params) {
// 1. Incredibly complex setup code
// Calculates dozens of pointers and indices for HBM and shared memory (SRAM)
const int block_row_idx = blockIdx.x;
const int head_idx = blockIdx.y;
// ... many more calculations ...
// 2. Explicitly allocate shared memory tiles for Q, K, V
// The developer must manage this limited resource manually.
extern __shared__ char smem[];
float* sQ = (float*)smem;
float* sK = sQ + kTileM * kTileK;
float* sV = sK + kTileN * kTileK;
// 3. Main loop over the sequence, manually loading blocks
for (int k_block_idx = 0; k_block_idx < params.k_num_blocks; ++k_block_idx) {
// Manually orchestrate asynchronous loads from HBM into SRAM
// to hide memory latency. This is extremely difficult to get right.
load_qkv_block_from_hbm(params, ...);
__syncthreads(); // Hard synchronization barrier
// Manually perform matrix multiplication in fast SRAM
compute_sram_matmul(sQ, sK, ...);
// Recompute softmax "online" to avoid writing the huge intermediate
// attention score matrix back to slow HBM. This is the core trick.
compute_online_softmax(...);
__syncthrows();
// Update the output block
update_output_block(sV, ...);
}
// 4. Manually write the final output block back to HBM
store_output_to_hbm(params, ...);
}
The Mojo version expresses the same algorithmic idea (tiling, online softmax) at a high level, delegating the hardware orchestration to the MLIR compiler.
(Mojo:)
from memory import DType, Tensor
from algorithm import parallelize
struct AttentionParams:
var is_causal: Bool
# ... other model parameters
# This function is a high-level, portable description of the FlashAttention algorithm.
fn flash_attention[T: DType](Q: Tensor[T], K: Tensor[T], V: Tensor[T], params: AttentionParams) -> Tensor[T]:
# Define problem dimensions from input tensors
let num_batches = Q.dim(0)
let num_heads = Q.dim(2)
let seqlen_q = Q.dim(1)
let seqlen_k = K.dim(1)
# Define tunable tiling parameters. The compiler can use these as hints.
alias BLOCK_M: Int = 128
alias BLOCK_N: Int = 64
# The output tensor
var O = Tensor[T](Q.dims)
# The @parallelize decorator tells the compiler to map this function
# over the available hardware parallelism (e.g., CUDA thread blocks or CPU cores).
@parallelize(num_batches * num_heads)
fn compute_head(batch_idx: Int, head_idx: Int):
# Define per-worker accumulators. The compiler will map these
# to the fastest available memory (e.g., registers or SRAM).
var o_i = Tensor[T](seqlen_q, V.dim(3))
var l_i = Tensor[T](seqlen_q) # Stores the denominator of the softmax
var m_i = Tensor[T](seqlen_q) # Stores the max of each row for stable softmax
o_i.zero()
l_i.fill(0.0)
m_i.fill(-50000.0) # Negative infinity
# Loop over blocks of the Key/Value sequence
for j in range(0, seqlen_k, BLOCK_N):
# 1. Load tiles of K and V.
# The compiler is responsible for generating the optimal code
# to move this data from main memory to fast memory.
let k_j = K.load_tile[BLOCK_N](batch_idx, j, head_idx)
let v_j = V.load_tile[BLOCK_N](batch_idx, j, head_idx)
# Loop over blocks of the Query sequence
for i in range(0, seqlen_q, BLOCK_M):
# 2. Load tile of Q.
let q_i = Q.load_tile[BLOCK_M](batch_idx, i, head_idx)
# 3. Compute attention scores for the tile. This is a simple matmul.
let s_ij = q_i @ k_j.transpose()
# Causal masking for decoder models like GPT
if params.is_causal:
# Algorithmic logic, no hardware specifics
apply_causal_mask(s_ij, i, j)
# 4. Perform the "online softmax" update.
# This is pure mathematical logic, not memory management.
let m_ij = row_max(s_ij)
let p_ij = exp(s_ij - m_ij)
let l_ij = row_sum(p_ij)
let m_new = max(m_i, m_ij)
let l_new = exp(m_i - m_new) * l_i + exp(m_ij - m_new) * l_ij
# Update output tile
o_i = (l_i / l_new * exp(m_i - m_new)) * o_i + (exp(m_ij - m_new) / l_new) * (p_ij @ v_j)
# Update softmax stats
l_i = l_new
m_i = m_new
# 5. Store the final output. The compiler manages the write-back.
O.store_tile(batch_idx, head_idx, o_i)
compute_head()
return O
One file.
Less than 100 LOC.
No brain-racking dependencies.
Of course, this is just the algorithm, but in the repository, the same algorithm took nearly 3000 LOC with CUDA!
So now you understand the difference:
This breaks the NVIDIA monopoly on high-performance AI and will drive down costs.
The limitations of the CUDA model become even more apparent when we look beyond traditional dense workloads to the future of computing.
MLIR/Mojo is designed for this future.
Proof-of-Work blockchains like Bitcoin require immense hashing power.
The goal is to find a "nonce" that, when hashed with other data, produces a result below a certain target.
This is a brute-force search, perfect for parallel hardware.
Initially, miners used CPUs, then GPUs for their superior parallelism.
The CUDA code for a SHA-256 miner is low-level, focused on bitwise and integer operations.
However, for a stable, unchanging algorithm like SHA-256, the ultimate hardware is an ASIC.
An ASIC (Application-Specific Integrated Circuit) is a chip designed for one single purpose - to implement an algorithm in hardware.
An SHA-256 ASIC has the hashing logic literally baked into the silicon.
It is thousands of times more power-efficient than a GPU for that one task.
This is where the CUDA story ends, but the MLIR/Mojo story gets even more interesting.
The process of designing a chip is called High-Level Synthesis (HLS).
HLS tools convert a high-level description of an algorithm into a low-level hardware description language (like Verilog or VHDL) used to fabricate the chip.
MLIR, through projects like CIRCT (Circuit IR for Compilers and Tools), is designed to be the backbone of next-generation HLS.
This unifies the entire stack, from high-level software to custom silicon design.
It allows for rapid prototyping and deployment of new algorithms onto the most efficient hardware possible, be it a GPU or a brand new ASIC.
CUDA has no answer to this.
It is a software-only solution for a single vendor's programmable hardware.
NVIDIA GPUs are masters of SIMT: Single Instruction, Multiple Thread.
This means they are incredibly efficient when thousands of threads are all executing the same instruction on different data (e.g., a matrix multiplication).
However, they are very inefficient at workloads with heavy branching or irregular data access.
This is because of "thread divergence."
If threads in a group (a "warp") take different branches of an if/else statement, the hardware must execute both paths serially, with threads in the inactive path simply turned off.
This kills performance for many important problems.
This is a brain-inspired computing paradigm.
Neuromorphic chips, like Intel's Loihi, are not based on clocks and dense matrix math.
They are event-driven.
"Neurons" fire a "spike" only when their input potential crosses a threshold.
These spikes travel to other "synapses," which may then cause other neurons to fire.
This is an extremely sparse, branch-heavy, and asynchronous process.
Trying to simulate this on a GPU is horrifically inefficient due to constant thread divergence.
MLIR is the perfect solution for this.
This allows for a portable, high-level programming model for a completely non-traditional form of computing.
The CUDA model is not relevant in this domain.
Many real-world problems involve sparse data: social networks, recommendation engines, and scientific simulations.
Representing these as dense matrices is wasteful.
Processing them on GPUs leads to irregular memory access patterns, which defeats the GPU's memory coalescing optimizations and cripples performance.
Again, MLIR provides the answer.
This allows a high-level algorithm written in Mojo to be efficiently compiled for sparse data on any hardware.
This is something that is extremely difficult today.
And next to impossible with CUDA.
Simulating a quantum computer on a classical computer is essential for developing and testing quantum algorithms.
The most common method is state vector simulation.
The state of an N-qubit quantum system is represented by a vector of 2^N complex numbers.
For just 50 qubits, this vector has 2^50 (over a quadrillion) elements, requiring petabytes of memory.
A quantum algorithm is a sequence of "gates."
Each gate is equivalent to multiplying the massive state vector by a very large, very sparse matrix.
This is a workload that is both computationally intensive and memory-bandwidth bound.
NVIDIA has invested heavily here with its cuQuantum library, a high-performance CUDA-based solution.
cuQuantum is very fast on NVIDIA GPUs, but it has the classic CUDA limitations:
The MLIR approach enables a much higher level of intelligence in the compiler.
For instance, the compiler would know that applying a Hadamard gate (H) twice in a row is an identity operation and can be completely eliminated.
It would know that certain sequences of gates can be "fused" into a single, more efficient gate.
This is an entire class of optimization that is invisible to the CUDA compiler, which only sees generic matrices, thanks to LLVM.
After performing these high-level algebraic simplifications, the MLIR compiler would then lower the simplified circuit into an optimized sequence of sparse matrix operations for the target hardware.
Because this is all built on MLIR, the same high-level quantum circuit written in Mojo could be compiled to run on an NVIDIA GPU, an AMD GPU, or a CPU cluster.
This provides both higher performance (due to smarter optimization) and complete hardware freedom.
Nvidia is investing heavily in quantum simulation hardware and the software stack.
But its CUDA-Q platform is still LLVM-based.
MLIR-based Mojo can not just offer advanced optimization - it also offers simpler programming.
Mojo:
The transition from CUDA to an MLIR-based world will be gradual, but it is inevitable.
It is a fundamental shift from a closed, hardware-centric model to an open, software-defined future.
But will it be the winner in the long run?
I believe it will, and developers will be happier with Mojo than CUDA.
CUDA built the impressive palace of today's high-performance computing.
But it is a cage.
MLIR and Mojo are handing every developer the key to unlock it and build the future on any foundation they choose.
And that foundation is destined to be MLIR and Mojo.
The simplest reason - the budget.
Which is why, unless Nvidia pivots, and soon:
This will be the end of the dominance of Nvidia - unless they embrace MLIR as well!