A ground-up performance study of matrix multiplication across single-core CPU, multi-core CPU with OpenMP, and NVIDIA GPU with CUDA. Three architectures. Three phases. One question: how much does parallelism actually matter?
As part of a short-term research internship under Prof. Dr. Unnikrishnan Cheramangalath at IIT Palakkad's Department of Computer Science & Engineering, this project was a structured study of computing architecture performance — specifically how matrix multiplication, the core operation of modern AI and scientific computing, scales across fundamentally different hardware paradigms.
Prof. Cheramangalath's research spans systems software and parallel computing. Matrix multiplication is the canonical benchmark for this domain — simple enough to implement from scratch, complex enough to reveal deep truths about memory hierarchies, parallelism, and the gap between CPU and GPU architectures.
The internship was structured in three deliberate phases — single-core baseline, OpenMP multicore, then CUDA GPU — each phase building on the last. The goal: understand not just the numbers, but WHY each architecture performs the way it does.
Every neural network layer. Every image transformation. Every recommendation algorithm. They all reduce to C = A × B at the hardware level. Understanding how different architectures handle this operation — and why — is fundamental systems knowledge.
Naive complexity for n×n matrices
Architectures benchmarked
Data types: int · float · double
The Baseline
The foundation. A templated C++ implementation with deliberate cache optimization — using i–k–j loop ordering instead of the naive i–j–k. This matters: the inner loop accesses memory sequentially, staying in cache, dramatically reducing cache misses even before any parallelism.
// Cache-friendly: i-k-j ordering
for (int i = 0; i < m; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < p; j++)
C[i*p+j] += A[i*n+k] * B[k*p+j];
L1 → L2 → L3 → RAM
i-k-j keeps data in L1
FASTER
OpenMP Parallelism
Add one pragma, multiply throughput by your core count. OpenMP parallelizes the outer loop — each CPU core independently computes its assigned rows of the output matrix. No inter-thread communication needed: the algorithm is embarrassingly parallel. The same cache-friendly loop ordering is preserved.
// Same algorithm — now parallel
#pragma omp parallel for
for (int i = 0; i < m; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < p; j++)
C[i*p+j] += A[i*n+k] * B[k*p+j];
Massive Parallelism
Where single-core does one multiply at a time and multi-core does N — CUDA does thousands simultaneously. Each output element C[i][j] is computed by a dedicated CUDA thread. The GPU launches a grid of thread blocks, mapping the 2D output matrix directly to the 2D thread grid.
__global__ void matmul_kernel(
T* A, T* B, T* C, int m, int n, int p) {
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
if (i < m && j < p) {
T sum = 0;
for (int k=0; k<n; k++)
sum += A[i*n+k] * B[k*p+j];
C[i*p+j] = sum;
}
}
Simultaneously computing output elements
| Architecture | Parallelism | Memory Model | Best For |
|---|---|---|---|
| Single-Core | 1 thread | L1/L2 cache | Small matrices |
| OpenMP | N threads | Shared L3 | Medium matrices |
| CUDA GPU | 1000s | VRAM + HBM | Large matrices |
Actual speedup depends on matrix size, data type, and hardware. Large matrices (512×512 and above) show maximum GPU advantage.
Fastest. No floating point unit needed. Integer overflow risk at large sizes.
32-bit. GPU native. Best throughput on CUDA. FP32 is the AI training standard.
64-bit. Most precise. 2× memory footprint. Slower on GPU than float.
Naive matrix multiplication uses i–j–k loop order. The problem: the inner loop accesses B[k][j] — jumping across rows in memory. Every access is a cache miss.
MatMul uses i–k–j ordering. The inner loop now accesses B[k][j] sequentially — staying in the same cache line. The difference? Orders of magnitude on large matrices.
for i → for j → for k
B[k*p+j] JUMPS
for i → for k → for j
B[k*p+j] SEQUENTIAL
cpu-gpu-matmul-benchmark/
├── phase1-matmul-singlecore/ ← C++ baseline
│ └── main.cpp
├── phase2-matmul-multicore/ ← OpenMP parallel
│ └── main.cpp
├── phase3-matmul-gpu/ ← CUDA massive
│ └── main.cu
├── README.md
└── LICENSE (MIT)
C++17 · Single Thread · Cache-Optimized
OpenMP · Multi-Thread · Same Algorithm
CUDA · GPU Kernel · Thread-Per-Element
Key insights from the internship under Dr. Unnikrishnan Cheramangalath at IIT Palakkad:
Changing loop order from i-j-k to i-k-j — same algorithm, same result — can improve performance by 10× on large matrices before touching parallelism.
MatMul is a perfect parallel problem: no thread needs another thread's output. This makes it ideal for both OpenMP and CUDA — and explains why GPUs dominate AI.
One templated createMatrix<T>() and matmul<T>() handles int, float, and double without duplication. C++ templates are the right tool for performance code.
For small matrices, GPU launch overhead dominates. The crossover point — where GPU wins — depends on matrix size and data type. Benchmarking reveals this precisely.
Core language — templates, cache-optimized loops
NVIDIA GPU kernel — thread-per-element parallelism
CPU multicore — one pragma, full core utilization
Benchmarking analysis — charts, timing, comparison
Compilers — -O3 optimization flag on all builds
Open source — fork it, extend it, learn from it
ORIGINALLY DEVELOPED AS A RESEARCH INTERNSHIP PROJECT AT IIT PALAKKAD
Fork it. Run the benchmarks on your hardware. Add Tensor Core support. Compare your CPU to mine. That's the point.