MatMul Benchmark — IIT Palakkad Research

01 / Research Context

IIT Palakkad.
Where It Happened.

INDIAN INSTITUTE OF TECHNOLOGY

PALAKKAD

Computer Science & Engineering

// RESEARCH SUPERVISOR

Dr. Unnikrishnan Cheramangalath

Professor, CSE · IIT Palakkad

SHORT-TERM RESEARCH INTERNSHIP

The Assignment

As part of a short-term research internship under Prof. Dr. Unnikrishnan Cheramangalath at IIT Palakkad's Department of Computer Science & Engineering, this project was a structured study of computing architecture performance — specifically how matrix multiplication, the core operation of modern AI and scientific computing, scales across fundamentally different hardware paradigms.

Why Matrix Multiplication

Prof. Cheramangalath's research spans systems software and parallel computing. Matrix multiplication is the canonical benchmark for this domain — simple enough to implement from scratch, complex enough to reveal deep truths about memory hierarchies, parallelism, and the gap between CPU and GPU architectures.

The Three-Phase Structure

The internship was structured in three deliberate phases — single-core baseline, OpenMP multicore, then CUDA GPU — each phase building on the last. The goal: understand not just the numbers, but WHY each architecture performs the way it does.

02 / The Problem

Matrix multiplication is the
backbone of modern AI.

Every neural network layer. Every image transformation. Every recommendation algorithm. They all reduce to C = A × B at the hardware level. Understanding how different architectures handle this operation — and why — is fundamental systems knowledge.

O(n³)

Naive complexity for n×n matrices

3

Architectures benchmarked

3

Data types: int · float · double

03 / Three Phases

One Algorithm.
Three Architectures.

PHASE 1

Single-Core CPU

The Baseline

The foundation. A templated C++ implementation with deliberate cache optimization — using i–k–j loop ordering instead of the naive i–j–k. This matters: the inner loop accesses memory sequentially, staying in cache, dramatically reducing cache misses even before any parallelism.

// Cache-friendly: i-k-j ordering
for (int i = 0; i < m; i++)
  for (int k = 0; k < n; k++)
    for (int j = 0; j < p; j++)
      C[i*p+j] += A[i*n+k] * B[k*p+j];

g++ -O3 -std=c++17 main.cpp -o benchmark

1 CORE

L1 → L2 → L3 → RAM

i-k-j keeps data in L1

N CORES (Parallel)

~Nx

FASTER

PHASE 2

Multi-Core CPU

OpenMP Parallelism

Add one pragma, multiply throughput by your core count. OpenMP parallelizes the outer loop — each CPU core independently computes its assigned rows of the output matrix. No inter-thread communication needed: the algorithm is embarrassingly parallel. The same cache-friendly loop ordering is preserved.

// Same algorithm — now parallel
#pragma omp parallel for
for (int i = 0; i < m; i++)
  for (int k = 0; k < n; k++)
    for (int j = 0; j < p; j++)
      C[i*p+j] += A[i*n+k] * B[k*p+j];

g++ -O3 -fopenmp -std=c++17 main.cpp -o benchmark

PHASE 3

NVIDIA GPU — CUDA

Massive Parallelism

Where single-core does one multiply at a time and multi-core does N — CUDA does thousands simultaneously. Each output element C[i][j] is computed by a dedicated CUDA thread. The GPU launches a grid of thread blocks, mapping the 2D output matrix directly to the 2D thread grid.

__global__ void matmul_kernel(
    T* A, T* B, T* C, int m, int n, int p) {
  int i = blockIdx.y * blockDim.y + threadIdx.y;
  int j = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < m && j < p) {
    T sum = 0;
    for (int k=0; k<n; k++)
      sum += A[i*n+k] * B[k*p+j];
    C[i*p+j] = sum;
  }
}

nvcc -O3 main.cu -o benchmark

THOUSANDS OF THREADS

Simultaneously computing output elements

SPEEDUP ~1000×

04 / Architecture Comparison

Three Ways to
Multiply.

Architecture	Parallelism	Memory Model	Best For
Single-Core	1 thread	L1/L2 cache	Small matrices
OpenMP	N threads	Shared L3	Medium matrices
CUDA GPU	1000s	VRAM + HBM	Large matrices

// RELATIVE SPEEDUP (Conceptual Visualization)

Single-Core CPU1.0× (Baseline)

BASELINE

Cache-optimized i-k-j implementation

Multi-Core CPU (OpenMP)~4–8× Faster

MULTI-THREAD

Scales linearly with available CPU cores

NVIDIA GPU (CUDA)~100–1000× Faster

MASSIVE PARALLEL

Dominated by large matrix throughput (n > 512)

Actual speedup depends on matrix size, data type, and hardware. Large matrices (512×512 and above) show maximum GPU advantage.

int

Fastest. No floating point unit needed. Integer overflow risk at large sizes.

float

32-bit. GPU native. Best throughput on CUDA. FP32 is the AI training standard.

double

64-bit. Most precise. 2× memory footprint. Slower on GPU than float.

05 / Why Cache Ordering Matters

The Hidden
Performance Killer.

Naive matrix multiplication uses i–j–k loop order. The problem: the inner loop accesses B[k][j] — jumping across rows in memory. Every access is a cache miss.

MatMul uses i–k–j ordering. The inner loop now accesses B[k][j] sequentially — staying in the same cache line. The difference? Orders of magnitude on large matrices.

// SLOW i-j-k

for i → for j → for k
B[k*p+j] JUMPS

// FAST i-k-j

for i → for k → for j
B[k*p+j] SEQUENTIAL

CACHE MISS

CACHE HIT

06 / Structure

Three Phases.
One Repo.

cpu-gpu-matmul-benchmark/
├── phase1-matmul-singlecore/  ← C++ baseline
│   └── main.cpp               
├── phase2-matmul-multicore/   ← OpenMP parallel
│   └── main.cpp               
├── phase3-matmul-gpu/         ← CUDA massive
│   └── main.cu                
├── README.md                             
└── LICENSE (MIT)

PHASE 1

C++17 · Single Thread · Cache-Optimized

PHASE 2

OpenMP · Multi-Thread · Same Algorithm

PHASE 3

CUDA · GPU Kernel · Thread-Per-Element

07 / Engineering Insights

What This
Taught Me.

Key insights from the internship under Dr. Unnikrishnan Cheramangalath at IIT Palakkad:

Cache is Everything

Changing loop order from i-j-k to i-k-j — same algorithm, same result — can improve performance by 10× on large matrices before touching parallelism.

Embarrassingly Parallel

MatMul is a perfect parallel problem: no thread needs another thread's output. This makes it ideal for both OpenMP and CUDA — and explains why GPUs dominate AI.

Templates Beat Duplication

One templated createMatrix<T>() and matmul<T>() handles int, float, and double without duplication. C++ templates are the right tool for performance code.

The GPU Isn't Always Faster

For small matrices, GPU launch overhead dominates. The crossover point — where GPU wins — depends on matrix size and data type. Benchmarking reveals this precisely.

08 / What's Next

Still
Building.

✓

Phase 1: Single-core CPU Base implementation with cache optimization

✓

Phase 2: OpenMP multicore Multi-thread scaling analysis

✓

Phase 3: CUDA GPU kernel Massive parallelism on NVIDIA hardware

Performance timing with std::chrono High-precision latency measurement in all phases

Automated comparison in Jupyter End-to-end benchmarking and plotting pipeline

Tensor Core acceleration Using NVIDIA's dedicated matrix hardware (WMMA)

09 / Tech Stack

Built
With.

C++17

Core language — templates, cache-optimized loops

CUDA

NVIDIA GPU kernel — thread-per-element parallelism

OpenMP

CPU multicore — one pragma, full core utilization

Jupyter

Benchmarking analysis — charts, timing, comparison

NVCC

Compilers — -O3 optimization flag on all builds

MIT License

Open source — fork it, extend it, learn from it

It's
Open Source.

ORIGINALLY DEVELOPED AS A RESEARCH INTERNSHIP PROJECT AT IIT PALAKKAD

Fork it. Run the benchmarks on your hardware. Add Tensor Core support. Compare your CPU to mine. That's the point.

GITHUB REPO → VIEW ALL PROJECTS →

Jupyter Notebook 58.2% C++ 30.6% CUDA 11.2%

MatMul.
From One Core
to Thousands.

// RESEARCH CONTEXT

IIT Palakkad.
Where It Happened.

PALAKKAD

The Assignment

Why Matrix Multiplication

The Three-Phase Structure

Matrix multiplication is the
backbone of modern AI.

One Algorithm.
Three Architectures.

Single-Core CPU

Multi-Core CPU

NVIDIA GPU — CUDA

Three Ways to
Multiply.

// RELATIVE SPEEDUP (Conceptual Visualization)

int

float

double

The Hidden
Performance Killer.

Three Phases.
One Repo.

What This
Taught Me.

Cache is Everything

Embarrassingly Parallel

Templates Beat Duplication

The GPU Isn't Always Faster

Still
Building.

Built
With.

C++17

CUDA

OpenMP

Jupyter

NVCC

MIT License

It's
Open Source.

MatMul. From One Core to Thousands.

// RESEARCH CONTEXT

IIT Palakkad.Where It Happened.

PALAKKAD

The Assignment

Why Matrix Multiplication

The Three-Phase Structure

Matrix multiplication is thebackbone of modern AI.

One Algorithm.Three Architectures.

Single-Core CPU

Multi-Core CPU

NVIDIA GPU — CUDA

Three Ways toMultiply.

// RELATIVE SPEEDUP (Conceptual Visualization)

int

float

double

The HiddenPerformance Killer.

Three Phases.One Repo.

What ThisTaught Me.

Cache is Everything

Embarrassingly Parallel

Templates Beat Duplication

The GPU Isn't Always Faster

StillBuilding.

BuiltWith.

C++17

CUDA

OpenMP

Jupyter

NVCC

MIT License

It'sOpen Source.

MatMul.
From One Core
to Thousands.

IIT Palakkad.
Where It Happened.

Matrix multiplication is the
backbone of modern AI.

One Algorithm.
Three Architectures.

Three Ways to
Multiply.

The Hidden
Performance Killer.

Three Phases.
One Repo.

What This
Taught Me.

Still
Building.

Built
With.

It's
Open Source.