🎓 IIT PALAKKAD — RESEARCH INTERNSHIP Under Dr. Unnikrishnan Cheramangalath · CSE Dept
SYSTEMS PROGRAMMING · C++ · CUDA · OPEN SOURCE

MatMul.
From One Core
to Thousands.

A ground-up performance study of matrix multiplication across single-core CPU, multi-core CPU with OpenMP, and NVIDIA GPU with CUDA. Three architectures. Three phases. One question: how much does parallelism actually matter?

C++17 CUDA OPENMP JUPYTER MIT LICENSE

// RESEARCH CONTEXT

InstitutionIIT Palakkad
DepartmentCSE Dept.
SupervisorDr. Unnikrishnan C.
TypeResearch Internship
FocusHPC & Systems
Matrix A
×
Matrix B
=
Result C
CPU ×1 CPU ×N GPU ×1000s
Single Core OpenMP CUDA C++17 Cache Locality Row-Major Template Functions Parallelism Benchmarking i–k–j Loop NVIDIA GPU Tensor Cores HPC Single Core OpenMP CUDA C++17 Cache Locality Row-Major Template Functions Parallelism Benchmarking i–k–j Loop NVIDIA GPU Tensor Cores HPC
01 / Research Context

IIT Palakkad.
Where It Happened.

INDIAN INSTITUTE OF TECHNOLOGY

PALAKKAD

Computer Science & Engineering

// RESEARCH SUPERVISOR
Dr. Unnikrishnan Cheramangalath

Professor, CSE · IIT Palakkad

SHORT-TERM RESEARCH INTERNSHIP

The Assignment

As part of a short-term research internship under Prof. Dr. Unnikrishnan Cheramangalath at IIT Palakkad's Department of Computer Science & Engineering, this project was a structured study of computing architecture performance — specifically how matrix multiplication, the core operation of modern AI and scientific computing, scales across fundamentally different hardware paradigms.

Why Matrix Multiplication

Prof. Cheramangalath's research spans systems software and parallel computing. Matrix multiplication is the canonical benchmark for this domain — simple enough to implement from scratch, complex enough to reveal deep truths about memory hierarchies, parallelism, and the gap between CPU and GPU architectures.

The Three-Phase Structure

The internship was structured in three deliberate phases — single-core baseline, OpenMP multicore, then CUDA GPU — each phase building on the last. The goal: understand not just the numbers, but WHY each architecture performs the way it does.

02 / The Problem

Matrix multiplication is the
backbone of modern AI.

Every neural network layer. Every image transformation. Every recommendation algorithm. They all reduce to C = A × B at the hardware level. Understanding how different architectures handle this operation — and why — is fundamental systems knowledge.

O(n³)

Naive complexity for n×n matrices

3

Architectures benchmarked

3

Data types: int · float · double

03 / Three Phases

One Algorithm.
Three Architectures.

PHASE 1

Single-Core CPU

The Baseline

The foundation. A templated C++ implementation with deliberate cache optimization — using i–k–j loop ordering instead of the naive i–j–k. This matters: the inner loop accesses memory sequentially, staying in cache, dramatically reducing cache misses even before any parallelism.

// Cache-friendly: i-k-j ordering
for (int i = 0; i < m; i++)
  for (int k = 0; k < n; k++)
    for (int j = 0; j < p; j++)
      C[i*p+j] += A[i*n+k] * B[k*p+j];
g++ -O3 -std=c++17 main.cpp -o benchmark
1 CORE

L1 → L2 → L3 → RAM

i-k-j keeps data in L1

N CORES (Parallel)
~Nx

FASTER

PHASE 2

Multi-Core CPU

OpenMP Parallelism

Add one pragma, multiply throughput by your core count. OpenMP parallelizes the outer loop — each CPU core independently computes its assigned rows of the output matrix. No inter-thread communication needed: the algorithm is embarrassingly parallel. The same cache-friendly loop ordering is preserved.

// Same algorithm — now parallel
#pragma omp parallel for
for (int i = 0; i < m; i++)
  for (int k = 0; k < n; k++)
    for (int j = 0; j < p; j++)
      C[i*p+j] += A[i*n+k] * B[k*p+j];
g++ -O3 -fopenmp -std=c++17 main.cpp -o benchmark
PHASE 3

NVIDIA GPU — CUDA

Massive Parallelism

Where single-core does one multiply at a time and multi-core does N — CUDA does thousands simultaneously. Each output element C[i][j] is computed by a dedicated CUDA thread. The GPU launches a grid of thread blocks, mapping the 2D output matrix directly to the 2D thread grid.

__global__ void matmul_kernel(
    T* A, T* B, T* C, int m, int n, int p) {
  int i = blockIdx.y * blockDim.y + threadIdx.y;
  int j = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < m && j < p) {
    T sum = 0;
    for (int k=0; k<n; k++)
      sum += A[i*n+k] * B[k*p+j];
    C[i*p+j] = sum;
  }
}
nvcc -O3 main.cu -o benchmark
THOUSANDS OF THREADS

Simultaneously computing output elements

SPEEDUP ~1000×
04 / Architecture Comparison

Three Ways to
Multiply.

Architecture Parallelism Memory Model Best For
Single-Core 1 thread L1/L2 cache Small matrices
OpenMP N threads Shared L3 Medium matrices
CUDA GPU 1000s VRAM + HBM Large matrices

// RELATIVE SPEEDUP (Conceptual Visualization)

Single-Core CPU1.0× (Baseline)
BASELINE

Cache-optimized i-k-j implementation

Multi-Core CPU (OpenMP)~4–8× Faster
MULTI-THREAD

Scales linearly with available CPU cores

NVIDIA GPU (CUDA)~100–1000× Faster
MASSIVE PARALLEL

Dominated by large matrix throughput (n > 512)

Actual speedup depends on matrix size, data type, and hardware. Large matrices (512×512 and above) show maximum GPU advantage.

int

Fastest. No floating point unit needed. Integer overflow risk at large sizes.

float

32-bit. GPU native. Best throughput on CUDA. FP32 is the AI training standard.

double

64-bit. Most precise. 2× memory footprint. Slower on GPU than float.

05 / Why Cache Ordering Matters

The Hidden
Performance Killer.

Naive matrix multiplication uses i–j–k loop order. The problem: the inner loop accesses B[k][j] — jumping across rows in memory. Every access is a cache miss.

MatMul uses i–k–j ordering. The inner loop now accesses B[k][j] sequentially — staying in the same cache line. The difference? Orders of magnitude on large matrices.

// SLOW i-j-k for i → for j → for k
B[k*p+j] JUMPS
// FAST i-k-j for i → for k → for j
B[k*p+j] SEQUENTIAL
CACHE MISS
CACHE HIT
06 / Structure

Three Phases.
One Repo.

cpu-gpu-matmul-benchmark/
├── phase1-matmul-singlecore/  ← C++ baseline
│   └── main.cpp               
├── phase2-matmul-multicore/   ← OpenMP parallel
│   └── main.cpp               
├── phase3-matmul-gpu/         ← CUDA massive
│   └── main.cu                
├── README.md                             
└── LICENSE (MIT)
PHASE 1

C++17 · Single Thread · Cache-Optimized

PHASE 2

OpenMP · Multi-Thread · Same Algorithm

PHASE 3

CUDA · GPU Kernel · Thread-Per-Element

07 / Engineering Insights

What This
Taught Me.

Key insights from the internship under Dr. Unnikrishnan Cheramangalath at IIT Palakkad:

Cache is Everything

Changing loop order from i-j-k to i-k-j — same algorithm, same result — can improve performance by 10× on large matrices before touching parallelism.

Embarrassingly Parallel

MatMul is a perfect parallel problem: no thread needs another thread's output. This makes it ideal for both OpenMP and CUDA — and explains why GPUs dominate AI.

Templates Beat Duplication

One templated createMatrix<T>() and matmul<T>() handles int, float, and double without duplication. C++ templates are the right tool for performance code.

The GPU Isn't Always Faster

For small matrices, GPU launch overhead dominates. The crossover point — where GPU wins — depends on matrix size and data type. Benchmarking reveals this precisely.

08 / What's Next

Still
Building.

Phase 1: Single-core CPU Base implementation with cache optimization
Phase 2: OpenMP multicore Multi-thread scaling analysis
Phase 3: CUDA GPU kernel Massive parallelism on NVIDIA hardware
Performance timing with std::chrono High-precision latency measurement in all phases
Automated comparison in Jupyter End-to-end benchmarking and plotting pipeline
Tensor Core acceleration Using NVIDIA's dedicated matrix hardware (WMMA)
09 / Tech Stack

Built
With.

C++17

Core language — templates, cache-optimized loops

CUDA

NVIDIA GPU kernel — thread-per-element parallelism

OpenMP

CPU multicore — one pragma, full core utilization

Jupyter

Benchmarking analysis — charts, timing, comparison

NVCC

Compilers — -O3 optimization flag on all builds

MIT License

Open source — fork it, extend it, learn from it

It's
Open Source.

ORIGINALLY DEVELOPED AS A RESEARCH INTERNSHIP PROJECT AT IIT PALAKKAD

Fork it. Run the benchmarks on your hardware. Add Tensor Core support. Compare your CPU to mine. That's the point.

GITHUB REPO → VIEW ALL PROJECTS →
Jupyter Notebook 58.2% C++ 30.6% CUDA 11.2%
← TRADE-LAB AI TRADING BOT ALL PROJECTS →