← Back to Blog

CPU to GPU: Practical Workload Conversion in ML Pipelines

7 min read
ml-infragpuoptimizationsvdexperiment-tooling

During the task-geometry experiment, we hit a wall: spectral feature extraction took over 6 hours. The analysis code used numpy's CPU-bound np.linalg.svd across 38 adapters × 56 layers. Standard approach for linear algebra in Python, but it leaves GPU hardware completely unused. The GPU was provisioned for inference phases; the analysis pipeline just defaulted to CPU.

After optimizing, that phase runs in ~25 minutes. Here's what worked.

The Workload

The experiment analyzes LoRA adapter weight matrices. For each adapter, we decompose 56 weight delta matrices via SVD to extract spectral features.

That's 2,128 SVD calls for the main feature pass, plus another 2,632 truncated SVDs for expanded features. Each matrix is modest (about 3072×768) but doing them one-by-one in numpy adds up fast.

Technique 1: Batched GPU SVD (the big win)

The key insight was that all matrices have the same shape. Instead of 56 individual SVD calls per adapter, we can stack same-shape matrices into a single tensor and issue one batched torch.linalg.svd call.

# Group matrices by shape
shape_groups = defaultdict(list)
for i, m in enumerate(matrices):
    shape_groups[m.shape].append((i, m))

# One GPU kernel launch per shape group
for shape, group in shape_groups.items():
    indices, mats = zip(*group)
    batch = torch.from_numpy(np.stack(mats)).float().to(device)
    S_batch = torch.linalg.svdvals(batch)

A batch of 28 same-shape layers in float32 is ~250MB. Fits on any modern GPU. The L4's 24GB VRAM handles an entire adapter in one shot.

Most of the speedup came from here. Numpy's np.linalg.svd calls LAPACK under the hood, optimized for a single matrix, but you pay Python loop overhead and CPU-to-memory bandwidth for each call. The batched GPU version pays one host-to-device transfer and one kernel launch per shape group, then the GPU's parallel cores process all matrices at once.

We built three entry points for different use cases:

  • batched_svdvals singular values only (fastest, no U/Vh). For when you only need the spectrum.
  • batched_svd full U, S, Vh. For alignment calculations that need singular vectors.
  • batched_svd_lowrank truncated top-k via torch.svd_lowrank (randomized). For when you only need the top few singular vectors.

The truncated variant is worth calling out. scipy.sparse.linalg.svds(k=3) on CPU was the original approach. Correct, but slow because it still operates on dense matrices. torch.svd_lowrank uses a randomized method that's faster for small k on GPU, and the batched version computes top-k SVs for all layers in one call.

Technique 2: Shape-Aware Batching (free parallelism)

The batching strategy generalizes. Any time you have N matrices of the same shape and need the same decomposition, stack and call the batched variant. The GPU doesn't care whether the matrices come from the same adapter or different ones. It just sees a (batch, rows, cols) tensor.

For our pipeline, this meant 2 GPU kernel launches per adapter instead of 56 CPU calls. The same pattern applies to any linear algebra operation that has a batched torch equivalent: SVD, eigendecomposition, matrix multiplication, Cholesky.

Technique 3: Device-Transparent API with Graceful Fallback

We designed the shared utility to be device-transparent. Callers pass device="auto" and the utility resolves to the best available backend:

  1. CUDA available → batched torch on GPU
  2. Torch available, no GPU → batched torch on CPU (still faster than numpy for large batches due to better threading)
  3. No torch → sequential numpy (original behavior)

This matters for portability. The same scripts run locally on a laptop (numpy fallback), on a CPU-only cloud VM (torch CPU), or on a GPU instance (full acceleration). No code changes, no conditional imports at the call site.

Technique 4: ProcessPoolExecutor for CPU Fallback

When there's no GPU, we fall back to multiprocessing. Each adapter's SVD is independent, so we fan out across CPU cores with ProcessPoolExecutor. The implementation is a thin wrapper:

def parallel_map(worker_fn, args_list, num_workers, sys_paths=None):
    if num_workers <= 1:
        return [worker_fn(a) for a in args_list]

    with ProcessPoolExecutor(
        max_workers=num_workers,
        initializer=_pool_init,
        initargs=(sys_paths,),
    ) as pool:
        return list(pool.map(worker_fn, args_list))

One subtle detail: worker processes need sys.path set up correctly for experiment imports. The _pool_init initializer handles this, injecting paths before the worker starts. Without it, workers fail with ModuleNotFoundError on the first import.

On the 8-vCPU L4 instance, CPU parallelism alone gave ~6x throughput (not 8x, SVD is memory-bandwidth bound and cores share the bus). Good enough as a fallback when GPU instances aren't available.

Technique 5: Dual PCA for Memory-Constrained Environments

A different pattern showed up in Phase 4, where we needed PCA on weight deltas. The data is 18 adapters × 352 million parameters. That matrix doesn't fit in the 32GB RAM of an L4 instance.

The standard approach (load everything, compute covariance, eigendecompose) hit OOM immediately. The fix was the dual PCA method: when samples are far fewer than features, eigendecompose the n×n Gram matrix instead of the d×d covariance matrix.

We streamed weight deltas to a disk-backed numpy memmap, then computed the 18×18 Gram matrix row by row. Each row only needs two 352M vectors in memory at a time:

G = np.zeros((n, n), dtype=np.float64)
for i in range(n):
    xi = mmap[i].astype(np.float64) - mean
    for j in range(i + 1, n):
        xj = mmap[j].astype(np.float64) - mean
        G[i, j] = G[j, i] = np.dot(xi, xj)

Peak memory: ~2.7GB (two 352M float64 vectors) instead of ~25GB (full matrix). The eigendecomposition of the 18×18 Gram matrix is instant. Total time: ~15 minutes on CPU, which is acceptable for a step that runs once.

This is the inverse of the GPU batching pattern. Instead of "throw more compute at it," restructure the math to fit the hardware. GPU batching solves compute-bound problems. Dual PCA solves memory-bound problems.

What Didn't Help

ProcessPoolExecutor for the Gram matrix computation. The loop looks parallelizable (each row is independent), but each dot product reads two 352M-element vectors from a shared memmap. Spawning processes means each one independently page-faults through the same file. You hit disk/memory bus contention and get diminishing returns. The real solution is GPU: load the 18×352M matrix into A100 VRAM (25GB fits in 80GB) and call torch.mm(X, X.T). Obvious next step.

Over-parallelizing small workloads. Process pool overhead (spawning, pickling, collecting) is non-trivial. For workloads under ~10 seconds total, single-threaded is faster. We gate on num_workers <= 1 to skip it.

The Reusable Parts

We extracted the optimization code into a shared commons/ library that any experiment can import:

  • commons/svd_utils.py batched GPU/CPU SVD with device routing and shape grouping
  • commons/parallel.py ProcessPoolExecutor wrapper with sys.path initialization

Not experiment-specific. Any pipeline that decomposes weight matrices (pruning analysis, lottery ticket experiments, representation similarity) can use the same utility. Code written against the device="auto" API picks up GPU acceleration automatically when run on GPU instances.

Summary

TechniqueSpeedupWhen to use
Batched GPU SVD~14xMany same-shape matrices, GPU available
Shape-aware batchingFreeGroup by shape, one kernel per group
Device-transparent API1x (portability)Run anywhere without conditionals
ProcessPoolExecutor~6x on 8 coresCPU-only fallback, embarrassingly parallel workloads
Dual PCA (Gram method)N/A (enables)Samples far fewer than features, data doesn't fit in RAM
Deterministic result cachingSkip on resumePure functions with stable inputs

Before reaching for distributed compute or bigger machines, check whether the workload can be restructured to match the hardware you already have. Batching for GPU, streaming for memory, caching for restarts.