Why CUDA kernels silently corrupt memory and how to catch the bug

The 2 AM kernel debugging session nobody warned you about

Last month I was helping a friend debug a training pipeline that worked perfectly on his 4090 dev box but produced garbage loss values once it hit the cluster. No segfault. No cudaErrorInvalidValue. Just wrong numbers, intermittently.

Six hours in, we found it: a kernel was writing one element past a shared memory buffer when blockDim.x happened to fall in a specific range. On the dev card the stomped bytes were padding. On the cluster they were the next thread block's working set.

If you've written more than a few hundred lines of CUDA, you've hit some flavor of this. Let's walk through why GPU kernels go silently wrong, how to actually catch the bug, and what newer tooling — including some interesting Rust-based experiments — is trying to do about the underlying problem.

Root cause: the kernel boundary breaks your safety net

When you write host C++, you have a stack of tools that catch memory bugs early. ASan. Valgrind. The OS handing you a SIGSEGV when you read past a page. As soon as you cross into device code, most of that disappears.

Here's a reduction kernel I see in code reviews constantly:

cuda

__global__ void reduce_sum(const float* __restrict__ in,
                           float* __restrict__ out,
                           int n) {
    __shared__ float scratch[256];   // assumes blockDim.x <= 256
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tid;

    // Bug 1: no bounds check between tid and the scratch array size
    scratch[tid] = (idx < n) ? in[idx] : 0.0f;
    __syncthreads();

    // Bug 2: stride loop only works if blockDim.x is a power of 2
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) scratch[tid] += scratch[tid + s];
        __syncthreads();
    }

    if (tid == 0) out[blockIdx.x] = scratch[0];
}

Launch it with blockDim.x = 300 and threads 256–299 write past scratch. There's no trap. No exception. The thread block keeps running. The "result" gets written. Your loss curve quietly bends.

Three things make this class of bug so awful:

The SIMT execution model means dozens of warps are stepping in lockstep — a single bad index gets multiplied by the warp size before it does anything observable.
Shared memory has no guard pages. There's no MMU between thread blocks and their scratch.
Most CUDA error returns are about the host-side call (launch config, allocation), not about whatever the device actually did with the memory.

Step 1: stop debugging without `compute-sanitizer`

The first thing I do when a kernel produces "wrong numbers" is rerun with compute-sanitizer. It ships with the CUDA toolkit and replaces the old cuda-memcheck. Docs live at docs.nvidia.com/cuda/compute-sanitizer.

bash

# memcheck catches out-of-bounds and misaligned accesses
compute-sanitizer --tool=memcheck ./my_app

# racecheck flags shared-memory data races
compute-sanitizer --tool=racecheck ./my_app

# initcheck finds reads of uninitialized device memory
compute-sanitizer --tool=initcheck ./my_app

For the reduction kernel above, memcheck will print a stack trace pointing at the exact line and the offending thread index. That one command would have saved my friend's afternoon.

The catch: compute-sanitizer slows things down 5x–50x depending on tool and kernel. You will not be running it on a full training run in CI. Use it on the smallest failing reproducer you can build.

Step 2: write kernels that can't get into this state

Once you've found the bug, the next question is how to stop writing the bug in the first place. A few habits that have saved me real time:

Pass the scratch capacity as a template/kernel parameter instead of hardcoding it as a literal in the __shared__ declaration. Then assert blockDim.x <= scratch_capacity at the top of the kernel.
Never assume blockDim.x is a power of two. Either pad the input or use a loop bound derived from the actual block dimension.
Wrap raw __shared__ arrays in a tiny accessor that does a bounds check in debug builds and compiles down to a raw store in release.

Here's the accessor pattern I default to:

cuda

template <typename T, int N>
struct SharedView {
    T* data;
    __device__ T& operator[](int i) {
#ifndef NDEBUG
        // Trap the offending thread so compute-sanitizer prints a clean trace
        if (i < 0 || i >= N) __trap();
#endif
        return data[i];
    }
};

It's not pretty, and it does nothing in release builds — but combined with compute-sanitizer you get a clean trap at the exact bad index instead of mysterious downstream garbage.

Step 3: consider moving the safety to the compiler

Bounds-checked accessors are a workaround for a deeper problem: C++ lets you express the wrong thing in the first place. This is where the recent wave of Rust-on-the-GPU work gets interesting.

The community Rust-CUDA project has been pushing on this for years, letting you write kernels in Rust and lower them to PTX through a custom rustc backend. The historical trade-off has been ecosystem maturity — you give up a lot of CUDA C++ libraries to get borrow checking on the device.

More recently, there's a project page at nvlabs.github.io/cuda-oxide describing a Rust-to-CUDA compiler effort called CUDA-oxide. I haven't tested it thoroughly yet, and I'd encourage you to read the project page directly rather than take any blog's summary as gospel — including this one. Given how fast this space is moving, I'm not going to claim feature parity, performance numbers, or production readiness based on what I've read so far.

What I will say is the underlying idea is the right one: the easiest way to stop writing the buggy reduction above is to use a language where the compiler refuses to compile it. A Rust kernel expresses the scratch buffer as a typed slice with a known length, and any indexed access is either bounds-checked at runtime or proven safe statically.

The general shape, in Rust-on-GPU style pseudocode:

rust

#[kernel]
pub unsafe fn reduce_sum(input: &[f32], output: &mut [f32]) {
    // Length-carrying slice — out-of-bounds indexing panics in debug
    // and certain misuses are refused at compile time.
    let scratch = shared_array::<f32, 256>();
    let tid = thread::index_x() as usize;

    // .get() returns Option<&T>, forcing you to handle the OOB case
    let v = input.get(tid).copied().unwrap_or(0.0);
    if tid < scratch.len() {
        scratch[tid] = v;
    }
    sync_threads();
    // ...reduction with length-aware iteration
}

You still have to think about warps, divergence, and shared memory layout — Rust does not magically make GPU programming easy. But the specific class of bug I lost a Saturday to becomes a compile error or a clean panic instead of a silent corruption.

Prevention checklist

If you want to ship CUDA code that doesn't bite you at 2 AM, the boring stuff helps most:

Run the smallest failing repro under compute-sanitizer --tool=memcheck and --tool=racecheck before you call a bug "fixed".
Keep a release-vs-debug split for your device code and gate the trap-on-OOB accessors behind #ifndef NDEBUG.
Treat any kernel that hardcodes a shared-memory size as a future incident.
Add a numeric regression test that compares against a CPU reference on a small input. Silent corruption is much easier to catch when you have ground truth sitting next to the kernel output.

I'm cautiously optimistic about the Rust-on-GPU direction, but I wouldn't rewrite a working CUDA codebase just because there's a shiny new compiler on Hacker News. The bigger win today is taking the safety tooling you already have and actually running it on your kernels before they hit the cluster.