Execution Model

Contents of this page targets general developers with intuitiveness in mind. You can find a more formal, professional description in Neutrino Design section of our OSDI 25 paper.

Motivating Example

neutrino probes are executed in a simple, interesting but robust model. Let's consider a motivating example (regardless of GPU):

original.c

// use function because GPU Kernel are Function
int add(int a, int b) {
    return a + b;
}

int result = add(1, 2); // shall be 3

If we want to observe the program, for example, the value of a in the runtime, one interesting way is to "fuse" another function (what we call probe) like the following example:

probe.c

//  use function because GPU Kernel are Function
int probed_add(int a, int b, int* c) {
    *c = a;       // add our probed code
    return a + b; // remain the original code
}

// and change the caller a bit
int* c = (void*) malloc(sizeof(int)); 
int result = add(1, 2, c); // shall still be 3

probe.c

//  use function because GPU Kernel are Function
int probed_add(int a, int b, int* c) {
    *c = a;       // add our probed code
    return a + b; // remain the original code
}

int launch((int*) binop(int, int), int lhs, int rhs) {
    int* c = (void*) malloc(sizeof(int)); 
    return binop(lhs, rhs, c);
}

int result = launch(probed_add, 1, 2);

This idea is similar to the kernel fusion, which proves that two independent functions can be safely merged into one, if:

Two merged function has independent parameters and returns.
Two merged function has independent variable names (register space).
Two merged function has independent instructions, which is naturally guaranteed as instructions are executed one-by-one in CPU/GPU/NPU.

By ensuring these conditions, this program model is robustly correct as the compiler/assembler will help protect the correctness and is of nearly zero-overhead because there is no context switch or stack frame creation, etc.

And the only difference between neutrino probe and other kernel fusion is that neutrino probes' code depends on the client program (e.g., *c = a;), which can still guarantee the security if the dependence is read-only.

Another analogy is like a daemon that runs in the background and is invoked when need to read the foreground code's runtime value.

Some Justification

However, though the model is simple, the requirement is strict that the program must be "static", means

Codes cannot change itself in the runtime (like Monkey Patching or a simple eval() in Javascript/Python)
All codes must be known in advanced for processing, i.e., cannot load unknown program (like dlopen).

All these can make the probe not tracing correctly, and this is why this model has been largely forgetten by CPU performance engineers, who now uses processor trace or interrupt for better coverage.

However, neutrino identifies that this model works fluently on GPU because:

GPU code cannot change itself.
GPU code must be loaded explicitly to driver.

These two reasons make us to choose this simple program model for neutrino, a GPU kernel profiler.

Virtualization and Contexted Registers

Now we present more technical discussion on the virtualization of neutrino's probe w.r.t. the original/foreground computing tasks. Following the principle of operating system, virtualization comes from both time and resource separation:

Time Separation: neutrinos originates from the SIMT execution model of GPGPU, where parallelism happens among threads while execution within a thread is generally sequential with one instruction at a cycle. Therefore, since we directly instruments the instructions into the original program, their time separation from the original program will be guaranteed.
Resource Separation: GPU threads also have thread-private registers as their primary resources. neutrino virtualizes the probe registers by separating an independent register group, as well as other resources like GMEM. Thus, neutrino probes can avoid affecting the original program's resources, and the execution flow.

Logical and Physical Registers

It is worth noting that neutrino probe register group is declared logically at the assembly level rather than physically. Logical registers will be integrated into physical registers by the assembler in register allocation, with independence between probes and the original program preserved by dependency tracking algorithms.

Persistence and eBPF-like Map

For CPU profiling tools, saving readings are simple, at least you can simply printf to the shell. However, anyone used CUDA/HIP printf will noticed the challenge of persistence on GPU:

Race condition: GPU can have >10,000 threads, saving results without concurrency cotnrol can easily leads to corruption. But concurrency control, such as atomic is also heavy on GPU...
Metadata-heavy: Usually we want pid/tid (8 bytes) be included for analysis, but on GPU, these will takes 24 bytes (blockIdx/gridIdx) or more (if need blockDim/gridDim).

Inspired by the lock-free per-cpu eBPF maps and event-buffer of HIPAnalyzer, neutrino presents a eBPF-like per-thread/warp Map, the main idea is:

Let each thread/warp has their map, rather than sharing a global map, so no lock/atomic need.
Formulate the maps in ndarray layout, let metadata (threadIdx/blockIdx) implicitly inferred from array structure rather than directly stored.

Execution Model

Motivating Example

Some Justification

Virtualization and Contexted Registers

Persistence and eBPF-like Map

On this page