How Neutrino differs from...

It's our honor to stand on the shoulders of these giants!

eBPF

eBPF, and its dev tools (e.g., bcc, bpftrace) is the de-facto standard that NEUTRINO tries to follow. So you can see the similar definition of probe and map, as well as helper.

The key difference is the platforms and mechanisms:

eBPF runs in a "sequential" CPU sandbox protected by Linux kernel, which provides helpers, maps, tracepipes, etc. The way of entering is instrumenting a mode switch (e.g., int3) with eBPF program as callback.
NEUTRINO runs in "parallel" GPUs without Linux kernel and its support. NEUTRINO's helpers and maps are self-implemented as normal GPU code. And NEUTRINO directly analyze instrument the probes into the parallel assembly, i.e., entering/exiting probes are seamless.

Most pure-software profilers integrated in the framework, such as the PyTorch Profiler and JAX Profiler, are kernel-exclusive. Though they're efficient in inspecting launch timeline and profiling host-device interaction, their finest granularity is the kernel as a whole and can only captures:

Kernel time and speed from device-side timers such as CUDA event.
Memory management events, such as alloc/free or internal memory pool.

Instead, NEUTRINO focuses on intra-kernel profiling at the instruction level.

GPU Hardware Profilers

Most widely-used GPU kernel profilers from vendors, such as Nsight-Compute and ROCprofiler, are hardware-dependent that profiling features require corresponding hardware implementation support, mostly:

Performance Monitors: hardware counters recording events such as cache hit/miss.
Program Counter Sampler: samples the program counter offset (corresponding to specific instruction).

These features are unique and helpful but:

Not adaptable to new hardware. For example, async tensor core that makes utilization metric less reliable.
Hard to customize, such as profiling user-specified part of the program, rather than the whole.
Sampling-based: Overhead and accuracy are counter-intuitive for sampling frequency selection.

Instead, NEUTRINO limits the profiling targets to only the desired tracepoints, achieving both fine-grained event tracing and low system overhead.

GPU Mirco-Benchmarking

Micro-benchmarking tries to unreveal architecture designs via ideal workloads (only interested instructions without any other instructions to reduce disturbance). For example, to micro-bench the cycles of mma instruction, Microbenchmarker will prefix every data as register values to ensure only mma instruction is used instead of ld/st from global memory (the real cases).

Instead, NEUTRINO aims to measure the performance of real workloads, rather than idealized workloads.

GPU Simulatiors

Simulators, such as GPGPU-Sim, emulate GPU execution at instruction/cycle level on a CPU. They are mainly designed for computer architects to test new hardware ideas in design phase, rather than profiling the real workload.

The major problems of these simulators lie in the speed of both running simulations (might need several days) and support for new hardware features and instructions (might even take several years).

Moreover, various runtime dynamics, e.g., timing an instruction, may not be accurately profiled on simulators, from the performance engineering perspective.

Instead, NEUTRINO aims to profile (and run) the workload on the real GPUs.

GPU Instrumentations

First, Neutrino is a GPU Instrumentation tool that injects snippets to the client program for analysis. And there are many previous explorations on this direction, of two main approaches:

Compiler-based

Representatives are Ocelot, HIPAnalyzer, CUDAAdvisor, CUDAFlux, and KPerfIR. The intuition is to implement the code injection as a compiler pass(extension). Their pros and cons are mutual:

Compiler Infrastructure:
- Pros: They enjoys additional information from compiler (such as MLIR contains rich shape/tile information).
- Cons: Their implementation are binded to specific compilers, lacking generalizability. For instance, KPerfIR only supports Triton. But in practice codes are from multiple framework. For example, PyTorch's code are (at least) from cuBLAS, ATen, Triton.
Source Code:
- Pros: They don't need to implement complex infra to retrieve the code (such as Hook Driver of NEUTRINO) as compilers have code in hand.
- Cons: They are binded to the source code. What if the code base is large, like ATen? What if code is hard to locate, such as C++ template of template of template of template...?

Runtime-based

NEUTRINO is one of the runtime-based approach and previous works includes SASSI, NvBit and GTPin.

The major difference is that previous works directly operates on machine code lacks sufficient virtualization, so they mostly rely on the protection of the stack via injecting pure device function, which prohibits cooperation between probes for advanced usages. A motivating example is timing instruction (subtracting two clock readings) it's hard for previous works because the start time has been cleared on return and is not visible in the end timer's context.

Instead, NEUTRINO takes the parallel assembly as an abstraction layer, providing enough virtualization to allow probes shares the same context safely without stacks.