Skip to main content

Profiling GPU Kernel à la eBPF for Linux!

Programmable: Customizes profiling via Python DSL or Handcrafting Asms!

Fine-Grained: Operates on assemblies for finest instruction-level control!

Versatile: Covers both value- and time- profiling, allowing deep understanding!

Compatibility: Support CUDA/ROCm, PyTorch/Triton/CUTLASS, and more!

* to appear in OSDI'25

How Neutrino Work?

On OS side, Neutrino captures GPU workload in runtime via hooking drivers, and launch the probe engine to probing GPU assemblies.

On GPU side, Neutrino probed program will seamlessly jump into probes in execution, and will have logically separated registers and memory.

How to Use Neutrino?

Neutrino exposes a CLI similar to strace/valgrind with probes specified with -p option.
Neutrino accepts probes in a high-level Python DSL and low-level PTX/GCN Assemblies.

Python DSL has similar interface to Triton:

import neutrino.language as nl # API borrowed from Triton :)

gstart : nl.u64 = 0
gend : nl.u64 = 0
elapsed: nl.u64 = 0

@nl.probe(pos="kernel", level="warp") # broadcast to warp leader
def block_start():
gstart = nl.time()

@nl.probe(pos="kernel", after=True, level="warp", size=16)
def block_sched():
gend = nl.time()
elapsed = gend - gstart
nl.save(gstart, dtype=nl.u64)
nl.save((elapsed, nl.smid()), dtype=nl.u32) # auto casted

Low-level Assemblies are wrapped in TOML:

analyze_hook = "block_sched.py"

[block_sched]
position = "kernel"
datamodel = "warp:16" # every warp save 16 bytes
before = """.reg .b64 %lstart; // local start time (unit: cycle)
.reg .b64 %lend; // local end time (unit: cycle)
.reg .b64 %elapsed; // thread elapsed time in u64
.reg .b32 %elapse; // thread elapsed time in u32
mov.u64 %lstart, %clock64;"""
# following operationo is done only by leader thread
after = """mov.u64 %lend, %clock64;
sub.u64 %elapsed, %lend, %lstart;
cvt.u32.u64 %elapse, %elapsed; // convert to u32
SAVE.u64 {%lstart}; // store start in u64 for alignment
SAVE.u32 {%elapse, %smid}; // store elapased time and core id"""

Supported Platforms / Workload

Neutrino supports a wide range of Hardware and Software, hopefully your task is inside!

HardwareSupport
NVIDIA/CUDA/PTX ✅ Supported
AMD/ROCm/GCNAsm 🏗️ Supported on CDNA
Intel/OneAPI/VISA 🚀 Planning
Apple/Metal/AIR 🚀 Planning

SoftwareSupport
PyTorch ✅ Supported
Triton ✅ Supported
CUTLASS ✅ Supported (with build option)
JAX ✅ Supported (with envariable)

More Platforms to Come! Contact us if you're interested!