Fine-grained GPU Kernel Profiling via Programmable Probing

Neutrino is a GPU Assembly Probing Framework for GPU Kernel Profiling, Fine-grained, Versatile, Programmable, with eBPF-like User Interface.

Accepted by OSDI 25, Artifact Evaluated.

Paper Code Docs Blog

Highlights

Programmability

Easily customizes profiling via Pythonic DSL or Handcrafting Asms!

Fine-Granularity

Operates on assemblies for finest instruction-level control!

Versatility

Covers both value- and time- profiling for complex and interesting profiling!

Easy-to-use for Performance Engineers.

Write the probe.

Define contexted registers, Maps and Probes in Tracing DSL.

from neutrino import probe, Map
import neutrino.language as nl
# declare maps via decorated class for persistence
@Map(level="warp", type="array", size=8, cap=1)
class block_sched:
  start: nl.u64
# declare probe registers shared across probes
start: nl.u64 = 0 # starting clock
# declare probe via decorated function
@probe(pos="kernel", level="warp", before=True)
def thread_start():
  start = nl.clock()

Run it.

Apply probes to your workload with simple CLI.

 Terminal
neutrino -p probe.py python -c "import torch; torch.zeros((4096, 4096), dtype=torch.float16)"
 ◆-- Trace Saving --◆[info] trace in ./trace/Apr24_231539_1860576 ◆-- Trace Analysis --◆vectorized_elementwise:No.block:32768 Exec:680869 Sched:142674 (cycle/SM)

Analyze the Trace.

Easily reading traces with auto-generated code for analysis.

import struct
from neutrino import TraceHeader, TraceSection
class block_sched(NamedTuple):
  start: int
def parse(path: str):
  with open(path, "rb") as f:
  header: TraceHeader = TraceHeader(struct.unpack("iiiiiiii", f.read(32)))
  sections: List[TraceSection] = []
  for _ in range(header.numProbes):)

event.log

Share it.

Share your probes with community via Github Issues or Gists.

Compatible with Most Ecosystem.

Hardware Compatibility

Works fluently on commonly used hardwares.

Platform	Support
NVIDIA/CUDA	✅ Fully Supported
AMD/ROCm	✅ Supported on CDNA
Intel/oneAPI	🚀 Planning
More to Come!	Raise Github Issue if you need!

Software Compatibility

Integrated seamlessly with ecosystems

Platform	Support
PyTorch (and everything on top)	✅ Supported (with custom build)
Triton	✅ Supported
JAX	✅ Supported (with envariable)
More to Come!	Raise Github Issue if you need!

Hackable for your need.

Designed with Extensibility

An approachable framework.

Neutrino consists of three components: Entry & Compiler, Hook Driver, and the Probe Engine. All can be easily extended.

Hook Driver

Hook driver captures driver call (load & launch) to provide runtime support, such as caching code(assembly) loaded.

Probe Engine

Probe engine extracts, prunes, probes and reassembles the GPU assembly from hook driver with probes from entry.

Demos Available at Simple Clicks.

How your kernel access memory?

Memory Access plays vital role in GPU performance engineering, but we know very little about that. Neutrino makes it easy for developers to visualize their memory access pattern with the novel DMAT plot via simply neutrino -p dmat

Fully Open-Sourced and Evaluated

Battery guaranteed.Actively maintained, open for contributions.
Fully open-source.Open source, available on Github.
Truly Collaborative.Share your probe via Issues or Gists.
Read docs Check codes