Introduction

neutrino is a GPU assembly probing tool providing eBPF-like programming interface for fine-grained, versatile, and programmable GPU kernel runtime profiling.

"Hello World" Example

Design with simplicity, Neutrino exposes a simple CLI and it's compatible with most frameworks like PyTorch and Triton.

Here is a simple Hello World of Neutrino using FlashAttn-v2 implemented by Triton as example workload:

# wget https://triton-lang.org/main/_downloads/54a35f6ec55f9746935b9566fb6bb1df/06-fused-attention.py
neutrino -p dmat python 06-fused-attention.py # can be any workload involving GPU

Then you can visualize the memory access pattern (DMAT Plot) of your GPU kernel:

Beyond the memory visualization, Neutrino has several critical features:

Programmability

Neutrino defines a programmable interface than pre-built tools. Users can easily build their own probes with our tracing language or handcrafting assemblies, like the following source code behind -p dmat:

import neutrino
import neutrino.language as nl

CALLBACK = "dmat.py"

# declare shared registers across probes
start  : nl.u64 = 0 # starting clock
aligned: nl.u64 = 0

# define probes with @neutrino.probe decorator
@neutrino.probe(pos="kernel", level="thread")
def thread_start():
    start = nl.clock()

@neutrino.probe(pos="ld.global/st.global/cp.async.ca/cp.async.cg", size=16, count="dynamic") 
def dmat():
    aligned = nl.clock() - start
    nl.save((aligned, nl.addr), dtype=nl.u32)

Moreover, Neutrino allows cooperative probes by leveraging registers as the temporal storage between probes. By doing so, NEUTRINO enables more complicated and more flexible profiling tasks by customizing and cooperating probes at different tracepoints and times.

Fine-Granularity

As demonstrated in the above source code, Neutrino directly works on instructions, the lowest software level, to offer the finest granularity of performance that can be effectively mapped to particular hardware units such as tensor cores and memory I/O unit.

Versatility

NEUTRINO profiles GPU kernel runtime from both the perspectives of value, i.e., capturing runtime values such as memory addresses and of time, i.e., recording event timestamps and advanced intra-kernel benchmarking by differencing timestamps. By spanning these two dimensions, NEUTRINO supports versatile profiling tasks, including memory access and GPU scheduling.

More Examples

We hosts more example on Colab for you to try!

Next Step

Follow installation guide to setup Neutrino on your GPU-enabled machine.
Check probing guide to write your first probe in DSL or assembly!
Check more in DMAT Plot on how to inerpret this powerful visualization!
Check our system design for more internal details!
If you're interested, don't forget checking our roadmap to see what you can help making Neutrino better!

Discussion Group

Coming Soon!

"Hello World" Example​

Programmability​

Fine-Granularity​

Versatility​

More Examples​

Next Step​

Discussion Group​