logoNeutrino

Tracing DSL and Compiler

Please read the Probe Writing Guide first.

You can find the code in neutrino/language/.

Neutrino DSL and Compiler is organized as:

neutrino/language/
__init__.py # Language Primitive Definition
compiler.py # Compiler entry
frontend.py # Python ast based frontend
ptx.py      # NVIDIA PTX Backend
gcn.py      # AMD GCN Asm Backend

Similar to (and inspired by) other deep learning compilers such as Triton, TileLang, and CUTE, Neutrino DSL is a high-level description for probes that will be lowered to hardware-specific assembly for the probe engine to attach and the hook driver to execute.

Program Model

Thread/Warp-Oriented

However, Neutrino DSL are highly specialized for observability and is different from above computing-oriented, block-oriented solutions. In short, Neutrino takes a traditional, thread/warp-oriented execution model. Take the DMAT as an example:

dmat.py
from neutrino import probe, Map
import neutrino.language as nl

CALLBACK = "dmat_callback.py"

@Map(level="thread", type="array", size=16, cap="dynamic")
class DMAT:
    clock: nl.u64
    addr:  nl.u64

start: nl.u64 = 0
mem_clock: nl.u64 = 0

# define probes with decorator
@probe(pos="kernel", level="thread", before=True)
def thread_start():
    start = nl.clock()

@probe(pos="ld.global:st.global:cp.async.cg:cp.async.ca", level="thread")
def memory_access():
    mem_clock = nl.clock() - start
    DMAT.save(mem_clock, nl.addr)

In neutrino Tracing DSL, everything is thread-local, i.e., neutrino handles the multi-thread concurrency under the hood.

For example, every thread has their registers (contexted, mem_clock, and other "unnamed" registers from compilation). These registers are private to the thread and can be referenced nearly free (1 clock cycle).

It is worth noting that there are two types of registers in :

  1. Global: shared among probes, their values can support cooperation among probes.
  2. Local: temporarily used within the probe, their value are not trustable outside the probe.

We may be able to support register syncrhonization, please raise a Github Issue if you need.

Multi-Occurance

Another difference is that probes might be attached to multiple tracepoints based on matching. Take the following counter as an example:

gmem_bytes.py
from neutrino import probe, Map
import neutrino.language as nl

CALLBACK = "gmem_bytes_analysis.py"

@Map(level="thread", type="array", size=8, cap=1)
class GMEMBytes:
    sync_bytes: nl.u32
    async_bytes: nl.u32

sync_bytes:  nl.u64 = 0
async_bytes: nl.u64 = 0

@probe(level="thread", pos="kernel", before=True)
def init():
    sync_bytes = 0
    async_bytes = 0

@probe(level="thread", pos="ld.global:st.global")
def record_sync():
    sync_bytes += nl.bytes

@probe(level="thread", pos="cp.async.ca:cp.async.cg")
def record_async():
    async_bytes += nl.bytes

@probe(level="thread", pos="kernel")
def save():
    GMEMBytes.save(sync_bytes, async_bytes)

For probes with instruction as pos (record_sync and record_async in the above example), they will be attached to and executed on each appearance of the tracepoint. This can be useful that we can simply build a counter across probes, but can also be harmful as the contexted registers might have unexpected values.

It is important for you to:

  1. Understand the probe relationship, e.g., the order of instructions.
  2. Reinitialize the contexted registers if need.

Neutrino Helper Reference

Currently we provide following Helpers:

Function Helpers

  • neutrino.language.clock: CU-local, GHz clock counter, unsynced across CUs.
  • neutrino.language.time: GPU-local, MHz timer, can be synced with CPU timer
  • neutrino.language.cuid: Get the current id of compute unit dispatched.

Field Helpers

  • nl.addr: address of memory accessed by the instruction, only applies to memory instruction.
  • nl.bytes: number of bytes accessed by the instruction, only applies to memory instruction.
  • nl.out: output operand
  • nl.in1: 1st input operand
  • nl.in2: 2nd input operand
  • nl.in3: 3rd input operand

The major difference is that function helpers are handled in the DSL compiler, but the field helpers will be handled by the probe engine.

Compilation

The tracing DSL will be compiled to assembly in two steps:

  1. Frontend: Parse and Flattening to Neutrino IR via via Python ast module
  2. Backend: Assembly Code Generation via translating the IR to Asms.

Supported Syntax

For probe snippets, we support following Python syntax:

  • Assign: assign values to variable
  • UnaryOp: unary operators like - (sign flip)
  • BinOp: binary operators like +-*/
  • Constant: use constant numbers
  • Attribute: limited to Neutrino helpers (such as nl.addr)
  • Call: call functions, only limited to helpers or map.save

Current frontend is still limited, and we are working on following features:

  • If: to support sampling
  • While
  • Compare

Moreover, current type system is u64-only, and we are working on the type inference to include u32, u1 (bool) and more types.

Neutrino IR ISA

The design is inspired by the eBPF ISA, but we take a separation of output/input as most target GPU ISA has separated input/output. For example, add in eBPF ISA is add dst, src;; aka dst += src but in Neutrino IR it is add out, in1, in2 aka out = in1 + in2 (you can also use add dst, dst, src for a similar semantic like eBPF ISA).

Currently, as we works on Python, we encode every instruction to be list[str] where first item is instruction name and operands followed by, i.e., no binary format.

Special Operands

Other than standard registers, we plan to have following operands for better value profiling

Special OperandDescriptionNVIDIA PTXAMD GCNAsm
dstwill be replaced by destination
srcwill be replaced by source
outwill be replaced by output (mostly 1st operand)
in1will be replaced by 1st input
in2will be replaced by 2nd input
in3will be replaced by 3rd input
byteswill be replaced by inst width✅ (only ld/st/cp)

ALU Instructions

InstructionDescriptionNVIDIA PTXAMD GCNAsm
add, out, in1, in2out = in1 + in2
sub, out, in1, in2out = in1 - in2
mul, out, in1, in2out = in1 * in2
div, out, in1, in2out = in1 / in2
mod, out, in1, in2out = in1 % in2
lsh, out, in1, in2out = in1 << in2
rsh, out, in1, in2out = in1 >> in2
and, out, in1, in2out = in1 and in2
or, out, in1, in2out = in1 or in2
xor, out, in1, in2out = in1 ^ in2

TODO

Support add32 kind of 32bit ALU instructions

Memory Instructions

Rather than the ldw/stw instruction with the map address as the operand, neutrino takes a formal memory save operation save with map name (definition referencable) as the operand and contents as following:

InstructionDescriptionNVIDIA PTXAMD GCNAsm
save, map, contentssave contents to map

The corresponding length of saving are handled by the map definition.

Other Instructions

We support many other kind of instructions for profiling usage:

InstructionDescriptionNVIDIA PTXAMD GCNAsm
mov, out, inout = in
clock, outout = current clock
time, outout = current time
cuid, outout = compute uint id✅(smid)

We may also add supports for threadIdx and blockIdx.

Branch Instructions

Currently we do not support branch instructions (like the early stage of eBPF) as existing security verifier is not complete enough for safe branching.