Tracing DSL and Compiler
Please read the Probe Writing Guide first.
You can find the code in neutrino/language/.
Neutrino DSL and Compiler is organized as:
__init__.py # Language Primitive Definition
compiler.py # Compiler entry
frontend.py # Python ast based frontend
ptx.py # NVIDIA PTX Backend
gcn.py # AMD GCN Asm Backend
Similar to (and inspired by) other deep learning compilers such as Triton, TileLang, and CUTE, Neutrino DSL is a high-level description for probes that will be lowered to hardware-specific assembly for the probe engine to attach and the hook driver to execute.
Program Model
Thread/Warp-Oriented
However, Neutrino DSL are highly specialized for observability and is different from above computing-oriented, block-oriented solutions. In short, Neutrino takes a traditional, thread/warp-oriented execution model. Take the DMAT as an example:
from neutrino import probe, Map
import neutrino.language as nl
CALLBACK = "dmat_callback.py"
@Map(level="thread", type="array", size=16, cap="dynamic")
class DMAT:
clock: nl.u64
addr: nl.u64
start: nl.u64 = 0
mem_clock: nl.u64 = 0
# define probes with decorator
@probe(pos="kernel", level="thread", before=True)
def thread_start():
start = nl.clock()
@probe(pos="ld.global:st.global:cp.async.cg:cp.async.ca", level="thread")
def memory_access():
mem_clock = nl.clock() - start
DMAT.save(mem_clock, nl.addr)
In neutrino
Tracing DSL, everything is thread-local, i.e., neutrino
handles the multi-thread concurrency under the hood.
For example, every thread has their registers (contexted
, mem_clock
, and other "unnamed" registers from compilation).
These registers are private to the thread and can be referenced nearly free (1 clock cycle).
It is worth noting that there are two types of registers in :
- Global: shared among probes, their values can support cooperation among probes.
- Local: temporarily used within the probe, their value are not trustable outside the probe.
We may be able to support register syncrhonization, please raise a Github Issue if you need.
Multi-Occurance
Another difference is that probes might be attached to multiple tracepoints based on matching. Take the following counter as an example:
from neutrino import probe, Map
import neutrino.language as nl
CALLBACK = "gmem_bytes_analysis.py"
@Map(level="thread", type="array", size=8, cap=1)
class GMEMBytes:
sync_bytes: nl.u32
async_bytes: nl.u32
sync_bytes: nl.u64 = 0
async_bytes: nl.u64 = 0
@probe(level="thread", pos="kernel", before=True)
def init():
sync_bytes = 0
async_bytes = 0
@probe(level="thread", pos="ld.global:st.global")
def record_sync():
sync_bytes += nl.bytes
@probe(level="thread", pos="cp.async.ca:cp.async.cg")
def record_async():
async_bytes += nl.bytes
@probe(level="thread", pos="kernel")
def save():
GMEMBytes.save(sync_bytes, async_bytes)
For probes with instruction as pos
(record_sync
and record_async
in the above example),
they will be attached to and executed on each appearance of the tracepoint.
This can be useful that we can simply build a counter across probes,
but can also be harmful as the contexted registers might have unexpected values.
It is important for you to:
- Understand the probe relationship, e.g., the order of instructions.
- Reinitialize the contexted registers if need.
Neutrino Helper Reference
Currently we provide following Helpers:
Function Helpers
neutrino.language.clock
: CU-local, GHz clock counter, unsynced across CUs.neutrino.language.time
: GPU-local, MHz timer, can be synced with CPU timerneutrino.language.cuid
: Get the current id of compute unit dispatched.
Field Helpers
nl.addr
: address of memory accessed by the instruction, only applies to memory instruction.nl.bytes
: number of bytes accessed by the instruction, only applies to memory instruction.nl.out
: output operandnl.in1
: 1st input operandnl.in2
: 2nd input operandnl.in3
: 3rd input operand
The major difference is that function helpers are handled in the DSL compiler, but the field helpers will be handled by the probe engine.
Compilation
The tracing DSL will be compiled to assembly in two steps:
- Frontend: Parse and Flattening to Neutrino IR via via Python
ast
module - Backend: Assembly Code Generation via translating the IR to Asms.
Supported Syntax
For probe snippets, we support following Python syntax:
Assign
: assign values to variableUnaryOp
: unary operators like-
(sign flip)BinOp
: binary operators like+-*/
Constant
: use constant numbersAttribute
: limited to Neutrino helpers (such asnl.addr
)Call
: call functions, only limited to helpers ormap.save
Current frontend is still limited, and we are working on following features:
If
: to support samplingWhile
Compare
Moreover, current type system is u64
-only, and we are working on the type inference to include u32
, u1
(bool) and more types.
Neutrino IR ISA
The design is inspired by the eBPF ISA, but we take a separation of output/input as most target GPU ISA has separated input/output.
For example, add
in eBPF ISA is add dst, src;;
aka dst += src
but in Neutrino IR it is add out, in1, in2
aka out = in1 + in2
(you can also use add dst, dst, src
for a similar semantic like eBPF ISA).
Currently, as we works on Python, we encode every instruction to be list[str]
where first item is instruction name and operands followed by, i.e., no binary format.
Special Operands
Other than standard registers, we plan to have following operands for better value profiling
Special Operand | Description | NVIDIA PTX | AMD GCNAsm |
---|---|---|---|
dst | will be replaced by destination | ✅ | ✅ |
src | will be replaced by source | ✅ | ✅ |
out | will be replaced by output (mostly 1st operand) | ✅ | ✅ |
in1 | will be replaced by 1st input | ✅ | ✅ |
in2 | will be replaced by 2nd input | ✅ | ✅ |
in3 | will be replaced by 3rd input | ✅ | ✅ |
bytes | will be replaced by inst width | ✅ (only ld/st/cp) |
ALU Instructions
Instruction | Description | NVIDIA PTX | AMD GCNAsm |
---|---|---|---|
add, out, in1, in2 | out = in1 + in2 | ✅ | ✅ |
sub, out, in1, in2 | out = in1 - in2 | ✅ | ✅ |
mul, out, in1, in2 | out = in1 * in2 | ✅ | |
div, out, in1, in2 | out = in1 / in2 | ✅ | |
mod, out, in1, in2 | out = in1 % in2 | ✅ | |
lsh, out, in1, in2 | out = in1 << in2 | ✅ | |
rsh, out, in1, in2 | out = in1 >> in2 | ✅ | |
and, out, in1, in2 | out = in1 and in2 | ||
or, out, in1, in2 | out = in1 or in2 | ||
xor, out, in1, in2 | out = in1 ^ in2 |
TODO
Support add32 kind of 32bit ALU instructions
Memory Instructions
Rather than the ldw/stw
instruction with the map address as the operand,
neutrino
takes a formal memory save operation save
with map name (definition referencable) as the operand and contents as following:
Instruction | Description | NVIDIA PTX | AMD GCNAsm |
---|---|---|---|
save, map, contents | save contents to map | ✅ | ✅ |
The corresponding length of saving are handled by the map definition.
Other Instructions
We support many other kind of instructions for profiling usage:
Instruction | Description | NVIDIA PTX | AMD GCNAsm |
---|---|---|---|
mov, out, in | out = in | ✅ | ✅ |
clock, out | out = current clock | ✅ | |
time, out | out = current time | ✅ | |
cuid, out | out = compute uint id | ✅(smid) |
We may also add supports for threadIdx
and blockIdx
.
Branch Instructions
Currently we do not support branch instructions (like the early stage of eBPF) as existing security verifier is not complete enough for safe branching.