Probe Engine
You can find the code in neutrino/probe/.
Probe Engine is organized as:
engine.py # Common Definition and Utilities
cuda.py # CUDA PTX Impl
hip.py # HIP GCNAsm Impl
Probe Engine is designed to be a CLI tool accepting two params: workdir
(having original.bin
inside) and kernel_name
(raw name of the kernel).
This CLI tool will be automatically invoked by the hook driver or manually (for troubleshooting) via:
python cuda.py/hip.py <work_dir> <kernel_name>
workdir
shall have original.bin
inside containing machine code of the original kernel, and the probe engine will transform it into:
original.bin # Original code dumped by Hook Driver
original.asm # objdumped original machine code
pruned.bin # assembled pruned assembly
pruned.asm # pruned assembly
probed.bin # assembled probed assembly
probed.asm # probed assembly
process.log # human-readable log for troubleshooting
kernel.info # machine-readable log for hook driver
The probe engine is organized in the following four steps:
Objdump
Neutrino does not disassemble code!
Probe engine started by demangling the assembly from machine code by the binary utilities (objdump) provided by vendors:
- CUDA/PTX: by
cuobjdump
. - ROCM/GCN: by
roc-obj-extract
. - Plain Text: Nothing to do.
Files created: original.asm
Prune
A practical issue of objdumped assembly is that they could be LARGE (MBs),
particularly if you use CUDA/HIP C++ that compiles all templates into one .asm
, like PyTorch aten
.
Then we use the kernel_name
to prune the many-kernel assembly into single-kernel assembly.
It is worth noting that:
- Probe engine uses a best-matching algorithm since C++ template mangled name may have non-ASCII letters that cannot be passed correctly from hook driver in C.
- Probe engine keeps the global definitions and device functions (
__device__
) that might be used by the kernel externally.
Files created: pruned.asm
.
Probe
Probe part takes three steps:
- Plan Maps: directing each thread/warp to its map.
- Match Tracepoints: match tracepoints to specific instruction (line in asm)
- Parse and Fill Helpers: parse matched instructions into tokens and fills the helpers
- Injecting: Inject the probes, map addresses, map planning.
Plan Maps
Neutriono map
are thread/warp-local and it is organizead in ndarray
-like layout.
Map planning directs each thread/warp to its map
.
Its mechanism is inspired by the following CUDA/HIP C++:
#define NO_BYTES ... // filled by map definition, 8*1
__global__ void plan_map(void* buff) {
int thread_idx = (blockDim.y * threadIdx.z + \
threadIdx.y) * blockDim.x + threadIdx.x;
int block_idx = (gridDim.y * blockIdx.z + \
blockIdx.y) * gridDim.x + blockIdx.x;
int block_size = blockDim.x * blockDim.y * blockDim.z;
int buf_idx = block_idx * block_size + thread_idx;
void* buf_loc = buff + buf_idx * NO_BYTES;
}
In practice, the probe engine uses the following template of assemblies:
.reg .b32 %loc<7>; // applies to all map
mad.lo.s32 %loc7, %ntid.y, %tid.z, %tid.y;
mad.lo.s32 %loc6, %loc7, %ntid.x, %tid.x;
mad.lo.s32 %loc5, %nctaid.y, %ctaid.z, %ctaid.y;
mad.lo.s32 %loc4, %loc5, %nctaid.x, %ctaid.x;
mul.lo.s32 %loc3, %ntid.x, %ntid.y;
mul.lo.s32 %loc2, %loc3, %ntid.z;
mad.lo.s32 %loc1, %loc2, %loc4, %loc5;
// following is unique for each map
.reg .b64 %map_{name}<5>;
mul.wide.s32 %map_{name}4, %loc1, {no_bytes};
ld.param.u64 %map_{name}3, [param_{name}];
cvta.to.global.u64 %map_{name}2, %map_{name}3;
add.s64 %map_{name}1, %map_{name}2, %map_{name}4;
Match Tracepoints
Probe engine then breaks the function body into lines of instructions, and matches the tracepoints by:
pos="inst"
: use a naiveif pos in line:
the tracepoint to the line.pos="kernel"
: use the first{
forbefore=True
andret;
forbefore=False
.- ...
Parse and Fill Helpers
For each matched instruction according to tracepoints, we throughfully parse it into tokens.
For instance, the instruction ld.global.u64 %rd1, [%rd2]; // %rd1 = *%rd2
will be parsed into
- OpCode:
ld.global.u64
- Out operand:
%rd1
- In1 operand:
%rd2
Then the probe engine will fill the helpers (OUT
, IN1
) by the real register (%rd1
, %rd2
in this example).
In Python, it's simply a string.replace()
.
Injection
Finally, the probe engine injects:
- snippets to
tracepoint
according tobefore=True/False
. - Map plannaing to the beginning of kernel body.
- Map addresses to the end of kernel parameters.
Files created: probed.asm
.
Reassemble
The last stage is to reassemble the pruned and probed asm into machine code. Neutrino uses the mature toolchain provided by vendors:
- CUDA/PTX:
ptxas
- ROCM/GCN:
clang -cc1as -triple amdgcn-amd-amdhsa
Files created: pruned.bin
, probed.bin
.