logoNeutrino

Probe Engine

You can find the code in neutrino/probe/.

Probe Engine is organized as:

neutrino/src/
engine.py   # Common Definition and Utilities
cuda.py     # CUDA PTX Impl
hip.py      # HIP GCNAsm Impl

Probe Engine is designed to be a CLI tool accepting two params: workdir (having original.bin inside) and kernel_name (raw name of the kernel). This CLI tool will be automatically invoked by the hook driver or manually (for troubleshooting) via:

python cuda.py/hip.py <work_dir> <kernel_name> 

workdir shall have original.bin inside containing machine code of the original kernel, and the probe engine will transform it into:

original.bin # Original code dumped by Hook Driver 
original.asm # objdumped original machine code
pruned.bin   # assembled pruned assembly
pruned.asm   # pruned assembly
probed.bin   # assembled probed assembly
probed.asm   # probed assembly 
process.log  # human-readable log for troubleshooting
kernel.info  # machine-readable log for hook driver 

The probe engine is organized in the following four steps:

Objdump

Neutrino does not disassemble code!

Probe engine started by demangling the assembly from machine code by the binary utilities (objdump) provided by vendors:

  • CUDA/PTX: by cuobjdump.
  • ROCM/GCN: by roc-obj-extract.
  • Plain Text: Nothing to do.

Files created: original.asm

Prune

A practical issue of objdumped assembly is that they could be LARGE (MBs), particularly if you use CUDA/HIP C++ that compiles all templates into one .asm, like PyTorch aten. Then we use the kernel_name to prune the many-kernel assembly into single-kernel assembly. It is worth noting that:

  • Probe engine uses a best-matching algorithm since C++ template mangled name may have non-ASCII letters that cannot be passed correctly from hook driver in C.
  • Probe engine keeps the global definitions and device functions (__device__) that might be used by the kernel externally.

Files created: pruned.asm.

Probe

Probe part takes three steps:

  1. Plan Maps: directing each thread/warp to its map.
  2. Match Tracepoints: match tracepoints to specific instruction (line in asm)
  3. Parse and Fill Helpers: parse matched instructions into tokens and fills the helpers
  4. Injecting: Inject the probes, map addresses, map planning.

Plan Maps

Neutriono map are thread/warp-local and it is organizead in ndarray-like layout. Map planning directs each thread/warp to its map. Its mechanism is inspired by the following CUDA/HIP C++:

#define NO_BYTES ... // filled by map definition, 8*1
__global__ void plan_map(void* buff) {
  int thread_idx = (blockDim.y * threadIdx.z + \
  threadIdx.y) * blockDim.x + threadIdx.x;
  int block_idx = (gridDim.y * blockIdx.z + \
  blockIdx.y) * gridDim.x + blockIdx.x;
  int block_size = blockDim.x * blockDim.y * blockDim.z;
  int buf_idx = block_idx * block_size + thread_idx;
  void* buf_loc = buff + buf_idx * NO_BYTES;
}

In practice, the probe engine uses the following template of assemblies:

.reg .b32 %loc<7>; // applies to all map
mad.lo.s32 %loc7, %ntid.y, %tid.z, %tid.y;
mad.lo.s32 %loc6, %loc7, %ntid.x, %tid.x;
mad.lo.s32 %loc5, %nctaid.y, %ctaid.z, %ctaid.y;
mad.lo.s32 %loc4, %loc5, %nctaid.x, %ctaid.x;
mul.lo.s32 %loc3, %ntid.x, %ntid.y;
mul.lo.s32 %loc2, %loc3, %ntid.z;
mad.lo.s32 %loc1, %loc2, %loc4, %loc5;
// following is unique for each map
.reg .b64 %map_{name}<5>;
mul.wide.s32 %map_{name}4, %loc1, {no_bytes};
ld.param.u64 %map_{name}3, [param_{name}];
cvta.to.global.u64 %map_{name}2, %map_{name}3;
add.s64 %map_{name}1, %map_{name}2, %map_{name}4;

Match Tracepoints

Probe engine then breaks the function body into lines of instructions, and matches the tracepoints by:

  • pos="inst": use a naive if pos in line: the tracepoint to the line.
  • pos="kernel": use the first { for before=True and ret; for before=False.
  • ...

Parse and Fill Helpers

For each matched instruction according to tracepoints, we throughfully parse it into tokens. For instance, the instruction ld.global.u64 %rd1, [%rd2]; // %rd1 = *%rd2 will be parsed into

  • OpCode: ld.global.u64
  • Out operand: %rd1
  • In1 operand: %rd2

Then the probe engine will fill the helpers (OUT, IN1) by the real register (%rd1, %rd2 in this example). In Python, it's simply a string.replace().

Injection

Finally, the probe engine injects:

  • snippets to tracepoint according to before=True/False.
  • Map plannaing to the beginning of kernel body.
  • Map addresses to the end of kernel parameters.

Files created: probed.asm.

Reassemble

The last stage is to reassemble the pruned and probed asm into machine code. Neutrino uses the mature toolchain provided by vendors:

  • CUDA/PTX: ptxas
  • ROCM/GCN: clang -cc1as -triple amdgcn-amd-amdhsa

Files created: pruned.bin, probed.bin.