Trace and Analysis

tip

This page is converted from the Jupyter Notebook, and we're optimizing the layout of this page.

Now, after finishing the probing, we can explore the trace structure and perform analysis. As a versatile system, Neutrino trace is not just a simple file and we formulate it into a organized "file system":

Structure Overview

First, traces will be placed under NEUTRINO_TRACEDIR (configurable via --trace), which is default to be ./trace. By ls we can see traces arranged in order:

os.listdir("./trace")

['Apr25_114624_1921397']

Each directory corresponding to traces of a run and it's named by the time and pid. For example Apr24_231539_1860576 means it's a run at:

April 24, 23:15:39, of your local timezone
PID of this process is 1860576

The use of pid is due to the process-independent design of Neutrino that if your program uses multiple process, there'll be multiple trace dir distinguished by the pid.

Now we start to explore content of each trace. Probably the latest one.

# CITE https://stackoverflow.com/questions/9727673/list-directory-tree-structure-in-python
# !sudo apt install tree # run this if your system don't have tree
tracedir = os.listdir("./trace")[-1]
!tree "./trace/{tracedir}" # an undocumented trick

    ./trace/Apr25_114624_1921397
    ├── event.log
    ├── kernel
    │   └── 0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6
    │       ├── kernel.info
    │       ├── original.bin
    │       ├── original.ptx
    │       ├── probed.bin
    │       ├── probed.ptx
    │       ├── process.log
    │       ├── pruned.bin
    │       └── pruned.ptx
    ├── probe.toml
    └── result
        └── 0.660526.bin
    
    3 directories, 11 files

Here you can find:

probe.toml: a copy of probe code shown above, useful for repeating the experiment.
event.log: an important log of all the events captured by Hook Driver (Sec.4.1)
kernel: a folder containing all kernels captured and processed by Neutrino.
result: a folder containing all traces dumped

event.log from hook driver

Now let's deep dive into the event.log:

!cat "./trace/{tracedir}/event.log"

    [pid] 1921397
    [cmd] 26 python block_sched/raw.py 
    [info] dl 0x27c4350
    [info] init success
    [mem] cuMemAlloc_v2 0 dptr 7f3234000000 bytesize 33554432
    [mod] cuLibraryLoadData 0 lib 0x8ba3040 code 0x7f33949cba28 type warpped_fatbin size 4962432
    [mod] cuLibraryGetModule 0 mod 0x871e410 lib 0x8ba3040
    [mod] cuModuleGetFunction func 0x90c68e0 mod 0x871e410 name _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
    [exec] funcmap-not-find 0x90c68e0
    [jit] find 0x90c68e0 name _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_ bin 0x7f32407e4010 size 4962432
    [jit] rename _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_ 0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6
    [jit] mkdir ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6
    [jit] write ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/original.bin
    [jit] subproc /home/root/anaconda3/envs/testenv/bin/python /home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/probe/cuda.py ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
    [jit] python succeed
    [jit] read ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/kernel.info name _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_ n_param 3 n_probe 1 trace_hook /home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/tools/block_sched.py
    [jit] finish 0x90c68e0 name _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_ n_param 3
    [exec] funcmap-find 0x90c68e0 success
    [exec] 1745552787265925120 param 1000000 10000000000 7f3234000000 
    [exec] grid 32768 1 1 block 128 1 1 shared 0
    [exec] grid 32768 warp  4 probe 16 total 2097152
    [exec] probe-mem 2097152 (bytes)
    [exec] succeed 0
    [exec] save ./trace/Apr25_114624_1921397/result/0.660526.bin size 2097200
    [exec] prologue 211.571716 kernel 0.059392 epilogue 2.958016 ratio 3613.098145
    [analyze] subproc /home/root/anaconda3/envs/testenv/bin/python /home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/tools/block_sched.py ./trace/Apr25_114624_1921397/result/0.660526.bin
    [analyze] succeed

Let's break it down line to line. Note that pointers will change from run to run, just for illustration:

[pid] 1860576 records the pid of this process
[cmd] 26 python block_sched/raw.py records the command line, useful for classifying traces.
[info] dl 0x26aa350 and [info] init success records the status of Hook Driver initialization for internal checking.
[mem] records the memory operations, useful for checking illegal memory access. For example, here it states driver sucessfully (return code 0) allocate 33554432 bytes GMEM at ptr 0x7fdf78000000.
[mod] records the module operations, useful for checking GPU code. For example, here it states a function named _ZN2at6native29vectorized_elementwise_kernel... was first loaded via cuLibraryLoadData from a fatbinary and lowered via cuLibraryGetModule and cuModuleGetFunction.
[exec] funcmap-not-find states internal function storage don't find the JIT record. This happens for every first-seen code and will trigger JIT probing.
[jit] records the interaction with probe engines. First it states kernel was find from binary storage and was rename to SHA1. Next, a folder is created under kernel dir and the code was written to original.bin. Then a subprocess is forked to launch the probe engine and we wait for the status.
[exec] records the execution of probing engine.
[analyze]

kernel/ by probing engine

Each directory under kernel/ corresponds to a kernel and is named by the SHA1 of kernel name with an indexed prefix for referencing.

And under each kernel dir, there will be a original.bin dumped by hook driver and all the rest are created by hooked driver. The probe engine follows following steps:

objdump original.bin to extrace assembly into original.ptx. Warning: original.ptx can be LARGE because many kernels are fused into one file.
Prune the asm by the kernel name (provided as sys.argv[-1], see above [exec] subproc) and save to pruned.ptx for checking.
Probe the asm by the probe in NEUTRINO_PROBE envariable and save it to probed.ptx
assemble the pruned.ptx and probed.ptx into .bin of binary machine code.
Write the kernel.info for the hooked driver to read the enough metadata for [exec].

Moreover, the probed engine will writes all the log into process.log, let's take a look first.

kernel_dir = os.listdir(f"./trace/{tracedir}/kernel")[0] # will be 0_xxx
!cat "./trace/{tracedir}/kernel/{kernel_dir}/process.log"

    ['']
    [decompile] via cuobjdump -ptx
    _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
    ptxas -arch=sm_80 -m64 --verbose ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/probed.ptx -o ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/probed.bin
    ptxas info    : 367 bytes gmem, 200 bytes cmem[4]
    ptxas info    : Compiling entry function '_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_' for 'sm_80'
    ptxas info    : Function properties for _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
        0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 18 registers, 376 bytes cmem[0]
    
    ptxas -arch=sm_80 -m64 --verbose ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/pruned.ptx -o ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/pruned.bin
    ptxas info    : 367 bytes gmem, 200 bytes cmem[4]
    ptxas info    : Compiling entry function '_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_' for 'sm_80'
    ptxas info    : Function properties for _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
        0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 12 registers, 368 bytes cmem[0]
    
    # Neutrino Auto-Generated Code for Trace Reading
    import struct
    from typing import NamedTuple, List, Tuple
    from neutrino import TraceHeader, TraceSection
    
    class block_sched(NamedTuple):
    	lstart: int
    	elapse: int
    	smid: int
    
    
    def parse(path: str) -> Tuple[TraceHeader, List[TraceSection], List[List[block_sched]]]:
        with open(path, "rb") as f:
            gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, sharedMemBytes, numProbes = struct.unpack("iiiiiiii", f.read(32))
            header: TraceHeader = TraceHeader(gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, sharedMemBytes, numProbes)
            assert header.numProbes == 1 # currently only one saving probe is supported
            sections: List[TraceSection] = []
            for _ in range(header.numProbes):
                size, offset = struct.unpack("QQ", f.read(16))
                sections.append(TraceSection(size, offset))
            gridSize = header.gridDimX * header.gridDimY * header.gridDimZ
            blockSize = header.blockDimX * header.blockDimY * header.blockDimZ
            records: List[List[block_sched]] = []
            for i in range(gridSize):
                records.append([])
                for j in range(blockSize//32):
                    lstart, elapse, smid = struct.unpack("QII", f.read(16))
                    records[i].append(block_sched(lstart, elapse, smid))
            return header, sections, records

Let's break it from top to down:

First line is the kernel filtering information (set via --filter/--kernel). It's empty here because we don't set anything.
Second line is the objdump information, stating the command used to dump the code.
Third line is the kernel name being used.
Then it's the command and logs of assembler (ptxas for NVIDIA). Please pay special attention to the last line stating no.registers used (for producing Table. 2) and constant memroy cmem[0] (used for kernel parameters, here Neutrino use 8 bytes more for a 64bit pointer).
Finally is the auto-generated Python code for trace reading conforming Sec. 4.4, helpful for building trace analysis tools, detailed presented in answering the 3rd question.

And when the probing engine fails, it will also print out the trace back tree here for analysis, like this:

['']
[decompile] via cuobjdump -ptx
_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
Traceback (most recent call last):
  File "/home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/probe/cuda.py", line 747, in <module>
    probed_ptx, probe_mem_sizes, trace_reading_code = probing(entry_section, probes)
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/probe/cuda.py", line 507, in probing
    in3:  str = operands[3] if len(operands) >= 4 else None
    ^^^
UnboundLocalError: cannot access local variable 'out' where it is not associated with a value

We build the probe engine in Python so you can easily debug and extend new functionalities if the current cannot fulfill your need.

Moreover, process.log is for huamn-reading, another log for the hook driver to read back is the kernel.info.

!cat "./trace/{tracedir}/kernel/{kernel_dir}/kernel.info"

    _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
    3
    1
    1,16
    /home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/tools/block_sched.py
    _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0,u32
    _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1[2],align
    _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2[8],align

From top to down, kernel.info contains:

First line is the kernel name, like _ZN2at6native29vectorized_elementwise_kernel...
Second line is the number of original kernel parameters, like 3.
Third line is the number of probes saving records, here is 1. Then it's n=1 lines followed containing the datamodel, of type (0:=thread/1:=warp) and bytes, 16 here, saved for each thread/warp.
Later, there'll be a line containing optional analyze hook, like block_sched.py here.

These will be parsed by the hook driver for execution usage ([exec]).

Finally, we can take a look at the difference of pruned.ptx and probed.ptx:

!cat "./trace/{tracedir}/kernel/{kernel_dir}/pruned.ptx"

    .version 8.4
    .target sm_80
    .address_size 64
    
    //
    .extern .func __assertfail
    (
    .param .b64 __assertfail_param_0,
    .param .b64 __assertfail_param_1,
    .param .b32 __assertfail_param_2,
    .param .b64 __assertfail_param_3,
    .param .b64 __assertfail_param_4
    )
    ;
    .global .align 1 .b8 __unnamed_1[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_2[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_3[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_4[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_5[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_6[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_7[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_8[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_9[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_10[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_11[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_12[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_13[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_14[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_15[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_16[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_17[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_18[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_19[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_20[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9726thrust20THRUST_200302_800_NS6system6detail10sequential3seqE[1];
    .global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9724cuda3std3__48in_placeE[1];
    .global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9724cuda3std6ranges3__45__cpo4swapE[1];
    .global .align 1 .b8 $str$5[6] = {102, 97, 108, 115, 101};
    .global .align 1 .b8 $str$6[58] = {47, 104, 111, 109, 101, 47, 104, 117, 97, 110, 103, 115, 48, 47, 119, 111, 114, 107, 100, 105, 114, 47, 100, 105, 115, 116, 47, 112, 121, 116, 111, 114, 99, 104, 47, 99, 49, 48, 47, 99, 111, 114, 101, 47, 68, 121, 110, 97, 109, 105, 99, 67, 97, 115, 116, 46, 104};
    
    
    
    .visible .entry _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_(
    .param .u32 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0,
    .param .align 2 .b8 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1[2],
    .param .align 8 .b8 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2[8]
    )
    .maxntid 128, 1, 1
    {
    .reg .pred %p<6>;
    .reg .b16 %rs<2>;
    .reg .b32 %r<19>;
    .reg .b64 %rd<15>;
    
    
    ld.param.u16 %rs1, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1];
    ld.param.u64 %rd2, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2];
    ld.param.u32 %r4, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0];
    mov.u32 %r5, %ctaid.x;
    shl.b32 %r1, %r5, 9;
    sub.s32 %r2, %r4, %r1;
    setp.lt.s32 %p1, %r2, 512;
    cvta.to.global.u64 %rd1, %rd2;
    @%p1 bra $L__BB66_2;
    bra.uni $L__BB66_1;
    
    $L__BB66_2:
    mov.u32 %r7, %tid.x;
    setp.ge.s32 %p2, %r7, %r2;
    add.s32 %r3, %r7, 384;
    @%p2 bra $L__BB66_7;
    
    add.s32 %r9, %r7, 128;
    setp.ge.s32 %p3, %r9, %r2;
    add.s32 %r10, %r7, %r1;
    mul.wide.u32 %rd7, %r10, 2;
    add.s64 %rd8, %rd1, %rd7;
    st.global.u16 [%rd8], %rs1;
    @%p3 bra $L__BB66_7;
    
    add.s32 %r12, %r7, 256;
    setp.ge.s32 %p4, %r12, %r2;
    add.s32 %r14, %r10, 128;
    mul.wide.u32 %rd9, %r14, 2;
    add.s64 %rd10, %rd1, %rd9;
    st.global.u16 [%rd10], %rs1;
    @%p4 bra $L__BB66_7;
    
    add.s32 %r17, %r10, 256;
    mul.wide.u32 %rd11, %r17, 2;
    add.s64 %rd12, %rd1, %rd11;
    st.global.u16 [%rd12], %rs1;
    setp.ge.s32 %p5, %r3, %r2;
    @%p5 bra $L__BB66_7;
    
    add.s32 %r18, %r3, %r1;
    mul.wide.u32 %rd13, %r18, 2;
    add.s64 %rd14, %rd1, %rd13;
    st.global.u16 [%rd14], %rs1;
    bra.uni $L__BB66_7;
    
    $L__BB66_1:
    mul.wide.s32 %rd3, %r1, 2;
    add.s64 %rd4, %rd1, %rd3;
    mov.u32 %r6, %tid.x;
    mul.wide.s32 %rd5, %r6, 8;
    add.s64 %rd6, %rd4, %rd5;
    st.global.v4.u16 [%rd6], {%rs1, %rs1, %rs1, %rs1};
    
    $L__BB66_7:
    ret;
    
    }

!cat "./trace/{tracedir}/kernel/{kernel_dir}/probed.ptx"

    .version 8.4
    .target sm_80
    .address_size 64
    
    //
    .extern .func __assertfail
    (
    .param .b64 __assertfail_param_0,
    .param .b64 __assertfail_param_1,
    .param .b32 __assertfail_param_2,
    .param .b64 __assertfail_param_3,
    .param .b64 __assertfail_param_4
    )
    ;
    .global .align 1 .b8 __unnamed_1[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_2[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_3[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_4[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_5[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_6[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_7[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_8[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_9[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_10[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_11[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_12[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_13[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_14[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_15[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_16[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_17[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_18[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_19[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 __unnamed_20[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
    .global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9726thrust20THRUST_200302_800_NS6system6detail10sequential3seqE[1];
    .global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9724cuda3std3__48in_placeE[1];
    .global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9724cuda3std6ranges3__45__cpo4swapE[1];
    .global .align 1 .b8 $str$5[6] = {102, 97, 108, 115, 101};
    .global .align 1 .b8 $str$6[58] = {47, 104, 111, 109, 101, 47, 104, 117, 97, 110, 103, 115, 48, 47, 119, 111, 114, 107, 100, 105, 114, 47, 100, 105, 115, 116, 47, 112, 121, 116, 111, 114, 99, 104, 47, 99, 49, 48, 47, 99, 111, 114, 101, 47, 68, 121, 110, 97, 109, 105, 99, 67, 97, 115, 116, 46, 104};
    
    
    
    .visible .entry _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_(
    .param .u32 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0,
    .param .align 2 .b8 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1[2],
    .param .align 8 .b8 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2[8],
    .param .u64 param_block_sched
    )
    .maxntid 128, 1, 1
    {
    .reg .b64 %lstart; // local start time (unit: cycle)
    .reg .b64 %lend;    // local end time (unit: cycle)
    .reg .b64 %elapsed; // thread elapsed time in u64
    .reg .b32 %elapse;  // thread elapsed time in u32
    mov.u64 %lstart, %clock64;
    // begin buffer calculation
    .reg .b32   %warpbuf<21>;                   // b32 reg to record access, will be optimized by ptxas
    .reg .pred  %leader;                        // predicate register
    .reg .pred  %joint_pred;                    // used to store AND result of %leader and instruction operand
    mov.u32     %warpbuf2, %laneid;             // read lane id
    setp.eq.u32 %leader, %warpbuf2, 0;          // check if thread is warp leader
    @%leader mov.u32 %warpbuf3, %nwarpid;       // warpDim := number of warp in current group
    @%leader mov.u32 %warpbuf4, %tid.x;         // threadIdx.x
    @%leader mov.u32 %warpbuf5, %tid.y;         // threadIdx.y
    @%leader mov.u32 %warpbuf6, %tid.z;         // threadIdx.z
    @%leader mov.u32 %warpbuf7, %ntid.x;        // blockDim.x
    @%leader mov.u32 %warpbuf8, %ntid.y;        // blockDim.y
    @%leader mov.u32 %warpbuf18, %ntid.z;       // blockDim.z
    @%leader mov.u32 %warpbuf9, %ctaid.x;       // blockIdx.x
    @%leader mov.u32 %warpbuf10, %ctaid.y;      // blockIdx.y
    @%leader mov.u32 %warpbuf11, %ctaid.z;      // blockIdx.z
    @%leader mov.u32 %warpbuf12, %nctaid.x;     // gridDim.x
    @%leader mov.u32 %warpbuf13, %nctaid.y;     // gridDim.y
    @%leader mad.lo.s32 %warpbuf14, %warpbuf8,  %warpbuf6,  %warpbuf5;  // blockDim.y * threadIdx.z + threadIdx.y
    @%leader mad.lo.s32 %warpbuf15, %warpbuf14, %warpbuf7,  %warpbuf4;  // thread_idx = (blockDim.y * threadIdx.z + threadIdx.y) * blockDim.x + threadIdx.x
    @%leader div.s32 %warpbuf15, %warpbuf15, 32;                        // get persistent warpid instead of dynamic %warpid
    @%leader mad.lo.s32 %warpbuf16, %warpbuf13, %warpbuf11, %warpbuf10; // gridDim.y * blockIdx.z + blockIdx.y
    @%leader mad.lo.s32 %warpbuf17, %warpbuf16, %warpbuf12, %warpbuf9;  // block_idx = (gridDim.y * blockIdx.z + blockIdx.y) * gridDim.x + blockIdx.x
    @%leader mul.lo.s32 %warpbuf19, %warpbuf7, %warpbuf8;
    @%leader mul.lo.s32 %warpbuf20, %warpbuf19, %warpbuf18;
    @%leader div.s32 %warpbuf20, %warpbuf20, 32;
    @%leader mad.lo.s32 %warpbuf1,  %warpbuf17, %warpbuf20,  %warpbuf15; // buf_idx = block_idx * warpSize + warpIdx
    // end buffer calculation
    // begin block_sched buffer
    .reg .b64 %buf_block_sched<5>;                          // register group defn
    @%leader mul.wide.s32  %buf_block_sched4, %warpbuf1, 16; // get buffer location, no_bytes is per thread
    @%leader ld.param.u64  %buf_block_sched3, [param_block_sched];        // load address from .param state space
    @%leader cvta.to.global.u64 	%buf_block_sched2, %buf_block_sched3;    // convert address to .global state space
    @%leader add.s64 %buf_block_sched1, %buf_block_sched2, %buf_block_sched4;  // offset to get final thread-specific address
    // end block_sched buffer
    .reg .pred %p<6>;
    .reg .b16 %rs<2>;
    .reg .b32 %r<19>;
    .reg .b64 %rd<15>;
    
    
    ld.param.u16 %rs1, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1];
    ld.param.u64 %rd2, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2];
    ld.param.u32 %r4, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0];
    mov.u32 %r5, %ctaid.x;
    shl.b32 %r1, %r5, 9;
    sub.s32 %r2, %r4, %r1;
    setp.lt.s32 %p1, %r2, 512;
    cvta.to.global.u64 %rd1, %rd2;
    @%p1 bra $L__BB66_2;
    bra.uni $L__BB66_1;
    
    $L__BB66_2:
    mov.u32 %r7, %tid.x;
    setp.ge.s32 %p2, %r7, %r2;
    add.s32 %r3, %r7, 384;
    @%p2 bra $L__BB66_7;
    
    add.s32 %r9, %r7, 128;
    setp.ge.s32 %p3, %r9, %r2;
    add.s32 %r10, %r7, %r1;
    mul.wide.u32 %rd7, %r10, 2;
    add.s64 %rd8, %rd1, %rd7;
    st.global.u16 [%rd8], %rs1;
    @%p3 bra $L__BB66_7;
    
    add.s32 %r12, %r7, 256;
    setp.ge.s32 %p4, %r12, %r2;
    add.s32 %r14, %r10, 128;
    mul.wide.u32 %rd9, %r14, 2;
    add.s64 %rd10, %rd1, %rd9;
    st.global.u16 [%rd10], %rs1;
    @%p4 bra $L__BB66_7;
    
    add.s32 %r17, %r10, 256;
    mul.wide.u32 %rd11, %r17, 2;
    add.s64 %rd12, %rd1, %rd11;
    st.global.u16 [%rd12], %rs1;
    setp.ge.s32 %p5, %r3, %r2;
    @%p5 bra $L__BB66_7;
    
    add.s32 %r18, %r3, %r1;
    mul.wide.u32 %rd13, %r18, 2;
    add.s64 %rd14, %rd1, %rd13;
    st.global.u16 [%rd14], %rs1;
    bra.uni $L__BB66_7;
    
    $L__BB66_1:
    mul.wide.s32 %rd3, %r1, 2;
    add.s64 %rd4, %rd1, %rd3;
    mov.u32 %r6, %tid.x;
    mul.wide.s32 %rd5, %r6, 8;
    add.s64 %rd6, %rd4, %rd5;
    st.global.v4.u16 [%rd6], {%rs1, %rs1, %rs1, %rs1};
    
    $L__BB66_7:
    @%leader mov.u64 %lend, %clock64;
    @%leader sub.u64 %elapsed, %lend, %lstart; 
    @%leader cvt.u32.u64 %elapse, %elapsed; // convert to u32
    @%leader st.global.u64 [%buf_block_sched1], %lstart;
    @%leader add.s64 %buf_block_sched1, %buf_block_sched1, 8;
    @%leader st.global.v2.u32 [%buf_block_sched1],  { %elapse,  %smid };
    @%leader add.s64 %buf_block_sched1, %buf_block_sched1, 8;
    ret;
    
    }

Here we can the global definition are kept and the probe engine make following modification:

A parameter .param .u64 param_block_sched is added
Probes added at corresponding places, i.e., kernel start and end
Buffer calculation in // begin buffer calculation and // begin block_sched buffer

callback for runtime analysis

Upon here, the last question is how the No.block:32768 Running:602241 Scheduling:87713(cycle) is analyzed and printed.

First, Neutrino will save all raw traces in the result directory. Raw traces are of raw binary and is orderly named by the TIME since the driver starts.

!ls "./trace/{tracedir}/result/"

0.660526.bin

To convert the raw binary into valuable analyzed results, Neutrino provides analyze_hook for user to register in the probe.toml. For example, here we register the block_sched.py as the analyze hook. Relative paths will be resolved based on tools/ directory of installation folder.

!cat "./trace/{tracedir}/probe.toml"

    author = "Neutrino Team"
    description = "Record start timestamp, elapsed time and smid of each warp\ncan be reduced to NSight Occupancy and launch__waves_per_multiprocessor\nThis is the same as block_sched but use local timer for better accuracy\nwhile sacrifies the auto alignment from gloabl timer"
    analyze_hook = "block_sched.py"
    
    [block_sched]
    position = "kernel"
    datamodel = "warp:16"
    before = ".reg .b64 %lstart; // local start time (unit: cycle)\n.reg .b64 %lend;    // local end time (unit: cycle)\n.reg .b64 %elapsed; // thread elapsed time in u64\n.reg .b32 %elapse;  // thread elapsed time in u32\nmov.u64 %lstart, %clock64;"
    after = "mov.u64 %lend, %clock64;\nsub.u64 %elapsed, %lend, %lstart; \ncvt.u32.u64 %elapse, %elapsed; // convert to u32\nSAVE.u64 {%lstart}; // store start in u64 for alignment\nSAVE.u32 {%elapse, %smid}; // store elapased time and core id"

# Real code of block_sched.py
!cat {os.path.join(os.path.dirname(neutrino.__file__), "tools", "block_sched.py")}

    # Neutrino Generated Code for Reading Trace
    import struct
    from typing import NamedTuple, List, Tuple
    from neutrino import TraceHeader, TraceSection
    
    class block_sched(NamedTuple):
        lstart: int
        elapse: int
        smid: int
    
    def parse(path: str):
      with open(path, "rb") as f:
        header: TraceHeader = TraceHeader(*struct.unpack("iiiiiiii", f.read(32)))
        sections: List[TraceSection] = []
        for _ in range(header.numProbes):
            size, offset = struct.unpack("QQ", f.read(16))
            sections.append(TraceSection(size, offset))
        gridSize = header.gridDimX * header.gridDimY  * header.gridDimZ
        blockSize = header.blockDimX * header.blockDimY * header.blockDimZ
        records: List[List[block_sched]] = []
        for i in range(gridSize):
            records.append([])
            for j in range(blockSize // 32):
                lstart, elapse, smid = struct.unpack("QII", f.read(16))
                records[i].append(block_sched(lstart, elapse, smid))
      return header, sections, records
    # END OF GENERATED CODE
    import sys
    import numpy as np
    header, sections, records = parse(sys.argv[1]) # filled by path to trace
    
    unique_sms = set()    
    for block in records:
        unique_sms.add(block[0].smid)
    
    sm_timelines = []
    for _ in range(len(unique_sms)):
        sm_timelines.append([])
    sched_times = [0.0] * len(unique_sms)
    work_times = [0.0] * len(unique_sms)
    
    for cur in records:
        # print(sm_timelines[cur[0].smid])
        sched_out = False
        smid = cur[0].smid
        if len(sm_timelines[smid]) > 0:
            for block in sm_timelines[smid]:
                if block.lstart + block.elapse <= cur[0].lstart:
                    # if cur[0].lstart - (block.lstart + block.elapse) < 100000:
                    #     print(cur[0], block)
                    sched_times[smid] += cur[0].lstart - (block.lstart + block.elapse)
                    sm_timelines[smid].remove(block)
                    sm_timelines[smid].append(cur[0])
                    work_times[smid] += cur[0].elapse
                    sched_out = True
                    break
                if not sched_out:
                    sm_timelines[smid].append(cur[0]) 
                    work_times[smid] += cur[0].elapse
                    break
        else:
            sm_timelines[smid].append(cur[0]) 
            work_times[smid] += cur[0].elapse
    
    print(f"No.block:{header.gridDimX * header.gridDimY  * header.gridDimZ} Running:{int(np.array(work_times).mean())} Scheduling:{int(np.array(sched_times).mean())}(cycle)")

There are two part of the code, separated by # END OF GENERATED CODE:

1st part is the trace reading code auto-generated by Neutrino (see process.log above) that are used to read .bin traces into Python objects familiar to developers.
2nd part is the analyze code built upon the parse written by developers. Here we simulate a FIFO scheduler based on traces collected, and print and calculate the scheduling times.

Structure Overview​

event.log from hook driver​

kernel/ by probing engine​

callback for runtime analysis​

Structure Overview

event.log from hook driver

kernel/ by probing engine

callback for runtime analysis