Trace and Analysis
This page is converted from the Jupyter Notebook, and we're optimizing the layout of this page.
Now, after finishing the probing, we can explore the trace structure and perform analysis. As a versatile system, Neutrino trace is not just a simple file and we formulate it into a organized "file system":
Structure Overview
First, traces will be placed under NEUTRINO_TRACEDIR
(configurable via --trace
), which is default to be ./trace
. By ls
we can see traces arranged in order:
os.listdir("./trace")
['Apr25_114624_1921397']
Each directory corresponding to traces of a run and it's named by the time and pid.
For example Apr24_231539_1860576
means it's a run at:
- April 24, 23:15:39, of your local timezone
- PID of this process is 1860576
The use of pid
is due to the process-independent design of Neutrino that if your program uses multiple process, there'll be multiple trace dir distinguished by the pid
.
Now we start to explore content of each trace. Probably the latest one.
# CITE https://stackoverflow.com/questions/9727673/list-directory-tree-structure-in-python
# !sudo apt install tree # run this if your system don't have tree
tracedir = os.listdir("./trace")[-1]
!tree "./trace/{tracedir}" # an undocumented trick
./trace/Apr25_114624_1921397
├── event.log
├── kernel
│ └── 0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6
│ ├── kernel.info
│ ├── original.bin
│ ├── original.ptx
│ ├── probed.bin
│ ├── probed.ptx
│ ├── process.log
│ ├── pruned.bin
│ └── pruned.ptx
├── probe.toml
└── result
└── 0.660526.bin
3 directories, 11 files
Here you can find:
probe.toml
: a copy of probe code shown above, useful for repeating the experiment.event.log
: an important log of all the events captured by Hook Driver (Sec.4.1)kernel
: a folder containing all kernels captured and processed by Neutrino.result
: a folder containing all traces dumped
event.log from hook driver
Now let's deep dive into the event.log
:
!cat "./trace/{tracedir}/event.log"
[pid] 1921397
[cmd] 26 python block_sched/raw.py
[info] dl 0x27c4350
[info] init success
[mem] cuMemAlloc_v2 0 dptr 7f3234000000 bytesize 33554432
[mod] cuLibraryLoadData 0 lib 0x8ba3040 code 0x7f33949cba28 type warpped_fatbin size 4962432
[mod] cuLibraryGetModule 0 mod 0x871e410 lib 0x8ba3040
[mod] cuModuleGetFunction func 0x90c68e0 mod 0x871e410 name _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
[exec] funcmap-not-find 0x90c68e0
[jit] find 0x90c68e0 name _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_ bin 0x7f32407e4010 size 4962432
[jit] rename _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_ 0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6
[jit] mkdir ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6
[jit] write ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/original.bin
[jit] subproc /home/root/anaconda3/envs/testenv/bin/python /home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/probe/cuda.py ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
[jit] python succeed
[jit] read ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/kernel.info name _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_ n_param 3 n_probe 1 trace_hook /home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/tools/block_sched.py
[jit] finish 0x90c68e0 name _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_ n_param 3
[exec] funcmap-find 0x90c68e0 success
[exec] 1745552787265925120 param 1000000 10000000000 7f3234000000
[exec] grid 32768 1 1 block 128 1 1 shared 0
[exec] grid 32768 warp 4 probe 16 total 2097152
[exec] probe-mem 2097152 (bytes)
[exec] succeed 0
[exec] save ./trace/Apr25_114624_1921397/result/0.660526.bin size 2097200
[exec] prologue 211.571716 kernel 0.059392 epilogue 2.958016 ratio 3613.098145
[analyze] subproc /home/root/anaconda3/envs/testenv/bin/python /home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/tools/block_sched.py ./trace/Apr25_114624_1921397/result/0.660526.bin
[analyze] succeed
Let's break it down line to line. Note that pointers will change from run to run, just for illustration:
[pid] 1860576
records the pid of this process[cmd] 26 python block_sched/raw.py
records the command line, useful for classifying traces.[info] dl 0x26aa350
and[info] init success
records the status of Hook Driver initialization for internal checking.[mem]
records the memory operations, useful for checking illegal memory access. For example, here it states driver sucessfully (return code 0) allocate 33554432 bytes GMEM at ptr 0x7fdf78000000.[mod]
records the module operations, useful for checking GPU code. For example, here it states a function named_ZN2at6native29vectorized_elementwise_kernel...
was first loaded viacuLibraryLoadData
from a fatbinary and lowered viacuLibraryGetModule
andcuModuleGetFunction
.[exec] funcmap-not-find
states internal function storage don't find the JIT record. This happens for every first-seen code and will trigger JIT probing.[jit]
records the interaction with probe engines. First it states kernel wasfind
from binary storage and wasrename
to SHA1. Next, a folder is created underkernel
dir and the code was written tooriginal.bin
. Then a subprocess is forked to launch the probe engine and we wait for the status.[exec]
records the execution of probing engine.[analyze]
kernel/ by probing engine
Each directory under kernel/
corresponds to a kernel and is named by the SHA1 of kernel name with an indexed prefix for referencing.
And under each kernel dir, there will be a original.bin
dumped by hook driver and all the rest are created by hooked driver.
The probe engine follows following steps:
objdump original.bin
to extrace assembly intooriginal.ptx
. Warning:original.ptx
can be LARGE because many kernels are fused into one file.- Prune the asm by the kernel name (provided as
sys.argv[-1]
, see above[exec] subproc
) and save topruned.ptx
for checking. - Probe the asm by the probe in
NEUTRINO_PROBE
envariable and save it toprobed.ptx
assemble
thepruned.ptx
andprobed.ptx
into.bin
of binary machine code.- Write the
kernel.info
for the hooked driver to read the enough metadata for[exec]
.
Moreover, the probed engine will writes all the log into process.log
, let's take a look first.
kernel_dir = os.listdir(f"./trace/{tracedir}/kernel")[0] # will be 0_xxx
!cat "./trace/{tracedir}/kernel/{kernel_dir}/process.log"
['']
[decompile] via cuobjdump -ptx
_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
ptxas -arch=sm_80 -m64 --verbose ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/probed.ptx -o ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/probed.bin
ptxas info : 367 bytes gmem, 200 bytes cmem[4]
ptxas info : Compiling entry function '_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_' for 'sm_80'
ptxas info : Function properties for _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 18 registers, 376 bytes cmem[0]
ptxas -arch=sm_80 -m64 --verbose ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/pruned.ptx -o ./trace/Apr25_114624_1921397/kernel/0_bf524e8c48fdebc3962c3b36823d8d68b093ebb6/pruned.bin
ptxas info : 367 bytes gmem, 200 bytes cmem[4]
ptxas info : Compiling entry function '_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_' for 'sm_80'
ptxas info : Function properties for _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 368 bytes cmem[0]
# Neutrino Auto-Generated Code for Trace Reading
import struct
from typing import NamedTuple, List, Tuple
from neutrino import TraceHeader, TraceSection
class block_sched(NamedTuple):
lstart: int
elapse: int
smid: int
def parse(path: str) -> Tuple[TraceHeader, List[TraceSection], List[List[block_sched]]]:
with open(path, "rb") as f:
gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, sharedMemBytes, numProbes = struct.unpack("iiiiiiii", f.read(32))
header: TraceHeader = TraceHeader(gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, sharedMemBytes, numProbes)
assert header.numProbes == 1 # currently only one saving probe is supported
sections: List[TraceSection] = []
for _ in range(header.numProbes):
size, offset = struct.unpack("QQ", f.read(16))
sections.append(TraceSection(size, offset))
gridSize = header.gridDimX * header.gridDimY * header.gridDimZ
blockSize = header.blockDimX * header.blockDimY * header.blockDimZ
records: List[List[block_sched]] = []
for i in range(gridSize):
records.append([])
for j in range(blockSize//32):
lstart, elapse, smid = struct.unpack("QII", f.read(16))
records[i].append(block_sched(lstart, elapse, smid))
return header, sections, records
Let's break it from top to down:
- First line is the kernel filtering information (set via
--filter
/--kernel
). It's empty here because we don't set anything. - Second line is the
objdump
information, stating the command used to dump the code. - Third line is the kernel name being used.
- Then it's the command and logs of assembler (
ptxas
for NVIDIA). Please pay special attention to the last line stating no.registers used (for producing Table. 2) and constant memroycmem[0]
(used for kernel parameters, here Neutrino use 8 bytes more for a 64bit pointer). - Finally is the auto-generated Python code for trace reading conforming Sec. 4.4, helpful for building trace analysis tools, detailed presented in answering the 3rd question.
And when the probing engine fails, it will also print out the trace back tree here for analysis, like this:
['']
[decompile] via cuobjdump -ptx
_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
Traceback (most recent call last):
File "/home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/probe/cuda.py", line 747, in <module>
probed_ptx, probe_mem_sizes, trace_reading_code = probing(entry_section, probes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/probe/cuda.py", line 507, in probing
in3: str = operands[3] if len(operands) >= 4 else None
^^^
UnboundLocalError: cannot access local variable 'out' where it is not associated with a value
We build the probe engine in Python so you can easily debug and extend new functionalities if the current cannot fulfill your need.
Moreover, process.log
is for huamn-reading, another log for the hook driver to read back is the kernel.info
.
!cat "./trace/{tracedir}/kernel/{kernel_dir}/kernel.info"
_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_
3
1
1,16
/home/root/anaconda3/envs/testenv/lib/python3.11/site-packages/neutrino/tools/block_sched.py
_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0,u32
_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1[2],align
_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2[8],align
From top to down, kernel.info
contains:
- First line is the kernel name, like
_ZN2at6native29vectorized_elementwise_kernel...
- Second line is the number of original kernel parameters, like 3.
- Third line is the number of probes saving records, here is 1. Then it's
n=1
lines followed containing the datamodel, of type (0:=thread/1:=warp
) and bytes, 16 here, saved for each thread/warp. - Later, there'll be a line containing optional analyze hook, like
block_sched.py
here.
These will be parsed by the hook driver for execution usage ([exec]
).
Finally, we can take a look at the difference of pruned.ptx
and probed.ptx
:
!cat "./trace/{tracedir}/kernel/{kernel_dir}/pruned.ptx"
.version 8.4
.target sm_80
.address_size 64
//
.extern .func __assertfail
(
.param .b64 __assertfail_param_0,
.param .b64 __assertfail_param_1,
.param .b32 __assertfail_param_2,
.param .b64 __assertfail_param_3,
.param .b64 __assertfail_param_4
)
;
.global .align 1 .b8 __unnamed_1[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_2[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_3[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_4[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_5[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_6[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_7[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_8[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_9[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_10[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_11[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_12[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_13[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_14[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_15[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_16[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_17[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_18[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_19[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_20[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9726thrust20THRUST_200302_800_NS6system6detail10sequential3seqE[1];
.global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9724cuda3std3__48in_placeE[1];
.global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9724cuda3std6ranges3__45__cpo4swapE[1];
.global .align 1 .b8 $str$5[6] = {102, 97, 108, 115, 101};
.global .align 1 .b8 $str$6[58] = {47, 104, 111, 109, 101, 47, 104, 117, 97, 110, 103, 115, 48, 47, 119, 111, 114, 107, 100, 105, 114, 47, 100, 105, 115, 116, 47, 112, 121, 116, 111, 114, 99, 104, 47, 99, 49, 48, 47, 99, 111, 114, 101, 47, 68, 121, 110, 97, 109, 105, 99, 67, 97, 115, 116, 46, 104};
.visible .entry _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_(
.param .u32 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0,
.param .align 2 .b8 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1[2],
.param .align 8 .b8 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2[8]
)
.maxntid 128, 1, 1
{
.reg .pred %p<6>;
.reg .b16 %rs<2>;
.reg .b32 %r<19>;
.reg .b64 %rd<15>;
ld.param.u16 %rs1, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1];
ld.param.u64 %rd2, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2];
ld.param.u32 %r4, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0];
mov.u32 %r5, %ctaid.x;
shl.b32 %r1, %r5, 9;
sub.s32 %r2, %r4, %r1;
setp.lt.s32 %p1, %r2, 512;
cvta.to.global.u64 %rd1, %rd2;
@%p1 bra $L__BB66_2;
bra.uni $L__BB66_1;
$L__BB66_2:
mov.u32 %r7, %tid.x;
setp.ge.s32 %p2, %r7, %r2;
add.s32 %r3, %r7, 384;
@%p2 bra $L__BB66_7;
add.s32 %r9, %r7, 128;
setp.ge.s32 %p3, %r9, %r2;
add.s32 %r10, %r7, %r1;
mul.wide.u32 %rd7, %r10, 2;
add.s64 %rd8, %rd1, %rd7;
st.global.u16 [%rd8], %rs1;
@%p3 bra $L__BB66_7;
add.s32 %r12, %r7, 256;
setp.ge.s32 %p4, %r12, %r2;
add.s32 %r14, %r10, 128;
mul.wide.u32 %rd9, %r14, 2;
add.s64 %rd10, %rd1, %rd9;
st.global.u16 [%rd10], %rs1;
@%p4 bra $L__BB66_7;
add.s32 %r17, %r10, 256;
mul.wide.u32 %rd11, %r17, 2;
add.s64 %rd12, %rd1, %rd11;
st.global.u16 [%rd12], %rs1;
setp.ge.s32 %p5, %r3, %r2;
@%p5 bra $L__BB66_7;
add.s32 %r18, %r3, %r1;
mul.wide.u32 %rd13, %r18, 2;
add.s64 %rd14, %rd1, %rd13;
st.global.u16 [%rd14], %rs1;
bra.uni $L__BB66_7;
$L__BB66_1:
mul.wide.s32 %rd3, %r1, 2;
add.s64 %rd4, %rd1, %rd3;
mov.u32 %r6, %tid.x;
mul.wide.s32 %rd5, %r6, 8;
add.s64 %rd6, %rd4, %rd5;
st.global.v4.u16 [%rd6], {%rs1, %rs1, %rs1, %rs1};
$L__BB66_7:
ret;
}
!cat "./trace/{tracedir}/kernel/{kernel_dir}/probed.ptx"
.version 8.4
.target sm_80
.address_size 64
//
.extern .func __assertfail
(
.param .b64 __assertfail_param_0,
.param .b64 __assertfail_param_1,
.param .b32 __assertfail_param_2,
.param .b64 __assertfail_param_3,
.param .b64 __assertfail_param_4
)
;
.global .align 1 .b8 __unnamed_1[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_2[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_3[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_4[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_5[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_6[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_7[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_8[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_9[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_10[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_11[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_12[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_13[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_14[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_15[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_16[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_17[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_18[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_19[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 __unnamed_20[15] = {99, 97, 115, 116, 95, 97, 110, 100, 95, 115, 116, 111, 114, 101};
.global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9726thrust20THRUST_200302_800_NS6system6detail10sequential3seqE[1];
.global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9724cuda3std3__48in_placeE[1];
.global .align 1 .b8 _ZN44_INTERNAL_14555b67_13_FillKernel_cu_e561d9724cuda3std6ranges3__45__cpo4swapE[1];
.global .align 1 .b8 $str$5[6] = {102, 97, 108, 115, 101};
.global .align 1 .b8 $str$6[58] = {47, 104, 111, 109, 101, 47, 104, 117, 97, 110, 103, 115, 48, 47, 119, 111, 114, 107, 100, 105, 114, 47, 100, 105, 115, 116, 47, 112, 121, 116, 111, 114, 99, 104, 47, 99, 49, 48, 47, 99, 111, 114, 101, 47, 68, 121, 110, 97, 109, 105, 99, 67, 97, 115, 116, 46, 104};
.visible .entry _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1_(
.param .u32 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0,
.param .align 2 .b8 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1[2],
.param .align 8 .b8 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2[8],
.param .u64 param_block_sched
)
.maxntid 128, 1, 1
{
.reg .b64 %lstart; // local start time (unit: cycle)
.reg .b64 %lend; // local end time (unit: cycle)
.reg .b64 %elapsed; // thread elapsed time in u64
.reg .b32 %elapse; // thread elapsed time in u32
mov.u64 %lstart, %clock64;
// begin buffer calculation
.reg .b32 %warpbuf<21>; // b32 reg to record access, will be optimized by ptxas
.reg .pred %leader; // predicate register
.reg .pred %joint_pred; // used to store AND result of %leader and instruction operand
mov.u32 %warpbuf2, %laneid; // read lane id
setp.eq.u32 %leader, %warpbuf2, 0; // check if thread is warp leader
@%leader mov.u32 %warpbuf3, %nwarpid; // warpDim := number of warp in current group
@%leader mov.u32 %warpbuf4, %tid.x; // threadIdx.x
@%leader mov.u32 %warpbuf5, %tid.y; // threadIdx.y
@%leader mov.u32 %warpbuf6, %tid.z; // threadIdx.z
@%leader mov.u32 %warpbuf7, %ntid.x; // blockDim.x
@%leader mov.u32 %warpbuf8, %ntid.y; // blockDim.y
@%leader mov.u32 %warpbuf18, %ntid.z; // blockDim.z
@%leader mov.u32 %warpbuf9, %ctaid.x; // blockIdx.x
@%leader mov.u32 %warpbuf10, %ctaid.y; // blockIdx.y
@%leader mov.u32 %warpbuf11, %ctaid.z; // blockIdx.z
@%leader mov.u32 %warpbuf12, %nctaid.x; // gridDim.x
@%leader mov.u32 %warpbuf13, %nctaid.y; // gridDim.y
@%leader mad.lo.s32 %warpbuf14, %warpbuf8, %warpbuf6, %warpbuf5; // blockDim.y * threadIdx.z + threadIdx.y
@%leader mad.lo.s32 %warpbuf15, %warpbuf14, %warpbuf7, %warpbuf4; // thread_idx = (blockDim.y * threadIdx.z + threadIdx.y) * blockDim.x + threadIdx.x
@%leader div.s32 %warpbuf15, %warpbuf15, 32; // get persistent warpid instead of dynamic %warpid
@%leader mad.lo.s32 %warpbuf16, %warpbuf13, %warpbuf11, %warpbuf10; // gridDim.y * blockIdx.z + blockIdx.y
@%leader mad.lo.s32 %warpbuf17, %warpbuf16, %warpbuf12, %warpbuf9; // block_idx = (gridDim.y * blockIdx.z + blockIdx.y) * gridDim.x + blockIdx.x
@%leader mul.lo.s32 %warpbuf19, %warpbuf7, %warpbuf8;
@%leader mul.lo.s32 %warpbuf20, %warpbuf19, %warpbuf18;
@%leader div.s32 %warpbuf20, %warpbuf20, 32;
@%leader mad.lo.s32 %warpbuf1, %warpbuf17, %warpbuf20, %warpbuf15; // buf_idx = block_idx * warpSize + warpIdx
// end buffer calculation
// begin block_sched buffer
.reg .b64 %buf_block_sched<5>; // register group defn
@%leader mul.wide.s32 %buf_block_sched4, %warpbuf1, 16; // get buffer location, no_bytes is per thread
@%leader ld.param.u64 %buf_block_sched3, [param_block_sched]; // load address from .param state space
@%leader cvta.to.global.u64 %buf_block_sched2, %buf_block_sched3; // convert address to .global state space
@%leader add.s64 %buf_block_sched1, %buf_block_sched2, %buf_block_sched4; // offset to get final thread-specific address
// end block_sched buffer
.reg .pred %p<6>;
.reg .b16 %rs<2>;
.reg .b32 %r<19>;
.reg .b64 %rd<15>;
ld.param.u16 %rs1, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_1];
ld.param.u64 %rd2, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_2];
ld.param.u32 %r4, [_ZN2at6native29vectorized_elementwise_kernelILi4ENS0_11FillFunctorIN3c104HalfEEENS_6detail5ArrayIPcLi1EEEEEviT0_T1__param_0];
mov.u32 %r5, %ctaid.x;
shl.b32 %r1, %r5, 9;
sub.s32 %r2, %r4, %r1;
setp.lt.s32 %p1, %r2, 512;
cvta.to.global.u64 %rd1, %rd2;
@%p1 bra $L__BB66_2;
bra.uni $L__BB66_1;
$L__BB66_2:
mov.u32 %r7, %tid.x;
setp.ge.s32 %p2, %r7, %r2;
add.s32 %r3, %r7, 384;
@%p2 bra $L__BB66_7;
add.s32 %r9, %r7, 128;
setp.ge.s32 %p3, %r9, %r2;
add.s32 %r10, %r7, %r1;
mul.wide.u32 %rd7, %r10, 2;
add.s64 %rd8, %rd1, %rd7;
st.global.u16 [%rd8], %rs1;
@%p3 bra $L__BB66_7;
add.s32 %r12, %r7, 256;
setp.ge.s32 %p4, %r12, %r2;
add.s32 %r14, %r10, 128;
mul.wide.u32 %rd9, %r14, 2;
add.s64 %rd10, %rd1, %rd9;
st.global.u16 [%rd10], %rs1;
@%p4 bra $L__BB66_7;
add.s32 %r17, %r10, 256;
mul.wide.u32 %rd11, %r17, 2;
add.s64 %rd12, %rd1, %rd11;
st.global.u16 [%rd12], %rs1;
setp.ge.s32 %p5, %r3, %r2;
@%p5 bra $L__BB66_7;
add.s32 %r18, %r3, %r1;
mul.wide.u32 %rd13, %r18, 2;
add.s64 %rd14, %rd1, %rd13;
st.global.u16 [%rd14], %rs1;
bra.uni $L__BB66_7;
$L__BB66_1:
mul.wide.s32 %rd3, %r1, 2;
add.s64 %rd4, %rd1, %rd3;
mov.u32 %r6, %tid.x;
mul.wide.s32 %rd5, %r6, 8;
add.s64 %rd6, %rd4, %rd5;
st.global.v4.u16 [%rd6], {%rs1, %rs1, %rs1, %rs1};
$L__BB66_7:
@%leader mov.u64 %lend, %clock64;
@%leader sub.u64 %elapsed, %lend, %lstart;
@%leader cvt.u32.u64 %elapse, %elapsed; // convert to u32
@%leader st.global.u64 [%buf_block_sched1], %lstart;
@%leader add.s64 %buf_block_sched1, %buf_block_sched1, 8;
@%leader st.global.v2.u32 [%buf_block_sched1], { %elapse, %smid };
@%leader add.s64 %buf_block_sched1, %buf_block_sched1, 8;
ret;
}
Here we can the global definition are kept and the probe engine make following modification:
- A parameter
.param .u64 param_block_sched
is added - Probes added at corresponding places, i.e., kernel start and end
- Buffer calculation in
// begin buffer calculation
and// begin block_sched buffer
callback for runtime analysis
Upon here, the last question is how the No.block:32768 Running:602241 Scheduling:87713(cycle)
is analyzed and printed.
First, Neutrino will save all raw traces in the result
directory.
Raw traces are of raw binary and is orderly named by the TIME since the driver starts.
!ls "./trace/{tracedir}/result/"
0.660526.bin
To convert the raw binary into valuable analyzed results, Neutrino provides analyze_hook for user to register in the probe.toml
.
For example, here we register the block_sched.py
as the analyze hook. Relative paths will be resolved based on tools/
directory of installation folder.
!cat "./trace/{tracedir}/probe.toml"
author = "Neutrino Team"
description = "Record start timestamp, elapsed time and smid of each warp\ncan be reduced to NSight Occupancy and launch__waves_per_multiprocessor\nThis is the same as block_sched but use local timer for better accuracy\nwhile sacrifies the auto alignment from gloabl timer"
analyze_hook = "block_sched.py"
[block_sched]
position = "kernel"
datamodel = "warp:16"
before = ".reg .b64 %lstart; // local start time (unit: cycle)\n.reg .b64 %lend; // local end time (unit: cycle)\n.reg .b64 %elapsed; // thread elapsed time in u64\n.reg .b32 %elapse; // thread elapsed time in u32\nmov.u64 %lstart, %clock64;"
after = "mov.u64 %lend, %clock64;\nsub.u64 %elapsed, %lend, %lstart; \ncvt.u32.u64 %elapse, %elapsed; // convert to u32\nSAVE.u64 {%lstart}; // store start in u64 for alignment\nSAVE.u32 {%elapse, %smid}; // store elapased time and core id"
# Real code of block_sched.py
!cat {os.path.join(os.path.dirname(neutrino.__file__), "tools", "block_sched.py")}
# Neutrino Generated Code for Reading Trace
import struct
from typing import NamedTuple, List, Tuple
from neutrino import TraceHeader, TraceSection
class block_sched(NamedTuple):
lstart: int
elapse: int
smid: int
def parse(path: str):
with open(path, "rb") as f:
header: TraceHeader = TraceHeader(*struct.unpack("iiiiiiii", f.read(32)))
sections: List[TraceSection] = []
for _ in range(header.numProbes):
size, offset = struct.unpack("QQ", f.read(16))
sections.append(TraceSection(size, offset))
gridSize = header.gridDimX * header.gridDimY * header.gridDimZ
blockSize = header.blockDimX * header.blockDimY * header.blockDimZ
records: List[List[block_sched]] = []
for i in range(gridSize):
records.append([])
for j in range(blockSize // 32):
lstart, elapse, smid = struct.unpack("QII", f.read(16))
records[i].append(block_sched(lstart, elapse, smid))
return header, sections, records
# END OF GENERATED CODE
import sys
import numpy as np
header, sections, records = parse(sys.argv[1]) # filled by path to trace
unique_sms = set()
for block in records:
unique_sms.add(block[0].smid)
sm_timelines = []
for _ in range(len(unique_sms)):
sm_timelines.append([])
sched_times = [0.0] * len(unique_sms)
work_times = [0.0] * len(unique_sms)
for cur in records:
# print(sm_timelines[cur[0].smid])
sched_out = False
smid = cur[0].smid
if len(sm_timelines[smid]) > 0:
for block in sm_timelines[smid]:
if block.lstart + block.elapse <= cur[0].lstart:
# if cur[0].lstart - (block.lstart + block.elapse) < 100000:
# print(cur[0], block)
sched_times[smid] += cur[0].lstart - (block.lstart + block.elapse)
sm_timelines[smid].remove(block)
sm_timelines[smid].append(cur[0])
work_times[smid] += cur[0].elapse
sched_out = True
break
if not sched_out:
sm_timelines[smid].append(cur[0])
work_times[smid] += cur[0].elapse
break
else:
sm_timelines[smid].append(cur[0])
work_times[smid] += cur[0].elapse
print(f"No.block:{header.gridDimX * header.gridDimY * header.gridDimZ} Running:{int(np.array(work_times).mean())} Scheduling:{int(np.array(sched_times).mean())}(cycle)")
There are two part of the code, separated by # END OF GENERATED CODE
:
- 1st part is the trace reading code auto-generated by Neutrino (see
process.log
above) that are used to read.bin
traces into Python objects familiar to developers. - 2nd part is the analyze code built upon the
parse
written by developers. Here we simulate a FIFO scheduler based on traces collected, and print and calculate the scheduling times.