Hook Driver
You can find the code in neutrino/src/.
Hook driver is organized as:
common.h # Platform-agnostic Definition (GNU-only)
preload.c # Injector via LD_PRELOAD
cuda.c # CUDA Impl (NVIDIA-related)
hip.c # ROCm Impl (AMD-related)
parse.py # Generate Unhook API (NVIDIA/AMD)
sha1.h # third-parties header-only library
uthash.h # third-parties header-only library
Mechanism
The hook driver is based on a known trick (I learn it from Valgrind) of ELF format that symbols(exported functions) are looked up by their signature (name, parameters and returns).
Therefore, as long as one can have a function of the same signature (available in documentation or headers, such as cuda.h
or hip_runtime.h
),
loaders (ldd
for static or dl
for dynamic) will treat it as valid.
To make the system work, we can dlsym
to get the pointer to the real function and calls it, like the following:
#define REAL_DRIVER ... // path to real driver
static void* dlib = NULL; // dlopen handle
CUresult cuInit(unsigned Flags){ //same signature
if (!dlib) {dlib = dlopen(REAL_DRIVER, RTLD_LAZY);}
CUresult (*real)(unsigned) = dlsym(dlib, "cuInit");
// insert code here := eBPF uprobe
CUresult ret = real(Flags);
// insert code here := eBPF uretprobe
return ret;
}
By doing so, we emulate a "symbolic link" at low cost (one additional function call if we cache real
rather than dlsym
everytime) but expose an programmable interface to add arbitrary logic in C.
Compared with other approach like eBPF uprobe, this is safer since it runs purely in user space.
To enfore the usage of hooked symbols, we can use envariables such as LD_PRELOAD
and LD_LIBRARY
that can change loader preference and make our hook symbol used.
Hooked APIs
Code in neutrino/src/cuda.c and neutrino/src/hip.c.
neutrino
mainly hooks Module Management APIs (cuda/hip docs) and Execution Control API (cuda/hip docs).
Module APIs
CUresult cuModuleLoadData(CUmodule* module, const void* image);
CUresult cuLibraryLoadData(CUlibrary* library, const void* code, CUjit_option* jitOptions, void** jitOptionsValues, unsigned int numJitOptions, CUlibraryOption* libraryOptions, void** libraryOptionValues, unsigned int numLibraryOptions);
CUresult cuModuleLoadDataEx(CUmodule* module, const void* image, unsigned int numOptions, CUjit_option* options, void** optionValues);
CUresult cuModuleLoadFatBinary(CUmodule* module, const void* fatCubin);
CUresult cuModuleGetFunction(CUfunction* hfunc, CUmodule hmod, const char* name);
CUresult cuKernelGetFunction(CUfunction* pFunc, CUkernel kernel);
CUresult cuLibraryGetKernel(CUkernel* pKernel, CUlibrary library, const char* name);
CUresult cuLibraryGetModule(CUmodule* pMod, CUlibrary library);
hipError_t hipModuleLoadData(hipModule_t* module, const void* image);
hipError_t hipModuleLoadDataEx(hipModule_t* module, const void* image, unsigned int numOptions, hipJitOption* options, void** optionValues);
hipError_t hipModuleGetFunction(hipFunction_t* function, hipModule_t module, const char* kname)
These APIs handle the loading of GPU code from disk (embed in the host ELF) to GPU runtime (cu/hipModuleGetFunction
),
and the lowering from module to the particular kernel (cu/hipModuleGetFunction
).
For these API, neutrino
interacts with the internal storage binmap
to record code loaded and the mapping from kernel to code.
Execution API
CUresult cuLaunchKernel(CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void** kernelParams, void** extra);
hipError_t hipModuleLaunchKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams, void** extra);
These APIs launches the kernel on GPU. neutrino
hooks these API to expose a complex pipeline:
- Search the
funcmap
forprobed
kernel andmaps
. - Allocate the memory (both device and host) according to
maps
. - Launch the
probed
kernel (call to the realcu/hipLaunchKernel
) and synchronize for its end. - Copy back the device maps and save it to disk.
Standardized Functionalities
To build a unified architecture across platforms and enhance reusability, we standardize most APIs into common.h
:
- Configuration reading: reading envariables, see Command-Line Interface.
- GPU code parsing
- Trace folder management: see Trace Structure.
- Binary and kernel(function) storage management
- [TODO] Saving traces: see Trace Structure.
- [TODO] Interacting with probe engine
GPU Code Parsing
A practical issue of code loaded via cuModuleLoad
is that they are of void *
,
whose size is unpredicatable from C perspective.
But after checking their magic number, these code (void *
) are actually ELF, Fatbinary or plain text.
Therefore, their size is predictable by parsing the header:
- ELF: parse
Elf64_Ehdr
fromelf.h
- Fatbinary:
header->size + sizeof(fatBinaryHeader)
. - Text: search for
\0
.
Usage: unsigned long long get_managed_code_size(const void* code)
.
Binary Storage
Another unpredictable issue of using user-given void *
is
that they might be garbage collected, particularly given the popularity of Python,
leading to core dumped
, when referencing.
Thus, we memcpy
these code into an internal (hash-based) storage managed by uthash:
typedef struct {
void* key; // could be CUlibrary, CUmodule, CUfunction or HIP equivalent
void* code; // the binary code, protected
char* name; // name of function
unsigned long long size; // size of bin
UT_hash_handle hh; // uthash handles
} binmap_item;
key
here is the CUlibrary
, CUmodule
, CUfunction
or HIP equivalent. In practice, they are pointer (void*
) to the corresponding struct on device.
Usage: the storage will be initialized automatically by hook driver. Please use following API (return 0 if success, -1 if failed):
int binmap_set(void* key, void* code, unsigned long long size, char* name)
: add a new itemint binmap_update_key(void* old_key, void* new_key)
: update key of the item.int binmap_update_name_key(void* old_key, void* new_key, char* name)
: update the name of the itemint binmap_get(void* key, size_t* size, char** name, void** code)
: get thesize
,name
,code
(protected).
Kernel Storage
To avoid repeating costly probing (~300ms), we build another internal storage to cache the probing result of each kernel, including success and failure:
// function map items, used as JIT code cache to avoid re-compilation
typedef struct {
void* original; // original CUfunction/HIPfunction
char* name; // name of function, if made possible, can be NULL
int n_param; // number of parameters, obtained from parsing
int n_probe; // number of probes that would dump memory
int* probe_sizes; // sizes of probe memory, order matches
int* probe_types; // types of probe,
bool succeed; // specify JIT status -> if failed, always goto backup
void* probed; // probed CUfunction/HIPfunction
void* pruned; // pruned CUfunction/HIPfunction, for benchmark only
void* countd; // counting CUfunction/HIPfunction, for DYNAMIC=TRUE only
char* callback; // callback to analyze the trace
UT_hash_handle hh; // reserved by uthash
} funcmap_item_t;
The original kernel (void* original
) is used as the key
of hashmap.
Usage: the storage will be automatically initialized by the hook driver, wit following exposed API:
int funcmap_set(void* original, bool succeed...)
: add a recordint funcmap_get(void* original, bool* succeed...)
: get a record, everything is valid only ifsucceed = true
.
Interacting with Probe Engine
We are still standardizing this part, you can check current implementation in cuda.c
or hip.c
.
Interaction with the probe engine happens only if funcmap_get
returns -1, i.e., no record is found. It is roughly a fork-exec:
binmap_get
to search the binary map for thecode
andname
.mkdir
withintrace/kernel
with directory name assha1
thename
.fwrite
thecode
tooriginal.bin
fork
:- child:
execlp
the probe engine:python <PROBE_ENGINE> <workdir> <kernel_name>
. - parent:
waitpid
for the probe engine (status is its return code).
- child:
fread
thekernel.info
for metadata- Load the probed kernel back via
cu/hipModuleLoadData
andcu/hipModuleGetFunction
.
Unhooked API
One drawback of symbolic links is that we have to expose every APIs, even those not interested such as cuMemAlloc
.
To avoid the huge workload, neutrino
choose to auto-generate them by parsing local environments, i.e., header (cuda.h/hip_runtime.h
) and drivers (libcuda.so/libamdhip64.so
).
For each parsed symbols (func_name
...), we generates a signature using following template:
CUresult (*real_{func_name})({param_list}) = NULL;
hipError_t (*real_{func_name})({param_list}) = NULL;
and the body using following templates:
CUresult {func_name}({param_list}) {{
if (shared_lib == NULL) {{ init(); }}
CUresult err = real_{func_name}({param_val_list}); // call the real
if (VERBOSE) {{
fprintf(event_log, "[info] {func_name} %d\\n", err);
fflush(event_log); // block until output written for debugging
}}
return err;
}}
hipError_t {func_name}({param_list}) {{
if (shared_lib == NULL) {{ init(); }}
hipError_t err = real_{func_name}({param_val_list}); // call the real
if (VERBOSE) {{
fprintf(event_log, "[info] {func_name} %d\\n", err);
fflush(event_log); // block until output written for debugging
}}
return err;
}}
Moreover, the shared library usually includes more suffixed/versioned symbols for versatile usages (such as _ptsz/_ptds
for CUDA),
such as distinguish API versions, etc.
As our implementation is regardless of logics, we use the matched param_list
and the updated func_name
for these versioned symbols, allowing better compatibility.
Usage: parse.py
will be called automatically by the builder(build.py
).
Preload and Filter
For dynamically loaded library, it is invalid to directly LD_PRELOAD = hook_driver/libcuda.so
since they directly use dlopen
to locate and get the function pointer, instead of using symbols resolved by linker.
But there is a function cannot be dlopen
-- dlopen
itself that must be handled by linker.
Neutrino hook driver inject the dlopen
to provide redirection via name matching.
#define REAL_DRIVER ... // filled in make
#define HOOK_DRIVER ... // filled in make
void* dlopen(const char *filename, int flags) {
// RTLD_NEXT -> the real dlopen of libc
real_dlopen = dlsym(RTLD_NEXT, "dlopen");
if (strstr(filename, "libcuda.so.1") != NULL) {
void* tmp[STACK_TRACE_SIZE];
int size = backtrace(tmp, STACK_TRACE_SIZE);
char** syms = backtrace_symbols(tmp, size);
for (int i = 0; i < size; i++) {
// filtered out, go with REAL_DRIVER
if (strstr(syms[i], "cublas") != NULL)
return dlopen(REAL_DRIVER, flags);
}
// not filtered out, go with HOOK_DRIVER
return dlopen(HOOK_DRIVER, flags);
}
// other symbols, let it go
return dlopen(filename, flags);
}
Moreover, to follow the EULA of vendors, we can use backtrace to detect the caller and filter out propietary products making Neutrino legally safe.
Usage: Compiled preload.c
will be injected automatically by the CLI.