Hook Driver

You can find the code in neutrino/src/.

Hook driver is organized as:

neutrino/src/

common.h    # Platform-agnostic Definition (GNU-only)
preload.c   # Injector via LD_PRELOAD
cuda.c      # CUDA Impl (NVIDIA-related)
hip.c       # ROCm Impl (AMD-related)
parse.py    # Generate Unhook API (NVIDIA/AMD)
sha1.h      # third-parties header-only library
uthash.h    # third-parties header-only library

Mechanism

The hook driver is based on a known trick (I learn it from Valgrind) of ELF format that symbols(exported functions) are looked up by their signature (name, parameters and returns).

Therefore, as long as one can have a function of the same signature (available in documentation or headers, such as cuda.h or hip_runtime.h), loaders (ldd for static or dl for dynamic) will treat it as valid. To make the system work, we can dlsym to get the pointer to the real function and calls it, like the following:

cuda.c

#define REAL_DRIVER ...   // path to real driver
static void* dlib = NULL; // dlopen handle
CUresult cuInit(unsigned Flags){ //same signature
  if (!dlib) {dlib = dlopen(REAL_DRIVER, RTLD_LAZY);}
  CUresult (*real)(unsigned) = dlsym(dlib, "cuInit");
  // insert code here := eBPF uprobe
  CUresult ret = real(Flags);
  // insert code here := eBPF uretprobe
  return ret; 
}

By doing so, we emulate a "symbolic link" at low cost (one additional function call if we cache real rather than dlsym everytime) but expose an programmable interface to add arbitrary logic in C. Compared with other approach like eBPF uprobe, this is safer since it runs purely in user space.

To enfore the usage of hooked symbols, we can use envariables such as LD_PRELOAD and LD_LIBRARY that can change loader preference and make our hook symbol used.

Hooked APIs

Code in neutrino/src/cuda.c and neutrino/src/hip.c.

neutrino mainly hooks Module Management APIs (cuda/hip docs) and Execution Control API (cuda/hip docs).

Module APIs

neutrino/src/cuda.c

CUresult cuModuleLoadData(CUmodule* module, const void* image);
CUresult cuLibraryLoadData(CUlibrary* library, const void* code, CUjit_option* jitOptions, void** jitOptionsValues, unsigned int numJitOptions, CUlibraryOption* libraryOptions, void** libraryOptionValues, unsigned int numLibraryOptions);
CUresult cuModuleLoadDataEx(CUmodule* module, const void* image, unsigned int numOptions, CUjit_option* options, void** optionValues);
CUresult cuModuleLoadFatBinary(CUmodule* module, const void* fatCubin);
CUresult cuModuleGetFunction(CUfunction* hfunc, CUmodule hmod, const char* name);
CUresult cuKernelGetFunction(CUfunction* pFunc, CUkernel kernel);
CUresult cuLibraryGetKernel(CUkernel* pKernel, CUlibrary library, const char* name);
CUresult cuLibraryGetModule(CUmodule* pMod, CUlibrary library);

neutrino/src/hip.c

hipError_t hipModuleLoadData(hipModule_t* module, const void* image);
hipError_t hipModuleLoadDataEx(hipModule_t* module, const void* image, unsigned int numOptions, hipJitOption* options, void** optionValues);
hipError_t hipModuleGetFunction(hipFunction_t* function, hipModule_t module, const char* kname)

These APIs handle the loading of GPU code from disk (embed in the host ELF) to GPU runtime (cu/hipModuleGetFunction), and the lowering from module to the particular kernel (cu/hipModuleGetFunction).

For these API, neutrino interacts with the internal storage binmap to record code loaded and the mapping from kernel to code.

Execution API

neutrino/src/cuda.c

CUresult cuLaunchKernel(CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void** kernelParams, void** extra);

neutrino/src/hip.c

hipError_t hipModuleLaunchKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams, void** extra);

These APIs launches the kernel on GPU. neutrino hooks these API to expose a complex pipeline:

Search the funcmap for probed kernel and maps.
Allocate the memory (both device and host) according to maps.
Launch the probed kernel (call to the real cu/hipLaunchKernel) and synchronize for its end.
Copy back the device maps and save it to disk.

Standardized Functionalities

To build a unified architecture across platforms and enhance reusability, we standardize most APIs into common.h:

Configuration reading: reading envariables, see Command-Line Interface.
GPU code parsing
Trace folder management: see Trace Structure.
Binary and kernel(function) storage management
[TODO] Saving traces: see Trace Structure.
[TODO] Interacting with probe engine

GPU Code Parsing

A practical issue of code loaded via cuModuleLoad is that they are of void *, whose size is unpredicatable from C perspective. But after checking their magic number, these code (void *) are actually ELF, Fatbinary or plain text. Therefore, their size is predictable by parsing the header:

ELF: parse Elf64_Ehdr from elf.h
Fatbinary: header->size + sizeof(fatBinaryHeader).
Text: search for \0.

Usage: unsigned long long get_managed_code_size(const void* code).

Binary Storage

Another unpredictable issue of using user-given void * is that they might be garbage collected, particularly given the popularity of Python, leading to core dumped, when referencing. Thus, we memcpy these code into an internal (hash-based) storage managed by uthash:

neutrino/src/common.h

typedef struct {
    void* key;  // could be CUlibrary, CUmodule, CUfunction or HIP equivalent
    void* code; // the binary code, protected
    char* name; // name of function
    unsigned long long size; // size of bin
    UT_hash_handle hh; // uthash handles
} binmap_item;

key here is the CUlibrary, CUmodule, CUfunction or HIP equivalent. In practice, they are pointer (void*) to the corresponding struct on device.

Usage: the storage will be initialized automatically by hook driver. Please use following API (return 0 if success, -1 if failed):

int binmap_set(void* key, void* code, unsigned long long size, char* name): add a new item
int binmap_update_key(void* old_key, void* new_key): update key of the item.
int binmap_update_name_key(void* old_key, void* new_key, char* name): update the name of the item
int binmap_get(void* key, size_t* size, char** name, void** code): get the size, name, code(protected).

Kernel Storage

To avoid repeating costly probing (~300ms), we build another internal storage to cache the probing result of each kernel, including success and failure:

neutrino/src/common.h

// function map items, used as JIT code cache to avoid re-compilation
typedef struct {
    void* original;    // original CUfunction/HIPfunction
    char* name;        // name of function, if made possible, can be NULL
    int n_param;       // number of parameters, obtained from parsing
    int n_probe;       // number of probes that would dump memory
    int* probe_sizes;  // sizes of probe memory, order matches
    int* probe_types;  // types of probe, 
    bool succeed;      // specify JIT status -> if failed, always goto backup
    void* probed;      // probed CUfunction/HIPfunction
    void* pruned;      // pruned CUfunction/HIPfunction, for benchmark only
    void* countd;      // counting CUfunction/HIPfunction, for DYNAMIC=TRUE only
    char* callback;    // callback to analyze the trace
    UT_hash_handle hh; // reserved by uthash
} funcmap_item_t;

The original kernel (void* original) is used as the key of hashmap.

Usage: the storage will be automatically initialized by the hook driver, wit following exposed API:

int funcmap_set(void* original, bool succeed...): add a record
int funcmap_get(void* original, bool* succeed...): get a record, everything is valid only if succeed = true.

Interacting with Probe Engine

We are still standardizing this part, you can check current implementation in cuda.c or hip.c.

Interaction with the probe engine happens only if funcmap_get returns -1, i.e., no record is found. It is roughly a fork-exec:

binmap_get to search the binary map for the code and name.
mkdir within trace/kernel with directory name as sha1 the name.
fwrite the code to original.bin
fork:
- child: execlp the probe engine: python <PROBE_ENGINE> <workdir> <kernel_name>.
- parent: waitpid for the probe engine (status is its return code).
fread the kernel.info for metadata
Load the probed kernel back via cu/hipModuleLoadData and cu/hipModuleGetFunction.

Unhooked API

One drawback of symbolic links is that we have to expose every APIs, even those not interested such as cuMemAlloc. To avoid the huge workload, neutrino choose to auto-generate them by parsing local environments, i.e., header (cuda.h/hip_runtime.h) and drivers (libcuda.so/libamdhip64.so).

For each parsed symbols (func_name...), we generates a signature using following template:

neutrino/src/parse.py

CUresult (*real_{func_name})({param_list}) = NULL;

neutrino/src/parse.py

hipError_t (*real_{func_name})({param_list}) = NULL;

and the body using following templates:

neutrino/src/parse.py

CUresult {func_name}({param_list}) {{
    if (shared_lib == NULL)  {{ init(); }}
    CUresult err = real_{func_name}({param_val_list}); // call the real
    if (VERBOSE)  {{ 
        fprintf(event_log, "[info] {func_name} %d\\n", err); 
        fflush(event_log); // block until output written for debugging
    }}
    return err;
}}

neutrino/src/parse.py

hipError_t {func_name}({param_list}) {{
    if (shared_lib == NULL)  {{ init(); }} 
    hipError_t err = real_{func_name}({param_val_list}); // call the real
    if (VERBOSE) {{ 
        fprintf(event_log, "[info] {func_name} %d\\n", err); 
        fflush(event_log); // block until output written for debugging
    }}
    return err;
}}

Moreover, the shared library usually includes more suffixed/versioned symbols for versatile usages (such as _ptsz/_ptds for CUDA), such as distinguish API versions, etc. As our implementation is regardless of logics, we use the matched param_list and the updated func_name for these versioned symbols, allowing better compatibility.

Usage: parse.py will be called automatically by the builder(build.py).

Preload and Filter

For dynamically loaded library, it is invalid to directly LD_PRELOAD = hook_driver/libcuda.so since they directly use dlopen to locate and get the function pointer, instead of using symbols resolved by linker.

But there is a function cannot be dlopen -- dlopen itself that must be handled by linker. Neutrino hook driver inject the dlopen to provide redirection via name matching.

neutrino/src/preload.c

#define REAL_DRIVER ... // filled in make
#define HOOK_DRIVER ... // filled in make
void* dlopen(const char *filename, int flags) {
  // RTLD_NEXT -> the real dlopen of libc
  real_dlopen = dlsym(RTLD_NEXT, "dlopen");
  if (strstr(filename, "libcuda.so.1") != NULL) {
    void* tmp[STACK_TRACE_SIZE];
    int size = backtrace(tmp, STACK_TRACE_SIZE);
    char** syms = backtrace_symbols(tmp, size);
    for (int i = 0; i < size; i++) {
      // filtered out, go with REAL_DRIVER
      if (strstr(syms[i], "cublas") != NULL)
        return dlopen(REAL_DRIVER, flags);
    }
    // not filtered out, go with HOOK_DRIVER
    return dlopen(HOOK_DRIVER, flags);
  }
  // other symbols, let it go
  return dlopen(filename, flags);
}

Moreover, to follow the EULA of vendors, we can use backtrace to detect the caller and filter out propietary products making Neutrino legally safe.

Usage: Compiled preload.c will be injected automatically by the CLI.

Hook Driver

On this page