# NtHookEngine x64 Bug Notes Findings from a debugging session against the test harness in this folder. ## Summary NtHookEngine works correctly on x86 but produces broken hooks on x64 due to two distinct issues: 1. **Pointer truncation in the consumer-side hook plumbing** (`InstallHook` / `DoHook` / `ADDHOOK`). 32-bit `int*` thunks silently chopped 8-byte trampoline addresses down to 4 bytes. 2. **RIP-relative instructions are not relocated when copied to the trampoline.** Any hooked function whose first ~12 bytes contain a RIP-relative load (very common on modern Windows due to `/GS` stack cookies) executes a garbage memory access in the trampoline. Bug #1 is fixed. Bug #2 is a structural limitation of the engine. ## Bug #1: int* thunk truncation ### Symptom `Real_X` function pointer ends up holding only the low 32 bits of the trampoline address. When the detour calls `Real_X(...)`, control jumps to `0x00000000XXXXXXXX` (or worse, sign-extended `0xFFFFFFFFXXXXXXXX`) and crashes — usually inside `CountPreAlignBytes` during the *next* hook install when it dereferences a truncated address. Visible signature: an access violation reading from an address like `0xFFFFFFFFD8E6EED0` where the real address should have been `0x00007FFDD8E6EED0`. Note the high bits got `FFFFFFFF` because the low dword's bit 31 was set and got sign-extended. ### Root cause ```cpp // Old: bool InstallHook(void* real, void* hook, int* thunk, char* name, hookType ht){ if(HookFunction((ULONG_PTR)real, (ULONG_PTR)hook, name, ht)){ *thunk = (int)GetOriginalFunction((ULONG_PTR)hook); // ← truncates 8 bytes to 4 return true; } return false; } #define ADDHOOK(name) DoHook(My_##name, (int*)&Real_##name, #name); // ↑ casts pointer-to-pointer to int* // which masks the truncation from the compiler ``` `Real_X` is declared as a function pointer (8 bytes on x64). `&Real_X` is a pointer-to-function-pointer (also 8 bytes on x64). Casting it to `int*` disguises the size mismatch from the compiler. Then `*thunk = (int)...` writes only 4 bytes through the `int*`, leaving the upper 4 bytes of `Real_X` unchanged (typically zero from initialization). This worked accidentally on x86 because every pointer was 4 bytes. On x64 it was always wrong, but only became reliably visible when the trampoline buffer ended up at a high address — which depends on where `VirtualAlloc(NULL, ...)` chose to place it on this run. In the test harness `/BASE:0x8000000` forced a low load address and made things "work" until `Real_GetProcAddress` was called (which returned a high-addressed system DLL function). ### Fix ```cpp // New: bool InstallHook(void* real, void* hook, ULONG_PTR* thunk, char* name, hookType ht){ if(HookFunction((ULONG_PTR)real, (ULONG_PTR)hook, name, ht)){ *thunk = GetOriginalFunction((ULONG_PTR)hook); // no cast, both are ULONG_PTR return true; } return false; } #define ADDHOOK(name) DoHook(My_##name, (ULONG_PTR*)&Real_##name, #name); ``` `ULONG_PTR` is 8 bytes on x64, 4 bytes on x86 — matches pointer size on both. The same change applies to `DoHook`'s signature and any manual `InstallHook` call sites. ### Related: function pointer return-type truncation The test harness also had: ```cpp // Wrong: int (__stdcall *Real_GetProcAddress)(HMODULE, LPCSTR) = NULL; int __stdcall My_GetProcAddress(HMODULE, LPCSTR) { ... } ``` `int` is 4 bytes regardless of arch. Real `GetProcAddress` returns 8-byte `FARPROC` on x64. Calling through a function pointer with the wrong return type causes the compiler to emit truncation-and-sign-extend on the return value. Production `main.h` had this correctly declared as `FARPROC` already; the test harness was a stale copy. ```cpp // Right: FARPROC (__stdcall *Real_GetProcAddress)(HMODULE, LPCSTR) = NULL; FARPROC __stdcall My_GetProcAddress(HMODULE, LPCSTR) { ... } ``` Lesson: any `Real_X` function pointer whose real Windows API returns something pointer-sized (HANDLE, HMODULE, FARPROC, LPVOID, SOCKET, HINTERNET, LPSTR, etc.) must be declared with that exact return type. Declaring it as `int` was a common pre-x64 shortcut that silently breaks on 64-bit. ## Bug #2: RIP-relative instructions not relocated ### Symptom Hook installs successfully. The first call to the hooked function trips `STATUS_STACK_BUFFER_OVERRUN` (C0000409) inside the function itself, not in the trampoline. C0000409 is `__report_gsfailure` — the `/GS` stack cookie check failed. ### Root cause NtHookEngine's `CreateBridge` copies the original API's prolog instructions to the trampoline byte-for-byte, then appends a tail jump back to the remainder of the API. It does not inspect or rewrite the copied instructions. x64 system DLLs compiled with `/GS` typically begin with something like: ``` push rbx ; 2 bytes sub rsp, 40 ; 4 bytes mov rax, [rip + cookie_disp] ; 7 bytes — RIP-relative load of __security_cookie xor rax, rsp mov [rsp+38], rax ; store cookie on stack ... function body ... mov rcx, [rsp+38] xor rcx, rsp call __security_check_cookie ; verify cookie unchanged; calls __report_gsfailure on mismatch ``` When the engine copies the first 12+ bytes to the trampoline, the `mov rax, [rip + cookie_disp]` instruction goes byte-for-byte. But the `disp` operand is interpreted relative to the *current* RIP at execution. In the trampoline, RIP is at a different address than the original, so the load reads from a completely different memory location — almost always garbage. The function then computes `xor rax, rsp` with garbage `rax`, stores a fake cookie, runs to completion, and on exit `__security_check_cookie` finds the saved cookie doesn't match the global one. `__report_gsfailure` fires. Confirmed by inspection of the trampoline: ``` trampoline+00: 40 53 push rbx ; OK, no operands trampoline+02: 48 83 EC 40 sub rsp, 40 ; OK trampoline+06: 48 8B 05 53 50 25 00 mov rax, [rip+0x255053] ; ← BUG: disp not relocated trampoline+0D: FF 25 00 00 00 00 jmp qword ptr [rip+0] ; tail jump (correct, ht_jmpderef) trampoline+13: <8-byte target address back to original API> ``` The `0x00255053` displacement was correct for the original API's RIP. At the trampoline's RIP it points to unmapped or unrelated memory. ### Engineering implications A correct fix requires: 1. **Detecting RIP-relative operands** in each instruction being copied. diStorm provides operand encoding info — needs to be checked for ModR/M encoding `00b 101b` (which means `[disp32]` interpreted as RIP-relative in 64-bit mode) and similar patterns for indirect calls/jumps. 2. **Rewriting the displacement** to point to the same effective address from the trampoline's location: ``` new_disp = (original_addr + instruction_size + old_disp) - (trampoline_addr + instruction_size) ``` 3. **Ensuring the trampoline is within ±2GB of the targets it references.** The new displacement must fit in 32 bits. NtHookEngine currently allocates the bridge buffer wherever `VirtualAlloc(NULL, ...)` decides — typically far from system DLLs. A correct implementation needs to allocate near each hooked module (using `MEM_TOP_DOWN` with address hints, or scanning for free regions in range). 4. **Handling instructions where rewriting would require widening** (e.g. a short conditional branch within the prolog whose new target is too far for the original encoding). These need to be rewritten as longer equivalents, which changes instruction sizes and cascades into other displacement fixups. ### Status Not fixed in this engine. Production code currently works around this by disabling hooks for any function whose prolog is observed to fail, which is fragile (a Windows update can change a previously-working prolog). The proper fix is enough engineering work that swapping the engine for a maintained library that handles this (MinHook, PolyHook, Detours) is the pragmatic path forward. ## Other x64 fixes applied during this investigation For reference, these were also fixed in NtHookEngine.cpp during the same session and are unrelated to the consumer-side bugs above: - `abs_jump_required(UINT, UINT)` truncated 64-bit pointers — changed to `ULONG_PTR` and signed 64-bit delta computation. - Bridge tail jump used `ht_jmp` (`mov rax, imm64; jmp rax`) which clobbered RAX mid-prolog — changed to `ht_jmpderef` (`jmp qword ptr [rip+0]; `) which preserves all registers. - `ht_pushret` on x64 used `is32BitSafe` (high 32 bits zero) but `push imm32` sign-extends, so bit 31 must also be clear — added `is32BitPushSafe`. - `OverWriteScratchPad` restored `VirtualProtect` on the post-loop pCur rather than the pre-loop start — fixed by saving start address. - `#else ifdef _M_AMD64` is not valid preprocessor — changed to `#elif defined(_M_AMD64)`. These were all real bugs but mostly latent (worked by accident in common configurations). The two consumer-side issues (Bug #1 above) and the RIP-relative relocation gap (Bug #2) are the ones that produce the consistent "x64 doesn't work" symptom. ## How the test harness reproduces this `NtHookEngine_Test.cpp` hooks GetProcAddress, ExitProcess, and WinExec from inside its own process. With Bug #1 unfixed and `/BASE:0x8000000` set: 1. First hook (GetProcAddress) installs successfully. Trampoline address is in low memory so truncation is lossless. 2. Setup of next hook calls `Real_GetProcAddress(hKernelBase, "ExitProcess")` which returns a high system-DLL address (`0x00007FFD...`). Truncation chops to `0xD8E6EED0`, sign-extends to `0xFFFFFFFFD8E6EED0`. 3. Engine tries to hook that bogus address, calls `CountPreAlignBytes` which dereferences `[address-1]`, access violation. Once `int*` is changed to `ULONG_PTR*` and the `Real_GetProcAddress` typedef to `FARPROC`, the harness runs to completion, all three hooks fire, Calc launches, and Bug #2 only manifests if a hooked function happens to have a RIP-relative instruction in its prolog (which GetProcAddress, WinExec, and ExitProcess on this Windows version don't, but GetSystemTime and many others do).