ThreadStackSpoofer is a PoC implementation for an advanced in-memory evasion technique that spoofs Thread Call Stack. This technique allows to bypass thread-based memory examination rules and better hide shellcodes while in-process memory.
This is an example implementation for Thread Stack Spoofing technique aiming to evade Malware Analysts, AVs and EDRs looking for references to shellcode’s frames in an examined thread’s call stack. The idea is to hide references to the shellcode on thread’s call stack thus masquerading allocations containing malware’s code.
Implementation along with my ShellcodeFluctuation brings Offensive Security community sample implementations to catch up on the offering made by commercial C2 products, so that we can do no worse in our Red Team toolings. 💪
Current implementation differs heavily to what was originally published. This is because I realised there is a way simpler approach to terminate thread’s call stack processal and hide shellcode’s related frames by simply writing 0
to the return address of the first frame we control:
void WINAPI MySleep(DWORD _dwMilliseconds)
{
[…]
auto overwrite = (PULONG_PTR)_AddressOfReturnAddress();
const auto origReturnAddress = *overwrite;
*overwrite = 0;
[…]
*overwrite = origReturnAddress;
}
The previous implementation, utilising StackWalk64
can be accessed in this commit c250724.
This implementation is much more stable and works nicely on both Debug
and Release
under two architectures – x64
and x86
.
This is how a call stack may look like when it is NOT spoofed:
This in turn, when thread stack spoofing is enabled:
Above we can see that the last frame on our call stack is our MySleep
callback. One can wonder does it immediately brings opportunities new IOCs? Hunting rules can look for threads having call stacks not unwinding into following expected thread entry points located within system libraries:
However the call stack of the spoofed thread may look rather odd at first, a brief examination of my system shown, that there are other threads not unwinding to the above entry points as well:
The above screenshot shows a thread of unmodified Total Commander x64. As we can see, its call stack pretty much resembles our own in terms of initial call stack frames.
Why should we care about carefully faking our call stack when there are processes exhibiting traits that we can simply mimic?
The rough algorithm is following:
dbghelp.dll
, call SymInitialize
kernel32!Sleep
pointing back to our callback.VirtualAlloc
+ memcpy
+ CreateThread
. The thread should start from our runShellcode
function to avoid having Thread’s StartAddress point into somewhere unexpected and anomalous (such as ntdll!RtlUserThreadStart+0x21
)MySleep
callback gets invoked.0
which effectively should finish the call stack.::SleepEx
is made to let the Beacon’s sleep while waiting for further communication.Function return addresses are scattered all around the thread’s stack memory area, pointed to by RBP/EBP
register. In order to find them on the stack, we need to firstly collect frame pointers, then dereference them for overwriting:
(PULONG_PTR)(frameAddr + sizeof(void)) = Fake_Return_Address;
Initial implementation of ThreadStackSpoofer
did that in walkCallStack
and spoofCallStack
functions, however the current implementation shows that these efforts are not required to maintain stealthy call stack
Example Run
Use case:
C:> ThreadStackSpoofer.exe
Where:
<shellcode>
is a path to the shellcode file<spoof>
when 1
or true
will enable thread stack spoofing and anything else disables it.Example run that spoofs beacon’s thread call stack:
PS D:\dev2\ThreadStackSpoofer> .\x64\Release\ThreadStackSpoofer.exe .\tests\beacon64.bin 1
[.] Reading shellcode bytes…
[.] Hooking kernel32!Sleep…
[.] Injecting shellcode…
[+] Shellcode is now running.
[>] Original return address: 0x1926747bd51. Finishing call stack…
===> MySleep(5000)
[<] Restoring original return address… [>] Original return address: 0x1926747bd51. Finishing call stack…
===> MySleep(5000)
[<] Restoring original return address… [>] Original return address: 0x1926747bd51. Finishing call stack…
How Do I Use It?
Look at the code and its implementation, understand the concept and re-implement the concept within your own Shellcode Loaders that you utilise to deliver your Red Team engagements. This is an yet another technique for advanced in-memory evasion that increases your Teams’ chances for not getting caught by Anti-Viruses, EDRs and Malware Analysts taking look at your implants.
While developing your advanced shellcode loader, you might also want to implement:
BeaconEye
RW
(from RX/RWX
) and encrypt their contents – using Shellcode Fluctuation technique – right before sleeping (that could evade scanners such as Moneta
or pe-sieve
)Actually this is not (yet) a true stack spoofing
As it’s been pointed out to me, the technique here is not yet truly holding up to its name for being a stack spoofer. Since we’re merely overwriting return addresses on the thread’s stack, we’re not spoofing the remaining areas of the stack itself. Moreover we’re leaving our call stack unwindable meaking it look anomalous since the system will not be able to properly walk the entire call stack frames chain.
However I’m aware of these shortcomings, at the moment I’ve left it as is since I cared mostly about evading automated scanners that could iterate over processes, enumerate their threads, walk those threads stacks and pick up on any return address pointing back to a non-image memory (such as SEC_PRIVATE
– the one allocated dynamically by VirtuaAlloc
and friends). A focused malware analyst would immediately spot the oddity and consider the thread rather unusual, hunting down our implant. More than sure about it. Yet, I don’t believe that nowadays automated scanners such as AV/EDR have sorts of heuristics implemented that would actually walk each thread’s stack to verify whether its un-windable ¯\_(ツ)_/¯
.
Surely this project (and commercial implementation found in C2 frameworks) gives AV & EDR vendors arguments to consider implementing appropriate heuristics covering such a novel evasion technique.
In order to improve this technique, one can aim for a true Thread Stack Spoofer by inserting carefully crafted fake stack frames established in an reverse-unwinding process. Read more on this idea below.
Implementing a true Thread Stack Spoofer
Hours-long conversation with namazso teached me, that in order to aim for a proper thread stack spoofer we would need to reverse x64 call stack unwinding process. Firstly, one needs to carefully acknowledge the stack unwinding process explained in (a) linked below. The system when traverses Thread call stack on x64 architecture will not simply rely on return addresses scattered around the thread’s stack, but rather it:
RUNTIME_FUNCTION
, UNWIND_INFO
and UNWIND_CODE
structures. These structures describe where are the function’s beginning address, ending address, and where are all the code sequences that modify RBP
or RSP
.UNWIND_CODE
s that examined function exhbits to precisely compute the location of that frame’s return address and stack pointer value.In order to interfere with this process we wuold need to revert it by having our reverted form of RtlVirtualUnwind
. We would need to iterate over functions defined in a module (let’s be it kernel32
), scan each function’s UNWIND_CODE
codes and closely emulate it backwards (as compared to RtlVirtualUnwind
and precisely RtlpUnwindPrologue
) in order to find locations on the stack, where to put our fake return addresses.
namazso mentions the necessity to introduce 3 fake stack frames to nicely stitch the call stack:
MySleep
(having different UWOP
– Unwind Operation code). We do this by looking through all functions from a module, looking through their UWOPs, calculating how big the fake frame should be. This frame must have UWOPS different than our MySleep
‘s caller.RBP
from the stack – basically through UWOP_PUSH_NONVOL
code.RSP
from RBP
through the code UWOP_SET_FPREG
The restored RSP
must be set with the RSP
taken from wherever control flow entered into our MySleep
so that all our frames become hidden, as a result of third gadget unwinding there.
In order to begin the process, one can iterate over executable’s .pdata
by dereferencing IMAGE_DIRECTORY_ENTRY_EXCEPTION
data directory entry. Consider below example:
ULONG_PTR imageBase = (ULONG_PTR)GetModuleHandleA("kernel32");
PIMAGE_NT_HEADERS64 pNthdrs = PIMAGE_NT_HEADERS64(imageBase + PIMAGE_DOS_HEADER(imageBase)->e_lfanew);
auto excdir = pNthdrs->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXCEPTION];
if (excdir.Size == 0 || excdir.VirtualAddress == 0)
return;
auto begin = PRUNTIME_FUNCTION(excdir.VirtualAddress + imageBase);
auto end = PRUNTIME_FUNCTION(excdir.VirtualAddress + imageBase + excdir.Size);
UNWIND_HISTORY_TABLE mshist = { 0 };
DWORD64 imageBase2 = 0;
PRUNTIME_FUNCTION currFrame = RtlLookupFunctionEntry(
(DWORD64)caller,
&imageBase2,
&mshist
);
UNWIND_INFO *mySleep = (UNWIND_INFO*)(currFrame->UnwindData + imageBase);
UNWIND_CODE myFrameUwop = (UNWIND_CODE)(mySleep->UnwindCodes[0]);
log("1. MySleep RIP UWOP: ", myFrameUwop.UnwindOpcode);
for (PRUNTIME_FUNCTION it = begin; it < end; ++it)
{
UNWIND_INFO* unwindData = (UNWIND_INFO*)(it->UnwindData + imageBase);
UNWIND_CODE frameUwop = (UNWIND_CODE)(unwindData->UnwindCodes[0]);
if (frameUwop.UnwindOpcode != myFrameUwop.UnwindOpcode)
{
// Found candidate function for a desynch gadget frame
}
}
The process is a bit convoluted, yet boils down to reverting thread’s call stack unwinding process by substituting arbitrary stack frames with carefully selected other ones, in a ROP alike approach.
This PoC does not follows replicate this algorithm, because my current understanding allows me to accept the call stack finishing on an EXE
-based stack frame and I don’t want to overcompliate neither my shellcode loaders nor this PoC. Leaving the exercise of implementing this and sharing publicly to a keen reader. Or maybe I’ll sit and have a try on doing this myself given some more spare time 🙂
garak checks if an LLM can be made to fail in a way we don't…
Vermilion is a simple and lightweight CLI tool designed for rapid collection, and optional exfiltration…
ADCFFS is a PowerShell script that can be used to exploit the AD CS container…
Tartufo will, by default, scan the entire history of a git repository for any text…
Loco is strongly inspired by Rails. If you know Rails and Rust, you'll feel at…
A data hoarder’s dream come true: bundle any web page into a single HTML file.…