Before delving into the technical implementations of this project, let's understand the difference between staged and stageless payloads. A stageless payload contains all the required dependencies within itself and is a standalone exploit. In the context of a C2 communication, once a stageless payload is executed, it connects back to the C2 server for instructions. Staged payloads, on the other hand, when executed, retrieve the main payload over the network and execute it.
There are many advantages to using staged payloads. They are often lightweight and practical where the dropper's size is crucial. Additionally, if your stager gets blocked, your actual payload (second-stage payload) is not compromised. However, Blue Teams can analyse the initial payloads to determine the endpoint where the second stage is hosted and attempt to access it. There are various steps that one can take while building their C2 infrastructure to block unauthorised connection attempts. These typically include comparing the request's User-Agent string against a whitelist, using cookie values, or whitelisting IP ranges from which requests are allowed. However, an unauthorised request can follow these rules and obtain the main payload, which can be further analysed to gain information regarding the implant. This project aims to build a POC of a stager shellcode that sends an authentication token with the request, which is validated by the server before sending the second stage. The number of times a payload can authenticate can be configured, and once the token expires, the requests are blocked even if they are generated from the stager.
To execute our main payload in the process, we need to allocate additional space that has execute permissions. If we allocate ReadExecute or ReadWriteExecute (RX/ RWX) memory region, it creates unbacked memory regions with suspicious permissions, which are heavily monitored by defensive solutions. An unbacked memory region means that there is no corresponding file on the disk, and it exists only in memory. Hence, we will execute our final payload using Module Stomping, also known as DLL Hollowing. This technique loads a genuine DLL that a process doesn't require and rewrites the DLL's code with our payload.
There are two aspects to this project. The first component is the generation of the stager shellcode, and the second is a server component that validates the requests before sending the second stage.
The code for this blog is available at:
Building Stager Shellcode
Shellcode, also known as machine code, are instructions that can be directly executed by the processor. A shellcode program can be manually written using assembly language, but that's a very tedious task and not viable for large projects. Instead, we will write our program in a high-level language such as C, compile it into an executable, and extract the machine code. These machine code instructions are stored in the .text section of an executable. Since we will only extract the instructions, our compiled executable shouldn't depend on other sections, like .data, .rdata, or .bss, which means that the program cannot use global variables or string literals.
In Windows, the Operating System's functionality is exposed through Windows DLLs as Win32APIs. For example, to display a message box, MessageBoxA API is used from the User32.dll. All the APIs and DLLs an executable uses are stored in the "Import Address Table" (IAT). During runtime, Windows Loader loads these DLLs, and the addresses of required APIs are mapped in the IAT. When the program calls this API, the address is looked up from the IAT. Shellcode, on the other hand, cannot have any dependencies and should manually load the required DLLs and iterate over functions in the DLL to identify the addresses of the required APIs.
The following functions are used to enumerate modules loaded by a process and to get the address of APIs exported by the modules. In essence, they are just custom implementations of the GetModuleHandle and GetProcAddress WinAPIs.
Utilising these functions, we can get the location of the KERNEL32.DLL module in memory and the address of the LoadLibraryA function. The LoadLibraryA can be used to load additional DLLs in the process. As our program will download the second stage over HTTP protocol, we need to load the WinInet.DLL and resolve the addresses of InternetOpenA, InternetConnectA, HttpOpenRequestA, HttpSendRequestA, HttpQueryInfoA, InternetQueryDataAvailable, InternetReadFile, and InternetCloseHandle Win32APIs. To map the contents of our sacrificial module in memory, we will be using the NtCreateSection and NtMapViewOfSection NTAPIs. For the allocation and deallocation of dynamic memory in the heap, we will be using the calloc and free functions from the msvcrt.DLL. From the KERNEL32.DLL, we will be using the CreateFileW, VirtualProtect, CreateThread, and WaitForSingleObject WinAPIs. Both GetModule and GetProcAddrHash utilise hashes of strings for comparison. The djb2 hash is calculated for these functions using the following code.
In this part of our program, the second stage payload will be downloaded by sending a get request to our web server. The authentication token will be sent in the WWW-Authenticate request header, which is typically used while performing Basic Authentication. The server validates the URI and the auth token and sends the second-stage payload. The functions required for these operations will be resolved dynamically using the GetModule and GetProcAddrHash functions.
The token used for authentication is dynamically generated by the Python server during runtime and will be unique for each execution. While downloading our payload, the first four bytes contain the total size of our shellcode, which is used to allocate memory in the heap, and the payload is written at this location.
After downloading our second-stage payload, we will load a sacrificial DLL and replace the original code with our shellcode. By stomping a DLL, we will not be creating any unbacked or additional RX/RWX regions and will be utilising the RX region of the .text section of our DLL. The DLL is mapped in memory using the CreateFileW, NtCreateSection, and NtMapViewOfSection APIs.
Windows utilises Control Flow Guard (CFG) as a security mechanism for exploit prevention. If CFG is enabled, all the calls to functions will be monitored to determine if they are valid functions. Otherwise, the program gets terminated. By manually mapping our DLL into memory, the .text section will not be subjected to CFG checks. If we alternatively use LoadLibrary to load the DLL, the Windows loader will register CFG checks for the DLL's code, which needs to be bypassed to execute our payload. We will be going with the former approach as that's simpler. After mapping the DLL into memory, we will identify the address of our DLL's entry point, which is where our second-stage payload will be written.
The .text section of our sacrificial DLL has Read and Execute permissions. After obtaining the Entry Point of the DLL, the memory permissions need to be modified to allow write access, and our payload will be written at this location. To evade signature-based network detections, the payload is XOR encrypted during transfer. The decryption routine will be called after writing our shellcode in the final location. The original RX memory permissions are then applied to this region, and the shellcode is executed by creating a new thread.
Now that our stager's functionality has been implemented, it needs to be compiled to extract our final shellcode. When dealing with NTAPIs, Windows expects that the stack is 16-bit aligned. Since we don't know the state of the stack when our shellcode is executed, it becomes our responsibility to align the stack before executing our DownloadExec function. Hence, we will be using a small assembly code that will align the stack before calling our function. This assembly code has been borrowed from Chetan Nayak's Blog (linked in the references section).
Now that the first part is done, we need to implement a backend that automates the process of generating the stager shellcode and starting a web server. A Python script is written to take care of this.
Usage
The script takes in the following arguments
-f: Path to your shellcode
-H: IP address or Domain where the stager needs to connect back
-s: Port number. By default, it uses 80.
-d: DLL to use for hollowing. By default C:\Windows\System32\chakra.dll is used.
-t: Validity of Authentication tokens. If this parameter is configured as 2, the token expires after detonating the payload twice.
-x: Output Format of the shellcode.
Demo
We will use a simple shellcode injector to test our stager against Defender. The stager will download and execute Havoc's payload.