Classifying Malware Packers Using Machine Learning

The recent rise in popularity of AI reignited my interest in machine learning. It inspired me to dive deeper into understanding how it can be applied to malware analysis and, more importantly, how to better detect malware packers, as almost every malware nowadays uses them.

My research and experiments eventually led me to make a web app, which I call the VGL4NT Malware Packer Classifier (https://packers.vgl4nt.com/).).

classifying-malware-packers-using-machine-learning-01

(For those curious, V.G.L.4.N.T. is a play on "Vigilant" and stands for "Visual Guided Learning 4 Neutralizing Threats")

Current State of Packer Detection

Traditional packer detection approaches like DiE (Detect it Easy) and Yara rules depend on known signatures and patterns to identify packers. These tools scrutinize a file for specific indicators, like unique sequences of bytes or strings. While effective in many cases, they have drawbacks, like when a packer is modified or if the sequence of bytes or strings are altered.

By using machine learning, the VGL4NT Malware Packer Classifier can be able to take into account minute differences and still be able to detect the packer used.

How it works

  • The uploaded executable file's bytes are converted into grayscale values, creating an image..
  • The grayscale image is then fed into an image machine-learning model I trained from scratch.
  • It returns a list of percentages on how similar it is to other Packers.

classifying-malware-packers-using-machine-learning-01

The approach above is nothing new and is based on this academic paper. The difference is that the paper has a tool that classifies malware families, while mine classifies the packers used.

Most of the magic happens in the model itself. I've trained it on several packed malware samples and measured its accuracy using multiple iterations. The latest version of this model has a 94% accuracy, which is calculated by comparing the model's predictions to the actual packer labels in a dataset that the model hasn't seen before (the test dataset).

Current limitations

The app works for the most part, but it has its limitations. For example, users can only upload executable files (EXE, Bin, ELF, DLLs, etc) with a maximum size limit of 10MB.

Furthermore, due to costs of GPU resources during training, only the following packer tools can be classified:

  • aspack
  • alienyze
  • amber
  • mew
  • mpress
  • nspack
  • pecompact
  • petite
  • themida
  • upx
  • others (Everything else)

The list of packer tools above was chosen based on available real-world malware samples that I have encountered or studied.

Future Plans and Updates

If this project gains enough interest, then I plan to add more improvements, such as:

  • Increase GPU resources to increase the model's capacity to classify more categories
  • Improvements in the training method by handpicking the most important parts of the executable and then feeding that to the model
  • Offer an API for integration with existing tools and processes.

Of course, this project would improve a lot with the community's help. I encourage users to provide feedback, report issues, or request new features. Feel free to throw your thoughts to me through my email, karlo@accidentalrebel.com, or Twitter at @accidentalrebel.

Adding Automation to Blue-Jupyter Malware Notebook

I came across the Blue-Jupyter project on Github while researching Jupyter notebooks. This short demo video got me excited, so I cloned the project and added some improvements that automate many things when I am looking for malware to investigate.

What are Jupyter Notebooks?

For readers who may be unfamiliar, Jupyter Notebooks are a web-based tool that allows users to create and share documents that contain live code, equations, visualizations, and narrative text. They are a popular tool among data scientists and researchers but have also adapted for use in other fields, such as cybersecurity.

My Additions to the Blue-Jupyter

Many of the changes I've made are focused on automating the process of quickly looking for interesting new samples to investigate.

One addition to the notebook is the automated downloading of samples from Malware Bazaar. This can download a maximum of 100 samples continuously. Additional information is listed to highlight some interesting points about the sample, like the malware signature. It also can skip samples that have already been downloaded to save bandwidth.

adding-automation-to-blue-jupyter-malware-notebook-01

The second significant addition is the automated generation of Capa results for each downloaded sample. This makes it easy to see which malware has a particular capability so I can quickly see which ones are interesting enough to investigate further.

adding-automation-to-blue-jupyter-malware-notebook-02

I also added minor improvements like error handling, additional logging for troubleshooting, and some cleanup code just in case I want to start fresh.

Check it out

If you are interested in checking it out, you can view my fork of the repository here. I did not request for a pull request on the original branch because I've changed a lot of things that the original owner might not prefer to have. Of course, I encourage everyone to fork what I made and make it their own. That's the beauty of Jupyter notebooks, anyway.

Malware sandbox evasion in x64 assembly by checking ram size - Part 2

In the previous post, I explored a sandbox evasion technique that uses GetPhysicallyInstalledSystemMemory to check the size of the RAM of the machine. The idea behind this technique (MBC Technique ID: B0009.014) is that any value that is lower than 4GB may probably be a sandbox (to reduce costs). This information can then be used with other sandbox evasion techniques to confirm.

For part 2 of this series, I'll be talking about an alternative Windows API function called GlobalMemoryStatusEx. This function is as straightforward as the first one, but requires the passing of a pointer to a C struct. This is significant because I'll be converting a working C code to x64 assembly so we can fully understand how it works under the hood.

Using GlobalMemoryStatusEx

Here is an example of an implementation of GlobalMemoryStatusEx in C that we'll later be converting to x64 assembly.

#include <stdio.h>
#include <windows.h>

int main(void)
{
    MEMORYSTATUSEX statex;
    statex.dwLength = sizeof (statex);
    GlobalMemoryStatusEx (&statex);
    printf ("Memory size: %*I64d", 7, statex.ullTotalPhys/1024);
}

You will see that the first parameter for GlobalMemoryStatusEx is expecting a pointer to a MEMORYSTATUSEX object. We need to declare the memory location statex by putting it onto the stack. Before we can do that, however, we first need to know beforehand how much we would need to reserve.

Getting the size of the struct

Finding out the size of a structure in C is easy with the sizeof function. However, we can't really use this in assembly, so we have to determine it manually by adding up the sizes of each member of the struct.

Consider the example struct definition below:

struct TestStruct {
    char member1;
    int member2;
    float member3;
};

If we would look at this table containing the fundamental types and their sizes, we could determine the sizes of each member:

  • member1 is of type char which has a size of 1 byte
  • member2 is of type int which is 4 bytes
  • member3 is of type float which also is 4 bytes

Adding all of these sizes results in TestStruct having a total size of 9 bytes.

Now to apply the same computation to our MEMORYSTATUSEX struct. Here is the definition of the struct according to MSDN:

typedef struct _MEMORYSTATUSEX {
  DWORD     dwLength;
  DWORD     dwMemoryLoad;
  DWORDLONG ullTotalPhys;
  DWORDLONG ullAvailPhys;
  DWORDLONG ullTotalPageFile;
  DWORDLONG ullAvailPageFile;
  DWORDLONG ullTotalVirtual;
  DWORDLONG ullAvailVirtual;
  DWORDLONG ullAvailExtendedVirtual;
} MEMORYSTATUSEX, *LPMEMORYSTATUSEX;

The types that we have are DWORD and DWORDLONG (which is just Window's own version of unsigned long and unsigned int64):

  • DWORD or unsigned long has a size of 4 bytes
  • DWORDLONG or unsigned int64 has a size of 8 bytes

So adding the two DWORDs and seven DWORDLONGs results in MEMORYSTATUSEX having a total size of 64 bytes.

Initializing statex

Now that we know the total size, we can now reserve this amount of space on the stack.

    sub rsp, 0x40   ; Reserve space for struct on stack
                    ; MEMORYSTATUSEX's is 64 bytes (0x40) in size

Before we can call GlobalMemoryStatusEx, however, MSDN states that the dwLength member should be first set. And this can be done by assigning 64 bytes to the corresponding memory location on the stack.

    mov rax, 0x40   
    mov [rsp], rax  ; Assign 0x40 to dwLength
    lea rcx, [rsp]  ; Load the memory location of struct

With this we can finally call our function:

    sub rsp, 32     ; Reserve shadow space
    call    GlobalMemoryStatusEx
    add rsp, 32     ; Release shadow space

Using the result

If successful, the function GlobalMemoryStatusEx populates the memory location we passed to it, as shown below:

malware-sandbox-evasion-in-x64-assembly-by-checking-ram-size-part-2-01

The struct member ullTotalPhys now has the memory size that we need. And because our stack pointer still points to the beginning of the struct, we can get this value by adding an offset to rsp.

    mov rax, [rsp+0x8]  ; Retrive value of ullTotalPhys from stack

We offset by 0x8 because the first 8 bytes is assigned to dwLength and dwMemoryLoad (both at 4 bytes each).

Displaying the result

As seen above, the value returned by GlobalMemoryStatusEx is in bytes. To be consistent with our example from the previous post, we need to convert this value to kilobytes by dividing it by 1024.

    mov rcx, 1024
    xor rdx, rdx    ; Clear rdx; This is required before calling div
    div rcx         ; Divide by 1024 to convert to KB

The result of the above operation is saved to rax which we can then move to rdx so we can pass it as the second argument to printf.

    mov rdx, rax    ; Argument 2; Result of ullTotalPhys / 1024
    lea rcx, [msg_memory_size]  ; Argument 1; Format string
    sub rsp, 32     ; Reserve shadow space
    call    printf
    add rsp, 32     ; Release shadow space

With this, we can now finally display the result on the console:

malware-sandbox-evasion-in-x64-assembly-by-checking-ram-size-part-2-02

Here is the full source code for reference:

    bits 64
    default rel

segment .data
    msg_memory_size db  "Memory size: %lld", 0xd, 0xa, 0

segment .text
    global main
    extern ExitProcess
    extern GlobalMemoryStatusEx
    extern printf

main:
    push    rbp
    mov     rbp, rsp

    sub rsp, 0x40   ; Reserve space for struct on stack
                    ; MEMORYSTATUSEX's is 64 bytes (0x40) in size 

    mov rax, 0x40   
    mov [rsp], rax  ; Assign 0x40 to dwLength
    lea rcx, [rsp]  ; Load the memory location of struct

    sub rsp, 32     ; Reserve shadow space
    call    GlobalMemoryStatusEx
    add rsp, 32     ; Release shadow space

    mov rax, [rsp+0x8]  ; Retrive value of ullTotalPhys from stack
    mov rcx, 1024
    xor     rdx, rdx    ; Clear rdx; This is required before calling div
    div rcx     ; Divide by 1024 to convert to KB

    mov rdx, rax    ; Argument 2; Result of ullTotalPhys / 1024
    lea rcx, [msg_memory_size]  ; Argument 1; Format string
    sub rsp, 32     ; Reserve shadow space
    call    printf
    add rsp, 32     ; Release shadow space

    add rsp, 0x40   ; Release space of struct from stack

    xor     rax, rax
    call    ExitProcess

Conclusion

Over the past two blog posts, we've learned how to use GlobalMemoryStatusEx and GetPhysicallyInstalledSystemMemory to determine the size of the RAM of a machine. We've also learned about using the stack to pass arguments to functions using x64 assembly.

In future posts I plan to continue exploring malware behavior and techniques and at the same time teach x64 assembly so that we can both improve when writing and reverse engineering malware.

Until then, you can view the C and Assembly code along with the build scripts for this evasion technique on this repository here.

Feel free to reach out to me on Twitter or LinkedIn for any questions or comments.