Saturday, October 15, 2022

Looking at process relationships from malware sandbox execution data


This blog post discusses looking at process relationships, specifically from malware sandbox execution data. One of the essential functions of malware sandbox is to gather and display process execution, for example if winword launches powershell, you'd want to know that.

One issue that I run into while doing research is that many of the public/free malware sandboxes don't allow me to search based on process relationships. For example, if I have a sample on an endpoint that executed whoami, nslookup, systeminfo, i would like to be able to search sandbox reports to see which malware families or samples do that.

The other thing I'm interested in as a researcher is trends a long for initial access execution, for specific malware families or in general. One of the twitter accounts I follow is and they post information about how malware such as qakbot is doing command/process execution on a system. 

I find the information interesting and obviously, the threat actors have made changes over time. Maybe the threat actors are using new LOLBINs more than before.

The final thing about process relationship data that can be useful is just looking for new things or rare executions. If you're collecting the data, you can do searches to look for rare executions.

All this research should be helpful with detection engineering too or with emulation, if you're trying to match a specific threat.

POC Implementation:

As a proof-of-concept, I decided to implement a searchable database that lets me collect data from malware sandbox report and lets me search for parent-child process relationships. 

I acquired my data from Hybrid-Analysis Public Feed, which gives you JSON file with around 250 recent malware analysis results. I also got data from Zero2Auto CAPE sandbox ( Thanks for letting me use the data!)

I initially looked at graph databases but asking graph database questions/doing queries seemed annoying to me so I didn't look into them too much.

The second thing I tried was to join data from process execution in Python manually, which was a horrible idea. The code turned out horrible and dataset wasn't fun to work with. (

CAPE and Hybrid-Analysis both record process execution data differently but one thing they have in common is a process list json object. Each process object has process metadata and parent process id and obviously the process id. 

I decided to use duckdb to analyze the data. (Usually I'd use sqlite but wanted to try out duckdb and it worked fine)

I created a table with:

  • report id - specific execution task/detonation in the sandbox
  • process id
  • parent process id
  • process name
  • process path
  • process command line

Then I loaded the results from CAPE or Hybrid-Analysis to the table. I'm loading the same type of data but parsing their json reports is obviously different.

Finally, I created a view with join, where I ensure that report id is the same and parent process id and process id's match.

The resulting view contains:

  • Report ID
  • Parent Process ID
  • Parent Parent Process ID
  • Parent Name
  • Parent Path
  • Parent Command Line
  • Process ID
  • Process Name
  • Process Path
  • Process Command Line

Gathering data from the sandbox reports and putting it in the database allows me to ask questions like these:

  • what process launched ping? select parent_name, proc_commandline from joined_proc_list where proc_commandline ilike '%ping.exe%';
  • what process launched powershell with command line to add Defender exclusion? select parent_name, proc_commandline from joined_proc_list where proc_commandline ilike '%add-mppreference%';
  • what processes launch wscript? select parent_name, count(*) as count from joined_proc_list where proc_commandline ilike '%wscript%' group by parent_name
  • what does cmd.exe launch from the appdata folder? select parent_name, proc_name from joined_proc_list where parent_name ilike '%cmd.exe%' AND proc_path ilike '%appdata%';

The results look kinda like this:

If you have large enough dataset, you can extract more info like malware or campaign name and etc and keep track of the trends.

Other solutions:

If you already are doing malware execution in your sandbox, you can check if you are able to search based on process relationships. 

You could also have a backend database that you can query, for example MongoDB or Elasticsearch, although I personally don't know about join capabilities of those databases.

Alternatively, if your sandbox supports either pushing data out to splunk or elasticsearch or any other place, you could try to work with that data. You can also maybe intercept that data and send it to a webhook or lambda for additional processing.

If you have a system that supports pulling data, maybe through an API, that's also a solution. Maybe have a script that pulls reports, parses data, and processes it.

You can store processed data in whatever database you feel comfortable utilizing. I would personally use Clickhouse or Postgresql if I was doing this. 



Also check out Grapl -