AVCLASS++ is a labeling tool for creating a malware dataset. Addressing malware threats requires constant efforts to create and maintain a dataset.
Especially, labeling malware samples is a vital part of shepherding a dataset. AVCLASS, a tool developed for this purpose, takes as input VirusTotal reports and returns labels that aggregates scan results of multiple anti-viruses. And now, AVCLASS++ is shipped with the brand-new capacities!
In a nutshell, AVCLASS++ enables the following operation:
Features
AVCLASS++ is developed for freeing you from the task of worrying about what families malware samples are. The salient features of AVCLASS++ are as follows:
Also Read – Pylane : An Python VM Injector With Debug Tools, Based On GDB
Step Forward
The following limitation was pointed out in the original AVCLASS paper:
The main limitation of AVClass is that its output depends on the input AV labels. It tries to compensate for the noise on those labels, but cannot identify the family of a sample if AV engines do not provide non-generic family names to that sample. In particular, it cannot label samples if at least 2 AV engines do not agree on a non-generic family name. Results on 8 million samples showed that AVClass could label 81% of the samples. In other words, it could not label 19% of the samples because their labels contained only generic tokens.
We have organized such pitfalls into two factors.
Because of them, we are forced to make a lot of manual, tedious intervention in malware labeling; otherwise, we need to drop samples with inconsistent labels from the dataset; since there was no alternative.
For the reason, AVCLASS++ is designed to address these drawbacks by arming with the following:
Besides, unlike AVCLASS, AVCLASS++ is Python 3 compatible!
How To Use?
Installation
git clone git@github.com:malrev/avclassplusplus.git
./setup.sh
Labeling
The labeler takes as input a JSON file with the AV labels of malware samples (-vt
or -lb
switches), a file with generic tokens (-gen
switch), and a file with aliases (-alias
switch). It outputs the most likely family name for each sample. If you do not provide alias or generic tokens files, the default ones in the data folder are used.
python avclass_labeler.py -lb data/malheurReference_lb.json -v >malheurReference.labels
The above command labels the samples whose AV labels are in the data/malheurReference_lb.json
file. It prints the results to stdout, which we redirect to the malheurReference.labels
file. The output looks like this:
aca2d12934935b070df8f50e06a20539 adrotator 67d15459e1f85898851148511c86d88d adultbrowser
which means sample aca2d12934935b070df8f50e06a20539 is most likely from the adrotator family and 67d15459e1f85898851148511c86d88d from the adultbrowser family.
The verbose (-v
) switch makes it output an extra malheurReference_lb.verbose
file with all families extracted for each sample ranked by the number of AV engines that use that family. The file looks like this:
aca2d12934935b070df8f50e06a20539 [(‘adrotator’, 8), (‘zlob’, 2)] ee90a64fcfaa54a314a7b5bfe9b57357 [(‘swizzor’, 19)] f465a2c1b852373c72a1ccd161fbe94c SINGLETON:f465a2c1b852373c72a1ccd161fbe94c
which means that for sample aca2d12934935b070df8f50e06a20539 there are 8 AV engines assigning adrotator as the family and another 2 assigning zlob. Thus, adrotator is the most likely family. On the other hand, for ee90a64fcfaa54a314a7b5bfe9b57357 there are 19 AV engines assigning swizzor as family, and no other family was found. The last line means that for sample f465a2c1b852373c72a1ccd161fbe94c no family name was found in the AV labels. Thus, the sample is placed by himself in a singleton cluster with the name of the cluster being the sample’s hash.
Note that the sum of the number of AV engines may not equal the number of AV engines with a label for that sample in the input file because the labels of some AV engines may only include generic tokens that are removed by AVCLASS++. In such a case, the propagater described later comes to rescue.
Input JSON Format
AVCLASS++ supports two input JSON formats:
-vt
file), where each line in file should be the full JSON of a VirusTotal report as fetched through the VirusTotal API. -lb
file), where each line in file should be a JSON with (at least) these fields: {md5, sha1, sha256, scan_date, av_labels}
. There is an example of such input file in data/malheurReference_lb.json
This option works well if you want to use label candidates from a source other than VirusTotal or from a self-made engine. AVCLASS++ can handle multiple input files putting the results in the same output files (if you want results in separate files, process each input file separately).
You can provide the -vt
and -lb
input options multiple times.
python avclass_labeler.py -vt <file1>-vt <file2> > all.labels
python avclass_labeler.py -lb <file1>-lb <file2> > all.labels
There are also -vtdir
and -lbdir
options that can be used to provide an input directory where all files are VT (-vtdir
) or simplified (-lbdir
) JSON reports.
python avclass_labeler.py -vtdir <directory> > all.labels
You can also combine -vt
with -vtdir
and -lb
with -lbdir
, but you cannot combine input files of different format. Thus, this command works:
python avclass_labeler.py -vt <file>-vtdir <directory> > all.labels
But, this one throws an error:
python avclass_labeler.py -vt <file1>-lb <file2> >all.labels
At this point you have read the most important information on how to use AVCLASS++. The following sections describe optional steps.
Label Propagation
When a sample has just been uploaded to VirusTotal, the original AVCLASS often gives you a SINGLETON label because of the lack of AVs signatures. In such a case, we usually try to disassemble and execute the sample, compare the results to past ones, and then give it the appropriate label.
Therefore, We introduce a function that automates this task. AVCLASS++ retrieves and compares byte histogram, printable strings, file size, PE headers, sections, imports, exports, and so on from the given executable files. Then, it gives the label to SINGLETONs from similar samples. An affinity matrix is derived to compute the similarities here. For label propagation, literally the label propagation algorithm [3] is used.
To use this function, run the following command:
python avclass_propagator.py -labels <file1>-sampledir <directory> -results <file2>
The input file passed with -labels
must be created in advance by avclass_labeler.py
in advance. The directory passed with -sampledir
must contain samples with the hash values contained in the labels file. The option -results
is optional. By default, the propagator creates _pr.labels
file based on a .labels
file passed as an argument. AVCLASS++ overwrites only SINGLETON labels with predicted labels by default. You can overwrite all original labels with predicted labels by enabling the -force
option. In addition, you can automatically optimize hyperparameter values by enabling -opt
.
python avclass_propagator.py -labels input.labels -sampledir samples -results output.labels -opt
This feature is contrary to the original AVCLASS manner of “does not require executables”, but it is really helpful in practice.
DGA Detection
AVs such as BitDefender, AegisLab, Emsisoft, eScan, GData, Ad-Aware, MAX, K7Antivirus, K7GW, Cybereason, and Cyren will output pseudo-randomly generated labels in a similar vein as DGA of malware. You can see an example at VirusTotal: f315be41d9765d69ad60f0b4d29e4300. This leads the original AVCLASS would be confused.
Therefore, we present a function that removes the label that seems to be generated by DGA. To this end, we employ the following heuristics [5]:
The key insight of these scores is that the appropriate label contains a string that is meaningful and easy to pronounce. AVCLASS++ calculates a harmonic mean of these scores and determine if the label is generated by DGA based on a threshold. This function is enabled by default now, but you can configure it with:
python avclass_labeler.py -vtdir <directory> -dgadetect <dictionary><n><threshold> >all.labels
An example is below:
python avclass_labeler.py -vtdir -dgadetect data/top10000en.txt 4 2 > all.labels
Family Ranking
AVCLASS++ has a -fam
switch to output a file with a ranking of the families assigned to the input samples.
python avclass_labeler.py -lb data/malheurReference_lb.json -v -fam > malheurReference.labels
This will produce a file called malheurReference_lb.families
with two columns:
virut 441
allaple 301
podnuha 300
The file indicates that 441 samples were classified in the virut family, 301 as allaple, and 300 as podnuha.
This switch is very similar to using the following shell command:
cut -f 2 malheurReference.labels | sort | uniq -c | sort -nr
The main difference is that using the -fam
switch all SINGLETON samples, i.e., those for which no label was found, are grouped into a fake SINGLETONS family, while the shell command would leave each singleton as a separate family.
PUP Classification
AVCLASS++ also has a -pup
switch to classify a sample as Potentially Unwanted Program (PUP) or malware. This classification looks for PUP-related keywords (e.g., pup, pua, unwanted, adware) in the AV labels.
python avclass_labeler.py -lb data/malheurReference_lb.json -v -pup > malheurReference.labels
With the -pup
switch the output of the malheurReference.labels
file looks like this:
aca2d12934935b070df8f50e06a20539 adrotator 1 67d15459e1f85898851148511c86d88d adultbrowser 0
The digit at the end is a Boolean flag that indicates sample aca2d12934935b070df8f50e06a20539 is (likely) PUP, but sample 67d15459e1f85898851148511c86d88d is (likely) not. This enables us to focus on PUP research [2] or non-PUP research!
The PUP classification tends to be conservative, i.e., if it says the sample is PUP, it most likely is. But, if it says that it is not PUP, it could still be PUP if the AV labels do not contain PUP-related keywords. Note that it is possible that some samples from a family get the PUP flag while other samples from the same family do not because the PUP-related keywords may not appear in the labels of all samples from the same family. To address this issue, you can combine the -pup
switch with the -fam
switch. This combination will add into the families file the classification of the family as malware or PUP, based on a majority vote among the samples in a family.
python avclass_labeler.py -lb data/malheurReference_lb.json -v -pup -fam > malheurReference.labels
This will produce a file called malheurReference_lb.families with five columns:
# Family Total Malware PUP FamType
virut 441 441 0 malware
magiccasino 173 0 173 pup
ejik 168 124 44 malware
For virut, the numbers indicate all the 441 virut samples are classified as malware, and thus the last column states that virut is a malware family. For magiccasino, all 173 samples are labeled as PUP, thus the family is PUP. For ejik, out of the 168 samples, 124 are labeled as malware and 44 as PUP, so the family is classified as malware.
Ground Truth Evaluation
If you have ground truth for some malware samples, i.e., you know the true family for those samples, you can evaluate the accuracy of the labeling output by AVCLASS++ on those samples with respect to that ground truth. The evaluation metrics used are precision, recall, and F1 measure.
python avclass_labeler.py -lb data/malheurReference_lb.json -v -gt data/malheurReference_gt.tsv -eval > data/malheurReference.labels
The output includes these lines:
Calculating precision and recall
3131 out of 3131
Precision: 90.81 Recall: 93.95 F1-Measure: 92.35
The last line corresponds to the accuracy metrics obtained by comparing AVClass results with the provided ground truth.
Each line in the data/malheurReference_gt.tsv
file has two tab-separated columns:
0058780b175c3ce5e244f595951f611b8a24bee2 CASINO
This sample 0058780b175c3ce5e244f595951f611b8a24bee2 is known to be of the CASINO family. Each sample in the input file should also appear in the ground truth file. Note that the particular label assigned to each family does not matter. What matters is that all samples in the same family are assigned the same family name (i.e., the same string in the second column)
The ground truth can be obtained from publicly available malware datasets. The one in data/malheurReference_gt.tsv
comes from the Malheur dataset. There are other public datasets with ground truth such as Drebin and Malicia.
Preparation
Generic Token Detection
The labeling takes as input a file with generic tokens that should be ignored in the AV labels, e.g., trojan, virus, generic, linux. By default, the labeling uses the data/default.generics
generic tokens file. You can edit that file to add additional generic tokens you feel we are missing.
In the original AVCLASS paper [1] presents an automatic approach to identify generic tokens, which requires ground truth, i.e., it requires knowing the true family for each input sample. Not only that, but the ground truth should be large, i.e., contain at least one hundred thousand samples. In the evaluation, AVCLASS identified generic tokens using as ground truth the concatenation of all datasets for which we had ground truth. This requirement of a large ground truth dataset is why we expect most users will skip this step and simply use our provided default file.
If you want to test generic token detection you can do:
python avclass_generic_detect.py -lb data/malheurReference_lb.json -gt data/malheurReference_gt.tsv -tgen 10 > malheurReference.gen
Each line in the data/malheurReference_gt.tsv file has two tab-separated columns:
0058780b175c3ce5e244f595951f611b8a24bee2 CASINO
which indicates that sample 0058780b175c3ce5e244f595951f611b8a24bee2 is known to be of the CASINO family.
The -tgen 10
switch is a threshold for the minimum number of families where a token has to be observed to be considered generic. If the switch is ommitted, the default threshold of 8 is used.
The above command outputs two files: malheurReference.gen
and malheurReference_lb.gen
. Each of them has 2 columns: token and number of families where the token was observed. File malheurReference.gen
is the final output with the detected generic tokens for which the number of families is above the given threshold. The file malheurReference_lb.gen
has this information for all tokens. Thus, malheurReference.gen
is a subset of malheurReference_lb.gen
.
However, note that in the above command you are trying to identify generic tokens from a small dataset since Drebin only contains 3K labeled samples. Thus, malheurReference.gen
only contains 25 identified generic tokens. Using those 25 generic tokens will produce significantly worse results than using the generic tokens in data/default.generics
.
Alias Detection
Different vendors may assign different names (i.e., aliases) for the same family. For example, some vendors may use zeus and others zbot as aliases for the same malware family. The labeling takes as input a file with aliases that should be merged. By default, the labeling uses the data/default.aliases aliases file. You can edit that file to add additional aliases you feel we are missing.
In the original AVCLASS paper [1] describes an automatic approach to identify aliases. Note that the alias detection approach requires as input the AV labels for large set of samples, e.g., several million samples. In contrast with the generic token detection, the input samples for alias detection do not need to be labeled, i.e., no need to know their family. In the evaluation, AVCLASS identified aliases using as input the largest of unlabeled datasets, which contained nearly 8M samples. This requirement of a large input dataset is why we expect most users will skip this step and simply use our provided default file.
If you want to test alias detection you can do:
python avclass_alias_detect.py -lb data/malheurReference_lb.json -nalias 100 -talias 0.98 > malheurReference.aliases
The -nalias
threshold provides the minimum number of samples two tokens need to be observed in to be considered aliases. If the switch is not provided the default is 20.
The -talias
threshold provides the minimum fraction of times that the samples appear together. If the switch is not provided the default is 0.94 (94%).
The above command outputs two files: malheurReference.aliases
and malheurReference_lb.alias
. Each of them has 6 columns:
These were observed over the number of input samples where t1 was observed.
File malheurReference.aliases
is the final output with the detected aliases that satisfy the -nalias and -talias thresholds. The file malheurReference_lb.alias
has this information for all tokens. Thus, malheurReference.aliases
is a subset of malheurReference_lb.alias
.
However, note that in the above command you are trying to identify aliases from a small dataset since Drebin only contains 3K samples. Thus, malheurReference.aliases
only contains 6 identified aliases. Using those 6 aliases will produce significantly worse results than using the aliases in data/default.aliases
. As mentioned, to improve the identified aliases you should provide as input several million samples.
Acknowledgment
We deeply respect original authors of AVCLASS. Reference with love:
The techniques introduced in AVCLASS++ are based on the following papers:
garak checks if an LLM can be made to fail in a way we don't…
Vermilion is a simple and lightweight CLI tool designed for rapid collection, and optional exfiltration…
ADCFFS is a PowerShell script that can be used to exploit the AD CS container…
Tartufo will, by default, scan the entire history of a git repository for any text…
Loco is strongly inspired by Rails. If you know Rails and Rust, you'll feel at…
A data hoarder’s dream come true: bundle any web page into a single HTML file.…