Crawlector (the name Crawlector is a combination of Crawler & Detector) is a threat-hunting framework designed for scanning websites for malicious objects.
Note-1: The framework was first presented at the No Hat conference in Bergamo, Italy on October 22nd, 2022 (Slides, YouTube Recording). Also, it was presented for the second time at the AVAR conference, in Singapore, on December 2nd, 2022.
Note-2: The accompanying tool EKFiddle2Yara (is a tool that takes EKFiddle rules and converts them into Yara rules) mentioned in the talk, was also released at both conferences.
Note-3: Version 2.0 (Photoid Build:180923), a milestone release, has been released on September 18, 2023.
Note-4: Version 2.1 (Universe-647 Build:031023), has been released on October 03, 2023. A major addition is the Slack Alert Notification feature.
Note-5: Version 2.2 (Hallstatt Build:051123), has been released on November 05, 2023. A major addition is the Slack Remote Control feature.
This is for checking for malicious urls against every page being scanned. The framework could either query the list of malicious URLs from URLHaus server (configuration: url_list_web), or from a file on disk (configuration: url_list_file), and if the latter is specified, then, it takes precedence over the former.
It works by searching the content of every page against all URL entries in url_list_web or url_list_file, checking for all occurrences.
Additionally, upon a match, and if the configuration option check_url_api is set to true, Crawlector will send a POST request to the API URL set in the url_api configuration option, which returns a JSON object with extra information about a matching URL.
Such information includes urlh_status (ex., online, offline, unknown), urlh_threat (ex., malware_download), urlh_tags (ex., elf, Mozi), and urlh_reference (ex., https://urlhaus.abuse.ch/url/1116455/).
This information will be included in the log file cl_mlog_<current_date><current_time><(pm|am)>.csv (check below), only if check_url_api is set to true.
Otherwise, the log file will include the columns urlh_url (list of matching malicious URLs) and urlh_hit (number of occurrences for every matching malicious URL), conditional on whether check_url is set to true.
URLHaus feature could be disabled in its entirety by setting the configuration option check_url to false.
It is important to note that this feature could slow scanning considering the huge number of malicious urls (~ 130 million entries at the time of this writing) that need to be checked, and the time it takes to get extra information from the URLHaus server (if the option check_url_api is set to true).
You must familiarize yourself with the configuration file cl_config.ini before running any session. All of the sections and parameters are documented in the configuration file itself.
The Yara offline scanning feature is a standalone option, meaning, if enabled, Crawlector will execute this feature only irrespective of other enabled features.
And, the same is true for the crawling for domains/sites digital certificate feature. Either way, it is recommended that you disable all non-used features in the configuration file.
log_to_file
or log_to_cons
), if a Yara rule references only a module’s attributes (ex., PE, ELF, Hash, etc….), then Crawlector will display only the rule’s name upon a match, excluding offset and length data.Note: for any option that takes a path, always provide the absolute path.
To visit/scan a website, the list of URLs must be stored in text files, in the directory “cl_sites”.
Crawlector accepts three types of URLs:
[a-zA-Z0-9_-]{1,128} = <url>
<id>[
depth:<0|1>-><\d+>,
total:<\d+>,
sleep:<\d+>] = <url>
For example,
mfmokbel[depth:1->3,total:10,sleep:0] = https://www.mfmokbel.com
which is equivalent to: mfmokbel[d:1->3,t:10,s:0] = https://www.mfmokbel.com
where, <id> := [a-zA-Z0-9_-]{1,128}
depth, total and sleep, can also be replaced with their shortened versions d, t and s, respectively.
40 (10 + (10*3))
URLs.Note 1: Type 3 URL could be turned into type 1 URL by setting the configuration parameter live_crawler to false, in the configuration file, in the spider section.
Note 2: Empty lines and lines that start with “;”, “#” or “//” are ignored.
The spider functionality is what gives Crawlector the capability to find additional links on the targeted page. The Spider supports the following features:
Type 3
, for the Spider functionality to workexclude_url
config. option. For example, *.zip|*.exe|*.rar|*.zip|*.7z|*.pdf|.*bat|*.db
include_url
config. option. For example, */checkout/*|*/products/*
exclude_https
add_ext_links
. This feature honours the exclude_url
and include_url
config. option.ext_links_only
. This feature honours the exclude_url
and include_url
config. option.In release 2.0, the ids have their types explicitly assigned by appending either of the following types to the id itself:
id_postfix (type) | description |
---|---|
_t1_p | type 1 plain with no id |
_sd | sub-type for subdomains |
_tld | sub-type for tlds |
_t2_p | type 2 plain with an id |
_t3_s | type 3 spidered domains |
_t3_sc | type 3 spidered domains with a child node |
_t3_ss | type 3 when a type 3 (_t3_s) url is turned into type 1 url |
_t3_s_e | type 3 spidered domains external links |
_obj_ | for deep scanning and object extraction |
_t4_ru | for redirect url (for all types) |
Having each id carry its type with it, makes browsing and filtering the results easier. Moreover, this is used internally for various reasons.
site_ranking
in the configuration file provides some options to alter how the CSV file is to be readsite
section provides the capability to expand on a given site, by attempting to find all available top-level domains (TLDs) and/or subdomains for the same domain. If found, new tlds/subdomains will be checked like any other domainrapid_api_key
in the configuration filefind_tlds
enabled, in addition to Omnisint Labs API tlds results, the framework attempts to find other active/registered domains by going through every tld entry, either, in the tlds_file
or tlds_url
tlds_url
is set, it should point to a url that hosts tlds, each one on a new line (lines that start with either of the characters ‘;’, ‘#’ or ‘//’ are ignored)tlds_file
, holds the filename that contains the list of tlds (same as for tlds_url
; only the tld is present, excluding the ‘.’, for ex., “com”, “org”)tlds_file
is set, it takes precedence over tlds_url
tld_dl_time_out
, this is for setting the maximum timeout for the dnslookup function when attempting to check if the domain in question resolves or nottld_use_connect
, this option enables the functionality to connect to the domain in question over a list of ports, defined in the option tlds_connect_ports
tlds_connect_ports
accepts a list of ports, comma separated, or a list of ranges, such as 25-40,90-100,80,443,8443 (range start and end are inclusive) tld_con_time_out
, this is for setting the maximum timeout for the connect functiontld_con_use_ssl
, enable/disable the use of ssl when attempting to connect to the domainsave_to_file_subd
is set to true, discovered subdomains will be saved to “\expanded\exp_subdomain_<pm|am>.txt”save_to_file_tld
is set to true, discovered domains will be saved to “\expanded\exp_tld_<pm|am>.txt”exit_here
is set to true, then Crawlector bails out after executing this [site] function, irrespective of other enabled options. It means found sites won’t be crawled/spideredThe url redirect functionality in previous releases was broken. This release provides a complete rewrite of the redirect feature, with a high degree of parametrization for controlling its operation.
In release version 2.0, the redirect has a dedicated section in the configuration file, named [redirect]. The entirety of the redirection functionality could be turned on/off via the option follow_redir, under the section [default].
The redirect function checks the HTTP response status codes: 301, 302, 303, 307 and 308. In case of a match, Crawlector will parse the Location header, for the redirect to url, accounting for both, absolute and relative redirect urls.
The redirect functionality in Crawlector was designed for performance and agility. The [redirect] section provides the following list of options:
The depth option takes either of the values, last or all. It controls what found redirect urls to visit, depending on whether the visit option is enabled or not. all is for visiting all found redirect urls. last is for visiting the last redirect url.
The visiting of those urls happens in the same/current session. Keep in mind that irrespective of the depth value, Crawlector will record the list of all found redirect to urls, along with the total number, in absolute form.
They will be written to the cl_mlog CSV file, under the columns redirect_urls and redirect_total.
The option max_redirect sets an upper limit on the total number of url redirects to discover.
The option skip_similar is best explained via the following example:
Assume that the original url given to Crawlector to crawl is “https://www.mfa.gov.law” and one of the found redirect_urls is “https://mfa.gov.law/“.
As you can tell, the only difference is the forward_slash at the end of the url. These two urls are the same, and the server will respond with the same page.
If the option visit is set to true, Crawlector will crawl both urls, thereby wasting resources, and performing the same task twice.
This might not be an issue for 1 or 2 urls, but if you have 1000s of urls you want to crawl, and the option visit is enabled, then the chances that more than half of them will have such discovered url is very high, in which, this becomes a pressing issue to account for.
Thus, setting the option skip_similar to true will help solve this issue by skipping over visiting similar urls.
In addition to the forward_slash scenario, the skip_similar option also accounts for the following two scenarios: if the redirect url is different only by either or both of the prefixes, “https://” and “www.”.
One of the major additions to release 2.0 is the capability to extract different types of objects from the page, save them to disk, Yara & URLHaus scan them and save the results to the CSV file.
To enable this feature, set the option extract_obj to true, under the section [page].
The implementation of the deep object extraction feature works by creating an MHT web archive file from the webpage, including external scripts, images and CSS files.
All embedded files will be extracted into the path specified by the option obj_dir (path: obj_dir/objects/), where each file will be scanned. The implementation is not to be confused with headless browser functionality.
DOE is different and doesn’t involve loading the page to retrieve all dynamically queried URLs. Therefore, it has its limitations.
All of the extracted objects will have some of their metadata written into the CSV file.
Things to keep in mind when reading the CSV file, the id of the domain with the extracted object has a unique format, as follows, <domain_id>_<type>_p_obj_<counter> (for example, _mfa_gov_cef40bc5-ba6a-41_t1_p_obj_0_). And, the url will have the following format, <url>__<object_filename> (for example, https://www.mfa.gov.law__bilmur.min.js).
If the option delete_obj is set to true, then, all exrtacted objects that aren’t being detected by Yara are deleted from disk. If the option log_all_objs is set to true, then, log all extracted objects metadata to the same cl_mlog CSV file.
If the option check_urlhaus under the [page] section is set to true, then, every exrtacted object will be URLHaus scanned. Note that this option’s options are inherited from the section [urlhaus].
Note: if the domain being crawled redirects to another domain, then, the last redirect to URL has to be passed to DOE to work. Moreover, the domain has to start with “HTTP(S)://” for DOE to work.
Sometimes, you might want to run Crawlector sessions that might take days to complete, for example, by crawling the top 1-million Alexa websites, and for such a scenario, you need a way to monitor the framework’s operation and progress, remotely.
Therefore, in release 2.1, I’ve added the Slack alert notification feature to provide a mechanism to monitor the execution of Crawlector in real-time, by sending Yara’s alerts, std::exit() events, and process warnings and errors, to a Slack channel of your choosing.
In addition to that, Crawlector installs a console handler, in an attempt to monitor certain event types, including ctrl_c, ctrl_close, ctrl_break, ctrl_logoff and ctrl_shutdown.
It is important to keep in mind that Crawlector doesn’t change/alter the default handler’s behaviour, it merely reports to the Slack channel the receiving of any of the listed events. This could be extended in the future to account for other types of events.
This feature uses Slack REST API, and for authentication with the server, it uses OAuth 2.0. You’ll need a Slack API token to use it, and a channel configured with the right permissions.
This feature only posts messages to the Slack channel and doesn’t receive or process any incoming messages.
The [slack_alert] section provides the following list of options:
To disable or enable this feature, simply set the option alert to true or false. Moreover, you need to specify the api_token, with a channel name.
Note-1: In the initialization phase of Crawlector, it tests whether the provided authentication token is valid or not, or if the channel is set, and in case of failure, this feature is disabled automatically.
All alerts reported to the Slack channel are reported under the user’s name Crawlector v<version_number>, for example, Crawlector v2.1. The user has the icon of a spider web.
Additionally, all alerts are threaded, meaning all subsequent alerts after the first starting message, are posted as replies.
This was a design decision and helps in case you’re running multiple sessions at the same time, all reporting to the same channel. Some alerts use the markdown markup language for formatting.
When the process has finished successfully and is about to exit, it posts the following message:
Crawlector has finished and is shutting down successfully
Note-2: Slack rate limit on the post message API is one message per second, with leeway for some bursts. Crawlector does not queue messages to account for more posts per second.
This might change in the future if required, however, the option sleep allows for the process to sleep for a specified amount of time after every successfully posted message.
With release 2.2 (code-named Hallstatt), I’m introducing the capability to remotely control Crawlector via a selected set of specially designed control commands.
The reason for introducing this functionality is to monitor and control certain behaviours of sessions that are supposed to run for hours or days.
For example, you might want to turn on/off the Slack alert functionality, terminate Crawlector, and upload a configuration file, among others.
This feature uses Slack REST API, and for authentication with the server, it uses OAuth 2.0. You’ll need a Slack API token to use it, and a channel configured with the right permissions. The API token is the same as that used in the [slack_alert] section, option api_token.
The [slack_alert] section provides the following additional list of options for the remote control functionality:
To disable or enable this feature, simply set the option control to true or false. The ctrl_channel name has to be the channel ID name and not the channel name. You can get it by right-clicking on the channel name -> View channel details -> Scroll down to the bottom of the window, and you’ll see the Channel ID: <channel_id> field.
The option ctrl_sleep determines the frequency of calling out to the control channel specified in the ctrl_channel option for retrieving control commands. You could also update this option via the control command cl_update_delay <time_in_ms>.
The list of supported control commands is the following:
Control Command | Description |
---|---|
cl_get_date | Retrieves the date and time Crawlector was started and the current date and time. |
cl_ping | Sends back the message “Pong…“. This is to check that C&C channel is working. |
cl_get_config | Uploads the currently used configuration (e.g., cl_config.ini) file as a text file. |
cl_update_delay <integer_in_milliseconds> | Updates the check-in time between every pull request for control commands.- Changes the value (ctrl_sleep) for the current session only. |
cl_turn_off_slack_alert | Turns off Slack alert feature for the currently active session. |
cl_turn_on_slack_alert | Turns on Slack alert feature for the currently active session |
cl_help | Lists this help message. |
cl_exit | Terminates Crawlector, forcefully. |
Note-1: In the initialization phase of Crawlector, it tests whether the provided authentication token is valid or not, or if the channel is set, and in case of failure, this feature is disabled automatically.
If this functionality is enabled, and once it passes API token validation, Crawlector sends the message “Crawlector is ready for receiving control commands. Type the command cl_help for a list of supported control commands.” to the designated ctrl_channel.
All responses to a given control command are threaded. Moreover, control commands are read on a session-by-session basis, from the time a session is started.
Note-2: Slack rate limit on the retrieval (conversation history) message API is one request per second, with leeway for some bursts. So, if the ctrl_sleep option is set to a value less than a second or greater tha a second, Crawlector does queue messages to account for more control commands per second, and execute them in the order received.
cl_sites
are allowed.Prompt injection is a type of security vulnerability that can be exploited to control the…
Firefly is an advanced black-box fuzzer and not just a standard asset discovery tool. Firefly…
Winit is a robust, cross-platform library designed for creating and managing windows in Rust applications.…
In today’s digital age, convenience often comes at the cost of security. One such overlooked…
Terminal GPT (tgpt) offers a seamless way to bring the power of ChatGPT 3.5 directly…
garak checks if an LLM can be made to fail in a way we don't…