IOCExtract : Advanced Indicator Of Compromise (IOC) Extractor

IOCExtract is an advanced Indicator of Compromise (IOC) extractor. This library extracts URLs, IP addresses, MD5/SHA hashes, email addresses, and YARA rules from text corpora. It includes some encoded and “defanged” IOCs in the output, and optionally decodes/refangs them.

The Problem

It is common practice for malware analysts or endpoint software to “defang” IOCs such as URLs and IP addresses, in order to prevent accidental exposure to live malicious content. Being able to extract and aggregate these IOCs is often valuable for analysts. Unfortunately, existing “IOC extraction” tools often pass right by them, as they are not caught by standard regex.

For example, the simple defanging technique of surrounding periods with brackets:


Existing tools that use a simple IP address regex will ignore this IOC entirely.

Also Read – PhoneSploit : Using Open ADB Ports We Can Exploit A Android Device

The Solution

By combining specially crafted regex with some custom postprocessing, we are able to both detect and deobfuscate “defanged” IOCs. This saves time and effort for the analyst, who might otherwise have to manually find and convert IOCs into machine-readable format.

A Simple Use Case

Many Twitter users post C2s or other valuable IOC information with defanged URLs. For example, this tweet from @InQuest:

  • Recommended reading and great work from @unit42_intel: …
  • InQuest customers have had detection for threats delivered from hotfixmsupload[.]com
  • since 6/3/2017 and cdnverify[.]net since 2/1/18.

If we run this through the extractor, we can easily pull out the URLs:

  • hotfixmsupload[.]com
  • cdnverify[.]net

Passing in refang=True at extraction time would remove the obfuscation, but since these are real IOCs, let’s leave them defanged in our documentation. 🙂


You may need to install the Python development headers in order to install the regex dependency. On Ubuntu/Debian-based systems, try:

sudo apt-get install python-dev

Then install iocextract from pip:

pip install iocextract

If you have problems installing on Windows, try installing regex directly by downloading the appropriate wheel from PyPI and running e.g.:

pip install regex-2018.06.21-cp27-none-win_amd64.whl


Try extracting some defanged URLs:

content = “””
… I really love example[.]com!
… All the bots are on hxxp:// these days.
… C2: tcp://example[.]com:8989/bad
… “””
import iocextract
for url in iocextract.extract_urls(content):
… print url


Note that some URLs may show up twice if they are caught by multiple regexes.

If you want, you can also “refang”, or remove common obfuscation methods from IOCs:

for url in iocextract.extract_urls(content, refang=True):
… print url

You can even extract and decode hex-encoded and base64-encoded URLs:

>>> content = ‘612062756e6368206f6620776f72647320687474703a2f2f6578616d706c652e636f6d2f70617468206d6f726520776f726473’
for url in iocextract.extract_urls(content):
… print url

for url in iocextract.extract_urls(content, refang=True):
… print url

All extract_* functions in this library return iterators, not lists. The benefit of this behavior is that iocextract can process extremely large inputs, with a very low overhead. However, if for some reason you need to iterate over the IOCs more than once, you will have to save the results as a list:

>>> list(iocextract.extract_urls(content)) [‘hxxp://’, ‘tcp://example[.]com:8989/bad’, ‘example[.]com’, ‘tcp://example[.]com:8989/bad’]

A command-line tool is also included:

$ iocextract -h
usage: iocextract [-h] [–input INPUT] [–output OUTPUT] [–extract-emails]
[–extract-ips] [–extract-ipv4s] [–extract-ipv6s]
[–extract-urls] [–extract-yara-rules] [–extract-hashes]
[–custom-regex REGEX_FILE] [–refang] [–strip-urls]

Advanced Indicator of Compromise (IOC) extractor. If no arguments are
specified, the default behavior is to extract all IOCs.

optional arguments:
-h, –help show this help message and exit
–input INPUT default: stdin
–output OUTPUT default: stdout
–custom-regex REGEX_FILE
file with custom regex strings, one per line, with one
capture group each
–refang default: no
–strip-urls remove possible garbage from the end of urls. default:
–wide preprocess input to allow wide-encoded character

matches. default: no

Only URLs, emails, and IPv4 addresses can be “refanged”.

More Details

This library currently supports the following IOCs:

  • IP Addresses
    • IPv4 fully supported
    • IPv6 partially supported
  • URLs
    • With protocol specifier: http, https, tcp, udp, ftp, sftp, ftps
    • With [.] anchor, even with no protocol specifier
    • IPv4 and IPv6 (RFC2732) URLs are supported
    • Hex-encoded URLs with protocol specifier: http, https, ftp
    • URL-encoded URLs with protocol specifier: http, https, ftp, ftps, sftp
    • Base64-encoded URLs with protocol specifier: http, https, ftp
  • Emails
    • Partially supported, anchoring on @ or at
  • YARA rules
    • With imports, includes, and comments
  • Hashes
    • MD5
    • SHA1
    • SHA256
    • SHA512
  • Custom regex
    • With exactly one capture group

For IPv4 addresses, the following defang techniques are supported:

. -> [.]1[.]1[.]1[.]
. -> (.)1(.)1(.)1(.)
. -> \.1\.1\.1\.
Any combination1.)1[.1.)

For email addresses, the following defang techniques are supported:

. -> [.]me@example[.]
. -> (.)me@example(.)
. -> {.}me@example{.}
. -> _dot_me@example dot
@ -> [@]me[@]
@ -> (@)me(@)
@ -> {@}me{@}
@ -> _at_me at
Partialme@} example[
Added spacesme@example [.]
Any combinationme @example [.)

For URLs, the following defang techniques are supported:

. -> [.]example[.]com/path
. -> (.)example(.)com/path
. -> \.example\.com/path
/ -> [/][/]path
Cisco ESAhttp:// example .com /path
:// ->
:// -> :\\http:\\
Any combinationhxxp__ example( .com[/]path
Hex encoded687474703a2f2f6578616d706c652e636f6d2f70617468
URL encodedhttp%3A%2F%2fexample%2Ecom%2Fpath
Base64 encodedaHR0cDovL2V4YW1wbGUuY29tL3BhdGgK

Note that the tables above are not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the GitHub Issues.

The base64 regex was generated with @deadpixi‘s base64 regex tool.

Custom Regex

If you’d like to use the CLI to extract IOCs using your own custom regex, create a plain text file with one regex string per line, and pass it in with the --custom-regex flag. Be sure each regex string includes exactly one capture group. For example:


This custom regex file will exctract the domain from matching URLs. The (?: ) noncapture group won’t be included in matches.

If you would like to extract the entire match, just put parentheses around your entire regex string, like this:


If your regex is invalid, you’ll see an error message like this:

Error in custom regex: missing ) at position 5

If your regex does not include a capture group, you’ll see an error message like this:

Error in custom regex: no such group