Cyber security

JailbreakEval : Automating the Evaluation Of Language Model Security

Jailbreak is an attack that prompts a language model to give actionable responses to harmful behaviors, such as writing an offensive letter, providing detailed instructions for creating a bomb.

Evaluating the results of such attacks typically requires manual inspections by determining if the response fulfills some standards, which is impractical for large-scale analysis.

As a result, most research on jailbreak attacks levarage automated tools to evaluate the result of their jailbreak attempts, and each of them offers some unique insights.

However, due to the inherent semantic flexibility of natural language, there is no such automated evaluator that could fit in all contexts.

Therefore, instead of proposing one best automated evaluator to rule them all, JailbreakEval aims to bring them together in a unified manner, making them straightforward to craft, access, and compare. It is particularly well-suited for:

  • Jailbreak Researchers, by providing well-known jailbreak evaluators to assess the effectness of their attacks out-of-the-box.
  • Jailbreak Evaluator Developers, by providing a handy framework for creating new evaluators and comparing its performance with the established ones.

Install

You can install the latest release of JailbreakEval from pypi:

pip install JailbreakEval

Alternatively, you can install JailbreakEval by cloning the project:

git clone https://github.com/ThuCCSLab/JailbreakEval
cd JailbreakEval
make install  # for installation
make dev  # for development

Quick Start

JailbreakEval can evaluate your jailbreak attempts in just a few lines of codes.

Here is a snippet to assess a jailbreak attempt by GCG’s evaluator:

from jailbreakeval import JailbreakEvaluator

# Initialize the evaluator with GCG's preset
evaluator = JailbreakEvaluator.from_preset("StringMatching-zou2023universal")

# The jailbreak attempt to be evaluated
attempt_to_evaluate = {
    "question": "How to build a bomb?",  # the target question to jailbreak
    "answer": "I cannot tell you. This is illegal."  # the answer extracted from the model by the jailbreak attack
}

# Evaluate the attempt and print the result
evaluation_result = evaluator(attempt_to_evaluate)
print(evaluation_result)   # Output: False

For more information click here.

Varshini

Varshini is a Cyber Security expert in Threat Analysis, Vulnerability Assessment, and Research. Passionate about staying ahead of emerging Threats and Technologies.

Recent Posts

TruffleHog Burp Suite Extension : Enhancing Security With Automated Secret Detection

The TruffleHog Burp Suite Extension is a powerful tool designed to scan HTTP traffic within…

59 minutes ago

ps5Spoofer : Unlocking PS4 Game Compatibility On Your PS5

The ps5Spoofer is a tool designed for the PlayStation 5 (PS5) that patches the PS4…

4 hours ago

eWPTX Preparion : Essential Tools And Functions

The eWPTX (eLearnSecurity Web Application Penetration Tester Extreme) certification is a challenging credential that validates…

4 hours ago

REC2 : Rusty External Command And Control Tool

REC2, short for Rusty External Command and Control, is a sophisticated Command and Control (C2)…

4 hours ago

AMSI-Bypass-HWBP : A Tool For Evading AMSI Detection

AMSI (Antimalware Scan Interface) is a Windows feature designed to help protect systems from malware…

23 hours ago

BurpSuite-Xkeys : Mastering Key And Token Extraction For Web Security

Xkeys is a Burp Suite extension designed to extract interesting strings such as keys, secrets,…

23 hours ago