OpenAttack is an open-source Python-based textual adversarial attack toolkit, which handles the whole process of textual adversarial attacking, including preprocessing text, accessing the victim model, generating adversarial examples and evaluation.
Features & Uses
OpenAttack has following features:
OpenAttack has a wide range of uses, including:
Installation
You can either use pip
or clone this repo to install OpenAttack.
pip install OpenAttack
git clone https://github.com/thunlp/OpenAttack.git
cd OpenAttack
python setup.py install
After installation, you can try running demo.py
to check if OpenAttack works well:
python demo.py
Usage Examples
Basic: Use Built-in Attacks
OpenAttack builds in some commonly used text classification models such as LSTM and BERT as well as datasets such as SST for sentiment analysis and SNLI for natural language inference. You can effortlessly conduct adversarial attacks against the built-in victim models on the datasets.
The following code snippet shows how to use a genetic algorithm-based attack model (Alzantot et al., 2018) to attack BERT on the SST dataset:
import OpenAttack as oa
#choose a trained victim classification model
victim = oa.DataManager.load(“Victim.BERT.SST”)
#choose an evaluation dataset
dataset = oa.DataManager.load(“Dataset.SST.sample”)
#choose Genetic as the attacker and initialize it with default parameters
attacker = oa.attackers.GeneticAttacker()
#prepare for attacking
attack_eval = oa.attack_evals.DefaultAttackEval(attacker, victim)
#launch attacks and print attack results
attack_eval.eval(dataset, visualize=True)
Advanced: Attack a Customized Victim Model
The following code snippet shows how to use the genetic algorithm-based attack model to attack a customized sentiment analysis model (a statistical model built in NLTK) on SST.
import OpenAttack as oa
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#configure access interface of the customized victim model
class MyClassifier(oa.Classifier):
def init(self):
self.model = SentimentIntensityAnalyzer()
# access to the classification probability scores with respect input sentences
def get_prob(self, input_):
rt = []
for sent in input_:
rs = self.model.polarity_scores(sent)
prob = rs[“pos”] / (rs[“neg”] + rs[“pos”])
rt.append(np.array([1 – prob, prob]))
return np.array(rt)
#choose the costomized classifier as the victim model
victim = MyClassifier()
#choose an evaluation dataset
dataset = oa.DataManager.load(“Dataset.SST.sample”)
#choose Genetic as the attacker and initialize it with default parameters
attacker = oa.attackers.GeneticAttacker()
#prepare for attacking
attack_eval = oa.attack_evals.DefaultAttackEval(attacker, victim)
#launch attacks and print attack results
attack_eval.eval(dataset, visualize=True)
Advanced: Design a Customized Attack Model
OpenAttack incorporates many handy components which can be easily assembled into new attack model.
Here gives an example of how to design a simple attack model which shuffles the tokens in the original sentence.
”’
This example code shows how to conduct adversarial training to improve the robustness of a sentiment analysis model.
The most important part is the “attack()” function, in which adversarial examples are easily generated with an API “attack_eval.generate_adv()”
‘”
import OpenAttack
import torch
import datasets
Design a feedforward neural network as the the victim sentiment analysis model
def make_model(vocab_size):
“””
see tutorial - pytorch <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#define-the-model>
__
“””
import torch.nn as nn
class TextSentiment(nn.Module):
def init(self, vocab_size, embed_dim=32, num_class=2):
super().init()
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim)
self.fc = nn.Linear(embed_dim, num_class)
self.softmax = nn.Softmax(dim=1)
self.init_weights()
def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
def forward(self, text):
embedded = self.embedding(text, None)
return self.softmax(self.fc(embedded))
return TextSentiment(vocab_size)
def dataset_mapping(x):
return {
“x”: x[“sentence”],
“y”: 1 if x[“label”] > 0.5 else 0,
“tokens”: x[“tokens”].split(“|”)
}
Choose SST-2 as the dataset
def prepare_data():
vocab = {
“”: 0,
“”: 1
}
dataset = datasets.load_dataset(“sst”).map(function=dataset_mapping).remove_columns([“label”, “sentence”, “tree”])
for dataset_name in [“train”, “validation”, “test”]:
for inst in dataset[dataset_name]:
for token in inst[“tokens”]:
if token not in vocab:
vocab[token] = len(vocab)
return dataset[“train”], dataset[“validation”], dataset[“test”], vocab
Batch data
def make_batch(data, vocab):
batch_x = [
[
vocab[token] if token in vocab else vocab[“”]
for token in tokens
] for tokens in data[“tokens”]
]
max_len = max( [len(tokens) for tokens in data[“tokens”]] )
batch_x = [
sentence + [vocab[“”]] * (max_len – len(sentence))
for sentence in batch_x
]
batch_y = data[“y”]
return torch.LongTensor(batch_x), torch.LongTensor(batch_y)
Train the victim model for one epoch
def train_epoch(model, dataset, vocab, batch_size=128, learning_rate=5e-3):
dataset = dataset.shuffle()
model.train()
criterion = torch.nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
avg_loss = 0
for start in range(0, len(dataset), batch_size):
train_x, train_y = make_batch(dataset[start: start + batch_size], vocab)
pred = model(train_x)
loss = criterion(pred.log(), train_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
avg_loss += loss.item()
return avg_loss / len(dataset)
def eval_classifier_acc(dataset, clsf):
correct = 0
for inst in dataset:
correct += (clsf.get_pred( [inst[“x”]] )[0] == inst[“y”])
return correct / len(dataset)
Train the victim model and conduct evaluation
def train_model(model, data_train, data_valid, vocab, num_epoch=10):
mx_acc = None
mx_model = None
for i in range(num_epoch):
loss = train_epoch(model, data_train, vocab)
clsf = OpenAttack.PytorchClassifier(model, word2id=vocab)
accuracy = eval_classifier_acc(data_valid, clsf)
print(“Epoch %d: loss: %lf, accuracy %lf” % (i, loss, accuracy))
if mx_acc is None or mx_acc < accuracy:
mx_model = model.state_dict()
model.load_state_dict(mx_model)
return model
Launch adversarial attacks and generate adversarial examples
def attack(classifier, dataset, attacker = OpenAttack.attackers.PWWSAttacker()):
attack_eval = OpenAttack.attack_evals.DefaultAttackEval(
attacker = attacker,
classifier = classifier,
success_rate = True
)
# correct_samples = dataset.eval(classifier).correct()
# accuracy = len(correct_samples) / len(dataset)
correct_samples = [
inst for inst in dataset if classifier.get_pred( [inst[“x”]] )[0] == inst[“y”]
]
accuracy = len(correct_samples) / len(dataset)
adversarial_samples = attack_eval.generate_adv(correct_samples)
attack_success_rate = attack_eval.get_result()[“Attack Success Rate”]
print(“Accuracy: %lf%%\nAttack success rate: %lf%%” % (accuracy * 100, attack_success_rate * 100))
tp = OpenAttack.text_processors.DefaultTextProcessor()
def adversarial_samples_mapping(x):
return {
“tokens”: list(map(lambda x:x[0], tp.get_tokens(x[“x”])))
}
return adversarial_samples.map(adversarial_samples_mapping).remove_columns([“pred”, “original”, “info”])
def main():
print(“Loading data”)
train, valid, test, vocab = prepare_data() # Load dataset
model = make_model(len(vocab)) # Design a victim model
print(“Training”)
trained_model = train_model(model, train, valid, vocab) # Train the victim model
print(“Generating adversarial samples (this step will take dozens of minutes)”)
clsf = OpenAttack.PytorchClassifier(trained_model, word2id=vocab) # Wrap the victim model
adversarial_samples = attack(clsf, train) # Conduct adversarial attacks and generate adversarial examples
print(“Adversarially training classifier”)
print(train.features)
print(adversarial_samples.features)
new_dataset = {
“x”: [],
“y”: [],
“tokens”: []
}
for it in train:
new_dataset[“x”].append( it[“x”] )
new_dataset[“y”].append( it[“y”] )
new_dataset[“tokens”].append( it[“tokens”] )
for it in adversarial_samples:
new_dataset[“x”].append( it[“x”] )
new_dataset[“y”].append( it[“y”] )
new_dataset[“tokens”].append( it[“tokens”] )
finetune_model = train_model(trained_model, datasets.Dataset.from_dict(new_dataset), valid, vocab) # Retrain the classifier with additional adversarial examples
print(“Testing enhanced model (this step will take dozens of minutes)”)
attack(clsf, train) # Re-attack the victim model to measure the effect of adversarial training
if name == ‘main’:
main()
Advanced: Adversarial Training
OpenAttack can easily generate adversarial examples by attacking instances in the training set, which can be added to original training data set to retrain a more robust victim model, i.e., adversarial training.
Here gives an example of how to conduct adversarial training with OpenAttack.
”’
This example code shows how to conduct adversarial training to improve the robustness of a sentiment analysis model.
The most important part is the “attack()” function, in which adversarial examples are easily generated with an API “attack_eval.generate_adv()”
”’
import OpenAttack
import torch
import datasets
Design a feedforward neural network as the the victim sentiment analysis model
def make_model(vocab_size):
“””
see tutorial - pytorch <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#define-the-model>
__
“””
import torch.nn as nn
class TextSentiment(nn.Module):
def init(self, vocab_size, embed_dim=32, num_class=2):
super().init()
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim)
self.fc = nn.Linear(embed_dim, num_class)
self.softmax = nn.Softmax(dim=1)
self.init_weights()
def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
def forward(self, text):
embedded = self.embedding(text, None)
return self.softmax(self.fc(embedded))
return TextSentiment(vocab_size)
def dataset_mapping(x):
return {
“x”: x[“sentence”],
“y”: 1 if x[“label”] > 0.5 else 0,
“tokens”: x[“tokens”].split(“|”)
}
Choose SST-2 as the dataset
def prepare_data():
vocab = {
“”: 0,
“”: 1
}
dataset = datasets.load_dataset(“sst”).map(function=dataset_mapping).remove_columns([“label”, “sentence”, “tree”])
for dataset_name in [“train”, “validation”, “test”]:
for inst in dataset[dataset_name]:
for token in inst[“tokens”]:
if token not in vocab:
vocab[token] = len(vocab)
return dataset[“train”], dataset[“validation”], dataset[“test”], vocab
Batch data
def make_batch(data, vocab):
batch_x = [
[
vocab[token] if token in vocab else vocab[“”]
for token in tokens
] for tokens in data[“tokens”]
]
max_len = max( [len(tokens) for tokens in data[“tokens”]] )
batch_x = [
sentence + [vocab[“”]] * (max_len – len(sentence))
for sentence in batch_x
]
batch_y = data[“y”]
return torch.LongTensor(batch_x), torch.LongTensor(batch_y)
Train the victim model for one epoch
def train_epoch(model, dataset, vocab, batch_size=128, learning_rate=5e-3):
dataset = dataset.shuffle()
model.train()
criterion = torch.nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
avg_loss = 0
for start in range(0, len(dataset), batch_size):
train_x, train_y = make_batch(dataset[start: start + batch_size], vocab)
pred = model(train_x)
loss = criterion(pred.log(), train_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
avg_loss += loss.item()
return avg_loss / len(dataset)
def eval_classifier_acc(dataset, clsf):
correct = 0
for inst in dataset:
correct += (clsf.get_pred( [inst[“x”]] )[0] == inst[“y”])
return correct / len(dataset)
Train the victim model and conduct evaluation
def train_model(model, data_train, data_valid, vocab, num_epoch=10):
mx_acc = None
mx_model = None
for i in range(num_epoch):
loss = train_epoch(model, data_train, vocab)
clsf = OpenAttack.PytorchClassifier(model, word2id=vocab)
accuracy = eval_classifier_acc(data_valid, clsf)
print(“Epoch %d: loss: %lf, accuracy %lf” % (i, loss, accuracy))
if mx_acc is None or mx_acc < accuracy:
mx_model = model.state_dict()
model.load_state_dict(mx_model)
return model
Launch adversarial attacks and generate adversarial examples
def attack(classifier, dataset, attacker = OpenAttack.attackers.PWWSAttacker()):
attack_eval = OpenAttack.attack_evals.DefaultAttackEval(
attacker = attacker,
classifier = classifier,
success_rate = True
)
# correct_samples = dataset.eval(classifier).correct()
# accuracy = len(correct_samples) / len(dataset)
correct_samples = [
inst for inst in dataset if classifier.get_pred( [inst[“x”]] )[0] == inst[“y”]
]
accuracy = len(correct_samples) / len(dataset)
adversarial_samples = attack_eval.generate_adv(correct_samples)
attack_success_rate = attack_eval.get_result()[“Attack Success Rate”]
print(“Accuracy: %lf%%\nAttack success rate: %lf%%” % (accuracy * 100, attack_success_rate * 100))
tp = OpenAttack.text_processors.DefaultTextProcessor()
def adversarial_samples_mapping(x):
return {
“tokens”: list(map(lambda x:x[0], tp.get_tokens(x[“x”])))
}
return adversarial_samples.map(adversarial_samples_mapping).remove_columns([“pred”, “original”, “info”])
def main():
print(“Loading data”)
train, valid, test, vocab = prepare_data() # Load dataset
model = make_model(len(vocab)) # Design a victim model
print(“Training”)
trained_model = train_model(model, train, valid, vocab) # Train the victim model
print(“Generating adversarial samples (this step will take dozens of minutes)”)
clsf = OpenAttack.PytorchClassifier(trained_model, word2id=vocab) # Wrap the victim model
adversarial_samples = attack(clsf, train) # Conduct adversarial attacks and generate adversarial examples
print(“Adversarially training classifier”)
print(train.features)
print(adversarial_samples.features)
new_dataset = {
“x”: [],
“y”: [],
“tokens”: []
}
for it in train:
new_dataset[“x”].append( it[“x”] )
new_dataset[“y”].append( it[“y”] )
new_dataset[“tokens”].append( it[“tokens”] )
for it in adversarial_samples:
new_dataset[“x”].append( it[“x”] )
new_dataset[“y”].append( it[“y”] )
new_dataset[“tokens”].append( it[“tokens”] )
finetune_model = train_model(trained_model, datasets.Dataset.from_dict(new_dataset), valid, vocab) # Retrain the classifier with additional adversarial examples
print(“Testing enhanced model (this step will take dozens of minutes)”)
attack(clsf, train) # Re-attack the victim model to measure the effect of adversarial training
if name == ‘main’:
main()
Advanced: Design a Customized Evaluation Metric
OpenAttack supports designing a customized adversarial attack evaluation metric.
Here gives an example of how to add BLEU score as a customized evaluation metric to evaluate adversarial attacks.
”’
This example code shows how to design a customized attack evaluation metric, namely BLEU score.
”’
import OpenAttack
from nltk.translate.bleu_score import sentence_bleu
import datasets
class CustomAttackEval(OpenAttack.DefaultAttackEval):
def init(self, attacker, clsf, processor=OpenAttack.DefaultTextProcessor(), **kwargs):
super().init(attacker, clsf, processor=processor, **kwargs)
self.processor = processor # We extend :py:class:.DefaultAttackEval
and use processor
option to specify # the :py:class:.TextProcessor
used in our CustomAttackEval
. def measure(self, x_orig, x_adv): # Invoke the original measure
method to get measurements info = super().measure(x_orig, x_adv) if info[“Succeed”]: # Add Blue
score which is calculated by NLTK toolkit if attack succeed.
token_orig = [token for token, pos in self.__processor.get_tokens(x_orig)]
token_adv = [token for token, pos in self.__processor.get_tokens(x_adv)]
info[“Bleu”] = sentence_bleu([x_orig], x_adv)
return info
def update(self, info):
info = super().update(info)
if info[“Succeed”]:
# Add bleu score that we just calculated to the total result.
self.__result[“bleu”] += info[“Bleu”]
return info
def clear(self):
super().clear()
self.__result = { “bleu”: 0 }
# Clear results
def get_result(self):
result = super().get_result()
# Calculate average bleu scores and return.
result[“Avg. Bleu”] = self.__result[“bleu”] / result[“Successful Instances”]
return result
def dataset_mapping(x):
return {
“x”: x[“sentence”],
“y”: 1 if x[“label”] > 0.5 else 0,
}
def main():
clsf = OpenAttack.load(“Victim.BiLSTM.SST”)
dataset = datasets.load_dataset(“sst”, split=”train[:20]”).map(function=dataset_mapping)
attacker = OpenAttack.attackers.GeneticAttacker()
attack_eval = CustomAttackEval(attacker, clsf)
attack_eval.eval(dataset, visualize=True)
if __name == “main”:
main()
Attack Models
According to the level of perturbations imposed on original input, textual adversarial attack models can be categorized into sentence-level, word-level, character-level attack models.
According to the accessibility to the victim model, textual adversarial attack models can be categorized into gradient
-based, score
-based, decision
-based and blind
attack models.
TAADPapers is a paper list which summarizes almost all the papers concerning textual adversarial attack and defense. You can have a look at this list to find more attack models.
Currently OpenAttack includes 13 typical attack models against text classification models that cover all attack types.
Here is the list of currently involved attack models.
decision
[pdf] [code]blind
[pdf] [code&data]decision
[pdf] [code]score
[pdf] [code]score
[pdf] [code]score
[pdf] [code]score
[pdf] [code]gradient
[pdf]gradient
[pdf] [code] [website]gradient
score
[pdf]gradient
[pdf] [code]score
[pdf] [code&data]score
[pdf] [code]Following table illustrates the comparison of the attack models.
Model | Accessibility | Perturbation | Main Idea |
---|---|---|---|
SEA | Decision | Sentence | Rule-based paraphrasing |
SCPN | Blind | Sentence | Paraphrasing |
GAN | Decision | Sentence | Text generation by encoder-decoder |
SememePSO | Score | Word | Particle Swarm Optimization-based word substitution |
TextFooler | Score | Word | Greedy word substitution |
PWWS | Score | Word | Greedy word substitution |
Genetic | Score | Word | Genetic algorithm-based word substitution |
FD | Gradient | Word | Gradient-based word substitution |
TextBugger | Gradient, Score | Word+Char | Greedy word substitution and character manipulation |
UAT | Gradient | Word, Char | Gradient-based word or character manipulation |
HotFlip | Gradient | Word, Char | Gradient-based word or character substitution |
VIPER | Blind | Char | Visually similar character substitution |
DeepWordBug | Score | Char | Greedy character manipulation |
Toolkit Design
Considering the significant distinctions among different attack models, we leave considerable freedom for the skeleton design of attack models, and focus more on streamlining the general processing of adversarial attacking and the common components used in attack models.
OpenAttack has 7 main modules:
garak checks if an LLM can be made to fail in a way we don't…
Vermilion is a simple and lightweight CLI tool designed for rapid collection, and optional exfiltration…
ADCFFS is a PowerShell script that can be used to exploit the AD CS container…
Tartufo will, by default, scan the entire history of a git repository for any text…
Loco is strongly inspired by Rails. If you know Rails and Rust, you'll feel at…
A data hoarder’s dream come true: bundle any web page into a single HTML file.…