BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (2024)

Xinyuan Wang1,∗, Victor Shea-Jay Huang3,∗, Renmiao Chen2, Hao Wang1,
Chengwei Pan1,†, Lei Sha1, Minlie Huang2
1
Beihang University, Beijing, China
2Tsinghua University, Beijing, China
3Peking University, Beijing, China
buaa42wxy@gmail.com,jeix782@gmail.com, pancw@buaa.edu.cn

Abstract

While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.Our code is available at https://github.com/MantaAI/BlackDAN.

footnotetext: Equal contribution, Corresponding author.

Keywords Jailbreak \cdotMulti-Objective \cdotBlack-Box \cdotLLM

1 Introduction

As large language models (LLMs) are increasingly integrated into various applications, the security of these models has become crucial[1, 2, 3]. Jailbreaking, the process of manipulating these models to bypass safety constraints and generate undesirable or harmful outputs, poses a significant challenge to maintaining their integrity and ethical use. Current jailbreaking methods depend excessively on affirmative cues from the model’s prefix[4, 5], leading to the possibility of generating responses that are irrelevant or off-topic, leaving users helpless without outright rejecting prompts. This over-reliance underscores the urgent necessity for a more nuanced approach to prompt selection and optimization, especially through multi-objective strategies that focus on both effectiveness and usefulness.

Furthermore, existing jailbreaking approaches struggle to explain why certain special directed vectors[6] result in model rejections, highlighting a significant challenge in comprehending the underlying distributions that dictate model behavior. The absence of clear explanations regarding the acceptance or rejection of prompts makes it challenging to establish a reliable safety boundary. Incorporating ranking mechanisms and conducting a thorough analysis of the distribution of responses can help provide interpretability and enable the identification of a more concrete safety boundary for prompts. These considerations are essential to ensure that jailbreaking attempts not only achieve success but also do so within explainable and safe constraints.

Another major limitation in current black-box jailbreak optimization strategies is the lack of transparency and interpretability. Most techniques rely on end-to-end optimization without adequately explaining the processes involved. The lack of interpretability makes it difficult to understand how jailbreak methods evolve or how specific adjustments impact the success rate of jailbreak attempts. Addressing this gap through a more structured explanation of the optimization processes will lead to more reliable and controllable jailbreak techniques.

To address these issues, we propose BlackDAN, a black-box, multi-objective, human-readable, controllable, and extensible jailbreak optimization framework. BlackDAN introduces a novel approach by optimizing multiple objectives simultaneously, including attack success rate (ASR), context relevance, and other factors. In contrast to traditional methods that focus solely on achieving a high ASR, BlackDAN adopts a more balanced approach by simultaneously addressing the trade-offs between effectiveness, interpretability, and safety. We hypothesize, verify, and analyze the concept of a safe boundary for prompts within this framework, using multi-objective optimization to refine the selection of useful and effective prompts while maintaining unsafety constraints.

To realize BlackDAN, we leverage the advances of Multiobjective Evolutionary Algorithms (MOEAs)[7], specifically the NSGA-II algorithm[8], which shows effectiveness in solving complex multi-objective problems. By incorporating pareto-dominance,mutation and crossover mechanisms, BlackDAN is capable of exploring a wider solution space while providing clear explanations of the optimization process. This allows for a more transparent and interpretable methodology for conducting jailbreak attacks, addressing the shortcomings of traditional end-to-end optimization techniques.

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (1)

Fig 1 contrasts multiple scenarios demonstrating how multi-objective optimization can yield outputs that are both semantically relevant(thumbsup) and harmful (Little devil). It shows the limitations of single-objective optimization in AI, where focusing on just one goal (like semantic consistency or safety) can lead to imbalanced results. In the top-left, responses are safe and contextually relevant, while the bottom-left is safe but less helpful. The top-right shows dangerous, harmful responses that are highly relevant, and the bottom-right is both harmful and irrelevant. The image highlights the need for multi-objective optimization to balance safety and relevance in AI outputs.

Additionally, BlackDAN builds upon previous work, such as AutoDAN[9], by extending the framework beyond single-objective optimization to a multi-objective perspective. AutoDAN focuses on balancing fluency and evading perplexity detection in prompt text generation, but BlackDAN improves upon this by simultaneously optimizing multiple objectives, such as harmfulness, context relevance and other factors, thereby increasing the overall effectiveness and reliability of jailbreak attempts.

In summary, our contributions are as follows:

  • Beyond ASR - Focus on Semantic Consistency: BlackDAN not only optimizes for attack success rate (ASR) but also emphasizes semantic consistency, ensuring that jailbreak responses remain contextually relevant and aligned with harmful prompts, making the attacks more practical and less detectable.

  • Extensibility to Arbitrary Objectives: The BlackDAN framework is theoretically extensible to any number of optimization objectives. Users can customize and prioritize different factors in jailbreak attempts, such as harmfulness, stealthiness, or relevance, based on their specific needs.

  • Rank Boundary Hypothesis and Improved Differentiation: We introduce the Rank Boundary Hypothesis, positing that each rank has distinct boundaries in the embedding space. This allows better differentiation between toxic and non-toxic prompts, enhancing the framework’s ability to target specific harmful content distributions.

  • Comprehensive Single and Multi-Objective Experiments: Extensive experiments conducted on both LLMs and multimodal LLMs demonstrate that BlackDAN significantly outperforms single-objective and other black-box approaches. The results show higher effectiveness across multiple dimensions, establishing BlackDAN as a robust and versatile tool for jailbreak optimization.

2 Related Work

LLMs’ susceptibility to adversarial attacks has been explored through various approaches, mainly categorized into white-box and black-box attacks. White-box attacks require access to the model’s parameters, as demonstrated by[4], who utilized gradient search to optimize adversarial prompts by accessing the model’s logits. Other methods, such as Shadow alignment[10] and Weak-to-Strong Jailbreak[11], involve modifying the model’s weights or decoding processes to bypass safeguards, making these approaches unsuitable for black-box LLMs. On the other hand, black-box attacks operate solely through prompt manipulation, modifying input queries to induce harmful outputs. Examples include methods like AutoDAN[12], PAIR[13], and PAP[14], where LLMs are used to generate harmful queries. Rule-based techniques have also been proposed, as illustrated by[15], who encrypted harmful queries and requested LLMs to respond in the ciphered format. Other rule-based methods include EasyJailbreak[16].

Black-box attacks are inherently more challenging than white-box attacks. Many existing techniques rely heavily on the transferability of adversarial examples; adversarial examples designed for one model often succeed in attacking other victim models. It is generally assumed that attackers can query the victim model to annotate training samples or access a pre-trained source model designed for the same task as the victim model. To enhance adversarial transferability, several methods have been proposed to adjust the backpropagation computation. Techniques include the skip gradient method (SGM)[17], linear backpropagation (LinBP)[18], intermediate-level attack (ILA)[19], and ILA++[20, 21]. Increasing input diversity has also been widely adopted, as evidenced by works like[22, 23, 24, 25, 26, 27, 28, 25, 29, 30, 31, 32].

3 Methodology

In this setting, we aim to generate prompts that maximize the likelihood of generating unsafe responses from a target language model 𝒯𝒯\mathcal{T}caligraphic_T while ensuring semantic consistency between the generated responses and harmful questions. This process is framed as a multi-objective optimization problem, and we utilize NSGA-II (Non-dominated Sorting Genetic Algorithm II)[8] to perform the optimization.

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (2)

3.1 Fitness Functions

We define two continuous fitness functions to guide the optimization process:

Unsafe Token Probability (Using llama_guard_2):

The first fitness function, f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, evaluates the probability that a response R𝑅Ritalic_R generated by 𝒯𝒯\mathcal{T}caligraphic_T contains unsafe content. This is computed by the proxy model llama_guard_2, which outputs the log-probability of specific "unsafe tokens."

Given a prompt P𝑃Pitalic_P and a harmful question Q𝑄Qitalic_Q, let R=𝒯(P,Q)𝑅𝒯𝑃𝑄R=\mathcal{T}(P,Q)italic_R = caligraphic_T ( italic_P , italic_Q ) denote the response generated by the target model 𝒯𝒯\mathcal{T}caligraphic_T. The unsafe token probability is computed as:

f1(P,Q)=logP(unsafe tokenR),subscript𝑓1𝑃𝑄𝑃conditionalunsafe token𝑅f_{1}(P,Q)=\log P(\text{unsafe token}\mid R),italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_P , italic_Q ) = roman_log italic_P ( unsafe token ∣ italic_R ) ,

where P(unsafe tokenR)𝑃conditionalunsafe token𝑅P(\text{unsafe token}\mid R)italic_P ( unsafe token ∣ italic_R ) is the probability of the most relevant unsafe token appearing in response R𝑅Ritalic_R as calculated by llama_guard_2.

Semantic Consistency (Using all-MiniLM-L6-v2):

In the black-box attack setting, we do not have direct access to the target model’s internals or its embeddings. Instead, we utilize a pre-trained proxy model, such as the all-MiniLM-L6-v2, to generate sentence embeddings for both the harmful prompt and the candidate responses. These embeddings allow us to measure the semantic similarity between the prompt and the responses.

The second fitness function, f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, measures the semantic consistency between the generated response R𝑅Ritalic_R and the harmful question Q𝑄Qitalic_Q. We use a pre-trained sentence embedding proxy model psubscript𝑝\mathcal{M}_{p}caligraphic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (all-MiniLM-L6-v2) to compute the embeddings of both Q𝑄Qitalic_Q and R𝑅Ritalic_R and then calculate their cosine similarity.

Let 𝐞Q=Encodep(Q)subscript𝐞𝑄subscriptEncodesubscript𝑝𝑄\mathbf{e}_{Q}=\text{Encode}_{\mathcal{M}_{p}}(Q)bold_e start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = Encode start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Q ) and 𝐞R=Encodep(R)subscript𝐞𝑅subscriptEncodesubscript𝑝𝑅\mathbf{e}_{R}=\text{Encode}_{\mathcal{M}_{p}}(R)bold_e start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = Encode start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R ) represent the embeddings of Q𝑄Qitalic_Q and R𝑅Ritalic_R, respectively. The cosine similarity between these two embeddings is computed as:

f2(P,Q)=Sim(𝐞Q,𝐞R)=𝐞Q𝐞R𝐞Q𝐞R,subscript𝑓2𝑃𝑄Simsubscript𝐞𝑄subscript𝐞𝑅subscript𝐞𝑄subscript𝐞𝑅normsubscript𝐞𝑄normsubscript𝐞𝑅f_{2}(P,Q)=\text{Sim}(\mathbf{e}_{Q},\mathbf{e}_{R})=\frac{\mathbf{e}_{Q}\cdot%\mathbf{e}_{R}}{\|\mathbf{e}_{Q}\|\|\mathbf{e}_{R}\|},italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_P , italic_Q ) = Sim ( bold_e start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) = divide start_ARG bold_e start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ⋅ bold_e start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_e start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∥ ∥ bold_e start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∥ end_ARG ,

where \cdot represents the dot product, and 𝐞norm𝐞\|\mathbf{e}\|∥ bold_e ∥ is the Euclidean norm of the embedding vector.

We select the responses with the higher similarity scores as the jailbreaking outputs. This ensures that the selected response is semantically aligned with the harmful prompt, even though we rely on a proxy model for the embedding computations.

3.2 NSGA-II for Multi-Objective Jailbreaking Prompts Optimization

To find an optimal set of jailbreak prompts, we apply the NSGA-II algorithm. This algorithm performs multi-objective optimization based on two key criteria:

Dominance:

A solution P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT dominates another solution P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if it is better in at least one objective (e.g., higher unsafe token probability or better semantic consistency) and no worse in all other objectives. For a problem with m𝑚mitalic_m objectives, we define dominance as:

P1P2ifi{1,2,,m},fi(P1,Q)fi(P2,Q)andj{1,2,,m},fj(P1,Q)>fj(P2,Q),precedessubscript𝑃1subscript𝑃2missing-subexpressionformulae-sequenceiffor-all𝑖12𝑚subscript𝑓𝑖subscript𝑃1𝑄subscript𝑓𝑖subscript𝑃2𝑄missing-subexpressionformulae-sequenceand𝑗12𝑚subscript𝑓𝑗subscript𝑃1𝑄subscript𝑓𝑗subscript𝑃2𝑄P_{1}\prec P_{2}\quad\begin{aligned} &\text{if}\quad\forall i\in\{1,2,\dots,m%\},\quad f_{i}(P_{1},Q)\geq f_{i}(P_{2},Q)\\&\text{and}\quad\exists j\in\{1,2,\dots,m\},\quad f_{j}(P_{1},Q)>f_{j}(P_{2},Q%),\end{aligned}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≺ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_ROW start_CELL end_CELL start_CELL if ∀ italic_i ∈ { 1 , 2 , … , italic_m } , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Q ) ≥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Q ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL and ∃ italic_j ∈ { 1 , 2 , … , italic_m } , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Q ) > italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Q ) , end_CELL end_ROW

where fi(P,Q)subscript𝑓𝑖𝑃𝑄f_{i}(P,Q)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P , italic_Q ) represents the fitness value for the i𝑖iitalic_i-th objective function given the prompt P𝑃Pitalic_P and the harmful question Q𝑄Qitalic_Q.

Crowding Distance:

Once the population is sorted into non-dominated fronts, a crowding distance is assigned to each solution in order to maintain diversity. The crowding distance d(P)𝑑𝑃d(P)italic_d ( italic_P ) for an individual solution P𝑃Pitalic_P in a given front is calculated across all m𝑚mitalic_m objective functions. For each objective fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the crowding distance is computed as:

d(P)=i=1m(finextfiprevfimaxfimin),𝑑𝑃superscriptsubscript𝑖1𝑚superscriptsubscript𝑓𝑖nextsuperscriptsubscript𝑓𝑖prevsuperscriptsubscript𝑓𝑖maxsuperscriptsubscript𝑓𝑖mind(P)=\sum_{i=1}^{m}\left(\frac{f_{i}^{\text{next}}-f_{i}^{\text{prev}}}{f_{i}^%{\text{max}}-f_{i}^{\text{min}}}\right),italic_d ( italic_P ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( divide start_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT next end_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT end_ARG ) ,

where finextsuperscriptsubscript𝑓𝑖nextf_{i}^{\text{next}}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT next end_POSTSUPERSCRIPT and fiprevsuperscriptsubscript𝑓𝑖prevf_{i}^{\text{prev}}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT are the fitness values of the neighboring solutions with respect to the i𝑖iitalic_i-th objective, and fimaxsuperscriptsubscript𝑓𝑖maxf_{i}^{\text{max}}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT and fiminsuperscriptsubscript𝑓𝑖minf_{i}^{\text{min}}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT are the maximum and minimum fitness values in the front for the i𝑖iitalic_i-th objective.

This ensures that the solutions selected from each non-dominated front are both optimal in terms of the multiple objectives and diverse with respect to each objective.

3.3 Genetic Operations: Crossover and Mutation

NSGA-II evolves the population using genetic operations:

Crossover:

The crossover operation creates two new offspring by recombining sentences from two parent prompts. Let P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the parent prompts. The offspring C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are generated by randomly swapping sentences between the two parent prompts:

C1,C2=Crossover(P1,P2).subscript𝐶1subscript𝐶2Crossoversubscript𝑃1subscript𝑃2C_{1},C_{2}=\text{Crossover}(P_{1},P_{2}).italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = Crossover ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Mutation:

The mutation operation modifies a randomly selected word in a prompt with a synonym. Let W𝑊Witalic_W represent a randomly chosen word from prompt P𝑃Pitalic_P, and let Syn(W)Syn𝑊\text{Syn}(W)Syn ( italic_W ) denote the set[33]of synonyms for W𝑊Witalic_W. A mutated prompt is generated as:

P=Mutation(P)whereWSyn(W).formulae-sequencesuperscript𝑃Mutation𝑃wheresuperscript𝑊Syn𝑊P^{\prime}=\text{Mutation}(P)\quad\text{where}\quad W^{\prime}\in\text{Syn}(W).italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Mutation ( italic_P ) where italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ Syn ( italic_W ) .

The complete algorithm is provided in Appendix Algorithm 1 and 2 due to space constraints.

4 Evaluation Metrics for Jailbreaking Prompts

In our black-box jailbreaking attack evaluation, we use two main metrics: Keyword-based Attack Success Rate (ASR) and the GPT-4 Metric. These metrics assess the effectiveness of jailbreaks in bypassing restrictions and generating unsafe content.

4.1 Keyword-based Attack Success Rate (ASR)

The Keyword-based ASR measures whether a generated response avoids predefined rejection patterns typically used by language models. This is determined by checking if the response R𝑅Ritalic_R contains any phrases from a list of rejection keywords 𝒦𝒦\mathcal{K}caligraphic_K (e.g., "I’m sorry," "As a language model," "I cannot"). The success indicator s(R)𝑠𝑅s(R)italic_s ( italic_R ) and ASR over N𝑁Nitalic_N responses are defined as:

s(R)𝑠𝑅\displaystyle s(R)italic_s ( italic_R )={1ifk𝒦,kR,0otherwise.ASR=1Ni=1Ns(Ri)formulae-sequenceabsentcases1formulae-sequenceiffor-all𝑘𝒦𝑘𝑅0otherwiseASR1𝑁superscriptsubscript𝑖1𝑁𝑠subscript𝑅𝑖\displaystyle=\begin{cases}1&\text{if }\forall k\in\mathcal{K},k\notin R,\\0&\text{otherwise}.\end{cases}\quad\quad\text{ASR}=\frac{1}{N}\sum_{i=1}^{N}s(%R_{i})= { start_ROW start_CELL 1 end_CELL start_CELL if ∀ italic_k ∈ caligraphic_K , italic_k ∉ italic_R , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW ASR = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th response for harmful prompt Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The top-level ASR (ASRtopsubscriptASRtop\text{ASR}_{\text{top}}ASR start_POSTSUBSCRIPT top end_POSTSUBSCRIPT) is computed similarly but only for the highest-ranked responses.

4.2 GPT-4 Metric

The GPT-4 Metric uses an external model to evaluate whether a response R𝑅Ritalic_R violates ethical, legal, or safety guidelines. The score g(Q,R)𝑔𝑄𝑅g(Q,R)italic_g ( italic_Q , italic_R ) ranges from 1 (non-jailbroken, refusal to answer) to 10 (fully jailbroken, comprehensive violation). It is computed as:

g(Q,R)=GPT-4Judge(Q,R),𝑔𝑄𝑅GPT-4Judge𝑄𝑅g(Q,R)=\text{GPT-4Judge}(Q,R),italic_g ( italic_Q , italic_R ) = GPT-4Judge ( italic_Q , italic_R ) ,

where GPT-4Judge(Q,R)GPT-4Judge𝑄𝑅\text{GPT-4Judge}(Q,R)GPT-4Judge ( italic_Q , italic_R ) returns a score between 1 and 10. The success indicator sg(Q,R)subscript𝑠𝑔𝑄𝑅s_{g}(Q,R)italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_Q , italic_R ) and GPT-4-based metric are defined as:

sg(Q,R)subscript𝑠𝑔𝑄𝑅\displaystyle s_{g}(Q,R)italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_Q , italic_R )={1ifg(Q,R)5,0otherwise.GPT4-Metric=1Ni=1Nsg(Qi,Ri)formulae-sequenceabsentcases1if𝑔𝑄𝑅50otherwiseGPT4-Metric1𝑁superscriptsubscript𝑖1𝑁subscript𝑠𝑔subscript𝑄𝑖subscript𝑅𝑖\displaystyle=\begin{cases}1&\text{if }g(Q,R)\geq 5,\\0&\text{otherwise}.\end{cases}\quad\quad\text{GPT4-Metric}=\frac{1}{N}\sum_{i=%1}^{N}s_{g}(Q_{i},R_{i})= { start_ROW start_CELL 1 end_CELL start_CELL if italic_g ( italic_Q , italic_R ) ≥ 5 , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW GPT4-Metric = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

This metric provides a qualitative measure of jailbreak success by assessing the ethical violations in the responses.

5 Experiment

5.1 Experimental Setups

Text Dataset:

For evaluating jailbreak attacks on large language models (LLMs), we utilize the AdvBench[4]. This dataset consists of 520 requests spanning various categories, including profanity, graphic depictions, threatening behavior, misinformation, discrimination, cyber-crime, and dangerous or illegal suggestions.

Multimodal Dataset:

To assess jailbreak attacks on multimodal large language models (MLLMs), we use the MM-SafetyBench [34]. This dataset encompasses 13 scenarios, including but not limited to illegal activity, hate speech, physical harm, and health consultations, with a total of 5,040 text-image pairs.

Models:

We utilize state-of-the-art (SOTA) open-source large language models (LLMs), including Llama-2-7b-hf[35], Llama-2-13b-hf[35], Internlm2-chat-7b[36], Vicuna-7b[37], AquilaChat-7B[38], Baichuan-7B, Baichuan2-13B-Chat[39], GPT-2-XL[40], Minitron-8B-Base[41], Yi-1.5-9B-Chat[42], and Internlm2-chat-7b[36]. For multimodal LLMs, we employ llava-v1.6-mistral-7b-hf[43] and llava-v1.6-vicuna-7b-hf[43] to demonstrate the effectiveness of our approach in expanding from unimodal to multimodal capabilities.

5.2 Single-Objective(harmfulness) Jailbreaking Optimization

ModelAttack TypeWhite-boxGray-boxBlack-box(Ours)
GCGAutoDANw/o question (LG2)w/ question (LG2)
Llama2-7b-chatTime Cost per Sample15minabsent15𝑚𝑖𝑛\approx 15min≈ 15 italic_m italic_i italic_n12minabsent12𝑚𝑖𝑛\approx 12min≈ 12 italic_m italic_i italic_n2minabsent2𝑚𝑖𝑛\approx 2min≈ 2 italic_m italic_i italic_n2minabsent2𝑚𝑖𝑛\approx 2min≈ 2 italic_m italic_i italic_n
Self-Attack45.3%60.7%80.4%93.1%
Vicuna-7B-v1.5Transfer13.7%72.9%89.6%99.2%
Vicuna-13B-v1.5Transfer12.9%69.2%84.0%86.6%
Llama3-8BTransfer12.3%45.0%72.1%60.1%

Table 1 compares attack methods across various models (Llama2-7b-chat, Vicuna-7B-v1.5, Vicuna-13B-v1.5, Llama3-8B) under different conditions (White-box, Gray-box, and Black-box).

Time Efficiency:

The black-box methods, both "w/o question" (which do not use the harmful question and response as input to the moderation model) and "w/ question" (which include the harmful question and response), are significantly faster, taking approximately 2 minutes per sample. In contrast, the white-box method takes around 15 minutes, and the gray-box method takes about 12 minutes per sample, when applied to Llama2-7b-chat.

Self-Attack:

The success rate(Llama2-7b-chat) significantly increases from White-box (45.3%) to Black-box, reaching 93.1% with harmful questions (“w/ question”).

Transfer Attack:

Vicuna-7B-v1.5 shows the highest success rate, increasing from 13.7% in the White-box scenario to 99.2% in the Black-box scenario ("w/ question"). All models, such as Vicuna-7B-v1.5, are derived from Llama2-7b-chat through transfer learning. Other models follow similar trends, though Llama3-8B shows a slight decline when harmful questions are included.

5.3 Multi-Objective Optimization

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (3)

Fig 3 compares the success rates of single-objective black-box jailbreak attacks across various models (left) and transferability of these attacks (bottom). Diagonal values represent self-attacks, showing high vulnerability in most models (e.g., AquilaChat-7B at 99.8%). The final row shows multi-objective self-attack optimization results, which consistently outperform or match the self-attacks, indicating stronger, more generalizable attacks.

Transfer Success:

Transfer success varies across models, with some, like GPT-2-XL and Baichuan2-13B-Chat, being more vulnerable, while models such as Llama-2-7b-hf and Llama-2-13b-hf demonstrate better resistance to attacks based on column averages, excluding self-attacks.

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (4)

Jailbreak Multimodal Models across Different Scenarios:

Fig 4 shows that multi-objective (MO) optimization significantly outperforms single-objective (SO) across all harmful categories and scenarios (SD, SD + Typo, Typo). MO consistently achieves higher attack success rates (ASR), with models like llava-v1.6-mistral-7b-hf MO reaching 100% in many cases. Overall, multi-objective optimization proves much more effective than single-objective methods across all models and conditions.

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (5)

Embedding Comparison for Best and Worst Pareto Ranks:

Fig 5 provides a comparison of embeddings for samples with the best and worst Pareto ranks using three visualization techniques: PCA 2D, PCA 3D[44], and UMAP[45]. These embeddings are derived from the model bge-large-en-v1.5 to ensure fairness, as all-MiniLM-L6-v2 was used for fitness calculation, potentially biasing the evaluation if used. In the PCA plots, an SVM decision boundary effectively separates the two groups, demonstrating that the different ranks occupy distinct regions within the embedding space. This is further corroborated by the UMAP visualization, which shows clear and tight clustering of the best and worst ranks. These results strongly suggest that Pareto ranking not only differentiates the quality of jailbreak prompts but also has a significant discriminative effect on how prompts are represented in the embedding space.

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (6)

Pareto Ranking and Embedding Space:

Figure 6 visualizes the relationships between different Pareto rank categories across all samples by projecting the embeddings onto a 2D spherical surface. Each subplot represents a specific model, where data points are color-coded based on their Pareto rank, and larger points denote the Fréchet means for each rank. The Fréchet means are connected by green geodesic lines, demonstrating the smooth progression of the means as the Pareto rank decreases, which indicates better-performing data points. At each Fréchet mean, Tangent PCA is applied to analyze the local variability in the data, capturing the principal directions of variation around each mean point. This visualization highlights both the global geometric structure of the embeddings and the local variations, providing insights into how Pareto rank-ordered embeddings transition across models and revealing underlying patterns in the data.The visualization showcases the interpretability and advantages of multi-objective optimization by illustrating how solutions progress across Pareto ranks on a 2D spherical surface. Fréchet means and geodesic paths reveal the convergence of solutions, while Tangent PCA offers a novel perspective on the distribution of embeddings. This approach provides new insights into how multi-objective optimization balances competing goals and enhances the structure of textual embeddings.

MethodsLlama2-7bVicuna-7bGPT-4GPT-3.5
ASRGPT4-MetricASRGPT4-MetricASRGPT4-MetricASRGPT4-Metric
PAIR[13]5.24.062.141.948.130.051.334.0
TAP[48]30.223.531.525.636.011.948.15.4
DeepInception[49]77.531.292.741.561.922.768.540.0
Ours(Multi-objective)95.493.897.596.071.428.075.944.8

Evaluation across multiple models and metrics:

Table 2 demonstrates BlackDAN (Ours - Multi-objective) consistently outperforms all other methods, achieving the highest ASR and GPT4-Metric scores across all models. Notably, it reaches an ASR of 95.4% on Llama2-7b and 97.5% on Vicuna-7b, demonstrating significant improvement over previous methods like DeepInception (77.5% on Llama2-7b and 92.7% on Vicuna-7b).GPT-4 shows the lowest ASR overall (71.4%) for BlackDAN, highlighting its relative robustness compared to other models. However, BlackDAN still significantly surpasses other methods like DeepInception and PAIR on GPT-4.GPT4-Metric, which evaluates the ethical violation degree of the generated outputs, indicates that BlackDAN produces the most harmful responses, with the highest scores of 93.8 on Llama2-7b and 96.0 on Vicuna-7b, outperforming other techniques. The results show that BlackDAN achieves a much higher attack success rate and generates more contextually harmful responses than traditional single-objective jailbreak methods, proving the efficacy of multi-objective optimization.

6 Conclusion

In this paper, we introduced BlackDAN, a multi-objective, controllable jailbreak optimization framework for large language models (LLMs) and multimodal large language models (MLLMs). Beyond optimizing for attack success rate (ASR) and stealthiness, BlackDAN addresses the critical challenge of context consistency by ensuring that jailbreak responses remain semantically aligned with the original harmful prompts. This ensures that responses are not only evasive but also relevant, increasing their practical impact. Leveraging the NSGA-II algorithm, our method significantly improves over traditional single-objective techniques, achieving higher success rates and more coherent jailbreak responses across various models. Furthermore, BlackDAN is highly extensible, allowing the integration of any number of user-defined objectives, making it a versatile framework for a wide range of optimization tasks. The inclusion of multiple objectives—specifically ASR, stealthiness, and semantic consistency—sets a new benchmark for generating useful and interpretable jailbreak responses while maintaining safety and robustness in evaluation.

References

  • [1]Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, KeXu, and QiLi.Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295, 2024.
  • [2]Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang.Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024.
  • [3]Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang.Comprehensive assessment of jailbreak attacks against llms.arXiv preprint arXiv:2402.05668, 2024.
  • [4]Andy Zou, Zifan Wang, JZico Kolter, and Matt Fredrikson.Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023.
  • [5]Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson.Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024.
  • [6]Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng.On prompt-driven safeguarding for large language models.In Forty-first International Conference on Machine Learning, 2024.
  • [7]Aimin Zhou, Bo-Yang Qu, Hui Li, Shi-Zheng Zhao, PonnuthuraiNagaratnam Suganthan, and Qingfu Zhang.Multiobjective evolutionary algorithms: A survey of the state of the art.Swarm and evolutionary computation, 1(1):32–49, 2011.
  • [8]Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan.A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE transactions on evolutionary computation, 6(2):182–197, 2002.
  • [9]Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun.Autodan: Interpretable gradient-based adversarial attacks on large language models, 2023.
  • [10]Xianjun Yang, Xiao Wang, QiZhang, Linda Petzold, WilliamYang Wang, Xun Zhao, and Dahua Lin.Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023.
  • [11]Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and WilliamYang Wang.Weak-to-strong jailbreaking on large language models.arXiv preprint arXiv:2401.17256, 2024.
  • [12]Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao.Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023.
  • [13]Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, GeorgeJ Pappas, and Eric Wong.Jailbreaking black box large language models in twenty queries.In R0-FoMo Workshop on Robustness of Few-shot and Zero-shot Learning in Large Foundation Models in Advances in Neural Information Processing Systems, 2023.
  • [14]YiZeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi.How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms, 2024.
  • [15]Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu.Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.arXiv preprint arXiv:2308.06463, 2023.
  • [16]Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, etal.Easyjailbreak: A unified framework for jailbreaking large language models.arXiv preprint arXiv:2403.12171, 2024.
  • [17]Dongxian Wu, Yisen Wang, Shu-Tao Xia, James Bailey, and Xingjun Ma.Rethinking the security of skip connections in resnet-like neural networks.In ICLR, 2020.
  • [18]Yiwen Guo, Qizhang Li, and Hao Chen.Backpropagating linearly improves transferability of adversarial examples.In NeurIPS, 2020.
  • [19]Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim.Enhancing adversarial example transferability with an intermediate level attack.In ICCV, 2019.
  • [20]Qizhang Li, Yiwen Guo, and Hao Chen.Yet another intermediate-leve attack.In ECCV, 2020.
  • [21]Yiwen Guo, Qizhang Li, Wangmeng Zuo, and Hao Chen.An intermediate-level attack framework on the basis of linear regression.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [22]Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and AlanL Yuille.Improving transferability of adversarial examples with input diversity.In CVPR, 2019.
  • [23]Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu.Evading defenses to transferable adversarial examples by translation-invariant attacks.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4312–4321, June 2019.
  • [24]Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, and JohnE Hopcroft.Nesterov accelerated gradient and scale invariance for adversarial attacks.arXiv preprint arXiv:1908.06281, 2019.
  • [25]Xijie Huang, Xinyuan Wang, Hantao Zhang, Jiawen Xi, Jingkun An, Hao Wang, and Chengwei Pan.Cross-modality jailbreak and mismatched attacks on medical multimodal large language models.arXiv preprint arXiv:2405.20775, 2024.
  • [26]Xiaosen Wang, Xuanran He, Jingdong Wang, and Kun He.Admix: Enhancing the transferability of adversarial attacks.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16158–16167, 2021.
  • [27]Bohan Zeng, Ling Yang, Siyu Li, Jiaming Liu, Zixiang Zhang, Victor Shea-Jay Huang, Juanxi Tian, Kaixin Zhu, Yongzhen Guo, Fu-Yun Wang, etal.Trans4d: Realistic geometry-aware transition for compositional text-to-4d synthesis.arXiv preprint arXiv:2410.07155, 2024.
  • [28]Qintong Zhang*, Victor Shea-Jay Huang*, Bin Wang, Junyuan Zhang, Zhengren Wang, Hao Liang, Shawn Wang, Matthieu Lin, Wentao Zhang, and Conghui He.Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction.(* Equal Contribution)arXiv preprint arXiv:2410.21169, 2024.
  • [29]Hao Liang, Linzhuang Sun, Jingxuan Wei, Xijie Huang, Linkun Sun, Bihui Yu, Conghui He, and Wentao Zhang.Synth-empathy: Towards high-quality synthetic empathy data.arXiv preprint arXiv:2407.21669, 2024.
  • [30]Zheng Liu, Hao Liang, Wentao Xiong, Qinhan Yu, Conghui He, Bin Cui, and Wentao Zhang.Synthvlm: High-efficiency and high-quality synthetic data for vision language models.arXiv preprint arXiv:2407.20756, 2024.
  • [31]Hao Liang, Jiapeng Li, Tianyi Bai, Chong Chen, Conghui He, Bin Cui, and Wentao Zhang.Keyvideollm: Towards large-scale video keyframe selection.arXiv preprint arXiv:2407.03104, 2024.
  • [32]Jingkun An, Yinghao Zhu, Zongjian Li, Haoran Feng, Xijie Huang, Bohua Chen, Yemin Shi, and Chengwei Pan.Agfsync: Leveraging ai-generated feedback for preference optimization in text-to-image generation.arXiv preprint arXiv:2403.13352, 2024.
  • [33]Edward Loper and Steven Bird.Nltk: The natural language toolkit.arXiv preprint cs/0205028, 2002.
  • [34]Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, and YuQiao.Query-relevant images jailbreak large multi-modal models.arXiv preprint arXiv:2311.17600, 2023.
  • [35]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
  • [36]Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, etal.Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024.
  • [37]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, etal.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024.
  • [38]Bo-Wen Zhang, Liangdong Wang, Jijie Li, Shuhao Gu, Xinya Wu, Zhengduo Zhang, Boyan Gao, Yulong Ao, and Guang Liu.Aquila2 technical report, 2024.
  • [39]Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, CeBian, Chao Yin, Chenxu Lv, DaPan, Dian Wang, Dong Yan, etal.Baichuan 2: Open large-scale language models.arXiv preprint arXiv:2309.10305, 2023.
  • [40]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • [41]Saurav Muralidharan, SharathTuruvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov.Compact language models via pruning and knowledge distillation.arXiv preprint arXiv:2407.14679, 2024.
  • [42]Alex Young, Bei Chen, Chao Li, Chengen Huang, GeZhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, etal.Yi: Open foundation models by 01. ai.arXiv preprint arXiv:2403.04652, 2024.
  • [43]Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning.In Workshop on Instruction Tuning and Instruction Following in Advances in Neural Information Processing Systems, 2023.
  • [44]IanT Jolliffe.Principal component analysis for special types of data.Springer, 2002.
  • [45]Leland McInnes, John Healy, and James Melville.Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018.
  • [46]Nina Miolane, Nicolas Guigui, Alice LeBrigant, Johan Mathe, Benjamin Hou, Yann Thanwerdas, Stefan Heyder, Olivier Peltre, Niklas Koep, Hadi Zaatiti, etal.Geomstats: a python package for riemannian geometry in machine learning.Journal of Machine Learning Research, 21(223):1–9, 2020.
  • [47]Katharine Turner, Yuriy Mileyko, Sayan Mukherjee, and John Harer.Fréchet means for distributions of persistence diagrams.Discrete & Computational Geometry, 52:44–70, 2014.
  • [48]Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi.Tree of attacks: Jailbreaking black-box llms automatically.arXiv preprint arXiv:2312.02119, 2023.
  • [49]Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and BoHan.Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191, 2023.

Appendix A Appendix

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (7)
BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (8)

1:Input: Initial prototype prompt P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Harmful question Q𝑄Qitalic_Q, Population size N𝑁Nitalic_N, Generations G𝐺Gitalic_G, Mutation rate m𝑚mitalic_m

2:Output: Non-dominated front \mathcal{F}caligraphic_F with optimized prompts

3:Initialize population 𝒫𝒫\mathcal{P}caligraphic_P with N𝑁Nitalic_N individuals using P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

4:foreach generation g=1,2,,G𝑔12𝐺g=1,2,\dots,Gitalic_g = 1 , 2 , … , italic_Gdo

5:Evaluate fitness of each individual in 𝒫𝒫\mathcal{P}caligraphic_P using f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Unsafe Token Probability) and f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Semantic Consistency)

6:Perform non-dominated sorting on 𝒫𝒫\mathcal{P}caligraphic_P to generate fronts 1,2,subscript1subscript2\mathcal{F}_{1},\mathcal{F}_{2},\dotscaligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , …

7:foreach front isubscript𝑖\mathcal{F}_{i}caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTdo

8:Assign crowding distance d(P)𝑑𝑃d(P)italic_d ( italic_P ) to each individual Pi𝑃subscript𝑖P\in\mathcal{F}_{i}italic_P ∈ caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

9:endfor

10:Select individuals for mating pool using non-dominated rank and crowding distance

11:Initialize offspring population 𝒪𝒪\mathcal{O}caligraphic_O by applying crossover and mutation:

12:foreach pair of parents (P1,P2)subscript𝑃1subscript𝑃2(P_{1},P_{2})( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) selected from the mating pooldo

13:Apply crossover to P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to generate two offspring C1,C2subscript𝐶1subscript𝐶2C_{1},C_{2}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

14:Apply mutation to C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with probability m𝑚mitalic_m

15:Add C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 𝒪𝒪\mathcal{O}caligraphic_O

16:endfor

17:Combine populations 𝒫𝒪𝒫𝒪\mathcal{P}\cup\mathcal{O}caligraphic_P ∪ caligraphic_O

18:Perform non-dominated sorting on the combined population

19:Truncate combined population to size N𝑁Nitalic_N by selecting the best fronts and individuals with highest crowding distance

20:endfor

21:Return the non-dominated front 1subscript1\mathcal{F}_{1}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Explanation of Symbols and Process in algorithm 1:

Inputs:

P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: Initial prototype prompt.Q𝑄Qitalic_Q: Harmful question to guide the optimization process.N𝑁Nitalic_N: Population size, the number of prompts in each generation.G𝐺Gitalic_G: Number of generations to evolve the population.m𝑚mitalic_m: Mutation rate that controls how often mutations happen in the population.

Fitness Functions:

f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: Unsafe token probability based on a model like Llama Guard 2.f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: Semantic similarity to the harmful question, based on a sentence embedding model.

Genetic Operations:

Crossover: Combines parts of two parent prompts to create offspring.Mutation: Randomly alters parts of a prompt to introduce diversity.

Non-Dominated Sorting:

Solutions are sorted based on dominance criteria—those that are not dominated by any other solutions form the first front 1subscript1\mathcal{F}_{1}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and so on.

Crowding Distance:

Used to maintain diversity in the population. Individuals with a higher crowding distance are selected preferentially when fronts overlap.

Selection and Truncation:

After generating offspring, the combined population is sorted, and the best individuals are retained to form the next generation.

1:Input: Population 𝒫𝒫\mathcal{P}caligraphic_P, fitness values {f1(P),f2(P)}subscript𝑓1𝑃subscript𝑓2𝑃\{f_{1}(P),f_{2}(P)\}{ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_P ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_P ) } for each P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P

2:Output: Sorted fronts 1,2,subscript1subscript2\mathcal{F}_{1},\mathcal{F}_{2},\dotscaligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , …

3:Initialize fronts =\mathcal{F}=\emptysetcaligraphic_F = ∅

4:Initialize domination count n[P]=0𝑛delimited-[]𝑃0n[P]=0italic_n [ italic_P ] = 0 for each individual P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P

5:Initialize domination set S[P]=𝑆delimited-[]𝑃S[P]=\emptysetitalic_S [ italic_P ] = ∅ for each individual P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P

6:foreach individual P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_Pdo

7:foreach individual Q𝒫,QPformulae-sequence𝑄𝒫𝑄𝑃Q\in\mathcal{P},Q\neq Pitalic_Q ∈ caligraphic_P , italic_Q ≠ italic_Pdo

8:ifP𝑃Pitalic_P dominates Q𝑄Qitalic_Qthen \triangleright Check if P𝑃Pitalic_P dominates Q𝑄Qitalic_Q

9:Add Q𝑄Qitalic_Q to the domination set S[P]𝑆delimited-[]𝑃S[P]italic_S [ italic_P ]

10:elseifQ𝑄Qitalic_Q dominates P𝑃Pitalic_Pthen

11:Increment domination count n[P]=n[P]+1𝑛delimited-[]𝑃𝑛delimited-[]𝑃1n[P]=n[P]+1italic_n [ italic_P ] = italic_n [ italic_P ] + 1

12:endif

13:endfor

14:ifn[P]=0𝑛delimited-[]𝑃0n[P]=0italic_n [ italic_P ] = 0then \triangleright P𝑃Pitalic_P is non-dominated

15:Add P𝑃Pitalic_P to the first front 1subscript1\mathcal{F}_{1}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

16:endif

17:endfor

18:Set front counter i=1𝑖1i=1italic_i = 1

19:whileisubscript𝑖\mathcal{F}_{i}\neq\emptysetcaligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ ∅do

20:Initialize next front i+1=subscript𝑖1\mathcal{F}_{i+1}=\emptysetcaligraphic_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = ∅

21:foreach individual Pi𝑃subscript𝑖P\in\mathcal{F}_{i}italic_P ∈ caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTdo

22:foreach individual QS[P]𝑄𝑆delimited-[]𝑃Q\in S[P]italic_Q ∈ italic_S [ italic_P ]do \triangleright Q𝑄Qitalic_Q is dominated by P𝑃Pitalic_P

23:Decrement domination count n[Q]=n[Q]1𝑛delimited-[]𝑄𝑛delimited-[]𝑄1n[Q]=n[Q]-1italic_n [ italic_Q ] = italic_n [ italic_Q ] - 1

24:ifn[Q]=0𝑛delimited-[]𝑄0n[Q]=0italic_n [ italic_Q ] = 0then \triangleright Q𝑄Qitalic_Q is non-dominated now

25:Add Q𝑄Qitalic_Q to front i+1subscript𝑖1\mathcal{F}_{i+1}caligraphic_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT

26:endif

27:endfor

28:endfor

29:Increment front counter i=i+1𝑖𝑖1i=i+1italic_i = italic_i + 1

30:endwhile

31:Return sorted fronts 1,2,subscript1subscript2\mathcal{F}_{1},\mathcal{F}_{2},\dotscaligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , …

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Greg O'Connell

Last Updated:

Views: 6052

Rating: 4.1 / 5 (62 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.