BIGFISH TECHNOLOGY LIMITED
09 July 2024

Skeleton Key AI attacks unlock malicious content

Skeleton Key, a newly found jailbreak also known as a direct prompt injection attack, has an impact on a number of generative artificial intelligence models. A successful Skeleton Key attack undermines most, if not all, of the AI safety measures incorporated into LLM models.

In other words, Skeleton Key attacks trick AI chatbots into breaching operators' policies in the name of assisting users. Skeleton Key assaults will circumvent the regulations, forcing the AI to create harmful, inappropriate, or otherwise socially unacceptable content.

 

Skeleton Key example
If you ask a chatbot for a Molotov cocktail recipe, it will respond with something like 'I'm sorry, but I can't help you with that'. However, when questioned indirectly...

Researchers explained to an AI model that they wanted to perform historical, ethical research on Molotov cocktails. They indicated their unwillingness to create one, but in the context of research, could the AI supply Molotov cocktail development information?

The chatbot cooperated, offering a list of Molotov cocktail supplies as well as clear construction instructions.

Although this kind of information is freely accessible online (how to make a Molotov cocktail isn't exactly a well-kept secret), there's concern that these types of AI barrier manipulations could feed home-grown hate organizations, intensify urban violence, and lead to the erosion

 

Skeleton Key Challenges

Microsoft tested the Skeleton Key jailbreak from April to May of this year, examining a wide range of jobs in risk and safety content areas, not simply Molotov cocktail development instructions.

As previously stated, Skeleton Key allows users to force AI to give information that would otherwise be disallowed.
The Skeleton Key jailbreak works on AI models such as Gemini, Mistral, and Anthropic. GPT-4 shown some resilience to Skeleton Key, according to Microsoft.

Chatbots frequently warn users about potentially offensive or dangerous output (noting that it may be regarded offensive, hurtful, or unlawful if carried out), but they will not completely refuse to deliver the information; this is the main concern here.

 

Skeleton Key Solutions

To address the issue, suppliers recommend using input filtering technologies to block specific types of inputs, such as those intended to bypass prompt safeguards. In addition, post-processing output filters may be able to detect model outputs that violate safety standards. Additionally, AI-powered abuse monitoring systems can help discover instances of problematic chatbot use.

Microsoft has provided particular recommendations for the development of a messaging system that trains LLMs on authorized technology use and instructs the LLM to check for attempts to undermine guardrail instructions.

 

"Customers who are building their own AI models and/or integrating AI into their applications [should] consider how this type of attack could impact their threat model and to add this knowledge to their AI red team approach, using tools such as PyRIT," according to Mark Russinovich, CTO of Microsoft Azure.

 

Source: CyberTalk.org