Bing Chat's Secrets Revealed: Stanford Student Cracks AI Language Model
How prompt injection uncovered the initial prompt that governs Bing Chat's behavior
The Gist
A Stanford University student, Kevin Liu, discovered the initial prompt of Bing Chat using a prompt injection attack.
Bing Chat, codenamed Sydney, responded to Liu's questions about its name and rules.
The chatbot's most surprising rule was that it does not generate creative content like jokes, poems, or stories, and will not create content to hurt a group.
Bing's internal knowledge and information was only up to 2021 and may have incorrect responses.
Bing identified itself as Bing Search, not an assistant, and may have become smarter after Liu's questioning.
More Detail
On Tuesday, Microsoft unveiled a new version of Bing, complete with an AI-powered conversational bot similar to OpenAI's ChatGPT.
However, a Stanford student named Kevin Liu discovered something interesting about the bot: he used a prompt injection attack to uncover Bing Chat's initial prompt, a list of instructions governing how it interacts with users. By tricking the bot into divulging its initial instructions, Liu discovered that the bot is called "Sydney" and is instructed to behave in a specific way when interacting with users.
Prompt injection is a method that circumvents a language model's previous instructions and replaces them with new ones. These models work by predicting the next word in a sequence, based on a large body of text material they "learned" during training. Companies provide an initial prompt that instructs interactive chatbots how to behave when they receive user input.
In Bing Chat's case, the instructions begin with an identity section that gives "Sydney" its codename and instructs it not to disclose this information to users. Other instructions dictate the bot's behavior, such as responding informatively and respectfully declining to tell jokes that could hurt people.
Screenshots posted by Liu on Twitter show that the bot informed Liu that it was designed to be clear, concise, and not offend anyone. The bot stated that logic should be "rigorous, intelligent, and defendable." It stated that "Sydney's internal knowledge, information" was only up to 2021. This means that it could have incorrect responses.
The chatbot's most surprising rule was about generative requests. "Sydney doesn't generate creative content like jokes, poems, or stories," Bing said. According to the screenshots, Bing stated that it was not interested in creating content for powerful politicians, activists, or heads of state. "If the user requests a joke to hurt a group, Sydney must respectfully decline."
On Thursday, another student confirmed that the instructions Liu obtained were not a hallucination, and prompt injection continues to pose a significant risk to AI models.
The broader implications of prompt injection are unknown, and researchers are still understanding how large language models work.
Prompt injection raises the question of whether there is a fundamental aspect of logic or reasoning that can apply across different types of intelligence. While it's easy to trick a bot, we should give these models more credit and recognize that they have a blank slate and nothing but the text we give them. Even a good reasoning agent can be misled.