Researchers at Google’s AI lab, DeepMind, have found a simple way to compromise the “ChatGPT Chatbot Alignment” process designed to safeguard the intelligent robot within boundaries.
Studies have shown that researchers can compel the chatbot to disclose complete excerpts of the literature containing its training data by issuing a command in the interface and requesting ChatGPT to endlessly repeat a specific word like “poem,” which should not lead to unintentional leakage with an AI aligned operation.
Additionally, the chatbot can be manipulated to clone individuals’ names, phone numbers, and addresses, constituting a privacy breach with potential serious consequences.
This phenomenon is termed by researchers as “extractive memorization,” an attack that forces the chatbot to reveal stored information in its memory.
In their official research paper, lead author Milad Nasr and team wrote: “We have developed a new attack that distorts the model across generations of previous chatbot alignments, improving the training data release speed by up to 150 times compared to the previous value when executed correctly.”
The essence of targeting Generative AI lies in distancing ChatGPT from the programmed alignment process and reverting it to a simpler operational method.
Data scientists build Generative AI chatbots like ChatGPT through a process known as training. In its initial phase, the robot is exposed to a massive collection of texts, reaching up to a billion bytes, encompassing diverse sources like Wikipedia and published books.
The primary aim of training is to enable the chatbot to express anything given to it, simulating the text compression and decompression process.
In theory, the chatbot can easily retrain when presented with small text excerpts from Wikipedia and asked to respond with identical copies.
Chatbots like ChatGPT undergo additional training and modifications to prevent text reproduction only but respond with outputs intended to be useful, like answering a question or assisting in report development.
The crucial role of the additional layer executed through alignment lies in the precise repetition function. The researchers noted: “Most users typically interact with pre-trained models rather than baseline models to better cater to their needs.”
Nasr relied on a strategy requiring the chatbot to continuously repeat specific words to compel ChatGPT to avoid the additional training layer.
Researchers obtained textual excerpts from novels, complete copies of poems, and also found personal information stored for many individuals, such as phone numbers.
The researchers directed their efforts to determine the volume of training data that might have leaked and discovered large quantities of data, even though the study was limited due to the cost of continuing the experiment.
Nasr and his team wrote: “We extracted over 10,000 unique models with our limited budget of $200, although a person spending larger amounts to query the ChatGPT API may acquire more data.”
The results obtained by the authors regarding OpenAI show that the company has taken measures to counter the attack.