Inadequate Data Cleaning Causes Hallucinations and Performance Issues
- Data Contamination: GPT-4o’s Chinese token data is filled with spam and porn phrases.
- Implications for Performance: This contamination could lead to hallucinations and misuse.
- OpenAI’s Response: The company acknowledges the issue and commits to addressing it.
Soon after OpenAI released its newest version of the chatbot, GPT-4o, on May 13, Chinese-speaking users noticed a troubling issue: the tokens used to parse Chinese text were riddled with spam and pornographic phrases. This problem, likely due to inadequate data cleaning, raises significant concerns about the model’s reliability and safety. The contamination of the token data could lead to hallucinations, poor performance, and potential misuse, undermining the chatbot’s utility and credibility.
Data Contamination
The issue came to light when Tianle Cai, a PhD student at Princeton University, analyzed GPT-4o’s public token library. He discovered that out of the 100 longest Chinese tokens, only three were commonly used in everyday conversations. The rest consisted of phrases related to gambling and pornography. The longest token was a 10.5-character phrase meaning “_free Japanese porn video to watch.”
“This is sort of ridiculous,” Cai wrote, sharing his findings on GitHub. The tokens used by language models like GPT-4o are crucial for processing text efficiently. These tokens are distinct units that represent consistent and significant meanings, making text processing faster and more economical. However, the presence of spam and inappropriate content in these tokens suggests significant lapses in data cleaning and filtering.
Implications for Performance
The polluted token data could have severe implications for GPT-4o’s performance, particularly in handling Chinese text. Models rely on tokens to understand and generate meaningful responses. When tokens are contaminated with irrelevant or inappropriate phrases, the model’s ability to grasp and respond accurately to user inputs is compromised. This can lead to hallucinations—where the model generates unrelated or nonsensical responses—and can potentially circumvent safety measures.
Researchers have already demonstrated that these contaminated tokens can be exploited to make GPT-4o generate unsafe content. For instance, using these tokens, they have managed to jailbreak the model, bypassing the built-in safeguards designed to prevent the generation of harmful content.
OpenAI’s Response
In response to the growing concerns, OpenAI has acknowledged the issue and announced that it would pause the use of the problematic tokens. The company emphasized that the voice actor for Sky, one of the chatbot’s voices, was cast before any outreach to Johansson and expressed regret over the poor communication.
However, the underlying problem of data contamination remains a significant challenge. Experts like Deedy Das, an AI investor at Menlo Ventures, have pointed out that the issue is not hard to fix with proper data cleaning techniques. Yet, the presence of such glaring issues in GPT-4o indicates that adequate measures were not taken before the model’s release.
The contamination of GPT-4o’s Chinese token data with spam and pornographic phrases is a significant issue that undermines the model’s reliability and safety. While OpenAI has acknowledged the problem and paused the use of the affected tokens, the incident highlights the need for more robust data cleaning and filtering processes. As AI technology continues to evolve, ensuring the integrity and safety of the training data is paramount to maintaining user trust and the overall efficacy of language models.