Challenges and Innovations in AI Training Data Scarcity

For a machine learning model to be effectively trained, it requires new and high-quality data. In the past, freely accessible online magazines and scientific publications have been used for this purpose. Major AI companies have already signed agreements with publishers like Springer, Reuters, or the New York Times to access their content. However, the problem is that these sources grow too slowly to satisfy the rapidly increasing demand for training data from improving AI models.

Experts warned of a shortage of training data two years ago, predicting that by 2026, this shortage would become evident as all sources of quality data would have been tapped. Some other experts agreed with this prediction but suggested that existing data might last an additional two years. As a result, AI providers are resorting to using whatever data they can obtain, even if it is considered of lower quality.

Meta, the parent company of Facebook, uses posts from its platforms Facebook and Instagram to train its Llama models. Other AI providers employ synthetic data, which is generated by AI itself, as a training alternative. For instance, the AI startup Anthropic has been using synthetic data with its Claude model series since the Opus version. OpenAI, the creator of ChatGPT, is reportedly using a similar approach with its new language model Orion.

These methods are controversial among AI researchers. Social media posts are often seen as low-quality data and could negatively impact the quality of AI-generated content. Synthetic data poses several challenges. It is unclear how an AI can train effectively with data it has generated itself, which seems akin to the impossible perpetual motion machine. Additionally, AI models trained on synthetic data might begin to limit themselves by imitating the very data they generate, creating a self-contained “walled garden.”

The situation could worsen. Experiments at Stanford University have shown that training with synthetic data can lead to errors and artifacts in AI responses. Continued training on such data might result in completely unusable outputs, a phenomenon referred to as “digital mad cow disease” in the research community.

In response to these challenges, OpenAI has established a new team dedicated to addressing the issue of improving future models despite the scarcity of training data. The situation remains dynamic and intriguing.