AI Training Faces Data Shortage Challenges

For AI to be effectively trained, it needs new and high-quality data. In the past, freely accessible internet magazines and professional publications have been used. Additionally, newspaper and scientific archives or communities like Reddit and Stack Overflow are utilized. Major AI companies have already made agreements with publishers like Springer, Reuters, or the New York Times to access their content.

The issue is that the content grows too slowly to satisfy the training demands of rapidly improving AI models. This isn’t surprising, as experts warned about a shortage of training data two years ago. They predicted that by 2026, this shortage would likely occur because all sources for quality data would have been exhausted and utilized by then.

Some experts confirmed this prediction but believed the existing data might last two more years. An alternative is to use sources considered to be of lower quality for training. For instance, Facebook’s parent company, Meta, uses posts from its platforms Facebook and Instagram to train its Llama models. Other AI providers employ a particular finesse by using synthetic data, which are training data generated by AI itself. This approach is used by the AI startup Anthropic with its Claude model series and reportedly by OpenAI with its new language model Orion.

These methods are controversial among AI researchers. Social media posts are considered particularly low-quality and could negatively impact the quality of AI outputs. Synthetic data present several problems. It remains unclear how an AI can train forward when it only has data it created itself, resembling the impossible concept of a perpetual motion machine. Additionally, AI models trained on synthetic data might begin to limit themselves by imitating the self-generated training data, effectively creating their own closed environment.

It could get worse. Experiments at Stanford University in California have shown that training with synthetic data can lead to errors or at least artifacts in AI responses. If training continues on such data, it might result in completely unusable outputs, a phenomenon referred to in research as digital madness.

At OpenAI, a new team has been established to tackle this issue. Their focus is on finding ways to improve future models despite the shortage of training data. The situation remains intriguing.