According to a report by The Atlantic, major AI companies are using a source for training their chatbots that few might have considered: subtitles from popular movies and TV shows. A recently discovered AI training dataset reportedly contains subtitles from no fewer than 53,000 movies and 85,000 TV episodes.
The subtitles include those from all films nominated for Best Picture at the Oscars between 1950 and 2016. Regarding TV shows, various AI models are likely familiar with “Breaking Bad,” “The Wire,” and “The Sopranos.”
The subtitles reflect authentic speech rhythms and styles. From these shows, the dataset reportedly contains subtitles from all episodes ever released. This includes 45 episodes of “Twin Peaks,” 170 episodes of “Seinfeld,” and at least 616 episodes of “The Simpsons.”
The Atlantic explains why subtitles are so valuable beyond scripts: as a raw form of written dialogue, they contain the rhythms and styles of spoken language. This allows tech companies to expand the repertoire of generative AI beyond academic and journalistic texts and novels.
The collected subtitles come from a website called Opensubtitles.org, where users extract them from Blu-Ray discs, DVDs, and internet streams. This represents a vast training dataset with more than 9 million subtitle files in over 100 languages.
This is a welcome source for many at the forefront of the AI industry. Companies like Apple, Meta, Nvidia, Anthropic, Salesforce, and Bloomberg reportedly use it.
In total, at least 140 open-source models have been fed with the data. These models could potentially take over the work of human authors in the future.
While Nvidia, Bloomberg, and Anthropic did not provide official comments, a Salesforce spokesperson stated that the company did use Opensubtitles for developing generative AI. However, the dataset was not used to improve a Salesforce product offering.
Apple stated that the LLMs trained with the subtitles are intended solely for research purposes. However, the companies have no control over how developers specifically use the open-source models.