Apple joins the ranks of generative AI developers caught scraping copyrighted content from social media to train their artificial intelligence systems. A recent report from Proof News reveals that Apple, along with other AI giants like Anthropic, Nvidia, and Salesforce, have been utilizing a dataset containing subtitles from 173,536 YouTube videos without proper authorization. Despite YouTube’s rules against such exploitation, these companies have been infringing on copyrighted material.
The dataset, known as YouTube Subtitles, includes video transcripts from a wide range of channels, including educational institutions like Khan Academy, MIT, and Harvard, as well as popular media outlets like The Wall Street Journal, NPR, and the BBC. Even content from late-night shows and influential YouTubers like Marques Brownlee and MrBeast have been used without permission.
Although the dataset doesn’t include video files, it offers translations in languages such as Japanese, German, and Arabic. The data was initially sourced from a larger dataset named Pile, compiled by startup EleutherAI, which extracted data from YouTube, European Parliament records, and Wikipedia.
Furthermore, companies like Bloomberg, Anthropic, and Databricks have trained models using the Pile dataset. Despite potential violations of YouTube’s terms of service, companies like Anthropic argue that their use of The Pile dataset is distinct from direct use of YouTube’s platform.
The issue of AI startups mining data from the internet without proper authorization has been ongoing, leading to legal challenges. For instance, stability AI and Midjourney are currently facing lawsuits from content creators for scraping copyrighted works. Google, which operates YouTube, also faced class-action lawsuits for data scraping related to generative AI.
Me: What data was used to train Sora? YouTube videos?
OpenAI CTO: I'm actually not sure about that…(I really do encourage you to watch the full @WSJ interview where Murati did answer a lot of the biggest questions about Sora. Full interview, ironically, on YouTube:… pic.twitter.com/51O8Wyt53c
— Joanna Stern (@JoannaStern) March 14, 2024
The lack of transparency in citing training data sources is a significant concern. In a recent interview, OpenAI CTO Mira Murati struggled to confirm whether the company utilized videos from platforms like YouTube and Facebook. Similarly, Microsoft AI CEO Mustafa Suleyman argued that content freely available on the web is considered fair game due to an implied “social contract.”
These instances highlight the ethical implications and legal consequences of AI companies accessing and using data without proper authorization, posing challenges in the evolving landscape of generative AI technologies.