OpenAI utilised one million hours of YouTube content to train GPT-4
The availability of good training data is rapidly diminishing, with companies possibly outpacing new content by 2028. Possible solutions include training models on synthetic data or using curriculum learning, but neither approach is proven.
In recent reports by The New York Times, the challenges faced by AI companies in acquiring high-quality training data have come to light. The New York Times elaborates on how companies like OpenAI and Google have navigated this issue, often treading in legally ambiguous territories related to AI copyright law.
OpenAI, for instance, resorted to developing its Whisper audio transcription model by transcribing over a million hours of YouTube videos to train GPT-4, its advanced language model. Although this approach raised legal concerns, OpenAI believed it fell within fair use. The company’s president, Greg Brockman, reportedly played a hands-on role in collecting these videos.
According to a Google spokesperson, there were unconfirmed reports of OpenAI’s activities, and both Google’s terms of service and robots.txt files prohibit unauthorised scraping or downloading of YouTube content. Google also utilised transcripts from YouTube, aligned with its agreements with content creators.
Similarly, Meta encountered challenges with data availability for training its AI models. The company’s AI team discussed using copyrighted works without permission to catch up with OpenAI. Meta explored options like paying for book licenses or acquiring a large publisher to address this issue.
Why does it matter?
AI companies, including Google and OpenAI, are grappling with the dwindling availability of quality training data to improve their models. The future of AI training may involve synthetic data or curriculum learning methods, but these approaches still need to be proven. In the meantime, companies continue to explore various avenues for data acquisition, sometimes straying into legally contentious territories as they navigate this evolving landscape.