Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

OpenAI used over 1 million hours of YouTube data to train GPT-4 AI: Report

OpenAI used over a million hours of YouTube videos to train its large language model GPT-4, a report revealed as major tech companies are attempting to acquire more and more data to train their artificial intelligence (AI) models. The GPT-4 model was trained using a speech recognition tool named Whisper to transcribe YouTube videos, New York Times reported. As per this process, over one million hours of video content was transcribed which raised concerns about compliance with YouTube’s policies as Google owned YouTube restricts use of its videos for independent applications.
This comes days after YouTube CEO Neal Mohan was asked if OpenAI’s Sora video generator uses data from YouTube in an interview with the Wall Stree Journal. He said that he was not aware if OpenAI used any YouTube data to train it new video tool but claimed that it would be a problem if OpenAI used YouTube videos.
The report also claimed that Google transcribed YouTube videos for AI training which could have potentially breached copyright laws. Even Mark Zuckerberg’s Meta discussed acquiring Simon & Schuster to access a vast library of books. 
The effectiveness of AI models gets enhanced by the volume of data they’re trained on. It was earlier reported that the demand for high-quality data is so high that some tech companies might exhaust available internet data by 2026.
OpenAI said that each of its AI models is trained on a unique dataset while Google acknowledged training AI models on some YouTube content under agreements with creators. 

en_USEnglish