GPT-4 trained on YouTube transcripts GPT-4 在 YouTube 记录上进行训练

# GPT-4 trained on YouTube transcripts GPT-4 在 YouTube 成绩单上进行训练

Based on the information provided in the search results, OpenAI reportedly used transcriptions of over a million hours
of YouTube videos to train GPT-4, its most advanced large language model. This was part of their effort to gather
high-quality training data, which is crucial for the development and improvement of AI models like GPT-4. The company
developed its Whisper audio transcription model to assist in this process, which allowed them to transcribe the YouTube
据报道,根据搜索结果中提供的信息,OpenAI 使用了超过 100 万小时的 YouTube 视频转录来训练其最先进的大型语言模型
GPT-4。这是他们收集高质量训练数据的努力的一部分,这对于 GPT-4 等人工智能模型的开发和改进至关重要。该公司开发了 Whisper
音频转录模型来协助这一过程,这使他们能够转录 YouTube 内容。

The use of YouTube videos for training data was considered legally questionable by OpenAI, but they believed it to be
fair use. OpenAI president Greg Brockman was personally involved in collecting videos that were used for this purpose.
The company’s spokesperson, Lindsay Held, stated that OpenAI curates unique datasets for each of its models to help
their understanding of the world and uses numerous sources, including publicly available data and partnerships for
non-public data.
OpenAI 认为使用 YouTube 视频作为训练数据 存在法律问题 ,但他们认为这是合理使用。 OpenAI 总裁 Greg Brockman
亲自参与收集用于此目的的视频。该公司发言人 Lindsay Held 表示,OpenAI

Google, which owns YouTube, has 'robots.txt files and Terms of Service that prohibit unauthorized scraping or
downloading of YouTube content. Google spokesperson Matt Bryant mentioned that the company takes technical and legal
measures to prevent such unauthorized use when they have a clear legal or policy basis to do so.
拥有 YouTube 的 Google 拥有 “robots.txt” 文件和服务条款,禁止未经授权抓取或下载 YouTube 内容。 Google 发言人 Matt Bryant

The search results indicate that the training of GPT-4 on YouTube transcripts was part of a broader strategy by AI
companies to overcome the challenge of finding sufficient and diverse data to train their models effectively. This
strategy also included using data from other sources such as Github, chess move databases, and schoolwork content from
搜索结果表明,在 YouTube 成绩单上训练 GPT-4 是人工智能公司更广泛战略的一部分,该战略旨在克服寻找充足且多样化的数据来有效训练其模型的挑战。该策略还包括使用来自其他来源的数据,例如
Github、国际象棋走棋数据库以及 Quizlet 中的作业内容。