OpenAI, Google Trained AI Models on YouTube Videos and Google Docs

Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart Thompson, and Nico Grant, reporting for The New York Times:

In late 2021, OpenAI faced a supply problem.

The artificial intelligence lab had exhausted every reservoir of reputable English-language text on the internet as it developed its latest A.I. system. It needed more data to train the next version of its technology — lots more.

So OpenAI researchers created a speech recognition tool called Whisper. It could transcribe the audio from YouTube videos, yielding new conversational text that would make an A.I. system smarter.

Some OpenAI employees discussed how such a move might go against YouTube’s rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are “independent” of the video platform.

Ultimately, an OpenAI team transcribed more than one million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI’s president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful A.I. models and was the basis of the latest version of the ChatGPT chatbot… Like OpenAI, Google transcribed YouTube videos to harvest text for its A.I. models, five people with knowledge of the company’s practices said. That potentially violated the copyrights to the videos, which belong to their creators.

Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps, and other online material for more of its A.I. products.

As I’ve said many times previously, I do not think scraping content — even non-consensually — from the web to train AI models is illegal, since I think AI large language models are transformative. Granted, if an LLM reproduces text one-for-one, that is a concern because it is not fair use according to U.S. copyright law. But transformative use of copyrighted works is permitted under the law for a good reason. The best way to solve the kerfuffle between publishers, authors, and other creators and the AI companies hungry for their data is via comprehensive regulation written by experts — but knowing Congress, that will never happen. The current law is the law we will always have, and while it is not sufficient to address this new era of copyrighted works, it’s what we’re stuck with.

With that said, I am not necessarily upset at OpenAI for scraping public YouTube videos using Whisper to train GPT-4 — GPT-4 is not quoting YouTube videos verbatim and provides helpful information, which is more than enough to qualify as “fair use.” What I do have a problem with is Google’s implementation — in its reaction to OpenAI’s scraping, its own scraping, and its use of private Google Docs data. Google is the owner of YouTube, and YouTube users sign a contract with Google in order to use the service: the terms of service. So, due to this relationship between YouTube users and Google, Google has a responsibility to inform its users about how it’s using their data in the form of a privacy policy.

Unlike the terms of use, YouTube’s privacy policy says nothing about how Google can use YouTube videos to train Bard and its other LLMs. (I’ll get to Google Docs momentarily.) This creates two issues: (a) it makes it fair game for any person or company to scrape YouTube content since Google did not provide YouTube users with an explicit guarantee that it would be the only company that scrapes their videos (or that their videos wouldn’t be scraped at all), and (b) it, unbeknownst to YouTube’s users, compromises their data without their knowledge. Neither of these issues put Google in a legally compromising scenario due to fair use, but it is not a good look for Google.

Looks matter for a company like Google, which must create the illusion of privacy for users to feel confident giving the company so much data. Unlike Meta, people use Google services for a variety of private matters, sharing personal family videos on YouTube and writing sensitive notes in Google Docs. On Meta services, every user has the expectation — aside from on services like WhatsApp and Messenger — that their data will be shared with the world, available for anyone to view or use however they would like. Google promises privacy and security, and, for the most part, has delivered on that promise — but it can’t continue selling users on privacy when its actions directly contradict that pitch.

And about OpenAI: YouTube likes to say that OpenAI’s scraping is against its terms of use, which any person who uses YouTube — including OpenAI’s employees who scraped the data — has implicitly agreed to, but YouTube doesn’t have the right to enforce that specific rule in the terms of use because the same usage terms also give YouTube creators the ownership rights to the content they publish on YouTube. It cannot be against the terms of service for creators to do what they want with their content; what if a creator wants OpenAI to have access to their videos? YouTube cannot meaningfully enforce this rule, and even if it wanted to, the argument would be unstable because YouTube (Google) does the same thing even though it has not given itself rights via the terms of service it claims OpenAI has broken.

And then, there is Google Docs. Unlike the issue of YouTube, this one is legally concerning. Google claims that it only trains on data from users who opt into “experimental features,” which is to say, the features that allow users to use Bard to help them write documents. That part of the agreement is well advertised, but the part where Google grants itself the ability to access private user data to train AI models is implicitly stated. Google does not tell users to sign a new service agreement to use Bard in Google Docs — it just includes in the main terms of service that if a user were to sign up for experimental features, their data may be used for training purposes. That is sleazy.

It might not be illegal, but, as I said earlier, immorality is harmful enough. This creates one more unnecessary problem for Google in the form of a question: How is Google gaining access to private Google Docs data? Most users are under the assumption that what they write in Google Docs/Sheets/Slides etc., is for their eyes only — that it is private data that is most likely encrypted at rest. If Google can mine and use it however it wants, it’s just being stored as plain text somewhere. LLMs do not need encrypted data because it is illegible, so Google is either decrypting Google Drive data for users who have opted in or is storing everyone’s files in some unencrypted format.

Whatever the case is, it is deeply concerning because it breaks trust. What happens to all of the people who no longer use Google Docs but have signed the terms of use that permit the usage of their old data that was written before the new agreement? Millions — hundreds of millions — of people are unwittingly sending their data straight to Bard’s language bank. The usage of the data itself may not be illegal, but the collection of it is immoral and might be a breach of the “contract” between Google and its users. I’m not particularly concerned about my data being used to train an LLM as long as that data is anonymized and obfuscated — and I think many people are the same — but it is wrong for Google to harvest this data and use it in ways users are unaware of.

Obviously, the best way to solve this problem is for Google to stop collecting Google Docs data — and perhaps YouTube data, though that is less pressing because it’s public, unlike private documents — or amend its privacy policy to include third parties like OpenAI in the mix, but all of that ignores a larger question: Where will training data for LLMs come from? Reputable websites such as The Times have blocked ChatGPT’s crawlers from ingesting their articles to use as training data, and eventually, these robots will run out of the internet to train on. That poses a large problem for LLMs, which require training data to function entirely.

The solution proposed by some is to prompt LLMs to create data for themselves, but immediately, anyone who knows how transformer models work will know that it will lead to heavily biased, inaccurate data. LLMs are not perfect now, and if they are trained on imperfect data, they will just become more imperfect, repeating the cycle. The only plausible solution I find to this is to make LLMs more efficient. Currently, AI companies are relying on the findings of research from 2020 — that research said, plainly, that the more data a model is fed, the more accurate it will be. But transformer models have improved since then, to the point where they can even correct themselves using data from the web to prevent “hallucinations,” a phenomenon where a chatbot creates information that doesn’t exist or is wrong.

I predict that in the next few years, researchers will stumble upon a breakthrough: LLMs will be able to do their weighting and prediction without as much data, using the web to fact-check their findings. I’m not a scientist, but this industry is booming right now, and new ideas will come to the table soon. But for 2024, perhaps AI firms should look elsewhere than private user data to train their models.