The Debate About AI Scraping

Kali Hays, reporting for Business Insider:

The world’s top two AI startups are ignoring requests by media publishers to stop scraping their web content for free model training data, Business Insider has learned.

OpenAI and Anthropic have been found to be either ignoring or circumventing an established web rule, called robots.txt, that prevents automated scraping of websites.

TollBit, a startup aiming to broker paid licensing deals between publishers and AI companies, found several AI companies are acting in this way and informed certain large publishers in a Friday letter, which was reported earlier by Reuters. The letter did not include the names of any of the AI companies accused of skirting the rule.

Yours truly, writing on Wednesday about Perplexity, another artificial intelligence firm, doing the same thing:

What makes this different from the New York Times lawsuit against OpenAI from last year is that there is a way to opt out of ChatGPT data scraping by adding two lines to a website’s robots.txt file. Additionally, ChatGPT doesn’t lie about reporting that it sources from other websites.

That aged well. I haven’t been able to replicate Business Insider or TollBit’s findings yet through my own ChatGPT requests, but if they’re true, they’re concerning. Hays asked OpenAI for comment, but a spokeswoman for the company refused to say anything more than that it already respects robots.txt files. This brings me back to Perplexity. Mark Sullivan, interviewing Aravind Srinivas, Perplexity’s chief executive, for Fast Company:

“Perplexity is not ignoring the Robot Exclusions Protocol and then lying about it,” said Perplexity cofounder and CEO Aravind Srinivas in a phone interview Friday. “I think there is a basic misunderstanding of the way this works,” Srinivas said. “We don’t just rely on our own web crawlers, we rely on third-party web crawlers as well.”

What a cop-out answer — it just proves Srinivas is a pathological liar and his company makes its fortune by stealing other people’s work. Perplexity is ignoring the Robot Exclusion Protocol, and it is lying about it. By saying Perplexity isn’t lying about it, Srinivas is fibbing. It’s just comical and entirely unacceptable. On top of that, he audaciously tells people that they’re the ones misunderstanding him, not the other way around.

Some people, like Federico Viticci and John Voorhees, who write the Apple-focused blog MacStories, have taken particular offense to this AI scraping, which they do not consent to. If it is true that OpenAI and Anthropic are ignoring the Robot Exclusion Protocol, then yes, they deserve to be put to the test; they’ll have to explain why they’re defying a “No Trespassing” sign, as I wrote on Wednesday. But I’ve been pondering this ethical dilemma for the past few days, and in conclusion, I don’t think AI scraping in its entirety is a bad thing. If a site doesn’t disallow AI scraping, it is a core tenet of the open web to allow anyone to use that content to learn. Granted, if the chatbot is partaking in plagiarism — copying words without attribution — just like Perplexity does, that’s both morally and probably legally wrong. But if a site doesn’t have disallow rules in place, I think it’s perfectly fine for an AI company to scrape it to help its chatbot learn.

In my case, I’ve disallowed AI chatbot scraping from all the major AI companies for now, but that’s subject to change. (I suspect it will change in the near future.) If OpenAI and Anthropic can prove that they aren’t ignoring robots.txt rules, I’ll be glad to remove them from my disallow list and allow their chatbots to learn from my writing to improve their products. I think these products have every right to learn from the open web — the words themselves aren’t copyrighted, it’s the idea. So if a chatbot is just learning the sequence of words, not the ideas, from my writing, I think it should be able to. That’s not what Perplexity is doing, though: it’s been caught flat-footed in blatantly copying authors’ work and then lying about it. (It does that to my articles, too.) That’s unethical and wrong; it’s a violation of copyright law.

I don’t frown on Viticci and Voorhees for being so down on AI scraping. Though I might disagree with their ethical stance that AI scraping of the open web is bad, period, I think they have every right to be annoyed about these reckless AI companies stealing their content when they don’t consent to it. That’s the golden word here: consent. If a publisher doesn’t consent to their content being used by scrapers, it shouldn’t be — but if they haven’t put up disallow rules, it’s a free-for-all unless content is being plagiarized one-to-one. Every writer, no matter how famous, has learned how to write from other people, and large language models should be able to do the same. But if I copied and pasted someone else’s work without attribution, and then lied about taking their words, that would be unethical and illegal. That’s what Perplexity is doing.

I do think we need new legislation to make the robots.txt file of a website legally binding, though. Most writers don’t work for a company with a legal team that can write well-intentioned terms of service for their website, so the robots.txt should be enough to tell AI companies how they can use the data on a site. If an LLM violates that “contact,” the copyright owner should be able to sue. I can’t imagine legislators will take this simple approach to AI regulation, however, which is why I’m weary of dragging the government into this debate. It’ll almost certainly make the situation worse. But for now, here’s my stance: AI companies should continue to sign deals with large publishers and respect robots.txt files. If they’re not barred from a website, they can scrape it. And writers on the internet should think for themselves if they’d like LLMs to learn from their writing: if they’re not comfortable, they should put up a “No Trespassing” sign in their robots.txt file.