Perplexity is a Thief and Serial Fabulist

Dhruv Mehrotra and Tim Marchman, reporting for Wired:

A WIRED analysis and one carried out by developer Robb Knight suggest that Perplexity is able to achieve this partly through apparently ignoring a widely accepted web standard known as the Robots Exclusion Protocol to surreptitiously scrape areas of websites that operators do not want accessed by bots, despite claiming that it won’t. WIRED observed a machine tied to Perplexity—more specifically, one on an Amazon server and almost certainly operated by Perplexity—doing this on WIRED.com and across other Condé Nast publications.

The WIRED analysis also demonstrates that, despite claims that Perplexity’s tools provide “instant, reliable answers to any question with complete sources and citations included,” doing away with the need to “click on different links,” its chatbot, which is capable of accurately summarizing journalistic work with appropriate credit, is also prone to bullshitting, in the technical sense of the word.

WIRED provided the Perplexity chatbot with the headlines of dozens of articles published on our website this year, as well as prompts about the subjects of WIRED reporting. The results showed the chatbot at times closely paraphrasing WIRED stories, and at times summarizing stories inaccurately and with minimal attribution. In one case, the text it generated falsely claimed that WIRED had reported that a specific police officer in California had committed a crime. (The AP similarly identified an instance of the chatbot attributing fake quotes to real people.) Despite its apparent access to original WIRED reporting and its site hosting original WIRED art, though, none of the IP addresses publicly listed by the company left any identifiable trace in our server logs, raising the question of how exactly Perplexity’s system works.

Relatedly, Sara Fischer, reporting for Axios:

Forbes sent a letter to the CEO of AI search startup Perplexity accusing the company of stealing text and images in a “willful infringement” of Forbes’ copyright rights, according to a copy of the letter obtained by Axios…

The letter, dated last Thursday, demands that Perplexity remove the misleading source articles, reimburse Forbes for all advertising revenues Perplexity earned via the infringement, and provide “satisfactory evidence and written assurances” that it has removed the infringing articles.

What makes this different from the New York Times lawsuit against OpenAI from last year is that there is a way to opt out of ChatGPT data scraping by adding two lines to a website’s robots.txt file. Additionally, ChatGPT doesn’t lie about reporting that it sources from other websites. Perplexity not only sleazily ignores disallow rules on sites it crawls by using a different user agent than it advertises on its website and support documentation but also lies about journalists’ reporting to users, potentially making the publisher suddenly liable for defamation claims and other legal nonsense. Perplexity is both a thief and a serial fabulist.

I maintain my position] that scraping the open web is not illegal, but simply unethical — and there are exceptions for when it is acceptable to scrape without permission. But I’m no ethicist, and while I have AI scraping disabled on my own website, I’m not sure how to feel about misattribution when quoting other websites. I do feel it’s a threat to journalism, however, and companies should focus on signing content deals with publishers like OpenAI did. Stealing, however, is a red line: If a company tells an AI scraper not to touch their website, masquerading as a completely different computer with a different IP address and user agent is disingenuous and probably illegal. If someone calls the police and trespasses someone they don’t want on their premises, and then the next day they come in with a different jacket, that’s still illegal. The property owner has trespassed the unwanted visitor, so no matter what jacket they’re in, they’re still somewhere they’re not allowed to be.

It’s not illegal for one to go into a shop they’re not barred from entering when the shop is open to the public. A flag in a robots.txt file is the internet equivalent of trespassing AI bots from scraping a website. If the website doesn’t have a flag, I think it’s fair game for AI websites to be able to crawl it; this is why I wasn’t explicitly disappointed in Apple for scraping the open web. I wish Apple had told publishers how to disable Applebot-Extended — its AI training scraper — before it began training Apple Intelligence’s foundation models, but it doesn’t really matter in the grand scheme: I allowed my website to be scraped by Apple’s robots, so I can’t be mad, only disappointed. (I’ve now disallowed Applebot-Extended from indexing this website.) The same is true for The New York Times and OpenAI, but that’s not the case for Perplexity, which is putting on a disguise, trespassing, and stealing.

Perplexity is doing the equivalent of breaking into a Rolex store, stealing a bunch of watches, taming the Rolex logo off of them, then selling them on the street for 10 times the price saying “I made these watches.” It’s purely disingenuous and almost certainly illegal because the robots.txt file acts as a de facto terms of service for that website. Websites like Wired and Forbes, owned by multinational media conglomerates, almost certainly have clauses in their terms of service that disallow AI scraping, and if Perplexity violates those terms, the companies have a right to send it a cease and desist. Would suing go a step too far? Probably, but I also don’t see how that wouldn’t be legally sound, unlike The Times’ suit against OpenAI.

You might think I’m playing favoritism with Silicon Valley’s golden child AI startup, but I’m not — they’re just two different cases. One company, Perplexity, is violating the terms of service of a website actively every day presently. ChatGPT scraped The Times’ website before The Times could “trespass” OpenAI after ChatGPT’s launch, and that’s entirely fair game. On top of that, it used disingenuous means to target Times articles through ChatGPT, whereas Perplexity’s model just plagiarized without even being asked. Perplexity is designed by its makers to disobey copyright law and is actively encouraged to plagiarize. If Perplexity didn’t want to do harm, it could just switch back to the “PerplexityBot” user agent it told publishers to block, but even when the company is in the news for being nefarious, it’s still not budging. In fact, Aravind Srinivas, Perplexity’s chief executive, had the audacity to say Wired’s reporters were the ones who didn’t know how the internet works, not his company. Shameful. Perplexity is a morally bankrupt institution.