OpenAI, in a blog post last week:

Today we’re releasing Operator, an agent that can go to the web to perform tasks for you. Using its own browser, it can look at a webpage and interact with it by typing, clicking, and scrolling. It is currently a research preview, meaning it has limitations and will evolve based on user feedback. Operator is one of our first agents, which are AIs capable of doing work for you independently—you give it a task and it will execute it.

Operator can be asked to handle a wide variety of repetitive browser tasks such as filling out forms, ordering groceries, and even creating memes. The ability to use the same interfaces and tools that humans interact with on a daily basis broadens the utility of AI, helping people save time on everyday tasks while opening up new engagement opportunities for businesses.

To ensure a safe and iterative rollout, we are starting small. Starting today, Operator is available to Pro users in the U.S. at operator.chatgpt.com . This research preview allows us to learn from our users and the broader ecosystem, refining and improving as we go. Our plan is to expand to Plus, Team, and Enterprise users and integrate these capabilities into ChatGPT in the future.

One thing I noticed while watching OpenAI’s live stream announcing Operator is that the company truly thinks this is the next paradigm of artificial intelligence. I agree, to a certain extent, since the emphasis seems to be less on the large language models powering some of the optical character recognition and reasoning required to make decisions — i.e., where to click, how to click, and where to enter text, essentially how to use a computer — and more on the synchrony between the vision and reasoning aspects of the model. Operator needs to understand not only how a computer works but also the relationship between concepts, and that’s only possible with a natively multimodal model. It was obvious this was where ChatGPT 4o was headed when it was announced last spring.

To achieve true multimodality, OpenAI developed a new model that works with 4o: Computer-Using Agent. (Fire whoever named this.) I’m at least a little surprised GPT-o1 isn’t involved at all in this project, as I’d think that the LLM would require advanced reasoning capabilities to make decisions, but while o1 literally writes its thoughts out to provide an answer, CUA does the same through reinforcement learning — a technique in machine learning that trains a computer on how to do something by correcting its mistakes along the way and giving it rewards. It seems primal — like how someone would train a dog by giving it treats — but it works since humans also tend to perform better if given rewards. It’s reflected in the training data: when people compliment each other, the resulting outcome is almost always more favorable.

Instead of thinking through a problem by writing detailed explanations for each step of the process (o1), CUA does it natively since it’s been pre-trained on how to use a computer. The LLM does the work of reasoning through what a user means and how to get there, and the reinforcement-powered parts take over the rest. It’s a clever mechanism that perfectly articulates why AI scaling will never stop: LLMs aren’t the best of AI. LLMs are extremely proficient in manipulating prose since they have a lot. It’s like how traditional computers are fantastic calculators — they think in numbers and thus can compute even massive ones with ease. LLMs think in words (tokens, pedants) and can produce them the best. But when they’re given numbers — or worse, images — they fail spectacularly because they can’t put a picture or numbers into a black box and get another picture or calculation, like how they do for language. They need to convert that picture into words they can understand first.

Words make up a substantial portion of humans’ communication and work. Writing emails, sending text messages, producing reports and slides, and reading the news all involve words and, for the most part, only words. But humans also work with their hands and use their eyes to help. That’s why graphical user interfaces came to be — not everyone knows how a command line works, but everyone can grok the basic idea that files are in folders. GUIs are a digital metaphor for the real world, and the world doesn’t use words — it’s heavily reliant on understanding the visual relationship between physical objects through our five senses. LLMs are atrocious at this, but vision models have potential.

Using a computer, a GUI, requires proficiency in both visual and text processing, and that’s where CUA shines. But OpenAI’s spiel about how Operator is the best thing since sliced bread ignores that using a literal GUI computer made for people isn’t the best way to do most things online. GUIs were made because they’re the easiest for humans — “GUIs are a metaphor for the real world” — to use, but they aren’t even nearly the most efficient way for computers to interact. Application programming interfaces are the way computers speak to other, unrelated machines, and they’re the best way to craft agentic experiences. If Operator could make an API call to Trivago, the travel-booking website, instead of navigating to its website and clicking buttons as a person would, it could probably set up a reservation in seconds. It would still require AI to choose what API calls to make based on a user’s request — weights! — but it wouldn’t require the technological prowess Operator possesses.

But, alas, Kayak doesn’t have an API, and neither do hundreds of thousands of internet services humans access online. So, we’re stuck with Operator doing computer-like things in a human-centric world. The idea reminds me of humanoid robots, an imperfect solution to completing real-world tasks. Ford doesn’t employ thousands of humanoid robots to build cars on an assembly line — it builds a robot for each part of the car-making process. Operator is a humanoid robot working an assembly line, making me believe its existence is rather short-lived. To do tons of things on the web, Operator needs to pass Captcha tests designed to keep robots off the internet. Operator passes them with ease, which is impressive on its own. But why are we implementing Captchas just for robots themselves to pass them? Robots passing Captchas aren’t the future, and neither is Operator, which is a (crucial) stepping-stone to something much, much bigger.

Maybe that explains the ridiculous, comical $200-a-month price for entry, which is too high for even me to be interested.