In the explosion of various neural networks and self-proclaimed artificial intelligences, many companies are trying to get their spot in the limelight and come up with new uses for these impressive tools. By far the most new products have been and are being created based on Large Language Models, or LLMs. These natural language generation models are incredibly impressive in a wide variety of fields, to the point where one wonders if the LLM they are texting with is actually an artificial intelligence similar to human intelligence.
In addition to generating text, people are trying to use these LLMs to automate everyday tasks such as ordering food, texting friends, playing music, and searching the Internet. However, these tasks can't be done with just a glorified random text generator, so when OpenAI created a way for their ChatGPT to interact with external tools (such as WolframAlpha for solving math equations), it opened the way to many new products. The biggest advance in this direction is the GPT-4o ("o" stands for "omni") model, which can accept and generate images in addition to understanding and generating text.
These advances have led to the development of many products, such as the Humane Pin and the Rabbit R1, small boxes designed to serve as personal assistants. These products build on the premise that the user says some instruction, e.g., "Create a reminder with 'call the doctor'", the box converts the speech into text, passes the text to the LLM, and the LLM executes the command using some predefined note-taking tool. A command of this type is fairly simple, and most products handle it as well without much trouble. Where they often struggle, however, are commands like "Play me the Beatles song Help", which very often play a different song, or don't play anything at all. Now you might be thinking, "I mean, even Amazon Alexa, which came out in 2016 and didn't have any LLM in it, can play music." And you'd be right...
Focusing on the Rabbit R1, not only can't it play songs, but it can't even order food, call a taxi or reply to a message, even though all these features were presented as part of their marketing campaign. However, the aim of this article is not to criticise the misleading advertising, but to look at the technical part of the problem and to test how and if an LLM-based personal assistant could be made truly functional.
Not long after the release of this product, one of the main reasons why almost none of the presented R1 features worked came to light. Although its creators claimed to use a proprietary LAM, or Large Action Model, that converts user instructions into actions within individual apps like Spotify, the reality was different. In practice, this LAM was really just an LLM (specifically ChatGPT) and a set of predefined scripts that interacted with web applications instead of the user. So the LLM just converted the user's command into parameters, which it passed to the script, which then clicked through the Spotify app for example, found the right song and played it. The fundamental problem with this approach is that the scripts had to be written by a programmer first, and as soon as Spotify changes its appearance, moves a button, adds an extra step, etc., the script breaks down and no longer plays any song.
When we learned about this unsuccessful attempt to automate UI control using LLM, we thought that it must be possible to do it "honestly", i.e. to actually use only LLM and not rely on pre-made application flows. Especially since there is a GPT-4o model that can understand the image. If this model could not only recognize that the picture of a cat is actually a cat, but identify the various controls in the picture of the web application, it would actually be possible to walk through the application based on commands from the LLM. The structure of such a project would be that the user enters a command, i.e., "Play me the Beatles song Help", the program passes this command to the LLM with a screenshot of the Spotify home screen, and asks, "If I want to execute this command, where do I click?". The LLM responds with the position on the screen, our program clicks that spot, screenshots the screen again, and asks "what next?" until it actually executes the original command. With this approach, we would eliminate the problem of the UI changing, because the LLM would be able to find the button anywhere on the screen, or even open the side menu first if it didn't find what it was looking for.
And lo and behold, it actually worked. The GPT-4o was able to identify the on-screen controls very accurately and provided more or less correct locations of the buttons we should click to move to the next step. However, if it were really that easy, we are here describing our breakthrough new product that we are launching, not that we conducted an experiment. So where's the problem? The main problem is in the response time. Because what hasn't been mentioned yet is how long does it actually take to play a song like this on Spotify? In practice, it took about a minute to get from the home screen to playing the audio. With that kind of time, I'm quicker to play the song myself than to wait for my assistant to do it for me. Even existing products like Rabbit R1 struggle with response speed, and they only take a few seconds to run through the app using a script. From various reviews of exiting products, it is clear that if the assistant doesn't do what it is told to do within five seconds, the user is annoyed and doesn't have a positive experience with the product. So waiting a minute for a song to play is unthinkable, and Spotify is one of the easier apps to do that. If we were talking about ordering food delivery, for example, where everyone wants something different and there is a drink, sauce or dessert to go with each meal, it is not realistic to click through the app fast enough to keep the user satisfied.
It's entirely possible that the creators of Rabbit R1 did a similar experiment as we did, got excited about it and decided to build a product around it. They started presenting it as a LAM because it could actually be considered one, raised a lot of money from investors, but when they started developing the product they found it impossible to get answers fast enough. But since it was too late to change the marketing campaign or to return the money, they had to find, objectively speaking, probably the best possible alternative that was fast enough. Their story points out that LLM is really just a glorified text generator and (at least for now) cannot be used for UI interaction.
NoxLabs is a team of engineers and designers specializing in web and mobile development. We're passionate about building beautiful software and welcome new project ideas.