RPA is dead, long live RPA(gent)!

Diamond

August 12, 2024

While a very cool concept, AI Agents simply don’t work for most cases… yet. There are a handful of missing pieces, and the top one that hobbles our autonomous little fellows is a lack of human/Agent collaboration tools. The daemons want to be unleashed, but they need our help.

The rise of agentic AI systems has been swift and all encompassing. There are more agent maps now, and fighting over what an agent is, then there were AI startups a few years ago. Agents aren’t exactly a new term, and many people are attributing far too much to “Agents”.

The TLDR I always keep in mind (mostly stolen from the OGs themselves Russel and Norvig) is that an Agent is able to perceive and act in a multi-step process with some aspect of planning, where they are generally goal driven. This cuts out basic one step classification or response type interactions (i.e. ask your LLM a question), and most of the old chatbots that used to be called chat agents (just Q&A type bots on a site, no goal in mind). It’s also not much more than tools + for loop + AI model(s) = Agent.

Agents are everywhere, but nowhere at once, and no one has cracked the code. Many Agents that do work require very little reasoning or thought over multiple turns or time, making them little more than glorified LLM calls.

We all want these daemons/agents to run ahead and do work for us, but they barely work. You know it’s bad when one of the hottest startups in the space brags about 14% correctness for test sets they picked out! The legacy scripting of Robotic Processing Automation (RPA) wasn’t much better with very fragile step by step, on rails, abilities, that were mined from capturing every click and step taken.

If we are to build the next generation of intelligent systems we need a new set of functionality to extend the current systems. Models continuing to improve on reasoning abilities from MAGMO (Meta, Anthropic, Google, Mistral, OpenAI, commonly used term, don’t worry) will help, but we believe a huge part of building Agents that work on complex problems requires human/agent collaboration. This encompasses:

Imparting new knowledge by showing to share a new skill or set of steps that you want the Agent to learn from you. Today this is largely done by RPA systems with click capture and with Agents through manual prompt entry and natural language description.
Corrections/feedback during, and after completing tasks to improve and correct how an Agent completes a task. Very little in this feedback loop today other than observing logs and adjusting. Agents need the ability to consolidate knowledge, learn and adjust skills with counterfactual analysis, and more.
Multimodal dialogue management easy back and forth at any time while teaching, correcting, etc., with the ability to discuss something you are both “seeing”. Move away from text based interactions only.
Bi-directional teaching through a shared mouse/keyboard where agents can not only teach other agents but also teach humans to further share knowledge as humans and agents collaborate in a group. Human teaches Agent, Agent can teach and explain to a different human.

AI Agents have the potential to democratize knowledge work in the same way that SaaS democratized software. And as we've seen in the past couple of decades with software, every time you make a service cheaper and more available, you dramatically increase the size of the total…
— Aaron Levie 🇺🇸 (@levie) April 24, 2024

There are many new frameworks, infrastructure companies, and more popping up, but the human collaboration aspect is ill supported, and we plan to fix that.

Introducing the WatchMe concept

Today, I want to share our early thoughts on “WatchMe”, aka show and tell. WatchMe gives any Agent the ability to observe and listen to a user while they perform a task, with the goal of replicating from this or many observed examples. This is part RPA, capturing the context and clicks of what is happening in order to replicate, and part coaching like you would another human by explaining and answering questions.

Figma AI should introduce a feature called "Watch me".

The AI observes while you perform a task and can replicate and manipulate it.
— Greg Sarafian (@GregSarafian) June 28, 2024

The idea behind WatchMe is to standardize the capture and collaboration space between Agents and Humans, and let any Agent builders add it as an ability to any Agent with this superpower. A WatchMe enabled Agent can work with a human to learn from and interact on a website, app, or across a user’s desktop. Agents can then reverse the script and show how they perform the action, with a chance for the user to provide feedback, in a visual way that moves beyond just looking at traces or logs, and more like you are sharing a screen with another person. This allows for learning across multiple steps, pointing out areas to improve/correct, or helping in a more natural way when stuck.

agent observing and learning — Agent program observing and replicating user interactions

We’ve been testing the observation and encoding of actions and explanations up to this point, and have early experiments with replicating these user actions. The way forward is not for every RPA company to become an AI company, it’s to give a million Agents the ability to see, learn, and interact with humans where they work, and break out of the fragile RPA systems of yesteryear.

Once WatchMe is figured out, there is a second concept worth mentioning; giving Agents not just the ability to learn from what you’re doing and replay it on your computer, but their own “personal” computer, we’re currently calling this the AgentVM (AVM), but that is subject to change as we evolve the tech 🙂. We’ve gone from chatbots to agents that can run simple commands, but now we are giving them their own PC to allow them to be more like drop-in remote workers who go off, get work done, interact with each other and us as needed, and come back afterwards. More to come on this later.

Watch a quick talk on how we can improve Agents here that covers a few more things from a presentation I did at Google recently:

Augmend: A Computer for Every Agent

Thanks!
-Diamond

Special thanks to 2x Nick (Arner & Walton) and 2x Alex (Reibman & Fazio) for feedback and thoughts on this concept.

Appendix - YAAD (Yet Another Agent Definition)

What is an AI Agent?
AI Agents and Assistants can overlap or be confused in many ways given the new proliferation of the term “Agent”. In order to be more specific we can define an Agent as requiring some “agency” in that they can perceive and act in a multi-step process with some aspect of planning, where they are generally goal driven. This cuts out basic one step classification or response type interactions (i.e. ask your LLM a question), and the old chatbots that used to be called chat agents (just Q&A type bots on a site, no goal in mind). An agent’s primary distinction from LLMs is that they run in a self-directed loop, largely augmented by a lightweight prompting layer and some kind of persistence or memory to handle the multi-turn/step interactions over time. I would then posit that we should require some aspect of proactive ability (the ability to take steps without a human reactive trigger).

In the classic AI literature, every student who reads Russell and Norvig has seen the definition of an agent as anything that can be considered able to perceive its environment through sensors and act on this environment through actuators.

What does the ecosystem look like, what else matters besides the Agent itself?
AI Agents have exploded in the last 6 months, though there is still debate about whether any work yet, we can assume the will as models improve and Agent capabilities improve. We can categorize the parts of the Agent ecosystem as:

Agent App (a domain or task specific Agent, made to be used by an end user)
Agent Builders (tools or frameworks for creating Agents)
Agent Orchestrators (managing and orchestrating multiple agents)
Agent Tools (services/tools/extensions made to improve the capabilities or action space of agents)
Agent Models (foundation models built specifically for Agents)
Agent Ops (observing, debugging, etc. running agents)

There has been significant research in the space (see Appendix - Posts/Docs List), but we’re still very early in making all this work well in production. We’ll review what exists or is being built today in the industry then get to potential problems and pain points.

Let’s start with Agent Applications, there are a very large number of early stage Agents being built for specific applications or tasks. These tend to be replacing specific non AI applications or making it easier to get a task done. The top domains/tasks are 1) coding (write some code for me iteratively, more than just high end auto-complete), 2) sales and marketing (manage and handle outreach to potential customers, `think self-driven CRM), 3) misc. back office common workflows such as handling HR interactions for common problems, 4) Research + writing (go off and find data data about a subject and write it up/build a knowledge base), and more coming up every day.

These Agent Applications can be seen as the top of the stack. In order to build one you might use an Agent Builder, these largely exist as either 1) libraries and frameworks, and 2) low/no-code builders. Looking at the frameworks first:

LangGraph: Good tool to build workflow automation. But LangChain has some marketing issues and potentially too much complexity in many cases for what it provides.
AutoGen: First to introduce multi-agent interaction. However, due to its complexity, AutoGen has issues around debugging but has a large following.
CrewAI: Open-source library built on Langchain. It is faster to build on and debug than AutoGen in usability, but the underlying Langchain framework can make it heavy (though they seem to be offering multiple options past Langchain).

In the low/no code space there are a growing number, where many are simple workflow builders (and debatably Agents by our definition but can evolve, like Google and Zapier’s Agent Builders). Fully no code general Agents can somewhat be seen as Agent Builders too, though they are more akin to RPA like process mining where they take a lot of observations and go from natural language to actions, companies in this space include:

MultiOn (early beta, limited tasks, browser only)
Adept (enterprise only focused right now, browser only, recently beheaded by Amazon, so )
HyperWrite (same as MultiOn basically)
GetLindy (enterprise focused version so hard to get access, but looks interesting)

Agent Orchestration tends to just be built within one of the Agent Builder frameworks or services today, and tends to also have the concept of the Reasoning Engine built in for many cases, in addition to Multi-agent coordination. Companies mentioned earlier like CrewAI as well as DeepWisdomAI are making moves in this space.

Agents built and orchestrated can’t do a lot in the ether unless they are given the ability to Act. Agents are generally given a set of actions to execute. These can be custom made by the creator, but usually extend their action space with a variety of Agent Tools that allow them to interact with the external world they are observing, this might be using APIs such as the Bing search API (which OpenAI does), asking a human for feedback, sending an email, using a browser (like BrowserBase and equivalents), or handling auth (Anon). This is likely the space that has the room to expand the most but can also cause issues where there are too many tools to choose from.

Running any of these Agents in production (which few are doing frequently right now) has similar problems to any distributed and stochastic system where understanding why something happened or how can be hard to trace and debug. There are a variety of new Agent Ops companies popping up to help with observability such as the aptly named agentops.ai which is taking off and very well positioned to take on this observability and ops space (if an incumbent like DataDog or AWS/Azure themselves doesn’t).

Appendix - Papers/Posts

Some Very Relevant Papers/Posts I Like on This

MetaGPT https://arxiv.org/pdf/2308.00352
AutoGPT https://arxiv.org/pdf/2306.02224
ReAct https://arxiv.org/pdf/2210.03629
MRKL https://arxiv.org/pdf/2205.00445
Toolformer https://arxiv.org/pdf/2302.04761 + Graph-ToolFormer https://arxiv.org/pdf/2304.11116
AgentVerse https://arxiv.org/abs/2308.10848
Tool Augmented Language Models https://arxiv.org/abs/2205.12255
WebGPT https://arxiv.org/abs/2112.09332
WebShop https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf
Gorilla https://gorilla.cs.berkeley.edu/
Sequoia https://www.sequoiacap.com/article/autonomous-agents-perspective/
Gates https://www.gatesnotes.com/AI-agents