The Augmend platform and services will be officially discontinued as of August 31st, 2024. See announcement for more details.
RPA is dead, long live RPA(gent)!
Diamond
While a very cool concept, AI Agents simply don’t work for most cases… yet. There are a handful of missing pieces, and the top one that hobbles our autonomous little fellows is a lack of human/Agent collaboration tools. The daemons want to be unleashed, but they need our help.
The rise of agentic AI systems has been swift and all encompassing. There are more agent maps now, and fighting over what an agent is, then there were AI startups a few years ago. Agents aren’t exactly a new term, and many people are attributing far too much to “Agents”.
The TLDR I always keep in mind (mostly stolen from
the
OGs themselves Russel and Norvig) is that an Agent is able to perceive and
act in a multi-step process with
some aspect of planning, where they are generally goal driven. This cuts out basic one step classification or response type interactions (i.e. ask your LLM a question), and most of the old chatbots that used to be called chat agents (just Q&A type bots on a site, no goal in mind). It’s also not much more than tools + for loop + AI model(s) = Agent.
Agents are everywhere, but nowhere at once, and no one has cracked the code. Many
Agents
that do work require
very little reasoning or thought over multiple turns or time, making them little more than glorified LLM
calls.
We all want these daemons/agents to run ahead and do work for us, but they barely work. You know it’s bad
when one of the hottest
startups in the space brags about 14% correctness for test sets they picked out! The legacy scripting of Robotic Processing Automation (RPA) wasn’t much better with very fragile step by step, on rails, abilities, that were mined from capturing every click and step taken.
If we are to build the next generation of intelligent systems we need a new set of functionality to extend the current systems. Models continuing to
improve on reasoning abilities from MAGMO (Meta, Anthropic, Google, Mistral, OpenAI, commonly used term, don’t worry) will help, but we believe a huge part of building Agents that
work
on complex problems requires human/agent collaboration. This
encompasses:
Imparting new knowledge by showing to share a new skill or set of
steps that you want the Agent to learn
from you. Today this is largely done by RPA systems with click capture and with Agents through manual prompt entry and natural language description.
Corrections/feedback during, and after completing tasks to improve
and
correct how an Agent completes a
task. Very little in this feedback loop today other than observing logs and adjusting. Agents need the ability to consolidate knowledge, learn and adjust skills with counterfactual analysis, and more.
Multimodal dialogue management easy back and forth at any time while teaching, correcting, etc., with the ability to discuss something you are both “seeing”. Move away from text based interactions only.
Bi-directional teaching through a shared mouse/keyboard where agents can not only teach other agents but also teach humans to further share knowledge as humans and agents collaborate in a group. Human teaches Agent, Agent can teach and explain to a different human.
There are many new frameworks, infrastructure companies, and more popping up, but the human
collaboration
aspect is ill supported, and we plan to fix that.
Introducing the WatchMe concept
Today, I want to share our early thoughts on “WatchMe”, aka show and tell. WatchMe gives any Agent the
ability to observe and listen to a user while they perform a task,
with
the goal of replicating from this or
many observed examples. This is part RPA, capturing the context and clicks of what is happening in order
to
replicate, and part coaching like you would another human by explaining and answering questions.
The idea behind WatchMe is to standardize the capture and collaboration space between Agents and Humans, and let any Agent builders add it as an ability to any Agent with this superpower. A WatchMe enabled Agent can work with a human to learn from and interact on a website, app, or across a user’s desktop. Agents can then reverse the script and show how they perform the action, with a chance for the user to provide feedback, in a visual way that moves beyond just looking at traces or logs, and more like you are sharing a screen with another person. This allows for learning across multiple steps, pointing out areas to improve/correct, or helping in a more natural way when stuck.
We’ve been testing the observation and encoding of actions and explanations up to this point, and have early experiments with replicating these user actions. The way forward is not for every RPA company to become an AI company, it’s to give a million Agents the ability to see, learn, and interact with humans where they work, and break out of the fragile RPA systems of yesteryear.
Once WatchMe is figured out, there is a second concept worth mentioning; giving Agents not just the ability to learn from what you’re doing and replay it on your computer, but their own “personal” computer, we’re currently calling this the AgentVM (AVM), but that is subject to change as we evolve the tech 🙂. We’ve gone from chatbots to agents that can run simple commands, but now we are giving them their own PC to allow them to be more like drop-in remote workers who go off, get work done, interact with each other and us as needed, and come back afterwards. More to come on this later.
Watch a quick talk on how we can improve Agents here that covers a few more things from a presentation I
did
at Google recently:
Thanks!
-Diamond
Special thanks to 2x Nick (Arner & Walton) and 2x Alex (Reibman & Fazio) for feedback and thoughts on this concept.
Appendix - YAAD (Yet Another Agent Definition)
What is an AI Agent?
AI Agents and Assistants can overlap or be confused in many ways given the new proliferation of the term
“Agent”. In order to be more specific we can define an Agent as requiring some “agency” in that they can
perceive and act in a multi-step
process
with some aspect of planning, where they are generally goal driven.
This cuts out basic one step classification or response type interactions (i.e. ask your LLM a
question),
and the old chatbots that used to be called chat agents (just Q&A type bots on a site, no goal in mind).
An
agent’s primary distinction from LLMs is that they run in a self-directed loop, largely augmented by a
lightweight prompting layer and some kind of persistence or memory to handle the multi-turn/step
interactions over time. I would then posit that we should require some aspect of proactive ability (the
ability to take steps without a human reactive trigger).
In the classic AI literature, every student who reads Russell and Norvig has seen the definition of an
agent
as anything that can be considered able to perceive its environment through sensors and act on this
environment through actuators.
What does the ecosystem look like, what else matters besides the Agent
itself?
AI Agents have exploded in the last 6 months, though there is still debate about whether any work yet,
we
can assume the will as models improve and Agent capabilities improve. We can categorize the parts of the
Agent ecosystem as:
Agent App (a domain or task specific Agent, made to be used by an
end
user)
Agent Builders (tools or frameworks for creating Agents)
Agent Orchestrators (managing and orchestrating multiple agents)
Agent Tools (services/tools/extensions made to improve the
capabilities or action space of agents)
Agent Models (foundation models built specifically for Agents)
Agent Ops (observing, debugging, etc. running agents)
There has been significant research in the space (see Appendix - Posts/Docs List), but we’re still very
early in making all this work well in production. We’ll review what exists or is being built today in
the
industry then get to potential problems and pain points.
Let’s start with Agent Applications, there are a very large number of
early stage Agents being built for
specific applications or tasks. These tend to be replacing specific non AI applications or making it
easier
to get a task done. The top domains/tasks are 1) coding (write some code for me iteratively, more than
just
high end auto-complete), 2) sales and marketing (manage and handle outreach to potential customers,
`think
self-driven CRM), 3) misc. back office common workflows such as handling HR interactions for common
problems, 4) Research + writing (go off and find data data about a subject and write it up/build a
knowledge
base), and more coming up every day.
These Agent Applications can be seen as the top of the stack. In order to build one you might use an Agent
Builder, these largely exist as either 1) libraries and frameworks, and 2) low/no-code builders.
Looking at
the frameworks first:
LangGraph: Good tool to build workflow automation. But LangChain
has
some marketing issues and
potentially too much complexity in many cases for what it provides.
AutoGen: First to introduce multi-agent interaction. However, due
to
its complexity, AutoGen has issues
around debugging but has a large following.
CrewAI: Open-source library built on Langchain. It is faster to
build
on and debug than AutoGen in
usability, but the underlying Langchain framework can make it heavy (though they seem to be offering
multiple options past Langchain).
In the low/no code space there are a growing number, where many are simple workflow builders (and
debatably
Agents by our definition but can evolve, like Google and Zapier’s Agent Builders). Fully no code general
Agents can somewhat be seen as Agent Builders too, though they are more akin to RPA like process mining
where they take a lot of observations and go from natural language to actions, companies in this space
include:
MultiOn (early beta, limited tasks, browser only)
Adept (enterprise only focused right now, browser only, recently
beheaded by Amazon, so )
HyperWrite (same as MultiOn basically)
GetLindy (enterprise focused version so hard to get access, but
looks
interesting)
Agent Orchestration tends to just be built within one of the Agent Builder frameworks or services today, and
tends to also have the concept of the Reasoning Engine built in for
many
cases, in addition to Multi-agent
coordination. Companies mentioned earlier like CrewAI as well as DeepWisdomAI are making moves
in
this
space.
Agents built and orchestrated can’t do a lot in the ether unless they are given the ability to Act. Agents
are generally given a set of actions to execute. These can be custom made by the creator, but usually
extend
their action space with a variety of Agent Tools that allow them to
interact with the external world they
are observing, this might be using APIs such as the Bing search API (which OpenAI does), asking a human
for
feedback, sending an email, using a browser (like BrowserBase and equivalents), or handling auth (Anon).
This is likely the space that has the room to expand the most but can also cause issues where there are
too
many tools to choose from.
Running any of these Agents in production (which few are doing frequently right now) has similar
problems to
any distributed and stochastic system where understanding why something happened or how can be hard to
trace
and debug. There are a variety of new Agent Ops companies popping up
to
help with observability such as the
aptly named agentops.ai which is taking off and very well positioned to take on this observability and
ops
space (if an incumbent like DataDog or AWS/Azure themselves doesn’t).