Rovo Chat is Atlassian’s conversational AI chat designed to help users retrieve information from their enterprise’s knowledge base to answer questions or perform actions such as updating a page.

With its intuitive interface and deep integration across Atlassian apps, Rovo Chat streamlines workflows and empowers teams to work more efficiently with instant access to knowledge. Since its launch, Rovo Chat’s capabilities have continuously expanded to meet the evolving needs of modern enterprises.

This blog discusses the evolution of Rovo Chat into our current multi-agent framework and what that means for your team. We’ve been constantly iterating, keeping a pulse on the latest AI capabilities, and identifying critical customer feedback to produce the strongest user experience yet.

Why Multi-Agents?

Agents are intelligent systems capable of solving tasks by leveraging both specialized tools and large language models (LLMs) for decision-making. By evolving Rovo Chat to a hierarchical multi-agent system, we allow it to flexibly handle a wider variety of tasks and scenarios, expanding its capabilities without compromising quality by delegating subtasks in the user’s query to the most appropriate subagent.

Subtask decomposition

We explicitly tune our orchestrator to break down the user query into digestible sub-tasks when the query is complex. We exploit the parallel tool calling capabilities of modern language models here to naturally delegate these sub-tasks to the most appropriate tool/agent at each step of the orchestration.

Hierarchal Agent Structure

In the standard RAG flow, we typically see something like:

However, as we try to support more and more tools, it’s very easy for a single agent to get confused and make mistakes. To improve the reliability, we structure the orchestration into hierarchal layers of subagents, allowing the orchestration to route to the most appropriate “expert” as a subagent. It also reduces the blast radius of introducing new tools by only impacting the expert subagent.

Agents are defined as “domains” of functionalities, so they can focus their full attention on a more specific subdomain of problems. This approach takes inspiration from the hierarchal search strategy many modern vector stores are built upon.

Domain Specialized Subagents as Tools

Jira Agent

Let’s take the Jira Agent as an example. Jira search is tied to the Jira Query Language (JQL) and allows for numerous filtering capabilities, such as ticket assignee, project key, and ticket resolution status. All of these nuanced capabilities are partially baked into the language model’s pre-trained data to an extent, but specialized instructions need to be presented to optimize the performance for Jira-related queries.

A single top-level orchestrator agent can easily become confused trying to deal with all the different domains of problems. We can obtain higher precision answers by leaving the domain selection problem to the orchestrator and allowing the domain-specialized subagent to attend to just that category of problems. As an example, some of the custom instructions we give the Jira Agent look like:

The tools exposed to the Jira Agent are:

Tool Name	Purpose
JQL Documentation Search	Search for documentation on JQL
JQL Execution	Generate and execute JQL to get results
Entity Linking	Match named entities to their user id to help in JQL generation

Processing Infinite Jira Issues

One example of a capability we unlock with this hierarchical agent structure is the ability to process massive volumes of Jira issues. A common use case for a Jira-related query is analysis over a large batch of issues, such as “what issues are important for me to focus on in this board?” Some boards have thousands of tickets, which need to be filtered by priority, assignee, etc. However, all of these issues can’t just be stuffed into the LLM context, especially when we get into the territory of 1000+ issues.

To process a large volume of issues, we give the JQLExecutionTool specialized metadata to allow it to decide whether it needs to loop over batches of Jira issues returned by a JQL query. From here we can iteratively refine the response output over many Jira issues to understand the full board. This allows us to remove the specialized instructions and Jira-specific tools from confusing the top-level orchestration, which will need to handle hundreds of tools from all domains and can easily get confused.

System Tools

Not all tasks necessarily belong to a particular “domain”. Sometimes we just require a simple tool call alone without needing to incur a full agent call. For these tasks, we include system tools at the top-level orchestrator, which allows us to bypass a complex agentic system in the cases where we don’t need that level of complexity.

A few of our system tools:

Tool	Purpose	Target Query Examples
Search	Natural language search to enterprise and web data – Unraveling Rovo search relevance – Work Life by Atlassian	“who is ceo”
UrlRead	Reads the contents of URLs found in text	“what is this about https://youtu.be/dQw4w9WgXcQ”
People	entity linking from natural language for people lookups	“who is joe”

Now that we’ve defined the concept of Agents and System Tools, we leave it to the LLM to decide the level of complexity for the query, routing to the appropriate reasoning mode.

Reasoning Modes

Brainstorming Scenario

These are queries that expect LLM only answer with no tool calls at all with a low latency expectation.

Tool QnA Scenario

These are queries that expect a single layer of parallel tool calls. Some latency is expected for tool calls such as search.

Reasoning Scenario

These are queries that require multi-step reasoning – leading to sequences of tool calls that carry dependencies on each other. Higher latency is expected for these queries to generate a multi-step plan, execute tool calls to fulfill that plan, and respond to the user. We explicitly output a natural language research plan for these cases to help steer subsequent tool calls for full task completion. This is especially important to help in decomposing queries that require multiple tasks to fulfill.

Exploring Alternative Orchestration Models

Single-Agent Tool Orchestration

The first versions of Rovo Chat we had were built on a single-agent framework. In this version, we kept all tools at the top level with minimal groupings. To mitigate some of the misclassifications of tool calls, we trained intent classifiers on internal and synthetic data to help reduce the tool selection space. However, the workflow was very static and did not allow much generalizability, especially as queries became more complex.

Multi-Agent Graph Orchestration

As we started moving towards a multi-agent world, we initially explored a planning phase decomposed of the user’s query to generate a directed acyclic graph (DAG) of subtasks. These subtasks were then assigned to the appropriate subagents. In this design, we wanted to challenge the limits of language models by leveraging a pure natural language interface without being bound by function-calling schemas. We theorized a long-horizon task planning stage to unlock more fine-grained control over the orchestration states and maximize information flow between subagents by sequencing the most appropriate chains of agents. This approach marked the beginning of our hierarchical agent structure, in which subagents consisted of specialized groups of tools.

The graph approach enabled a natural decomposition of subtasks from the user query, triggering deeper subagent executions for more complex queries. Simple queries would generate a DAG with a shorter path to the sink node, while complex queries would have a longer topological depth. In this graph orchestration approach, we include Think and Answer as “control” subagents, so the orchestrator can explicitly decide to sequence answer or reflection states directly after retrieval. This allowed us to maintain specially tuned agents to optimize for answer and planning quality, allowing us stronger flexibility in orchestration state management.

We also introduced a “Direct Answer” path for queries that don’t require tool calls or retrieval to optimize for latency. This skips the entire graph orchestration and returns the LLM output directly.

Why we landed on the Current Hybrid Orchestration Model

The largest challenge with graph-based orchestration was reconfiguring the graph when a subagent failed to execute or did not provide the required information to execute a downstream task. From just the user’s query, we simply don’t have enough information to plan an orchestration path in a single shot, where all subagent outputs are relevant with respect to the graph’s topological ordering. The graph orchestration model is better suited for LLM post-training efforts, where we can formalize the problem as a partially observable Markov decision process using historical tool success rates as a reward and latencies as a penalty.

At this point, we decided to revert to a tool loop, where we generate a schema for all subagents and allow the LLM to naturally generate subtasks one step at a time, given the current context. This allowed us to simplify the complexity of the graph-based orchestration and better scale our orchestration across Atlassian, enabling other teams to build upon it. We call this the Hybrid Orchestrator since it is a hybrid approach between tools and subagents for the most efficient orchestration path.

Evaluations

Quality

We maintain a curated offline evaluation set from our own internal and synthetic data, which contains reference answers. We leverage LLM as a judge to evaluate whether an answer from Rovo Chat aligns with the reference answer. We assign a binary true/false label for response correctness to the golden answer which becomes our numerator.

The percentages shown below are: num correct responses / total responses.

Single Agent RAG	Multi-Agent DAG Orchestrator	Hybrid Orchestrator
baseline	+2.52%	+3.49%

Latency

While we run our eval sets, we also measure the time to first answer token latencies, reflecting the first update to the user:

	Single Agent RAG	Multi-Agent DAG Orchestrator	Multi-Agent Hybrid Orchestrator
P10 first token latency	Baseline	-71.7%	-75.96%
P50 first token latency	Baseline	-1.16%	-29.5%
P90 first token latency	Baseline	+2.24%	-19.97%

The reductions in P10 latencies from single agent → multi-agent are directly attributed to the “direct answer” pathways, which we can now allow due to dynamic orchestrations, which don’t always require a tool call.

The further reductions in the Hybrid Orchestrator stem from its more intelligent tool selection capabilities, which only execute one layer of tools at a time without an explicit planning phase overhead, compared to the DAG Orchestrator.

Conclusion

We’re very excited by the additional capabilities we can unlock with our hierarchical multi-agent framework. There are also many promising research directions we can pursue with the introduction of this framework to further enhance improvements in Rovo Chat’s intelligence, thereby better empowering teams to deliver outstanding work.

How Rovo Chat embraces multi-agent orchestration

Why Multi-Agents?

Subtask decomposition

Hierarchal Agent Structure