Who evaluates the evaluators? The data science behind agent evals

An inside look at the data science and evaluation systems helping teams improve agent quality at scale in Copilot Studio.

At Microsoft Copilot Studio, we talk a lot about ways to make agents better. You can add knowledge sources, fine-tune models, insert prompts, and more. But how do you know whether an agent is actually getting better? How do you detect regressions before your users do? And perhaps most importantly, how do you trust the signals you’re using to make those decisions?

As data scientists, we don’t ship a model without evaluating it. We evaluate it before the first release, and we evaluate it after every meaningful change.

We validate offline, track metrics over time, compare variants, look for regressions, and ask a simple question: Did this change actually make the model better?

When we started building evaluation features in Copilot Studio, we treated them the same way. We asked: Is the evaluation giving the right answer to “is it better”? This is the quality question.

To answer it, we’ll explore three core areas of AI evaluation quality:

The data behind evaluation, and why generated datasets play such an important role in agent testing.
The evaluators (graders) themselves, and how we validate that graders produce reliable signals.
The metrics we use to determine whether those signals are trustworthy enough to support real-world decisions.

Because as organizations increasingly rely on evaluation to improve AI systems, confidence in the evaluation process becomes just as important as confidence in the agent.

Evaluate your agents with Copilot Studio

From model evaluation to agent evaluation

Traditional machine learning evaluation is relatively well‑defined. You have:

Labeled data
A clear task
A small set of metrics
A mostly static input–output mapping

AI agents challenge many of these assumptions.

Agents operate over multi‑turn conversations, adapt to user behavior, use tools, and use implicit reasoning. Accordingly, they are judged across multiple quality dimensions, correctness, completeness, clarity, coherence, tone, and more.

So Copilot Studio ships evaluation features to help makers answer questions like:

Did my agent regress after this change?
Does quality degrade in longer conversations?
Can I trust my agent to behave as expected?

But that immediately raises a second‑order question: How do we know the evaluation features themselves are giving the right answer?

Building the right evaluation data for agents

In data science, we know that evaluation quality is bounded by evaluation data.

If the data you evaluate on is narrow, unrealistic, or biased, the metrics will look confident but be wrong. That principle applies just as much when we evaluate evaluation features as when makers evaluate their agents.

Real data: Grounded, but limited

For makers, Copilot Studio supports importing real production data to evaluate agent performance. Internally, however, we do not use customer data in any form when validating evaluation features.

Instead, we rely on curated examples and generated datasets that allow controlled and systematic testing—what we call semi‑real examples. These include scenarios inspired by conversations shared by design partners, as well as examples curated during feedback cycles.

For us, and for many of our makers, these alone are not sufficient.

Generated data: Scalable, targeted, and intentional

Real and semi‑real examples provide valuable grounding, but in practice, most evaluation workflows (both internally and for makers) rely primarily on generated data. And that isn’t a compromise; it’s an intentional design choice.

Generated data allows evaluation to start earlier, scale faster, and cover a broader range of agent behaviors.

From our perspective as a data science team, generated datasets are essential for supporting the wide variety of agents built in Copilot Studio. They allow us to validate evaluation features across different agent types, domains, and interaction patterns, and to do so at a scale that would not be feasible with curated examples alone.

For makers, the motivations are equally practical:

Evaluating before publishing: Generated datasets make it possible to assess agent behavior and quality before the agent is exposed to real users.
Limited or restricted access to production data: In many cases, makers do not have access to their agent’s production conversations at all, due to compliance, governance, or organizational policies.
Working with production data selectively: Even when production data exists, it often needs filtering or augmentation to support systematic evaluation.

Perhaps the strongest motivation is how easy it is to generate high‑quality evaluation datasets. Copilot Studio enables makers to create test sets that are targeted, repeatable, and aligned with their agent’s intended behavior—without requiring manual data collection.

Evaluating data generation for agent evaluations

Evaluation datasets can be generated in multiple ways. We support multiple data generation strategies because they surface different aspects of agent behavior. When applied together, they give makers practical, high‑coverage evaluation datasets.

Data generation strategies

There are four main types of dataset generation:

Single‑turn generation allows makers to test specific behaviors in isolation. These datasets are easier to reason about and are well‑suited for validating correctness, relevance, and instruction adherence.
Multi‑turn generation adds the complexity of context tracking and conversational dependencies. This is particularly useful for makers building predefined flows or agents whose behavior depends on conversation state.
Knowledge‑based generation tends to produce very concrete, sometimes highly specific questions. These queries are effective for testing grounding and answerability against the agent’s connected knowledge sources.
Topic‑based and instruction‑based generation often lead to more general or exploratory questions. These datasets are useful for identifying unsupported or weakly supported areas—reasonable questions users may ask that fall outside the agent’s main flows.

By combining these generation types, makers can build large and diverse evaluation sets that cover both expected and unexpected usage patterns.

How we evaluate generated queries

Because data generation itself is an evaluation feature, we explicitly assess the quality of generated queries. We use an LLM‑as‑a‑judge methodology to assess dataset quality along several dimensions, including:

Relevance. How well queries align with the agent’s intended scope.
Interaction naturalness. Whether queries resemble plausible user goals, confusion, and follow‑ups.
Human likeness. The extent to which generated queries resemble questions a human would naturally ask.
Redundancy. Whether examples add new coverage rather than repeating similar patterns.
Intent diversity. The range of user intents represented in queries (for example, informational, troubleshooting, or exploratory).

In addition, we apply generation‑specific measures where appropriate, such as topic coverage for topic‑based generation or grounding for knowledge‑based generation. These assess, for example, whether questions are answerable using the provided sources.

These metrics allow us to reason systematically about whether a generation capability produces datasets that are broad, targeted, and useful for evaluation—without relying on subjective impressions.

Evaluating graders: Assessing the quality of our evaluators

Graders are the evaluators we build to help makers assess their agents. They produce the scores and labels that makers use to understand what works well and what needs to be improved. For that reason, we assess graders explicitly and independently before they are exposed to makers.

What we expect from a high‑quality grader

We treat graders as a system that estimates quality rather than produces one absolute answer. We assess these graders using the same principles we would apply to any automated evaluation system.

Concretely, we ask whether a grader:

Measures the intended dimension and only that dimension.
Distinguishes between meaningful differences in responses.
Behaves consistently across similar inputs.
Produces interpretable and stable signals that can support downstream decisions.

A grader that produces reasonable explanations, but inconsistent judgments doesn’t meet the bar.

Purpose‑built datasets for grader assessment

To assess a grader’s quality, we build purpose-built datasets, each tailored to the specific behavior or quality dimension the grader is designed to measure.

Each grader requires targeted datasets designed to measure the specific behavior being evaluated. As a result, the datasets we use for grader evaluation are intentionally designed for that purpose.

In practice, the composition of grader‑specific datasets depends on the grader. For some graders, we rely primarily on human-labeled data. For others, generated data plays a central role, allowing us to construct targeted test cases with known ground truth. Most often, we use a combination of the two, balancing human judgment with scale and control.

A controlled generation methodology

For many graders, we use controlled synthetic datasets with known ground truth.

The process works as follows:

Define a test agent. We start with a well-scoped agent configuration that represents the behavior domain the grader is intended to evaluate.
Generate high-quality queries. Using our data generation capabilities, we create a set of realistic, high-quality user queries aligned with the agent’s scope.
Generate high-quality responses. For each query, we generate responses that meet the expected quality bar for the dimension under evaluation.
Introduce controlled degradations. We then intentionally degrade a subset of these responses in a controlled and traceable way. Each degradation targets a specific failure mode and we explicitly track whether a response was damaged and how.
Use the dataset for evaluation. The resulting dataset contains both intact and intentionally degraded responses, with known ground truth about their quality.

Because we control the transformation applied to each response, we can treat this dataset as labeled. We know which responses should be flagged by the grader, and for what reason.

Measuring grader performance

Now that we know which responses were intentionally degraded and how, we can evaluate graders in a concrete and measurable way. Rather than relying on subjective inspection, we can treat grader assessment as a standard classification problem with known ground truth.

The main metrics we track and optimize when developing graders are true positive rate (TPR) and true negative rate (TNR).

TPR measures how often the grader correctly identifies responses that should be flagged. In our context, this reflects the grader’s ability to detect intentionally damaged or low-quality responses when a problem is present.
TNR measures how often the grader correctly accepts responses that should not be flagged. This reflects the grader’s ability to avoid false alarms and not penalize responses that meet the expected quality bar.

These metrics capture the core tradeoff every grader must manage: being sensitive enough to catch real issues, while remaining precise enough to avoid over‑penalizing valid responses.

By evaluating graders against datasets with controlled degradations, we can measure TPR and TNR directly, analyze failure modes, and iterate systematically. This allows us to tune grader behavior intentionally—understanding where a grader is too permissive, where it’s too strict, and how changes affect its decision boundaries.

All together, these techniques allow us to move beyond evaluating individual grader performance and toward a broader goal: building evaluation systems whose behavior can be understood, measured, and improved over time.

Bringing rigor to agent evaluation features

So, who evaluates the evaluators? At Copilot Studio, we approach evals with the same rigor we apply to models themselves. Because as teams increasingly rely on evaluation to guide real-world decisions, they need to trust the systems producing those signals.

In this post, we described how we approach that challenge in Copilot Studio: constructing targeted datasets for graders, using controlled generation to create reliable ground truth, and measuring decision accuracy through metrics such as TPR and TNR. These practices help us understand not only whether an evaluation feature works, but how it behaves, where its limitations are, and how it can be improved over time.

Feedback from design partners and customers plays an important role in this process. When real-world examples reveal gaps in a grader or generated dataset, we incorporate those learnings back into our evaluation process to continuously improve the system.

As the industry continues moving from experimental AI systems to production-scale agents, evaluation will become a foundational capability. As AI agents move into production environments, organizations need to trust not just the agents themselves, but the systems used to evaluate them.

For us, rigorous evaluation is a core part of helping teams build and improve agents with confidence. Because better agent decisions start with trustworthy evaluation.