和LangGraph一样,MAF的Workflow也是基于Checkpointing机制来实现持久化的。Checkpointing允许您在Workflow执行过程中的特定点保存其状态,并在之后从这些点恢复执行。此功能在以下场景中尤其有用:
- 长时间运行的Workflow,需要避免因故障而丢失进度;
- 长时间运行的Workflow,需要暂停并在稍后恢复执行;
- 需要定期保存状态以进行审计或合规性审查的Workflow;
- 需要在不同环境或实例之间迁移的Workflow;
Workflow是Superstep执行的。如果作为执行环境的InprocessExecutionEnvironment开启了Checkpointing功能,指定的Checkpointer会在Superstep结束后或者因为某些原因导致Superstep中断时创建Checkpoint,并进行存储。
1. 捕获Checkpoint和基于Checkpoint的恢复执行
在正式介绍MAF基于Checkpointing的持久化机制之前,我们先利用一个简单的来演示一下如何利用事件迭代的方式收集基于Superstep生成的Checkpoint。Checkpoint存储了某个Superstep正常结束或者因某些情况中断时的Workflow状态,我们可以利用某个Checkpoint从这个时间点重放Workflow的执行过程。
1.1 编排工作流
我们构建的Workflow的所有节点都采用如下这个的SimpleExecutor类型。如代码片段所示,为了跟踪组成Workflow的每个Executor的执行情况,我们在控制台输出了Executor被调用的消息。为了后面演示针对共享状态的持久化,我们调用了IWorkflowContext的QueueStateUpdateAsync方法以Is{ExecutorId}Invoked为Key,tracking为Scope写入了一个状态。
internal partial class SimpleExecutor(string id) : Executor(id)
{
[MessageHandler]
public async ValueTask<string> HandleAsync(string input, IWorkflowContext context)
{
await Task.Delay(10);
Console.WriteLine($"Executor {Id} is invoked");
await context.QueueStateUpdateAsync(key: $"Is{Id}Invoked", value: true, scopeName: "tracking");
return Id;
}
}
我们指定不同的ID(“foo”, “bar”, “baz”, “qux”, “quux”)来创建五个SimpleExecutor对象,并将它们按照如下的方式编排成一个Workflow。
var executors = new string[] { "foo", "bar", "baz", "qux", "quux" }
.ToDictionary(it => it, it => new SimpleExecutor(it));
var workflow = new WorkflowBuilder(executors["foo"])
.AddFanOutEdge(executors["foo"], [executors["bar"], executors["baz"]],label:"fan-out")
.AddEdge(executors["bar"], executors["qux"])
.AddFanInBarrierEdge([executors["baz"], executors["qux"]], executors["quux"])
.Build();
await Utilities.GenerateAndShowPngImageAsync(workflow);
Utilities的GenerateAndShowPngImageAsync是我们在前面介绍的一个工具方法,它可以根据Workflow对象生成对应的流程图,并在本地打开这个图片文件。上面的代码片段中,我们创建了五个Executor对象,并将它们按照特定的方式编排成一个Workflow。生成的流程图如下所示:
1.2 确定Excutor执行所在的Superstep
由于Checkpointing是基于Superstep来实现的,为了更好地理解基于这个我们构建的这个Workflow背后的Checkpointing机制,我们有必要进一步了解这个Workflow基于BSP向前推进的流程,看看具有的Executor都在哪个Superstep中执行。如下面的代码片段所示,我们在调用默认的InProcessExecution的RunStreamingAsync方法以流的方式执行Workflow之前,先调用WithCheckpointing方法来启用Checkpointing为它指定一个通过调用CheckpointManager.CreateInMemory方法创建的ICheckpointManager对象。
var checkpointManager = CheckpointManager.CreateInMemory();
var run = await InProcessExecution.Default
.WithCheckpointing(checkpointManager)
.RunStreamingAsync(workflow, "start");
await foreach(var @event in run.WatchStreamAsync())
{
if (@event is SuperStepStartedEvent superStepStartedEvent)
{
Console.WriteLine($"{new string('-', 20)}Superstep {superStepStartedEvent.StepNumber}{new string('-', 20)}");
}
}
我们异步遍历StreamingRun的事件流,并专门捕捉在Superstep开始时输出的SuperStepStartedEvent事件,并输出对应Suprestep编号。从如下的输出结果可以看出,五个Executor分别在四个Superstep中执行,其中作为Fan-out目标的bar和baz在同一个Superstep中执行,而作为Fan-in目标的quux则在最后一个Superstep中被执行两次。
--------------------Superstep 0--------------------
Executor foo is invoked
--------------------Superstep 1--------------------
Executor baz is invoked
Executor bar is invoked
--------------------Superstep 2--------------------
Executor qux is invoked
--------------------Superstep 3--------------------
Executor quux is invoked
Executor quux is invoked
1.3 收集Checkpoint
作为Workflow当前快照的Checkpoint会在每个Superstep结束时被创建,我们利用通过SuperStepCompletedEvent事件得到描述这个对象的CheckpointInfo。在如下的代码片段中,我们在捕捉到SuperStepCompletedEvent事件时,将其中的CheckpointInfo对象添加到一个列表中,最后输出这个列表中的Checkpoint对象。
var checkpointManager = CheckpointManager.CreateInMemory();
var run = await InProcessExecution.Default
.WithCheckpointing(checkpointManager)
.RunStreamingAsync(workflow, "start");
List<CheckpointInfo> checkpoints = [];
await foreach(var @event in run.WatchStreamAsync())
{
if (@event is SuperStepCompletedEvent superStepCompletedEvent)
{
var checkpoint = superStepCompletedEvent.CompletionInfo?.Checkpoint;
checkpoints.Add(checkpoint!);
}
}
foreach (var checkpoint in checkpoints)
{
Console.WriteLine(checkpoint);
}
Debug.Assert(run.Checkpoints.SequenceEqual(checkpoints));
输出:
CheckpointInfo(SessionId: 6b5b0b0eea644bdcb709162f866fbd2f, CheckpointId: fcfa6c69d6d4428a9f72b7d72a64b362)
CheckpointInfo(SessionId: 6b5b0b0eea644bdcb709162f866fbd2f, CheckpointId: 9c5dc779f6ea425d8f6d4172a9036580)
CheckpointInfo(SessionId: 6b5b0b0eea644bdcb709162f866fbd2f, CheckpointId: a2f5d6719d6b4240bd72d383f0613e8d)
CheckpointInfo(SessionId: 6b5b0b0eea644bdcb709162f866fbd2f, CheckpointId: df7db4183e5347ffbb9a3c0045508d65)
MAF似乎刻意将整个Checkpointing的细节全部隐藏起来,所以我们看不到真正的Checkpoint对象是什么样子的,因为CheckpointInfo只包含Session和Checkpoint的ID信息,并没有包含Checkpoint对象的具体内容。StreamingRun对象的Checkpoints属性中包含了所有CheckpointInfo对象,上面的Debug.Assert语句也验证了这一点。
1.4 基于Checkpoint的恢复执行
StreamingRun对象提供了RestoreCheckpointAsync方法来基于Checkpoint恢复Workflow的执行。在如下的演示程序中,我们首先以流的方式执行Workflow并收集了每个Superstep结束时生成的CheckpointInfo对象。然后分别利用收集的前两个CheckpointInfo对象来回复StreamingRun对象,并调用其RunToCompletionAsync方法从恢复点开始执行Workflow。
var checkpointManager = CheckpointManager.CreateInMemory();
var run = await InProcessExecution.Default
.WithCheckpointing(checkpointManager)
.RunStreamingAsync(workflow, "start");
for (var index = -1; index < 2; index++)
{
if (index == -1)
{
Console.WriteLine($"{new string('-',10)}Direct run{new string('-', 10)}");
await run.RunToCompletionAsync();
continue;
}
Console.WriteLine($"{new string('-', 10)}Restore from Checkpoints[{index}]{new string('-', 10)}");
await run.RestoreCheckpointAsync(run.Checkpoints[index]);
await run.RunToCompletionAsync();
}
输出:
----------Direct run----------
Executor foo is invoked
Executor bar is invoked
Executor baz is invoked
Executor qux is invoked
Executor quux is invoked
Executor quux is invoked
----------Restore from Checkpoints[0]----------
Executor bar is invoked
Executor baz is invoked
Executor qux is invoked
Executor quux is invoked
Executor quux is invoked
----------Restore from Checkpoints[1]----------
Executor qux is invoked
从输出结果可以看出,对于使用第一个CheckpointInfo对象恢复的StreamingRun对象,能够从Superstep 1开始执行直至结束。但是对于使用第二个CheckpointInfo对象恢复的StreamingRun对象,则只能从Superstep 2开始执行,并且只执行了qux这个Executor就结束了。这很明显是一个严重的Bug,原则上我从任意一个Checkpoint恢复的StreamingRun对象,都能重放后面的操作。带着这个问题,我们来看看MAF是如何实现Checkpointing的。
2. 看看隐藏的Checkpoint类型
上面我们提到过,MAF将Checkpointing的细节全部隐藏了起来,所以很多核心的类型都是internal类型,其中就包括承载所有持久化信息的如下这个Checkpoint类型。
internal sealed class Checkpoint
{
public bool IsInitial => StepNumber == -1;
public int StepNumber { get; }
public WorkflowInfo Workflow { get; }
public RunnerStateData RunnerData { get; }
public Dictionary<ScopeKey, PortableValue> StateData { get; } = new Dictionary<ScopeKey, PortableValue>();
public Dictionary<EdgeId, PortableValue> EdgeStateData { get; } = new Dictionary<EdgeId, PortableValue>();
public CheckpointInfo? Parent { get; }
}
属性成员说明如下:
- StepNumber:Checkpoint对应的Superstep编号;
- Workflow:描述Workflow的WorkflowInfo对象;
- RunnerData:描述Workflow执行器状态的RunnerStateData对象;
- StateData:一个字典,Key是ScopeKey对象,Value是PortableValue对象,用于存储Workflow中不同Scope维度的状态数据;
- EdgeStateData:一个字典,Key是EdgeId对象,Value是PortableValue对象,用于存储Workflow中不同Edge维度的状态数据;
- Parent:一个可选的
CheckpointInfo对象,指向上一个Checkpoint。
WorkflowInfo是对构建的Workflow的静态描述,包括组成Workflow的节点(Executor)和边、用于人机交互的RequestPort、输入类型以及起始和输出节点等信息。
internal sealed class WorkflowInfo
{
public Dictionary<string, ExecutorInfo> Executors { get; }
public Dictionary<string, List<EdgeInfo>> Edges { get; }
public HashSet<RequestPortInfo> RequestPorts { get; }
public TypeId? InputType { get; }
public string StartExecutorId { get; }
public HashSet<string> OutputExecutorIds { get; }
}
当我们将将构建的Workflow交付给InprocessExecutionEnvironment来执行时,后者会创建一个InprocessRunner采用BSP规定的机制来执行。RunnerData属性返回的RunnerStateData表示的就是执行过程中的动态信息。
internal sealed class RunnerStateData(
HashSet<string> instantiatedExecutors,
Dictionary<string, List<PortableMessageEnvelope>> queuedMessages,
List<ExternalRequest> outstandingRequests)
{
public HashSet<string> InstantiatedExecutors { get; } = instantiatedExecutors;
public Dictionary<string, List<PortableMessageEnvelope>> QueuedMessages { get; } = queuedMessages;
public List<ExternalRequest> OutstandingRequests { get; } = outstandingRequests;
}
三个属性成员说明如下:
- InstantiatedExecutors:一个
HashSet集合,包含了当前已经被实例化的Executor的ID; - QueuedMessages:一个字典,Key是
Executor的ID,Value是一个列表,包含了发送给这个Executor的所有待处理消息; - OutstandingRequests:一个列表,包含了所有想
RequestPort发送的待处理的外部请求。
Worflow的消息有两种:一是由某个Executor发送给另一个Executor的消息,我们称之为内部消息,它可以是任意类型;而是在人机交互中外部利用RequestPort发送给Workflow的消息,我们称之为外部消息,对应的类型为ExternalResponse。消息在内部本封装成如下这个MessageEnvelope类型。
internal sealed class MessageEnvelope(
object message,
ExecutorIdentity source,
TypeId? declaredType = null,
string? targetId = null,
Dictionary<string, string>? traceContext = null)
{
public TypeId MessageType { get; }
public object Message { get; }
public ExecutorIdentity Source { get; }
public string? TargetId => targetId;
public Dictionary<string, string>? TraceContext { get; }
public bool IsExternal { get; }
public string? SourceId { get; }
}
相关属性成员说明如下:
- MessageType:消息的可移植类型(与运行时类型相对);
- Message:消息对象;
- Source:作为消息发送方
Executor的ID,ExecutorIdentity本质是就是对作为ExecutorID的字符串的简单封装; - TargetId:作为消息接收方
Executor的ID; - TraceContext:一个可选的字典,包含了消息的追踪上下文信息;
- IsExternal:一个布尔值,取决于
Source是否等于ExternalIdentity.None; - SourceId:
Source的字符串表示,如果Source等于ExternalIdentity.None,则为null。
RunnerStateData的QueuedMessages属性中的消息的类型为PortableMessageEnvelope,它提供了针对MessageEnvelope的可移植性(可序列化)的表示,并提供PortableValue方法实现向MessageEnvelope的转换。
internal sealed class PortableMessageEnvelope
{
public TypeId MessageType { get; }
public PortableValue Message { get; }
public ExecutorIdentity Source { get; }
public string? TargetId { get; }
public PortableMessageEnvelope(MessageEnvelope envelope);
public MessageEnvelope ToMessageEnvelope();
}
最后说说Checkpoint的EdgeStateData返回的字典,它提供的针对Edge状态的描述只针对FanInEdge这种特殊类型的边。这是因为对于三种类型的边(DirectEdge、FanOutEdge和FanInEdge)来说,只有针对FanInEdge的路由会跨越多个Superstep。为了让保证所有的上游节点成功执行之后才能执行下游节点,FanInEdge在初始化的时候会将所有上游节点的ID保存起来,当收集到人一个上游节点的消息时,就会将这个消息保存到一个集合中,并从这个集合中移除对应的上游节点ID。当这个集合为空时,说明所有上游节点的消息都已经到达了,这时才会触发下游节点的执行。为了保证维护的这两组数据不丢失,必需要将它们保存在Checkpoint中,以便在恢复执行时能够正确地恢复FanInEdge的状态。该状态类型定义如下:
internal sealed class FanInEdgeState
{
public string[] SourceIds { get; }
public HashSet<string> Unseen ;
public List<PortableMessageEnvelope> PendingMessages ;
}
三个属性成员说明如下:
- SourceIds:一个字符串数组,包含了作为FanInEdge上游节点的ID;
- Unseen:一个HashSet集合,包含了还没有到达FanInEdge的消息对应的上游节点ID;
- PendingMessages:一个列表,包含了已经到达FanInEdge的消息。
3. 真正用于持久化的ICheckpointManager对象
在前面的演示程序中,我们调用了StreamingRun的WithCheckpointing方法来启用Checkpointing,并传入了一个通过调用CheckpointManager.CreateInMemory方法创建的ICheckpointManager对象。ICheckpointManager接口是整个Checkpointing体系的核心,但它依然是一个internal接口。
internal interface ICheckpointManager
{
ValueTask<CheckpointInfo> CommitCheckpointAsync(string sessionId, Checkpoint checkpoint);
ValueTask<Checkpoint> LookupCheckpointAsync(string sessionId, CheckpointInfo checkpointInfo);
ValueTask<IEnumerable<CheckpointInfo>> RetrieveIndexAsync(string sessionId, CheckpointInfo? withParent = null);
}
基于Checkpoint的持久化实现在ICheckpointManager的CommitCheckpointAsync方法中。RetrieveIndexAsync方法根据指定的SessionId和可选的CheckpointInfo对象来检索CheckpointInfo对象的集合。LookupCheckpointAsync方法根据指定的SessionId和CheckpointInfo对象来查找对应的Checkpoint对象。
public sealed class CheckpointManager : ICheckpointManager
{
private readonly ICheckpointManager _impl;
public static CheckpointManager Default { get; } = CreateInMemory();
private static CheckpointManagerImpl<TStoreObject> CreateImpl<TStoreObject>(
IWireMarshaller<TStoreObject> marshaller,
ICheckpointStore<TStoreObject> store)
=>new CheckpointManagerImpl<TStoreObject>(marshaller, store);
internal CheckpointManager(ICheckpointManager impl)=>_impl = impl;
public static CheckpointManager CreateInMemory()=> new CheckpointManager(new InMemoryCheckpointManager());
public static CheckpointManager CreateJson(
ICheckpointStore<JsonElement> store,
JsonSerializerOptions? customOptions = null)
{
JsonMarshaller marshaller = new JsonMarshaller(customOptions);
return new CheckpointManager(CreateImpl(marshaller, store));
}
ValueTask<CheckpointInfo> ICheckpointManager.CommitCheckpointAsync(string sessionId, Checkpoint checkpoint)
=>_impl.CommitCheckpointAsync(sessionId, checkpoint);
ValueTask<Checkpoint> ICheckpointManager.LookupCheckpointAsync(string sessionId, CheckpointInfo checkpointInfo)
=>_impl.LookupCheckpointAsync(sessionId, checkpointInfo);
ValueTask<IEnumerable<CheckpointInfo>> ICheckpointManager.RetrieveIndexAsync(string sessionId, CheckpointInfo? withParent)
=>_impl.RetrieveIndexAsync(sessionId, withParent);
}
CheckpointManager虽然自身也实现了ICheckpointManager接口,但它仅仅是一个另一个ICheckpointManager对象的包装器或者代理。它的静态方法CreateInMemory方法返回的CheckpointManager对象内部包装了一个InMemoryCheckpointManager对象,后者直接将Checkpoint存储在内存中,并在此基础上提供检索。
CreateJson方法采用的策略是将Checkpoint序列化成JsonElement,然后交给指定的ICheckpointStore<JsonElement>对象来存储,并提供检索。针对Checkpoint的序列化和反序列化被抽象成了一个JsonMarshaller对象,后者实现了IWireMarshaller<JsonElement>接口。
public interface IWireMarshaller<TWireContainer>
{
TWireContainer Marshal(object value, Type type);
TWireContainer Marshal<TValue>(TValue value);
TValue Marshal<TValue>(TWireContainer data);
object Marshal(Type targetType, TWireContainer data);
}
internal sealed class JsonMarshaller : IWireMarshaller<JsonElement>
用于持久化存储Checkpoint并在此基础上提供检索的ICheckpointStore<TStoreObject>接口定义如下,泛型参数TStoreObject表示Checkpoint在存储介质中的表示类型。JsonCheckpointStore是ICheckpointStore<JsonElement>接口的一个抽象实现,FileSystemJsonCheckpointStore和CosmosCheckpointStore<T>则分别提供了基于文件系统和Cosmos DB的JsonCheckpointStore的具体实现。CosmosCheckpointStore直接继承CosmosCheckpointStore<JsonElement>。
public interface ICheckpointStore<TStoreObject>
{
ValueTask<IEnumerable<CheckpointInfo>> RetrieveIndexAsync(
string sessionId,
CheckpointInfo? withParent = null);
ValueTask<CheckpointInfo> CreateCheckpointAsync(
string sessionId,
TStoreObject value,
CheckpointInfo? parent = null);
ValueTask<TStoreObject> RetrieveCheckpointAsync(
string sessionId,
CheckpointInfo key);
}
public abstract class JsonCheckpointStore : ICheckpointStore<JsonElement>
{}
public sealed class FileSystemJsonCheckpointStore : JsonCheckpointStore, IDisposable
{}
public class CosmosCheckpointStore<T> : JsonCheckpointStore, IDisposable
{}
public sealed class CosmosCheckpointStore : CosmosCheckpointStore<JsonElement>
{}
4. 查看生成的Checkpoint对象
在此回到我们开篇演示的哪个例子,我们现在看看整个Workflow运行过程中生成的四个Checkpoint都存储了什么内容。由于Checkpoint和ICheckpointManager都是internal类型,我们无法直接访问它们,所以我们只能通过反射来获取Checkpoint对象,并查看它的内容。为此我写了PrettyPrint和LookupCheckpoint这两个方法:
public static void PrettyPrint(this object checkpoint);
public static object LookupCheckpoint(
this CheckpointManager checkpointManager,
string sessionId, CheckpointInfo checkpointInfo)
两个方法说明如下:
- PrettyPrint方法:针对
Checkpoint对象(只能表示成object)的一个扩展方法,它利用反射来获取Checkpoint对象的属性,并以一种易于阅读的格式输出它们的内容; - LookupCheckpoint方法:针对
CheckpointManager对象的一个扩展方法,它利用CheckpointManager的LookupCheckpointAsync方法来获取Checkpoint对象;
然后我们编写了如下的程序:在执行Workflow并收集CheckpointInfo对象之后,我们利用LookupCheckpoint方法来获取Checkpoint对象,并利用PrettyPrint方法来输出Checkpoint对象的内容。
using Microsoft.Agents.AI.Workflows;
var executors = new string[] { "foo", "bar", "baz", "qux", "quux" }
.ToDictionary(it => it, it => new SimpleExecutor(it));
var workflow = new WorkflowBuilder(executors["foo"])
.AddFanOutEdge(executors["foo"], [executors["bar"], executors["baz"]],label:"fan-out")
.AddEdge(executors["bar"], executors["qux"])
.AddFanInBarrierEdge([executors["baz"], executors["qux"]], executors["quux"])
.Build();
var checkpointManager = CheckpointManager.CreateInMemory();
var run = await InProcessExecution.Default
.WithCheckpointing(checkpointManager)
.RunStreamingAsync(workflow, "start");
await run.RunToCompletionAsync();
var checkpoints = run.Checkpoints
.Select(it => checkpointManager.LookupCheckpoint(run.SessionId, it))
.ToArray();
var index = 0;
foreach (var checkpoint in checkpoints)
{
Console.WriteLine($"{new string('-', 30)}Checkpoint[{index++}]{new string('-', 30)}");
checkpoint.PrettyPrint();
Console.WriteLine();
}
四个Checkpoint对象的内容采用如下的方式被输出:
------------------------------Checkpoint[0]------------------------------
IsInitial: False
StepNumber: 0
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz]
QueuedMessages: <Dictionary<String, List<PortableMessageEnvelope>>>
Key: bar
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: foo
Source: foo
TargetId: null
Key: baz
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: foo
Source: foo
TargetId: null
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: null
------------------------------Checkpoint[1]------------------------------
IsInitial: False
StepNumber: 1
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz, qux]
QueuedMessages: <Dictionary<String, List<PortableMessageEnvelope>>>
Key: qux
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: bar
Source: bar
TargetId: null
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
Key: foo/tracking/IsbarInvoked
Value: True
Key: foo/tracking/IsbazInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: <CheckpointInfo>
SessionId: e8a3dd4385384e589271758a1435ab58
CheckpointId: fc30a48fe01e4eea83ed794df37b537a
------------------------------Checkpoint[2]------------------------------
IsInitial: False
StepNumber: 2
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz, quux, qux]
QueuedMessages: <Dictionary<String, List<PortableMessageEnvelope>>>
Key: quux
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: baz
Source: baz
TargetId: null
MessageType: System.String
Message: qux
Source: qux
TargetId: null
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
Key: foo/tracking/IsbarInvoked
Value: True
Key: foo/tracking/IsbazInvoked
Value: True
Key: foo/tracking/IsquxInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: <CheckpointInfo>
SessionId: e8a3dd4385384e589271758a1435ab58
CheckpointId: cd63bfae5da74893a4acbbed74e196b9
------------------------------Checkpoint[3]------------------------------
IsInitial: False
StepNumber: 3
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz, quux, qux]
QueuedMessages: []
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
Key: foo/tracking/IsbarInvoked
Value: True
Key: foo/tracking/IsbazInvoked
Value: True
Key: foo/tracking/IsquxInvoked
Value: True
Key: foo/tracking/IsquuxInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: <CheckpointInfo>
SessionId: e8a3dd4385384e589271758a1435ab58
CheckpointId: 9420e8db25764828b0c1fcc900c5cd34
4.1 分析第一个生成的Checkpoint
我们来法分析一下生成的第一个Checkpoint对象的内容。
IsInitial: False
StepNumber: 0
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz]
QueuedMessages: <Dictionary<String, List<PortableMessageEnvelope>>>
Key: bar
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: foo
Source: foo
TargetId: null
Key: baz
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: foo
Source: foo
TargetId: null
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: null
- StepNumber为0,说明这个Checkpoint是第一个Superstep结束时生成的
Checkpoint,此时初始节点foo执行完毕; - RunnerData:
- InstantiatedExecutors表明下一个Superstep中待执行的两个Executor(
bar和baz)已经被实例化了; - QueuedMessages保存了节点
foo通过FanOutEdge发送给节点bar和baz的消息; - OutstandingRequests为空,因为整个Workflow并未涉及人机交互,没有提供
RequestPort节点;
- InstantiatedExecutors表明下一个Superstep中待执行的两个Executor(
- StateData:保存了在节点
foo中通过调用IWorkflowContext.QueueStateUpdateAsync方法写入的状态数据; - EdgeStateData:
Workflow涉及的唯一的FanInEdge的状态就保存在这里,我们可以看到这个FanInEdge的上游节点是baz和qux,由于这时还没有任何一个上游节点执行,所以Unseen集合中包含了baz和qux,而PendingMessages集合则为空; - Parent为null,这是第一个
Checkpoint,所以它没有父Checkpoint;
如果我们利用这个Checkpoint去恢复StreamingRun,它会从bar和baz这两个节点开始执行,并且能够保证后续流程能够继续下去。
4.2 分析第二个生成的Checkpoint
第二个Checkpoint是在第二个Superstep结束时生成的Checkpoint,此时节点bar和baz都执行完毕。
IsInitial: False
StepNumber: 1
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz, qux]
QueuedMessages: <Dictionary<String, List<PortableMessageEnvelope>>>
Key: qux
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: bar
Source: bar
TargetId: null
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
Key: foo/tracking/IsbarInvoked
Value: True
Key: foo/tracking/IsbazInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: <CheckpointInfo>
SessionId: e8a3dd4385384e589271758a1435ab58
CheckpointId: fc30a48fe01e4eea83ed794df37b537a
- StepNumber为1,说明这个
Checkpoint是第二个Superstep结束时生成的Checkpoint,此时节点bar和baz执行完毕; - RunnerData:
bar节点发送给qux节点的消息已经被加入到了QueuedMessages中了,InstantiatedExecutors中包含了下一个Superstep中待执行的节点qux; - StateData:节点
foo、bar和baz中通过调用IWorkflowContext.QueueStateUpdateAsync方法写入的状态数据都被保存在这里了; - EdgeStateData:不但这里没有没变换,而且所有
Checkpoint的这个FanInEdge的状态都是一样的,这说明在这个Workflow中,FanInEdge的状态并没有被改变过; - Parent:指向第一个
Checkpoint;
现在回到我们在前面提到的那个问题:为什么基于第二个Checkpoint恢复的StreamingRun对象只能从Superstep 2开始执行,并且只执行了qux这个Executor就结束了?从上面的Checkpoint内容分析我们找到了答案:就是因为EdgeStateData中FanInEdge的状态没有被正确地更新,导致在恢复执行时,FanInEdge认为它的两个上游节点baz和qux都还没有执行,所以它不触发下游节点quux的执行,最终导致Workflow无法继续往下执行了。这样如此严重的Bug竟然出现在MAF的核心Checkpointing机制中,真是让人难以置信。
5. 关于Checkpointing的两个回调方法
为确保Executor的状态被捕获到检查点中,自定义的Executor必须重写OnCheckpointingAsync方法并将其状态保存到工作流上下文中。为了确保从检查点恢复时状态能够正确恢复,Executor必须重写 OnCheckpointRestoredAsync方法并从工作流上下文中加载其状态。
using Microsoft.Agents.AI.Workflows;
internal sealed partial class CustomExecutor() : Executor("CustomExecutor")
{
private const string StateKey = "CustomExecutorState";
private List<string> messages = new();
[MessageHandler]
private async ValueTask HandleAsync(string message, IWorkflowContext context)
{
messages.Add(message);
...
}
protected override ValueTask OnCheckpointingAsync(IWorkflowContext context, CancellationToken cancellation = default)
=>context.QueueStateUpdateAsync(StateKey, this.messages);
protected override async ValueTask OnCheckpointRestoredAsync(IWorkflowContext context, CancellationToken cancellation = default)
=>messages = await context.ReadStateAsync<List<string>>(StateKey).ConfigureAwait(false);
}

被折叠的 条评论
为什么被折叠?



