在前文中,我们详细介绍了MAF的Workflow框架中的Checkpointing机制,并且通过一个示例程序演示了如何利用Checkpointing来实现Agent的状态持久化和恢复功能。其中提到一个关于FanInEdge的bug,我认为这是一个非常严重的bug,它会导致只要涉及到FanInEdge的Checkpointing功能就无法正常工作。我为此在github上提交了一个Issue,我们现在就来重点剖析这个Bug产生的根源。
1. 从FanInEdge的工作原理
FanInEdge并非普通的多对一的DirectEdge,因为它多了一个同步屏障的功能:要求所有N源节点全部执行完成后,统一执行N次目标节点。它的作用和设计思路与LangGraph的NamedBarrierValue非常相似。
和其他两种Edge(DirectEdge和FanOutEdge)相比,FanInEdge最大的不同在针对它的消息路由可能需要跨越多个Superstep,所以需要在多步之间维护未决(Pending)状态。由于Checkpointing机制的设计是基于Superstep进行的,所以这个状态需要写入创建的Checkpoint中。具体来说,这个状态的类型为如下所示的FanInEdgeState:
internal sealed class FanInEdgeState
{
public string[] SourceIds { get; }
public HashSet<string> Unseen { get; private set; }
public List<PortableMessageEnvelope> PendingMessages { get; private set; }
}
当FanInEdgeState被初始化时,所有源节点的ID会被写入SourceIds和Unseen集合中。在执行的时候,每当一个源节点完成执行并传来了消息,FanInEdge就会将对应的源节点ID从Unseen集合中移除,并将消息写入PendingMessages中。当Unseen集合被清空时,意味着所有的源节点都已经完成了执行,此时只需要将消息依次从PendingMessages中取出,并将其作为输入调用目标节点。
2. Checkpoint无法重放后续操作
创建Checkpoint的目的就是在每个Superstep正常结束,或者中断时将当前快照的状态持久化下来,以便在之后的某个时间点进行恢复,同时还提供时间旅行的能力。从任何一个Checkpoint开始都应该能够完整地重放之后的操作,这是对Checkpointing机制的基本要求。但是我们前面演示的例子却表明:如果选择的Checkpoint所在的Superstep执行了FanInEdge的某个源节点,后续的操作就可能无法重放。
using Microsoft.Agents.AI.Workflows;
var executors = new string[] { "foo", "bar", "baz", "qux", "quux" }
.ToDictionary(it => it, it => new SimpleExecutor(it));
var workflow = new WorkflowBuilder(executors["foo"])
.AddFanOutEdge(executors["foo"], [executors["bar"], executors["baz"]],label:"fan-out")
.AddEdge(executors["bar"], executors["qux"])
.AddFanInBarrierEdge([executors["baz"], executors["qux"]], executors["quux"])
.Build();
var checkpointManager = CheckpointManager.CreateInMemory();
var run = await InProcessExecution.Default
.WithCheckpointing(checkpointManager)
.RunStreamingAsync(workflow, "start");
for (var index = -1; index < 2; index++)
{
if (index == -1)
{
Console.WriteLine($"{new string('-', 10)}Direct run{new string('-', 10)}");
await run.RunToCompletionAsync();
continue;
}
Console.WriteLine($"{new string('-', 10)}Restore from Checkpoints[{index}]{new string('-', 10)}");
await run.RestoreCheckpointAsync(run.Checkpoints[index]);
await run.RunToCompletionAsync();
}
internal partial class SimpleExecutor(string id) : Executor(id)
{
[MessageHandler]
public async ValueTask<string> HandleAsync(string input, IWorkflowContext context)
{
await Task.Delay(10);
Console.WriteLine($"Executor {Id} is invoked");
await context.QueueStateUpdateAsync(key: $"Is{Id}Invoked", value: true, scopeName: "tracking");
return Id;
}
}
上面就是重现Bug的程序。我们创建了五个具有不同ID的同类节点(SimpleExecutor),并且构建了一个包含FanOutEdge和FanInEdge的Workflow。这个Workflow具有如下所示的结构:
我们采用流的方式进行了一次完整的调用,并收集到四个Checkpoint。然后我们分别使用第一个和第二个Checkpoint来恢复StreamingRun并调用它直至结束,从如下的输出可以看出:从第二个Checkpoint恢复的StreamingRun并不能重发后续的操作。
----------Direct run----------
Executor foo is invoked
Executor baz is invoked
Executor bar is invoked
Executor qux is invoked
Executor quux is invoked
Executor quux is invoked
----------Restore from Checkpoints[0]----------
Executor baz is invoked
Executor bar is invoked
Executor qux is invoked
Executor quux is invoked
Executor quux is invoked
----------Restore from Checkpoints[1]----------
Executor qux is invoked
3. 分析创建的Checkpoint
为了找出这个问题的根源,我们通过反射的方式结构化地输出了四个Checkpoint的内容,并得到如下的结果。我们发现四个Checkpoint的EdgeStateData属性中存储的FanInEdgeState的内容是一样的,这明显不合理。
------------------------------Checkpoint[0]------------------------------
IsInitial: False
StepNumber: 0
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz]
QueuedMessages: <Dictionary<String, List<PortableMessageEnvelope>>>
Key: bar
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: foo
Source: foo
TargetId: null
Key: baz
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: foo
Source: foo
TargetId: null
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: null
------------------------------Checkpoint[1]------------------------------
IsInitial: False
StepNumber: 1
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz, qux]
QueuedMessages: <Dictionary<String, List<PortableMessageEnvelope>>>
Key: qux
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: bar
Source: bar
TargetId: null
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
Key: foo/tracking/IsbarInvoked
Value: True
Key: foo/tracking/IsbazInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: <CheckpointInfo>
SessionId: e8a3dd4385384e589271758a1435ab58
CheckpointId: fc30a48fe01e4eea83ed794df37b537a
------------------------------Checkpoint[2]------------------------------
IsInitial: False
StepNumber: 2
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz, quux, qux]
QueuedMessages: <Dictionary<String, List<PortableMessageEnvelope>>>
Key: quux
Value: <List<PortableMessageEnvelope>>
MessageType: System.String
Message: baz
Source: baz
TargetId: null
MessageType: System.String
Message: qux
Source: qux
TargetId: null
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
Key: foo/tracking/IsbarInvoked
Value: True
Key: foo/tracking/IsbazInvoked
Value: True
Key: foo/tracking/IsquxInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: <CheckpointInfo>
SessionId: e8a3dd4385384e589271758a1435ab58
CheckpointId: cd63bfae5da74893a4acbbed74e196b9
------------------------------Checkpoint[3]------------------------------
IsInitial: False
StepNumber: 3
RunnerData: <RunnerStateData>
InstantiatedExecutors: [bar, foo, baz, quux, qux]
QueuedMessages: []
OutstandingRequests: []
StateData: <Dictionary<ScopeKey, PortableValue>>
Key: foo/tracking/IsfooInvoked
Value: True
Key: foo/tracking/IsbarInvoked
Value: True
Key: foo/tracking/IsbazInvoked
Value: True
Key: foo/tracking/IsquxInvoked
Value: True
Key: foo/tracking/IsquuxInvoked
Value: True
EdgeStateData: <Dictionary<EdgeId, PortableValue>>
Key: 3
Value: <PortableValue>
SourceIds: [baz, qux]
Unseen: [baz, qux]
PendingMessages: []
TypeId: Microsoft.Agents.AI.Workflows.Execution.FanInEdgeState
Parent: <CheckpointInfo>
SessionId: e8a3dd4385384e589271758a1435ab58
CheckpointId: 9420e8db25764828b0c1fcc900c5cd34
4. FanInEdgeState真的不曾改变过吗?
那么是不是FanInEdgeState在四个Superstep中真的没有改变过了呢?根据前面针对FanInEdge工作原理的分析,这是不可能的:如果状态不变的话,作为FanInEdge的目标节点永远都不会执行。而且Checkpoint的FanInEdgeState不变也正好能够解释后续超步没法重发。
但是如果是我针对Checkpointing和FanInEdge工作原理的理解有误呢?为了验证我的理解没有问题,我在自定义的SimpleExecutor的HandleAsync方法设置了一个断点,并在执行到qux节点是查看IWorkflowContext上下文的内容。

如上图所示,我们从IWorkflowContext找到了这个FanInEdgeState对象,发现它的Unseen集合中只包含qux节点,而不包含baz节点了,而PendingMessages集合中也包含了baz节点传来的消息了。这个状态是对的,但是为什么没有体现在创建的Checkpoint中呢?
5. 真正的根源
到目前为止,我们离真相已经不远了:肯定时执行Workflow和创建Checkpoint的FanInEdgeState对象不是同一个对象了。于是我们从源码中找到了导致这个问题的根源,问题出在一个名为InProcessRunner的内部类型中,InProcessExecutionEnvironment利用它作为执行Workflow的Runner。
internal sealed class InProcessRunner : ISuperStepRunner, ICheckpointingHandle
{
private InProcessRunner(
Workflow workflow,
ICheckpointManager? checkpointManager,
string? sessionId = null,
object? existingOwnerSignoff = null,
bool subworkflow = false,
bool enableConcurrentRuns = false,
IEnumerable<Type>? knownValidInputTypes = null)
{
this.RunContext = new InProcessRunnerContext(
workflow,
this.SessionId,
checkpointingEnabled: checkpointManager != null,
this.OutgoingEvents, this.StepTracer,
this.EdgeMap = new EdgeMap(
this.RunContext,
this.Workflow.Edges,
this.Workflow.Ports.Values,
this.Workflow.StartExecutorId,
this.StepTracer);
}
internal async ValueTask CheckpointAsync(CancellationToken cancellationToken = default)
{
...
Dictionary<EdgeId, PortableValue> edgeData =
await this.EdgeMap.ExportStateAsync().ConfigureAwait(false);
...
Checkpoint checkpoint = new(this.StepTracer.StepNumber, this._workflowInfoCache,
runnerData, stateData, edgeData, this._lastCheckpointInfo);
...
}
}
如上面的代码所示,InProcessRunner在构造函数中利用Workflow创建了一个InProcessRunnerContext作为执行上下文,使用Workflow的Edges集合创建了一个EdgeMap对象。很明显执行Workflow修改的FanInEdgeState存在于InProcessRunnerContext中(我通过Debug也证实了这一点)。但是CheckpointAsync方法中用来创建Checkpoint的edgeData确实从EdgeMap中导出的,后者永远不会改变。

被折叠的 条评论
为什么被折叠?



