From GNNs to Symbolic Surrogates via Kolmogorov–Arnold Networks for Delay Prediction
Abstract
Accurate prediction of flow delay is essential for optimizing and managing modern communication networks. We investigate three levels of modeling for this task. First, we implement a heterogeneous GNN with attention-based message passing, establishing a strong neural baseline. Second, we propose FlowKANet in which Kolmogorov–Arnold Networks replace standard MLP layers, reducing trainable parameters while maintaining competitive predictive performance. FlowKANet integrates KAMP-Attn (Kolmogorov–Arnold Message Passing with Attention), embedding KAN operators directly into message-passing and attention computation. Finally, we distill the model into symbolic surrogate models using block-wise regression, producing closed-form equations that eliminate trainable weights while preserving graph-structured dependencies. The results show that KAN layers provide a favorable trade-off between efficiency and accuracy and that symbolic surrogates emphasize the potential for lightweight deployment and enhanced transparency.
I Introduction
Flow delay is a key performance metric in communication networks, influencing congestion control, Traffic Engineering (TE), and Quality of Service (QoS). Network performance prediction has traditionally relied on analytical models and Discrete Event Simulation (DES). Queueing models make strong assumptions, while simulations face scalability and runtime limitations. In recent years, data-driven methods have emerged as an alternative, with Graph Neural Networks (GNNs) showing strong ability to capture dependencies between flows, links, and topologies.
Despite their success, GNN-based models face two main challenges. First, they are often parameter-heavy with tens of thousands of weights, hindering deployment in resource-constrained environments. Second, they remain largely black-box models: while accurate, they provide little transparency about how input features combine to yield predicted delays, limiting trust and interpretability in operational use.
Recently, Kolmogorov–Arnold Networks (KANs) [liu2024kan] have emerged as a promising alternative to conventional multi-layer perceptrons (MLPs). Grounded in the Kolmogorov–Arnold representation theorem, KANs replace fixed activations with learnable spline functions, enabling smooth and interpretable functional modeling. Their transparent structure makes them particularly suitable for analyzing relationships in network performance data. In this work, we investigate a spectrum of models that balance precision, compactness, and interpretability. Our contributions are:
-
•
A heterogeneous GNN with attention-based message passing as a strong baseline for flow delay prediction.
-
•
FlowKANet, a fully KAN-based GNN architecture in which all components are implemented with spline-based operators for greater functional consistency, efficiency, and interpretability. The model incorporates KAMP-Attn, a mechanism that leverages KAN operators to compute both feature transformations and attention coefficients within the message-passing process.
-
•
A symbolic distillation of FlowKANet through block-wise regression, yielding compact analytical surrogates that preserve graph dependencies and enable lightweight, interpretable deployment.
II Related Works
II-A Traditional Network Modeling
Analytical queuing models, such as M/M/1 and M/M/k systems, provide closed-form expressions for metrics like average delay. While those models are mathematically elegant, they depend on strong assumptions (e.g., Poisson arrivals, exponential service times). In contrast, machine learning (ML) approaches learn directly from data, allowing more flexible and adaptive modeling of complex network behaviors. DES approaches, on the other hand, can capture detailed protocol dynamics and queuing interactions with high fidelity. They are widely used in academia and industry, but their computational complexity is prohibitive: DES is inherently difficult to parallelize, scales poorly with network size, and remains unsuitable for real-time applications. These limitations motivate the shift toward data-driven ML approaches, which learn performance models directly from traffic traces without restrictive assumptions.
II-B Machine Learning for Networking
Machine learning has been used for various networking tasks such as traffic classification, anomaly detection, routing, and resource allocation [almasan2022deep, ridwan2021applications]. However, these approaches typically treat flows as independent samples, failing to capture the inherent graph structure of communication networks. This motivates the use of GNNs, which represent networks as graphs and capture node, link, and flow dependencies.
II-B1 GNN-based Models
GNNs process graph-structured data by iteratively propagating and aggregating information between neighboring nodes through message passing [zhou2020graph, gilmer2017neural]. In general, a GNN layer can be expressed as:
| (1) |
where are edge features, is a permutation-invariant aggregation operator (e.g., sum, mean, max), is the feature of node at layer , is the message function, and is the update function. This formulation suits communication networks, where the interaction between flows and links can naturally be represented as a graph.
In traffic engineering (TE), GNNs have shown strong potential for resource allocation [xu2023teal, almasan2022deep]. TEAL [xu2023teal] combined a GNN with reinforcement learning and ADMM for WAN optimization, achieving near-optimal results. Building on this direction, FlowAttune [marouani2024advanced] applied graph attention to dynamically weight neighboring nodes during message passing. Formally, in a Graph Attention Network (GAT) [velivckovic2017graph], the message from a neighbor to node is weighted by an attention coefficient
| (2) |
where and are node features, is a learnable weight matrix, is the attention vector, and denotes concatenation. The updated node representation is then obtained as
| (3) |
where is a non-linear activation function.
Attention mechanisms enable GNNs to capture structural dependencies and adapt to dynamic traffic, offering a flexible alternative to fixed aggregation functions. Both methods rely on modeling the network as a flow–link bipartite graph, which facilitates message passing between flows and their associated links. This representation has inspired our own work, where we adopt a similar bipartite modeling strategy for flow delay prediction. GNNs have also been applied to performance prediction. RouteNet [rusek2020routenet, ferriol2023routenet] and its extensions have estimated end-to-end metrics such as delay and jitter with high accuracy across unseen topologies, while the GNNet Challenge [guemes2023building] further validated GNN-based performance prediction using real traffic traces, adopting RouteNet-Fermi as its baseline. These works highlight the versatility of GNNs in networking, spanning from traffic allocation to performance prediction. However, existing GNNs still face (i) large parameter counts, leading to high training and inference costs; (ii) limited scalability on large graphs; and (iii) lack of transparency, which constrains their use in operational and real-time network environments.
II-B2 Kolmogorov–Arnold Networks (KANs)
KANs [liu2024kan] are inspired by the Kolmogorov–Arnold representation theorem, which expresses any multivariate function as a composition of univariate ones, KANs replace fixed activations with trainable spline operators. KAN layer computes:
| (4) |
where is a learnable spline function defined on a grid. This design enables smoother function approximation, improved parameter efficiency, and greater transparency of learned transformations. KANs have shown promising results in scientific machine learning and physics-informed tasks [ji2024comprehensive, rigas2024adaptive, somvanshi2025survey], but remain unexplored in networking or combined with GNNs. This motivates one of the main axes of our study: exploring the integration of KAN layers within GNN architectures for flow delay prediction.
III Framework for Flow Delay Prediction
We introduce a unified framework that provides a single pipeline from raw network data to graph-based models. Unified here means that the same data representation, preprocessing, and normalization steps are shared across architectures, ensuring that improvements can be attributed to the model design itself. The framework starts with two important steps: constructing a heterogeneous bipartite graph that captures flow–link relationships, and selecting a compact set of relevant features that improves efficiency and generalization. On this foundation, we implement two predictive models: a baseline GNN and a KAN-augmented GNN.
III-A Graph Construction
The network is represented as a heterogeneous bipartite graph , where denotes the set of flow nodes, each corresponding to a unidirectional flow characterized by features describing its traffic profile (e.g., rate, packet size, burstiness, loss), and denotes the set of link nodes, each representing a physical link annotated with its capacity and load. The edge set connects flows to the links they traverse, as determined by routing, and includes both and directions to enable bidirectional message passing. This construction captures dependency of flows on the sequence of links along their path and vice versa.
Data extraction and feature engineering
The raw dataset provides per-flow, per-link, and topology-level information. From this, we build feature vectors for each node type. Flow nodes include basic attributes such as average traffic, number of packets, mean packet size, flow type, and path length, augmented with distributional descriptors that capture burstiness and variability, including inter-packet gap (IPG) statistics (mean, variance, and selected percentiles), packet-size percentiles, packet loss ratio, variance of packet sizes, inter-burst gap (IBG), rate, per-burst bitrate, and type of service (ToS). Each link node is annotated with its capacity and a normalized load, computed as
| (5) |
where is the set of flows traversing link and is the link capacity (Gbps). The concatenation forms the feature vector for link . This enriched representation incorporates not only average values but also distributional characteristics of packet timing and size, which are crucial for accurately modeling flow delay.
III-B Feature Selection
The flow feature set exceeds one hundred dimensions, many of which are correlated or redundant. To reduce complexity and improve generalization, we apply Sequential Forward Selection (SFS) with a linear regression proxy and three-fold cross-validation, using mean squared error (MSE) for stable optimization on small delay values. SFS incrementally adds features that maximize performance gain until convergence, yielding a compact subset of 16 flow features (Table I) that balance expressiveness and efficiency.
| Feature name | Description |
|---|---|
| flow_traffic | Average flow bitrate |
| flow_packets | Number of packets in the flow |
| flow_packet_size | Mean packet size |
| flow_type | CBR or MB |
| flow_length | Path length (hop count) |
| flow_p10PktSize | 10th percentile of packet size |
| flow_tos | Type of Service (ToS) field |
| flow_packet_loss | Packet loss ratio (%) |
| ibg | Inter-burst gap |
| rate | Flow generation rate |
| flow_bitrate_per_burst | Average bitrate per burst |
| flow_ipg_mean | Mean inter-packet gap |
| flow_ipg_var | Variance of inter-packet gap |
| IPG percentile P11, | 11th, 99th, 100th percentile of |
| P99, P100 | IPG distribution |
Feature normalization
To stabilize training and ensure comparable scaling across heterogeneous features, we apply min–max normalization to all flow attributes:
| (6) |
where and are the per-feature minima and maxima computed on the training set. These statistics are stored in model buffers (min_feat, inv_range) and reused during inference, ensuring consistent scaling across datasets.
III-C Baseline GNN Model Architecture
The baseline architecture is a heterogeneous GNN designed for flow delay prediction. Its workflow is given in Algorithm 1, while the main architectural blocks are shown in Figure 1.
-
•
Flow and Link Encoders: Raw flow and link features are projected into latent embeddings of dimension .
-
•
Message Passing: Each link aggregates messages from incident flows and each flow from its traversed links. Attention mechanisms compute flow-to-link and link-to-flow scores, weighting contributions by traffic intensity and congestion.
-
•
Readout and Prediction: A gated recurrent unit (GRU) refines flow embeddings across iterations, followed by fusion of flow and aggregated link embeddings. A Softplus layer outputs the final delay prediction.
III-D FlowKANet Model Architecture
To reduce complexity while preserving expressivity, we extend the baseline by replacing all MLPs with KANs layers. The overall message passing structure remains identical, but every transformation block is spline-based. The workflow is summarized in Algorithm 2.
-
•
Flow and Link Encoders: Flow and link features are projected into latent embeddings by KAN layers. These initial encoders provide compact hidden representations tailored to each node type.
-
•
KAMP-Attn (Kolmogorov–Arnold Message Passing with Attention): Messages are exchanged between flows and links using KAN operators for both transformation and attention, ensuring bidirectional propagation of information across the bipartite graph.
-
•
Readout and Prediction: After message passing, flow embeddings are fused with their aggregated link representations and passed through a final KAN block with Softplus activation to predict the per-flow delay.
KAN-based message passing
Let and denote the embeddings of flow and link at iteration . For each edge , the message is computed using two shared spline-based operators: (i) a transformation operator that maps flow embeddings into the link space, and (ii) an attention operator that produces edge-specific importance weights. These operators are shared across all edges in the same direction. Formally,
| (7) | ||||
| (8) | ||||
| (9) |
The aggregated message at a node is then
| (10) |
and the node embedding is updated with a residual connection:
| (11) |
This mechanism operates in both directions: flows send messages to links via and , while links send messages back to flows through distinct operators and . Each operator is shared across all edges of its respective direction, ensuring consistency and avoiding edge-specific parameterization.
Concise single-step composition
Combining transformation, aggregation, and fusion, the per-flow output after rounds of bidirectional message passing can be expressed as
| (12) |
where and are the final flow and link embeddings, and is the aggregated link context. Here, and denote the KAN fusion and readout blocks, respectively, with the latter ending in a Softplus activation to ensure non-negative delay predictions.
III-E Symbolic Surrogate Models
Although the KAN-augmented GNN is lighter than the MLP baseline, it still contains many trainable parameters. To further reduce deployment overhead, we distill the trained FlowKANet into symbolic surrogate models, replacing each KAN block with an analytical expression that approximates its learned mapping, yielding a fully symbolic pipeline from input features to predicted delay. We employ PySR [cranmer2024pysr, cranmer2023interpretablemachinelearningscience], combined with Optuna-based hyperparameter optimization, to discover compact analytical expressions that closely match the outputs of the corresponding KAN operators.
Sequential block-wise search
The symbolic distillation is performed progressively, one block at a time, following the network structure. At each step, previously symbolized equations are frozen while downstream components remain neural. This ensures consistency of symbolic dependencies and yields a coherent chain of analytical transformations. The procedure can be summarized as follows:
-
1.
For each block , freeze all previously symbolized blocks and keep downstream blocks neural.
-
2.
Fit a PySR regressor to approximate the KAN output of , while Optuna tunes PySR hyperparameters (population size, mutation rate, parsimony).
-
3.
Evaluate each candidate expression inside the full model on a validation subset, fix the best formula, and proceed to the next block .
Final surrogate pipeline.
Once all blocks have been symbolized, we obtain a fully analytical model that respects the underlying graph structure. Formally, the surrogate prediction takes the form:
| (13) |
where denotes the composed surrogate equations, are the selected flow features, are link descriptors, and encodes the neighborhood relations in the bipartite flow–link graph. In other words, the symbolic surrogate maintains the same message-passing dependencies as the neural FlowKANet: flow delay predictions depend not only on local features but also on the aggregated symbolic contributions of neighboring links and flows. This yields a fully analytical yet graph-aware predictor that mirrors the inductive bias of the original architecture while eliminating the need for neural inference.
IV Performance Evaluation
IV-A Experimental Setup
We use the GNNet Challenge dataset [guemes2023building], which provides realistic topologies and flow-level traces for graph-based performance prediction. All experiments employ the heterogeneous bipartite graph representation described in Section III. The dataset is accessed through an API that exposes per-flow, per-link, and topology-level features, and is split into 3,511 training graphs (80%) and 878 test graphs (20%).
IV-B Hyperparameter Search with Optuna
We employ Optuna to automatically select the main architectural hyperparameters of the FlowKANet. The search space includes: the hidden dimensions of flow and link embeddings, the number of message-passing layers , KAN parameters (grid size , spline order , and scaling ), as well as dropout rate, learning rate, and activation configuration.
| Block | Grid | Order | Scale |
|---|---|---|---|
| flow_init | 9 | 3 | 0.93 |
| link_init | 7 | 5 | 1.66 |
| (i=0) | 5 | 3 | 0.55 |
| (i=1) | 6 | 4 | 0.70 |
| (i=2) | 8 | 4 | 0.82 |
| (i=0) | 7 | 3 | 0.73 |
| (i=1) | 7 | 5 | 0.77 |
| (i=2) | 10 | 3 | 0.33 |
| fuse | 6 | 5 | 1.15 |
| final | 10 | 5 | 2.28 |
We tested several activation functions (ReLU, SiLU, Softplus, Tanh) and four placement strategies: final_only (after the fusion block), except_mp (all blocks except message passing), all (every KAN block), and no_activation (none applied). Regardless of the chosen strategy, the last readout block always applies Softplus to guarantee non-negative flow delay predictions. Each Optuna trial was trained for up to 150 epochs with early stopping on the validation set. The Tree-structured Parzen Estimator (TPE) sampler was used for efficient exploration of the large search space. Table III summarizes the best global hyperparameters, while Table II details the KAN-specific settings for each block.
| Parameter | Best Value | Description |
|---|---|---|
| Flow hidden dim. | 8 | Size of flow embedding |
| Link hidden dim. | 2 | Size of link embedding |
| MP layers | 3 | Number of heterogeneous MP layers |
| Dropout | 0.1 | Regularization between layers |
| Learning rate | 0.002 | Optimizer step size |
| Activation type | Tanh | Activation applied to KAN outputs |
| Activation mode | except_mp | Applied to all blocks except MP |
IV-C Symbolic Surrogate Search
| Component | Options / Constraints |
|---|---|
| Binary operators | , , with exponent range |
| Unary operators | , , , |
| Expression size | maxsize |
| Numerical guards | ; clipped to ; NaN/ replaced |
We apply the symbolic distillation procedure described in Section III-E to the trained FlowKANet model. For each KAN block, PySR performs symbolic regression guided by Optuna-based hyperparameter optimization, jointly tuning the operator sets, tree complexity (maxsize), population size, iteration count, and parsimony coefficient to minimize the validation MSE. Expressions exceeding the desired complexity are penalized via the parsimony term, and model selection favors the most accurate symbolic representations. Numerical robustness is enforced through safe operators and replacement of non-finite values. The search runs in parallel through a shared Optuna RDB and is limited to 250 trials per block. FlowKANet weights remain frozen during distillation. For each block, input–output activations are sampled from the training graphs, with a fraction , set to in our runs used for symbolic fitting and the remainder for validation within the hybrid model. The best expression per block is validated in context and fixed, progressively replacing all KAN modules to obtain a fully symbolic surrogate of the network.
IV-D Predictive Accuracy
We compare the baseline GNN, the KAN-augmented GNN, and the symbolic surrogate distilled from FlowKANet, reporting MSE and on the test set.
| Model | MSE (lower is better) | (higher is better) |
|---|---|---|
| Baseline GNN | 38.6358 | 0.8113 |
| FlowKANet | 40.8094 | 0.8727 |
| Symbolic surrogate | 54.8562 | 0.8290 |
Both models achieve strong predictive power on the GNNet dataset. FlowKANet attains a higher with slightly higher MSE, showing that spline operators capture small-delay variance and remain stable via the Softplus readout. The symbolic surrogate, though less accurate, provides transparent closed-form equations suited for interpretable or lightweight deployment. Figure 2 shows predicted vs. true delays; FlowKANet points align closer to the diagonal, confirming improved variance capture. During symbolic distillation, we tracked the model MSE as each KAN block was replaced (“progressive hybrid”). Figure 3 shows that replacing early encoders and mid-level message-passing blocks causes mild degradation, while symbolizing the final layers increases error more sharply. Thus, late components are most critical for accuracy, suggesting hybrid deployments, symbolic early/mid blocks with neural readout, offer the best trade-off between interpretability and performance. Overall, KANs bridge conventional GNNs and symbolic models, reducing parameters while retaining accuracy and enabling transparent surrogates. Future work will address accuracy loss in final blocks through symbolic operators for message passing and adaptive hybrid designs.
IV-E Parameter Efficiency
Model compactness is critical for deployment in practical settings. Table VI compares the number of parameter across the three evaluated models. FlowKANet reduces the trainable parameter count by nearly compared to the GNN (20k vs. 98k), and the symbolic surrogate eliminates all trainable weights, leaving only 267 fixed constants in the final equations. This progression illustrates a efficiency-interpretability spectrum: from large but flexible GNNs, through compact KAN-augmented models, to purely analytical surrogates suitable for constrained or safety-critical environments.
| Model | Model Parameters |
|---|---|
| Baseline GNN | 98,210 |
| FlowKANet | 20,094 |
| Symbolic surrogate | 267 (Constants in Equations) |
V Conclusion
We have presented a unified framework for flow delay prediction, covering three levels of model design: a heterogeneous GNN with attention-based message passing, FlowKANet model with spline-based transformations, and fully symbolic surrogates distilled from the KAN model. Our experiments show that the FlowKANet achieves comparable accuracy to the GNN while reducing parameter count nearly five-fold. The symbolic surrogates, although less accurate, eliminate trainable parameters entirely and produce transparent closed-form equations that respect the original graph structure. This progression highlights a clear spectrum of trade-offs: from accuracy-focused neural models, to compact spline-based architectures, to symbolic predictors suitable for deployment in resource-constrained or safety-critical environments. In future work, we will focus on deeper interpretation of the learned transformations, both in the KAN-based model and in the distilled symbolic equations, to provide further insights into how graph-structured dependencies drive flow delay.
Acknowledgment
This work has been supported by grant ANR-21-CE25-0005 from the Agence Nationale de la Recherche, France for the SAFE project.