A Mechanistic Analysis of Transformers
for Dynamical Systems

Gregory Duthé
ETH Zürich
[email protected]
&Nikolaos Evangelou
Johns Hopkins University

&Wei Liu
Singapore-ETH Centre
&Ioannis G. Kevrekidis
Johns Hopkins University
&Eleni Chatzi
ETH Zürich
Abstract

Transformers are increasingly adopted for modeling and forecasting time-series, yet their internal mechanisms remain poorly understood from a dynamical systems perspective. In contrast to classical autoregressive and state–space models, which benefit from well-established theoretical foundations, Transformer architectures are typically treated as black boxes. This gap becomes particularly relevant as attention-based models are considered for general-purpose or zero-shot forecasting across diverse dynamical regimes. In this work, we do not propose a new forecasting model, but instead investigate the representational capabilities and limitations of single-layer Transformers when applied to dynamical data. Building on a dynamical systems perspective we interpret causal self-attention as a linear, history-dependent recurrence and analyze how it processes temporal information. Through a series of linear and nonlinear case studies, we identify distinct operational regimes. For linear systems, we show that the convexity constraint imposed by softmax attention fundamentally restricts the class of dynamics that can be represented, leading to oversmoothing in oscillatory settings. For nonlinear systems under partial observability, attention instead acts as an adaptive delay-embedding mechanism, enabling effective state reconstruction when sufficient temporal context and latent dimensionality are available. These results help bridge empirical observations with classical dynamical systems theory, providing insight into when and why Transformers succeed or fail as models of dynamical systems.

1 Introduction

Understanding and modeling dynamical systems using data –in the form of observations–is a central problem in nonlinear science, with applications ranging from fluid mechanics and structural dynamics [amoudruz2025, LIU2022109276, RAISSI2019686] to neuroscience, chemical kinetics, weather and power systems and beyond [10.1063/5.0297336, 10.1063/5.0291493, chiavazzo2014reduced, wu2023interpretable]. Classical approaches rely either on use of explicit governing equations or on well-established data-driven identification frameworks, such as autoregressive and state–space models, for which stability, observability, and identifiability properties are well understood [box1976analysis, Kantz_Schreiber_2003]. These frameworks provide a principled connection between data, latent state representations, and the underlying geometry of dynamical systems, including attractors and invariant manifolds.

More recently, machine-learning architectures originally developed for sequence modeling have been increasingly applied to dynamical systems modeling, particularly in what concerns purely data-driven inference, raising fundamental questions about their expressive power and their relationship to classical dynamical systems theory. Among these architectures, the Transformer model [vaswani2017attention] has emerged as a dominant paradigm. Originally introduced for natural language processing, Transformers are now widely used in computer vision, speech processing, and scientific machine learning [pmlr-v267-holzschuh25a], including the modeling and forecasting of dynamical systems [geneva2022transformers, sitapure2023introducing, gao2025learning]. Their defining feature is the attention mechanism, which enables flexible aggregation of information across a temporal context through parallel rather than sequential computation. This property has led to strong empirical performance in step-ahead prediction tasks, including for nonlinear and weakly chaotic systems [valle2025forecasting, choi2025defining].

A growing body of work has explored the use of Transformers for dynamical and physical systems. Early studies demonstrated that attention-based models can learn surrogate evolution maps when provided with suitable spatiotemporal tokenizations. Geneva and Zabaras [geneva2022transformers], for example, modeled diverse dynamical systems using a “vanilla” Transformer architecture, relying on Koopman-based embeddings to project high-dimensional states into lower-dimensional token representations. Subsequent work investigated the direct application of Transformers to chaotic time-series forecasting, showing that autoregressive prediction is feasible when the Lyapunov exponent is sufficiently low [valle2025forecasting]. More recent efforts have extended these ideas towards large pretrained scientific foundation models. Aurora, for instance, is proposed as a foundation model for the Earth system, trained on heterogeneous atmospheric and oceanic datasets and equipped with an encoder–processor–decoder architecture to evolve a latent three-dimensional spatial representation forward in time [bodnar2025aurora]. These studies indicate that Transformers, or Transformer-like operator processors, can act as general temporal integrators across complex physical systems, often at substantially reduced computational cost compared to traditional numerical pipelines.

In parallel, operator-style Transformer architectures have been developed specifically for scientific computing [shih2025transformers]. Poseidon introduces a multiscale operator Transformer pretrained on diverse fluid-dynamics PDE datasets and leverages time-conditioned layers together with semigroup training to enable continuous-in-time evaluation [poseidon2024]. This places Transformers within the broader operator-learning lineage that includes Fourier- and Graph Neural Operators. Related theoretical work has clarified connections between attention mechanisms and classical numerical integration or projection schemes. Li et al. [li2020fourier, kovachki2023neuraloperators] introduced the Fourier Neural Operator framework, which learns mappings between function spaces using spectral convolution kernels and can be interpreted as performing data-driven Galerkin projections. Building on this perspective, Cao et al. [cao2021choose] showed that self-attention can be interpreted as a learnable integral operator, capable of recovering Fourier- or Galerkin-type behavior depending on positional encoding and kernelization. These results position attention mechanisms and neural operators within a shared theoretical space as flexible, possibly nonlocal (in space and even possibly in time) integrators/solvers.

A second, rapidly growing stream concerns time-series foundation models. Chronos treats time series as tokenized sequences via scaling and quantization and reuses T5-style Transformers to obtain zero-shot probabilistic forecasts across domains [chronos2024]. Subsequent models, including Chronos-Bolt, improved speed and accuracy, reinforcing the view that a single pretrained Transformer can generalize across dynamical regimes provided that the data are cast into a language-like format and extended the foundation models to multivariate systems [ansari2025chronos2univariateuniversalforecasting]. This paradigm aligns closely with recent zero-shot and universal forecasting studies for chaotic systems [zhang2024zero, lai2025panda, hemmer2025true], as well as with position papers calling for clearer definitions of what constitutes a foundation model for computational science [choi2025defining].

Despite this growing body of work, the role of Transformers in modeling dynamical systems remains poorly understood from a theoretical standpoint, and to our knowledge, connections with first principles of autoregressive modeling and dynamical systems theory have not yet been firmly established. Transformers do not maintain an internal representation that is explicitly interpreted as (i.e. matched with, mapped to) the physical state of the system. Instead, their internal embeddings are optimized for predictive performance and are not directly constrained to represent dynamical invariants such as attractors, invariant manifolds, or conserved quantities associated with Hamiltonian or symplectic structure. This contrasts with structure-preserving architectures, which enforce such physical constraints by design [Bertalan2019, lutter2019, HERNANDEZ2021109950, Bacsa2023]. As a result, it remains unclear what classes of dynamical behavior Transformers can faithfully represent, under what conditions they succeed or fail, and how their internal computations relate to established concepts in nonlinear dynamics.

Recent studies have begun to address these gaps by probing not only forecasting accuracy but also the nature of the representations learned by Transformers. Kantamneni, Liu, and Tegmark analyzed how a Transformer models the simple harmonic oscillator, showing that attention induces a convex, data-driven autoregressive operator with characteristic spectral limitations [kantamneni2024transformers]. Related concerns arise in zero-shot dynamical studies, where long-term statistics may be preserved while the internal representation remains opaque [hemmer2025true]. At the same time, theoretical developments have revealed close connections between causal attention mechanisms, recurrence, and state–space structure. Katharopoulos et al. [katharopoulos2020transformers] showed that kernelized causal self-attention admits an exact recurrent formulation with constant memory, demonstrating that autoregressive Transformers with linear attention are, in a precise sense, recurrent neural networks. Dao et al. [dao2024ssd] generalized this insight through their structured state-space duality framework, proving that broad classes of causal attention mechanisms are equivalent to structured state–space models and that any kernelized attention admitting efficient recurrence must correspond to a state-space realization. Complementarily, Sieber et al. [sieber2024dsf] introduced what they called the "Dynamical Systems Framework", which reformulates attention mechanisms, state–space models, and recurrent neural networks within a unified recursive representation, allowing principled comparisons in terms of stability, expressivity, and state expansion.

These developments motivate interpreting Transformers used for dynamical systems modeling not merely as generic sequence-to-sequence regressors, but as data-adaptive state–space models. Within this view, attention constructs and updates an implicit state from past observations, while subsequent feed-forward components approximate the local flow map governing state evolution. This architecture also suggests a structural analogy to a form of numerical error analysis termed "Backward Error Analysis". The "one-token-ahead" prediction of a Transformer naturally corresponds to the "one-timestep-ahead" prediction of a numerical initial value solver with a fixed time step [geneva2022transformers]. Broadly, this implies that the Transformer constructs what Backward Error Analysis terms the Inverse Modified Differential Equation (IMDE). The IMDE is a transformed differential equation whose numerical solution exactly matches the true trajectory’s discretely observed data[Zhu2023ImplementationA]. By training on discrete tokens, the Transformer may learn to approximate this modified dynamical system, effectively identifying a slightly perturbed governing law; when "numerically solved" by the network’s forward pass, this perturbation effectively compensates for the discretization artifacts inherent in the data. From this perspective, the Transformer may learn a discrete-time map mathematically equivalent to a numerical integrator for a modified physical system. For a sufficiently accurate initial-value solver, this slightly perturbed system constitutes a good approximation of the underlying continuous flow.

To explicitly decode these learned algorithmic structures, we draw on mechanistic interpretability—an emerging field originally developed for large language models (LLMs)—which aims to reverse-engineer neural networks into discrete algorithms [rico1992discrete] and resolve challenges such as superposition [elhage2022toy, elhage2021mathematical]. From this perspective, Transformers are viewed not as black boxes, but as collections of mechanistic circuits whose internal computations can be analyzed and related to classical modeling principles. In contrast with typical mechanistic interpretability studies of LLMs, we apply this lens to dynamical systems: we ask what a single-layer Transformer captures about linear versus nonlinear dynamics, and whether its ability to “unfold” an attractor (as shown in section 4) relies on attention behaving as a specific delay-embedding or state-aggregation mechanism. This bottom-up analysis complements recent top-down investigations of physics foundation models, where large models trained on physical simulations have been shown to learn linearly steerable representations of concepts like vorticity and diffusion [fear2025physics]. We believe that understanding the mechanistic role of attention in minimal architectures could serve as a step toward bridging these concepts.

2 Theory and Methods

2.1 Problem formulation

We consider autonomous continuous-time dynamical systems of the form

d𝐮dt=f(𝐮(t)),𝐮(t)d,\frac{d\mathbf{u}}{dt}=f(\mathbf{u}(t)),\quad\mathbf{u}(t)\in\mathbb{R}^{d}, (1)

where f:ddf:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} defines the system dynamics. Throughout this work, moving from linear systems, we focus on canonical nonlinear systems exhibiting a range of properties, including multiple fixed points, limit cycles, bifurcation behavior, and parameter-dependent behavior. Our representative examples include the Van der Pol oscillator, reaction diffusion PDEs, and the Navier-Stokes equations.

The system is observed at discrete time instants with uniform sampling interval Δt\Delta t, yielding the sampled trajectory

𝒟={𝐮1,𝐮2,,𝐮n},\mathcal{D}=\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{n}\}, (2)

where 𝐮t=𝐮(tΔt)\mathbf{u}_{t}=\mathbf{u}(t\cdot\Delta t) denotes the state at time step tt. This discrete-time representation induces a flow map 𝐮t+1=ΦΔt(𝐮t)\mathbf{u}_{t+1}=\Phi_{\Delta t}(\mathbf{u}_{t}), which is the object implicitly approximated by step-ahead prediction models (whether nonlinear IVP solvers or Transformers).

In many practical settings, the full system state 𝐮(t)\mathbf{u}(t) is not directly accessible. Instead, measurements 𝐮to\mathbf{u}^{o}_{t} are obtained through an observation function g:dpg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{p} with p<dp<d,

𝐮to=g(𝐮t)=𝐇𝐮t,𝐮top,\mathbf{u}^{o}_{t}=g(\mathbf{u}_{t})=\mathbf{H}\mathbf{u}_{t},\quad\mathbf{u}^{o}_{t}\in\mathbb{R}^{p}, (3)

where 𝐇p×d\mathbf{H}\in\mathbb{R}^{p\times d} denotes a (here, linear) observation operator. For example, in the Van der Pol oscillator, one may observe only the position variable while the velocity remains unmeasured.

This partial observability setting is central to nonlinear time-series analysis and motivates classical state-reconstruction approaches based on delay-coordinate embeddings. Takens’ embedding theorem guarantees that, under suitable conditions, the attractor of the underlying dynamical system can be reconstructed from time-delayed measurements of partial observations [takens1981detecting]. A central question in this work is whether attention-based models implicitly perform an analogous reconstruction when trained on sequences of partial observations.

Research questions

Our mechanistic study investigates connections between Transformer operations and classical dynamical systems formulations such as Auto-Regressive modeling frameworks and delay-coordinate embeddings (Takens’ theorem) for linear and non-linear dynamical systems.

By training single-layer Transformers to predict the evolution of dynamical systems, we investigate:

A) For linear systems (Section 3), where the attention mechanism approximates the dynamics directly, we ask: (A1) what does attention learn and how does this connect to classical linear system theory; (A2) what classes of dynamics (e.g., monotonic, resonant, or underdamped) can attention represent, and for which does it fail and why; and (A3) when does attention meaningfully capture multi-modal interactions, and how does this connect to delay-coordinate embeddings?

B) For nonlinear systems (Section 4), where attention no longer approximates the dynamics itself but instead serves as a state reconstruction operator upon which a nonlinear map is learned, we ask: (B1) under what observation regimes—full-state versus partial—does a Transformer provide computational benefits; (B2) can attention identify representations analogous to delay-coordinate embeddings, and how does this relate to classical results such as Takens’ embedding theorem; and (B3) when such delay-based representations are formed, under what conditions do they suffice to (a) unfold nonlinear attractors while preserving their effective dimensionality; (b) capture the dominant modes of the underlying dynamics; and (c) provide meaningful organization of trajectories across different system parameters?

By restricting our analysis to single-layer architectures, and addressing those questions we aim to develop interpretable insights into the fundamental mechanisms through which Transformers process temporal dynamics. The findings provide a foundation for understanding deeper architectures and we hope they can serve as a guide for designing attention-based models tailored to dynamical systems.

2.2 Single-layer Transformer as a Discrete-Time Operator

To analyze how attention-based models process dynamical information, we restrict our experiments to a canonical single-head, single-layer self-attention (decoder only) Transformer architecture. Given an input sequence of nn tokens of input dimension did_{i}, X=[𝐮1,,𝐮n]n×diX=[\mathbf{u}_{1},\ldots,\mathbf{u}_{n}]^{\top}\in\mathbb{R}^{n\times d_{i}}, representing system observations over a finite time window, the model first applies a self-attention operation,

Z=Attention(Q,K,V)=softmax(QKdk)V,Z=\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_{k}}}\right)V, (4)

where Q=XWQQ=XW_{Q}, K=XWKK=XW_{K}, and V=XWVV=XW_{V} are learned linear projections. The resulting representation Zn×dvZ\in\mathbb{R}^{n\times d_{v}} is then processed by a token-wise output layer, which takes one of two forms depending on the analysis. For nonlinear dynamics, we employ the position-wise feed-forward network that is typical for Transformers associated with the final token 𝐳n\mathbf{z}_{n},

𝐮^n+1=MLP(𝐳n)=σ(𝐳nW1+b1)W2+b2,\hat{\mathbf{u}}_{n+1}=\mathrm{MLP}(\mathbf{z}_{n})=\sigma(\mathbf{z}_{n}W_{1}+b_{1})W_{2}+b_{2}, (5)

where σ()\sigma(\cdot) denotes the activation function, W1W_{1} W2W_{2} are learnable weight matrices, and b1b_{1} b2b_{2} are the learnable bias terms. For linear dynamics, we instead use a linear output projection,

𝐮^n+1=𝐳nWO.\hat{\mathbf{u}}_{n+1}=\mathbf{z}_{n}W_{O}. (6)

The MLP serves as a universal approximator for nonlinear state transition functions, while the linear projection isolates the representational capacity of attention alone.

By analyzing this minimal Transformer layer, we can isolate the specific dynamical roles of these components: (a) the attention mechanism functions as a temporal aggregator that navigates the trajectory history, (b) the MLP serves as a universal approximator for the local state transition function. In the step-ahead prediction setting considered here, the Transformer defines an operator that maps an implicitly reconstructed state, formed through delayed observations, rather than using the instantaneous state itself. From a dynamical systems perspective, this can be interpreted as a learned discrete-time evolution operator acting on an implicitly constructed state representation.

Refer to caption
Figure 1: Schematic of the single-layer single-head self-attention Transformer architecture with optional linear output studied in this work. Adapted from [raschka2023understanding].

2.3 Classical dynamical system formulations and recursive representations

Classical data-driven modeling of dynamical systems is grounded in recursive representations, most prominently autoregressive (AR) and state–space formulations. In linear autoregressive models, the current observation is expressed as a finite-memory recursion over past outputs,

𝐮t=k=1p𝐁k𝐮tk+𝜺t,\mathbf{u}_{t}=\sum_{k=1}^{p}\mathbf{B}_{k}\mathbf{u}_{t-k}+\boldsymbol{\varepsilon}_{t}, (7)

where 𝐁k\mathbf{B}_{k} are regression operators and 𝜺t\boldsymbol{\varepsilon}_{t} denotes process noise. Such models admit well-characterized notions of stability, identifiability, and spectral structure, and have been widely used in system identification and structural dynamics [box1976analysis, Kantz_Schreiber_2003].

A more expressive and principled formulation is obtained through state–space models, which introduce an explicit latent state 𝜻t\boldsymbol{\zeta}_{t} evolving recursively as

𝜻t+1\displaystyle\boldsymbol{\zeta}_{t+1} =𝐀𝜻t+𝐁𝐮t+𝐰t,\displaystyle=\mathbf{A}\boldsymbol{\zeta}_{t}+\mathbf{B}\mathbf{u}_{t}+\mathbf{w}_{t}, (8)
𝝌t\displaystyle\boldsymbol{\chi}_{t} =𝐂𝜻t+𝐯t,\displaystyle=\mathbf{C}\boldsymbol{\zeta}_{t}+\mathbf{v}_{t}, (9)

with 𝝌t\boldsymbol{\chi}_{t} the output at time tt, and 𝐰t\mathbf{w}_{t} and 𝐯t\mathbf{v}_{t} denoting process and observation noise, respectively. Here, the latent state is typically endowed with physical meaning (e.g. displacements, velocities, modal coordinates, or internal variables), and recursive filtering schemes, such as the Kalman filter, provide optimal state estimates under linear–Gaussian assumptions. This paradigm embodies the classical principle of latent-state reconstruction, where memory and dynamics are encoded through a compact, physically interpretable state.

From a unifying viewpoint, autoregressive and state–space models are both instances of recursive dynamical systems. Classical autoregressive models admit equivalent state–space realizations by defining the latent state as a stack of delayed outputs (and inputs, when present), a construction standard in system identification and control [ljung1999systemid].

Recently, Sieber et al. [sieber2024dsf] introduced the Dynamical Systems Framework (DSF), which represents attention mechanisms, state–space models, and recurrent neural networks within a common linear time-varying recurrence. Within this framework, masked self-attention admits an exact recursive realization where the effective transition operators are determined by the query–key interactions and normalization terms. A key distinction is that the recursive state induced by attention is forced to be a deterministic function of past input tokens, indexed by token position rather than by intrinsic dynamical variables.

Attention as a data-adaptive autoregressive model.

For our purposes, the essential insight is that a single attention head induces a finite-memory linear recursion interpretable as a data-driven AR model. Consider the final row of the attention matrix 𝜶=[αn,1,,αn,n]\boldsymbol{\alpha}=[\alpha_{n,1},\ldots,\alpha_{n,n}]^{\top} with αn,i>0\alpha_{n,i}>0 and iαn,i=1\sum_{i}\alpha_{n,i}=1. Incorporating the output and value projection matrices, the predicted next state becomes

𝐮^i+1=i=1nαn,iWOWV𝐱i=i=1nBi𝐮i,\hat{\mathbf{u}}_{i+1}=\sum_{i=1}^{n}\alpha_{n,i}W_{O}^{\top}W_{V}^{\top}\mathbf{x}_{i}=\sum_{i=1}^{n}B_{i}\mathbf{u}_{i}, (10)

where Bi=αn,iMB_{i}=\alpha_{n,i}M and M=WOWVM=W_{O}^{\top}W_{V}^{\top}. This mirrors an AR model whose coefficients are determined by the attention weights. However, due to the softmax normalization in attention computation, all αn,i\alpha_{n,i} are strictly non-negative and sum to unity, implying that BiB_{i} are non-negative scalar multiples of the same matrix MM.

This non-negativity constraint introduces a fundamental representational limitation. Unlike conventional AR models, which can employ both positive and negative coefficients to represent oscillatory or phase-inverted dynamics, the attention mechanism cannot directly encode subtractive interactions. Consequently, systems requiring mixed-sign autoregressive dependencies—such as lightly damped oscillators—may not be faithfully represented by only using standard attention layers. This theoretical insight provides the basis for our empirical analyses in Section 3, where we demonstrate both successful and failed cases depending on the sign structure of the underlying linear dynamics.

Nonlinear systems and delay-coordinate embeddings.

For nonlinear systems, state reconstruction from partial observations is classically addressed through delay-coordinate embeddings. Takens’ embedding theorem [takens1981detecting] establishes that for a smooth, deterministic, generally nonlinear dynamical system evolving on a compact attractor of dimension dd, and for a generic observation function, the delay map

Φ(𝐮(t))=[g(𝐮(t)),g(𝐮(tτ)),,g(𝐮(t(n1)τ))]\Phi(\mathbf{u}(t))=\big[g(\mathbf{u}(t)),g(\mathbf{u}(t-\tau)),\ldots,g(\mathbf{u}(t-(n-1)\tau))\big] (11)

constitutes an embedding of the attractor provided that n2d+1n\geq 2d+1. This result formalizes the conditions under which the latent state of a nonlinear system can be reconstructed from time-delayed measurements alone, even under partial observability.

From this perspective, attention mechanisms can be interpreted as constructing adaptive, data-driven delay embeddings by aggregating information from a finite history of past observations. Crucially, while the attention operation preserves linearity, the Transformer’s feed-forward network introduces the capacity to approximate nonlinear vector fields. This decoupling of linear history aggregation from nonlinear state evolution provides the necessary bridge to the nonlinear analyses presented in Section 4.

2.4 Scope, analysis and organization

The formulations above establish two complementary viewpoints on data-driven dynamical modeling. Classical autoregressive, state–space, and delay-embedding approaches rely on either explicit latent states or geometrically justified state reconstructions, with well-defined notions of memory, minimality, and observability. By contrast, attention-based Transformers trained for step-ahead prediction induce a forced recursive representation in which the effective state is constructed deterministically from a finite history of observations through data-adaptive aggregation.

The analysis that follows examines how this architectural distinction manifests in practice. We study the internal computations of a single-layer Transformer trained on representative dynamical systems and investigate how attention weights relate to phase-space geometry, how implicit state representations emerge under partial observability, and how the learned dynamics compare to classical autoregressive and delay-based models. Particular emphasis is placed on identifying structural constraints imposed by finite context length, token-based state construction, and attention normalization, and on understanding how these constraints affect the model’s ability to represent periodic, quasi-periodic, and chaotic behavior.

The subsequent sections are organized as follows. We first analyze linear and weakly nonlinear systems to characterize the effective autoregressive structure induced by attention and its spectral properties. We then consider nonlinear systems with partial observations, examining whether attention mechanisms recover embeddings consistent with classical delay-coordinate constructions. Finally, we investigate regimes in which the single-layer architecture fails to capture essential dynamical features, thereby delineating fundamental limitations that persist independently of training data or optimization.

This organization allows us to connect mechanistic observations of Transformer behavior directly to established concepts in dynamical systems theory, and to assess the extent to which attention-based models can be interpreted as data-adaptive realizations of classical recursive dynamical representations.

3 Linear Dynamical Systems

We begin our analysis with linear dynamical systems, the simplest yet foundational class where analytical insights are most tractable. By isolating the attention mechanism and excluding the feed-forward network, the Transformer reduces to a linear, time-varying recursive operator that admits a direct interpretation as introduced in Section 2.3. This allows us to establish a baseline understanding and derive closed-form expressions for the learned representations. Linear systems serve as an ideal starting point because their mathematical structure is well understood and admits explicit classical representations, thus allowing us to precisely characterize what the attention mechanism computes and how it relates to classical linear system theory.

3.1 Single-DOF Structural System

Dynamical System

To connect the abstract discussion of attention-as-autoregression in Section 2.3 with a physically interpretable example, we now consider a single-degree-of-freedom (SDOF) linear oscillator, whose discrete-time dynamics admit an exact low-order autoregressive representation. This setting allows us to explicitly compare the coefficients of the physical AR model with the effective coefficients induced by a single attention head, and to interpret the results directly. By focusing on an attention-only architecture, we isolate the linear operator induced by self-attention and assess when it can, and cannot, reproduce the signed recursive structure of the underlying dynamics. The governing equation of motion of the considered second-order SDOF system is given by:

mx¨(t)+cx˙(t)+kx(t)=f(t),m\ddot{x}(t)+c\dot{x}(t)+kx(t)=f(t), (12)

where mm, cc, and kk denote the mass, damping coefficient, and stiffness, respectively. Only the displacement response xx is assumed to be available. In the present study, we fix m=1kgm=1~\mathrm{kg} and c=0.5Ns/mc=0.5~\mathrm{Ns/m}, corresponding to a lightly damped (underdamped) regime. For an initial stiffness of k=2000N/mk=2000~\mathrm{N/m}, the natural frequency is obtained as:

fn=12πkm7.12Hz.f_{n}=\frac{1}{2\pi}\sqrt{\frac{k}{m}}\approx 7.12~\text{Hz}. (13)

A sampling frequency of 25Hz25~\text{Hz} (Δt=0.04s\Delta t=0.04~\text{s}) ensures adequate temporal resolution satisfying the Nyquist criterion.

A free vibration case is considered in this study, f(t)=0f(t)=0, where the system is initialized with a displacement of x0=10mmx_{0}=10~\mathrm{mm} and zero velocity, allowing the response to evolve solely under its internal dynamics. The governing second-order differential equation is numerically integrated using a high-accuracy ODE solver. The displacement response x(t)x(t) is recorded and discretized to form sequential time series data used for model training and evaluation. The predictive task is one-step-ahead forecasting—that is, predicting the displacement xt+1x_{t+1} from a short history {xt,xt1,,xtn+1}\{x_{t},x_{t-1},\ldots,x_{t-n+1}\}. This formulation directly parallels the autoregressive structure discussed previously and allows direct comparison between the physical AR(2) system and its attention-based approximation.

Transformer Setup

A minimal attention-only Transformer is employed to assess whether self-attention can recover the inherent oscillatory structure of the SDOF system. The model operates on scalar input sequences xix_{i} with learned positional encodings pip_{i}. Because no embedding or projection layers are used, the attention transformation can be expressed analytically.

For a two-step input history, the attention scores are computed as:

Aij=qk(x~ix~j),where x~i=xi+pi.A_{ij}=qk\,(\tilde{x}_{i}\cdot\tilde{x}_{j}),\quad\text{where }\tilde{x}_{i}=x_{i}+p_{i}.

After softmax normalization, the predicted output takes the form:

x^=v[α2,1(x1+p1)+α2,2(x2+p2)],\hat{x}=v\left[\alpha_{2,1}(x_{1}+p_{1})+\alpha_{2,2}(x_{2}+p_{2})\right],

which can be rearranged as:

x^=β1x1+β2x2+C,\hat{x}=\beta_{1}x_{1}+\beta_{2}x_{2}+C,

with βi=α2,iv\beta_{i}=\alpha_{2,i}v, γi=α2,iv\gamma_{i}=\alpha_{2,i}v, and C=γ1p1+γ2p2C=\gamma_{1}p_{1}+\gamma_{2}p_{2}. This formulation mirrors an AR(2) process xt+1=c1xt+c2xt1x_{t+1}=c_{1}x_{t}+c_{2}x_{t-1}, but with the key restriction that attention weights α2,i\alpha_{2,i} are positive and normalized, enforcing convexity in the combination of past inputs.

Results and Observations

Case 1: k=2000N/mk=2000~\mathrm{N/m}

For the first configuration, the discrete-time AR(2) coefficients derived from the physical SDOF model are c1=0.4352c_{1}=-0.4352 and c2=0.9802c_{2}=-0.9802, both negative. Since the coefficients share the same sign, the attention-based model can emulate the dynamics by adopting positive attention weights and a negative scalar value weight (v<0v<0). Under these conditions, the transformer successfully captures the oscillatory characteristics of the system.

As shown in Figure 2(a), the predicted displacement sequence accurately reproduces the oscillatory motion in the time domain. The corresponding frequency-domain representation exhibits a clear peak near the analytical natural frequency (7.12Hz7.12~\text{Hz}), while the time–frequency spectrogram shows a coherent modal decay ridge. Together, these results confirm that the learned dynamics capture both the correct oscillatory behavior and damping characteristics of the system.

The spectrum is computed by interpreting the attention-induced linear recurrence as an AR model and evaluating its frequency response. For a linear oscillatory system, an optimally fitted AR model exhibits spectral peaks at the system’s natural frequencies. The correspondence confirms that, when the target dynamics can be expressed as a convex weighted combination of past states, the attention operator can emulate the underlying physical process with high fidelity.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Transformer performance on linear SDOF systems. (a) Successful reproduction of oscillatory dynamics for k=2000N/mk=2000~\mathrm{N/m} with same-sign AR coefficients: the transformer accurately captures amplitude and phase, and preserves both the dominant modal peak in the frequency domain and the coherent modal decay ridge in the time–frequency representation. (b) Failure case for k=500N/mk=500~\mathrm{N/m} with mixed-sign AR coefficients: the attention-only model produces over-smoothed responses and fails to recover the resonance peak and coherent modal ridge.

Case 2: k=500N/mk=500~\mathrm{N/m}

Reducing the stiffness to k=500N/mk=500~\mathrm{N/m} lowers the natural frequency to approximately 3.56Hz3.56~\text{Hz} and alters the discrete-time AR(2) coefficients to mixed signs, requiring c1>0c_{1}>0 and c2<0c_{2}<0. To reproduce this structure, the attention model would need to satisfy:

α2,1v=c1>0,α2,2v=c2<0.\alpha_{2,1}v=c_{1}>0,\quad\alpha_{2,2}v=c_{2}<0.

However, since α2,i\alpha_{2,i} are strictly positive due to softmax normalization, no real-valued vv can simultaneously satisfy these conditions. The model is thus incapable of representing the required subtractive relationship between consecutive time steps that gives rise to oscillatory or resonant behavior.

As shown in Figure 2(b), the time-domain predictions fail to reproduce sustained oscillatory behavior. The corresponding frequency-domain representation lacks a distinct resonance peak at the physical natural frequency, and the time–frequency spectrogram shows no coherent modal ridge. These observations indicate that the transformer is unable to capture the correct dynamic signature under this setting.

This experiment highlights a fundamental limitation of the attention mechanism as a linear dynamical operator: due to the non-negativity constraint imposed by the softmax function, attention cannot reproduce the signed coefficient patterns essential for modeling oscillatory systems with phase-alternating behavior. Consequently, the attention operator behaves as a low-pass smoothing filter, capable of representing monotonic or overdamped responses but unable to reproduce resonant or underdamped dynamics that demand alternating sign relationships among autoregressive terms.

3.2 Extension to 2-DOF Systems

Dynamical System

To examine the scalability of attention-based architectures beyond single-mode dynamics, we next consider a two-degree-of-freedom (2DOF) linear structural system with coupled masses. This system introduces modal interaction and a higher-dimensional state-space, offering a more stringent test for the transformer’s ability to capture multi-modal dependencies. The 2DOF system provides a minimal setting in which the effective state dimension must increase to encode multiple interacting modes. Unlike the single-DOF case, the latent dynamics now require the representation of at least a second-order linear recurrence per mode, together with cross-coupling terms. This makes the role of temporal context and observation richness explicit: the attention-induced state must either grow in dimension or compensate through longer memory in order to remain expressive.

The governing equation in this 2DOF system, comprising a coupled system of two second-order ODEs, is expressed as:

𝐌𝐱¨(t)+𝐂𝐱˙(t)+𝐊𝐱(t)=𝟎,\mathbf{M}\ddot{\mathbf{x}}(t)+\mathbf{C}\dot{\mathbf{x}}(t)+\mathbf{K}\mathbf{x}(t)=\mathbf{0}, (14)

where 𝐱(t)=[x1(t),x2(t)]\mathbf{x}(t)=[x_{1}(t),x_{2}(t)]^{\top} denotes the displacement vector, and the system matrices are defined as:

𝐌=[m100m2],𝐂=[c1+c2c2c2c2],𝐊=[k1+k2k2k2k2].\mathbf{M}=\begin{bmatrix}m_{1}&0\\ 0&m_{2}\end{bmatrix},\quad\mathbf{C}=\begin{bmatrix}c_{1}+c_{2}&-c_{2}\\ -c_{2}&c_{2}\end{bmatrix},\quad\mathbf{K}=\begin{bmatrix}k_{1}+k_{2}&-k_{2}\\ -k_{2}&k_{2}\end{bmatrix}.

Although the system is second order in time, it can equivalently be written in first-order state-space form with a four-dimensional state vector, comprising the displacements and velocities. Consequently, four state variables are required to fully characterize and numerically integrate the system dynamics, while only the displacement responses are directly observed in the present study. We adopt the parameters m1=m2=1.0m_{1}=m_{2}=1.0 kg, c1=c2=0.5c_{1}=c_{2}=0.5 Ns/m, k1=1000k_{1}=1000 N/m, and k2=1500k_{2}=1500 N/m. Solving the undamped eigenvalue problem 𝐊ϕ=ω2𝐌ϕ\mathbf{K}\boldsymbol{\phi}=\omega^{2}\mathbf{M}\boldsymbol{\phi} yields two natural frequencies:

f14.1Hz,f29.5Hz.f_{1}\approx 4.1~\text{Hz},\quad f_{2}\approx 9.5~\text{Hz}.

Free vibration is simulated by initializing the first mass with a displacement perturbation while keeping all other states at rest:

𝐱(0)=[10.00.0],𝐱˙(0)=[0.00.0].\mathbf{x}(0)=\begin{bmatrix}10.0\\ 0.0\end{bmatrix},\quad\dot{\mathbf{x}}(0)=\begin{bmatrix}0.0\\ 0.0\end{bmatrix}.

This excitation activates both modes and allows observation of their natural decay through damping. The "ground truth" system response is obtained by numerical integration using a Runge–Kutta solver (RK45) with relative and absolute tolerances set to 1e101e-10 and 1e121e-12, respectively, and displacement histories are recorded for subsequent model training and evaluation.

Transformer Setup

The transformer model receives vector-valued inputs 𝐱t2\mathbf{x}_{t}\in\mathbb{R}^{2} across consecutive time steps, serving as historical observations for one-step-ahead prediction. Both full and partial observation scenarios are examined:

  • Full displacement observation: both displacement channels (x1,x2)(x_{1},x_{2}) are available.

  • Partial displacement observation: only a single displacement, x1x_{1}, is provided.

In each setting, the input sequence length (context window) is varied to assess the model’s ability to reconstruct modal information from limited temporal and spatial cues. Within the DSF interpretation of attention [sieber2024dsf], increasing the length of the input sequence effectively enlarges the span of the induced recursive state, allowing the attention mechanism to approximate higher-dimensional linear dynamics through delayed aggregation. In this sense, the context window plays a role analogous to state augmentation in classical state–space realizations of autoregressive models.

Results and Observations

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Spectral analysis of transformer-predicted responses for the two-DOF linear system under different observability and temporal context settings. (a) Full observation with input sequence length 4: both modal frequencies (f1=4.1f_{1}=4.1 Hz, f2=9.5f_{2}=9.5 Hz) are accurately recovered. (b) Partial observation with input sequence length 4: distorted spectral content with failure to recover the true modal frequencies. (c) Partial observation with input sequence length 8: extending the temporal context enables recovery of both modal frequencies, indicating that temporal aggregation can partially compensate for limited spatial observability.

Case 1: Full Displacement Observation (Input Sequence Length = 4). With both displacement channels available, the Transformer successfully identifies the two inherent modal frequencies of the system. As shown in Figure 3(a), the spectral analysis of the predicted sequences exhibits distinct peaks at 4.1 Hz and 9.5 Hz, closely matching the analytical modal frequencies. This confirms that the attention mechanism can approximate the underlying coupled dynamics when complete state information is available. The recovery of the correct modal frequencies indicates that the attention-induced linear recurrence implicitly approximates the underlying state evolution operator. For linear systems, modal frequencies are directly linked to the eigenvalues of the discrete-time state transition matrix 𝚲t\boldsymbol{\Lambda}_{t}.

Case 2: Partial Displacement Observation (Input Sequence Length = 4). When only x1(t)x_{1}(t) is observed, the Transformer fails to reconstruct the coupled modal structure. As shown in Figure 3(b), the resulting spectral density is dominated by a low-frequency component and lacks distinct peaks at the true modal frequencies. This degradation arises because the model no longer receives spatial coupling information through x2(t)x_{2}(t), while the short temporal context prevents it from inferring intermodal interactions from temporal correlations alone. In essence, partial observability combined with limited temporal context yields insufficient information for the attention mechanism to reconstruct the effective state needed to represent intermodal coupling.

Case 3: Partial Displacement Observation (Input Sequence Length = 9). Extending the temporal window to eight steps restores the model’s ability to identify both modal frequencies, even under partial observation. As shown in Figure 3(c), the spectrum again exhibits clear peaks at the correct modal frequencies. The longer temporal context allows the transformer to implicitly capture delayed intermodal correlations that would otherwise require direct spatial measurements. This highlights a fundamental trade-off between spatial observability and temporal context in attention-based modeling of dynamical systems, directly analogous to the classical delay-coordinate embedding principle where temporal history compensates for missing state variables.

In summary, these results demonstrate that transformer-based representations can recover multi-modal linear dynamics when provided with sufficient information—either through direct access to spatial degrees of freedom or through extended temporal context that enables implicit state reconstruction. However, as in the single-DOF case, the attention mechanism remains constrained by the non-negativity of its coefficients, restricting the class of dynamics that can be represented and limiting the ability to encode signed intermodal feedback essential for oscillatory coupling.

4 Nonlinear Systems and Sparse Observations

The analytical approach used for linear systems in the previous section, comparing discrete-time AR coefficients with the representations learned by attention, does not directly extend to nonlinear dynamics. Since attention acts as a linear mixer of historical states, it cannot alone represent nonlinear vector fields; the feed-forward network (MLP) becomes structurally essential for approximating the nonlinear flow map. Through three case studies (the Van der Pol oscillator, the Chafee–Infante reaction–diffusion PDE, and the Navier–Stokes equations), we demonstrate that Transformer architectures (i) provide computational benefits primarily in partial observation regimes, (ii) operate as delay-embedding mechanisms that preserve essential physical state information, and (iii) discover latent spaces that maintain the effective dimensionality of the dynamics while capturing dominant modes and relevant system parameters. However, we need to keep in mind that discrete-time modeling comes with limitations. Although discrete-time forward predictors can approximate the forward flow map arbitrarily well, the identified maps are frequently non-invertible [cui2023certified]. Additionally, the topological structure of infinite-time attractors is often captured incorrectly (e.g., yielding invariant circles instead of limit cycles) [rico1992discrete].These limitations must be taken into account when designing and interpreting Transformer-based dynamical models, particularly in settings where invertibility, periodic behavior, and parametric dependence play a central role.

4.1 Van der Pol Oscillator

Dynamical system

The Van der Pol oscillator is a classic nonlinear dynamical system described by the second-order differential equation:

x¨μ(1x2)x˙+x=0,\ddot{x}-\mu(1-x^{2})\dot{x}+x=0, (15)

where μ\mu\in\mathbb{R} is a scalar parameter controlling strength of damping. Solving equation 15 requires the simultaneous integration of the state xx and its time derivative x˙\dot{x}.

For our experiments, we fix μ=0.5\mu=0.5, a regime that yields a non-stiff oscillator with a smooth, stable limit cycle emerging around the unstable origin.

We generate training and test data by sampling multiple short trajectories from randomly initialized states across the system’s phase space. Initial conditions 𝐱0=[x,x˙]\mathbf{x}_{0}=[x,\dot{x}]^{\top} are drawn from a uniform distribution x,x˙𝒰(3,3)x,\dot{x}\sim\mathcal{U}(-3,3), ensuring coverage of diverse dynamical behaviors, including transients and convergence to the limit cycle. A total of 1500 initial conditions are used.

Each trajectory is integrated forward in time over the interval [0,6.5][0,6.5] using the BDF solver from SciPy’s solve_ivp, with relative and absolute tolerances of 10610^{-6} and 10910^{-9}, respectively. The integration output is sampled at a fixed time step Δt=0.1\Delta t=0.1. We consider these accurate simulations as our ground truth.

To evaluate generalization near the system’s asymptotic behavior, we additionally simulate a long trajectory from initial state 𝐱0=[2,0]\mathbf{x}_{0}=[2,0]^{\top}, chosen to lie very close to the limit cycle. This trajectory is integrated over 65 time units—ten times longer than the short trajectories—using the same solver and discretization settings.

The dataset is divided into training and validation subsets in an 80:10 ratio. The long trajectory on the limit cycle is kept separate and used exclusively for testing. Figure 4(a) shows the resulting phase portrait, with different colors for the training (black), validation (green), and test (red) trajectories, clearly illustrating the long-term attractor behavior.

To illustrate the effectiveness of Transformer architectures in cases of partial observability and our observations of its latent space, we conduct two distinct experiments: (1) full observations and (2) partial observations. From the perspective of Section 2.3, these experiments investigate how the linear, attention-induced recurrence interacts with nonlinear dynamics when combined with a feed-forward mapping, and how the availability of state information versus delayed observations affects the implicit state reconstruction.

Transformer setup

In our first experiment, we compare the predictive performance of a typical Transformer model, with a single attention head followed by a feedforward Multi-Layer Perceptron (MLP) layer, against traditional feedforward MLPs when provided full state observations. For this first experiment we also test Transformers with and without learned positional encoding. The number of neurons used for the MLPs was the same across the Transformer and MLP architectures. For the Transformer model, the prediction task involves forecasting future states based on a fixed-length window of 5 past observations, satisfying the embedding dimension requirement (2n+12n+1) from Takens’ embedding theorem. In contrast, the MLP model is tasked with learning the time one map, given the current state predict the next one after time Δt\Delta_{t}, without incorporating any additional historical observations. In this fully observed setting, where all state variables are available at each time step, one might expect both approaches to perform comparably - since the current state alone already encapsulates the full information needed to predict the future evolution of the system.

In our second experiment, we consider a partial observability setting where only xx is provided to the model. This reflects a situation where some components of the system are unmeasured. In this setting, successful prediction requires reconstruction of the unobserved state from time-delayed measurements, aligning with the delay-coordinate embedding perspective formalized by Takens’ theorem [takens1981detecting].

Results and observations

For the full observation case, we trained each model ten times with different random seeds, keeping all other hyperparameters constant, to assess robustness. The resulting performance across seeds is summarized in Figure 4(b), which shows the absolute mean squared error (|MSE|) for all runs on the test limit cycle. The MLP and single-layer single-head Transformer exhibit similar performance, both having |MSE|107|\text{MSE}|\approx 10^{-7}. This aligns with our expectations: when the full state is observable, a simple nonlinear function approximator can effectively learn the time-one map without requiring access to historical information.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: (a) Phase portrait of the Van der Pol oscillator with (μ=0.5)(\mu=0.5). The training (black), validation (green), and test (red) trajectories are shown. (b) Prediction error (|MSE|) under full observation across different models. The transformer performs comparably to the MLP, indicating that attention mechanisms do not confer a significant advantage when all state variables are observed, (c) Prediction error (|MSE|) under partial observation (only xx observable). The transformer significantly outperforms the MLP, demonstrating its ability to learn latent dynamics from delayed inputs.

For our second experiment, we consider the partial observability setting where only xx is provided to the model. Figure 4(c) shows a performance gap: the Transformer (MLP + attention) outperforms the MLP-only model, as expected. This difference arises because the MLP only has access to a single scalar observation xx, at a single point in time, and must predict future states without access to the full state of the system or a history of past measurements. As a result, it cannot distinguish between points in a trajectory that share the same xx value but differ in their phase. In contrast, the Transformer can leverage the window of past observations (5 observations in our experiments here), enabling it to use past information to distinguish between trajectories that share the same xx value but differ in their phase. This allows it to resolve ambiguities in the observed dynamics. These results further support the interpretation that attention mechanisms enable autoregressive modeling capabilities. Further, under partial observability, the use of delayed observations provides the conditions under which such aggregations can support effective state reconstruction, in line with delay-coordinate embedding theory.

To further investigate the mechanisms that support this behavior, we further analyze the Transformer’s latent space for two of the trained models. As shown in Figure 5(a), the predicted trajectories closely match the ground truth, confirming that the models can accurately capture the system’s dynamics. As shown in Figure 1, ZZ can be interpreted as a correction term that incorporates the temporal history of the system. Figure 1 illustrates that ZZ results from the application of the attention mechanism across the input sequence and thus encodes information from the past observations. This suggests that for same values of xx, which appear at different phases of the trajectory, ZZ carries the necessary information required to distinguish between them. This is shown in Figure 5(b) in which we plot xx against ZZ. For the same value of xx, we see that the Transformer is able to learn two different corrections (values of ZZ), which matches our expectations.

Although this analysis is not strictly necessary in the 1D1D setting, we include it here to build intuition for cases where it becomes more informative. A complementary way to convey the same information is by plotting xx against Z+xZ+x, as shown in Figure 5(c). This figure, again reveals that the same value of xx corresponds to two distinct values of Z+xZ+x. In the next few examples that have dimensions larger than one, we will prefer this visualization over ZZ as it provides an easier way to visualize the transformers’ latent space.

We now proceed with experiments using a 2D2D latent space for the Van der Pol system. Recall that the input to the transformer are the five one-dimensional delays of y1(t)y_{1}(t). To embed this input into a 2D latent space, a learned linear transformation of the form is applied 𝐱emb(t)=x𝐖emb,where𝐱1emb(t)=[x1emb1(t),x1emb2(t)].\mathbf{x}^{\mathrm{emb}}(t)=x\cdot\mathbf{W}^{\mathrm{emb}},\quad\text{where}\quad\mathbf{x}_{1}^{\mathrm{emb}}(t)=\left[x_{1}^{\mathrm{emb}_{1}}(t),\;x_{1}^{\mathrm{emb}_{2}}(t)\right]. This embedding precedes the attention mechanism.

In this setting, we visualize the learned latent trajectories by plotting x1emb1(t)+Z1x_{1}^{\mathrm{emb}_{1}}(t)+Z_{1} against x1emb2(t)+Z2x_{1}^{\mathrm{emb}_{2}}(t)+Z_{2}, as shown in Figure 5(d). For both transformer models—with and without positional encoding—we observe that the resulting latent trajectories recover a limit cycle structure. This indicates that the transformer successfully learns a representation aligned with the intrinsic dynamics of the system. To assess robustness, in the Supplementary Information (Section A.1) we present the same visualization as in Figures 5(b), 5(c) and 5(d) for each of the 10 independently trained models (different random seeds).

We then proceed to investigate the entries in the attention matrix, focusing on the last row where the model uses five history tokens to make the next prediction in time. We report these results across all 10 seeds for: (a) 1D MLP + Attention with learned P.E., (b) 2D MLP + Attention with learned P.E., and (c) 2D MLP + Attention without P.E., as shown in Figures 5(e), 5(f), and 5(g), respectively. A consistent observation across all models is that attention is distributed over multiple past tokens rather than concentrating on a single time step. This behavior is expected from a dynamical systems perspective: under partial observability, a scalar measurement is not enough to define the system and multiple delayed observations are required to reconstruct the system state, in accordance with delay-coordinate embedding theory. The distinction between models trained with and without positional encoding seems to be in the structure of the learned distribution. Without positional encoding, attention weights collapse to an approximately uniform averaging over the delay window, yielding an order-invariant aggregation of past observations. In contrast, learned positional encodings allow the attention mechanism to assign non-uniform, time-dependent weights, enabling structured autoregressive representations that distinguish between different lags.

For both cases (1D and 2D inner dimensions) we provide for representative models additional visualizations that include the query (Q), the key (K) and value (V) in Section A.1 of the Appendix. In our view, these plots do not offer significant additional interpretability of the latent space of the transformer, but we include them for completeness.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Figure 5: (a) Predicted time series on the limit cycle. Ground truth values (black circles) are compared against predictions from two transformer models with identical architectures: one without positional encoding (No P.E., red crosses) and one with learned positional encoding (P.E., cyan stars). (b) Latent correction term ZZ plotted against y1(t)y_{1}(t), showing phase-dependent separation. (c) Visualization of Z+y1(t)Z+y_{1}(t) against y1(t)y_{1}(t), the effective transformed input signal. (d) Latent space trajectories of 2D transformer models with and without positional encoding. Attention pattern visualizations across 10 trained models: (e) 1D MLP + Attention with learned P.E., (f) 2D MLP + Attention with learned P.E., and (g) 2D MLP + Attention without P.E.

4.2 Chafee-Infante system

In an effort to understand whether basic Transformer architectures are able to uncover meaningful latent spaces not only for systems of ODEs but also for PDEs, we consider the Chafee-Infante reaction diffusion equation. The equation has the form

ut=uu3+νuxx,u_{t}=u-u^{3}+\nu u_{xx}, (16)

and for our experiments we consider boundary conditions u(0,t)=u(π,t)=0u(0,t)=u(\pi,t)=0 and ν=0.16\nu=0.16. For ν=0.16\nu=0.16 it has been shown that the long-term dynamics live in a two-dimensional inertial manifold [foias1988computation, gear2011slow, evangelou2023double, koronaki2024nonlinear].

We follow the same Galerkin projection approach and sampling scheme as in [gear2011slow, evangelou2023double] to ensure that the data lie near/on this low-dimensional manifold. Specifically, we approximate the solution as

u(x,t)k=13ϕk(t)sin(kx),u(x,t)\approx\sum_{k=1}^{3}\phi_{k}(t)\,\sin(kx), (17)

which yields a system of three spectral ODEs:

dϕdt=𝒇(ϕ),\frac{d\boldsymbol{\phi}}{dt}=\boldsymbol{f}(\boldsymbol{\phi}), (18)

where ϕ=(ϕ1,ϕ2,ϕ3)\boldsymbol{\phi}=(\phi_{1},\phi_{2},\phi_{3}). The sampled data was obtained by integrating Eq. (18) from a range of initial conditions and discarding transients, yielding samples that lie on the two-dimensional inertial manifold embedded in the three-dimensional Fourier space. We would like to note that the inertial manifold for this system is parametrized fully by the first two modes ϕ1\phi_{1} and ϕ2\phi_{2}, shown in Figure 6(a) as projection. The manifold is symmetric with respect to the origin as it can be seen from the projection in Figure 6(a); as we discuss below this has implications for the embedding structure the Transformer can capture/reveal.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: (a) Sampled data expressed in terms of the first three Fourier modes (ϕ1,ϕ2,ϕ3)(\phi_{1},\phi_{2},\phi_{3}) for the Chafee–Infante PDE, revealing a two-dimensional manifold embedded in 3\mathbb{R}^{3}. The projection of the manifold onto the (ϕ1,ϕ2)(\phi_{1},\phi_{2}) plane is shown, and a reference trajectory is highlighted in light blue both in the ambient space and in the projection. (b) The (light blue) reference trajectory reconstructed in the physical space u(x,t)u(x,t) is shown. The single pointwise observation at x=10x=10, denoted u10(t)u_{10}(t), is plotted in red. (c) Distribution of reconstruction errors for trajectories with missing observations.

For our purposes, we use these samples as initial conditions and generate short trajectories by integrating the spectral system over the interval [0,4][0,4]. Each trajectory is sampled at 10 uniformly spaced time points. We then map each trajectory from the Fourier space back to the physical solution u(x,t)u(x,t) using Eq. (17). For the reconstruction a uniform spatial grid consisting of 256 points was used. As observations for the Transformer model, we extract the signal at a single spatial location, u(x=10,t)=u10u(x=10,t)=u_{10}. We assume that this spatial coordinate is the only measurement that we have access to.

For all experiments reported below, we split the sampled short trajectories into training, validation, and test sets using a 70% / 15% / 15% ratio, respectively. Each sampled trajectory (consisting of 10 snapshots), is decomposed into overlapping 5-step sequences used as delayed input windows (Transformer tokens) for training. The choice of delay length of 5 was done to satisfy Takens’ embedding theorem, which requires at least 2n+12n+1 delays to reconstruct an nn-dimensional manifold—here, n=2n=2, implying in our case is advisable to have 5 delays.

Transformer setup

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 7: Representative model MLP + Attn. 3D. (a) Latent-space relationships among 𝐮10emb,𝐐,𝐊,𝐕,𝐙\mathbf{u}_{10}^{\mathrm{emb}},\,\mathbf{Q},\,\mathbf{K},\,\mathbf{V},\,\mathbf{Z}. (b) Projection of 𝐳+𝐮10emb\mathbf{z}+\mathbf{u}_{10}^{\mathrm{emb}} onto the (Z1+u10,1emb,Z2+u10,2emb)(Z_{1}+u_{10,1}^{\mathrm{emb}},\,Z_{2}+u_{10,2}^{\mathrm{emb}}) plane. (b–c) Attention matrices for three representative data points across ten models trained with different random seeds: (c) models with learned positional encodings (PE) and (d) models without PE.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 8: Representative model MLP + Attn. 3D. The latent transformer embeddings of 𝒛+𝒖10emb\boldsymbol{z}+\boldsymbol{u}_{10}^{\mathrm{emb}} plotted in the (a) (Z1+u10emb1,Z2+u10emb2)(Z_{1}+u_{10}^{\mathrm{emb}_{1}},Z_{2}+u_{10}^{\mathrm{emb}_{2}}) plane and (b) (Z2+u10emb1,Z3+u10emb2)(Z_{2}+u_{10}^{\mathrm{emb}_{1}},Z_{3}+u_{10}^{\mathrm{emb}_{2}}) plane. In both subfigures the three panels show the same points colored by ϕ1\phi_{1}, ϕ2\phi_{2}, and ϕ3\phi_{3}, respectively. (c-d) Attention matrices for three representative data points across ten models trained with different random seeds: (c) models with learned positional encodings (PE) and (d) models without PE.

To examine the benefit of temporal attention in settings with limited spatial information, we begin by comparing a simple MLP only against transformer-based models. Both are trained to predict the next value of the signal u(x=10,t)u(x=10,t). The MLP only uses the value at u(x=10,t0)u(x=10,t_{0}) and attempts to predict u(x=10,t1)u(x=10,t_{1}). The transformer model receives 5 delayed observations 𝒖(x=10,t15)={u(x=10,t1),u(x=10,t2),,u(x=10,t5)}\boldsymbol{u}(x=10,t_{1-5})=\{u(x=10,t_{1}),u(x=10,t_{2}),\dots,u(x=10,t_{5})\} as input tokens, process them via a single attention layer and predicts the next point in time u(x=10,t6)u(x=10,t_{6}).

We consider transformer variants with both 2D2D and 3D3D latent dimensions, trained with and without learned positional encoding (PE) for 10 different seed values. All models share the same MLP architectures and same training protocol (e.g. epoch number, learning rate etc). This setup mirrors our earlier experiments on the Van der Pol system and allows for a direct assessment of the transformer’s ability to recover latent structure when only partial observations are available. The choice of the 2D2D and 3D3D latent dimensions for the transformer was made given the prior knowledge that we have about the true dimensionality of this system. We remind the reader that the data were sampled from a two-dimensional (symmetric) non-linear manifold embedded in a three-dimensional space.

From these experiments (Figure 6(c)), we observe that the MLP only baseline exhibits the highest mean-squared error |MSE||MSE| among all architectures. In contrast, the Transformer-based models generally achieve lower errors, though the 2D-attention variants display substantial variance across training runs with some instances performing worse than the MLP. This suggests that incorporating 2D attention can lead to unstable training dynamics and inconsistent predictive performance. The 3D-attention models demonstrate the best overall performance and lowest median errors. Notably, the inclusion of positional encoding (PE) does not appear to provide any systematic improvement in either the 2D or 3D configurations.

Results and observations

Despite the fact that the underlying inertial manifold of the Chafee–Infante system is two-dimensional, the Transformer with a 2D latent space does not recover a clean, unfolded representation. As illustrated in Figure 8(a), the learned embedding of 𝐳+𝐮10emb\mathbf{z}+\mathbf{u}_{10}^{\mathrm{emb}}—our effective latent coordinates, following the notation introduced in the Van der Pol example—does not form a smooth 2D surface but instead collapses into a narrow, thickened curve. This indicates that the model is not able to discover the full two-dimensional parametrization of the attractor.

To understand why this occurs, we recall that from a delay-embedding perspective Takens’ theorem ensures that 2n+1=52n+1=5 delayed observations are sufficient to reconstruct an n=2n=2 dimensional attractor from a generic scalar observable. In our setting, we do provide five delayed measurements of u10(t)u_{10}(t), so the temporal information supplied to the model is, in principle, sufficient. The difficulty therefore does not stem from a lack of sufficiently long temporal history. Takens’ theorem guarantees that such delay coordinates contain sufficient information to reconstruct the underlying attractor, but it does not guarantee that this information can be faithfully represented after parametric compression into a prescribed latent dimension. In the Transformer, this compression is enforced by the choice of the internal latent dimension, which constrains how the attention-induced state can represent the unfolded delay embedding. Thus, we argue that the limitation arises from the dimensionality of the Transformer’s latent space. By restricting the inner dimension to two, we require the model to (i) integrate information across the five delayed tokens via attention, (ii) infer the nonlinear relationships among these delays, (iii) construct an appropriate correction term 𝐳\mathbf{z}, and simultaneously (iv) represent the resulting unfolded geometry in only two latent coordinates. This places a strong constraint on the model as it must find the right delay embedding and compress it directly into a 2D representation. In practice, this appears to overconstrain the learning problem and prevents the Transformer from fully unfolding the underlying manifold. The key issue is that a two-dimensional latent space forces premature compression of the delay embedding. In contrast, a three-dimensional latent space provides sufficient room for the attention mechanism to first unfold the delay coordinates into a richer intermediate representation, after which the nonlinear MLP can project onto the intrinsic two-dimensional inertial manifold. This separation of concerns (unfolding followed by projection) appears essential for successful state reconstruction.

This difficulty might be exacerbated by the symmetry of the Chafee–Infante inertial manifold. Because the attractor is symmetric with respect to the origin, a single spatial observable such as u10(t)u_{10}(t) induces a delay embedding that is not one-to-one with the truth, yielding a "folded" representation. The geometry presented to the model is therefore more intricate than a simple 2D sheet, making the unfolding problem even more challenging under strict 2D latent constraints.

For the remainder of this section we will focus on the latent space for the models trained with 3D3D Attention. In Figures 8(a) 8(b) we show projections of the transformer’s latent space colored with the Fourier Modes ϕ1,ϕ2,ϕ3\phi_{1},\phi_{2},\phi_{3}. It can be easily seen that the latent space is effectively 2D2D since Z3+u10emb1Z_{3}+u_{10}^{emb_{1}} is a function of Z2+u10emb1Z_{2}+u_{10}^{emb_{1}} and Z1+u10emb1Z_{1}+u_{10}^{emb_{1}},Z2+u10emb1Z_{2}+u_{10}^{emb_{1}} are independent. In addition, from Figure 8(a) we can see that the Fourier Modes are functions on the transformer’s latent space. This suggests that the space that the transformer learned despite seeing only one single spatial observation u10u_{10} is enough to discover the true underlying dynamics.

The observed attention patterns indicate that the model distributes weight across multiple delayed tokens rather than collapsing onto a single dominant lag. This is consistent with the need to integrate information across several delays in order to reconstruct the effective state from a single spatial observation. The absence of a sharply peaked attention profile suggests that no single delay is sufficient, and that the attention mechanism functions as a distributed linear aggregator over the reconstructed delay coordinates. Interestingly, in this case the learned distribution of attention weights between the models we PE and without PE looks similar.

For this example as well, to assess robustness of the learned representations we include, in the Supplementary Information (Section A.2) additional visualizations across the 10 trained models for different random seeds. We have also included for representative models visualizations that include the query (Q), the key (K) and value (V) and Z.

4.3 Navier-Stokes: Flow past a cylinder

Dynamical system

The final dynamical system that is considerd is the 2D flow past a cylinder, governed by the incompressible Navier–Stokes equations:

𝐮t+(𝐮)𝐮=1ρp+ν2𝐮,𝐮=0\frac{\partial\mathbf{u}}{\partial t}+(\mathbf{u}\cdot\nabla)\mathbf{u}=-\frac{1}{\rho}\nabla p+\nu\nabla^{2}\mathbf{u},\qquad\nabla\cdot\mathbf{u}=0 (19)

where 𝐮=(ux,uy)\mathbf{u}=(u_{x},u_{y}) is the velocity field, pp the pressure, ρ\rho the density, and ν\nu the kinematic viscosity. The flow is thus characterized by three scalar fields: the two velocity components and the pressure.

For our experiments, we use the dataset from [geneva2022transformers] containing flow trajectories for various Reynolds numbers Re[100,750]Re\in[100,750], where each Reynolds number corresponds to a single trajectory. To focus on the system’s asymptotic behavior, we discard initial transients, since in this post-transient regime, the flow exhibits periodic vortex shedding, meaning the dynamics converge to a stable limit cycle. Consequently, the data resides on a low-dimensional manifold, justifying the use of a compact latent representation [deane1991low, menier2025interpretable].

We select a single spatial location from the uxu_{x} component of the velocity field at coordinates x=35x=35, y=45y=45, as shown in Figure 9. This scalar observation is made across all parameter values and times, providing a one-dimensional time series for each Reynolds number. The specific choice of coordinates is not critical; any spatial location that exhibits representative dynamics (i.e., not on a boundary or in a stagnant region) would suffice.

Refer to caption
Figure 9: Selected spatial location (x=35x=35, y=45y=45) on the uxu_{x} velocity field used for training, and corresponding time-series from that location.

Transformer setup

Similarly to all of our previous experiments, the modeling and architectural choices are guided by Takens’ embedding theorem, which suggests that 2n+12n+1 time-delayed observations are sufficient to capture the underlying dynamics in an nn-dimensional manifold. We assume that the intrinsic dimension of the system is at most 3: two dimensions to describe the limit cycle, and one for the Reynolds number variation.

Based on this analysis, we use a model with an inner dimension of 3, and in our experiments, we found that 5–7 delays provide sufficient temporal context,(with the upper end of this range consistent with Takens-style delay embedding heuristics), with diminishing returns beyond this range. The choice of the internal latent dimension determines the capacity of this surrogate state to represent both the oscillatory dynamics and the variation induced by the Reynolds number. Beyond prediction, this experiment also probes whether the learned latent representation is not only minimal but also sufficient for reconstructing the full spatial field—that is, whether the transformer discovers a compact embedding from which the complete flow state can be recovered.

Results and observations

We conduct two experiments to investigate the role of explicit parameter information in learning latent representations: first, training the transformer using only the scalar observation without the Reynolds number; second, incorporating the normalized Reynolds number as an additional input.

Figure 10 presents the main results for both configurations. Both models achieve low errors, with a maximum mean absolute error over both cases of around 0.014. The error stratification by Reynolds number (Figure 10(a)) reveals a key distinction: the parameter-unaware model exhibits non-uniform errors, with intermediate Reynolds numbers (Re300Re\approx 300500500) performing best while accuracy degrades toward both parameter extremes. In contrast, the parameter-aware model achieves both much lower overall errors and a more uniform distribution across the Reynolds number range.

The latent space visualizations explain this discrepancy. Without explicit parameter information (Figures 10(b) and 10(d)), limit cycles corresponding to different Reynolds numbers overlap substantially rather than forming clearly separated structures. For intermediate Reynolds values, overlapping neighbors in latent space correspond to dynamically similar regimes, so conflating them incurs only modest prediction error. At parameter extremes, however, latent neighbors may correspond to substantially different dynamics, and this conflation leads to larger errors. With the Reynolds number provided, the model successfully disentangles the limit cycles (Figure 10(c)), arranging them along a continuous manifold parameterized by ReRe, as seen in the 3D visualization of Figure 10(e).

This result warrants further discussion, since, in theory, delay embeddings of the observed state alone should suffice for prediction. For a fixed Reynolds number, the dynamics evolve on a well-defined invariant set, and a sufficiently long history uniquely determines future evolution. However, in practice the learned dynamics are only approximate. The Transformer does not enforce exact physical constraints (such as conservation laws or parameter-conditioned invariants) and therefore acts as an imperfect integrator of the underlying flow. Much like a non-symplectic numerical scheme applied to a Hamiltonian system, small local errors may accumulate over time, causing drift away from the true invariant manifold. From a latent-space perspective, the delay-embedded representation spans a space larger than the set of realizable physical histories for any fixed parameter value, allowing trajectories to migrate between nearby parameter regimes. Histories constructed solely from delayed observations may thus correspond to physically plausible trajectories at an incorrect effective Reynolds number, or even emulate dynamics inconsistent with any single parameter value/physical regime. Making the Reynolds number explicit constrains the model to remain on the correct parameter-conditioned manifold, suppressing drift and enforcing dynamical consistency over long horizons.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 10: Comparison of parameter-unaware and parameter-aware Transformer training on the cylinder flow. (a) Mean absolute error stratified by Reynolds number; the hashed bars indicate the parameter-aware model. (b,c) Two-dimensional projection of the latent representation Z+XZ+X colored by Reynolds number. (d,e) Three-dimensional visualization of the latent representation Z+XZ+X for (a) parameter-unaware and (b) parameter-aware training. The parameter-aware model achieves lower and more uniform errors across the Reynolds number range, and produces a latent space with clearly separated limit cycles. Without explicit Reynolds number input, limit cycles corresponding to different parameter values exhibit significant overlap. Providing the Reynolds number as input enables the Transformer to disentangle the cycles, arranging them along a continuous manifold parameterized by ReRe.

The implications for full-field reconstruction are shown in Figure 11. We use Geometric Harmonics, a kernel-based regression method proposed by Coifman and Lafon in [lafon2004diffusion], to reconstruct the complete velocity field from the learned latent representation. The choice of reconstruction method is not central to our analysis; any sufficiently expressive regression scheme (e.g., Gaussian Process Regression or a neural network) could serve the same purpose. Without explicit parameter information (Figure 11(a)), the reconstruction accurately recovers the broad spatial structure, but exhibits notable errors concentrated near the cylinder wake, where vortex shedding dynamics are most sensitive to Reynolds number. These errors stem directly from the overlapping limit cycles observed in the latent space: the learned representation is predictive but not injective with respect to the underlying flow fields. Points that share similar latent coordinates may correspond to different dynamical regimes, and without parameter information the reconstruction cannot disambiguate between them. When the Reynolds number is provided as additional input (Figure 11(b)), this ambiguity is resolved and reconstruction quality improves dramatically, with accurate recovery of the velocity field across the entire domain.

Refer to caption
(a)
Refer to caption
(b)
Figure 11: Full-field reconstruction of the uxu_{x} velocity via Geometric Harmonics from the learned latent representation. (a) Using only Z+XZ+X from the parameter-unaware model, reconstruction errors are concentrated near the cylinder wake where vortex shedding dynamics are most sensitive to Reynolds number. (b) With the parameter-aware latent representation, the reconstruction accurately recovers the velocity field across the domain. The black square indicates the observation location used for training.

These experiments demonstrate that simple, single-layer Transformers can learn meaningful latent representations of complex fluid dynamics from minimal-effective observations. However, with this minimal architecture, additional input information such as system parameters may be necessary to fully disentangle complex behaviors in the latent space and enable accurate reconstruction of complete spatial fields. Deeper architectures with multiple layers or attention heads may reduce this dependence, though we leave this investigation to future work.

5 Conclusions and Future Directions

In this work, we have attempted to go beyond standard forecasting benchmarks to provide a mechanistic understanding of how Transformer architectures represent dynamical systems in their latent space. Rather than assessing models through aggregate performance or post hoc interpretability, we adopted a bottom-up perspective focused on the self-attention mechanism in isolation within minimal, single-layer architectures. We summarize the key findings below and outline directions for future research.

5.1 Summary of key insights

The key insights drawn from our numerical experiments across different linear, Section 3 and non linear dynamical systems, Section 4 are summarized below:

Insights from linear systems

For linear dynamical systems, our analysis adopted a classical autoregressive (AR) viewpoint that allowed a direct, mechanistic comparison between analytically derived AR models and the effective recursions learned by attention-only Transformer architectures.

We showed that a single-head causal attention mechanism induces a linear, time-varying autoregressive operator whose coefficients can be determined by the attention weights and value projections. This enabled a direct comparison between the closed-form AR coefficients derived from the underlying physical system and the effective coefficients learned by the Transformer. In regimes where the true dynamics admit an AR representation with coefficients with the same sign, the attention mechanism accurately recovers the dominant modal decay ridge, effectively learning a data-adaptive AR model consistent with classical linear system theory.

Our experiments demonstrate that attention-only Transformers can correctly capture overdamped or monotonic dynamics, as well as certain oscillatory regimes where the discrete-time AR coefficients share the same sign (Case 1). In contrast, the model fails to reproduce resonant or lightly damped dynamics that require mixed-sign autoregressive coefficients. This failure arises from the softmax normalization inherent in attention, which enforces non-negativity and convexity of the attention weights. As a result, attention cannot represent subtractive interactions between past states, leading to oversmoothing and an incorrect dynamic signature in resonant regimes (Case 2). This fundamental limitation means that even simple linear oscillators may require additional architectural components or modifications.

In multi-degree-of-freedom linear systems, we showed that attention can recover multiple interacting modes when enough delayed information is available to form an expressive effective state. In this case, the attention mechanism aggregates past observations in a way that captures multiple spectral modes. However, this ability is still limited by the same non-negativity constraint discussed above.

Insights from non-linear systems

Attention provides little advantage when the full state vector is directly observable. In the Van der Pol oscillator and the Chafee-Infante system under full observation, a feed-forward MLP trained to approximate the time-one map achieves predictive accuracy comparable to that of a Transformer with access to temporal context. In this setting, the system is already Markovian, and temporal aggregation does not introduce additional useful information. In contrast, under partial observability the role of attention becomes important. When incomplete measurements are available, the Transformer leverages temporal context through attention to construct an implicit state from delayed observations. This enables the model to disambiguate states that are indistinguishable from a single snapshot but correspond to different phases or dynamical regimes. As a result, Transformers provide a clear computational benefit in experimental and data-driven settings where sensing is limited. These findings are further corroborated by recent work showing that Transformer-based models are particularly well-suited for history-dependent flows with limited data, whereas simpler architectures may suffice when dynamics depend solely on instantaneous variables [urdeitx2025can].

Under partial observability, the behavior of attention is consistent with classical delay-coordinate reconstruction principles. Rather than introducing new dynamical information, attention learns a data-adaptive mechanism for aggregating delayed measurements into an internal representation that supports prediction. Across all nonlinear case studies, attention distributes weight across multiple past observations, reflecting the need to integrate information over time to reconstruct the effective state. From a dynamical systems perspective, this behavior aligns closely with Takens-style delay embeddings, where temporal context compensates for missing state variables.

The quality of the learned latent representations depends critically on both temporal context and latent dimensionality. Consistent with Takens’ embedding theorem, our experiments confirm that at least 2n+12n+1 delayed observations are required to reconstruct an nn-dimensional attractor. Systems with higher intrinsic dimensionality require longer input sequences to enable accurate reconstruction, as demonstrated by the increased sequence length. In addition to sufficient temporal history, the internal latent dimension of the Transformer must be large enough to unfold the reconstructed geometry. Insufficient latent dimensionality leads to overlapping or collapsed representations that degrade the final predictive accuracy. This effect is particularly evident in symmetric low dimensional manifolds and parameter-varying systems. In the cylinder flow example, Transformers trained without explicit parameter input implicitly encode Reynolds number variation, but the resulting latent space does not uniquely separate different parameter values. As we pointed out earlier this happens because the latent trajectories inferred from delays alone may drift away from the true Reynolds number, either toward other physically realizable regimes or into unphysical regions of the latent space; this can adversely affect the learned representation. By prescribing the Reynolds number explicitly, the model is constrained to remain on the correct parameter-conditioned manifold, mitigating drift and stabilizing the latent dynamics. We note that this ambiguity may also be exacerbated by the relatively close Reynolds numbers considered in our experiments, which limits the degree of parametric separation present in the data.

Together, these results indicate that attention-based models succeed in nonlinear settings when architectural choices—such as sequence length and latent dimension—are aligned with the intrinsic dimensionality of the dynamics and any relevant parameter variations. Grounding these choices in delay-embedding and dynamical systems considerations can yield strong predictive performance without unnecessary model complexity, while also improving interpretability of the learned representations.

5.2 Future directions

These findings suggest that standard Transformer components inherited from NLP are not inherently optimal for physical dynamics. Future research should focus on aligning architectural inductive biases with the mathematical structure of dynamical systems:

  • Multi-head attention and spectral expressivity: While a single head is constrained to convex combinations, an open question is whether multi-head attention can circumvent the spectral limitations identified in section 3. Interactions across heads might enable subtractive or complementary combinations that recover mixed-sign autoregressive structure, or different heads might specialize in distinct modes to mitigate folding or collapse of the latent representation. Rigorous analysis of whether such head-wise specialization can preserve correct spectral structure while enabling richer state representations remains an important direction.

  • Mechanistic Interpretability for stiff and multiscale systems: Our results indicate that the primary computational benefit of training Transformers arises in partial observation regimes, where attention supports delay-based state reconstruction. Whether this conclusion extends to systems with strong stiffness or pronounced multiscale structure is an open question. In such regimes, attention may offer additional advantages by adaptively integrating information across disparate time scales, even when the full state is observable. Systematically investigating these settings would clarify when attention-based architectures genuinely outperform classical approaches. From the perspective of Backward Error Analysis, such investigations could reveal whether Transformers learn an Inverse Modified Differential Equation (IMDE) [Zhu2023ImplementationA], effectively adapting their implicit numerical scheme to the system’s varying time scales and stiffness. This could provide a rigorous mathematical grounding for their empirical success in stiff and multiscale dynamical regimes.

  • Shared latent space geometry in foundation models: Foundation models for dynamical systems require multiple systems or parameter regimes to share a common latent representation. Understanding how distinct dynamical behaviors are embedded, separated, or entangled within this shared space is essential for both interpretability and generalization. Our Navier–Stokes results illustrate this challenge: while Transformers can implicitly track parameter variations, explicit conditioning significantly disentangles the latent space. Developing architectures that formally separate state dynamics from parameter manifolds, and analyzing the resulting geometry, could be critical for universal forecasting models.

  • Normalization, positional encoding, and latent dimensionality: Several architectural components inherited from NLP appear suboptimal for dynamical systems. Softmax normalization enforces non-negativity and convexity that restrict representable dynamics; positional encodings do not consistently benefit time-continuous physical systems; and our Chafee–Infante results show that latent dimension must exceed the intrinsic manifold dimension to allow proper unfolding before projection. These observations motivate the development of alternative normalization schemes, position-handling mechanisms, and principled guidelines for latent dimensionality—potentially drawing on numerical analysis and dynamical systems theory.

Ultimately, we hope this study may serve as a bridge between the empirical success of large-scale models and classical dynamical systems theory. It cautions against treating Transformers as "black-box" universal approximators, highlighting that their inductive biases, specifically regarding spectral filtering and manifold topology, must be carefully aligned with the physical systems they are intended to model.

Acknowledgments

GD and EC were partially supported by the French-Swiss project MISTERY funded by the French National Research Agency (ANR PRCI Grant No. 266157) and the Swiss National Science Foundation (Grant No. 200021L_212718). NE and IGK were partially supported by the US Department of Energy and the US National Science Foundation.

Appendix A Additional Results

A.1 Van der Pol

In this section, we provide additional complementary results for the Van der Pol oscillator described in Section 4.1.

Specifically for the models we focused in the main text we provide additional visualizations.

For the model trained with one-dimensional inner dimension, in Figures S1 and S4 we plot x(t)x(t) against the query (Q), key (K), and value (V) vectors for the models trained with the learned positional encoding (P.E.) and the one without it. As discussed in the main text, we find that visual inspection of these components does not offer additional interpretability for our use cases.

We also include visualizations for models trained with a two-dimensional inner dimension, Figures S1 and S2, in which the embedded inputs, queries, values, and correction terms are all plotted in the latent two-dimensiona space.

Refer to caption
Figure S1: Visualization of attention-related latent variables for one of the one-dimensional Transformers we trained without positional encoding.
Refer to caption
Figure S2: Visualization of attention-related latent variables for the 1D Transformer we trained with learned positional encoding.
Refer to caption
Figure S3: Visualization of the learned latent space for the 2D Transformer model without positional encoding.
Refer to caption
Figure S4: Visualization of the learned latent space for the 2D Transformer model with learned positional encoding.

We also report the internal embeddings across all models that we have trained for the Van der Pol Oscillator. For the models trained with one-dimensional embedding space we plot x(t)x(t) against x(t)+Zx(t)+Z in Figure S5. As it becomes evident, the characteristic limit-cycle structure of the Van der Pol oscillator begins to emerge in the learned representation.

Refer to caption
(a)
Refer to caption
(b)
Figure S5: For the 10 models with inner dimension one trained with different random seeds, we inspect their internal embeddings by plotting y1(t)y_{1}(t) versus y1(t)+Zy_{1}(t)+Z: (a) model with learned positional encoding; (b) model without positional encoding.

For the models with a two-dimensional embedding space, Fig. S6 further shows that the learned embedding reveals the limit cycles. In this case, the limit-cycle structure is more evident than in the one-dimensional inner-dimension case.

Refer to caption
(a)
Refer to caption
(b)
Figure S6: For the 10 models with inner dimension two trained with different random seeds, we inspect their internal embeddings by plotting Z1+y1emb1(t)Z_{1}+y_{1}^{emb_{1}}(t) versus Z1+y1emb2(t)Z_{1}+y_{1}^{emb_{2}}(t): (a) model with learned positional encoding; (b) model without positional encoding.

A.2 Chafee–Infante

We provide additional complementary results for the Chafee–Infante equation discussed in Section 4.2 of the main text.

As in the Van der Pol case, we focus on additional visualizations for the models analyzed in the main text.

We first report visualizations of the query (Q), key (K), value (V), and correction terms for representative models with three- and two-dimensional latent spaces in Figures S7 and S8 respectively.

We also examine the consistency of the learned internal embeddings across all trained models by plotting 𝒛+𝒖10emb\boldsymbol{z}+\boldsymbol{u}_{10}^{\mathrm{emb}}, colored by the leading Fourier modes ϕ\boldsymbol{\phi}. The models trained with a two-dimensional inner dimension are shown in Fig. S9. For none of these models does visual inspection of the latent space reveal a clearly unfolded two-dimensional structure. Nevertheless, the Fourier modes still exhibit a discernible relationship with the latent variables. For the models trained with a three-dimensional inner dimension, in Figure S10, visual inspection of the latent space (based on two-dimensional projects) indicates a clearly unfolded three-dimensional structure across all models (except one). In this case, the leading Fourier modes exhibit a more clear association with the latent variables across all the different random seeds.

Refer to caption
Figure S7: Visualization of attention-related latent variables for one of the three-dimensional latent space transformers we trained with no positional encoding.
Refer to caption
Figure S8: Visualization of attention-related latent variables for one of the two-dimensional latent space transformers we trained with no positional encoding.
Refer to caption
(a)
Refer to caption
(b)
Figure S9: For the 10 models with inner dimension two trained with different random seeds, we inspect their internal embeddings by plotting 𝒛+𝒖10emb\boldsymbol{z}+\boldsymbol{u}_{10}^{\mathrm{emb}} colored with the fourier modes ϕ1,ϕ2,ϕ3\phi_{1},\phi_{2},\phi_{3} (a) model with learned positional encoding; (b) model without positional encoding.
Refer to caption
(a)
Refer to caption
(b)
Figure S10: For the 10 models with inner dimension three trained with different random seeds, we inspect their internal embeddings by plotting 𝒛+𝒖10emb\boldsymbol{z}+\boldsymbol{u}_{10}^{\mathrm{emb}} colored with the fourier modes ϕ1,ϕ2,ϕ3\phi_{1},\phi_{2},\phi_{3} (a) model with learned positional encoding; (b) model without positional encoding.