Transformers for Physical Systems
Reference: Geneva & Zabaras (2022) established the connection between self-attention and explicit time-integration for physical systems.
Transformers, originally developed for natural language processing, can learn to approximate the dynamics of physical systems by modeling temporal sequences. The key insight is that self-attention with residual connections can approximate explicit multi-step time-integration methods, providing a learned alternative to traditional numerical schemes.
Self-Attention Mechanism
Self-attention computes a weighted combination of features from past time-steps. For each time-step \(i\), three vectors are computed:
- Query: \(q_i = F_q(x_i) \in \mathbb{R}^{d_k}\)
- Key: \(k_i = F_k(x_i) \in \mathbb{R}^{d_k}\)
- Value: \(v_i = F_v(x_i) \in \mathbb{R}^{d_v}\)
where \(F_q\), \(F_k\), \(F_v\) are learnable neural networks. The attention mechanism computes:
Attention scores (softmax-normalized similarities): $\(\alpha_{n,i} = \frac{\exp(q_n^\top k_i / \sqrt{d_k})}{\sum_{j=1}^k \exp(q_n^\top k_j / \sqrt{d_k})}\)$
Context output (weighted combination): $\(c_n = \sum_{i=1}^k \alpha_{n,i} v_i\)$
In matrix form for a context window of length \(k\): $\(C = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\)$
where \(Q, K \in \mathbb{R}^{k \times d_k}\) and \(V \in \mathbb{R}^{k \times d_v}\) contain all query, key, and value vectors.
Connection to Explicit Time Integration
The fundamental connection: self-attention with residual connections can approximate explicit Adams multi-step methods.
Mathematical Equivalence
Explicit Adams method (\(s\)-step): $\(\varphi_{n+s} = \varphi_{n+s-1} + \Delta t \sum_{j=0}^{s-1} b_j f(t_{n+j}, \varphi_{n+j})\)$
where \(b_j\) are method coefficients and \(f\) is the right-hand side of \(\dot{\varphi} = f(t, \varphi)\).
Residual self-attention: $\(\hat{\varphi}_{n+s} = \varphi_{n+s-1} + \sum_{i=0}^{s-1} \alpha_{n+s-1,i} v_i\)$
Both compute: previous state + weighted combination of past information.
Universal Approximation Theorem
Theorem 1 (Geneva & Zabaras, 2022): For a dynamical system \(\dot{\varphi} = f(t, \varphi)\) with Lipschitz continuous \(f\), a self-attention layer \(A_\theta\) with residual connection and context length \(k\) can approximate any explicit Adams method \(M_i\) of order \(i \leq k\) within arbitrarily small error:
Proof sketch: By setting attention weights \(\alpha_{n+s-1,i} = b_i / \sum_j b_j\) and using a neural network to approximate \(v_i = c \cdot f(t_{n+i}, \varphi_{n+i}) + O(\epsilon)\), the self-attention layer reproduces the Adams method. The universal approximation property of neural networks ensures the error can be made arbitrarily small.
Beyond Fixed Schemes
While a single layer approximates Adams methods, the full transformer architecture learns:
- Multi-scale temporal patterns: Different attention heads capture different time scales
- Adaptive integration: The effective integration scheme adapts to system state
- Long-term dependencies: Representations encode relationships beyond the context window
Transformer Architecture for Physical Systems

The model uses a transformer decoder (GPT-style) architecture:
- Masked self-attention: Each time-step attends only to previous time-steps
- Positional encoding: Sinusoidal or time-based encoding provides temporal information
- Residual connections: Enable identity mappings and gradient flow
- Multi-head attention: Parallel attention heads learn different dependency patterns
- Auto-regressive prediction: Next state predicted from all previous states
Training Objective
Given embedded time-series \(\mathcal{D} = \{\Xi_i\}_{i=1}^D\) where \(\Xi_i = (\xi_0, \ldots, \xi_T)\):
Unlike NLP (discrete tokens with softmax), physical systems use continuous Gaussian likelihood, resulting in L2 loss.
Koopman Embedding
Physical states must be embedded into a lower-dimensional representation. The approach uses Koopman operator theory to learn embeddings where dynamics are approximately linear.
Koopman Operator
For a discrete-time system \(\varphi_{i+1} = F(\varphi_i)\), the Koopman operator \(K\) acts on observables \(g(\varphi_i)\):
This implies linear evolution: \(g(\varphi_{i+1}) = K g(\varphi_i)\), \(g(\varphi_{i+2}) = K^2 g(\varphi_i)\), etc.
Encoder-Decoder Architecture

- Encoder: \(F : \mathbb{R}^{n \times d} \to \mathbb{R}^e\) maps states to observables \(\xi_i = F(\varphi_i)\)
- Decoder: \(G : \mathbb{R}^e \to \mathbb{R}^{n \times d}\) reconstructs states \(\hat{\varphi}_i = G(\xi_i)\)
- Koopman operator: Learnable linear \(K\) (often banded) evolves observables: \(\xi_{i+1} = K \xi_i\)
Training Loss
The embedding model minimizes:
- Reconstruction: Accurate encoding/decoding
- Dynamics: Linear evolution in embedded space
- Regularization: Prevents overfitting
The embedding is trained first, then frozen. The transformer trains on embedded trajectories.
Applications
Surrogate Modeling
Transformers replace expensive numerical solvers for:
- Optimization: Many repeated simulations with different parameters
- Design: Exploration of large parameter spaces
- Inverse problems: Iterative parameter estimation
- Uncertainty quantification: Monte Carlo sampling
Advantages
- Long-term dependencies: Self-attention maintains relationships across the full context window
- Multi-scale patterns: Different heads capture different temporal scales
- Generalization: Models entire families of problems (distributions of initial/boundary conditions, parameters)
- Learned schemes: Adaptive integration tailored to system dynamics
Demonstrated Applications
- Chaotic systems: Lorenz system with sensitive dependence on initial conditions
- Fluid dynamics: 2D Navier-Stokes flow around a cylinder (various Reynolds numbers)
- Reaction-diffusion: 3D Gray-Scott system with complex spatiotemporal patterns
Transformers outperform LSTMs, auto-regressive models, and echo-state networks, particularly when extrapolating beyond the training context length.
How to Use Transformers for Physical Systems
Workflow
- Collect training data: Generate trajectories using a numerical solver for various initial conditions, boundary conditions, and parameters
- Train embedding model: Learn Koopman observables that linearize the dynamics
- Embed all data: Convert physical states to embedded representations
- Train transformer: Learn temporal dynamics in embedded space
- Inference: Predict future states by encoding initial conditions, running the transformer, and decoding
Key Design Choices
- Context length: Determines how many past time-steps the model can attend to
- Embedding dimension: Balance between compression and information preservation
- Architecture depth: Number of transformer layers affects expressivity
- Multi-head attention: More heads enable richer temporal pattern learning
The transformer learns an adaptive integration scheme that can outperform fixed numerical methods for the specific dynamics it was trained on.
References
Geneva, N., & Zabaras, N. (2022). Transformers for modeling physical systems. Neural Networks, 146, 272-289.
ScienceDirect | DOI