Transformers for Physical Systems

Reference: Geneva & Zabaras (2022) established the connection between self-attention and explicit time-integration for physical systems.

Transformers, originally developed for natural language processing, can learn to approximate the dynamics of physical systems by modeling temporal sequences. The key insight is that self-attention with residual connections can approximate explicit multi-step time-integration methods, providing a learned alternative to traditional numerical schemes.

Self-Attention Mechanism

Self-attention computes a weighted combination of features from past time-steps. For each time-step $i$, three vectors are computed:

Query: $q_i = F_q(x_i) \in \mathbb{R}^{d_k}$
Key: $k_i = F_k(x_i) \in \mathbb{R}^{d_k}$
Value: $v_i = F_v(x_i) \in \mathbb{R}^{d_v}$

where $F_q$, $F_k$, $F_v$ are learnable neural networks. The attention mechanism computes:

Attention scores (softmax-normalized similarities): $$\alpha_{n,i} = \frac{\exp(q_n^\top k_i / \sqrt{d_k})}{\sum_{j=1}^k \exp(q_n^\top k_j / \sqrt{d_k})}$$

Context output (weighted combination): $$c_n = \sum_{i=1}^k \alpha_{n,i} v_i$$

In matrix form for a context window of length $k$: $$C = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

where $Q, K \in \mathbb{R}^{k \times d_k}$ and $V \in \mathbb{R}^{k \times d_v}$ contain all query, key, and value vectors.

Connection to Explicit Time Integration

The fundamental connection: self-attention with residual connections can approximate explicit Adams multi-step methods.

Mathematical Equivalence

Explicit Adams method ($s$-step): $$\varphi_{n+s} = \varphi_{n+s-1} + \Delta t \sum_{j=0}^{s-1} b_j f(t_{n+j}, \varphi_{n+j})$$

where $b_j$ are method coefficients and $f$ is the right-hand side of $\dot{\varphi} = f(t, \varphi)$.

Residual self-attention: $$\hat{\varphi}_{n+s} = \varphi_{n+s-1} + \sum_{i=0}^{s-1} \alpha_{n+s-1,i} v_i$$

Both compute: previous state + weighted combination of past information.

Universal Approximation Theorem

Theorem 1 (Geneva & Zabaras, 2022): For a dynamical system $\dot{\varphi} = f(t, \varphi)$ with Lipschitz continuous $f$, a self-attention layer $A_\theta$ with residual connection and context length $k$ can approximate any explicit Adams method $M_i$ of order $i \leq k$ within arbitrarily small error:

\[\|M_i - A_{\theta_i}\|_\infty < O(\epsilon), \quad \forall \epsilon > 0\]

Proof sketch: By setting attention weights $\alpha_{n+s-1,i} = b_i / \sum_j b_j$ and using a neural network to approximate $v_i = c \cdot f(t_{n+i}, \varphi_{n+i}) + O(\epsilon)$, the self-attention layer reproduces the Adams method. The universal approximation property of neural networks ensures the error can be made arbitrarily small.

Beyond Fixed Schemes

While a single layer approximates Adams methods, the full transformer architecture learns:

Multi-scale temporal patterns: Different attention heads capture different time scales
Adaptive integration: The effective integration scheme adapts to system state
Long-term dependencies: Representations encode relationships beyond the context window

Transformer Architecture for Physical Systems

The model uses a transformer decoder (GPT-style) architecture:

Masked self-attention: Each time-step attends only to previous time-steps
Positional encoding: Sinusoidal or time-based encoding provides temporal information
Residual connections: Enable identity mappings and gradient flow
Multi-head attention: Parallel attention heads learn different dependency patterns
Auto-regressive prediction: Next state predicted from all previous states

Training Objective

Given embedded time-series $\mathcal{D} = \{\Xi_i\}_{i=1}^D$ where $\Xi_i = (\xi_0, \ldots, \xi_T)$:

\[\mathcal{L}_{\mathcal{D}} = \sum_{i=1}^D \sum_{j=1}^T -\log p(\xi_j^i | \xi_{j-k}^i, \ldots, \xi_{j-1}^i, \theta)\]

Unlike NLP (discrete tokens with softmax), physical systems use continuous Gaussian likelihood, resulting in L2 loss.

Koopman Embedding

Physical states must be embedded into a lower-dimensional representation. The approach uses Koopman operator theory to learn embeddings where dynamics are approximately linear.

Koopman Operator

For a discrete-time system $\varphi_{i+1} = F(\varphi_i)$, the Koopman operator $K$ acts on observables $g(\varphi_i)$:

\[K g(\varphi_i) \triangleq g \circ F(\varphi_i) = g(\varphi_{i+1})\]

This implies linear evolution: $g(\varphi_{i+1}) = K g(\varphi_i)$, $g(\varphi_{i+2}) = K^2 g(\varphi_i)$, etc.

Encoder-Decoder Architecture

Encoder: $F : \mathbb{R}^{n \times d} \to \mathbb{R}^e$ maps states to observables $\xi_i = F(\varphi_i)$
Decoder: $G : \mathbb{R}^e \to \mathbb{R}^{n \times d}$ reconstructs states $\hat{\varphi}_i = G(\xi_i)$
Koopman operator: Learnable linear $K$ (often banded) evolves observables: $\xi_{i+1} = K \xi_i$

Training Loss

The embedding model minimizes:

\[\mathcal{L} = \lambda_0 \|\varphi_i - G(F(\varphi_i))\|^2 + \lambda_1 \|\varphi_j - G(K^j F(\varphi_0))\|^2 + \lambda_2 \|K\|_2^2\]

Reconstruction: Accurate encoding/decoding
Dynamics: Linear evolution in embedded space
Regularization: Prevents overfitting

The embedding is trained first, then frozen. The transformer trains on embedded trajectories.

Applications

Surrogate Modeling

Transformers replace expensive numerical solvers for:

Optimization: Many repeated simulations with different parameters
Design: Exploration of large parameter spaces
Inverse problems: Iterative parameter estimation
Uncertainty quantification: Monte Carlo sampling

Advantages

Long-term dependencies: Self-attention maintains relationships across the full context window
Multi-scale patterns: Different heads capture different temporal scales
Generalization: Models entire families of problems (distributions of initial/boundary conditions, parameters)
Learned schemes: Adaptive integration tailored to system dynamics

Demonstrated Applications

Chaotic systems: Lorenz system with sensitive dependence on initial conditions
Fluid dynamics: 2D Navier-Stokes flow around a cylinder (various Reynolds numbers)
Reaction-diffusion: 3D Gray-Scott system with complex spatiotemporal patterns

Transformers outperform LSTMs, auto-regressive models, and echo-state networks, particularly when extrapolating beyond the training context length.

How to Use Transformers for Physical Systems

Workflow

Collect training data: Generate trajectories using a numerical solver for various initial conditions, boundary conditions, and parameters
Train embedding model: Learn Koopman observables that linearize the dynamics
Embed all data: Convert physical states to embedded representations
Train transformer: Learn temporal dynamics in embedded space
Inference: Predict future states by encoding initial conditions, running the transformer, and decoding

Key Design Choices

Context length: Determines how many past time-steps the model can attend to
Embedding dimension: Balance between compression and information preservation
Architecture depth: Number of transformer layers affects expressivity
Multi-head attention: More heads enable richer temporal pattern learning

The transformer learns an adaptive integration scheme that can outperform fixed numerical methods for the specific dynamics it was trained on.

References

Geneva, N., & Zabaras, N. (2022). Transformers for modeling physical systems. Neural Networks, 146, 272-289.
ScienceDirect | DOI