PhysicsX - How the Fourier Neural Operator Learns to Solve PDEs

‍

Introduction

Simulation is at the heart of modern engineering design. Whether you are optimizing an aerofoil, analyzing stress in a structural component, or predicting heat distribution in an electronic assembly, the workflow is the same: adjust the design, run the solver, interpret the result, repeat. Each iteration can take minutes to hours, and in a design study with hundreds of variants, the solver becomes the bottleneck.

Neural operators (NOs) offer a fundamentally different approach. Instead of solving each partial differential equation (PDE) instance from scratch, they learn the solution map itself from data. At inference time, a new solution takes milliseconds rather than hours. This makes them a natural choice for physics and engineering domains governed by PDEs, where traditional solvers are often the computational bottleneck. Neural operators also serve as a powerful backbone for larger pre-trained models aimed at solving domain-specific problems. They encode the structure of PDE solution maps as an inductive bias — which makes understanding them well all the more important. Despite the recent advancements in the neural surrogate literature, the Fourier Neural Operator (FNO) remains one of the most well-studied architectures in this space. Understanding how FNO works gives valuable insights into how neural operators more broadly are designed, and that is the focus of this article.

We focus on steady-state problems, where the solution depends only on space, not time. This already covers a large and practically important class of engineering problems. The ideas extend naturally to time-dependent problems, which we will cover in a follow-up post.

By the end of this article, you will understand:

Why the Fourier transform is the natural foundation for FNO;
How the architecture learns solution maps rather than single solutions;
And where FNO genuinely struggles — so you can make an informed decision about whether it is the right tool for your problem.

We use the terms "Fourier space" and "frequency space" interchangeably throughout this article; "physical space" refers to the original spatial domain where the solution $u(x)$ is defined.

‍

From Derivatives to Multiplications: Building Fourier Intuition

If you are already comfortable with the Fourier transform, skip ahead to the Neural Operators section.

The solution to a steady-state PDE is a spatial field $u(x)$ — a value at every point in the domain $\Omega$. The Fourier transform $\mathcal{F}$ decomposes this field into sinusoidal waves at different spatial frequencies, each described by a wavenumber $k$ and a coefficient $\hat{u}(k)$:

$$u(x) = \sum_k \hat{u}(k)\, e^{ikx} \quad \longleftrightarrow \quad \hat{u}(k) = \mathcal{F}\{u\}(k)$$

Think of $u(x)$ as telling you where things happen, and $\hat{u}(k)$ as telling you at what scale. A sharp spike in physical space spreads energy across many frequencies. A smooth, slowly varying field concentrates almost all its energy in just a few low-$k$ modes. Same information but in a different language.

The practical payoff for PDEs is significant. Differentiating $u$ in physical space is equivalent, after transforming, to a simple multiplication in frequency space: each derivative just multiplies the Fourier coefficient $\hat{u}(k)$ by a factor involving the wavenumber $k$. Under periodic boundaries and spatially constant coefficients, each Fourier mode evolves independently, reducing a large coupled system to a set of decoupled algebraic equations. Variable coefficients generally couple the modes, though the Fourier basis remains computationally valuable even in that case.

Derivatives are not the only operation that simplifies in this way. A convolution slides a fixed kernel across a signal, blending each point with its neighbours; in frequency space, it also reduces to pointwise multiplication. Because so many operations collapse to per-mode arithmetic, the cost of a spectral method hinges on how many modes the solution actually needs — and in smooth, elliptic, or laminar regimes, most of the energy is captured by a handful of low-frequency modes. Sharp features such as boundary layers, shocks, or discontinuities, however, spread energy across the entire spectrum.

Classical spectral solvers exploit all of this for fixed, known PDEs — but they require you to know the governing equations upfront and re-run from scratch for every new parameter set. What if you could instead learn the solution operator from data? That is the idea behind NOs, and it is where we turn next.

‍

Neural Operators for PDE Surrogates

NOs have emerged as one of the most promising approaches to this problem. Unlike standard neural networks — which are tied to a fixed discretization (the number of grid points used during training) and must be retrained if the mesh changes — a neural operator learns a mapping between function spaces. Given an input field $a(x)$ over $\Omega$ (a spatially varying parameter, boundary condition, or source term), it produces the corresponding solution field $u(x)$ at any resolution. In a steady-state structural problem, for instance, $a(x)$ might be a spatially varying elastic modulus and $u(x)$ the resulting displacement field — feed in a new material distribution, and the trained operator instantly returns the new solution, without re-running the solver.

To understand how this works in practice, it's helpful to walk through the three-stage architecture that all NOs share. Think of it as a pipeline: the input field goes in one end, passes through a sequence of layers that progressively build up a rich picture of the solution, and comes out the other end as the output field. The diagram below shows this at a glance — the sections that follow explain each stage in detail.

‍

1. Lifting (encoding)

Think of lifting as a pre-processing step. The raw input $a(x)$ — a scalar or small vector at each spatial location, like an elastic modulus or temperature load — is expanded into a much richer internal representation by a small pointwise network $\mathcal{P}$. The same network is applied independently at every point, producing a feature vector of dimension $d_v$ (typically 32–256) at each location. This richer representation, $v_0(x)$, gives the model a more expressive vocabulary before any information exchange between locations begins. That exchange happens only in the operator layers that follow.

‍

2. Operator layers — global meets local

This is where the real computation happens. At each of the $L$ layers, the hidden field $v_\ell(x)$ is updated by asking two questions simultaneously: what is happening across the whole domain? and what is happening locally at this point? The answer to the first comes from an integral over the domain; the answer to the second comes from a pointwise transformation. Combined and passed through a nonlinearity, the result is $v_{\ell+1}(x)$:

$$v_{\ell+1}(x) = \sigma\!\left(W_\ell\, v_\ell(x) + \int_{\Omega} K_\ell(x,y)\, v_\ell(y)\, dy\right), \quad \ell = 0, \ldots, L-1$$

The global term, $\int_\Omega K_\ell(x,y)\,v_\ell(y)\,dy$, aggregates information from every location $y$ in the domain and brings it to bear at location $x$. The learned kernel $K_\ell(x,y)$ controls how much influence each location $y$ has on $x$. This integral formulation is what sets NOs apart from standard neural networks. A conventional network operates on fixed-size vectors — it takes a specific number of inputs and produces a specific number of outputs. The integral operator, by contrast, defines a map between functions: it can be evaluated at any collection of points in the domain, regardless of how many there are or where they are placed. As a side effect, every point can see every other point in a single pass, rather than requiring many stacked layers to propagate information across the domain.

The local term, $W_\ell\,v_\ell(x)$, is a simple learned matrix applied independently at each point. It handles purely local structure that the global integral might miss — for example, how the solution at a point responds directly to the local boundary condition or source term. It also acts as a residual bypass that stabilizes training.

The pointwise function, $\sigma$, is a nonlinearity applied after combining both terms. Without it, stacking layers would collapse to a single linear operation — $\sigma$ is what allows the network to represent non-linear solution behavior.

Why stack $L$ layers? Each layer performs one round of global-local mixing. Early layers pick up coarse, large-scale structure; later layers progressively refine the detail. One layer is rarely enough for complex PDE solutions — stacking several of them lets the network build up a multi-scale picture of the solution.

‍

3. Projection (decoding)

The mirror image of lifting. A second small pointwise network $\mathcal{Q}$ takes the final hidden representation $v_L(x)$ — which by now encodes a rich, multi-scale picture of the solution built up over $L$ layers — and collapses it back down to the physical output dimension: a scalar pressure, a displacement vector, or whatever the problem demands. The output $u_{out}(x)$ is a field over the same domain $\Omega$ as the input, ready for use at inference or comparison against ground truth during training.

The key design decision in any NO is how to approximate the kernel integral efficiently. Different approximation strategies give rise to different architectures — for example, approximating the kernel on a graph of spatial locations yields the Graph Neural Operator (GNO), which naturally handles irregular meshes; parameterizing the kernel directly in Fourier space as learned per-mode matrices yields the FNO, and is what we cover next.

‍

The Fourier Neural Operator (FNO)

From Classical Spectral Methods to a Learned Operator

If you have worked with spectral methods before, FNO will feel immediately familiar, because it is directly motivated by them. Classical spectral solvers expand the solution in a Fourier basis, truncate to a finite number of modes, and exploit the derivative-to-multiplication property to turn the PDE into algebraic equations in frequency space. The core recipe is: work in frequency space, keep only the dominant low-$k$ modes, transform back to physical space.

FNO takes this same recipe and makes it learnable. Instead of hard-coding the spectral operator based on a known PDE, FNO learns the per-mode transformations directly from data. The three pillars of classical spectral methods map directly onto FNO's design: mode truncation (keep only $K_{cut}$ modes), the fast Fourier transform (FFT) as the computational backbone, and a global Fourier basis that handles smooth, spatially-coupled problems naturally. The key departure is that FNO does not require you to know the PDE — the hard-coded physics is replaced by a neural network trained on data.

‍

How Each FNO Layer Works

The bottleneck in any NO is the kernel integral — directly evaluating it costs $\mathcal{O}(N^2)$ for $N$ grid points. FNO's FFT-based computation path reduces this to $\mathcal{O}(N \log N)$ plus per-mode channel mixing — a substantial win for large grids, though the cost also depends on the number of channels and modes retained.

‍

Each FNO layer does this in four steps:

Step 1 — FFT: Transform the hidden features from physical space into frequency space. Each spatial pattern becomes a set of Fourier coefficients, one per mode.

Step 2 — Spectral filtering: For each retained mode up to $K_{cut}$, apply a learned matrix $R_\ell(k)$ that mixes the feature channels. Crucially, $R_\ell(k)$ is learned independently for each mode, giving FNO substantial flexibility to construct very complex kernels — though this comes at the cost of a high parameter count. Modes beyond $K_{cut}$ are simply zeroed out.

Step 3 — Inverse FFT: Transform back to physical space.

Step 4 — Combine and apply activation: Add the spectral output to the local bypass term $W_\ell\,v_\ell(x)$ — the same skip connection from the general operator layer — then pass through a nonlinearity $\sigma$:

$$v_{\ell+1}(x) = \sigma\!\left(\underbrace{\mathcal{F}^{-1}\!\big(R_\ell(k)\,\hat{v}_\ell(k)\big)(x)}_{\text{global: spectral path}}+\underbrace{W_\ell\,v_\ell(x) + b}_{\text{local: bypass}}\right)$$

The two terms are labelled directly in the equation — the left handles global frequency-domain mixing, the right handles local pointwise correction.

Why keep only $K_{cut}$ modes? First, mode truncation brings the practical convenience of resolution-invariance: because the layer operates on a fixed set of low-frequency modes rather than on the full grid, FNOs can generalize across data resolutions without retraining. Second, from a theoretical standpoint, the elliptic and laminar problems that FNO is best suited for tend to have their dominant physics concentrated in the first few Fourier modes. In turbulent flows, energy cascades from large scales (low $k$) to small scales (high $k$) where it is dissipated — the Kolmogorov energy cascade. Keeping only modes up to $K_{cut}$ captures the dominant physics, cuts cost dramatically, and filters out high-frequency noise. For problems with shocks or sharp gradients, this assumption breaks — which is precisely why those cases are listed as limitations.

‍

Practical Considerations

FNO's design choices — working in Fourier space, truncating to $K_{cut}$ modes, relying on the FFT — are what make it fast and effective. But each comes with a trade-off.

‍

Non-Periodic Boundary Conditions (BCs)

Most real engineering domains have inlets, outlets, and solid walls — boundaries that are not periodic. This matters because the FFT implicitly treats the domain as a loop: values at one edge wrap around to the opposite side. If the solution is not periodic, this wrap-around introduces artificial discontinuities that corrupt the spectral representation.

FNO handles this with two complementary strategies. Zero-padding extends the domain with zeros before the FFT, reducing the circular convolution artifacts that arise from wrap-around — it is a practical mitigation, not a principled BC enforcement mechanism. The local bypass term $W_\ell\,v_\ell(x)$ operates purely in physical space and can learn to approximate boundary behavior from data, but it does not guarantee satisfaction of Dirichlet, Neumann, or flux conditions either. BC satisfaction is not guaranteed with standard FNO — for strict boundary enforcement, you typically need to add explicit BC channels as additional inputs, or use geometry- and BC-aware variants. That said, in practice, FNO trained on data that includes BC information often learns to respect boundaries implicitly, which is why it performs reasonably well on many non-periodic problems.

‍

Recovering Fine-Scale Structure from Truncated Modes

If FNO only keeps low-wavenumber modes, how does it produce fine-scale features in the output? The answer is nonlinearity. Applying a nonlinear activation to a sum of low-frequency modes generates new frequency components — the same way multiplying two sinusoids produces new ones at their sum and difference frequencies. Stacking multiple FNO layers progressively couples modes, allowing some fine-scale structure to emerge. However, because truncation happens at every layer, very high-frequency content generated by the nonlinearity can be suppressed again at the next spectral step. In practice, fine detail usually comes from a combination of the local pointwise path, retaining enough modes in $K_{cut}$, and the target solution being predictable from the coarse low-frequency structure.

The hard limit remains: if the target contains high-frequency structure uncorrelated with the retained modes, no amount of depth will recover it. For smooth, low-to-moderate frequency solutions — which dominate most engineering PDE problems — this limitation rarely binds.

‍

Other Known Limitations

Regular grids only (vanilla FNO). The standard FNO assumes a uniform Cartesian grid — a requirement of the FFT. Real engineering geometries rarely fit this mould: curved surfaces, unstructured meshes, and complex boundaries all require geometry-aware extensions. These extensions exist and are actively developed, but they trade some of FNO's FFT efficiency and simplicity for mesh flexibility. If your problem involves irregular geometry, vanilla FNO is not the right starting point — see the Further Reading section for pointers.

Shock-dominated and discontinuous flows. FNO's strength — concentrating on low-frequency modes — becomes a weakness when the solution contains shocks or sharp discontinuities. Energy spreads across the full spectrum (the Gibbs phenomenon), and FNO struggles unless $K_{cut}$ is set very high, negating the efficiency gains. Hybrid approaches are more appropriate for such problems.

High-dimensional problems. The representation power that comes from vanilla FNO's kernel parameterization carries a steep cost: the parameter count scales exponentially with the problem's dimensionality $d$. In one dimension, the model learns a matrix $R_\ell(k)$ for each mode up to $K_{cut}$; in two dimensions, it must learn a separate matrix for each pair of modes — and so on. This exponential growth of $O(K_{cut}^d)$ quickly makes training FNOs impractical for high-dimensional problems, with training becoming infeasible on consumer-grade hardware. Our Research team devised a physically-motivated kernel design to eliminate that parameter growth. Check the Further Reading section for a reference.

‍

When to Use FNO — and When Not To

Use FNO when:

The solution is smooth (elliptic or laminar regime)
You need many evaluations — design sweeps, uncertainty quantification, and real-time control
Your domain is a regular Cartesian grid
Moderate boundary condition strictness is acceptable
You can generate training data from a classical solver

Be cautious or avoid when:

The solution contains shocks, discontinuities, or sharp boundary layers
Strict enforcement of Dirichlet, Neumann, or flux BCs is required
Your geometry is unstructured or curved (use geometry-aware extensions instead)
Fine-scale features are highly localised and unpredictable from coarse structure
Training data is very scarce (fewer than a few hundred solver runs)
The problem has more than three dimensions

‍

Summary

Fourier Neural Operators (FNOs) learn a solution operator from data, mapping an input field ($a(x)$) to an output solution field ($u(x)$), so new PDE instances can be evaluated in milliseconds rather than re-solving from scratch.
The Fourier transform perspective matters because it turns differentiation into multiplication in frequency space and concentrates smooth physics into low-frequency modes.
Neural operators share a common lifting → operator layers → projection structure. FNO implements the global interaction term efficiently by moving to Fourier space, mixing channels with learned per-mode matrices, truncating to ($K_{cut}$) modes, and transforming back.
FNO works best for smooth, spatially coupled problems on regular grids, where most of the energy sits in low modes and moderate boundary-condition fidelity is acceptable.
Its main failure modes are tied to the same design choices: FFT assumptions (periodicity), mode truncation (difficulty with shocks and sharp gradients), and scalability limits in higher dimensions.

‍

Conclusion

FNO is a strong surrogate when you need fast repeated PDE evaluations on regular grids and your solutions are predominantly smooth. It offers an appealing blend of spectral efficiency and learnable global coupling, but it is not a drop-in replacement for PDE solvers in settings that demand strict boundary-condition enforcement, handle irregular geometries, or exhibit shock-dominated dynamics. In those cases, geometry-aware neural operator variants or hybrid approaches are typically a better starting point.

‍

How the Fourier Neural Operator Learns to Solve PDEs — and Where It Falls Short