Model Error, Information Barriers, State Estimation and Prediction in Complex Multiscale Systems

Complex multiscale systems are ubiquitous in many areas. This research expository article discusses the development and applications of a recent information-theoretic framework as well as novel reduced-order nonlinear modeling strategies for understanding and predicting complex multiscale systems. The topics include the basic mathematical properties and qualitative features of complex multiscale systems, statistical prediction and uncertainty quantification, state estimation or data assimilation, and coping with the inevitable model errors in approximating such complex systems. Here, the information-theoretic framework is applied to rigorously quantify the model fidelity, model sensitivity and information barriers arising from different approximation strategies. It also succeeds in assessing the skill of filtering and predicting complex dynamical systems and overcomes the shortcomings in traditional path-wise measurements such as the failure in measuring extreme events. In addition, information theory is incorporated into a systematic data-driven nonlinear stochastic modeling framework that allows effective predictions of nonlinear intermittent time series. Finally, new efficient reduced-order nonlinear modeling strategies combined with information theory for model calibration provide skillful predictions of intermittent extreme events in spatially-extended complex dynamical systems. The contents here include the general mathematical theories, effective numerical procedures, instructive qualitative models, and concrete models from climate, atmosphere and ocean science.

One of the central difficulties in studying these complex multiscale turbulent dynamical systems is that either the dynamical equations for the truth are unknown due to the lack of physical understanding or the resolution in the models is inadequate due to the current computing power [1, 13,18,[23][24][25]. Therefore, understanding the model error from the imperfect dynamics as well as the coarse-grained processes becomes important. From both the theoretical and practical point of view, the following issues are of great interest.

1.
How to measure the skill (i.e., the statistical accuracy) of a given imperfect model in reproducing the present states and predicting the future states in an unbiased fashion? 2.
How to make the best possible estimate of model sensitivity to changes in external or internal parameters by utilizing the imperfect knowledge available of the present state? What are the most sensitive parameters for the change of the model status given uncertain knowledge of the present state? 3.
How to design cheap and practical reduced models that are nevertheless able to capture both the main statistical features of nature and the correct response to external/internal perturbations? 4.
How to develop a systematic data-driven nonlinear modeling and prediction framework that provides skillful forecasts and allows accurate quantifications of the forecast uncertainty? 5.
How to build effective models, efficient algorithms and unbiased quantification criteria for online data assimilation (state estimation or filtering) and prediction especially in the presence of model error?
physics-constrained nonlinear stochastic models [46][47][48]. Applying the information-theoretic framework for model calibration, the reduced-order models with suitable model structures are able to capture both the key dynamical and statistical features as well as the crucial nonlinear and non-Gaussian characteristics such as intermittency and extreme/rare events as observed in nature. Nevertheless, the choice of the reduced or simplified models plays a crucial role in approximating nature. Within an improper model family, even the best model with the most elaborate calibration will result in a large model error. This is known as the information barrier and can be quantified by the information-theoretic framework [27,[49][50][51][52]. In fact, the information-theoretic framework allows a rigorous decomposition of the total model error into an intrinsic barrier and an actual model error. The latter can be eliminated or at least be minimized to a negligible level by the information-optimization criterion [27,28]. Quantifying such information barriers have both theoretic and practical importance. It indicates the futility of model calibration if the information barrier is significant. It can also be used as a guidance to expand the model family of reduced models for the improvement of imperfect models. Note that information barriers appear in both the model fidelity and model sensitivity. A model with perfect model fidelity can still have a significant information barrier in response to the internal/external perturbation and in short term predictions [27].
Another important application of the information-theoretic framework is that it provides a novel and unbiased approach to assess the online data assimilation/filtering and prediction skill in complex multiscale dynamical systems [31,[53][54][55][56][57][58]. The traditional path-wise measurements such as the root-mean-square error and pattern correlation [59,60] are misleading in assessing the model error in both filtering and prediction [31,61]. In fact, these traditional measurements completely fail to quantify the ability of the imperfect models in reproducing the extreme events in nature even in the linear and Gaussian setup. In addiction, these traditional path-wise measurements take into account the information only up to the second order statistics and therefore they have no skill in quantifying the features of intermittency and non-Gaussian probability density functions (PDFs) as well as other salient characteristics in nonlinear multiscale turbulent dynamical systems. On the other hand, the information-theoretic framework combining different information measurements is able to quantify the model error in an unbiased fashion and succeeds in assessing the ability of imperfect models in reproducing both the Gaussian and non-Gaussian features in filtering and forecasting complex nonlinear dynamical systems [31,61].
In practice, due to the incomplete knowledge and the limited computing power for dealing with the complex nonlinear turbulent dynamical systems of nature, reduced-order models are often designed for the state estimation and prediction [62][63][64][65][66][67][68]. Parameterizations of unresolved variables and coarse-grained processes are typically involved in the reduced-order models [69][70][71][72], which result in large uncertainties. Therefore, the reduced-order models aim at capturing the statistical features rather than a single realization of random trajectories of the complex nonlinear turbulent dynamical systems. Among all types of the reduced-order models, linear tangential approximations and Gaussian closure models are widely used in approximating the time evolutions of the statistics of nature [71,[73][74][75]. Despite their simplicity and skillful behavior in some scenarios, these crude approximations fail to capture many crucial features in nature that result from the nonlinear interactions between different variables or nonlinear feedback within different scales. Therefore, nonlinear and non-Gaussian closure becomes important in describing the turbulence [76][77][78][79]. Recently, a new framework of statistical closure models has been developed for improving the skill of the reduced-order models. The new reduced-order models take into account higher-order moments but nevertheless remain computationally efficient [1, 45,50]. With the model calibration based on effective information criteria, these new reduced-order models succeed in capturing the non-Gaussian statistical characteristics including intermittency and extreme events as well as memory effect and temporal correlation. The new reduced-order models have also been used to provide accurate state estimation and prediction of many high-dimensional complex nonlinear turbulent systems [80][81][82].
This research expository article blends new viewpoints and results with a current research summary of a specific perspective. It focuses on both the development and applications of the information-theoretic framework as well as the new reduced-order nonlinear modeling strategies for dealing with model error, information barriers, state estimation and prediction in complex multiscale systems. The contents include the general mathematical framework and theory, effective numerical procedures, instructive qualitative models, and concrete models from climate, atmosphere and ocean science. The remaining of the article is organized as follows. The information-theoretic framework is developed in Section 2. In the same section, various information barriers in the presence of model error are shown via simple but instructive stochastic models. In Section 3, the information-theoretic framework is applied to assess model error in state estimation and prediction with examples coming from both complex scalar models and spatially-extended multiscale turbulent systems. The advantage of the information-theoretic framework over the traditional path-wise measurements are illustrated. Section 4 deals with sensitivity and linear statistical response using the fluctuation-dissipation theorem. An efficient and effective algorithm in finding the most sensitive change directions using information theory is also included in this section. Then, in Section 5, a novel framework of data-driven physics-constrained nonlinear stochastic models and predictions is developed and is applied to predicting the time series of an important atmospheric phenomenon with strong intermittent instabilities and extreme events. Section 6 includes the development of the new effective reduced-order models that involve higher order statistical features. These new models together with the information-optimization model calibration strategy are applied to predicting passive tracer extreme events driven by spatially-extended complex dynamical systems. The article is concluded in Section 7.

An Information-Theoretic Framework of Quantifying Model Error and Model Sensitivity
An information-theoretic framework has recently been developed and applied to quantify model error, model sensitivity and prediction skill [10,[26][27][28][29][30][31][32][33][34]. The natural way to measure the lack of information in one probability density q(u) compared with the true probability density p(u) is through the relative entropy P (p, q) [26,32,40], which is also known as Kullback-Leibler divergence or information divergence [83][84][85]. Despite the lack of symmetry, the relative entropy has two attractive features. First, P (p, q) ≥ 0 with equality if and only if p = q. Second, P (p, q) is invariant under general nonlinear changes of variables. These provide an attract framework for assessing model errors in many applications [23,26,33,34,[86][87][88][89]. To quantify the model error and information barriers, let us denote π the probability density of the perfect model, which is actually unknown in practice. Nevertheless, the least biased estimate π L based on L measurements of the perfect model during the training phase is typically available. Therefore, P (π, π L ) precisely quantifies the intrinsic error that is due to the ignorance of the information beyond the L measurements in the perfect model. On the other hand, denote π M the probability density associated with an imperfect model. Then, the model error in the imperfect model compared with the truth is given by the difference between π M and π, which is quantified by P (π, π M ). Note that P (π, π M ) quantifies the overall model error. Signal-dispersion (e.g., in Equation (6)) and other decomposition methods are often used to access different components of the model error.
In addition, a general description of the model error in complex turbulent systems includes both the statistical information in terms of the PDFs and the dynamical information such as the temporal autocorrelation function. The latter will be emphasized in Sections 2.3-2.5.
In practice, π M is determined by no more information than the available in the prefect model. In addition, the imperfect model is typically defined on a subspace of the coarse-grained, resolved variables of the perfect model. Therefore, throughout the following analysis, we focus on characterizing the statistical departures of the imperfect model dynamics relative to the perfect model on these coarse-grained variables u. Now, consider a class of imperfect models M. The best imperfect model for the coarse-grained variables u is the M * ∈ M so that the perfect model has the smallest additional information beyond the imperfect model distribution π M * , namely P (π, π M * ) = min M∈M P (π, π M ). (2) The following general principle [26] facilitates the practical calculation of (2), P (π, π M L ) = P (π, π L ) + P (π L , π M L ) = (S(π L ) − S(π)) + P (π L , π M L ).
In (3), we have assumed in practice that only L measurements are available in the perfect system and the imperfect model takes into accounts L measurements with L ≤ L. In (3), S(π L ) − S(π) = − π L log π L + π log π (4) is the entropy difference, which precisely measures an intrinsic error from the L measurements of the prefect system. Consequently, the optimization in (2) can be computed by replacing the unknown π by the hypothetically known π L so that the optimal model within the given class satisfies P (π L , π M * L ) = min M∈M P (π L , π M L ).
The most practical setup for utilizing the framework of empirical information theory in many applications arises when both the measurements of the perfect system and its imperfect model involve only the mean and covariance of the resolved variables u so that π L = π G ∼ N (ū, R) and π M := π M G ∼ N (ū M , R M ) are both Gaussian. In this case, P (π G , π M G ) has the explicit formula [2,26] P (π G , π M G ) = Rigorous theorems guarantee this smooth dependence under minimal hypothesis for stochastic systems [90]. By assuming the parameter δ is small enough and doing leading order Taylor expansion of (3), we reach the following result: P (π δ , π M δ ) =S(π L,δ ) − S(π δ ) + P (π L , π M ) + log In the case with perfect model fidelity in terms of the L measurements, namely P (π L , π M ) = 0 or π L (u) = π M (u), the expansion in (7) becomes P (π δ , π M δ ) = S(π L,δ ) − S(π δ ) + where the quadratic discrepancy is measured in the Fisher information metric [91][92][93]. One important scenario in practice involves measuring only the mean and covariance for an imperfect model. We denote π 2,δ = π G,δ the unbiased Gaussian estimate of the perfect model. For the simplicity of statement, we further assume both the covariance of the perfect and imperfect models, R δ and R M,δ , are diagonal such that R δ = (R k ) + (δR k ) and R M,δ = (R M,k ) + (δR M,k ), where |k| ≤ N and (δR k ) and (δR M,k ) are the covariance response to the external perturbation which are all scalar variances. In such a Gaussian setup, Equation (7) becomes P (π δ , π M δ ) =S(π G,δ ) − S(π δ ) + P (π G , π M ) Under the same Gaussian assumptions and perfect model fidelity, the formula in (8) becomes P (π δ , π M δ ) =S(π G,δ ) − S(π δ ) + where the first summation represents the signal contribution whereas the second summation represents the dispersion contribution. The formula in (9) or (10) can be applied to quantify the information barrier in the model sensitivity using imperfect models (see, for example, Sections 2.3 and 4). The information theory developed here plays important roles in quantifying model error, model sensitivity and information barrier, assessing data assimilation and prediction skill as well as developing new reduced-order nonlinear modeling strategies. These topics will all be addressed with instructive examples in the following sections.

Information Barriers in Capturing Model Fidelity
Information barriers are defined broadly as the gap of information obtained from the imperfect model related to that from the perfect one that can never be overcome. In other words, the information barriers imply the impossibility of generating stochastic models in a given model family that capture the missing physics. For example, if the minimization of the information model error in (5) remains significant, then there always exists a portion of information in the perfect model that cannot be recovered by the reduced imperfect models. Such information barriers play important roles in both model fidelity and sensitivity. Quantifying the information barriers have both theoretic and practical importance. It indicates the futility of model calibration if the information barrier is significant. It can also be used as a guidance to expand the model family of reduced models for the improvement of imperfect models.
Below, two simple but illustrative examples will be shown for the information barriers in capturing model fidelity. The study of these information barriers to more sophisticated turbulent dynamical models can be found in [30,50]. The information barrier in the model sensitivity will be discussed in Section 2.3.

First Information Barrier: Using Gaussian Approximation in Non-Gaussian Models
The first piece of information involves using linear Gaussian models to approximate non-Gaussian nature, which is a typical (crude) strategy in many real-world applications [30,94,95]. In addition to the intrinsic barrier in capturing the higher-order statistics (namely the non-Gaussian features), the goal here is to show that there exists an information barrier using the linear Gaussian models in capturing even the second order statistics of the truth in the presence of a time-periodic forcing.
As a simple but illustrative example, consider the following nonlinear dynamical system: This is a simplified version of the model named as "stochastic parameterized extended Kalman filter (SPEKF) model" that is widely used in nonlinear data assimilation and prediction [96][97][98][99].
Here, u can be regarded as a resolved variable and γ is an unresolved process which interacts with u in a nonlinear way. The external forcing F(t) is usually a periodic function that mimics the seasonal/annual cycle or any deterministic cycle that contributes to the system. In (11), the unresolved process γ plays the role of stochastic damping and therefore the statistics of u can be highly non-Gaussian with intermittent instabilities. One nice property of the model in (11) is that the time evolution of all the moments can be written down in closed analytical forms [15,97].
A natural way to approximate u without knowing the detailed structure of the unresolved process γ is the following mean stochastic model (MSm) [15,41], where the mean value of the hidden process γ is used in the dynamics of the resolved variable u. The MSm in (12) is a linear model with additive noise and therefore it has Gaussian statistics.
In both regimes, the periodic forcing is given by F(t) = 5 sin(t). Figure 1 shows sample trajectories of u and γ in the two regimes, respectively. It is clear that in the highly intermittent regime, γ has a frequent transition to values below zero, which triggers large bursts in the signal of u, namely the intermittent instability. On the other hand, in the nearly Gaussian regime, γ stays positive and therefore the signal of u has no intermittent instability. In panels (a)-(d) of Figure 2, the time evolution of the first four moments of u is shown. Despite the strong nonlinear interactions between γ and u, the external forcing F(t) results in periodic behavior in all these statistics. In panels (e)-(g), the PDFs of u at three different time instants within one period t = 13.5, 15 and 16.5 are demonstrated, which are all highly non-Gaussian with significant skewness and kurtosis. As comparison, the skewness and kurtosis in the nearly Gaussian regime (Figure 3) are tiny and the amplitudes of the periodic oscillation in the variance, skewness and kurtosis become much weaker.  Let's denote π the PDF of the nonlinear dynamics in (11) and π M G = π M 2 that of the MSm in (12). It is clear according to (3) that there is an intrinsic barrier S(π) − S(π G ) due to the non-Gaussian nature of π. Next, we show that there exists an information barrier in the statistics of MSm π M G even compared with the Gaussian approximation of π. Below, the mean and variance of u from the perfect model are both computed using their closed analytical forms [97].
First, we take the same parameters F M (t) = F(t) and σ M u = σ u in the MSm (12) as those in the perfect model (11). The time evolutions of the mean u(t) and the variance Var(u(t)) within one period at the statistical equilibrium from the MSm (blue) are shown in panels (a) and (b) of Figure 4. Although the evolution of the mean using MSm is quite close to the truth (black), the variance is strongly underestimated. This is as expected since a large portion of the variance comes from the intermittent events while the MSm stabilizes the system and does not allow such large bursts. As a consequence, the dispersion part of the model error, as defined in (6), becomes huge. Interestingly, despite the small error in the time evolution of the mean (panel (a)), the signal part of the model error remains significant. In fact, according to (6), the signal part of the model error is weighted by the inverse of the model variance, the severe underestimation of which results in such a large error. To overcome the issue of underestimating the variance, a common strategy in improving imperfect model is to inflate the stochastic forcing coefficient σ M u [29,100,101]. Here, the optimization is based on the minimization of the averaged information content P (π, π M ) between the perfect and imperfect models within one period at the statistical equilibrium. In panel (f), we show P (π, π M ) as a function of σ M u . With an inflation of σ M u , the model error does decrease significantly. However, even at the minimum of the curve where σ M * u = 1.5, the model error is still far from zero. In fact, due to the linear nature of the MSm, the forcing F M (t) in the MSm only affects the evolution of the mean. The periodic behavior in the variance can never be captured by the MSm (see panel (h)), which leads to an information barrier. As comparison, the nonlinearity in the nearly Gaussian regime as shown in Figure 5 is much weaker and therefore the information barrier becomes insignificant. In Figure 6, the optimal stochastic forcing σ M * u as well as the information barrier as a function ofγ are shown. It is clear that the information barrier decreases when the dynamical regime goes more towards Gaussian (with an increase ofγ).    The above analysis indicates an important fact. That is, even if only the first two moments (mean and variance) are taken into account in the perfect model, the linear MSm still fails to capture the evolution of these Gaussian statistics, which evolve in a strongly nonlinear way driven by the underlying nonlinear perfect dynamics. Such an information barrier cannot be overcome unless the imperfect model contains nonlinear information. In practice, various closure models are used to approximate the nonlinear behavior in the underlying perfect model. Below, we briefly report the results using a simple Gaussian closure model (GCm) [73,102,103]. Recall the Reynolds' decomposition where· is the ensemble mean and · is the fluctuation with· = 0. With the Reynolds' decomposition, the evolution equations of the mean u =ū, γ =γ, the variance Var(u) = u 2 , Var(γ) = γ 2 and the covariance Cov(u , γ ) = u γ are given by Note that the third and the fifth equations of (14) involve triad interactions u γ 2 and −2u 2 γ , which represent the third order moments. These triad terms come from the nonlinearity of the underlying perfect system. In fact, the perfect model involves quadratic nonlinearity, and therefore the evolution of the k-th order moments always depend on the k + 1-th order ones. To close the system, the Gaussian closure model assumes u γ 2 = −2u 2 γ = 0. The resulting system then involves only the interactions between the mean and covariance, In Figure 4, it is clear from panel (b) that even using exactly the same parameters in GCm as in the perfect one and without optimizing σ M u , the variance recovered from the GCm is much improved compared with that from the MSm. This indicates the fact that the information barrier can be largely overcome by taking into account the nonlinearity in the imperfect model. The remaining error comes from ignoring the third order moments u γ 2 and −2u 2 γ , which are nonzero in the intermittent non-Gaussian regime. More elaborate closure model techniques involve calibrating the third order moments using various approximations [11,44,104], which are not necessary in this simple example but have been shown to be crucial for more complex turbulent dynamical systems. Such topics will be discussed in detail in Section 6.3. Notably, the periodic behavior in the variance using the GCm is captured due to the nonlinear interactions between mean and variance. For example, the time-periodic meanū appears in the equation of u 2 . This is a significant difference compared with the linear MSm. Finally, with the optimal choice of σ M u (panel (f)), the information model error here becomes negligible.

Second Information Barrier: Using Single Point Correlation to Approximate Full Correlation Matrix
The strategy with single point statistics is widely used in climate science [105]. The single point statistics takes into account only the variance at each grid point and ignores the correlations between different grids. Despite the fact that both equilibrium consistency and sensitivity in response in single point statistics can be achieved by turning at most one parameter of the imperfect model, such a strategy is not enough for desirable model performance, which can be measured by the information barrier [10,26,27]. Such an information barrier was first quantified in [44,50]. In the following, we show this information barrier.
Let the PDF from the true model be π(u) with u = (u 0 , u 1 , . . . u J−1 ) T as before. Consider a Gaussian imperfect model where we only measure pointwise marginal PDF π M 1pt (u j ) ≡ π M 1pt,j at each grid pint j = 0, . . . , J − 1. Then, we construct the PDF with only single point statistics from the marginal distribution as π M 1pt = ∏ J−1 j=0 π M 1pt,j . According to [34], the information distance between the truth and imperfect model prediction has the form: with π G 1pt,j = N (ū j , R j ). The first part on the right-hand side of (16) is the intrinsic information barrier in Gaussian approximation. The third part is the model error in the imperfect model as compared to the single point statistics of the perfect model Gaussian fit, which can be vanished (or at least be minimized) by calibrating the imperfect model. The error from single point approximation by ignoring the cross-covariance then comes only from the information barrier in the marginal approximation as shown in the second part on the right-hand side of (16).
Below, we assume the true system is statistically homogeneous, which means the statistics is invariant at different grid points. This is in fact a typical feature in many real applications [2,106,107]. With statistical homogeneity, it is straightforward to show [50] that the diagonal entries of the covariance matrix corresponding to the single point approximation are all the same R 1pt . In addition, the covariance matrixR in the spectral space (associated with the Fourier modes of u) is also a diagonal matrix. DenoteR j the j-th diagonal entry ofR: where u n is the n-th component of u by subtracting the mean. Therefore,R 1pt = ∑ J−1 j=0R j J . By further assuming the same pointwise mean for π G and its single point approximation, the information barrier due to single point statistics approximation becomes where the second equality just applies the definition of R 1pt such that ∑ In addition to compute the information barrier explicitly using (18), the following result provides an effective estimation of such an information barrier, where σ max and σ min are the largest and smallest variances inR. See [50] for more details. Below, we construct a simple linear system to illustrate the information barrier due to the single point statistics approximation and showing the calculation of the formula in (18). An illustration of such information barrier based on a more sophisticated (40-dimensional) turbulent system is included in [50]. Here, the truth is given by a two-dimensional linear model, with the following parameters It is easy to check that the two eigenvalues of L are λ 1 = −1.5 and λ 2 = −0.5 and therefore the linear system (20) is stable. In addition, due to the non-zero coefficients L 01 and L 10 , u 0 and u 1 are correlated. Figure 7 shows the sample trajectories of the two-dimensional model (20) with parameters (21). The covariance matrix and statistical equilibrium can be written down explicitly [15], The non-zero off-diagonal entries clearly indicate the cross-covariance between u 0 and u 1 . Now, we implement the single point statistics approximation, which ignores the off-diagonal entries in (22) and the result is Since this example is extremely simple, it is straightforward to compute the information barrier by plugging the covariance matrices (22) and (23) as well as the zero mean into the explicit formula of the relative entropy (6), which yields Alternatively, according to (17), it is also easy to show thatR is a diagonal matrix and is given bŷ Plugging (26) into (18) gives the same result as in (24). This clearly shows the information barrier due to single point statistics approximation. Panels (c) and (d) in Figure 7 show true joint probability density function (PDF) π(u 0 , u 1 ) and the one with single point statistics approximation, the difference between which indicates the information barrier.

Intrinsic Information Barrier in Predicting Mean Response to the Change of Forcing
In Section 2.2, we have demonstrated the information barrier in the model fidelity via simple but illustrative examples. In this section, we aim at using a simple example to illustrate the intrinsic information barrier in the model sensitivity. Ref. [27] is a good reference for this topic. The example shown here also reveals the following fact. Even if the model fidelity represented by the equilibrium PDF is captured, the dynamical feature in the perfect model can still be missed if the model sensitivity is not recovered by the imperfect model. Therefore, both the model fidelity and model sensitivity are required in calibrating the imperfect model, which will be discussed with more details in Sections 5 and 6.
Here, the focus is the mean response to the change of forcing in linear models. A general framework of any system response (e.g., variance or higher order moments) to different types of external perturbations (e.g., forcing, dissipation or phase) in complex nonlinear model will be developed in Section 4 using the so-called fluctuation-dissipation theorem (FDT).
Consider a general linear system with noise In (27), L is a linear operator whose eigenvalues all have a negative real part, which guarantees the existence of a Gaussian statistical steady state of u. Here F is an external forcing, which can be a function of time t, andẆ is stochastic white noise. Now, we impose a forcing perturbation δF to the original system in (27) du Since both (27) and (28) are linear models, the mean values u and u δ at the statistical steady state can be written explicitly, Therefore, the mean response of u to the forcing perturbation δF is given by In practice, model error is usually inevitable. A suitable imperfect model is expected to generate at least the same mean response as in the perfect model in (29) in addition to the model fidelity.
A typical situation with model error for complex systems arises when the true system has additional degrees of freedom that are hidden from the family of imperfect models due to either the lack of scientific understanding or practical lack of computational resolution. The simple example below involves such features.
Consider the following perfect model with linear stochastic equations, whereẆ is white noise. The system in (30) has a smooth Gaussian statistical steady state provided that In (30), u can be treated as a resolved variable while v is a hidden one. All the imperfect models are given by the scalar stochastic equation that involves only the process of the observed variable, It is natural to require γ M > 0 such that the imperfect model (32) has a Gaussian statistical steady state. Next, the imperfect model (32) is tuned to capture the model fidelity in the perfect system (30) by matching the equilibrium mean and variance of u and u M . This implies with a suitable choice of the three tuning parameters F M , σ M and γ M (> 0), it is clear that the conditions in (33) can always be satisfied.
In addition to the model fidelity, an important and practical issue is to understand the response of the system to the external forcing perturbation δF. Therefore, it is crucial to have an imperfect model that has the same forcing response as the perfect model. To test the response of the external forcing, we replace F and F M by F + δF and F M + δF in the two linear systems (30) and (32), respectively. Note that the external forcing will not change the variance in linear systems. The only change in the equilibrium response is through the change in mean, Now assume A > 0 in the perfect model (30). We claim that no model in the family (32) can match the response of u correctly. In fact, with A > 0 and aA − q > 0 as required in (31), δu ∝ −δF. However, γ M > 0 implies δu M ∝ δF. In other words, the responses in the perfect and imperfect models are always anti-correlated, which implies an information barrier. To quantify this information barrier, we insert the response of mean (35) into (10) and make use of the fact that the response in the variance is always zero. Then, (10) yields It is clear that with A > 0 there is no finite minimum over γ M of (35) and necessarily γ M → ∞ in the approach to this minimum value. Thus, there is an intrinsic information barrier to skill in the mean response that cannot be overcome with the imperfect models in (32) even if they satisfy perfect model fidelity (33). On the other hand, if A < 0, then (35) indicates a unique minimum with γ * M = −A −1 (aA − q), in which case both the model fidelity and mean response are captured.

Slow-Fast System and Reduced Model
An important practical issue for complex dynamical systems is how to account for the indirect influence of the unresolved variables u II on the response of the resolved variables u I beyond bare truncation formulae. The importance of this has already been in Sections 2.2 and 2.3 for calibrating the model fidelity and predicting the mean response using linear imperfect models. Understanding this issue also has practical significance since simplified models are always preferred to decrease the computational cost in solving the complex multiscale dynamical systems. Therefore, developing reduced stochastic models for the variables u I with high skill for the low-frequency response is a central issue. While nature can be highly nonlinear and non-Gaussian, the focus here is linear systems with slow-fast multiscale features. Below, we will show that the stochastic mode reduction techniques [40,[108][109][110] are able to produce a reduced stochastic model for the low-frequency variables u I . Despite its simplicity, such a reduced stochastic model has exactly the same mean response operator as that in the complete stochastic system! More Gaussian and non-Gaussian examples can be found in [111].
Consider a linear multiscale stochastic model for variables u = (u I , u II ) T given by which can also be written in a compact form The parameter > 0 in (36) can be large or small. Here, we require that L has eigenvalues with a negative real part for all and in particular (L 11 u, u) < 0, (Γu, u) > 0. (38) for u = 0. These requirements guarantee that L is invertible and the climate mean state is given by u = −(L ) −1 F . This together with (37) and (38) implies in particular that the change in the first components of the climate mean state, δ u I , in response to a change in forcing, δF 1 , is given exactly by Stochastic mode reduction techniques [40,[108][109][110] systematically produce a reduced stochastic model for the variables u I alone, which is a valid model in the limit → 0; such models often have significant skill for moderate variables of [112,113]. Here, we focus on their skill in producing infinite time-mean response in (39) of the full dynamics from (36) independent of .
First, the local equations in (36) can be rewritten exactly as an equivalent equation with memory in time for the u I variable alone [114] given by (40) For simplicity in exposition, zero initial data are assumed for u II . As discussed in detail [40,109], the second and third terms in (40) simplify in the limit → 0 and yield reduced simplified local stochastic dynamics for u I alone given by This is an explicit example of stochastic mode reduction where the variables u II have been eliminated and there is a reduced local stochastic equation forũ I alone with explicit corrections that reflect the interaction with the unresolved variables. Here, we address the skill of the approximation in (41) in recovering the exact mean climate response in (39) independent of . Reasoning as discussed earlier in general below (38), the response of the climate mean in (41) to a change in forcing is given exactly by Remarkably, the mean response operator in (42) coincides exactly with the projected mean climate response operator in (39) for the complete stochastic system in (36) for any value of > 0! This general result points to the high skill of the response for the reduced stochastic model in calculating the mean climate response. Note that the asymptotic behavior and the filtering skill of the linear multiscale stochastic model in (36) have both been studied in [115].
We wrap up this subsection with the following remark. Unlike the linear models as shown in Sections 2.3 and 2.4, direct calculations of the response in general nonlinear models become a great challenge. Nevertheless, fluctuation-dissipation theorem (FDT) provides an efficient and practical way for computing the response in nonlinear systems. The general framework of FDT will be developed in Section 4. Note that low-frequency regimes of general circulation models (GCMs) typically exhibit subtle but systematic departures from Gaussianity. In [41], the stochastic mode reduction technique is applied to a simple prototype nonlinear stochastic model that mimics structural features of low-frequency variability of GCMs with non-Gaussian features [116]. FDT is then used to study the skill of the resulting reduced nonlinear stochastic models.

Fitting Autocorrelation Function of Time Series by a Spectral Information Criteria
As was seen in the previous subsections, the model sensitivity is applied to quantify the information of the temporal evolution of the system. In fact, the autocorrelation function of a given stochastic system is a simple and easily computed measurement that can be used for accessing the model sensitivity. In this subsection, we make use of the autocorrelation to quantify the model sensitivity based on a new information-theoretic framework. Autocorrelation is the correlation of a signal with a delayed copy of itself, as a function of delay. For a zero mean and stationary random process u, the autocorrelation function can be calculated as Clearly, a linear Gaussian process is completely determined by its mean and autocorrelation function, where the autocorrelation function characterizes the memory of the system. Therefore, an accurate estimation of the autocorrelation function in the imperfect models plays an important role in prediction. In many applications, the integral of the autocorrelation function, which is known as the decorrelation time, is used for model calibration. Although fitting the decorrelation time in the imperfect model is a simpler strategy, it is however insufficient for pointwise agreement with the true autocorrelation. In particular, if the underlying nonlinear turbulent dynamics has a slow mixing rate and involves wave-like behavior, then the profile of the true autocorrelation function is very likely to be a damped oscillation. As a consequence, fitting only the decorrelation time in the imperfect model results in a large model error due to the failure of capturing the detailed oscillation structures of the autocorrelation function, which severely deteriorates the prediction skill. Thus, it is of practical importance to calibrate the autocorrelation function in imperfect models in order to capture the dynamical features beyond the equilibrium statistics of the truth. The autocorrelation function is also directly linked with the model sensitive in terms of the mean response as well as the prediction skill. Information theory provides a rigorous and practical way to quantify the error in the two autocorrelation functions associated with the perfect and imperfect models respectively [81,117]. However, direct application of the information distance in (1) is not suitable for measuring the difference between the two autocorrelation functions. This is because the autocorrelation function R(t) may oscillate with negative values while π and π M have to be positive in (1). To generalize the information-theoretic framework to include the autocorrelation functions, the theory of spectral representation of stationary random fields [118] is exploited here. It is proved by Khinchin's formula [118] that if the autocorrelation function R(t) is smooth and rapid-decay, which is the typical property for most systems, then there exists a non-negative function E(λ) ≥ 0 such that with dF(λ) = E(λ)dλ a non-decreasing function. Therefore, the spectral representation of the stationary process of u can be constructed as The exact spectral random measureẐ(dλ) has independent increments whose energy spectrum can be represented by E(λ) or dF(λ) Applying the theory for spectra representation of stationary processes, an one-to-one correspondence between the autocorrelation function R(t) and non-negative energy spectra E(λ) together with the spectral representationẐ(dλ) of the process u(t) can be found. Consider the approximation of this random process with only second order statistics by a lattice random field with spacing ∆λ. By independence, the true incrementẐ(∆λ j ) =Ẑ(λ j + ∆λ) −Ẑ(λ j ) has the second order Gaussian probability density function approximation and the corresponding spectral representation from the imperfect model also has the density function where N (µ, σ 2 ) denotes a Gaussian random variable with mean µ and variance σ 2 . Since the spectral measure has independent increment, we approximate the true and imperfect model Gaussian random fields by Then, the normalized relative entropy between these two Gaussian fields becomes Therefore, given spectral density E(λ) and E M (λ), the spectral relative entropy is given by where we slightly abuse the notion above by using the spectra E(λ) to denote density functions. Since E and E M are variances for the spectral random variables, it is well-defined in the last part of the above formula (47) using the information distance formula (1). Through measuring the information distance in the spectral coefficientsẐ(λ), we arrive at the lack of information in the autocorrelation function R(t) from the model. See [81] for more details as well as an efficient algorithm of solving (47). Finally, let there be the set of parameters θ for the imperfect model. Minimum relative entropy criterion implies the process of minimizing the lack of spectral information distance by picking the optimal parameter set θ * for the imperfect model so that The following example makes use of the above spectral information criteria to reveal the importance in calibrating the autocorrelation function in the imperfect linear prediction model. The perfect model considered here is a noisy version of the so-called Lorenz 84 model [119,120], This model is an extremely simple analogue of the global atmospheric circulation and the noise-free version can be derived as a Galerkin truncation of the two-layer quasigeostrophic potential vorticity equations in a channel [121]. In (49), x represents the intensity of the mid-latitude westerly wind current while y and z represent the cosine and sine phases of a chain of vortices superimposed on the zonal flow.
With g = 0, the processes y and z in (49) form a pair of stochastic nonlinear oscillator through the skew-symmetric terms −bxz and bxy, where the frequency of the oscillation is stochastic and it depends on the amplitude of x. Meanwhile, x also plays the role of stochastic damping, which can be seen in the nonlinear terms xy and xz that modify the wave amplitudes.
The parameters used in this test are as follows: In Figure 8, sample trajectories and the corresponding autocorrelation functions associated with x, y and z in Equation (49) with parameters (50) are shown. It is clear that the autocorrelation functions associated with y and z oscillate and decay to zero, which satisfies the features of wave pairs. The goal here is to predict the vortex variables y and z. The imperfect model for prediction is a mean stochastic model (MSm) with a constant phase, Note that u M is a complex process and its real and imaginary parts correspond to the pair of vortex y and z in (49), respectively. Due to the simple linear structure, the autocorrelation function R M (t) and spectra density E M (λ) of the MSm in (51) can be written down explicitly, The model (51) has three parameters to determine: d M u , ω M u and σ M u . Now, we apply the spectral relative entropy in (47) and make use of the analytic formula in (52) for E M (λ) to implement the optimization (48). Note that the spectra density E M (λ) in (52) does not depend on the stochastic forcing coefficient σ M u . Therefore, the optimization in (48) is over all the possible choices of d M u and ω M u . This gives the following results: Optimal parameters by fitting the autocorrelation function: For comparison, we also adopt the traditional parameter estimation strategy in MSm by fitting only the decorrelation time (44) [15]: Optimal parameters by fitting only the decorrelation time: The remaining parameter σ M u is calibrated by matching the variance of y and Re(u M ) at the statistical steady state, which results in σ M u = 0.205 in Case (53) and σ M u = 1.850 in Case (54). In Figure 9, two sample trajectories of Re(u M ) and the corresponding autocorrelation functions with parameters in (53) and (54) are shown. It is clear that the sample trajectory of Re(u M ) and the corresponding autocorrelation function with parameters in (53) highly resemble those of y in Figure 8. On the other hand, the decorrelation time of y is very short, which is due to the canceling out of the oscillation patterns with positive and negative values by integrating the autocorrelation function. Thus, the oscillation patterns in the trajectory of Re(u M ) are overwhelmed by the noise due to the strong mixing property when the model is calibrated using only the decorrelation time (Panels (c) in Figure 9). This example indicates the necessity of using the information criterion developed here for calibrating the autocorrelation function rather than simply matching the decorrelation time as used in many earlier works.
Finally, in Figure 10, we show the prediction of the time evolution of the mean and variance of y and Re(u M ) starting from the same initial value y = Re(u M ) = 1 and others = 0. In column (a), the mean evolution of Re(u M ) from the linear model (51) captures that of y in the nonlinear Lorenz 84 model (49) quite accurately with a significant oscillation structure. The trend of the variance using the linear model is also quite similar to that using the Lorenz 84 model, though the oscillation structure in the variance due to the fact that the nonlinearity in the Lorenz 84 model is not predicted by the linear MSm. The latter has already been discussed in [111] and in Section 2.2.1 as an information barrier. On the other hand, by calibrating only the decorrelation time (Column (b) in Figure 10), the time evolutions of the mean and variance have dramatically fast relaxations towards the statistical steady state, which are completely different from the truth.
In [81], the information theory as shown above has been applied to calibrate more complicated reduced order models. The calibrated models succeed in predicting fat-tailed intermittent PDFs in passive scalar turbulence.

Kalman Filter, State Estimation and Linear Stochastic Model Prediction
Filtering (also known as data assimilation or state estimation) is the process of obtaining the optimal statistical estimate (based on a Bayesian framework for example) of a natural system from partial observations of the true signal. Important contemporary examples involve the real-time filtering and prediction of weather and climate as well as the spread of hazardous plumes or pollutants [13][14][15][16][122][123][124].
The general procedure of filtering complex turbulent dynamical systems with partial and noisy observations contains two steps at each time step t = m∆t. The first step involves a statistical prediction of a probability distribution u m+1|m starting from the initial value u m|m using the given dynamical model. Then, in the second step, u m+1|m is corrected on the basis of the statistical input of noisy observation v m+1 , which results in u m+1|m+1 . See the illustration of Figure 11.  Figure 11. Illustration of the prediction-filtering procedure.
Let u m ∈ C be a complex random variable whose dynamics are given by the following: where σ m+1 is a complex Gaussian noise with σ m+1 = (σ 1,m+1 + iσ 2,m+1 )/ √ 2 and it has zero mean and variance r = σ m+1 σ * m+1 = 1 2 ∑ 2 j=1 σ 2 j,m+1 . Here, F is a complex number known as the forward operator and F is an external forcing which can vary in time. The goal of the Kalman filter is to estimate the unknown true state u m+1 , given noisy observations where g is a linear observation operator and σ o m ∈ C is an unbiased Gaussian noise with variance The Kalman filter is the optimal (in the least-squares sense) solution found by assuming that the model and the observation operator that relates the model state with the observation variables are both linear and both the observation and prior forecast error uncertainties are Gaussian, unbiased and uncorrelated. In particular, the observation error distribution of v at time t m+1 is a Gaussian conditional distribution which depends on the true state u m+1 through (55). In (57), p(v m+1 |u m+1 ) is known as the likelihood of estimating u m+1 given observation v m+1 . Assume the filter model is perfectly specified [128]. An estimate of the true state prior to knowledge of the observation at time t m+1 , which is known as the prior state or forecast state, is given by See the first step in Figure 11. From the probabilistic point of view, we can represent this prior estimate with a probability density p(u m+1 ). This prior distribution acounts only for the earlier observations up to time t m , p(u m+1 ) ∼ N (ū m+1|m , r m+1|m ), (59) where the prior mean and prior variancē are solved viaū with r m|m = (u m −ū m|m )(u m −ū m|m ) * . Note that in order to solve the prior distribution p(u m+1|m ), the posterior information in the previous stepū m|m , r m|m has been used. Next, we derive the posterior state (or the filtered state) that combines the prior information p(u m+1|m ) with the observation v m+1 at t m+1 . This estimate is given in the probabilistic sense by the Bayesian update through maximizing the following conditional density, which is equivalent to minimizing The value of u at which J(u) attains its minimum is the estimate for the mean and is given bȳ where K m+1 = gr m+1|m is the Kalman gain. Note that 0 ≤ K m+1 g ≤ 1. The filter fully weighs to the model or prior forecast when K m+1 g = 0 and fully weighs to the observation when K m+1 g = 1. Such weights depend on the ratio of the uncertainty (reflected by the noise) in the observations and the model. Finally, the posterior variance is calculated via the following: These result in the expression of the posterior variance Note that the above Kalman filter is designed for a linear system with Gaussian noise. In practice, different generalizations of the Kalman filter and various nonlinear filters such as the ensemble Kalman filter, particle filter and blended filtering techniques are applied to nonlinear and non-Gaussian systems. See [13][14][15][16][17]101,129,130] for more details.

Asymptotic Behavior of Prediction and Filtering in One-Dimensional Linear Stochastic Models with Model Error
Recall in Section 3.1 that the true underlying linear stochastic model is given by However, the true underlying dynamics is typically unknown in practice. Therefore, imperfect forecast models are used in the prediction stage. Now, let's assume the forecast model has the following form where the model error comes from the imperfect forward operator, forcing and noise coefficient. Due to the appearance of such model errors, the updates of prediction and filtering distributions becomē respectively. Now, we study the asymptotic behavior of the updates (69) and (70) with model error compared with the truth based on (67). The detailed calculations are included in Appendix C, which exploit the augmented system involving the truth and the prediction/filtering with model error [31]. Here, we summarize the results.
From (67) and (68), it is easy to show the equilibrium mean estimates of the perfect and imperfect modelū

Prediction
The asymptotic prediction mean ofū M m+1|m is given by Clearly, the asymptotic mean of the prediction state is a linear combination of the equilibrium mean of original true modelū eq and that of the forecast model of the meanū M eq . With (67), the error in the asymptotic prediction mean ofū M m+1|m yields, The asymptotic prediction mean ofū M m+1|m is equal to the equilibrium mean of the perfect model if and only if the imperfect model has the same equilibrium mean as the perfect model, namelyū eq =ū M eq . On the other hand, the asymptotic prediction variance r P ∞ = lim m→∞ r m+1|m is given by where the asymptotic Kalman gain is Therefore, the asymptotic prediction variance simplifies to

Filtering
The asymptotic filtering mean ofū M m+1|m+1 is given by with (67), the error in the asymptotic filtering mean ofū M m+1|m+1 yields, The asymptotic prediction mean ofū M m+1|m+1 is equal to the equilibrium mean of the perfect model provided that (1) the imperfect model has the same equilibrium mean as the perfect model, namelȳ u eq =ū M eq , or (2) K M ∞ g = 1, namely the observational noise is zero and the filter trusts completely towards the observations.
On the other hand, the dynamics of prediction variance r A ∞ = lim m→∞ r m+1|m+1 are given by with direct manipulations, the asymptotic analysis variance becomes

Comparison
Comparing the asymptotic prediction mean (73) and filtering mean (77), we have which indicates that the error in the filtering mean is always smaller than that in the prediction mean in the sense of standard mean deviation, with the existence of observational error. Next, comparing the asymptotic prediction variance (75) and filtering variance (79), we have See Appendix C for the detailed derivations. This implies that filtering state always results in a smaller uncertainty (variance) than the prediction state. Such an uncertainty reduction is due to the extra information in the noisy observations.
Note that the conclusions made from (80) and (81) are valid only when full observations are available. In high dimensional situations, if the observations are only available on part of the variables (known as partial observations), then the prediction error can be smaller than filtering error. See Section 3.5.1 for simple examples.

Motivation Examples
To illustrate the importance and necessity of developing an information theoretical framework in assessing the filtering and predicting skill, let us first review the two traditional path-wise measurements that are widely used in filtering and prediction [13,[131][132][133][134][135]. Denote u i , i = 1, . . . , n the true signal andû i the filtering/prediction estimate. These measurements are given by 1.
The pattern correlation (PC): whereû i and u i denotes the mean ofû i and u i , respectively.
While these two path-wise measurements are easy to implement and are able to quantify the filtering/prediction skill to some extent, they have fundamental limitations. To see this, let us consider a simple motivation example, where the true dynamics is given by with parameters For the imperfect forecast model, we assume the same ansatz as the prefect one in (84) but the parameters related to the forcing amplitudes contain model errors. Consider the following two imperfect models: In panels (a) and (b) of Figure 12, we show the predictions using these two imperfect forecast model (green curve) and they are compared with the truth (blue curve). Here, the observational time step is ∆t obs = 0.5 which is less than the decorrelation time τ = 1/γ = 1, and the observational noise level is r o = 1. In terms of the RMSE and PC, the two predictions are comparable with each other and the one in panel (a) is even slightly more skillful. However, the prediction in panel (a) by intuition is worse than that in panel (b). In fact, the amplitude of the prediction using the imperfect model (a) is severely underestimated. The consequence is that the prediction fails to capture all the important extreme events in the true signal. On the other hand, the prediction using the imperfect model (b) results in a time series which has the same amplitude as the truth. See the PDFs associated with the time series in panel (d) for estimating the amplitudes. Therefore, the two traditional path-wise measurements-RMSE and PC-are misleading here in providing the prediction skill. In addition, according to the definitions of the RMSE and the PC in (82) and (83), both the measurements take into account the information only up to the second-order statistics. Therefore, they are not able to capture the information beyond Gaussian statistics and are not suitable to assess the filtering/prediction skill for any non-Gaussian models as in nature.
Due to the above fundamental limitations of these two traditional path-wise measurements, various information measurements become useful in assessing the filtering/predictino skill.
In [34,56,[136][137][138], an information measurement called Shannon entropy difference was introduced and was used to assess the filtering/prediction skill. The Shannon entropy difference is defined as In particular, if both π ≡ π G and π M ≡ π M G are Gaussian (as in the linear models), then the Shannon entropy difference has the following explicit formula: where R and R M are the covariance of π G and π M G . Intuitively, the Shannon entropy difference quantifies the uncertainty between π and π M . For Gaussian distributions, the uncertainty is reflected by the variance. Connecting the Shannon entropy difference with the two predictions in panels (a) and (b) of Figure 12, it is expected that the Shannon entropy difference is able to distinguish the two predictions since the associated PDFs of the two predictions have different variances. In fact, the Shannon entropy difference in the imperfect model (a) (0.7122) is much larger than that in the imperfect model (b) (0.1502), which indicates that the prediction in panel (b) is more skillful than that in panel (a).
However, relying solely on the Shannon entropy difference in assessing the filtering/prediction skill is also misleading. Consider an imperfect model with the following parameters: Comparing with the perfect model and the other two imperfect models in (86), a non-zero constant forcing f M 0 = 2 is imposed in (89). The prediction results are shown in panel (c) of Figure 12. Since the Shannon entropy difference in Gaussian framework (88) takes into account only the the variance but completely ignores the information in the mean, the resulting Shannon entropy difference using the imperfect models (b) and (c) give exactly the same value. In addition, the pattern correlations in these two models are also identical to each other. However, the prediction using the imperfect model (c) has an obvious mean bias and therefore the prediction is not as skillful as that using the imperfect model (b).
From these simple motivation examples, it seems that the combination of the RMSE, the PC and the Shannon entropy difference can overcome the fundamental limitations as discussed above. However, there are at least two extra shortcomings even in the combination of these three measurements. First, two different PDFs associated with the imperfect model, namely π M {1} and π M {2} , may result in the same Shannon entropy difference compared with the truth. For example, such situation happens when {2} has a mean shift compared with π M {1} . This is because the Shannon entropy difference computes the uncertainty of the two distributions separately rather than considering the pointwise difference between the two PDFs. Therefore, a more sophisticated measurement should take into account the relative difference between the PDFs associated with the perfect and imperfect models. Second, as has been discussed above, the RMSE and PC only make use of the information up to the second order statistics. The important non-Gaussian features as they appeared in many realistic applications are not reflected in these path-wise measurements. Here, the truth is the same in (a-c), which is generated from (84) and (85). The three imperfect forecast models are given by (86) and (89). In column (d), the associated PDFs are shown. In all the panels, only the real part of u is shown.

Assessing the Skill of Estimation and Prediction Using Information Theory
Due to the fundamental limitations in the two classical path-wise measurement, RMSE and PC, as well as those in the Shannon entropy difference, a new information-theoretic framework [102] has developed to assess the filtering/prediction skill. Again, denote π ≡ π(u) and π M ≡ π(u M ) the PDFs associated with truth u and the filtering/prediction estimate u M , respectively. Denote p(u, u M ) the joint PDF of u and u M . Let U = u − u M be the residual between the truth and the estimate. This information-theoretic framework involves three information measurements: 1.
The mutual information, 3.
The relative entropy, The Shannon entropy residual quantifies the uncertainty in the point-wise difference between u and u M . It is an information surrogate of the RMSE in the Gaussian framework. The mutual information quantifies the dependence between the two processes. It measures the lack of information in the factorized density π(u)π(u M ) relative to the joint density p(u, u M ), which follows the identity, The mutual information is an information surrogate of the PC in the Gaussian framework. On the other hand, the relative entropy quantifies the lack of information in π M related to π and it is a good indicator of the skill of u M in capturing the peaks and extreme events of u. It also takes into account the pointwise discrepancy between π and π M rather than only computing the difference between the uncertainties associated with the two individual PDFs (as in the Shannon entropy difference). Therefore, the combination of these three information measurements is able to capture all the features in assessing the filtering/prediction skill and overcomes the shortcomings as discussed in the previous subsection.
Note that when π ∼ N (u, R) and π M ∼ N (u M , R M ) are both Gaussian, then the above three information measurements have explicit expressions: 1.
The relative entropy (Gaussian framework), In (94)- (96), N is the dimension of π and π M , I is the identity matrix with size N × N, and Cov(u, u M ) is the covariance between u and u M . More discussions of Gaussian and non-Gaussian cases can be found in [56] and [54], respectively.
The information-theoretic framework (90)- (92) or (94)-(96) is usually defined in the super-ensemble sense [31] in accessing the data assimilation and prediction skill of the imperfect model given the perfect one. However, in some more realistic situations, although the imperfect models can be run in the ensemble mode, the ensemble run of the perfect model or the truth is never available. This is because the perfect model that describes nature is unknown. The only available information is one realization from observations (e.g., satellites). Nevertheless, the information-theoretic framework can also be used in a path-wise way, where the statistics are computed by collecting all the sample points in the given realization. Some realistic applications of the information-theoretic framework for filtering and prediction can be found in [31,61,139].

State Estimation and Prediction for Complex Scalar Forced Ornstein-Uhlenbeck (OU) Processes
Now, we study the state estimation (filtering) and prediction. The focus here is a complex scalar forced Ornstein-Uhlenbeck (OU) process, where γ and ω 0 are the damping and oscillation frequency while f 0 and f 1 e iω 1 t are a constant and time-periodic large-scale forcing, respectively, and σẆ is stochastic noise. Despite the simplicity of this model in (97), it can be used to mimic some climate physics [15,120]. For example, the deterministic forcing f 0 + f 1 e iω 1 t can be regarded as the annual cycle while ω 0 can be treated as internal oscillation which may occur in the intreaseasonal time scale. The damping term γ measures the system memory and the stochastic term represents the input to the system from small or unresolved scales. The model in (97) can also be regarded as one Fourier mode of complex spatially-extended systems. The information-theoretic framework developed above will be used to assess the filtering/prediction skill and quantify the model error. The complex scalar forced OU process in (97) will be used to generate the true signal. The imperfect forecast model has the same structure as (97) but with model errors in the parameters. The goal here is to systematically study the model error as functions of the observational time step, observational noise and the forcing amplitude.
The exact solution of (97) can be written down explicitly, which provides the analytical forms of the time evolution of the forecast mean u(t) and the forecast variance r(t), With the explicit expressions in (98) and (99), it is easy to write down the corresponding operators F, F m+1 and σ m+1 in (67) or those in the imperfect forecast model (68) in each prediction/filtering assimilation step.
To understand the prediction and filtering skill of the complex scalar forced OU process (97), we start with a simple case which involves only a constant forcing. We also adopt the perfect forecast model in this example. The parameters of (97) are given as follows: and the observational operator g = 1. Here, we take ∆t obs = 0.5 and r o = 0.5 as the default values of the observational time step and the observational noise level, respectively. Since the phase ω 0 = 1 is nonzero, the autocorrelation function has an oscillation structure. Therefore, the decorrelation time which includes the cancellation of the positive and negative values in the autocorrelation function may be misleading for measuring the system memory. Here, we have checked both the standard decorrelation time, namely the integral of the autocorrelation function (ACF), and the integral of the absolute value of the ACF. These two quantities have values τ ACF = 0.34 and τ |ACF| = 1.63. Thus, ∆t obs = 0.5 is a reasonable observational time step that includes some memory of the information in the previous assimilation step. On the other hand, r o = 0.5 here results in a polluted signal with roughly 10% observational noise. Figure 13 show the prediction and filtering skill in terms of the three information measurements, namely the Shannon entropy residual, the mutual information and the relative entropy, as a function of ∆t obs (panels (a)-(c)) and r o (panels (d)-(f)), respectively. First, the filtering estimate is always more skillful than the prediction one. This is consistent with the theoretical analysis in Section 3.2 in that the filter estimate combines the prediction result with the information observation and the error is therefore reduced. Next, it is clear from all the panels in Figure 13 that with the increase of either ∆t obs or r o , both the prediction and filtering skill deteriorates, which is as expected. Nevertheless, the prediction skill decreases more quickly with the increase of the observational time step ∆t obs . In particular, the model error in terms of the relative entropy increases exponentially. As comparison, the filtering skill only has a slight deterioration with the change of ∆t obs . On the other hand, the error in the filter estimate increases quickly with the observational noise r o when r o is small. When r o becomes moderate to large, the error in the filter estimate increases steadily. The error in the prediction estimate always increases steadily as a function of r o .
To have a more intuitive understanding of these results, the time series of the truth, the filter estimate and the prediction estimate are shown in Figure 14 with three different ∆t obs or r o . Comparing with panels (a) and (b), it is clear that a long observational time step ∆t obs leads to a smaller fluctuation of the prediction estimate around its steady state mean value. In fact, the signal due to the memory effect in the previous assimilation step is strongly damped with a long observational time step and the resulting signal is dominated by the constant forcing. The consequence is that the PDF associated with the predicted time series has a much smaller variance compared with the truth and the prediction fails to capture the extreme events and large variabilities in the true signal. On the other hand, due to the incorporation of the information from the observations, the filter estimate even with a long observational time step provides a quite skillful result in terms of both the correlation and the signal amplitude. Note that the asymptotic Kalman gain in the filtering in panel (b) is K ∞ = 0.9, which means the observations play an important role in regaining the skill in the filter estimate. In panel (c), the observational time step ∆t obs = 0.5 is the same as that in panel (a) but the observational noise level is increased from r o = 0.5 to r o = 3. Both the filtering and prediction skill becomes worse as compared with those in panel (a). Nevertheless, the deterioration is not quite significant which is consistent with the statistics shown in Figure 13.  Next, we consider the complex forced scalar system with a time-periodic forcing in (97). The parameters are as follows: and the observational operator g = 1. Two dynamical regimes are studied here: Regime I: ω 0 = 0.5, and Regime II: It is important to note that, in Regime II, the phase ω 0 and the forcing period ω 1 are equal to each other, which means this dynamical regime has a resonance forcing. On the other hand, the dynamical Regime I has a non-resonance forcing. We take ∆t obs = 0.5 and r o = 9 as the default values of the observational time step and the observational noise level, respectively. Note that although r o = 9 is much larger than that in the previous example, the signal amplitude due to the periodic forcing also increases. The observational noise here is about 25% compared with the true signal. See Figure 15 for the true signal and the noisy observations. Now, let us assume the imperfect model shares the same model structure as the perfect one in (97). The imperfect part comes from the parameter ω M 0 . This comes from the motivation that the situation that the large-scale forcing f 0 + f 1 e iω 1 t is in general known quite well but measuring the internal oscillation ω 0 usually contains error. In Figure 16, we show the model error in terms of the three information measurements, namely the Shannon entropy residual, the mutual information and the relative entropy, as a function of ω M 0 . Here, the information measurements are computed based on the Gaussian framework (94)-(96) due to their simplicity from a practical point of view, where the statistics here are averaged directly over the time series. Note that such direct average results in a bimodal distribution in the true model due to the large amplitude of the periodic forcing such that the mixing and ergodicity are not satisfied [108] (see Figure 15 and a detailed discussion in Appendix D). Nevertheless, the Gaussian approximation in the information measurements here provides a qualitatively accurate estimate of the model error as can be seen in  There is an alternative way of computing the model error by first collecting all the points t + mT, m = 1, 2, . . . with t fixed and T being the period of the time series. These points appear in the same location within a period and the collection forms a Gaussian distribution. Compute the information measurements for these Gaussian distributions. Then, let t vary within t ∈ (0, T] and repeat the above procedure. Eventually, take the average of the information measurements within one period to finalize the results. This alternative method is more rigorous in terms of applying the Gaussian framework of the information measurements. However, it requires a very long time series to guarantee the sampling size (the number of period) is sufficient. It also requires knowing the perfect information of the period, which may not be realistic in practice if the period contains randomness.
In Figure 16, the minimum of the model error appears around the perfect value. The small bias in the optimal value compared with the truth comes from applying the Gaussian framework of the information measurements. When the discrepancy between ω M 0 and the truth ω 0 increases, the model error in all the three information measurements becomes large as well. Despite the similar profiles in the model error curves in the two regimes, the model error increases significantly faster in Regime II (the resonance regime). In fact, according to (98) or (99), the contribution of the time-periodic forcing to the forecast solution is given by In addition to the error appears in the phase e −(γ+iω 1 −iω 0 ) due to an imperfect ω M 0 , the resonance forcing also greatly modifies the amplitude of the contribution in (103). With a resonance forcing ω 0 = ω 1 , the amplitude in the contribution (103) reduces to f 1 /γ which can be much larger than If in the imperfect model ω M 0 = 0 = 1 = ω 0 , then a large error in the amplitude of the forcing contribution using the imperfect is expected. This is shown in Panel (b) of Figure 15. It is clear that in addition to the phase shift in both the filtering and prediction estimates, the amplitudes of these estimates are severely underestimated as well. Notably, the relative entropy here unambiguously indicates such underestimations of the amplitudes and extreme events, which cannot be captured by the RMS error and the pattern correlation.
In order to reduce the model error in the imperfect model, a typical strategy is to optimize the noise coefficient σ M [1, 13,15,101,140]. In Figure 17, we show the model error in the perfect model as a function of σ M , where ω M 0 = 0. Comparing to the non-optimized values σ M = σ = 2 as indicated by the blue 'x', the model error with the optimal value σ M = 7 for the filter estimate (which is also nearly the optimal value for the prediction estimate) has a significant decrease. This noise inflation strategy is in fact consistent with that in dealing with many operational models or complex dynamical systems [18,141,142]. Figure 17 also confirms that noise inflation leads to a much smaller model error than the underdispersion [15,18,100,140]. In Panel (c) of Figure 15, the filter and prediction estimates with this optimized noise are shown. The amplitudes of the true signal are recaptured by the imperfect model estimates. Interestingly, the filter estimates are now almost perfectly in phase with the true signal and even the discrepancy between prediction estimates and the truth is greatly decreased. See Panel (b) for a comparison. In fact, we note that the Kalman gain has increased from K ∞ = 0.27 to K ∞ = 0.73, which implies that the observations now play a more important role in obtaining the filter estimates. This is the underlying reason that the filtering becomes more skillful, which also increases the skill of the prediction since the filter estimate now provides a much more accurate initial value of the prediction.
The blue 'x' shows the non-optimized values σ M = σ = 2. The dot σ M = 7 indicates the optimal value for the filter estimate which is also nearly the optimal value for the prediction estimate. (a) Shannon entropy residual, (b) Mutual information, (c) Relative entropy.

State Estimation and Prediction for Multiscale Slow-Fast Systems
Multiscale slow-fast systems are commonly seen in many geophysical and engineering turbulent flows [18,[143][144][145][146]. A concrete example involves the coupling of random incompressible geostrophically balanced (GB) flows and random rotating compressible gravity waves in the middle latitude atmosphere [8]. Under the situation with a small Rossby number, the coupled system becomes a multiscale slow-fast system where the GB component dominates the slow-varying geophysical flows [8,[147][148][149].

A 3 × 3 Linear Coupled Multiscale Slow-Fast System
Here, we start with a simple 3 × 3 linear coupled multiscale slow-fast system, In (104), we assume the linear coefficients L 12 = −L 21 , L 13 = −L 31 and L 23 = −L 32 such that the L ij forms a skew-symmetric matrix. The three damping coefficients −d u 1 , −d u 2 , −d u 3 < 0 to guarantee the mean stability. F 1 (t), F 2 (t) and F 3 (t) are external forcing that can depend on time t. Here, is a controllable parameter. With 1, the coupled system has a fast oscillation structure in u 2 and u 3 while u 1 remains as a slow variable. All the variables here are real.
The coupled system in (104) can be regarded as one Fourier mode of the shallow water equations, where u 1 mimics the large-scale GB flow while u 2 and u 3 represent the analogies of the real and imaginary parts of the gravity waves. Note that the gravity waves appear in pairs and therefore the linear combinations of u 2 and u 3 in the complex plane are good surrogates of the two components of the gravity waves associated with one Fourier mode in the shallow water equation. These three variables are coupled in a linear way in (104).
Below, we study the filtering/prediction skill. The following parameters are taken: Here, we only impose the deterministic time-periodic forcing to u 1 . This is because we denote u 1 as the slow (or large) scale variable, which is typically driven by external forcing, such as the seasonal cycle or annual cycle [15]. On the other hand, the other two variables mostly occur in a faster time scale and the forcing is basically stochastic.
To understand the filtering/prediction skill, the following four setups are adopted:

1.
Full observations, full forecast model (F/F). The observational operator g is an identity such that The forecast model is the same as in (105). Although this straightforward setup may not be practical (see below) and can be expensive when a much larger dimension of the system is considered (see next subsection), the results from such a setup can be used as a baseline for testing various modifications and reduced models as will be presented below.

2.
Partial observations, full forecast model (P/F). The real observations typically involve the superposition of different wave components. It is usually impossible to artificially separate these components from the noisy observations. Therefore, here we let the observational operator be g = (1, 1, 1), namely the observation is the combination of the three variables, The forecast model remains the same as that in (105).

3.
Partial observations, reduced forecast model (P/R). In practice, only part of the state variables are of particular interest in filtering and prediction. These state variables usually lie in large or resolved scales, such as the GB flow. Therefore, simple reduced forecast models are typically designed to reduce the computational cost and retain the key features in filtering and predicting these variables. To this end, the following reduced forecast model is used and the observation remains the same as that in (107). Here, we have completely dropped the dependence of u 1 on u 2 and u 3 since their mean is zero according to the setup above.

4.
Partial observations, reduced forecast model and tuned observational noise level with inflation (P/R tuned). It is easy to notice that in the previous setup (P/R), the signals of u 2 and u 3 actually become part of the observational noise in filtering and predicting u 1 . This is known as the representation error [53,100,[150][151][152][153][154]. However, if the original observational noise level r o is still used in updating the Kalman gain, then the filtering and prediction skill may be affected by the representation error. To resolve this issue, we utilize an inflated r o M in the analysis step to compute the Kalman gain while the other setups remain the same as in the P/R case. Here, the inflated r o M is given by where var(u 2 ) and var(u 3 ) are the variance of u 2 and u 3 respectively at the statistical steady state. The inflation in (109) is the most straightforward one. More elaborate inflation techniques can be reached by applying the information theory in the training phase. Nevertheless, with such a simple inflation of the observational noise, the signals of u 2 and u 3 are treated as part of the observational noise. The estimation of the Kalman gain using the imperfect forecast model (108) is therefore expected to be improved.
Below, we consider two dynamical regimes with = 0.1 and = 1, respectively. The two variables u 2 and u 3 evolve in a much faster time scale than u 1 in the regime with = 0.1 while the three variables lie in the same time scale with = 1. Now, we compare the filtering and prediction skill using the four setups as discussed above. In Figures 18 and 19, the skill as a function of the observational time step ∆t obs is shown in Regime = 0.1. The following conclusions are reached. First, both the filtering and prediction skill overall deteriorates with the increase of the observational time step ∆t obs . Second, the filter estimates are almost always more accurate than the prediction estimates since the former contains extra information from observations. Third, the results with F/F is the best among all the four setups, as expected. Nevertheless, the filtering and prediction results of u 1 based on the other three setups remain comparable to that of F/F. However, the predictions of u 2 and u 3 using both the full and partial observations (F/F and P/F) contain a large error when the observational time step becomes large. Such an error is not reflected by the RMSE and PC but is clearly indicated by the relative entropy. In fact, since u 2 and u 3 both lie in faster time scales, their decorrelation times become much shorter than the observational time step when the latter increases. The consequence is that, regardless of the initial value, the prediction estimates always relax to the equilibrium mean and the amplitudes are thus severely weakened. Despite the success in capturing the pattern correlation, the prediction fails to catch any extreme events. On the other hand, the observations help the state estimation of filtering. In fact, the filter estimates with full observations (F/F) can almost perfectly capture the amplitudes of the truth while the partial observations (P/F) at least allow the filter estimates to reach some of the events with large amplitudes, which is nevertheless more skillful than the prediction. See Figures 20 and 21 for the true time series as well as the prediction and filtering estimates.
Next, in Figures 22 and 23, the filtering and prediction skill in Regime = 1 is shown. Now, the difference in the results between using different setups becomes more significant. The filtering and prediction skill of u 1 using P/F remains good but the gap compared with that using F/F is more obvious. Interestingly, the reduced strategy P/R now becomes much worse and the filtering results are even worse than the predictions especially with short observational time step ∆t obs (see also the time series in Figures 24 and 25). In fact, there are two sources of error that bring about such unskillful results. First, the variances of u 2 and u 3 with = 1 are now much larger, which leads to a large representation error. Such representation error leads to a larger error in the filtering than prediction (For comparison, see Section 3.2 for the conclusion with no representation error). Second, recall that in Regime = 0.1, u 2 and u 3 evolve in a much faster time scale and therefore they can be treated as noise. Ignoring them in the dynamics of u 1 provides a good approximation in (108). This is however not true in Regime = 1 where the long memory of u 2 and u 3 plays an important role in the reduced dynamics of u 1 . In other words, the reduced forecast model in (108) results in a large model error in Regime = 1 due to the ignorance of u 2 and u 3 . Thus, the combined effect from both the representation error and the imperfect model leads to the large error in filtering as well as prediction. With a short observational time step (for example, ∆t obs = 0.2 in Figure 24), the representation error becomes dominant. Therefore, an inflation of the observational noise to compensate the representation error (P/R tuned in Figure 23) improves the filtering and prediction skill. However, with a much longer observational time step, say ∆t obs = 2, an inflation of the observational noise (P/R tuned) will reduce the Kalman gain and the model rather than the observation provide more information to the filter results. When ∆t obs is large, the model will relax towards its equilibrium mean and thus the amplitude will be underestimated. See Figure 23. This again indicates the importance of using the relative entropy as one of the quantification criteria. Note that, with ∆t obs = 2, Figure 23 clearly states that both the RMSE and PC of the filtering estimates in P/R tuned setup are better than those in P/R setup, but the relative entropy in P/R tuned setup is much larger. This is a good example to show the importance and necessity of using the information-theoretic framework in quantifying the filtering and prediction skill instead of using only the path-wise RMSE and PC. Finally, we note that the signals of u 2 and u 3 here do not behave as a pair of oscillator as in the Regime with = 0.1. This is because all the three variables now lie in the same time scale and they interact with each other. Here, the large-scale time-periodic forcing in u 1 leads to a time-periodic pattern in u 2 as well. However, the strong anti-correlation in u 1 and u 2 provides a cancelation in the feedback to u 3 , which makes the signal of u 3 more noisy than u 2 . The consequence is that the filtering and prediction skill in u 3 is much worse than those in u 2 due to the much larger noise to signal ratio in u 3 .
To summarize, in the Regime with = 0.1, all four of the setups lead to comparable results for both filtering and predicting the slow variable u 1 . In particular, the most efficient strategy P/R works quite well. The filtering and prediction of the two fast variables u 2 and u 3 using F/F and P/F also show skillful results when the observational time step ∆t obs is short. When ∆t obs exceeds the decorrelation time of u 2 and u 3 , the filter estimates tend to miss some large events while the prediction results fail to capture all the extreme events. In the Regime with = 1, the reduced strategy (P/R) for u 1 does not work well especially with small observational time step ∆t obs . Nevertheless, if an observational noise inflation is adopted (P/R tuned), then both the filtering and prediction skill can be improved and becomes nearly comparable to those to the full filter with full observations (F/F) when ∆t obs is small to moderate. When the ∆t obs is large, the model error in the reduced forecast model (108) becomes dominant. In such a situation, only a full forecast model provides skillful prediction and filtering results while a partial observation (P/F) is allowed for retaining the skill. The partial observation (P/F) also gives a comparable skill as the full observation (F/F) in filtering and predicting u 2 but only in the setup with both the full forecast model and the full observations (F/F) leads to skillful results for u 3 . A summary is shown in Table 1. Table 1. Summary of the four setups in filtering the 3 × 3 system in (104). The four setups are: Full observations, full forecast model (F/F); partial observations, full forecast model (P/F); partial observations, reduced forecast model (P/R); and partial observations, reduced forecast model and tuned observational noise level with inflation (P/R tuned). Here, √ means the strategy works for small, moderate and moderately large ∆ obs . Small ∆ obs implies ∆ obs ≤ 0.4 which is roughly the decorrelation time of u 2 and u 3 in = 0.1 regime. Moderate ∆ obs means 0.4 ≤ ∆ obs ≤ 1.2 and moderately large ∆ obs is up to ∆ obs ≤ 2, which is nevertheless below the decorrelation time of u 1 since u 1 has a slow-varying time-periodic forcing.  The three rows are shown for the skill of u 1 , u 2 and u 3 , respectively. The numerical simulation is based on time series with total length T total = 5000 while the largest observational time step here is ∆t obs = 2.

Shallow Water Flows
Finally, let us study the filtering and prediction for spatially-extended systems. Consider the linearized two-dimensional rotating shallow water equation [8,143] ∂u ∂t where u = (u, v) T is the two-dimensional velocity field and η is the geophysical height. Here, is the Rossby number representing the ratio between the Coriolis term and the advection term. We also set the Froude number equal to the Rossby number, which is the typical case in realistic geophysical flows [8]. Applying the Fourier decomposition method (See Section 4.4 in [8]) to (110), a 3 × 3 system is obtained for each Fourier wavenumber. In particular, associated with each Fourier wavenumber, there are: 1.
One geostrophically balanced (GB) mode with eigenvalue The GB mode is incompressible.
The gravity modes are compressible.
Therefore, the solution of the shallow water equation in (110) can be written as a superposition of different Fourier modes, where the eigenvectors associated with the GB and gravity modes, i.e., r k,0 and r k,± , are given by respectively, for |k| = 0 and respectively, for |k| = 0. Here, in (111)-(115), k = (k 1 , k 2 ) and x = (x, y).
On the other hand,û k,1 =û k,B . Without L k,12 , L k, 21 , L k, 13 and L k,31 , these setups are similar to those in [139,155] except that the starting 3 × 3 systems in [139,155] are complex and there are two extra freedoms for noise in the pair of the gravity modes. In (116), the GB and gravity modes are coupled with each other linearly through nonzero coefficients L k,12 , L k,21 , L k, 13 and L k, 31 .
Next, the noisy observations are given by the velocity fields u and v at each grid point in physical space. This is known as the Euler observations. Note that Lagrangian observations (via Lagrangian tracers) are also widely used in filtering the shallow water flows or more generally the geophysical flows [49,139,[155][156][157][158][159]. Here, the Fourier expansion is applied to the noisy observational data of u and v. We assume the observational noise is white. Therefore, the noise level associated with each Fourier wavenumber is the same [108]. Note that the observations are not the Fourier coefficients in (116). They are the summation of the three Fourier componentsû k,α for α = {B, ±} multiplying by the associated eigenvectors r k,α in (114) and (115) according to the expression of the velocity in (113). These correspond to the setups of P/F, P/R and P/R as discussed in Section 3.5. We will also report the filtering and prediction skill using the F/F as introduced in Section 3.5, which assumes that the observation for each GB and gravity mode is available. Although such a setup is idealized, it provides the optimal filtering and prediction results and can be used to examine the skill in the other setups. Once the results are obtained for each Fourier mode associated with the 3 × 3 system in (116), the summation of different Fourier modes are taken to recover the velocity field in the physical space. In practice, recovering and predicting the GB flow are of particular interest since GB flows lie in a longer time scale. Therefore, we focus on the study of the GB flow in different setups (F/F, P/F, P/R and P/R tuned). Since the GB modes are incompressible, it is more convenient to show the stream function ψ instead of the velocity field, where (u, v) = (∂ψ/∂y, −∂ψ/∂x).
In the following, we consider the Fourier wavenumbers k in [−2, 2] 2 , where there are 25 GB modes and 50 gravity modes. The modes with k = (0, 0) are the background modes, which are usually deterministic. Thus, we filter and predict the other 24 wavenumbers. Note that the mode k and −k are complex conjugate. The following parameters are taken in rotating shallow water Equation (110), Two dynamical regimes will be studied. They are = 0.1 (fast rotation regime) and = 1.0 (moderate rotation regime). The observational noise level is r o k = 1.5. The noise to signal ratio varies in different Fourier modes, but the noise is about 30% to 40% compared with the amplitude of the true signals multiplying by the eigenvectors (which is also the observational operator here) when the mode has observability [15,140,160,161]. The observability issue will be discussed at the end of this section.
The statistical behavior in filtering and predicting each Fourier wavenumber based on the information-theoretic framework are quite similar to those in Section 3.5.1. Therefore, in the following, we focus only on the comparison in the physical space, the results of which are given by taking the summation of different Fourier modes. In Figures 26 and 27, the prediction and filtering results in Regime = 0.1 are shown. With a short observational time step ∆t obs (shorter than the decorrelation time of the gravity waves), both the filtering and prediction estimates are quite accurate, despite the fact that the prediction estimates in Figure 26 contain small errors in recovering the vortex in the right bottom corner. When ∆t obs is increased to ∆t obs = 1, which is longer than the memory time of the gravity waves, obvious errors are found in the predicted GB flows as shown in Figure 27. Nevertheless, the overall patterns and the amplitudes of the predicted GB flows in all the setups remain acceptable. The filtering estimates are more accurate than the predictions, especially in recovering the vortex near the left edge. On the other hand, in Regime = 1, even with a short observational time step ∆t obs = 0.1, the prediction is inaccurate. See Figure 28. The error comes from both the pattern and the amplitude, the latter of which is quantified by the relative entropy. When ∆t obs becomes ∆t obs = 1, the filtering skills using the three practical setups (P/F, P/R and P/R tuned) all contain significant errors while the prediction estimates provide completely wrong patterns such as those at the right bottom corner. See Figure 29.
One interesting question to ask is that whether the observations of both the velocity fields u and v are needed in filtering and predicting the rotation shallow water flows since these two velocity components are strongly linked through the eigenvectors (114). To answer this question, we show the filtering and prediction estimates in physical space by observing both u and v (Panels (b) and (d)) and observing only u (Panels (c) and (e)). See the first two rows of Figure 30. Here, the fast rotation regime = 0.1 is chosen and a short observational time step ∆t obs = 0.1 is adopted. It is clear that by observing only u, both the filtering and prediction estimates contain significant errors under the setups of both F/F and P/R (and others, not shown here). In fact, it is expected from Section 3.5 that with such a small ∆t obs and in the small regime, both the filtering and prediction results are accurate. This is true for most of the Fourier modes, such as k = (1, 1) as shown in the last row of Figure 30. However, it is seen in the third row that the estimates of mode k = (1, 0) for both filtering and predictions are quite different from the truth by observing only u. The reason is that the first component of r k,B in (114) which multipliesû k,B in obtaining the observation of u is zero for all modes with k = (k 1 , 0). This means any modeû k,B with k = (k 1 , 0) has no observability. In other words, the observation u plays no role in the filtering process. The consequence is that both the filtering and prediction estimates of u k,B follow exactly the mean evolution of the dynamics. This is clearly demonstrated in column (e) for P/R. On the other hand, the small fluctuations in the estimates ofû k,B in F/F are due to the coupling betweenû k,B andû k,± where the latter is observable. These findings indicate the importance and necessity of observing both u and v.
There are a few issues that have not been fully addressed here but can be good directions for future works. First, there might not be necessary to observe all the components of u and v. For example, observing v only for the modes that u has no observability may provide a cheaper strategy. Second, comparing the Euler and Lagrangian observations is an interesting topic. In fact, it has been shown in [49] that there exists an information barrier in recovering the velocity field using the Lagrangian observations. Whether this information barrier can be rigorously quantified by using Euler measurements and how to combine Euler and Lagrangian observations to maximize the information are both important topics that deserve further explorations. Finally, as has been noticed here, the P/R tuned setup does not significantly reduce the biases due to the representation error. Therefore, a more systematical study of understanding and improving the representation error is a good future direction.

Information, Sensitivity and Linear Statistical Response-Fluctuation-Dissipation Theorem (FDT)
In Sections 2.3 and 2.4, we have shown the response in the statistical mean as a function of the external forcing perturbation in linear models, where analytic formulae were available and they were used to explicitly illustrate the response. For complex nonlinear dynamical systems, computing the system response due to different types of external perturbations is an important issue in many areas including climate change in climate science and feedback control in engineering. These external perturbations can be forcing (as in the examples shown in Sections 2.3 and 2.4), dissipation, phase as well as all other types of perturbations. In addition, the response function of interest is not only the statistical mean but also the energy (variance) and many other nonlinear functions of the state variables. Clearly, for most of the nonlinear systems, analytic formulae for the statistical response are not available and direct numerical methods are too expensive to adopt. Therefore, it is important to develop a general strategy of efficiently computing the system response to any external perturbation in complex nonlinear dynamical systems.
The fluctuation-dissipation theorem (FDT) [38][39][40]162] is an attractive way to assess the system response by using the statistics of the present states. For example, the important practical and conceptual advantages for climate change science when a skillful FDT algorithm can be established is that the linear statistical response operator produced by FDT can be used directly for multiple climate change scenarios, multiple changes in forcing, dissipation and other parameters and inverse modelling directly [163,164] without the need of running the complex climate model in each individual case, often a computational problem of overwhelming complexity. With systematic approximations, FDT has been shown to have high skill for suitable regimes of general circulation models (GCMs), which are extremely complicated with an order of a million degrees of freedom [163,164].

The General Framework
Here, we summarize the general framework of the FDT [40]. Consider a general nonlinear dynamical system with noise where u ∈ R N is the state variables, σ is an N × K noise matrix andẆ ∈ R K is a K-dimensional white noise. The evolution of the PDF p(u) associated with u is driven by the so-called Fokker-Planck equation [108], where Σ = σσ T and p| t=0 = p 0 (u). Let p eq (u) be the smooth equilibrium PDF that satisfies L FP p eq = 0. The statistics of some function A(u) are determined by Now, consider the dynamical in (119) by a small external forcing perturbation δF(u, t). The perturbed system reads du dt = F(u) + δF(u, t) + σ(u)Ẇ.
We further assume an explicit time-separable structure for δF(u, t), which occurs in many applications [40,97,165], namely δF(u, t) = δw(u) f (t). (123) Then, the Fokker-Planck equation associated with the perturbed system (122) is given by where δL ext p δ = L ext p·δF(t), Similar to (121), for the perturbed system (124) the expected value of the nonlinear functional A(u) is given by The goal here is to calculate the change in the expected value To this end, let's take the difference between (120) and (124), where δp = p δ − p eq is the small perturbation in the PDF. Ignoring the higher order term δL ext δp assuming δ is small, (127) reduces to ∂ ∂t δp = L FP δpp + δL ext p eq , δp| t=0 = 0.
Since L FP is a linear operator, with the semigroup notation, exp[tL FP ], for this solution operator, the solution of (128) is written concisely as Now, combining (129) with (124) and (126), we arrive at the linear response formula where the vector linear response operator is given by This general calculation is the first step in the FDT. However, for nonlinear systems with many degrees of freedom, direct use of the formula in (131) is completely impractical because the exponential exp[tL FP ], cannot be calculated directly.
FDT states that, if δ is small enough, then the leading-order correction to the statistics in (121) becomes [40] δ A(u) (t) = t 0 R(t − s)δ f (s)ds, (132) where R(t) is the linear response operator, which is calculated through correlation functions in the unperturbed climate: See [40] for a rigorous proof of (132) and (133). Clearly, calculating the correlation functions in (133) via FDT is much cheaper and practical than directly computing the linear response operator (131).
Before we move on to the more specific FDT algorithms, let's comment on the perturbation function in (122) and (123). In fact, if w has no dependence on u, then δF(t) naturally represents the forcing perturbation. If w(u) is a linear function of u, then δF(u, t) represents the perturbation in dissipation. It is also clear that if the functional A(u) in (132) is given by A(u) = u, then the response computed is for the statistical mean. Likewise, A(u) = (u −ū) 2 is used for computing the response in the variance.
Notably, despite the small perturbation, FDT (132) and (133) does not require any linearization of the underlying dynamics in (119). Therefore, it captures the nonlinear features in the underlying turbulent systems.

Approximate FDT Methods
One major issue in applying FDT directly in the form of (133) is that the equilibrium measure p eq (u) is not known exactly. Therefore, different approximate methods have been proposed to compute the linear response operator.

Quasi-Gaussian (qG) FDT.
Among all the approximate methods, the quasi-Gaussian (qG) approximation is one of the most effective approaches. It uses the approximate equilibrium measure where the meanū and covariance matrix R match those in the equilibrium p eq . One then calculates (135) and replaces B(u) by B G (u) in the qG FDT. The correlation in (133) with this approximation is calculated by integrating the original system in (119) over a long trajectory or an ensemble of trajectories covering the attractor for shorter times assuming mixing and ergodicity for (133).
For the special case of changes in external forcing w(u) i = e i , i ≤ i ≤ N, the response operator for the qG FDT is given by the matrix The qG FDT will be applied in the simple example in Section 4.2.
Kicked FDT. One strategy to approximate the linear response operator which avoids direct evaluation of π eq through the FDT formula is through the kicked response of an unperturbed system to a perturbation δu of the initial state from the equilibrium measure [30], that is, One important advantage of adopting this kicked response strategy is that higher order statistics due to nonlinear dynamics will not be ignored (compared with other linearized strategy using only Gaussian statistics [162]). Then, the kicked response theory gives the following fact [28,40] for calculating the linear response operator: Fact: For δ small enough, the linear response operator R (t) can be calculated by solving the unperturbed system (119) with a perturbed initial distribution in (137). Therefore, the linear response operator can be achieved through Here, δπ is the resulting leading order expansion of the transient density function from unperturbed dynamics using initial value perturbation. The straightforward Monte Carlo algorithm to approximate (138) is sketched elsewhere [40,50]. The use of kicked FDT in calibrating the reduced-order models will be illustrated in Section 6.3.

Information Barrier for Linear Reduced Models in Capturing the Response in the Second Order Statistics
In this subsection, we use a simple 2D example to systematically illustrate the procedure of the FDT as introduced above. We also aim at showing the information barrier for linear reduced models in capturing the response beyond the first-order statistics. Note that such an information barrier was first pointed out in [41] with detailed discussions and more complicated examples.
The perfect model here is the SPEKF type of non-Gaussian model as discussed in (11), except that for simplicity we adopt a constant forcing f u in the equation of u, The following parameters are used in (139) in order to generate non-Gaussian statistics of u, In Figure 31, sample trajectories and the associated PDFs of the SPEKF type non-Gaussian model (139) with parameters (140) are shown. Since γ frequently crosses zero and becomes negative, the corresponding signal of u is intermittent. Consequently, u has a skewed non-Gaussian PDF with a one-side fat tail.
With this constant forcing f u ≡ 1, the time evolutions of the mean u and variance Var(u) of u are shown in Figure 32. For simplicity of the discussion below, the initial time here is set to be t 0 = −12. It is clear that after t reaches around t = −6, the model (139) arrives at the statistical equilibrium. Now, we add a forcing perturbation δ f u (t) to the model in (139), The function δ f u (t) is a ramp-type perturbation with the following form with The profile of δ f u (t) is shown in panel (c) of Figure 32. The forcing perturbation δ f u (t) starts from 0 at time t = 0 and it reaches 0.1 at roughly t = 5. After t = 5, δ f u (t) stays at δ f u (t) = 0.1. Due to this forcing perturbation, the mean u and variance Var(u) also have corresponding changes, which are shown in panels (a) and (b) of Figure 32. Note that these responses are computed by using the analytical formulas of the time evolutions of the statistics, which are accurate. They are known as the idealized responses.
In most realistic scenarios, the true dynamics is unknown or it is too expensive to run the full perfect model. Therefore, simplified or reduced models are widely used in computing the responses. One type of the simple models that are widely adopted is the linear model, Note that adopting such a linear model to compute the responses shares the same philosophy as one of the ad-hoc-FDT procedures [166], where linear regression approximate stochastic model [87] is used for the variables of interest before applying FDT.
The three parameters in (144) are calibrated by matching the equilibrium mean, equilibrium variance and decorrelation time with those of u in the perfect model (139), where Note that the autocorrelation function of u in (139) with parameters in (140) does not have a strong oscillation decaying structure, and therefore matching the decorrelation is sufficient for the calibration purpose here. With such calibrations, the linear model (144) automatically fit the unperturbed mean and variance at t = 0. Now, we add the same forcing perturbation to the linear model, Since the statistics in the linear model is Gaussian, the formulas in (134)-(136) become rigorous with no approximation. In computing the responses to the forcing perturbation δ f u (t) in the mean and variance of u, the functional A(u(t)) is set to be Response in the mean : A(u(t)) = u M , Response in the variance : respectively. These responses using such a linear model are shown in Figure 33 (green colors) while the idealized responses are shown in blue for reference. It is clear that the response in the mean using the linear model captures the trend of the truth, but the amplitude is severely overestimated. On the other hand, the response in the variance using the linear model is identically zero and therefore it completely misses the truth. In fact, inserting the second Equation (147) into (136) yields solving a third-order centered moment. However, all odd-centered moments automatically vanish for Gaussian distribution and therefore the response in the variance using the linear model is zero [41], which has already been mentioned in Sections 2.2.1 and 2.3. These results unambiguously indicate the insufficiency of using linear approximate models as well as the ad hoc-FDT [166] to compute the responses when the underlying dynamics is highly nonlinear. As a comparison, we also show the responses using the qG FDT based on the perfect model (139). Since the forcing perturbation is only on the direction of u, w(u) in (135) is given by w(u) = [1, 0] T . In Figure 33, it is clear that the qG FDT based on the perfect model (red) captures the response in the mean quite accurately. In addition, this qG FDT also results in a response in the variance and the skill in recovering the time evolution of the variance response is pretty good. Notably, although the response operator R(t) in (132) is linear and the Gaussian approximation (134) is used in computing the equilibrium PDF of the unperturbed system, the underlying nonlinear dynamics is used in computing the functional A(u(t)) in (147). Therefore, the nonlinear interaction is included in the FDT and the response in the variance is captured to a large extent. It is of importance to keep in mind that FDT does not implement linearization on the original underlying nonlinear system. Thus, the nonlinear dynamical features are reflected in the FDT. The linearization is applied only in the response operator due to small perturbations.
Although the simple test example here deals with a constant forcing in the unperturbed system, the FDT technique can be easily generalized to the systems with time-periodic settings, which usually corresponds to annual or seasonal cycles in climate, atmosphere and ocean sciences. Mathematical theories of the generalizations of FDT to time-dependent ensembles can be found in [162]. In [167], a triad nonlinear stochastic model with time-periodic setting was developed, which mimics the nonlinear interaction of two Rossby waves forced by baroclinic processes with a zonal jet forced by a polar temperature gradient. Systematical studies showed that qG FDT has surprisingly high skill for the mean response to the changes in forcing. The performance of qG FDT for the variance response to the perturbations of dissipation is good in the nearly Gaussian regime and deteriorates in the strongly non-Gaussian regime. More examples can be found in [15,40].

(c) f u and f u + f u (t)
Response to forcing perturbation Equilibrium of the unperturbed model New equilibrium Relaxation towards equilibrium   (panel (a)) and variance Var(u(t)) (panel (b)) of u with the forcing perturbation δ f (t) given in (142). The perturbation starts at time t = 0, which is consistent with that in Figure 32.
Other FDT techniques that have skillful performance in dealing with complex nonlinear dynamical systems includes blended response algorithms [168,169] and kicked FDT [30]. FDT has been demonstrated to have high skill for the mean and variance response in the upper troposphere for changes in tropical heating in a prototype atmospheric GCM and can be utilized for complex multiple forcing and inverse modeling issues of interest in climate change science [163,164]. Note that GCMs usually have a huge number of state variables and applying FDT on the entire phase space is impractical due to the limitations in calculating the covariance matrix. Practical strategies involve computing the response operator on a reduced subspace. Mathematical principles of applying FDT on reduced subspaces can be found in [41].

Information Theory for Finding the Most Sensitive Change Directions
An important question in climate change is how to find the most sensitive directions for climate change given the present climate. To quantify these most sensitive directions, consider a family of parameters λ ∈ R p with π λ the PDF of the true climate as a function of λ. Here λ = 0 corresponds to the unperturbed state or the present climate π. Note that λ can consist of external parameters such as changes in forcing or parameters of internal variability such as a change in dissipation. In light of the information theoretic framework, the most sensitive perturbed climate is the one with the largest uncertainty related to the unperturbed one, P (π λ * , π) = max λ∈R p P (π λ , π).
The calculation of the most sensitive perturbation for the present climate in (148) is through the information theoretical framework. Assume that π λ is differentiable with respect to the parameter λ [90,162,170]. Since π λ | λ=0 = π, for small values of λ, we have where λ · I(π)λ is the quadratic form in λ given by the Fisher information [40,93,162,171] λ · I(π)λ = (λ · ∇ λ π) 2 π , (150) and the elements of the matrix of this quadratic form are given by Detailed derivations of (149)- (151) are included in Appendix A. Note that the gradients are calculated at the unperturbed state λ = 0. Therefore, if both the unperturbed state π and the gradients λ · ∇ λ π are known, then the most sensitive perturbation direction occurs along the unit direction e * π ∈ R p which is associated with the largest eigenvalue λ * π of the quadratic form in (150). Below, we use two simple examples to provide insights of the above information theoretical framework in finding the most sensitive change direction in the underlying models. We will start with a linear example, where all the results using the direct calculation method can be written down explicitly. We aim at comparing the results using the direct method and using the Fisher information in (150). The analytic formulae associated with this linear example also allow us to understand the contributions of the uncertainty in the perturbation from the signal and the dispersion parts, respectively. Then, we will use a more complicated nonlinear example with non-Gaussian noise to show the efficiency and accuracy of using the information criterion in (150).
The first example is an one-dimensional linear model, the equilibrium PDF of which is Gaussian and is given by N (ū, C), The two-dimensional parameters λ = ( f , a) T ∈ R 2 for external forcing and dissipation are the natural parameters which are varied in this model. Therefore, the corresponding I(λ) in (150) is a 2 × 2 matrix with entries I ij , i, j = 1, 2. Using (153), it is straightforward to compute the first-order derivatives of π with respect to f and a, In light of (151) and (155), the four elements of I have the following explicit expressions: Now, let's implement numerical experiments. The following two groups of parameters are used: Since I is a 2 × 2 matrix, there are only two eigenmodes. The eigenvector w associated with the larger eigenvalue corresponds to the most sensitive direction with respect to the perturbation (δ f , δa) T . By plugging the model parameters (157) into the I matrix in (156), we find the most sensitive direction in both of the cases: To gain more intuition on the results of these most sensitive directions, we make use of the simple structure of (152) to solve this problem in an alternative way. In fact, given small perturbations (δ f , δa) T to ( f , a) T , the corresponding perturbed mean and variance can be written down explicitlȳ Since both the unperturbed and perturbed PDFs are Gaussian, we can easily make use of the explicit formula of the relative entropy in (6) to compute the uncertainty due to the perturbation P (π, π δ ) and find the most sensitive direction in the two-dimensional parameter space. Recall in (6) that the total uncertainty can be decomposed into signal and dispersion parts. Making use of (154) and (159), we have Note that the dispersion part depends only on the perturbation in the dissipation δa since f has no effect on the variance. In addition, it is clear that δa and δ f should have opposite signs in order to maximize the relative entropy in the signal part. Figure 34 shows the total relative entropy as well as its two components, namely signal and dispersion, as a function of the perturbations in the two-dimensional parameter space (δ f , δa) T using the direct formula (160). The numerical simulation here assumes δ f 2 + δa 2 ≤ 0.05 to guarantee the perturbation is small enough. In both cases, the most sensible direction with respect to only the dispersion part lies in the direction (δa, δ f ) T = (1, 0) T , due to the fact that δ f has no effect on the dispersion part. In the signal part, the most sensitive direction satisfies aδ f = − f δa. The overall most sensitive direction depends naturally on the weights of signal and dispersion parts. When σ becomes larger, the weight on the signal part reduces since the signal part is proportional to the inverse of the model variance. It is easy to see that the most sensitive directions as indicated by the black dashed lines in Figure 34 are consistent with the theoretical prediction from (158) using the Fisher information (148)- (151).  (6). The black dashed line shows the direction of the maximum uncertainty, namely the most sensitive direction of perturbation. Now, we consider a second example with a nonlinear model [116,170], The nonlinear model in (161) is a canonical model for low frequency atmospheric variability and was derived based on stochastic mode reduction strategies. This one-dimensional, normal form was applied in a regression strategy in [116] for data from a prototype AOS model [112] to build one-dimensional stochastic models for low-frequency patterns such as the North Atlantic Oscillation (NAO) and the leading principal component (PC-1) that has features of the Arctic Oscillation. Note that the model in (161) has both correlated additive and multiplicative noise (A − Bu)Ẇ C as well as an extra uncorrelated additive noise σẆ A . The nonlinearity interacting with noise allows a rich dynamical features in the model such as strongly non-Gaussian PDFs and multiple attractors. Different from the previous example with linear dynamics, the direct method has no explicit solution for the nonlinear system (161). The goal here is to find the most sensitive directions using the information theory developed above in different dynamical regimes.
Here, we consider a simple case with A = B = 0 such that the model has only additive noise. Nevertheless, the cubic nonlinearity still allows the model to have strong non-Gaussian characteristics. With A = B = 0, the equilibrium PDF of (161) is given by the following explicit formula π(u) = N 0 exp 2 We again look at the perturbation in the two-dimensional parameter space λ = ( f , a) T , which represent the changes in forcing and damping. Following (149)- (151), we aim at solving the eigenvectors of the 2 × 2 matrix I(λ). To explicitly write down the elements in I(λ), we define Straightforward calculations show that Now, we focus on the case studies in the following three regimes, Regime I : Regime II : The sample trajectories and equilibrium PDFs associated with these regimes are shown in Figure 35. The PDF in Regime I is unimodal with skewness and an one-side fat tail. Interestingly, the time series in Regime I shows a distinct regimes of behavior [172,173]. Regimes II and III correspond to PC-1 and NAO for the low frequency data as discussed above, where Regime II has a slight skewed PDF with sub-Gaussian tails while Regime III is nearly Gaussian.
With the parameters in (165) and the explicit expression of I(λ) in (164), the most sensitive direction of the parameter perturbation in the two-dimensional space (δ f , δa) T is given by respectively The results in (166) imply that the forcing perturbation leads to more significant changes of the system in Regimes I and II while damping perturbation is more crucial in Regime III for the NAO. In column (c) of Figure 35, we show the numerical simulations of the relative entropy in (1) with perturbations in all the directions within the entire two-dimensional parameter space (δ f , δa) T . Here, we take smaller (δ f , δa) T in Regime II than those in Regimes I and III due to the smaller parameter values ( f , a) T in Regime II. These numerical results, which are more expensive to compute, are consistent with the theoretical predictions in (166). Note that, although the most sensitive directions in Regimes I and II are close to each other, the ratio of the larger to smaller eigenvalues in the two regimes are quite different with 18.2979 in Regime I and 2.5307 in Regime II. This means that there is a direction of (δ f , δa) T in Regime I in which the perturbation results in almost no change in the PDF, which can also be seen in column (c) of Figure 35.
Note that both the simple examples shown above contain the perfect knowledge of the present climate given by the unperturbed equilibrium PDFs. However, it is often quite difficult in practice to know the exact expression of these PDFs or it is computationally unaffordable to compute the gradient in high dimensions. Therefore, many approximations are combined with the information theoretical framework developed above. One common practical strategy is to adopt some approximated PDFs based on a few measurements such as the mean and covariance. It is also common to use imperfect or reduced models from a practical point of view, where FDT can also be incorporated to calculate the gradient of the present climate. Then, quantifying the model error in finding the most sensitive directions using imperfect models is an important issue. For detailed discussions of these topics, please see the reference [26]. The parameter values of the three regimes are given in (166). The black dashed line shows the direction of the maximum uncertainty, namely the most sensitive direction of perturbation.

A General Framework
A central issue in contemporary science is the development of data-driven statistical dynamical models for the time series of a partial set of observed variables which arise from suitable observations from nature ( [174] and references therein). Examples involve multi-level linear autoregressive models as well as ad hoc quadratic nonlinear regression models. It has been established recently [111] that ad hoc quadratic multilevel regression models can have finite time blow up of statistical solutions and pathological behavior of their invariant measure even though they match the data with high precision. Recently, a new class of physics-constrained multi-level nonlinear regression models was developed which involve both memory effects in time as well as physics constrained energy conserving nonlinear interactions [47,48] and completely avoid the above pathological behavior with full mathematical rigor.
The physics-constrained multi-level nonlinear regression models have the following forms: where B(u, u) is a quadratic nonlinearity which imposes the physical constraint of energy conservation on the nonlinear terms, namely u · B(u, u) = 0.
In (167), the noise has the form r = (r 1 , . . . , r p ) T where p denotes the number of memory levels and these noises are characterizes by the triangular matrix A. The situation with p = 0 denotes the special zero-memory level model See [47,48] for more details.
The ideas of developing physics-constrained nonlinear regression models can be combined with information calibration for predicting strongly nonlinear time series. The general procedure is shown in Figure 36. Here, the observed time series are divided into two parts, namely the training phase and the prediction phase. In the first step, physics-constrained nonlinear stochastic models are developed based on the characteristics of the given time series in the training phase. The second step involves applying information theory for model calibration using the time series again in the training phase. Then, the remaining time series is used for testing the prediction skill of the calibrated model.

t Calibration Phase
Prediction Phase (Using Information Theory) Step 2: Model Calibration Step 1: Model Development Step 3: Prediction

Model Calibration via Information Theory
The key step above is the model calibration. As has been seen in Section 2, an effective model is expected to capture both the fidelity and sensitivity of nature. Therefore, two objective functions are utilized here for model calibration. The first one aims at capturing the model fidelity, which is given by minimizing the information distance between the PDF associated with the time series π(u) and that associated with the model π M (u). The model fidelity guarantees the model's ability in recovering the long-term statistics of nature. However, the model fidelity does not necessarily provide skillful predictions at short and medium ranges. See examples in Figures 9 and 10. Thus, a second objective function is launched, which aims at minimizing the distance between the two autocorrelation functions associated with the observed time series and the model, respectively. As has been shown in Section 2.5, the autocorrelation function is associated with the mean response of the system. In fact, autocorrelation function characterizes the overall time-evolving patterns of the underlying dynamical system. Capturing the autocorrelation function ensures the dynamical consistency and is crucial for skillful short and medium-range forecasts using the proposed model.
Denote θ the parameters in the physics-constrained nonlinear stochastic model. If both the model and nature are stationary, then the model calibration is given by the following optimization problem: where w 1 and w 2 are weight functions. In (170), E(λ) and E M (λ) are the energy spectra corresponding to the autocorrelation functions R(t) and R M (t) of nature and the model, respectively, as studied in Section 2.5. In practice, time-periodic forcing may be involved in both the observed time series and the physics-constrained model. In such a situation, both π M and R M (t) can be formed by making use of the sample points in a long trajectory from the model. Since the stationary assumption is broken, the target function in (170) can be modified as the average value of the minimizations at different points within one period. Alternatively, an even cruder but practically useful target function involves a modified version of the first part in (170) given by the empirical measurements based on the time-averaged PDFs while the second part in (170) is replaced by directly computing the difference between the two autocorrelation functions. The important issue here is that both the PDF and the temporal correlation must be included in the target function. The model calibration based on (170) or its modified versions has several salient features. First, the information distance P(π, π M ) is able to quantify the difference in the non-Gaussian statistics between the model and nature. Particularly, it is able to assess the skill of the model in recovering extreme events. Second, the two target functions play the role of improving long-term and short-term prediction skill, respectively. Therefore, the calibrated model can be used for predicting both transit phases and the statistical equilibrium state. Third, although the number of the parameters, namely the dimension of θ, can be large, the cost function L is in general robust with respect to the perturbation of θ around the optimal values with a suitable choice of the physics-constrained nonlinear model. This is crucial in practice because it requires only a crude estimation of the parameters for the model, which greatly reduces the computational cost for searching in high-dimensional parameter space. In fact, as has been shown in [46], the energy-conserving nonlinear interaction in these physics-constrained nonlinear models is the underlying mechanism for such robustness property even in the presence of strong nonlinearity and intermittency. Finally, the physics-constrained nonlinear stochastic models require only a short training period [61,175] because the model development automatically involves a large portion of the information of nature. Thus, the data-driven physics-constrained modeling framework as discussed above is much cheaper and more practical than most non-parametric methods where a massive training data is typically required.

Applications: Assessing the Predictability Limits of Time Series Associated with Tropical Intraseasonal Variability
A striking application combining physics-constrained nonlinear model strategy with the above procedure is to assess the predictability limits of time series associated with the tropical intraseasonal variability such as the the Madden-Julian oscillation (MJO) and monsoon [46,61,176]. They yield an interesting class of low-order turbulent dynamical systems with extreme events and intermittency. Denote by u 1 and u 2 the two observed large-scale components of tropical intraseasonal variability. Here, we focus on the MJO time series [46], which are measured by outgoing longwave radiation (OLR; a proxy for convective activity) from satellite data [177]. See panel (a) of Figure 37. The PDFs for u 1 and u 2 (panel (c)) are highly non-Gaussian with fat tails indicative of the temporal intermittency in the large-scale cloud patterns. To describe the variability of the time series u 1 and u 2 , the following family of low-order stochastic models are proposed: where In (171), in addition to the two observed variables u 1 and u 2 , the other two variables v and ω u are hidden and unobserved, representing the stochastic damping and stochastic phase, respectively. Here,Ẇ u 1 ,Ẇ u 2 ,Ẇ v andẆ ω are independent white noise. The constant coefficients d u , d v , d ω represent damping for each stochastic process, and the non-dimensional constant γ is the coefficient of the nonlinear interaction. The time periodic damping v f (t) in the Equation (171) is utilized to crudely model the active winter and the quiescent summer in the annual cycle. The constant coefficients ω f and φ in (172) are the frequency and phase of the damping, respectively. All of the model variables are real. The energy conserving nonlinear interactions between u 1 , u 2 and v, ω u are seen in the following way. First, by dropping the linear and external forcing terms in (171), the remaining equations involving only the nonlinear parts of (171) read To form the evolution equation of the energy from nonlinear interactions E = (u 2 1 + u 2 2 + v 2 + ω 2 u )/2, we multiply the four equations in (173) by u 1 , u 2 , v, ω u respectively and then sum them up. The resulting equation yields The vanishing of the right-hand side in (174) is due to the opposite signs of the nonlinear terms involving v multiplying u 1 and u 2 in (174) and those in (174) multiplying by v as well as the trivial cancellation of skew-symmetric terms involving ω u .
Further motivation for the models in (171) is provided by the stochastic skeleton model which predicts key features of the MJO [178][179][180][181]. These are coupled nonlinear oscillator models of the MJO where if we identify the OLR variables with the envelope of synoptic scale convective activity, the hidden variables v, ω u , and their dynamics become phenomenological surrogates for the energy-conserving interactions in the skeleton model involving the synoptic scale convective activity and the equatorial dynamic equations for temperature, velocity, and moisture.
It is shown in Figure 37 that, with the optimized parameters, the model in (171) almost perfectly capture the highly non-Gaussian fat-tailed PDFs, the autocorrelation functions (up to three months) and the power spectrums. In addition, the wiggles around one year in the autocorrelation functions, representing the annual cycle, are also recovered. Importantly, these parameters are pretty robust around the optimal values. In panel (b), a sample trajectory of u 1 from the model is shown, which shares many salient features as those of the observed MJO time series in panel (a). Another notable advantage of the physics-constrained nonlinear low-order stochastic models developed here is that the model structure allows an efficient nonlinear data assimilation scheme to determine the initial values of the hidden variables v, ω u [140]. This facilitates the ensemble prediction algorithm since no direct observation is available for these hidden variables. In [46], significant prediction skill of these MJO indices using the physics-constrained nonlinear stochastic model (171) was shown. The prediction based on ensemble mean can have skill even up to 40 days. In addition, the ensemble spread accurately quantify the forecast uncertainty in both short and long terms. In light of a twin experiment, it was also revealed in [46] that the model in (171) is able to reach the predictability limit of the large-scale cloud patterns of the MJO.

Reduced-Order Models (ROMs) for Complex Turbulent Dynamical Systems
The model in (175) has the following properties: • L = L + D is a linear operator representing dissipation and dispersion. Here, L is skew symmetric representing dispersion and D is a negative definite symmetric operator representing dissipative process such as surface drag, radiative damping, viscosity, etc. • B (u, u) is a bilinear term and it satisfies energy conserving property with u · B (u, u) = 0.
The energy-conserving quadratic nonlinearity is one of the representative features in many turbulent dynamical systems in nature. The energy is transferred from the unstable modes to stable modes where the energy is dissipated resulting in a statistical steady state.
We use a finite-dimensional representation of the stochastic field consisting of a fixed-in-time, whereū (t) = u (t) represents the ensemble average of the response, i.e., the mean field, and Z i (t; ω) are stochastic processes. By taking the average of (175) and using (176), the mean equation is given by with R = ZZ * the covariance matrix. Moreover, the random component of the solution, u = Z i (t; ω) v i satisfies By projecting the above equation to each basis element v i , we obtain From the last equation, we directly obtain the exact evolution equation of the covariant matrix where we have: 1.
The linear dynamical operator expressing energy transfers between the mean field and the stochastic modes (effect due to B), as well as energy dissipation (effect due to D) and non-normal dynamics (effect due to L) 2.
The positive definite operator expressing energy transfer due to the external stochastic forcing 3.
The energy flux between different modes due to non-Gaussian statistics (or nonlinear terms) modeled through third-order moments We note that the energy conservation property of the quadratic operator B is inherited by the matrix Based on the observation that the eigenvalues are effectively changed by the existence of the nonlinear energy transfer mechanism, we propose a special form of the flux Q F that will make the correct steady state statistics a stable equilibrium. More specifically, we split the nonlinear fluxes into a positive semi-definite part Q F,+ and a negative semi-definite part Q F,− : As in (184), the nonlinear fluxes should always satisfy the conservative property of B, namely, The positive fluxes Q F,+ indicate the energy being 'fed' to the stable modes in the form of external stochastic noise. On the other hand, the negative fluxes Q F,− should act directly on the linearly unstable modes of the spectrum, effectively stabilizing the unstable modes.

Modeling the Effect of Nonlinear Fluxes
The first idea here is to model the effect of the nonlinear energy transfers on each mode by adding additional damping balancing the linearly unstable character of these modes, and adding additional (white) stochastic excitation with standard deviation which will model the energy received by the stable modes, In (185) Quasilinear Gaussian closure model: The simplest approximation for the closure methods at the first stage should be simply neglecting the nonlinear part entirely [182][183][184]. That is, set Thus, the nonlinear energy transfer mechanism will be entirely neglected in this Gaussian closure model. This is the similar idea in the eddy-damped Markovian model where the moment hierarchy is closed at the level of second moments with Gaussian assumption and a much larger eddy-damped parameter is introduced to replace the molecular viscosity [121,185]. Obviously, this crude Gaussian approximation will not work well in general due to the cutoff of the energy flow when strong nonlinear interactions between modes occur. Actually, the deficiency of this crude approximation has been shown under the Lorenz 96 framework, and in a final equilibrium state, there exists only one active mode with a critical wavenumber [11,50]. Such closures are only useful in the weakly nonlinear case where the quasi-linear effects are dominant.

2.
Models with consistent equilibrium statistics: The next strategy is to construct the simplest closure model with consistent equilibrium statistics. Thus, the direct way is to choose constant damping and noise term scaled with the total variance. We propose two possible choices as in [50] for the damping and noise in (185) Above, only two scalar model parameters ( M , σ M ) are introduced, and I N represents the N × N identity matrix. GC1 is the familiar strategy of adding constant damping and white noise forcing to represent nonlinear interaction; GC2 scales with the total variance trR (or total statistical energy) so that the model sensitivity can be further improved as the system is perturbed. From both GC1 and GC2, we introduce uniform additional damping rate for each spectral mode controlled by a single scalar parameter M , while the additional noise with variance σ 2 M is added to make sure climate fidelity in equilibrium. The statistical model closure Q M F is used to approximate the third-order moments in the true dynamics, thus the exponents of the total energy trR in GC2 should be consistent in scaling dimension. In the positive-definite part Q M F+ , it calibrates the rate of energy injected into the spectral mode due to nonlinear effect in the order |u | 3 . The factor scales with the total energy with exponent 3/2 so that the corrections keep consistent with the third-order moment approximations; In the negative damping rate Q M F,− , the scaling function is used to characterize the amount of energy that flows out the spectral mode due to nonlinear interactions. Scaling factor with a square-root of the total energy with exponent 1/2 is applied for this damping rate multiplying the variance in order |u | 2 to make it consistent in scaling dimension with third moments. However, the damping and noise are chosen empirically without consideration about the true dynamical features in each mode. A more sophisticated strategy with slightly more complexity in computation is to introduce the damping and noise judiciously according to the linearized dynamics. Then, climate consistency for each mode can be satisfied automatically.

3.
Modified quasi-Gaussian (MQG) closure with equilibrium statistics: In this modified quasi-Gaussian closure model originally proposed in [11,45], we exploit more about the true nonlinear energy transfer mechanism from the equilibrium statistical information. Thus, the additional damping and noise proposed as before are calibrated through the equilibrium nonlinear flux by letting N M,eq is the effective damping from equilibrium, and Q + F,eq is the effective noise from the positive-definite component. Unperturbed equilibrium statistics in the nonlinear flux Q F,eq are used to calibrate the higher-order moments as additional energy sink and source. The true equilibrium higher-order flux can be calculated without error from first and second order moments in (ū eq , R eq ) from the unperturbed true dynamics (180) in steady state following the steady state statistical solution relation: where Q − F,eq , Q + F,eq are the negative and positive definite components in the unperturbed equilibrium nonlinear flux Q F,eq . Since exact model statistics are used in the imperfect model approximations, the true mechanism in the nonlinear energy transfer can be modeled under this first correction form. This is the similar idea used for measuring higher order interactions in [45], where more sophisticated and expensive calibrations are required to make that model work there.

A Reduced-Order Statistical Energy Model with Optimal Consistency and Sensitivity
The above closure model ideas, especially (187)- (189), have advantages of their own. Models in (187) and (188) are simple and efficient to construct with consistent equilibrium consistency, while (189) involves the true information about the higher-order statistics in equilibrium so that the energy mechanism can be characterized well. The validity of these approaches has been tested and compared from several papers [11,45,50] using the simplified triad model and L96 model. The methods have also been applied to the two-layer quasi-geostrophic equation [44,117], where the phase space of the original system is 5 × 10 5 while the ROM contains only 0.1% of the large scale modes and can cope with the changes in external forcing. Still, when it comes to the more complicated and realistic flow systems such as the quasi-geostrophic equations, more detailed calibration for model consistency and sensitivity is required to achieve the optimal performance. A preferred approach for the nonlinear flux Q M F combining both the detailed model energy mechanism and control over model sensitivity is proposed in the form The closure form (191) consists of three indispensable components: (i).
Higher-order corrections from equilibrium statistics: In the first part of the correction using the damping and noise operator as N M,eq , Q + F,eq , unperturbed equilibrium statistics in the nonlinear flux Q F,eq are used to calibrate the higher-order moments as additional energy sink and source following the procedure in (189). Therefore, the equilibrium statistics can be guaranteed to be consistent with the truth, and the true energy mechanism can be restored. (ii).
Additional damping and noise to model changes in nonlinear flux: The above corrections in step (i) by using equilibrium information for nonlinear flux is found to be insufficient for accurate prediction in the reduced-order methods since the scheme is only marginally stable and the energy transferring mechanism may change with large deviation from the equilibrium case when external perturbations are applied. Thus, we also introduce the additional damping and noise (d M , Σ M ) as from (187). d M is just a constant scalar parameter to add uniform dissipation on each mode, and Σ M is the further correction as an additional energy source to maintain climate fidelity. (iii).
Statistical energy scaling to improve model sensitivity: Still note that these additional parameters are added regardless of the true nonlinear perturbed energy mechanism where only unperturbed equilibrium statistics are used. To capture the responses to a specific perturbation forcing, it is better to make the imperfect model parameters change adaptively according to the total energy structure. Considering this, the additional damping and noise corrections are scaled with factors f 1 (R), f 2 (R) related with the total statistical variance trR as

Calibration Strategy
As discussed in the previous sections, the calibration should involve two criteria: (1) the model fidelity (consistency) with the same equilibrium statistics as the truth, and (2) the optimal model sensitivity.
Let's denote π G (u) and π M G (u) the Gaussian distributions of the truth and the imperfect model, as in Section 2.1. Here, the Gaussian approximation of the PDFs is adopted since the above reduced-order model strategy only involves the evolution of mean and variance in the imperfect model. According to the information-theoretic framework in Section 2.1, the statistical equilibrium fidelity means that the Gaussian relative entropy satisfies Practically, we can make use of the explicit form (6) to calibrate the parameters such that (193) is satisfied. Statistical equilibrium fidelity is a natural necessary condition to tune the mean and variance of the imperfect model to match those of the perfect model; it is far from a sufficient condition. To see this, recall from Section 2.5 that different dynamical systems can have the same Gaussian invariant measure and therefore statistical equilibrium fidelity among the models is obviously satisfied (see [40] for several concrete examples). Thus, the condition in (193) should be regarded as an important necessary condition. UQ requires an accurate assessment of both the mean and variance and at least (193) guarantees calibration of this on a subspace, u ∈ R M , for the unperturbed model. Climate scientists often just tune only the means (see [26] and references therein).
Next, the prediction skill of imperfect models can be improved by comparing the information distance through the linear response operator with the true model. The response in the Gaussian framework P (π δ , π M δ ) can be computed by making use of (10). This condition has been shown to be crucial in calibrating the parameters (see examples in Sections 2.5 and 5). The optimal model M * ∈ M that ensures best information consistent responses to various kinds of perturbations is characterized with the smallest additional information in the linear response operator R among all the imperfect models, such that where π M δ can be achieved through a kicked response procedure (138) in the training phase compared with the actual observed data π δ in nature, and the information distance between perturbed responses P π δ , π M δ can be calculated with ease through the expansion formula (10). For the time-periodic cases, the information distance P π δ (t) , π M δ (t) is measured at each time instant, so the entire error is averaged under the L 1 -norm inside proper time window [0, T] (such as one period). Some low dimensional examples of this procedure for turbulent systems can be found in [10,30,186].
Below, in the example of predicting passive tracer extreme events using low-order reduced model (Section 6.3), the imperfect models are all linear Gaussian models. As we have already seen in Sections 2.2.1 and 2.3, the linear Gaussian models are only able to capture the response in the mean statistics. Therefore, minimizing the model sensitivity is based on optimizing the mean response in the imperfect models compared with that in the truth. Note that optimizing the mean response is equivalent to optimizing the autocorrelation function in the linear Gaussian models. Thus, instead of applying the general strategy with FDT for the optimization of the response, minimizing the spectral density (for autocorrelations) using information theory as discussed in Section 2.5 is an efficient alternative approach, which will be adopted below. The readers should keep in mind that these two methods share the same goal of optimizing the sensitivity in imperfect models.

Physics-Tuned Linear Regression Models for Hidden (Latent) Variables
Before we show concrete applications of the reduced-order modeling framework developed in Section 6.1, let's briefly discuss the physics-tuned linear regression modeling strategy. These physics-tuned linear regression models are particularly useful for simplifying the hidden or latent processes in complex dynamical systems while they preserve the important feedback from the latent variables to the resolved variables. Thus, such physics-tuned linear regression technique allows the dynamics and statistical structure of the models to become much more tractable and facilitates the application of the reduced-order modeling strategy in Section 6.1.
Consider the following general nonlinear system, where F 1 and F 2 are nonlinear functions of u and v. The model in (195) is usually written as a collection of Fourier modes. In (195), the state variables u are resolved variables that are our primary interest. The state variables v are the latent or unresolved variables, which nevertheless play an important role in contributing to the variability of u through nonlinear interactions. Here, the goal is to develop physics-tuned linear regression models of v such that their dynamics and statistical structure become much more tractable. Meanwhile, the feedback from v to u under the physics-tuned linear regression modeling framework are expected to be preserved to a large extent. The physics-tuned linear regression modeling framework for the latent variables v is given as follows: where Λ M are σ M v both diagonal matrices. The i-th diagonal entry of Λ M usually has the form where the real part of each diagonal entry of λ M j , namely −d M j , is negative. In other words, each component v j of v satisfies a one-dimensional OU process: The physics-tuned linear regression modeling strategy requires that each v M j in (197) satisfies both the model fidelity and model sensitivity compared with the i-th component of the truth v, namely v j , in (195). Therefore, the model parameters d M j , ω M j ,v M j and σ M v,j in (197) are calibrated by matching the equilibrium Gaussian PDF and the autocorrelation function with those associated with v j in (195). See Section 2.5 for more technical details. Note that the goal here is to let v M provide the least biased statistical feedback to u instead of recovering all the point-wise details of the latent random process v M .
Below, we use a simple example to illustrate physics-tuned linear regression modeling strategy and emphasize the importance of capturing the feedback from v to u. Note that such ideas will be adopted in Section 6.3 for predicting the extreme events in passive tracers, where v and u can be thought of as the surrogates of the advection flow and the passive tracer fields there, respectively.
The true model is a two-dimensional nonlinear model: In (198), u is complex but v is real. The unresolved process v is given by the canonical model for low frequency atmospheric variability, derived based on stochastic mode reduction strategies [116,170]. It has been used in Section 4.3. The unresolved variable v serves as the stochastic damping for the resolved variable u. The feedback mechanism of v with suitable parameters can result in the intermittent instability of u. The parameters in this coupled model are all constants. We consider the following two dynamical regimes: Bimodal regime: where the fat-tailed regime is a typical non-Gaussian regime for the unresolved process while the bimodal one is an extremely tough test regime. The other two parameters in the dynamics of u are the same in the two regimes, The reduced model with simplified process of v is given by Since v is real in the true model (198), v M is also real in the reduced model (201).
here, which is a special case of the general framework in (197). The three parameters d M ,v M and σ M v are tuned by capturing the model sensitivity (autocorrelation function) and model fidelity (equilibrium PDF) of the those associated with the truth of v in (198). The sample trajectories and the statistics of both the truth and the reduced model with the physics-tuned parameters are shown in Figures 38 and 39 for the fat-tailed and the bimodal regimes, respectively. In both figures, despite the failure in capturing the non-Gaussianity in the hidden process v, the non-Gaussian fat-tailed PDFs as well as the intermittent trajectories of the resolved variable u are both recovered with high accuracy in the reduced model with physics-tuned parameters. One crucial reason for the high skill in the reduced model is that the autocorrelation function of v M resembles that of v in the truth. Therefore, the duration and frequency of the intermittent phases of u M are statistically similar to those of u. In fact, even in the bimodal regime which is an extremely tough test case (Figure 39), where the PDF of v M is highly biased from that of v, the feedback mechanism from v to u is well recovered by the reduced model due to the capturing of both the model fidelity and model sensitivity. In Figure 40, we show that only matching the equilibrium PDF of v M with that of v but ignoring the autocorrelation function in the calibration process is insufficient in recovering the key features of u. This emphasizes the importance of physics-tuned calibration strategy and the skill of using the resulting linear regression model for the hidden unresolved variables in capturing the non-Gaussian intermittent behavior of the resolved variable u.  Figure 39 where the process of the truth of v is in a bimodal regime. In the reduced model, the parameters are not physical-tuned. Thus, a large model error is seen from both the trajectories in (b,d) compared with the truth (a,c) and the statistics in (e-h).

Predicting Passive Tracer Extreme Events
Turbulent diffusion models of passive tracers have numerous applications in geophysical science and engineering. These applications range from, for example, the spread of contaminants or hazardous plumes in environmental science, to the behavior of anthropogenic and natural tracers in climate change science, and many related areas [187][188][189][190]. The scalar field T(x, t) describes the concentration of the passive tracer immersed in the fluid which is carried with the local fluid velocity but which does not itself significantly influence the dynamics of the fluid. The evolution of the passive tracer is driven by turbulent advection, diffusion and usually uniform damping, where v(x, y, t) is a velocity field. One key feature of great interest in the tracer turbulent model (202) is the existence of intermittency, which can be found in atmosphere observational data [190], laboratory experiments [191], and numerical simulations of idealized models [15,189,192,193]. A special form of the velocity field v, which is a superposition of a spatially uniform but possibly temporally fluctuating cross-sweep in the x-direction and a random shear flow (with fluctuations possibly in both time and spatial x-direction) in the y-direction v(x, y, t) = (U(t), v(x, t)), (203) has been proposed by Majda et al. [189,193] and tested on simple mathematical models [26,29,194]. Assume the existence of a background mean gradient for the tracer varying in only y-variable and a tracer fluctuation component dependent with only the x-variable Combining (204) with the tracer dynamics (202) and the simplified flow field (203), the fluctuation part of the tracer T satisfies the following equation: Despite their simplicity, the model (205) in random shear flow with a mean sweep can capture and preserve key features for various inertia range statistics for turbulent diffusion [192,[195][196][197][198] including intermittency. Even for roughly Gaussian velocity fields v in (203) as observed in turbulent flows, the linear scalar field can experience rare but very large fluctuations in amplitude, and its statistics can depart significantly from Gaussianity displaying fatter tails representing the intermittency [199][200][201][202][203]. Explicit formulations about the statistical solutions of the tracers have been derived in [193] under this simplified flow system, and a rigorous mathematical proof about the intermittent fat tails in tracer distributions has been achieved recently in [195].
Complex nonlinear and non-Gaussian features in the flow components are unavoidable and ubiquitous especially in realistic turbulent flows. The complexity and large computational expense in resolving the highly turbulent advection flow equations require the introduction of simpler and more tractable imperfect models while still maintaining the ability in capturing the key intermittent features in the tracer field. Below, we investigate the effects from a nonlinear advection flow on the steady state passive tracer intermittency, and especially the errors and performances of imperfect approximation models are tested in a variety of turbulent regimes. In particular, the following two issues will be addressed in the remaining of this section:

1.
Whether a linear Gaussian dynamics in approximating the advection flow is able to capture tracer non-Gaussian statistical structures? 2.
How to design an unambiguous reduced-order stochastic modeling strategy with high prediction skill of the tracer field?
There are at least two motivations for using linear Gaussian imperfect models for describing advection flow v. First, the dynamics and statistical structure become much more tractable with explicit solutions that enable us to design the model and tune parameters with ease. Second, the computational difficulty and cost are also greatly reduced considering the simple and controllable structure of the linear model. However, it is challenging by applying these linear Gaussian models with no positive Lyapunov exponents to estimate the non-Gaussian flow field including various degrees of internal instabilities. Therefore, a systematic procedure in calibrating the imperfect model parameters is required.
Here, the information-theoretic framework developed in Section 6.1.4 is applied to train the imperfect model parameters in a training phase so that the model predicted stationary process can possess the least biased estimation in energy and autocorrelation function, the latter of which plays an particularly important role in determining the structure of tracer statistics. With such a systematical calibration strategy, these linear stochastic models can be greatly improved through this proposed tuning strategy under a proper information metric. On the other hand, the reduced-order strategy aims at using low order equilibrium statistics in the model calibration as a correction to the nonlinear small scale feedback, which avoids high computational cost.

Approximating Nonlinear Advection Flow Using Physics-Tuned Linear Regression Model
Here, we aim at answering the question that whether a linear Gaussian dynamics is able to approximate the advection flow such that the non-Gaussian statistical structures in the tracer field is preserved. One good reference of this topic is [50].
To this end, we consider a background flow, which is driven by the 40-dimensional Lorenz 96 (L96) system [204]. The L96 model is able to mimic baroclinic turbulence in the midlatitude atmosphere with the effects of energy conserving nonlinear advection and dissipation, displaying a wide range of distinct dynamical regimes from Gaussian to extremely non-Gaussian statistics. Therefore, the L96 model is a representative testbed for studying the model error here.
The L96 system with state variables u(x, t) = (u 0 , u 1 , . . . u J−1 ) T is given by with periodic boundary condition u 0 = u J . Nonlinearity comes from the bilinear quadratic form B j (u, u) = (u j+1 − u j−2 )u j−1 , which conserves energy through u · B(u, u) = 0. By changing the amplitude of the external forcing F, the L96 system displays a wide range of different dynamical regimes ranging from weakly chaotic (F = 5), strongly chaotic (F = 8), to finally full turbulence (F = 16) with varying statistics. The advection flow field v = (U(t), v(x, t)) is constructed from the L96 model solution. Note that the system is homogeneous and translation invariant along each grid point, so standard Fourier basis e k = e 2πikj/J J−1 j=1 naturally becomes the empirical orthogonal functions (EOFs) of the system [50].
The state variables of the system can be decomposed under Fourier basis as where · is the ensemble average. We construct the passive tracer fields nonlinearly advected by the flow generated through the L-96 system. The gradient cross-sweeping component U(t) is from the mean state with randomness from zero mode, while the shearing component v(x j , t) simulated by the flow fluctuation modes with varying values at each grid point, Below, we will focus on the statistical features of the scalar tracer field in stationary steady state. To make sure the system converges to the final stationary state, that is,ū(t) →ū ∞ , r k (t) = |û k | 2 (t) → r k,∞ as t → ∞, we consider the simplified dynamics of (206) with constant damping and forcing terms The exact dynamical equations for each mode in the shearing flowû k and the mean gradient U can be derived from the L96 system (206) as where k = 1, . . . , J/2 and the energy transfer rate is Γ k = cos 4πk J − cos 2πk J . The cross-sweep field U is forced by the combined effects from each fluctuation mode ∑ k =0 Γ k |û k | 2 , and conversely the shearing flow is advected by the mean drift through the second term in the first line in (210).
The accuracy in the steady state passive tracer statistics is limited by the modeling and computation skill of the complex background advection flow v. However, several difficulties cannot be bypassed if we directly go with the true flow system with nonlinearity. First, simple Galerkin truncation of high frequency wave-numbers in the dynamical equations may introduce large errors to the flow system due to strong nonlinear interactions between the (truncated) small scale and large scale modes. Second, even with a low dimensional Galerkin truncation model, large ensemble size may still be required to resolve the flow if non-Gaussian features and intermittencies are important and of interest. On the other hand, returning to our original problem, the central issue of major interest is the turbulent fluctuation and statistical structure of the passive tracer T rather than the background flow field v. Considering both sides of the problem, the question that is worth asking is whether we can predict the crucial features (such as, intermittency) in steady state tracer statistics advected and forced by nonlinear non-Gaussian background flow v using simpler imperfect models with error for the background dynamical field. Now, we adopt the simplest approximation about the advection flow with imperfect models using linear stochastic dynamics along each spectral mode from the Ornstein-Uhlenbeck process [15,193,196,205]. With the simple structures in these linear Gaussian models, the dynamics and statistical structure become much more tractable with explicit solutions that enable us to design the model and tune parameters with ease. The linear stochastic models for each mode can be written as with γ u k , ω u k and σ u k as parameters to be determined, together with the dynamics for the mean with r M k = |û M k | 2 . Note that, in both (211) and (212), we consider all the Fourier modes. In practice, Galerkin truncation is naturally applied to these imperfect models, which greatly reduces the dimension of the imperfect system [81]. Since the goal of this subsection is to understand the role of these linear models with optimized parameters, we do not include the Galerkin truncation here. In the next subsection, we will apply the Galerkin truncation forû M k and reach a suite of low-order models in approximating both the velocity and the tracer fields.
Under the approximations in (211) and (212), the background flow v M = (U M (t), v M (x j , t) can be constructed as before for the mean cross-sweep U M and the shearing flow v M in the tracer model (202), Now, the problem is converted to finding systematic strategies of assigning values to the three undetermined coefficients γ u k , ω u k , σ u k so that the tracer structure (intermittency) can be reconstructed from this imperfect model. They should be chosen in an unambiguous way according to the true steady state statistics of the system (which is available from observations). In comparison with the original equation for each mode described in (210), the linear Gaussian approximation of L96 system replaces the nonlinear interaction part in the second line of (210) by linear damping and rotation together with a white noise The white noise σ u kẆ k is added to each Fourier mode in order to make sure that the system converges to the consistent equilibrium steady state spectra. γ u k represents the damping that neutralizes the additional energy from the white noise. The imaginary component ω u k is the additional degree of freedom for tuning the autocorrelation function (or in other words, to control the 'memory' of this mode of its previous history). Note that the quasi-linear part with U(t) in the first line of the formula (210) is also included in the coefficients γ u k , ω u k . It is discovered that even under this linear flow field with Gaussian statistics, intermittency with fat-tailed distributions can be generated in the steady state tracer distributions [193,195]. Here, the challenge is whether we can still capture the correct structure in the tracer spectra and density functions, especially for the intermittency, under these imperfect linear models. Therefore, judicious choice of the model parameters needs to be investigated.
One of the simplest and most direction way to estimate the undetermined coefficients γ u k , ω u k , σ u k is through the mean stochastic model (MSM) [15,41] by calibrating the energy (variance) E k = |û k (t) − û k | 2 | and decorrelation time τ in (44) of the truth (known as "MSM tuned parameters"). Note that the decorrelation time τ = T k + iθ k here contains real and imaginary parts, fitting both as well as the energy provides three conditions. Despite the simplicity in this mean stochastic model, reasonably skillful prediction in uncertainty quantification as well as filtering under this strategy have been obtained for some turbulent systems [193,195]. However, MSM still suffers several shortcomings when strong nonlinearity takes place in the system. Most importantly, the decorrelation time τ = T k + iθ k involves only the time-integrated effects in each mode. This works well when the system is strongly mixing within a nearly Gaussian regime, whereas, when non-Gaussian features become crucial in the system, the pointwise decaying process of the entire autocorrelation function R(t) becomes important and we need take into account the temporal performance of the autocorrelation in the linear model approximation. This has already been seen in the simple example in Section 2.5. In fact, the autocorrelation function becomes strongly oscillatory when F = 5 in the L96 model, which shows the insufficiency of fitting only the decorrelation time.
Therefore, following the physics-tuned linear regression modeling strategy in Section 6.2 and the information-theoretic framework of calibrating the autocorrelation function in Section 2.5, we fit the autocorrelation function of eachû k by the spectral information criteria (47) and (48). Note that the linear Gaussian model in (211) has explicit solution for the autocorrelation function and power spectrum (52), which provides an extremely efficient way of calibrating the two parameters γ u k , ω u k . The remaining parameter σ u k is calibrated by fitting the energy. Finally, we keep the tracer equation to be the same in this example. Finding a reduced order model for the tracer equation following the general strategy in Section 6.1 will be discussed in the next subsection.
In Figure 41, the statistical features of both the advection field v and the tracer T are shown. Here, the parameters in the true model (205), (209) and (210) are as follows: Note that F = 5 corresponds to the weakly chaotic regime in L96 model, which results in a very slow mixing and therefore the autocorrelation function in a certain modes decays quite slowly with strong oscillations. See the black curves Panel (a) of Figure 41. It is expected from Section 2.5 that using the strategy of MSM by fitting only the decorrelation time results in a large bias, which is clearly indicated by the blue curve in Panel (a). On the other hand, with the physics-tuned parameters calibrated based on fitting the autocrrelation function via the spectral density, the imperfect model provides a significantly accurate estimation of the autocorrelation function even in this tough regime. Next, the comparison of the PDFs associated with both v and T are shown in Panels (b)-(g). Clearly, the linear Gaussian models of v fail to capture the sub-Gaussian PDFs of the velocity field, which indicates an information barrier. Nevertheless, the nonlinear interaction between U and T allows the imperfect model to capture the non-Gaussian features in the tracer field with fat-tailed PDFs in both physical space (Panel (e)) and spectrum space (Panels (f)-(g)). The sample time series using the linear Gaussian velocity model (209) and (210) with the physics-tuned parameters also resembles that of the truth with significant intermittency (Panels (h) and (i)). On the other hand, the linear model with MSM tuned parameters (fitting only the decorrelation time) fails to capture these features (not shown here). See [81] for more discussions and numerical tests in other regimes (F = 8 and F = 16).

Predicting Passive Tracer Extreme Events with Low-Order Stochastic Models
Now, we aim at answering the second question proposed at the beginning of this section. That is, how to design an unambiguous reduced-order modeling (ROM) strategy with high prediction skill of the tracer field [82]. Here, we consider a more realistic and complicated advection flow v(x, t), which is described from the solution of the two-layer quasi-geophysics (QG) equation [121,143] ∂q j ∂t + v j · ∇q j + (β + k 2 d U j ) ∂ψ j ∂x = −δ 2j r∆ψ j − ν∆ s q j , Above, the subindex j = 1, 2 is used to represent the upper and lower layer of the two-layer flow model. The two-dimensional incompressible velocity field v j is decomposed into a zonal mean cross sweep, (U j , 0), and a fluctuating shear flow v j = ∇ ⊥ ψ j = (−∂ y ψ j , ∂ x ψ j ). For the passive tracer field, now we assume the background mean gradient varying in both x and y directions together with a tracer fluctuation component where α = (α x , α y ). Plugging (217) into (202), the fluctuation part of the passive tracer model yields In (218), v j = (u j , v j ) is the fluctuating advection flow field from the solution of (216) together with a zonal mean flow U j . In addition, a scale separation in the tracer Equation (218) with order is introduced. The difference in time scale in the tracer is through a different time scale,t = −1 t, in the tracer time as in various previous works [189,193,195]. As < 1, the velocity field is varying at a faster time scale than the passive tracer process, while on the other hand with > 1 the tracer evolves in a more rapid rate than the advection field. A long time rescaling limit with explicit analytic tracer solutions is derived in [195] and numerical simulations for varying values of among a wide range are investigated in [192] under a much simpler linear model. In general, different intermittent features will be generated from near Gaussian statistics to distributions with fat tails as the scale separation parameter value changes [189,192].
Given periodic boundary condition in both the two-layer flow and the tracer field, we formulate the flow and tracer fields with Galerkin truncation to finite number of Fourier modes. Spatial Fourier decomposition in flow potential vorticity q j and passive tracer disturbance T j can be written in the expansion under modes exp(ik · x) as q j = ∑ kq j,k e ik·x , Note that here we focus on the homogeneous flow on mesoscale and therefore the periodic condition is reasonable. By projecting the tracer and flow Equations (218) and (216) to each Fourier spectral mode, equations for the spectral coefficients in each wavenumber of the two-layer tracer field T k = (T 1,k ,T 2,k ) T , and two-layer advection flow field q k = (q 1,k ,q 2,k ) T , form the set of ODEs in the spectral domain as where '•' is used to denote the pointwise produce, namely a • b = (a i b i ). The potential vorticities q k and stream function ψ k in two layers are related by the transform matrix H k , through the relation q j = ∇ 2 ψ j + (216). The other operators and terms in the nonlinear dynamics (220) and (221) are given by In (223), the linear dissipation γ T,k is due to the Ekman friction applied only on the bottom layer and the hyperviscosity. The dispersion ω T,k is from the rotational β-effect as well as the background zonal mean flow advection from the original Equation (216) applied on the vorticity modes.
The advection terms in the tracer and flow Equations (220) and (221) involve interactions between modes of different scales along the entire spectrum in a large dimensional phase space, thus usually high computational cost is required in achieving accurate statistical results from direct numerical simulations. In general, intermittency in a tracer field is dominated by the variability in largest scales, thus we will concentrate on the large-scale modes with wavenumber |k| ≤ M N, where M is the number of resolved modes and N is the full dimensionality of the system. Usually, we could choose M much smaller than N that only covers the essential most energetic directions in the flow system. Below, we first develop the simple strategy with linear corrections to approximate the advection flow field in the leading modes, which is similar to that in Section 6.3.1. Then, the calibration and improvement of the imperfect models due to model errors from this approximation will be discussed.
As in Section 6.3.1, in order to approximate the advection flow, the simple Gaussian approximation is adopted to replace the quadratic interactions (v · ∇q) k in the flow equations by additional linear damping and random Gaussian noise. Thus, the reduced-order advection flow equations are given by with only Gaussian statistics generated. Only the first M large-scale modes 1 ≤ |k| ≤ M are resolved in the reduced-order model (224). In addition to the linear dissipation and dispersion operators (γ q,k , ω q,k ), additional damping and noise D M q,k , σ M q,k are introduced to correct model errors due to the neglected nonlinear interactions in the flow equations. On the other hand, there is no additional model calibrations of the tracer field statistics in case of over fitting of data. The idea here is to improve the reduced-order model prediction skill by optimizing the background advection flow field, thus the reduced order passive tracer equations can be modeled through a direct truncation where only the first leading modes of the advection flow 1 ≤ |k| ≤ M N are resolved in the tracer approximation model.
Again, the major difficulty in modeling the tracer dynamics is from the accurate approximation of the tracer advection A km q m • T n in (220). Exact modeling about this nonlinear interaction term requires the flow mode solution q M,k along the entire spectrum 0 < |k| ≤ N, while only the first M leading modes are available through the reduced-order model. One crude approximation idea could be to replace the nonlinear advection in the tracer field v(x, t) · ∇ T(x, t) with damping and noise in a similar fashion as the flow approximation model (224). However, as discussed in previous works [50,195], the nonlinear advection in the tracer equation is crucial in the generation of many important statistical features including the intermittency. Thus, including of nonlinear effects from the flow solution is essential, at least for the large scale modes. On the left-hand side of the Equation (225), the nonlinear advectionṽ M · ∇ T M is modeled explicitly, but only the first M 1 ≤ M largest scale flow modes in the model velocity solutionṽ M are used to calculate the imperfect model tracer advection. This nonlinear advection is essential in generating the accurate spectra in tracer statistics, while it is also not expensive to calculate since only leading modes are involved. The idea for this approximation is through the assumption that the dominant features in tracer statistics (such as intermittency and equilibrium spectrum in large scales) are due to the leading advection flow modes with largest energy. Now, we calibrate the imperfect low-order linear Gaussian model for advection flow system (225) using equilibrium statistics and information theory. Such calibration procedure is divided into two steps: 1.
Properly reflecting the nonlinear energy mechanism from the true system. 2.
Imperfect stochastic model consistency in equilibrium statistics and autocorrelation functions.
In the first step, we aim at making sure that the imperfect model calibration parameters (D M q,k , Σ M q,k ) can properly reflect the true nonlinear energy mechanism from the true system. The consistent imperfect model then can be proposed by consulting the model statistical dynamics. Therefore, it is useful to investigate the statistical equations for the second order moments from the fluctuation equations of (221). The dynamics for the covariance matrix R q k = q k q * k of flow vorticity can be derived as a 2 × 2 blocked system for each wavenumber [117], The linear operators (L q , D q ) represent the skew-symmetric dispersion and dissipation effects from the right-hand side of (221). The additional operator L k (q) represents the interactions with a non-zero statistical mean state, where internal instability occurs with positive growth rate. Most importantly, the nonlinear interactions between different spectral modes introduce the additional nonlinear flux term Q q F indicating higher-order interactions, that is, Therefore, the small and large scale modes are linked through third-order moments q m q n q * k in (227) between the triad modes m + n = k. The nonlinear flux Q q F plays the central role in the energy mechanism that balances the unstable directions due to internal instability from the linear operators. Here, our focus is on the low-order stochastic realization in (224) of the statistical closure model of (226), thus solving the statistical Equation (226) directly is not favorable considering its complexity.
Below, we follow the general framework developed in Section 6.1 to determine the reduced order model. The nonlinear flux Q q F in (227) corresponds to the unresolved nonlinear effects in the stochastic model in (224). Thus, it is useful to exploit the nonlinear flux Q q F so that the imperfect model parameters (D M q , Σ M q ) in (224) can be proposed according to the true model energy transfer mechanism. Especially in statistical equilibrium, as t → ∞ the nonlinear fluxes can be calculated easily from the localized lower-order moments Next, we further decompose the matrix Q q F = Q q,+ F + Q q,− F by singular value decomposition into positive-definite and negative-definite components. The positive definite part Q q,+ F illustrates the additional energy that injected into this mode from other scales, while the negative definite part Q q,− F shows the extraction of energy through nonlinear transfer to other scales. In adopting the true equilibrium statistics from Q q F,eq , the true model energy transfer mechanism is respected and the least artificial effect is introduced into the imperfect approximation model. Considering all these aspects, the first proposal for the linear damping and Gaussian random noise correction can be introduced as D eq q,k = − 1 2 Q q,− F,eq,k (R q k,eq ) −1 , Σ eq q,k = (Q q,+ F,eq,k ) 1/2 .
The additional damping is from the negative definite equilibrium flux Q q,− F,eq and the positive definite equilibrium flux Q q,+ F,eq acts as additional noise to the system. The above additional damping and noise (229) offer a desirable quantification for the minimum amount of corrections to stabilize the system with consistent equilibrium statistics for the mean and variance. This is the same idea applied to the statistical modified quasi-linear Gaussian closures developed in [45].
As discussed in Section 6.1.3, the above estimation of parameters (229) may not be optimal for the reduced-order Gaussian model considering that: (i) it only guarantees marginal stability in the unstable modes for equilibrium; and more importantly (ii) the time mixing scale in each mode (represented by the autocorrelation functions) may still lack the accuracy in the approximation using only equilibrium information. The nonlinear energy transferring mechanism may change with large deviation from the equilibrium case when intermittent fluctuations are present. The shortcomings for purely using the approximation (229) only from equilibrium statistics can be observed from the various numerical simulations [82]. As a further correction, we propose additional terms on top of (229) with a simple constant damping for all the spectral modes and an additional noise accordingly to make sure consistency in energy, The correction term in (230) is aimed to offer stabilizing effects in the marginal stable equilibrium form (229), and to offer further corrections in modeling the autocorrelation function that is important for the mixing rate in each spectral mode. Combining (229) Comparing with the exact true system (221), the reduced-order approximation is equivalent to replacing the nonlinear interaction terms with the judiciously calibrated damping and noise in consideration with both the equilibrium energy transfer mechanism and further sensitivity correction. Now, we move to the second step. Here, we tune the undermined model parameters (D add M , Σ add M ) to guarantee the imperfect stochastic model consistency in equilibrium statistics (the leading two moments) and autocorrelation functions. The procedure here is exactly the same as that in Section 6.3.1, where information theory developed in Section 2.5 is used for calibrating the autocorrelation function. Thus, we neglect the details here.
Finally, let us show a simple example for predicting the tracer statistics using the low-order model prediction. The example here has the same setup as one of the regimes considered in [81], that is, the high latitude atmosphere regime, where the parameters are given as follows: Here, N = 128 is the number of grid points in each direction. The true statistics are calculated by a pseudo-spectra code with 128 × 128 × 2 grid points in total. The zonal mean flow U = (U, −U) is taken as the same strength with opposite directions in the two layers. In the tracer simulations, for simplicity, we consider the mean gradient along y direction, that is to assume, T = T + αy. This assumption is representative in many previous investigations [117,193,195]. The scale separation parameter in this example is chosen to be −1 = 5 such that intermittency is prominent. In the reduced-order model, we only compute the modes k ≤ M = 10 in largest scales, compared with the true system resolution N = 128.
The autocorrelation functions of the first four leading modes (1, 0), (0, 1), (1, 1) and (−1, 1) in both the flow stream functions and the tracer fields are plotted in Figure 42. It is clear that for both the flow and tracer fields, the reduced order model with optimized parameters calibrated by information theory succeeds in capturing the autocorrelation function of the truth. As a comparison, equipped with the parameters with no additional corrections d M = 0, σ M = 0, the reduced order model has a huge bias in recovering the autocorrelation function of the flow field. Next, we test the prediction skill of the turbulent tracer statistics in the reduced-order model. As we have seen in Section 6.3.1 and the discussions above, the nonlinear advection in the tracer equation v M · ∇T M is important for the final tracer statistical structure. The goal here is to see whether the intermittent and other features in the tracer field can be accurately predicted using only principal modes with largest variance in v M in calculating the nonlinear term. Figure 43 compares the representative time-series and tracer PDFs of the leading modes in statistical steady state. Despite only 0.6% of the modes being involved in the flow field, the fat-tails in the distribution functions of the tracer can be captured, and similar characteristic structures can be seen in the truth and reduced model time series. In fact, the high skill of recovering the non-Gaussian features is due to the fact that the advection term v M · ∇T M is captured quite well even with such a crude truncation of the flow field. The results in Figures 42 and 43 imply the skillful predictions using the reduced order model with the optimized parameters. In [82], the recovered tracer field using different truncation size M has been explored. It is important to note that with only the first two modes M = 2 being included in calculating the nonlinear advection, larger errors appear due to the insufficient quantification for flow advection. The recovering skill of other statistical features such as the power spectrum and eddy diffusivity approximations for the tracers in this regime as well as the test examples in other regimes have also been systematically discussed in [82].

Conclusions
This research expository article discusses various important topics related to model error, information barriers, state estimation and prediction in complex multiscale systems. A recent information-theoretic framework is developed and summarized, which is applied together with other mathematical tools to study all these topics. It is also combined with novel reduced-order nonlinear modeling strategies for understanding and predicting complex multiscale systems. The contents of this article include the general mathematical framework and theory, effective numerical procedures, instructive qualitative models, and concrete models from climate, atmosphere and ocean science. The information-theoretic framework is developed in Section 2 and is applied to understand various information barriers in the presence of model error via instructive stochastic models. In Section 3, the information-theoretic framework is adopted to assess model error in state estimation and prediction with examples coming from both complex scalar models and spatially-extended multiscale turbulent systems. The advantage of the information-theoretic framework over the traditional path-wise measurements are illustrated. Section 4 deals with sensitivity and linear statistical response using the fluctuation-dissipation theorem. An efficient and effective algorithm in finding the most sensitive change directions using information theory is also included in this section. In Section 5, a novel framework of data-driven physics-constrained nonlinear stochastic models and predictions is developed and is applied to predicting the MJO which contains strong intermittent instabilities and extreme events. Section 6 includes the development of the new effective reduced-order models that involve higher order statistical features but nevertheless remain computationally efficient. These new models together with the information-optimization model calibration strategy are applied to predicting passive tracers extreme events.
The simple imperfect models used in Sections 2 and 4 are all motivated from the strategies that are commonly used in practice for approximating extremely complicated systems such as GCMs. Therefore, the information barriers shown in this article clearly indicate the deficiency of these strategies and point out the directions of improving the imperfect models. The computationally efficient reduced-order modeling framework developed in Section 6 is promising in dealing with many complicated real-world issues. In particular, including higher order statistical features using the novel approach allows these new reduced-order models to capture the nonlinear evolution and non-Gaussian characteristics in both the dynamics and statistics. Therefore, these models are able to overcome those information barriers resulting from the linear tangential or Gaussian closure approximations as well as the ignorance of the cross-correlations between different modes or grid points. In addition to studying the spatially-extended systems associated with predicting passive tracers extreme events, the other applications of these low-order modeling strategies are good future directions. Note that these low-order modeling strategies combined with FDT can also be powerful tools to study the effective statistical control of complex turbulent dynamical systems [206]. On the other hand, although great efforts have been put in understanding the sources of model errors in data assimilation (or filtering), the representation error was nevertheless overlooked in the past. In Section 3, the issue of representation error is emphasized and some practical strategies have been proposed and tested. More systematic studies are required in this area as future works. It is also of great importance to study filtering and prediction as a whole and understand the model error and improved strategies for both procedures instead of focusing solely on the filtering part. In addition, as is briefly discussed in Section 3.5.2, combining the Euler and Lagrangian observations is another interesting topic in improving the skill of data assimilation and prediction of spatially-extended systems as well as quantifying the uncertainty reduction. Finally, it has been shown in Section 5 that the data-driven physics-constrained nonlinear stochastic modeling framework has several salient advantages over the purely data-driven non-parametric methods in terms of both understanding the underlying physics and obtaining skillful predictions. These advantages include a much shorter training phase, a systematic calibration strategy, gaining clear physical insights and reaching model robustness. Applying the data-driven physics-constrained nonlinear stochastic modeling framework to many other complex real-world problems is potentially important. Many related issues remain as future work.

Appendix C. Augmented System for Prediction and Filtering Distributions
Here, we show the details of using augmented systems for studying the prediction/filtering state estimates compared to the truth as discussed in Section 3.2.

Appendix C.1. Augmented System for Prediction
In light of the truth (67) and the prediction mean (69), the coupled evolution of the augmented state X m := (u m ,ū m+1|m ) T is given by The system in (A6) is a Gaussian system and therefore its behavior is completely characterized by its mean and variance. The evolution of the mean of (A6) is given by With the equilibrium mean of the perfect and imperfect modelū eq = F ∞ 1−F andū M eq = F M ∞ 1−F M , the asymptotic mean of X m is given by where the asymptotic Kalman gain K M ∞ is given by Clearly, the asymptotic mean of the prediction state is a linear combination of the equilibrium mean of original true modelū eq and that of the forecast model of the meanū M eq . With (A8), the mean bias yields According to (A9), the asymptotic prediction mean is equal to the equilibrium mean of the perfect model if and only if the imperfect model has the same equilibrium mean as the perfect model, namelyū eq =ū M eq . Next, we derive the covariance of the augmented system. Denote the operators F P and R P as Denote the covariance of X m by C P m = Cov(X m , X m ), where C P m = Cov(u m , u m ) Cov(u m ,ū m|m−1 ) Cov(ū m|m−1 , u m ) Cov(ū m|m−1 ,ū m|m−1 ) ≡ C P (11)m C P The evolution of the covariance matrix of the augmented system is given by and the components of the asymptotic limit C P ∞ are C P (11)∞ = r 1 − |F| 2 , C P (12)∞ = FF M * K M ∞ gC P , C P (21)∞ = C P (12)∞ , With direct calculation, the dynamics of the prediction variance r m+1|m is r m+1|m = |F M | 2 r m|m + r M = |F M | 2 (1 − K M m g)r m|m−1 + r M .
Asymptotically, it is given by

Appendix C.2. Augmented System for Filtering
The coupled evolution of the augmented state Y m := (u m ,ū m|m ) T is Again, the Gaussian statistics of the augmented state Y m is fully characterized by its mean and covariance. The evolution of its mean is given by (A17) When m → ∞, the asymptotic mean of Y m is and the deviation of analysis mean from the truth signal is therefore given as follows: Next, the covariance of prediction mean is where the operator F A and R A are given respectively by .
As discussed in [15], the asymptotic analysis variance is Finally, we compare the asymptotic prediction and filtering variance. We have the following conclusion: if the time-periodic forcing f 1 = 0. This can be easily seen in the examples in Figure 15. In other words, the system is not ergodic [108] when the time-periodic forcing is strong. Below, we provide a mathematical quantification of such behavior. We focus on the system at the attractor and ignore the contribution from the initial value. For simplicity, we also assume f 0 = 0 since this constant forcing only shifts the mean of the solution by a constant and won't affect the non-Gaussian behavior in the time-averaged PDF. Therefore, the solution of (A28), according to (98), reduces to Clearly, the true signal u(t) can be decomposed into two parts: the deterministic time-periodic mean stateū(t) and the fluctuation around the time-periodic mean state u (t): For the simplicity of illustration, in the following, we only consider the real part of the true signal. Thus,ū(t) and u (t) becomē e (−γ+iω 0 )(t−s) dW(s).

(A31)
As a simple illustration, we show in Panel (a) of Figure A1 the signal u(t) and its decomposition u(t) and u (t) within one period.
With such a mean-fluctuation decomposition, the PDF based on the samples of the long trajectory can be computed in the following way. Say the total length of the trajectory is NT, where T = 2π/ω 1 is the period. Now, we use take n o grid points with uniform increment within one period, namely nT, nT + ∆T, nT + 2∆T, . . . , nT + (n o − 1)∆T, where ∆T = T/n o and n = 1, . . . , N. For different n and fixed i, the time-dependent meanū(nT + i∆T) is the same while u (nT + i∆T) is different due to the randomness in the fluctuation. It is known from (A31) that the collection of the n points u (nT + i∆T) with n = 1, . . . , N and fixed i satisfies a Gaussian distribution N (µ i , R i ) with µ i =ū(nT + i∆T), and R i = u (nT + i∆T)u * (nT + i∆T) = σ 2 4γ , where the variance is computed from the second equation of (A31) and it has no dependence on t. Therefore, the PDF of u(t) is given by the summation of all the Gaussian distributions in (A32) for i = 1, . . . , n o with n o → ∞. See Figure A2 for an illustration. Mathematically, this can be written as a convolution, with C a normalized constant to guarantee ∞ −∞ p(u)du = 1. Here, p u (u − t) is the Gaussian PDF with mean and variance given by (A32), namely, To compute pū(u), we refer to Figure A1. Sinceū is bounded by −A and A, the support of pū(u) is also within [−A, A]. Direct calculation of pū(u) is difficult. Nevertheless, the cumulative distribution function (CDF) can be used, which facilitates the derivation of pū(u). Now, consider the interval [−φ/ω 1 , (2π − φ)/ω 1 ]. In fact, for any u ∈ [−A, A], the CDF is given by P(ū < u) = 1 − 2 ω 1 arccos u A /( 2π ω 1 ) = 1 − 1 π arccos u A , |u| ≤ A, The CDF in (A35) is easily derived by making use of the fact that the samples are uniformly distributed in t ∈ [−φ/ω 1 , (2π − φ)/ω 1 ], which is of length 2π/ω 1 . With the CDF (A35) in hand, the PDF pū(t) is given by the derivative of the CDF. Therefore, Therefore, combining (A34) and (A36) leads to the calculation of (A33). It is clear that the full PDF depends on the variation of the time-dependent meanū, and the variance of the Gaussian PDF p u for the fluctuation part. If the variation ofū is much smaller than the variance of u , then the full PDF is nearly Gaussian. With the increase of the variation ofū and a fixed variance of u , then the full PDF becomes bimodal. Thus, the ratio of the bandwidths associated with pū and p u can be used to quantify the Gaussianity of the full PDF. The bandwidth of pū is On the other hand, although there is no finite support of the Gaussian distribution p u , the three-sigma rule of thumb [208] is always used as an empirical bandwidth to quantify the "range" of the Gaussian distribution, where the three sigma here means the three standard deviation from the mean of the Gaussian distribution that covers 99.73% of the values of p u . Thus, the bandwidth of p u is The ratio of (A37) and (A38) is given by (A39) Figure A3 shows the full PDFs with different ratio r. In Panel (a), the time-periodic forcing f 1 increases from f 1 = 0 to f 1 = 5. When f 1 = 0, the PDF is Gaussian and the system is ergodic. When f 1 increases, Lū becomes larger while L u is fixed. Note that, in this example, ω 1 = ω 0 , which means the forcing is resonant and therefore the bandwidth Lū is sensitive to the change of forcing amplitude f 1 . With f 1 > 1, the bimodality in the PDF becomes significant. In Panel (b), we fix f 1 = 5, ω 1 = 1 and let ω 0 change. With the resonant forcing ω 0 = ω 1 = 1, the PDF is significantly bimodal but with a non-resonant forcing ω 0 = 3, the PDF is nearly Gaussian.