Estimating Predictive Rate–Distortion Curves via Neural Variational Inference

The Predictive Rate–Distortion curve quantifies the trade-off between compressing information about the past of a stochastic process and predicting its future accurately. Existing estimation methods for this curve work by clustering finite sequences of observations or by utilizing analytically known causal states. Neither type of approach scales to processes such as natural languages, which have large alphabets and long dependencies, and where the causal states are not known analytically. We describe Neural Predictive Rate–Distortion (NPRD), an estimation method that scales to such processes, leveraging the universal approximation capabilities of neural networks. Taking only time series data as input, the method computes a variational bound on the Predictive Rate–Distortion curve. We validate the method on processes where Predictive Rate–Distortion is analytically known. As an application, we provide bounds on the Predictive Rate–Distortion of natural language, improving on bounds provided by clustering sequences. Based on the results, we argue that the Predictive Rate–Distortion curve is more useful than the usual notion of statistical complexity for characterizing highly complex processes such as natural language.


Introduction
Predicting the future from past observations is a key problem across science. Constructing models that predict future observations is a fundamental method of making and testing scientific discoveries. Understanding and predicting dynamics has been a fundamental goal of physics for centuries. In engineering, devices often have to predict the future of their environment to perform efficiently. Living organisms need to make inferences about their environment and its future for survival.
Real-world systems often generate complex dynamics that can be hard to compute. A biological organism typically will not have the computational capacity to perfectly represent the environment. In science, measurements have limited precision, limiting the precision with which one can hope to make predictions. In these settings, the main goal will typically be to get good prediction at low computational cost.
This motivates the study of models that try to extract those key features of past observations that are most relevant to predicting the future. A general information-theoretic framework for this problem is provided by Predictive Rate-Distortion [1,2], also known as the past-future information bottleneck [3]. The Predictive Rate-Distortion trade-off seeks to find an encoding of past observations that is maximally informative about future observations, while satisfying a bound on the amount of information that has to be stored. More formally, this framework trades off prediction loss in the future, formalized as cross-entropy, with the cost of representing the past, formalized as the mutual information between the past observations and the compressed representations of the past. Due to its information-theoretic nature, this framework is extremely general and applies to processes across vastly different domains. It has been applied to linear dynamical systems [3,4], but is equally meaningful for discrete dynamical systems [2]. For biological systems that make predictions about their environment, this corresponds to placing an information-theoretic constraint on the computational cost used for conditioning actions on past observations [5].
The problem of determining encodings that optimize the Predictive Rate-Distortion trade-off has been solved for certain specific kinds of dynamics, namely for linear dynamic systems [3] and for processes whose predictive dynamic is represented exactly by a known, finite Hidden Markov Model [2]. However, real-world processes are often more complicated. When the dynamics are known, a representing Hidden Markov Model may still be extremely large or even infinite, making general-purpose automated computation difficult. Even more importantly, the underlying dynamics are often not known exactly. An organism typically does not have access to the exact dynamics of its surroundings. Similarly, the exact distribution of sentences in a natural language is not known exactly, precluding the application of methods that require an exact model. Such processes are typically only available implicitly, through a finite sample of trajectories.
Optimal Causal Filtering (OCF, Still et al. [6]) addresses the problem of estimating Predictive Rate-Distortion from a finite sample of observation trajectories. It does so by constructing a matrix of observed frequencies of different pairs of past and future observations. However, such a method faces a series of challenges [2]. One is the curse of dimensionality: Modeling long dependencies requires storing an exponential number of observations, which quickly becomes intractable for current computation methods. This exponential growth is particularly problematic when dealing with processes with a large state space. For instance, the number of distinct words in a human language as found in large-scale text data easily exceeds 1×10 5 , making storing and manipulating counts of longer word sequences very challenging. A second challenge is that of overfitting: When deploying a predictive model constructed via OCF on new data to predict upcoming observations, such a model can only succeed when the past sequences occurred in the sample to which OCF was applied. This is because OCF relies on counts of full past and future observation sequences; it does not generalize to unseen past sequences.
Extrapolating to unseen past sequences is possible in traditional time series models representing processes that take continuous values; however, such methods are less easily applied to discrete sequences such as natural language. Recent research has seen a flurry of interest in using flexible nonlinear function approximators, and in particular recurrent neural networks, which can handle sequences with discrete observations. Such machine learning methods provide generic models of sequence data. They are the basis of the state of the art by a clear and significant margin for prediction in natural language [7][8][9][10]. They also have been successfully applied to modeling many other kinds of time series found across disciplines [11][12][13][14][15][16][17][18].
We propose Neural Predictive Rate-Distortion (NPRD) to estimate Predictive Rate-Distortion when only a finite set of sample trajectories is given. We use neural networks both to encode past observations into a summary code, and to predict future observations from it. The universal function approximation capabilities of neural networks enable such networks to capture complex dynamics, with computational cost scaling only linearly with the length of observed trajectories, compared to the exponential cost of OCF. When deploying on new data, such a neural model can generalize seamlessly to unseen sequences, and generate appropriate novel summary encodings on the fly. Recent advances in neural variational inference [19,20] allow us to construct predictive models that provide almost optimal predictive performance at a given rate, and to estimate the Predictive Rate-Distortion trade-off from such networks. Our method can be applied to sample trajectories in an off-the-shelf manner, without prior knowledge of the underlying process.
In Section 2, we formally introduce Predictive Rate-Distortion, and discuss related notions of predictive complexity. In Section 3, we describe the prior method of Optimal Causal Filtering (OCF).
Due to the Markov condition, the distortion measure satisfies the relation i.e., it captures how much less information Z carries about the future − → X compared to the full past ← − X . For a fixed process (X t ) t , choosing Z to minimize the distortion is equivalent to maximizing the mutual information between the code and the future: We will refer to (4) as the predictiveness of the code Z.
The rate-distortion trade-off then chooses Z to minimize distortion at bounded rate: or-equivalently-maximize predictiveness at bounded rate: Equivalently, for each λ ≥ 0, we study the problem where the scope of the maximization is the class of all random variables Z such that The objective (7) is equivalent to the Information Bottleneck [21], applied to the past and future of a stochastic process. The coefficient λ indicates how strongly a high rate I[ ← − X , Z] is penalized; higher values of λ result in lower rates and thus lower values of predictiveness.
The largest achievable predictiveness , which is known as the excess entropy of the process [23]. Due to the Markov condition (1) and the Data Processing Inequality, predictiveness of a code Z is always upper-bounded by the rate: As a consequence, when λ ≥ 1, then (7) is always optimized by a trivial Z with zero rate and zero predictiveness. When λ = 0, any lossless code optimizes the problem. Therefore, we will be concerned with the situation where λ ∈ (0, 1).

Relation to Statistical Complexity
Predictive Rate-Distortion is closely related to Statistical Complexity and the -machine [24,25]. Given a stationary process X t , its causal states are the equivalence classes of semi-infinite pasts ← − X that induce the same conditional probability over semi-infinite futures − → X : Two pasts ← − X , ← − X belong to the same causal state if and only if P(x 1...k | ← − X ) = P(x 1...k | ← − X ) holds for all finite sequences x 0...k (k ∈ N). Note that this definition is not measure-theoretically rigorous; such a treatment is provided by Löhr [26].
The causal states constitute the state set of a a Hidden Markov Model (HMM) for the process, referred to as the -machine [24]. The statistical complexity of a process is the state entropy of the -machine. Statistical complexity can be computed easily if the -machine is analytically known, but estimating statistical complexity empirically from time series data are very challenging and seems to at least require additional assumptions about the process [27].
Marzen and Crutchfield [2] show that Predictive Rate-Distortion can be computed when the -machine is analytically known, by proving that it is equivalent to the problem of compressing causal states, i.e., equivalence classes of pasts, to predict causal states of the backwards process, i.e., equivalence classes of futures. Furthermore, [6] show that, in the limit of λ → 0, the code Z that optimizes Predictive Rate-Distortion (7) turns into the causal states.

Related Work
There are related models that represent past observations by extracting those features that are relevant for prediction. Predictive State Representations [28,29] and Observable Operator Models [30] encode past observations as sets of predictions about future observations. Rubin et al. [31] study agents that trade the cost of representing information about the environment against the reward they can obtain by choosing actions based on the representation. Relatedly, Still [1] introduces a Recursive Past Future Information Bottleneck where past information is compressed repeatedly, not just at one reference point in time.
As discussed in Section 2.1, estimating Predictive Rate-Distortion is related to the problem of estimating statistical complexity. Clarke et al. [27] and Still et al. [6] consider the problem of estimating statistical complexity from finite data. While statistical complexity is hard to identify from finite data in general, Clarke et al. [27] introduces certain restrictions on the underlying process that make this more feasible.

Prior Work: Optimal Causal Filtering
The main prior method for estimating Predictive Rate-Distortion from data are Optimal Causal Filtering (OCF, Still et al. [6]). This method approximates Predictive Rate-Distortion using two approximations: first, it replaces semi-infinite pasts and futures with bounded-length contexts, i.e., pairs of finite past contexts ( M← X := X −M . . . X −1 ) and future contexts ( →M X := X 0 . . . X M−1 ) of some finite length M.(It is not crucial that past and future contexts have the same lengths, and indeed Still et al. [6] do not assume this). (We do assume equal length throughout this paper for simplicity of our experiments, though nothing depends on this). The PRD objective (7) then becomes (9), aiming to predict length-M finite futures from summary codes Z of length-M finite pasts: Second, OCF estimates information measures directly from the observed counts of ( M← X ), ( →M X ) using the plug-in estimator of mutual information. With such an estimator, the problem in (9) can be solved using a variant of the Blahut-Arimoto algorithm [21], obtaining an encoder P(Z| M← X ) that maps each observed past sequence M← X to a distribution over a (finite) set of code words Z. Two main challenges have been noted in prior work: first, solving the problem for a finite empirical sample leads to overfitting, overestimating the amount of structure in the process. Still et al. [6] address this by subtracting an asymptotic correction term that becomes valid in the limit of large M and λ → 0, when the codebook P(Z| ← − X ) becomes deterministic, and which allows them to select a deterministic codebook of an appropriate complexity. This leaves open how to obtain estimates outside of this regime, when the codebook can be far from deterministic.
The second challenge is that OCF requires the construction of a matrix whose rows and columns are indexed by the observed past and future sequences [2]. Depending on the topological entropy of the process, the number of such sequences can grow as |A| M , where A is the set of observed symbols, and processes of interest often do show this exponential growth [2]. Drastically, in the case of natural language, A contains thousands of words.
A further challenge is that OCF is infeasible if the number of required codewords is too large, again because it requires constructing a matrix whose rows and columns are indexed by the codewords and observed sequences. Given that storing and manipulating matrices greater than 10 5 × 10 5 is currently not feasible, a setting where I[ ← − X , Z] > log 10 5 ≈ 11.5 cannot be captured with OCF.

Neural Estimation via Variational Upper Bound
We now introduce our method, Neural Predictive Rate-Distortion (NPRD), to address the limitations of OCF, by using parametric function approximation: whereas OCF constructs a codebook mapping between observed sequences and codes, we use general-purpose function approximation estimation methods to compute the representation Z from the past and to estimate a distribution over future sequences from Z. In particular, we will use recurrent neural networks, which are known to provide good models of sequences from a wide range of domains; our method will also be applicable to other types of function approximators.
This will have two main advantages, addressing the limitations of OCF: first, unlike OCF, function approximators can discover generalizations across similar sequences, enabling the method to calculate good codes Z even for past sequences that were not seen previously. This is of paramount importance in settings where the state space is large, such as the set of words of a natural language. Second, the cost of storing and evaluating the function approximators will scale linearly with the length of observed sequences both in space and in time, as opposed to the exponential memory demand of OCF. This is crucial for modeling long dependencies.

Main Result: Variational Bound on Predictive Rate-Distortion
We will first describe the general method, without committing to a specific framework for function approximation yet. We will construct a bound on Predictive Rate-Distortion and optimize this bound in a parametric family of function approximators to obtain an encoding Z that is close to optimal for the nonparametric objective (7).
As in OCF (Section 3), we assume that a set of finite sample trajectories x −M . . . x M−1 is available, and we aim to compress pasts of length M to predict futures of length M. To carry this out, we restrict the PRD objective (7) to such finite-length pasts and futures: It will be convenient to equivalently rewrite (10) as where H[ →M X |Z] is the prediction loss. Note that minimizing prediction loss is equivalent to maximizing predictiveness I[ →M X , Z]. When deploying such a predictive code Z, two components have to be computed: a distribution P(Z| M← X ) that encodes past observations into a code Z, and a distribution P( →M X |Z) that decodes the code Z into predictions about the future.
Let us assume that we already have some encoding distribution where φ is the encoder, expressed in some family of function approximators. The encoder transduces an observation sequence M← X into the parameters of the distribution P φ (·| M← X ). From this encoding distribution, one can obtain the optimal decoding distribution over future observations via Bayes' rule: where ( * ) uses the Markov condition Z⊥ →M X | M← X . However, neither of the two expectations in the last expression of (13) is tractable, as they require summation over exponentially many sequences, and algorithms (e.g., dynamic programming) to compute this sum efficiently are not available in general. For a similar reason, the rate I[ M← X , Z] of the code Z ∼ P φ (Z| M← X ) is also generally intractable.
Our method will be to introduce additional functions, also expressed using function approximators that approximate some of these intractable quantities: first, we will use a parameterized probability distribution q as an approximation to the intractable marginal P(Z) = E M← X P φ (Z| M← X ): Second, to approximate the decoding distribution P( →M X |Z), we introduce a parameterized decoder ψ that maps points Z ∈ R N into probability distributions P ψ ( →M X |Z) over future observations →M X : for each code Z ∈ R N . Crucially, P ψ ( →M X |Z) will be easy to compute efficiently, unlike the exact decoding distribution P( →M X |Z). If we fix a stochastic process (X t ) t∈Z and an encoder φ, then the following two bounds hold for any choice of the decoder ψ and the distribution q: Proposition 1. The loss incurred when predicting the future from Z via ψ upper-bounds the true conditional entropy of the future given Z, when predicting using the exact decoding distribution (13): Furthermore, equality is attained if and only if P ψ ( →M X |Z) = P( →M X |Z).
Proof. By Gibbs' inequality: Equality is attained if and only if q(Z) is equal to the true marginal P(Z) = E M← X P φ (Z| M← X ).

Proof. The two equalities follow from the definition of KL Divergence and Mutual Information.
To show the inequality, we again use Gibbs' inequality: Here, equality holds if and only if q(Z) = P(Z), proving the second assertion.
We now use the two propositions to rewrite the Predictive Rate-Distortion objective (18) in a way amenable to using function approximators, which is our main theoretical result, and the foundation of our proposed estimation method: is equivalent to where φ, ψ, q range over all triples of the appropriate types described above.
From a solution to (19), one obtains a solution to (18) by setting Z ∼ P φ (·| M← X ). The rate of this code is given as follows: and the prediction loss is given by Proof. By the two propositions, the term inside the minimization in (19) is an upper bound on (18), and takes on that value if and only if P φ (·| M← X ) equals the distribution of the Z optimizing (18), and ψ, q are as described in those propositions.
Note that the right-hand sides of (20) and (21) can both be estimated efficiently using Monte Carlo If φ, ψ, q are not exact solutions to (19), the two propositions guarantee that we still have bounds on rate and prediction loss for the code Z generated by φ: To carry out the optimization (19), we will restrict φ, ψ, q to a powerful family of parametric families of function approximators, within which we will optimize the objective with gradient descent. While the solutions may not be exact solutions to the nonparametric objective (19), they will still satisfy the bounds (22) and (23), and-if the family of approximators is sufficiently rich-can come close to turning these into the equalities (20) and (21).

Choosing Approximating Families
For our method of Neural Predictive Rate-Distortion (NPRD), we choose the approximating families for the encoder φ, the decoder ψ, and the distribution q to be certain types of neural networks that are known to provide strong and general models of sequences and distributions.
For φ and ψ, we use recurrent neural networks with Long Short Term Memory (LSTM) cells [32], widely used for modeling sequential data across different domains. We parameterize the distribution P φ (Z| M← X ) as a Gaussian whose mean and variance are computed from the past M← X : We use an LSTM network to compute a vector h ∈ R k from the past observations M← X , and then compute where W µ , W σ ∈ R k×k are parameters. While we found Gaussians sufficiently flexible for φ, more powerful encoders could be constructed using more flexible parametric families, such as normalizing flows [19,33].
For the decoder ψ, we use a second LSTM network to compute a sequence of vector representations We compute predictions using the softmax rule for each element s i of the state space S = {s 1 , ..., s |S| }, and W o ∈ R |S|×k is a parameter matrix to be optimized together with the parameters of the LSTM networks. For q, we choose the family of Neural Autoregressive Flows [20]. This is a parametric family of distributions that allows efficient estimation of the probability density and its gradients. This method widely generalizes a family of prior methods [19,33,34], offering efficient estimation while surpassing prior methods in expressivity.

Parameter Estimation and Evaluation
We optimize the parameters of the neural networks expressing φ, ψ, q for the objective (19) using Backpropagation and Adam [35], a standard and widely used gradient descent-based method for optimizing neural networks. During optimization, we approximate the gradient by taking a single sample from Z (24) per sample trajectory X −M , . . . , X M−1 and use the reparametrized gradient estimator introduced by Kingma and Welling [36]. This results in an unbiased estimator of the gradient of (19) w.r.t. the parameters of φ, ψ, q.
Following standard practice in machine learning, we split the data set of sample time series into three partitions (training set, validation set, and test set). We use the training set for optimizing the parameters as described above. After every pass through the training set, the objective (19) is evaluated on the validation set using a Monte Carlo estimate with one sample Z per trajectory; optimization terminates once the value on the validation set does not decrease any more.
After optimizing the parameters on a set of observed trajectories, we estimate rate and prediction loss on the test set. Given parameters for φ, ψ, q, we evaluate the PRD objective (19), rate (22), and the prediction loss (23) on the test set by taking, for each time series X −M ...X M−1 = M← X →M X from the test set, a single sample z ∼ N (µ, σ 2 ) and computing Monte Carlo estimates for rate where p N (µ,σ 2 ) (z) is the Gaussian density with µ, σ 2 computed from M← X as in (24), and prediction loss Thanks to (22) and (23), these estimates are guaranteed to be upper bounds on the true rate and prediction loss achieved by the code Z ∼ N (µ, σ 2 ), up to sampling error introduced into the Monte Carlo estimation by sampling z and the finite size of the test set.
It is important to note that this sampling error is different from the overfitting issue affecting OCF: Our Equations (26) and (27) provide unbiased estimators of upper bounds, whereas overfitting biases the values obtained by OCF. Given that Neural Predictive Rate-Distortion provably provide upper bounds on rate and prediction loss (up to sampling error), one can objectively compare the quality of different estimation methods: among methods that provide upper bounds, the one that provides the lowest such bound for a given problem is the one giving results closest to the true curve.

Estimating Predictiveness
Given the estimate for prediction loss, we estimate predictiveness I[Z, →M X ] with the following method. We use the encoder network that computes the code vector h (24) to also estimate the marginal probability of the past observation sequence, P η ( M← X ). P η has support over sequences of length M. Similar to the decoder ψ, we use the vector representations f t ∈ R k computed by the LSTM encoder after processing X −M...t for t = −M, . . . , −1, and then compute predictions using the softmax rule where W o ∈ R |S|×k is another parameter matrix.
Because we consider stationary processes, we have that the cross-entropy under P η of →M X is equal to the cross-entropy of M← X under the same encoding distribution: E X M← which we approximate using Monte Carlo sampling on the test set as in (26) and (27). In order to optimize parameters for estimation of P η , we add the cross-entropy term (19) during optimization, so that the full training objective comes out to: Again, by Gibbs' inequality and Propositions 1 and 2, this is minimized when P η represents the true distribution over length-M blocks P( M← X ), P φ (Z| M← X ) describes an optimal code for the given λ, q is its marginal distribution, and P ψ ( →M X |Z) is the Bayes-optimal decoder. For approximate solutions to this augmented objective, the inequalities (22) and (23) will also remain true due to Propositions 1 and 2.

Related Work
In (19), we derived a variational formulation of Predictive Rate-Distortion. This is formally related to a variational formulation of the Information Bottleneck that was introduced by [38], who applied it to neural-network based image recognition. Unlike our approach, they used a fixed diagonal Gaussian instead of a flexible parametrized distribution for q. Some recent work has applied similar approaches to the modeling of sequences, employing models corresponding to the objective (19) with [39][40][41][42].
In the neural networks literature, the most commonly used method using variational bounds similar to Equation (19) is the Variational Autoencoder [36,43], which corresponds to the setting where λ = 1 and the predicted output is equal to the observed input. The β-VAE [44], a variant of the Variational Autoencoder, uses λ > 1 (whereas the Predictive Rate-Distortion objective (7) uses λ ∈ (0, 1)), and has been linked to the Information Bottleneck by [45].

Experiments
We now test the ability of our new method NPRD to estimate rate-distortion curves. Before we apply NPRD to obtain the first estimates of Predictive Rate-Distortion for natural language in Section 6, we validate the method on processes whose trade-off curves are analytically known, and compare with OCF.

OCF
As discussed in Section 3, OCF is affected by overfitting, and will systematically overestimate the predictiveness achieved at a given rate [6]. To address this problem, we follow the evaluation method used for NPRD, evaluating rate and predictiveness on held-out test data. We partition the available time series data into a training and test set. We use the training set to create the encoder P(Z| M← X ) using the Blahut-Arimoto algorithm as described by Still et al. [6]. We then use the held-out test set to estimate rate and prediction loss. This method not only enables fair comparison between OCF and NPRD, but also provides a more realistic evaluation, by focusing on the performance of the code Z when deployed on new data samples. For rate, we use the same variational bound that we use for NPRD, stated in Proposition 2: where P(Z| M← X ) is the encoder created by the Blahut-Arimoto algorithm, and s(Z) is the marginal distribution of Z on the training set. N is the number of sample time series in the test data. In the limit of enough training data, when s(Z) matches the actual population marginal of Z, (31) is an unbiased estimate of the rate. We estimate the prediction loss on the future observations as the empirical cross-entropy, i.e., the variational bound stated in Proposition 1: where P( →M X |Z) is the decoder obtained from the Blahut-Arimoto algorithm on the training set. Thanks to Propositions 1 and 2, these quantities provide upper bounds, up to sampling error introduced by finiteness of the held-out data. Again, sampling error does not bias the results in either direction, unlike overfitting, which introduces a systematic bias.
Held-out test data may contain sequences that did not occur in the training data. Therefore, we add a pseudo-sequence ω and add pseudo-observations (ω, X to the matrix of observed counts that serves as the input to the Blahut-Arimoto algorithm. These pseudo-observations were assigned pseudo-counts γ in this matrix of observed counts; we found that a wide range of values ranging from 0.0001 to 1.0 yielded essentially the same results. When evaluating the codebook on held-out data, previously unseen sequences were mapped to ω.

Neural Predictive Rate-Distortion
For all experiments, we used M = 15. Neural networks have hyperparameters, such as the number of units and the step size used in optimization, which affect the quality of approximation depending on properties of the dataset and the function being approximated. Given that NPRD provably provides upper bounds on the PRD objective (18), one can in principle identify the best hyperparameters for a given process by choosing the combination that leads to the lowest estimated upper bounds. As a computationally more efficient method, we defined a range of plausible hyperparameters based both on experience reported in the literature, and considerations of computational efficiency. These parameters are discussed in Appendix A. We then randomly sampled, for each of the processes that we experimented on, combinations of λ and these hyperparameters to run NPRD on. We implemented the model using PyTorch [46].

Analytically Tractable Problems
We first test NPRD on two processes where the Predictive Rate-Distortion trade-off is analytically tractable. The Even Process [2] is the process of 0/1 IID coin flips, conditioned on all blocks of consecutive ones having even length. Its complexity and excess entropy are both ≈ 0.63 nats. It has infinite Markov order, and Marzen and Crutchfield [2] find that OCF (at M = 5) performs poorly. The true Predictive Rate-Distortion curve was computed in [2] using the analytically known -machine. The Random Insertion Process [2] consists of sequences of uniform coin flips X t ∈ {0, 1}, subject to the constraint that, if X t−2 was a 0, then X t has to be a 1.
We applied NPRD to these processes by training on 3M random trajectories of length 30, and using 3000 additional trajectories for validation and test data. For each process, we ran NPRD 1000 times for random choices of λ ∈ [0, 1]. Due to computational constraints, when running OCF, we limited sample size to 3000 trajectories for estimation and as held-out data. Following Marzen and Crutchfield [2], we ran OCF for M = 1, ..., 5.
The resulting estimates are shown in Figure 1, together with the analytical rate-distortion curves computed by Marzen and Crutchfield [2]. Individual runs of NPRD show variation (red dots), but most runs lead to results close to the analytical curve (gray line), and strongly surpass the curves computed by OCF at M = 5. Bounding the trade-off curve using the sets of runs of NPRD results in a close fit (red line) to the analytical trade-off curves.

Recovering Causal States
Does NPRD lead to interpretable codes Z? To answer this, we further investigated the NPRD approximation to the Random Insertion Process (RIP), obtained in the previous paragraph.
The analytical solution to the Predictive Rate-Distortion problem was computed by Marzen and Crutchfield [2]. At λ > 0.5, the optimal solution collapses A and B into a single codeword, while all three states are mapped to separate codewords for λ ≤ 0.5.
Does NPRD recover this picture? We applied PCA to samples from Z computed at two different values of λ, λ = 0.25 and λ = 0.6. The first two principal components of Z are shown in Figure 3. Samples are colored by the causal states corresponding to the pasts of the trajectories that were encoded into the respective points by NPRD. On the left, obtained at λ = 0.6, the states A and B are collapsed, as expected. On the right, obtained at λ = 0.25, the three causal states are reflected as distinct modes of Z. Note that, at finite M, a fraction of pasts is ambiguous between the green and blue causal states; these are colored in black and NPRD maps them into a region between the modes corresponding to these states. The -machine of the Random Insertion Process, as described by [2]. Right: After computing a code Z from a past x −15···−1 , we recorded which of the three clusters the code moves to when appending the symbol 0 or 1 to the past sequence. The resulting transitions mirror those in the -machine. There is a small number of samples from sequences that, at M = 15, cannot be uniquely attributed to any of the states (ambiguous between A and C); these are indicated in black.
In Figure 2 (right), we record, for each of the three modes, to which cluster the distribution of the code Z shifts when a symbol is appended. We restrict to those strings that have nonzero probability for RIP (no code will ever be needed for other strings). For comparison, we show the -machine computed by Marzen and Crutchfield [2]. Comparing the two diagrams shows that NPRD effectively recovers the -machine: the three causal states are represented by the three different modes of Z, and the effect of appending a symbol also mirrors the state transitions of the -machine.

A Process with Many Causal States
We have seen that NPRD recovers the correct trade-off, and the structure of the causal states, in processes with a small number of causal states. How does it behave when the number of causal states is very large? In particular, is it capable of extrapolating to causal states that were never seen during training?
We consider the following process, which we will call COPY3: X −15 , ..., X −1 , are independent uniform draws from {1, 2, 3}, and X 1 = X −1 , ..., X 15 = X −15 . This process deviates a bit from our usual setup since we defined it only for t ∈ {−15, ..., 15}, but it is well-suited to investigating this question: the number of causal states is 3 15 ≈ 14 million. With exactly the same setup as for the EVEN and RIP processes, NPRD achieved essentially zero distortion on unseen data, even though the number of training samples (3 Million) was far lower than the number of distinct causal states. However, we found that, in this setup, NPRD overestimated the rate. Increasing the number of training samples from 3M to 6M, NPRD recovered codebooks that achieved both almost zero distortion and almost optimal rate, on fresh samples ( Figure 4). Even then, the number of distinct causal states is more than twice the number of training samples. These results demonstrate that, by using function approximation, NPRD is capable of extrapolating to unseen causal states, encoding and decoding appropriate codes on the fly. Note that one could easily design an optimal decoder and encoder for COPY3 by hand-the point of this experiment is to demonstrate that NPRD is capable of inducing such a codebook purely from data, in a general-purpose, off-the-shelf manner. This contrasts with OCF: without optimizations specific to the task at hand, a direct application of OCF would require brute-force storing of all 14 million distinct pasts and futures.

Estimating Predictive Rate-Distortion for Natural Language
We consider the problem of estimating rate-distortion for natural language. Natural language has been a testing ground for information-theoretic ideas since Shannon's work. Much interest has been devoted to estimating the entropy rate of natural language [10,[47][48][49]. Indeed, the information density of language has been linked to human processing effort and to language structure. The word-by-word information content has been shown to impact human processing effort as measured both by per-word reading times [50][51][52] and by brain signals measured through EEG [53,54]. Consequently, prediction is a fundamental component across theories of human language processing [54]. Relatedly, the Uniform Information Density and Constant Entropy Rate hypotheses [55][56][57] state that languages order information in sentences and discourse so that the entropy rate stays approximately constant.
The relevance of prediction to human language processing makes the difficulty of prediction another interesting aspect of language complexity: Predictive Rate-Distortion describes how much memory of the past humans need to maintain to predict future words accurately. Beyond the entropy rate, it thus forms another important aspect of linguistic complexity.
Understanding the complexity of prediction in language holds promise for enabling a deeper understanding of the nature of language as a stochastic process, and to human language processing. Long-range correlations in text have been a subject of study for a while [58][59][60][61][62][63]. Recently, Dębowski [64] has studied the excess entropy of language across long-range discourses, aiming to better understand the nature of the stochastic processes underlying language. Koplenig et al. [65] shows a link between traditional linguistic notions of grammatical structure and the information contained in word forms and word order. The idea that predicting future words creates a need to represent the past well also forms a cornerstone of theories of how humans process sentences [66,67].
We study prediction in the range of the words in individual sentences. As in the previous experiments, we limit our computations to sequences of length 30, already improving over OCF by an order of magnitude. One motivation is that, when directly estimating PRD, computational cost has to increase with the length of sequences considered, making the consideration of sequences of hundreds or thousands of words computationally infeasible. Another motivation for this is that we are ultimately interested in Predictive Rate-Distortion as a model of memory in human processing of grammatical structure, formalizing psycholinguistic models of how humans process individual sentences [66,67], and linking to studies of the relation between information theory and grammar [65].

Part-of-Speech-Level Language Modeling
We first consider the problem of predicting English on the level of Part-of-Speech (POS) tags, using the Universal POS tagset [68]. This is a simplified setting where the vocabulary is small (20 word types), and one can hope that OCF will produce reasonable results. We use the English portions of the Universal Dependencies Project [69] tagged with Universal POS Tags [68], consisting of approximately 586 K words. We used the training portions to estimate NPRD and OCF, and the validation portions to estimate the rate-distortion curve. We used NPRD to generate 350 codebooks for values of λ sampled from [0, 0.4]. We were only able to run OCF for M ≤ 3, as the number of sequences exceeds 10 4 already at M = 4.
The PRD curve is shown in Figure 5 (left). In the setting of low rate and high distortion, NPRD and OCF (blue, M = 1, 2, 3) show essentially identical results. This holds true until I[Z, − → X ] ≈ 0.7, at which point the bounds provided by OCF deteriorate, showing the effects of overfitting. NPRD continues to provide estimates at greater rates. Figure 5 (center) shows rate as a function of log 1 λ . Recall that λ is the trade-off-parameter from the objective function (7). In Figure 5 (right), we show rate and the mutual information with the future, as a function of log 1 λ . As λ → 0, NPRD (red, M = 15) continues to discover structure, while OCF (blue, plotted for M = 1, 2, 3) exhausts its capacity.
Note that NPRD reports rates of 15 nats and more when modeling with very low distortion. A discrete codebook would need over 3 million distinct codewords for a code of such a rate, exceeding the size of the training corpus (about 500 K words), replicating what we found for the COPY3 process: Neural encoders and decoders can use the geometric structure of the code space to encode generalizations across different dimensions, supporting a very large effective number of distinct possible codes. Unlike discrete codebooks, the geometric structure makes it possible for neural encoders to construct appropriate codes 'on the fly' on new input.

Discussion
Let us now consider the curves in Figure 5 in more detail. Fitting parametric curves to the empirical PRD curves in Figure 5, we find a surprising result that the statistical complexity of English sentences at the POS level appears to be unbounded.
The rate-predictiveness curve (left) shows that, at low rates, predictiveness is approximately proportional to the rate. At greater degrees of predictiveness, the rate grows faster and faster, whereas predictiveness seems to asymptote to ≈ 1.1 nats. The asymptote of predictiveness can be identified with the mutual information between past and future observations, E 0 := I[ M← X , →M X ], which is a lower bound on the excess entropy. The rate should asymptote to the statistical complexity. Judging by the curve, natural language-at the time scale we are measuring in this experiment-has a statistical complexity much higher than its excess entropy: at the highest rate measured by NPRD in our experiment, rate is about 20 nats, whereas predictiveness is about 1.1 nats. If these values are correct, then-due to the convexity of the rate-predictivity curve-statistical complexity exceeds the excess entropy by a factor of at least 20 1.1 . Note that this picture agrees qualitatively with the OCF results, which suggest a lower-bound on the ratio of at least 2.5 0.6 > 5. Now, turning to the other plots in Figure 5, we observe that rate increases at least linearly with log 1 λ , whereas predictiveness again asymptotes. This is in qualitative agreement with the picture gained from the rate-predictiveness curve.
Let us consider this more quantitatively. Based on Figure 5 (center), we make the ansatz that the map from log 1 λ to the rate R := I[ M← X , Z] is superlinear: with α > 0, β > 1. We fitted R ≈ log 1 λ 1.7 (α = 1, β = 1.7). Equivalently, From this, we can derive expressions for rate R := I[ where θ is the codebook defining the encoding distribution P(Z| M← X ), and thus Our ansatz therefore leads to the equation Qualitatively, this says that predictiveness P asymptotes to a finite value, whereas rate R-which should asymptote to the statistical complexity-is unbounded. q q q q q q q q q q q q q q q q q q qqqqq qqqqq qqqqq qqqqq qqqqq qqqqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Equation (37) has the solution
where Γ is the incomplete Gamma function. Since lim R→∞ P = C, the constant C has to equal the maximally possible predictiveness E 0 := I[ M← X , →M X ]. Given the values fitted above (α = 1, β = 1.7), we found that E 0 = 1.13 yielded a good fit. Using (33), this can be extended without further parameters to the third curve in Figure 5. Resulting fits are shown in Figure 6.
Note that there are other possible ways of fitting these curves; we have described a simple one that requires only two parameters α > 0, β > 1, in addition to a guess for the maximal predictiveness E 0 . In any case, the results show that natural language shows an approximately linear growth of predictiveness with a rate at small rates, and exploding rates at diminishing returns in predictiveness later.

Word-Level Language Modeling
We applied NPRD to the problem of predicting English on the level of part-of-speech tags in Section 6.1. We found that the resulting curves were described well by Equation (37). We now consider the more realistic problem of prediction at the level of words, using data from multiple languages. This problem is much closer to the task faced by a human in the process of comprehending text, having to encode prior observations so as to minimize prediction loss on the upcoming words. We will examine whether Equation (37) describes the resulting trade-off in this more realistic setting, and whether it holds across languages.
For the setup, we followed a standard setup for recurrent neural language modeling. The hyperparameters are shown in Table A1. Following standard practice in neural language modeling, we restrict the observation space to the most frequent 10 4 words; other words are replaced by their part-of-speech tag. We do this for simplicity and to stay close to standard practice in natural language processing; NPRD could deal with unbounded state spaces through a range of more sophisticated techniques such as subword modeling and character-level prediction [70,71].
We used data from five diverse languages. For English, we turn to the Wall Street Journal portion of the Penn Treebank [72], a standard benchmark for language modeling, containing about 1.2 million tokens. For Arabic, we pooled all relevant portions of the Universal Dependencies treebanks [73][74][75]. We obtained 1 million tokens. We applied the same method to construct a Russian corpus [76], obtaining 1.2 million tokens. For Chinese, we use the Chinese Dependency Treebank [77], containing 0.9 million tokens. For Japanese, we use the first 2 million words from a large processed corpus of Japanese business text [78]. For all these languages, we used the predefined splits into training, validation, and test sets.
For each language, we sampled about 120 values of log 1 λ uniformly from [−6, 0] and applied NPRD to these. The resulting curves are shown in Figures 7 and 8, together with fitted curves resulting from Equation (37). As can be seen, the curves are qualitatively very similar across languages to what we observed in Figure 6: In all languages, rate initially scales linearly with predictiveness, but diverges as the predictiveness approaches its supremum E 0 . As a function of log 1 λ , rate grows at a slightly superlinear speed, confirming our ansatz (33). These results confirm our results from Section 6.1. At the time scale of individual sentences, Predictive Rate-Distortion of natural language appears to quantitatively follow Equation (37). NPRD reports rates up to ≈ 60 nats, more than ten times the largest values of predictiveness. On the other hand, the growth of rate with predictiveness is relatively gentle in the low-rate regime. We conclude that predicting words in natural language can be approximated with small memory capacity, but more accurate prediction requires very fine-grained memory representations.

General Discussion
Our analysis of PRD curves for natural language suggests that human language is characterized by very high and perhaps infinite statistical complexity, beyond its excess entropy. In a similar vein, Dębowski [64] has argued that the excess entropy of connected texts in natural language is infinite (in contrast, our result is for isolated sentences). If the statistical complexity of natural language is indeed infinite, then statistical complexity is not sufficiently fine-grained as a complexity metric for characterizing natural language.
We suggest that the PRD curve may form a more natural complexity metric for highly complex processes such as language. Among those processes with infinite statistical complexity, some will have a gentle PRD curve-meaning that they can be well-approximated at low rates-while others will have a steep curve, meaning they cannot be well-approximated at low rates. We conjecture that, although natural language may have infinite statistical complexity, it has a gentler PRD curve than other processes with this property, meaning that achieving a reasonable approximation of the predictive distribution does not require inordinate memory resources.

Conclusions
We introduce Neural Predictive Rate-Distortion (NPRD), a method for estimating Predictive Rate-Distortion when only sample trajectories are given. Unlike OCF, the most general prior method, NPRD scales to long sequences and large state spaces. On analytically tractable processes, we show that it closely fits the analytical rate-distortion curve and recovers the causal states of the process. On part-of-speech-level modeling of natural language, it agrees with OCF in the setting of low rate and short sequences; outside these settings, OCF fails due to combinatorial explosion and overfitting, while NPRD continues to provide estimates. Finally, we use NPRD to provide the first estimates of Predictive Rate-Distortion for modeling natural language in five different languages, finding qualitatively very similar curves in all languages.
All code for reproducing the results in this work is available at https://github.com/m-hahn/ predictive-rate--distortion. Acknowledgments: We thank Dan Jurafsky for helpful discussion, and the anonymous reviewers for helpful comments.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Hyperparameters
All hyperparameter choices are shown in Table A1. We defined separate hyperparameter ranges for the three Sections 5.2, 6.1, and 6.3. Guided by the fact that the analytically known processes in Section 5.2 are arguably less complex than natural language, we allowed larger and more powerful models for modeling of natural language (Sections 6.1 and 6.3), in particular word-level modeling (Section 6.3).
In Table A1, hyperparameters are organized into four groups. The first group of parameters is the dimensions of the input embedding and the recurrent LSTM states. The second group of parameters is regularization parameters. We apply dropout [79] to the input and output layers. The third group of parameters is related to the optimization procedure [35]. Neural Autoregressive Flows, used to approximate the marginal q, also have a set of hyperparameters [20]: the length of the flow (1, 2, ...), the type (DSF/Deep Sigmoid Flow or DDSF/Deep Dense Sigmoid Flow), the dimension of the flow (an integer) and the number of layers in the flow (1,2, ...). Larger dimensions, more layers, and longer flows lead to more expressive models; however, they are computationally more expensive.
Training used Early Stopping using the development set, as described in Section 4.3. Models were trained on a TITAN Xp graphics card. On Even Process and Random Insertion Process, NPRD took a median of 10 min to process 3M training samples. OCF took less than one minute at M ≤ 5; however, it does not scale to larger values of M. On English word-level modeling, training took a median number of nine epochs (max 467 epochs) and five minutes (max 126 min).

Appendix B. Alternative Modeling Choices
In this section, we investigate the trade-offs involved in alternative modeling choices. It may be possible to use simpler function-approximators for φ, ψ, and q, or smaller context windows sizes M, without harming accuracy too much.
First, we investigated the performance of a simple fixed approximation to the marginal q. We considered a diagonal unit-variance Gaussian, as is common in the literature on Variational Autoencoders [36]. We show results in Figure A1. Even with this fixed approximation to q, NPRD continues to provide estimates not far away from the analytical curve. However, comparison with the results obtained from full NPRD (Figure 1) shows that a flexible parameterized approximation still provides considerably better fit.
Second, we investigated whether the use of recurrent neural networks is necessary. As recurrent models such as LSTMs process input sequentially, they cannot be fully parallelized, posing the question of whether they can be replaced by models that can be parallelized. Specifically, we considered Quasi-Recurrent Neural Networks (QRNNs), which combine convolutions with a weak form of recurrence, and which have shown strong results on language modeling benchmarks [80]. We replaced the LSTM encoder and decoder with QRNNs and fit NPRD to the Even Process and the Random Insertion Process. We found that, when using QRNNs, NPRD consistently fitted codes with zero rate for the Even Process and the Random Insertion Process, indicating that the QRNN was failing to extract useful information from the past of these processes. We also found that the cross-entropies of the estimated marginal distribution P η ( M← X ) were considerably worse than when using LSTMs or simple RNNs. We conjecture that, due to the convolutional nature of QRNNs, they cannot model such processes effectively in principle: QRNNs extract representations by pooling embeddings of words or n-grams occurring in the past. When modeling, e.g., the Even Process, the occurrence of specific n-grams is not informative about whether the length of the last block is even or odd; this requires some information about the positions in which these n-grams occur, which is not available to the QRNN, but which is generally available in a more general autoregressive model such as an LSTM.
Third, we varied M, comparing M = 5, 10 to the results obtained with M = 15. Results are shown in Figure A2. At M = 5, NPRD provides estimates similar to OCF (Figure 1). At M = 10, the estimates are close to the analytical curve; nonetheless, M = 15 yields clearly more accurate results.

Appendix C. Sample Runs on English Text
In Figure A3, we provide four sample outputs from English word-level modeling with three values of log 1 λ (1.0, 3.0, 5.0), corresponding to low (1.0), medium (3.0), and high (5.0) rates (compare Figure 7). We obtained sample sequences by selecting the first 32 sequences M← X →M X (at M = 15) from the Penn Treebank validation set, and selected four examples where the variation in cross-entropy values at X 0 was largest between the three models.
Across these samples, models generated at log 1 λ = 5 show lower cross-entropy on the first future observation X 0 , as these codes have higher rates. For instance, in the first sample, the cross-entropy on the first word Jersey is lowest for the code with the higher rate; indeed, this word is presumably strongly predicted by the preceding sequence ...Sen. Billd Bradley of New. Codes with higher rates are better at encoding such predictive information from the past.
Past: within minutes after the stock market closed friday , i called Sen. Bill Bradley of new Past: asset you own would be worth more if the government suddenly announced that if you sold Past: he thinks, when the white house is forced to ask for higher taxes to meet Past: to be dealt with by the oct. 15 deadline, and these senate democrats refused to ----Low rate (log 1 λ = 1.0) ---Medium rate (log 1 λ = 3.0) --High rate (log 1 λ = 5.0) Figure A3. Four example outputs from English word-level modeling, with low rate (log 1 λ = 1; red, dotted), medium rate (log 1 λ = 3; green, dashed), high rate (log 1 λ = 5; blue, solid). For each sample, we provide the prior context M← X (top), and the per-word cross-entropies (in nats) on the future words →M X (bottom).