From Boltzmann to Zipf through Shannon and Jaynes

The word-frequency distribution provides the fundamental building blocks that generate discourse in language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf's law, at least approximately. Following Stephens and Bialek [Phys. Rev. E 81, 066119, 2010], we interpret the frequency of any word as arising from the interaction potential between its constituent letters. Indeed, Jaynes' maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of all pairwise (two-letter) potentials. The improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. Appling this formalism to words with up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus, we find that the model is able to reproduce Zipf's law, but with some limitations: the general Zipf's power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws.


INTRODUCTION
Zipf's law is a pattern that emerges in many complex systems composed by individual elements that can be grouped into different classes or types [1]. It has been reported in demography, with citizens linked to the city or village where they live [2]; in sociology, with believers gathering into religions [3]; in economy, with employees hired by companies [4]; and also in ecology [5,6], communications [3,7], cell biology [8], and even music [9][10][11]. In all these cases, the size of the groups in terms of the number of its constituent elements shows extremely large variability, more or less well described in some range of sizes by a power-law distribution with an exponent close to two (for the probability mass function; this turns out to be an exponent close to one for the complementary cumulative distribution). Of particular interest is Zipf's law in linguistics [12][13][14][15][16][17], for which individual elements are word tokens (i.e., word occurrences in a text), and classes or groups are the words themselves (word types). In this way, the "size" of a word type is given by the number of tokens of it that appear in the text under study (in other words, the absolute frequency of the word).
There have been many attempts to provide a mechanism for this curious law [18][19][20].
With text generation in mind, we can mention monkey typing, also called intermittent silence [21,22], the least effort principle [23][24][25], sample-space reduction [26], and codification optimization [27]. More general mechanistic models for Zipf's law are preferential attachment [28][29][30][31], birth-and-death processes [32], variations of Polya urns [33] and random walks on networks [34]. The existence of so-many models and explanations is a clear indication of the controversial origin of the law. Further, there have been also important attempts to explain not only Zipf's law but any sort of power-law distribution in nature [35][36][37][38].
A different approach is provided by the maximum-entropy principle. In statistical physics it is well known that a closed system in equilibrium with a thermal bath displays fluctuations in its energy but keeping a constant mean energy. As Jaynes showed [39], the maximization of the Shannon entropy with the constrain that the mean energy is fixed yields the Boltzmann factor, which states that the probability of any microstate has to be an exponential function of its energy (note that this does not mean that the distribution of energy is exponential, as the number of microstates as a function of the energy is not necessarily constant).
Therefore, some authors have looked for an analogous of the Boltzmann factor for power laws. For example, one can easily obtain a power law not imposing a constant (arithmetic) mean but a constant geometric mean [40] (assuming also a degeneracy that is constant with respect the energy). Also, fixing both the arithmetic and the geometric mean leads to a power law with an exponential tail [41]. Nevertheless, the physical meaning of these constraints is difficult to justify. More recently, Peterson et al. [42] have proposed a concrete non-extensive energy function that leads to power-law tails of sizes when maximizing the Shannon entropy. The main idea is that the probability is exponential with the energy, but the energy is logarithmic with size, resulting in an overall power law for sizes. Other authors have found the use of the Shannon entropy inadequate, due to its close connection with exponential distributions, and have generalized the very entropy concept, yielding nonextensive entropies such as the Havrda-Charvát entropies [43], also called Tsallis entropies [44], and the Hanel-Thurner entropies [45,46].
Here we will follow the distinct approach of Stephens and Bialek [47]. As Peterson et al. [42], these authors consider maximization of the plain Shannon entropy, but in contrast to them, no functional form is proposed "a priori" for the energy. Instead, the constrains are provided by the empirical two-body marginal distributions. The framework is that of word occurrence in texts, and words are considered as composed by letters that interact in pairs. The interaction potentials are provided in a natural way by the Lagrange multipliers obtained in the maximization of entropy under the empirical values of the constrains.
Stephens and Bialek [47] only considered four-letter English words and performed a visual comparison with the empirical frequencies of words. We will considerably extend their results by analyzing words of any length from 1 to 6 letters, and will undertake a quantitative statistical analysis of the fulfillment of Zipf's law. We will pay special attention to the values of the interaction potentials. The main conclusion is that two-body (two-letter) pairwise interactions are able to reproduce a power-law regime for the probabilities of words (which is the hallmark of Zipf's law), but with considerable scatter of the concrete values of the probabilities.
In the next section we review the maximum-entropy formalism and its application to pairwise interaction of letters in words, using the useful concept of feature functions. Next, we describe the empirical data we use and the results, including the empirical pairwise marginals (which are the input of the procedure) and the resulting pairwise potentials (which are the output from which the theoretical word distribution is built). The Zipfian character of the theoretical word distribution as well as its correspondence with the empirical distribution is evaluated. In the final section we discuss limitations and extensions of this work.

MAXIMUM ENTROPY AND PAIRWISE INTERACTIONS
"Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge," which leads to a special type of statistical inference. This is the key idea of Jaynes' maximum-entropy principle [39]. The recipe can be summarized as: use that probability distribution which has maximum Shannon entropy subject to whatever is known. In other words, everything should be made as random as possible, but not more [48] [64].
Let us consider words in texts. Labelling each word type by j, with j = 1, 2, . . . V , and V the size of the vocabulary (the total number of word types), the Shannon entropy is where P j is the probability of occurrence of word type j. Note that as we use natural logarithms, the entropy is not measured in bits but in nats, in principle. In order to maximize the entropy under a series of constrains one uses the method of Lagrange multipliers, where one finds the solution of for all j, with α, β, etc. the Lagrange multipliers associated to constrain 1, constrain 2, etc., and L = S − α × (constrain 1) − β × (constrain 2) − . . . the Lagrangian function.
One can see that the maximum-entropy method yields intuitive solutions in very simple cases. For example, if no contrains are provided one obtains the equiprobability case, P µc j = 1/V (as there is in fact one implicit constrain: normalization; µc stands from microcanonical, in analogy with statistical physics). If there are no other constrains it is clear one cannot escape this "rudimentary" solution. If, instead, one uses all empirical values as constrains, one gets the same one puts, with a solution P f ull j = ρ(j), with ρ(j) the empirical probability of occurrence of word j (i.e., the relative frequency of j). So, the full data is the solution, which is of little practical interest, as this model lacks generalization and does not bring any understanding. More interestingly, when the mean value of the energy is used as a constrain (as it happens in thermodynamics for closed systems in thermal equilibrium with a bath), the solution is given by the Boltzmann distribution with the term can coming from the analogy with the canonical ensemble and with Z = j e −βE j . Needless to say, we have no idea yet what the energy E j of a word is.

Feature functions and marginal probabilities
At this point it becomes useful to introduce the feature functions [49]. Given a feature i, Considering m features, each one yielding a constrain for its expected value, we have for i = 1, 2, . . . m, with F i the empirical value of feature i. Note that P j and f i are unknown, whereas F i should not. With these m constrains, the method of Lagrange multipliers [Eq.
(1)] leads to where λ i are now the Lagrange multipliers (we have in fact inverted their sign with respect the previous examples, for convenience). The solution is = exp −1 + λ's of features of word j .
In contrast with the previous simplistic models, we are now able to deal with the inner structure of words, as composed by letters, i.e., j = { 1 , 2 , . . . } and P j = P ( 1 2 . . . ), with 1 the letter at first position of word j and so on. If we consider that the features describe the individual letters of a word, for example, for i = 1c, we have that (using that only words starting with 1 = c contribute to the sum); in words, we obtain that the expected value of the feature 1c is the marginal probability P I 1 (c) that the first letter in a word is c, which we make equal to its empirical value ρ 1 (c) (which is just the number of tokens with letter c in position 1 divided by the total number of tokens). Notice that we do not impose normalization constrain for the P j 's, as this is implicit in the marginals. Coming back to the expression for the probabilities, Eq. (4), we have, for a three-letter example, the label I standing for the fact that the solution is obtained from the constrains of oneletter marginals. Substituting this into the constrain, Eq. (5), we arrive to solutions of the form e λ 1c −1/3 = ρ 1 (c) and so, (note that other solutions for the λ 1c 's are possible, but they lead to the same P I 's; in particular, the origin of each potential is not fixed and one could replace, for instance, λ 1 1 → λ 1 1 + C 1 for all 1 , provided that the other potentials are modified accordingly to yield the same value of the sum). This model based on univariate (single-letter) marginals is very simple indeed, and closely related to monkey-typing models [21,22], as we obtain that each word is an independent combination of letters, with each letter having its own probability of occurrence (but depending on its position in the word).
with ρ 12 (ca) the two-letter marginal, provided by the empirical data, ρ 12 (ca) = number of tokens with c in 1 and a in 2 total number of tokens .
The solution (4), restricted for the particular example of a three-letter word can be written as using the notation λ 12ca = λ 12 (ca) for the multipliers, and the label II denoting that we are dealing with theoretical probabilities arising from two-letter features, i.e., two-letter marginals. The same result writes, in general, with K the word length (in number of letters). Comparing with Boltzmann distribution, Eq.
(2), we can identify the Lagrange multiplier for each feature with the pairwise interaction potential between the letters defining the feature (with a minus sign, and with a shift of one unit); for example, −βE(cat) = λ 12 (ca) + λ 13 (ct) + λ 23 (at) − 1, and in general, Therefore, words can be seen as networks of interacting letters (with all-to-all interaction between pairs, and where the position of the two letters matters for the interaction). Note that three-letter interacions, common in English ortographic rules, are not captured by the pairwise interaction; for example, in positions 3 to 5: believe (rule) versus deceive (exception, due to the c letter). Remarkably, this pairwise approach has been used also for neuronal, biochemical, and genetic networks [47]. A very simplified case of this letter system turns out to be equivalent to an Ising model (or, more properly, a spin-glass model): just consider an alphabet of two letters (a and b) and impose the symmetries (not present in linguistic data, in general) λ kk (ab) = λ kk (ba) and λ kk (aa) = λ kk (bb) (if one wants to get ride of this symmetries in the Ising system one could consider external "magnetic" fields, associated to the one-letter marginals).
Substituting the solution (4) or (7) into the constrains (6), the equations we need to solve would be like if we restricted to three-letter words.
For computational limitations, we will only treat words comprising from 1 to 6 letters.
As the numerical algorithm we will use requires that the number of letters is constant (see the Appendix), we will consider that words shorter than length 6 are six-letter words whose where the solution is not straightforward anymore, and has to be found numericaly. So, we deal with a constrained optimization problem, for which the Appendix provides complete information. Here we just mention that the improved iterative-scaling method consist in the successive application of transformations as , see Eq. (13) in the Appendix. Note that, as in the case of univariate marginals, the potentials are undetermined under a shift, i.e., λ 12 ( 1 2 ) → λ 12 ( 1 2 ) + C 12 , as long as the other potentials are correspondingly shifted to give the same value for the sum.

Data
As a corpus, we use all English books in the recently presented Standardized Project Gutenberg Corpus [50]. This comprises more than 40,000 books in English, with a total number of tokens 2,016,391,406 and a vocabulary size V = 2, 268, 043. The entropy of the corresponding word-probability distribution is S = 10.27 bits. In order to avoid spurious words (misspellings, etc.), and also, for computational limitations, we disregard word types with absolute frequency smaller than 10,000. Also, word types (unigrams) containing characters different than the plain 26 letters from a to z are disregarded (note that we do not distinguish between capital and lower-case letters). Finally, we remove also Roman numerals (these are not words for our purposes, as they are not formed by interaction between letters).
This reduces the number of tokens to 1,881,679,476 and V to 11,042, and so the entropy becomes S = 9.45 bits. Finally, the subset of words with length smaller or equal to 6 yields 1,597,358,419 tokens, V = 5, 081 and S = 8.35 bits. We will see that these sub-corpora fulfill Zipf's law, but each one with a slightly different power-law exponent.
Marginal distributions Figure 1 displays the empirical two-letter marginal probabilities (obtained from the 6or-less-letter sub-corpus just described), which constitute the target of the optimization procedure. There are a total of 5,092 non-zero values of the marginals. Notice that, although the two-letter marginals are bivariate probabilities [47], Zipf's representation allows one to display them as univariated. This is achieved by defining a rank variable, assigning rank r = 1 to the type with the highest empirical frequency ρ (i.e., the most common type), r = 2 to the second most common type, and so on ( Fig. 1(left)). This is called the rank-frequency representation (or, sometimes, distribution of ranks). Then, Zipf's law can be formulated as a power-law relation between ρ and r, for some range of ranks (typpically the lowest ones, i.e., the highest frequencies), with the exponent β −1 taking values close to one (the symbol ∝ denotes proportionality). When we calculate and report entropies we use always the rank-frequency representation.
An approximated alternative representation (also used by Zipf) considers the empirical frequency ρ as a random variable, whose distribution is computed. In terms of the complementary cumulative distribution, G(ρ), Zipf's law can be written as which in terms of the probability density or probability mass function of ρ leads to asymptotically, for large ρ (Fig. 1(right)). Both G(ρ) and g(ρ) constitute a representation in terms of the distribution of frequencies. For more subttle arguments relating ρ(r), G(ρ), and g(ρ), see Refs. [17,51].
We can test the applicability of Zipf's law to our two-letter marginals, in order to evaluate how surprising or unsurprising is the emergence from them of Zipf's law in the word distribution. Remember that, in the case of marginal distributions, types are pairs of letters. Figure 1 left shows that, despite the number of data in the marginals is relatively low (a few hundreds as shown in Table I, with a theoretical maximum equal to 26 2 = 676), the marginal frequencies appear as broadly distributed, varying along 4 orders of magnitude. Although the double logarithmic plots do not correspond to straight lines, the high-frequency (low-rank) part of each distribution can be fitted to a power law, for a number of orders of magnitude ranging from 0.5 to 2 and an exponent β typically between 1 and 2, as it can be seen in Table I. Thus, the two-letter marginal distributions display a certain Zipfian character (at least considering words of length not larger than 6, in letters), with a short power-law range, in general, and with a somewhat large value of β (remember that β has to be close to one for the fulfilment of Zipf's law).
Remarkably, Fig. 1(right) also shows that all the marginal distributions present a characteristic, roughly the same shape, with the only difference being on the scale parameter of the frequency distribution, which is determined by the mean frequency ρ kk (denoted generically in the figure as ρ emp ). This means, as shown in the figure, that the distribution g(ρ emp ), when multiplied (rescaled) by ρ emp , can be considered, approximately, as a function that only depends of the rescaled frequency, ρ emp / ρ emp , independently on which potential ρ kk one is considering. In terms on the distribution of ranks this scaling property translates into the fact that ρ emp / ρ emp can be considered a function of only r/V .
For the fitting we have used the method proposed in Refs. [52,53], based on maximumlikelihood estimation and Kolmogorov-Smirnov goodness-of-fit testing. This method lacks the problems presented in the popular Clauset et al.'s recipe [3,54,55]. The fitting method is applied to ρ as a random variable (instead than to r [16]); this choice presents a number of important advantages, as discussed in Ref. [56]. The outcome of the method is a estimated value of the exponent β together with a value of ρ, denoted by a, from which the power-law fit, Eqs. (11) and (12), is non-rejectable (with a p−value larger than 0.20, by prescription).
Although other distributions different than the power law can be fitted to the marginal data (e.g., lognormal [53]) our purpose is not to find the best fitting distribution, but just to evaluate how much Zipf's power law depends on a possible Zipf's behavior of the marginals. Figure 2 shows that the optimization succeeds in getting values of the theoretical marginal distributions very close to the empirical ones. However, despite the fact the target of the optimization are the marginal distributions (whose empirical values are the input of the procedure), we are interested in the distribution of words, whose empirical value is known but does not enter into the procedure, as this is the quantity we seek to "explain". Zipf's rank-frequency representation allows us to display in one dimension the six-dimensional nature (from our point of view) of the word frequencies; for the empirical word frequencies this is shown in Fig. 3. We find that the distribution is better fitted in terms of an upper truncated power law [52,58], given, as in Eq. (12), by g(ρ) ∝ 1/ρ β+1 but in a finite range a ≤ ρ ≤ b < ∞ (the untruncated case would be recovered by taking b → ∞). This corresponds, in the continuum case, to a cumulative distribution G(ρ) ∝ 1/ρ β − 1/b β , and to a rank-frequency relation ρ ∝ 1 (r + V /b β ) 1/β , which coincides in its mathematical expression with the so-called Zipf-Mandelbrot distribution (although the continuous fit makes r a continuous variable; remember that V is the number of types). The fitting procedure is essentially the same as the one for the untruncated power law outlined in the previous subsection, with the maximization of the likelihood a bit more involved [52,53]. In Fig. 3 we also display the theoretical result, P II , Eq. (8), arising from the solution of Eq. (9). We see that, qualitatively, P II has a shape rather similar to the empirical one.

Word distributions
Both distributions fulfill Zipf's law, with exponents β equal to 0.89 and 0.81, respectively.
We also see in the figure that the quantitative agreement in the values of the probability (P II and ρ word ) is rather good for the smallest values of the rank (r < 10); however, both curves start to slightly depart from each other for r > 10. In addition, the rank values are associated to the same word types for r ≤ 6 (the, of, and, to, a, in), but for larger ranks the correspondence may be different (r = 7 corresponds to i in one case and to that in the other). If we could represent ρ word and P II in six dimensions (instead that as a function of the rank) we would see more clearly the differences between both.
Zipf's law is, in part, the reason of this problem, as for r ≥ 10 the difference in probabilities for consecutive ranks becomes smaller than 10 %, see Eq. (10), and for r ≥ 100 the difference decreases to less than 1 % (assuming β 1). So, finite resolution in the calculation of P II will lead to the "mixing of the ranks." However, the main part of the problem comes from the unability of the algorithm in some cases to yield values of P II close to the empirical value, ρ word , as it can be seen in the scatter plot of Fig

Values of Lagrange multipliers and potentials
We have established that, for a given word, the value of its occurrence probability P II comes from the exponentiation of the sum the 15 interaction potentials between the 6 letter positions that constitute the word (in our maximum-entropy approach). So, the values of the potentials (or the values of the Lagrange multipliers) determine the value of the probability P II . It is interesting to investigate, given a potential or a multiplier (for instance λ 12 ), how the different values it takes (λ 12 (aa), λ 12 (ab), etc.) are distributed. Curiously, we find that the 15 different potentials are (more or less) equally distributed, i.e., follow the same skewed and spiky distribution, as shown in Fig. 5(left).
One can try to use this fact to shed some light on the origin of Zipf's law. Indeed, exponentiation is a mechanism of power-law generation [37,54]. We may arguee that the sum of 15 random numbers drawn from the same spiky distribution has to approach, by the central limit theorem, a normal distribution, and therefore, the exponentiation of the sum would yield a lognormal distribution for P II (i.e., a lognormal shape for g(P II )). However, this may be true for the central part of the distribution, but not for its rightmost extreme values, which is the part of the distribution we are more interested in (high values of P II , i.e., the most common words). Note also that, in practice, for calculating the probability of a word, we are not summing 15 equally distributed independent random numbers, as not all the words are possible; i.e., there are potentials that take a value equal to infinite, due to forbidden combinations, and these infinite values are not taken into account in the distribution of the potentials. An additional problem with this approach is that, although most values of the potentials converge to a fix value (and the distribution of potentials shown in the figure is stable), there are single values that do not converge, related to words with very low probability. These issues need to be further investigated in future research. In addition, Fig. 5(right) shows, as a scatter plot, the dependence between the value of each potential and the corresponding two-letter marginal probability. Although Eq. (9) seems to indicate a rough proportionality between both, the figure shows that such proportionality does not hold (naturally, the rest of terms in the equation play their role).

DISCUSSION
We have generalized a previous study of Stephens and Bialek [47]. Instead of restricting our study to four-letter words, we consider words of any length from one to six, which leads to greater computational difficulties, and employ a much larger English corpus as well. We perform an analysis of the fulfilment of Zipf's law using state-of-art statistical tools. Our more general results are nevertheless in the line of those of Ref. [47]. We see how the frequency of occurrence of pairs of letters in words (the pairwise marginal distributions), together with the maximum-entropy principle (which provides the distribution with the maximum possible randomness), constrain the probabilities of word occurrences in English.
Regarding the shape of the distributions, the agreement between the maximum-entropy solution for the word distribution and its empirical counterpart is very good at the qualitative level, and reasonably good at the quantitative level for the most common words, as shown in This leads to only one potential (in the case of nearest-neighbor interaction; 5 potentials in the all-to-all case). It would be interesting to see how these modifications compare with the original model; this is left for future research. An extension towards a different direction would be to use phonemes or syllables instead of letters as the constituents of words. We urge the authors of the corpus in Ref. [50] to provide the decomposition of the words in the corpus into these parts. Remarkably, the approach presented here, and in Ref. [47] has also been applied to music [63].
her specially for drawing our attention to Refs. [49,59]. Some preliminary results of this research were presented at the workshop "Statistics of Languages" (Warsaw, July 2017); we are grateful to the organizers. This work was largely completed at l'Abadia de Burg, Pallars Sobirà, Lleida.

APPENDIX
We summarize here the main formulas in Ref. [59], for the improved iterative-scaling method. The per-datum log-likelihood L( λ) of the model P j ( λ) (stressing the dependence on the value of the set of parameters λ) is given by with λ = (λ 1 , λ 2 , . . . λ m ) and ρ(j) the empirical probability for word type j. Substituting the maximum-entropy solution for the theoretical probability Eq. (4), written as P j = , which leads to using Eq. (3). This indicates that the parameters λ that fulfill the constrains also maximize the log-likelihood, and vice versa, and therefore the maximum-entropy parameters can be obtained from maximum likelihood.
It can be shown that, for a change δ in the values of the parameters, the increase in log-likelihood fulfils with n(j) = i f i (j) = number of features of word j. Now one should look for the values of δ that maximize the lower bound (right-hand side of the previous inequality). Curiously, Ref. [59] does not provide the final solution, but this is in Ref. [49] instead. Maximizing, one gets using that n(j) = constant = n, if word length is constant (6 in our case, considering that blanks complete shorter words). The improved iterative-scaling algorithm is just: Initialize As the equation to solve, Eq. (9), is a sum of exponentials, when a marginal value is not present in the empirical data, i.e., when the right-hand side of Eq. (9) is zero, the left-hand side of the equation cannot verify the equality unless some Lagrange multiplier is minus infinite, which is a value that the numerical algorithm cannot achieve. We therefore take from the beginning the corresponding multiplier to be equal to minus infinity (i.e., interaction potential equal to infinite). To be concrete, if for example ρ 12 (zz) = 0, we take λ 12 (zz) = −∞, which leads to P j = 0 for any j = {zz 3 4 . . . }. This means that we can restrict our analysis of possible words to those with all pairs of letters corresponding to non-null empirical marginals, because the rest of words have zero probability.     II: Most common theoretical words from the maximum-entropy procedure that are not present in the analyzed sub-corpus. In fact, some of these words are present in the original complete corpus (may be as misspellings), but not in our sub-corpus (as we have disregarded frequencies smaller than 10,000). r is (theoretical) rank and P II is (theoretical) probability.