Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples

Determining the strength of nonlinear, statistical dependencies between two variables is a crucial matter in many research fields. The established measure for quantifying such relations is the mutual information. However, estimating mutual information from limited samples is a challenging task. Since the mutual information is the difference of two entropies, the existing Bayesian estimators of entropy may be used to estimate information. This procedure, however, is still biased in the severely under-sampled regime. Here, we propose an alternative estimator that is applicable to those cases in which the marginal distribution of one of the two variables—the one with minimal entropy—is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain a consistent estimator that presents very low bias, outperforming previous methods even when the sampled data contain few coincidences. As with other Bayesian estimators, our proposal focuses on the strength of the interaction between the two variables, without seeking to model the specific way in which they are related. A distinctive property of our method is that the main data statistics determining the amount of mutual information is the inhomogeneity of the conditional distribution of the low-entropy variable in those states in which the large-entropy variable registers coincidences.


Introduction
Inferring the statistical dependencies between two variables from a few measured samples is a ubiquitous task in many areas of study. Variables are often linked through nonlinear, relations, which contain stochastic components. The standard measure employed to quantify the amount of dependency is the mutual information, defined as the reduction in entropy of one of the variables when conditioning the other variable [1,2]. If the states of the joint distribution are well-sampled, the joint probabilities can be estimated by the observed frequencies. Replacing such estimates in the formula for the mutual information yields the so-called "plug-in" estimator of mutual information. However, unless all the states of the joint distribution are sampled extensively, this procedure typically over-estimates the mutual information [3][4][5]. In fact, when the number of samples is of the order of the effective number of joint states, even independent variables tend to appear as correlated.
The search for an estimator of mutual information that remains approximately unbiased even with small data samples is an open field of research [6][7][8][9][10][11]. Here, we focus on discrete variables, and assume it is not possible to overcome the scarceness of samples by grouping states that are close according to some metric. In addition to corrections that only work in the limit of large samples [12], the state of the art for this problem corresponds to quasi-Bayesian methods that estimate mutual information indirectly through measures of the entropies of the involved variables [8,13,14]. These approaches have the drawback of not being strictly Bayesian, since the linear combination of two or more Bayesian estimates of entropies does not, in general, yield a Bayesian estimator of the combination of entropies [8]. The concern is not so much to remain within theoretical Bayesian purity, but rather to avoid frameworks that may be unnecessarily biased, or where negative estimates of information may arise.
Here, we propose a new method for estimating mutual information that is valid in the specific case in which there is an asymmetry between the two variables: One variable has a large number of effective states, and the other only a few. Examples of asymmetric problems can be found for instance in neuroscience, when assessing the statistical dependence between the activity of a population of neurons and a few stereotyped behaviors, as "lever to the left" or "lever to the right." If the neural activity is represented as a collection of binary strings, with zeroes and ones tagging the absence or presence of spikes in small time bins, the set of possible responses is huge, and typically remains severely undersampled in electrophysiological experiments. If the behavioral paradigm is formulated in terms of a binary forced choice, the behavioral response only has two possible states.
In our approach, no hypotheses are made about the probability distribution of the large-entropy variable, but the marginal distribution of the low-entropy variable is assumed to be well sampled. The prior is chosen so as to accurately represent the amount of dispersion of the conditional distribution of the low-entropy variable around its marginal distribution. This prior is motivated in the framework of Bayesian estimation of information, and then tested with selected examples in which the hypotheses implied in the prior are fulfilled in varying degree, from high to low. The examples show that our estimator has very low bias, even in the severely under-sampled regime, where there are few coincidences that is, when any given state of the large-entropy variable has a low probability of being sampled more than once. The key data statistics that determine the estimated information is the inhomogeneity of the distribution of the low-entropy variable in those states of the high-entropy variable where two or more samples are observed. In addition to providing a practical algorithm to estimate mutual information, our approach sheds light on the way in which just a few samples reveal those specific properties of the underlying joint probability distribution that determine the amount of mutual information.

Bayesian Approaches to the Estimation of Entropies
We seek a low-bias estimate of the mutual information between two discrete variables. Let X be a random variable with a large number k x of effective states {x 1 , . . . , x k x } with probabilities q x , and Y be a variable that varies in a small set y ∈ {y 1 , . . . , y k y }, with k y k x . Given the conditional probabilities q y|x , the marginal and joint probabilities are q y = ∑ x q x q y|x and q xy = q x q y|x , respectively. The entropy H(Y) is and can be interpreted as the average number of well-chosen yes/no questions required to guess the sampled value of Y (when using a logarithm of base two). Since the entropy is a function of the probabilities, and not of the actual values taken by the random variable, here we follow the usual notation in which the expressions H(Y) and H(q) = H(q y 1 , . . . , q y ky ) are taken to represent the same concept, defined in Equation (1). The conditional entropy H(Y|X) is the average uncertainty of the variable Y once X is known, The mutual information is the reduction in uncertainty of one variable once we know the other [2] I(X, Our aim is to estimate I(X, Y) when Y is well sampled, but X is severely undersampled, in particular, when the sampled data contain few coincidences in X. Hence, for most values x, the number of samples n x is too small to estimate the conditional probability q y|x from the frequencies n xy /n x . The plug-in estimators of entropy, conditional entropy and mutual information are defined as the ones in which all the probabilities appearing in Equations (1)-(3) are estimated naïvely by the frequencies obtained in a given sample. In fact, when n x ∼ O(1), the plug-in estimator typically underestimates H(Y|x) severely [5], and often leads to an overestimation of I(X, Y).
One possibility is to estimate H(X), H(Y) and H(X, Y) using a Bayesian estimator, and then insert the obtained values in Equation (3) to estimate the mutual information. We now discuss previous approaches to Bayesian estimators for entropy, to later analyze the case of information. For definiteness, we focus on H(X), but the same logic applies to H(Y), or H(X, Y).
We seek the Bayesian estimator of the entropy conditional to having sampled the state x i a number n i of times. We denote the collection of sampled data as a vector n = (n 1 , . . . , n k x ), and the total number of samples N = ∑ i n i . The Bayesian estimator of the entropy is a function that maps each vector n on a non-negative real numberĤ(n), in such a way that the posterior expected error is minimized [15,16]. In the Bayesian framework, the sampled data n are known, whereas the underlying distribution q originating the data is unknown, and must be inferred. In this framework, the estimator of the entropy is the functionĤ(n) for which is minimized. The Bayesian nature of this framework is embodied in the fact that the integral weighs all candidate distributions q that could have originated the data with the posterior distribution p(q|n), which can be related to the multinomial distribution p(n|q) by means of Bayes rule. The estimator H(n) can be found by taking the derivative of Equation (4) with respect toĤ and equating to zero. The result is [17] where the first equation results from the minimization procedure, and the second, from using Bayes rule. As a result, the Bayesian estimator is the expected value of H(q|n).
Since p(n|q) is the multinomial distribution and since the normalization constant p(n) can be calculated from the integral the entire gist of the Bayesian approach is to find an adequate prior p(q) to plug into Equations (5) and (7). For the sake of analytical tractability, p(q) is often decomposed into a weighted combination of distributions p(q|β) that can be easily integrated, each tagged by one or a few parameters, here generically called β that vary within a certain domain, The decomposition requires to introduce a prior p(β). Hence, the former search for an adequate prior p(q) is now replaced by the search for an adequate prior p(β). The replacement implies an assumption and also a simplification. The family of priors that can be generated by Equation (8) does not necessarily encompass the entire space of possible priors. The decomposition relies on the assumption that the remaining family is still rich enough to make good inference about the quantity of interest, in this case, the entropy. The simplification stems from the fact that the search for p(β) is more restricted than the search for p(q) because the space of alternatives is smaller (the dimensionality of q is typically high, whereas the one of β is low). Two popular proposals of Bayesian estimators for entropies are Nemenman-Shafee-Bialek (NSB) [13] and Pitman-Yor Mixture (PYM) [14]. In NSB, the functions p(q|β) are Dirichlet distributions, in which β takes the role of a concentration parameter. In PYM, these functions are Pitman-Yor processes, and β stands for two parameters: one accounting for the concentration, and the other for the so-called discount. In both cases, the Bayesian machinery implies where W(β|n) is the weight of each β in the estimation of the expected entropy When choosing the family of functions p(q|β), it is convenient to select them in such a way that the weight W(β|n) can be solved analytically. However, this is not the only requirement. In order to calculate the integral in β, the prior p(β) also plays a role. The decomposition of Equation (8) becomes most useful when the arbitrariness in the choice of p(β) is less serious than the arbitrariness in the choice of p(q). This assumption is justified when W(β|n) is peaked around a specific β-value, so that, in practice, the shape of p(β) hardly has an effect. In these cases, a narrow range of relevant β-values is selected by the sampled data, and all assumptions about the prior probability outside this range play a minor role. For the choices of the families p(q|β) proposed by NSB and PYM, W(β|n) can be calculated analytically, and one can verify that, indeed, a few coincidences in the data suffice for a peak to develop. In both cases, the selected β is one for which p(q|β) favours a range of q values that are compatible with the measured data (as assessed by p(n|q)), and also produce non-negligible entropies (Equation (10)).
When the chosen Bayesian estimates of the entropies are plugged into Equation (3) to obtain an estimate of the information, each term is dominated by its own preferred β. Since the different entropies are estimated independently, the β values selected by the data to dominate the priors p(q x ) and p(q y ) need not be compatible with the ones dominating the priors of the joint or the conditional distributions. As a consequence, the estimation of the mutual information is no longer Bayesian, and can suffer from theoretical issues, as, for example, yield negative estimates [8].
A first alternative would be to consider an integrable prior containing a single β for the joint probability distribution q xy , and then replace H by I in the equations above, to calculate I . This procedure was tested by Archer et al. [8], and the results were only good when the collection of q xy values governing the data were well described by a distribution that was contained in the family of proposed priors p(q|β). The authors concluded that mixtures of Dirichlet priors do not provide a flexible enough family of priors for highly-structured joint distributions, at least for the purpose of estimating mutual information.
To make progress, we note that I(X, Y) can be written as where q y|x and q y stand for the k y -dimensional vectors (q y 1 |x , . . . , q y ky |x ) and (q y 1 , . . . , q y ky ), and D KL represents the Kullback-Leibler divergence. The average divergence between q y|x and q y captures a notion of spread. Therefore, the mutual information is sensitive not so much to the value of the probabilities q y|x , but rather, to their degree of scatter around the marginal q y . The parameters controlling the prior should hence be selected in order to match the width of the distribution of q y|x values, and not so much each probability. With this intuition in mind, in this paper, we put forward a new prior for the whole ensemble of conditional probabilities q y|x obtained for different x values. In this prior, the parameter β controls the spread of the conditionals q y|x around the marginal q y .

A Prior Distribution for the Conditional Entropies
Our approach is valid when the total number of samples N is at least of the order of magnitude of √ e H(X) , since in this regime, some of the x states are expected to be sampled more than once [18,19]. In addition, the marginal distribution q y must be well sampled. This regime is typically achieved when X has a much larger set of available states than Y. In this case, the maximum likelihood estimatorsq y of the marginal probabilities q y can be assumed to be accurate that is, In this paper, we put forward a Dirichlet prior distribution centered atq y that is, where {q y|x } contains the k x conditional probabilities q y|x corresponding to different x values. Large β values select conditional probabilities close toq y , while small values imply a large spread that pushes the selection towards the border of the k y -simplex. For the moment, for simplicity, we work with a prior p({q y|x }) defined on the conditional probabilities q y|x , and make no effort to model the prior probability of q x . In practice, we estimate the values of q x with the maximum likelihood estimatorq x = n x /N. Since X is assumed to be severely undersampled, this is a poor procedure to estimate q x . Still, the effect on the mutual information turns out to be negligible, since the only role of q x in Equation (11) is to weigh each of the Kullback-Leibler divergences appearing in the average. If k x is large, each D KL -value appears in several terms of the sum, rendering the individual value of the accompanying q x irrelevant, only the sum of all those with the same D KL matters. In Section 6, we tackle the full problem of making Bayesian inference both in q x and {q y|x }.
The choice of prior of Equation (13) is inspired in three facts. First, β captures the spread of q y|x around q y , as implied by the Kullback-Leibler divergence in Equation (13). Admittedly, this divergence is not exactly the one governing the mutual information (Equation (11)), since q y|x and q y are swapped. Yet, it is still a measure of spread. The exchange, as well as the denominator in Equation (13), were introduced for the sake of the second fact, namely, analytical tractability. The third fact regards the emergence of a single relevant β when the sampled data begin to register coincidences. If we follow the Bayesian rationale of the previous section, now replacing the entropy by the mutual information, we can again define a weight W(β|n) for the parameter β where F(β, n) can be obtained analytically, and is a well behaved function of its arguments, whereas For each x, the vector q y|x varies in a k y -dimensional simplex. For p n|q x , {q y|x } we take the multinomial The important point here is that the ratio of the Gamma functions of Equation (14) develops a peak in β as soon as the collected data register a few coincidences in x. Hence, with few samples, the prior proposed in Equation (13) renders the choice of p(β) inconsequential.
Assuming that the marginal probability of Y is well-sampled, the entropy H(Y) is well approximated by the plug-in estimatorĤ(Y) = − ∑ y (n y /N) log(n y /N). For each β, the expected posterior information can be calculated analytically, where ψ 0 is the digamma function. A code implementing the estimate of Equation (16) is publicly available in the site mentioned in the Supplementary Material below.
When the system is well sampled, n xy 1, so the effect of β becomes negligible, the digamma functions tend to logarithms, and the frequencies match the probabilities. In this limit, Equation (16) coincides with the plug-in estimator, which is consistent [20]. As a consequence, our estimator is also consistent. The rest of the paper focuses on the case in which the marginal probability of X is severely undersampled.

A Closer Look on the Case of a Symmetric and Binary Y-Variable
In this section, we analyze the case of a binary Y-variable, which, for simplicity, is assumed to be symmetric that is, q y=0 = q y=1 = 1 /2, such that H(Y) = log 2 nats. In this case, the Dirichlet prior in each factor of Equation (13) becomes a Beta distribution and p {q 1|x }|β = ∏ x p(q 1|x |β). Large values of β mostly select conditional probabilities q y|x close to 1 /2. If all conditional probabilities are similar, and similar to the marginal, the mutual information is low, since the probability of sampling a specific y value hardly depends on x. Instead, small values of β produce conditional probabilities q y|x around the borders (q y|x ∼ 0 or q y|x ∼ 1). In this case, q y|x is strongly dependent on x (see Figure 1b), so the mutual information is large. The expected prior mutual information I|n = 0, β can be calculated using the analytical approach developed by [14,17], The prior information is a slowly-varying function of the order of magnitude of β, namely of log β. Therefore, if a uniform prior in information is desired, it suffices to choose a prior on log β such that p(log β) ∝ |∂ log β I|n = 0, β |, When k y = 2, the expected posterior information (Equation (16)) becomes The marginal likelihood of the data given β is also analytically tractable. The likelihood is binomial for each x, so The posterior for β can be obtained by adding a prior p(β), as p(β|n) ∝ p(n|β)p(β). The role of the prior becomes relevant when the number of coincidences is too low for the posterior to develop a peak (see below). We consider different hypotheses about the strength with which the probability of each y-value varies with x; (b) one possibility is that the conditional probability of each of the two y-values hardly varies with x. This situation is modeled by assuming that the different q y|x are random variables governed by a Beta distribution with a large hyper-parameter β 1 ; (c) On the other hand, the conditional probability q y|x could vary strongly with x. This situation is modeled by a Beta distribution with a small hyper-parameter β 2 . (d) As β varies, so does the prior mutual information (Equation (18)). This prior is obtained by averaging all the I({q 1|x }) values obtained from different possible sets of marginal distributions {q 1|x } that can be generated when sampling the prior p({q 1|x }|β) of Equation (17). The shaded area around the solid line illustrates such fluctuations in I({q 1|x }) when k x = 50.
In order to gain intuition about the statistical dependence between variables with few samples, we here highlight the specific aspects of the data that influence the estimator of Equation (20). Grouping together the terms of Equation (21) that are equal, the marginal likelihood can be rewritten in terms of the multiplicities m nn that is, the number of states x with specific occurrences {n x1 = n, n x0 = n } or {n x1 = n , n x0 = n}, where , The posterior for β is independent from states x with just a single count, as p 10 (β) = constant. Only states x with coincidences matter. In order to see how the sampled data favor a particular β, we search for the β-value (denoted as β * ) that maximizes log p(n|β) in the particular case where at most two samples coincide on the same x, obtaining Denoting the fraction of 2-count states that have one count for each y-value as f 11 = m 11 /(m 11 + m 20 ), Equation (24) implies that the likelihood p(n|β) is maximized at If the y-values are independent of x, we expect f 11 ∼ 1 /2. This case corresponds to a large β * and, consequently, to little information. On the other side, for small f 11 , the parameter β * is also small and the information grows. Moreover, the width of p(n|β) is also modulated by f 11 . When the information is large, the peak around β * is narrow. Low information values, instead, require more evidence, and they come about with more uncertainty around β * .
In Equation (24), the data only intervene through m 11 and m 20 , which characterize the degree of asymmetry of the y-values throughout the different x-states. This asymmetry, hence, constitutes a sufficient statistics for β. If a prior p(β) is included, the β that maximizes the posterior p(β|n) may shift, but the effect becomes negligible as the number of coincidences grows.
We now discuss the role of the selected β in the estimation of information, Equation (20), focusing on the conditional entropy H Y|X (β). First, in terms of the multiplicities, the conditional entropy can be rewritten as where f r is the fraction of the N samples that fall in states x with r counts, and f nn is the fraction of all states x with n + n counts, of which n correspond to one y-value (whichever) and n for the other. Finally, H nn (β) is the estimation of the entropy of a binary variable after {n, n } samples, A priori, I|n = 0, β = log 2 − H 00 (β), as in Figure 1d. Surprisingly, from the property ψ 0 (z + 1) = ψ 0 (z) + 1/z, it turns out that H 00 = H 10 (in fact, H nn = H (n+1)n ). Hence, if only a single count breaks the symmetry between the two y-values, there is no effect on the conditional entropy. This is a reasonable result, since a single extra count is no evidence of an imbalance between the underlying conditional probabilities, it is just the natural consequence of comparing the counts falling on an even number of states (2) when taking an odd number of samples. Expanding the first terms for the conditional entropy, In the severely under-sampled regime, these first terms are the most important ones. Moreover, when evaluating these terms in β = β * (Equation (25)), the conditional entropy simplifies into Typically, ( f 1 + f 2 ) takes most of the weight, so the estimation is close to the prior H 00 evaluated at the β-value that maximizes the marginal likelihood (or the posterior).
When dealing with few samples, it is important to have not just a good estimate of the mutual information, but also a confidence interval. Even a small information may be relevant, if the evidence attests that it is strictly above zero. The theory developed here also allows us to estimate the posterior variance of the mutual information, as shown in Appendix A. The variance (Equation (A4)) is shown to be inversely proportional to the number of states k x , thereby implying that our method benefits from a large number of available states X, even if undersampled.
If an estimator, such as ours, is guaranteed to provide a non-negative estimate for all possible sets of sampled data, it cannot be free of bias, not at least if the samples are generated by an independent distribution (q y|x = q y ), for which the true information vanishes. In Appendix B, we show that, in this specific case, the bias decreases with the square root of the number of coincidences. This number may be large, even in the severely undersampled regime, if exp[H(X)/2] < N exp[H(X)]. If the distribution q x is approximately uniform, the number of coincidences is proportional to the square of the total number of samples, so the bias is inversely proportional to N.

Testing the Estimator
We now analyze the performance of our estimator in three examples where the number of samples N is below or in the order of the effective size of the system exp(H XY ). In this regime, most observed x-states have very few samples. In each example, we define the probabilities q x and q 1|x with three different criteria, giving rise to collections of probabilities that can be described with varying success by the prior proposed in this paper, Equation (17). Once the probabilities are defined, the true value I XY of the mutual information can be calculated, and compared to the one estimated by our method in 50 different sets of samples n of the measured data. We present the estimate obtained with I|n, β * from Equation (20) evaluated in the β * that maximizes the marginal likelihood p(n|β). We did not observe any improvement when integrating over the whole posterior p(β|n) with the prior p(β) of Equation (19), except when m 20 or m 10 were of order 1. This fact implies the existence of a well-defined peak in the marginal likelihood.
In Figure 2, the performance of our estimator is compared with that of three other methods widely employed in the literature: Plug in, NSB and PYM. In addition, two other estimators were evaluated, but not shown in the figure to avoid cramming: the one of reference [21], which is a particularly convenient case of the Schuermann family of estimators [22], and the one of reference [23], extensively used in ecology. Their estimates fell between the plug-in estimator and NSB in the first case, and between NSB and PYM in the second case.  (20)) calculated with the β that maximizes the marginal likelihood p(n|β) (Equation (21)). The curves represent the average over 50 different data sets n, with the standard deviation displayed as a colored area around the mean. (a) estimates of mutual information as a function of the total number of samples N, when the values of q 1|x are generated under the hypothesis of our method (Equation (17)). We sample once the marginal probabilities q x ∼ PY(d = 0.55, α = 50) (as described in [14]), as well as the conditionals q y|x ∼ Beta(β/2, β/2) with β = 2.3. The effective size of the system is exp(H XY ) 800. The exact value of I XY is shown as a horizontal dashed line; (b) E vbvbsgv gtimates of mutual information, for data sets where the conditional probabilities have spherical symmetry. X, a binary variable of dimension 12, corresponds to the presence of 12 delta functions equally spaced in a sphere (q x = 2 −12 , for all x). We generate the conditional probabilities such that they are invariant under rotations of the sphere, namely q y|x = q y|R(x) , being R a rotation. To this aim, we set q y|x as a sigmoid function of a combination of frequency components (π 0 − π 1 − π 2 ) of the spherical spectrum [24]. The effective size of the system is exp(H XY ) 5000; (c) estimates of mutual information, for a conditional distribution far away from our hypotheses. The x states are generated as Bernoulli (p = 0.05) binary vectors of dimension D = 40, while the conditional probabilities depend on the parity of the sum of the components of the vector. When the sum is even we set q y|x = 1/2, and when is odd, q y|x is generated by sampling a mixture of two deltas of equal weight q y|x ∼ [δ(q − q 0 ) + δ(q − 1 + q 0 )]/2 with q 0 = 0.1. The resulting distribution of q y|x -values contains three peaks, and therefore, cannot be described with a Dirichlet distribution. The effective size of the system is exp(H XY ) 4000; (d) bias in the estimation as a function of the value of mutual information. Settings remain the same as in (a), but fixing N = 500 and changing β ∈ (0.04, 14) in the conditional; (e) bias in the estimation as a function of the value of mutual information. Settings as in (b), but fixing N = 2000 and changing the gain of the sigmoid in the conditional; (f) bias in the estimation as a function of the value of mutual information. Settings as in (c), but fixing N = 2000 and changing q 0 ∈ (0.01, 0.4) in the conditional.
In the first example (Figure 2a,d), the probabilities q x are obtained by sampling a Pitman-Yor distribution with concentration parameter α = 50 and tail parameter d = 0.55 (as described in [14]). These values correspond to a Pitman-Yor prior with a heavy tail. The conditional probabilities q y|x are defined by sampling a symmetric Beta distribution q y|x ∼ Beta(β/2, β/2), as in Equation (17). In Figure 2a, we use β = 2.3. Once the joint probability q xy is defined, 50 sets of samples n are generated. The effective size of the system is exp(H XY ) 800. We compare our estimator to the plug-in estimator (Plug-in), NSB and PYM when applied to H X and H XY (all methods coincide in the estimation of H Y ). Our estimator has a low bias, even when the number of samples per effective state is as low as N/e H XY = 0.15. The variance is larger than in the Plug-in estimator, comparable to NSB and smaller than PYM. All the other methods (Plug-in, NSB and to a lesser extend PYM) overestimate the mutual information. In Figure 2d, the performance of the estimators is also tested for different values of the exact mutual information I XY , which we explore by varying β ∈ (0.04, 14). For each β, the conditional probabilities q 1|x are sampled once. Each vector n contains N = 500 samples, and n is sampled 50 times. Our estimates have very low bias, even as the mutual information goes to zero -namely, for independent variables.
Secondly, we analyze an example where the statistical relation between X and Y is remarkably intricate (example inspired by [25]), which underscores the fact that making inference about the mutual information does not require inferences on the joint probability distribution. The variable x is a binary vector of dimension 12. Each component represents the presence or absence of one of a maximum of 12 delta functions equally spaced on the surface of a sphere. There are 2 12 possible x vectors, and they are governed by a uniform prior probability: q x = 2 −12 . The conditional probabilities are generated in such a way that they be invariant under rotations of the sphere that is, q y|x = q y|R(x) , where R is a rotation. Using a spherical harmonic representation [24], the frequency components π ( f (x)) of the spherical spectrum are obtained, where f (x) is the combination of deltas. The conditional probabilities q y|x are defined as a sigmoid function of (π 0 − π 1 − π 2 ). The offset of the sigmoid is chosen such that q y=1 0.5, and the gain such that I XY 0.5 nats. In this example, and, unlike the Dirichlet prior implied by our estimator, p(q y|x ) has some level of roughness (inset in Figure 2b), due to peaks coming from the invariant classes in {x 1 , . . . , x 2 12 }. Hence, the example does not truly fit into the hypothesis of our method. With these settings, the effective size of the system is exp(H XY ) 5000. Our estimator has little bias (Figure 2b,e), even with N/e H XY = 0.2 samples per effective state. In this regime, around ∼ 80% of the samples fall on x states that occur only once ( f 1 0.8), ∼ 19% on states that occur twice and ∼ 1% on states with three counts, or maybe four. As mentioned above, in such cases, the value of I XY is very similar to the one that would be obtained by evaluating the prior information I|n = 0, β of (Equation (18)) at the β * that maximizes the marginal likelihood p(n|β), which in turn is mainly determined by f 11 . In Figure 2e, the estimator is tested with a fixed number of samples N = 2000 for different values of the mutual information, which we explore by varying the gain of the sigmoid. The bias of the estimate is small in the entire range of mutual informations.
In the third place, we consider an example where the conditional probabilities are generated from a distribution that is poorly approximated by a Dirichlet prior. The conditional probabilities are sampled from three Dirac deltas, as q y|x ∼ [0.5 δ(q − 1 /2) + 0.25 δ(q − q 0 ) + 0.25 δ(q − 1 + q 0 )], with q 0 = 0.1. The delta placed in q = 1 /2 could be approximated by a Dirichlet prior with a large β, while the other two deltas could be approximated by a small β, but there is no single value of β that can approximate all three deltas at the same time. The x states are generated as Bernoulli (p = 0.05) binary vectors of dimension D = 40, while the conditional probabilities q 1|x depend on the parity of the sum of the components of the vector x. When the sum is even, we assign q y|x = 1 /2, and when it is odd, we assign q y|x = q 0 or q y|x = 1 − q 0 , both options with equal probability. Although in this case our method has some degree of bias, it still preserves a good performance in relation to the other approaches (see Figure 2c,f). The marginal likelihood p(n|β) contains a single peak in an intermediate value of β, coinciding with none of the deltas in p(q 1|x ), but still capturing the right value of the mutual information. As in the previous examples, we also test the performance of the estimator for different values of the mutual information, varying in this case the value of q 0 (with N = 2000). Our method performs acceptably for all values of mutual information. The other methods, instead, are challenged more severely, probably because a large fraction of the x states have a very low probability, and are therefore difficult to sample. Those states, however, provide a crucial contribution to the relative weight of each of the three values of q 1|x . PYM, in particular, sometimes produces a negative estimate for I XY , even on average.
Finally, we check numerically the accuracy of the analytically predicted mean posterior information (Equation (20)) and variance (Equation (A4)) in the severely under-sampled regime. The test is performed in a different spirit than the numerical evaluations of Figure 2. There, averages were taken for multiple samples of the vector n, from a fixed choice of the probabilities q x and q y|x . The averages of Equations (20) and (A4), however, must be interpreted in the Bayesian sense. The square brackets in I|n and H 2 Y|X represent averages taken for a fixed data sample n, and unknown underlying probability distributions q x and q y|x . We generate many such distributions with q x ∼ DP(α) (a Dirichlet Process with concentration parameter α) and q y|x ∼ Beta(β/2, β/2). A total of 13,500 distributions q xy are produced, with log β sampled from Equation (19), and three equiprobable values of α = {e 4 , e 5 , e 6 }. For each of these distributions, we generate five (5) sets of just N = 40 samples, thereby constructing a list of 5× 13,500 cases, each case characterized by specific values of α, β, q x , {q y|x }, I(q x , {q y|x }), n, I|n and σ 2 (I|n). Following the Bayesian rationale, we partition this list in classes, each class containing all the cases that end up in the same set of multiplicities {m nn } -for example, {m 10 = 36, m 20 = 2}. For each of the 100 most occurring sets of multiplicities (which together cover 70% of all the cases), we calculate the mean and the standard deviation of the mutual information I(q, {q y|x }) of the corresponding class, and compare them with our predicted estimates I|{m nn } and σ 2 I |{m nn } , using the prior p(log β) from Equation (19). Figure 3 shows a good match between the numerical (y-axis) and analytical (x-axis) averages that define the mean information (panel a) and the standard deviation (b). The small departures from the diagonal stem from the fact that the analytical average contains all the possible q x and {q y|x }, even if some of them are highly improbable for one given set of multiplicities. The numerical average, instead, includes the subset of the 13,500 explored cases that produced the tested multiplicity. All the depicted subsets contained many cases, but, still, they remained unavoidably below the infinity covered by the theoretical result.   (20)) and variance (Equation (A4)) in the severely under-sampled regime. A collection of 13,500 distributions q xy are constructed by sampling q x ∼ DP(α) and q y|x ∼ Beta(β/2, β/2), with α varying in the set {e 4 , e 5 , e 6 } and log β from Equation (19). Each distribution q xy has an associated I XY (q xy ). From each q xy , we take five (5)  We have also tested cases where Y takes more than two values, and where the marginal distribution q y is not uniform, observing similar performance of our estimator.

A Prior Distribution for the Large Entropy Variable
The prior considered so far did not model the probability q x of the large-entropy variable X. Throughout the calculation, the probabilities q x were approximated by the maximum likelihood estimatorq X = n x /N. Here, we justify such procedure by demonstrating that proper Bayesian inference on q x hardly modifies the estimation of the mutual information. To that end, we replace the prior of Equation (13) by another prior that depends on both q x and {q y|x }.
The simplest hypothesis is to assume that the prior p(q x , {q y|x }) factorizes as p(q x ) p({q y|x }), implying that the marginal probabilities q x are independent of the conditional probabilities q y|x . We propose q x ∼ DP(α), so that the marginal probabilities q x are drawn from a Dirichlet Process with concentration parameter α, associated with the total number of pseudo-counts. After integrating in q x and in q y|x , the mean posterior mutual information for fixed hyper-parameters β and α is Before including the prior p(q x ), in the severely undersampled regime, the mean posterior information was approximately equal to the prior information evaluated in the best β (Equation (16)). The new calculation (Equation (30)) contains the prior information explicitly, weighted by α/ (N + α), that is, the ratio between the number of pseudo-counts from the prior and the total number of counts. Thereby, the role of the non-observed (but still inferred) states is established.
The inference over α coincides with the one of PYM with the tail parameter as d = 0 [14], since where k 1 = ∑ x,n x >0 1 is the number of states x with at least one sample. With few coincidences in x, p(n x |α) develops a peak around a single α-value that represents the number of effective states. Compared to the present Bayesian approach, maximum likelihood underestimates the number of effective states (or entropy) in x. Since the expected variance of the mutual information decreases with the square root of the number of effective states, the Bayesian variance is reduced with respect to the one of the Plug-in estimator.

Discussion
In this work, we propose a novel estimator for mutual information of discrete variables X and Y, which is adequate when X has a much larger number of effective states than Y. If this condition does not hold, the performance of the estimator breaks down. We inspire our proposal in the Bayesian framework, in which the core issue can be boiled down to finding an adequate prior. The more the prior is dictated by the data, the less we need to assume from outside. Equation (11) implies that the mutual information I(X, Y) is the spread of the conditional probabilities of one of the variables (for example, q y|x , but the same holds for q x|y ) around the corresponding marginal (q y or q x , respectively). This observation inspires the choice of our prior (Equation (13)), which is designed to capture the same idea, and, in addition, to be analytically tractable. We choose to work with a hyper-parameter β that regulates the scatter of q y|x around q y , and not the scatter of q x|y around q x because the asymmetry in the number of available states of the two variables makes the β of the first option (and not the second) strongly modulated by the data, by the emergence of a peak in p(n|β).
Although our proposal is inspired in previous Bayesian studies, the procedure described here is not strictly Bayesian, since our prior (Equation (13)) requires the knowledge ofq y , which depends on the sampled data. However, in the limit in which q y is well sampled, this is a pardonable crime, sinceq y is defined by a negligible fraction of the measured data. Still, Bayesian purists should employ a two-step procedure to define their priors. First, they should perform Bayesian inference on the center of the Dirichlet distribution of Equation (13) by maximizing p(q y |n), and then replaceq y in Equation (13) by the inferred q y . For all practical purposes, however, if the conditions of validity of our method hold, both procedures lead to the same result. By confining the set or possible priors p({q y|x }) to those generated by Equation (13), we relinquish all aspiration to model the prior of, say, q y|x=3 , in terms of the observed frequencies at x = 3. In fact, the preferred β-value is totally blind to the specific x-value of each sampled datum. Only the number of x-values containing different counts of each y-value matters. Hence, the estimation of mutual information is performed without attempting to infer the specific way the variables X and Y are related, a property named equitability [26], and that is shared also by other methods [8,13,14]. Although this fact may be seen as a disadvantage, deriving a functional relation between the variables can actually bias the inference on mutual information [26]. Moreover, fitting a relation is unreasonable in the severe under-sampled regime, in which not all x-states are observed, most sampled x-states contain a single count, and few x-states contain more than two counts-at least without a strong assumption about the probability space. In fact, if the space of probabilities of the involved variables has some known structure or smoothness condition, other approaches that estimate information by fitting the relation first may perform well [9][10][11]. Part of the approach developed here could be extended to continuous variables or spaces with a determined metric. This extension is left for future work.
In the explored examples, our estimator had a small bias, even in the severely under-sampled regime, and it outperformed other estimators discussed in the literature. More importantly, the second and third examples of Section 5 showed that it even worked when the collection of true conditional probabilities q y|x was not contained in the family of priors generated by p(q y|x |β). In these cases, the success of the method relies on the peaked nature of the posterior distribution p(β|n). Even if the selected p(q y|x |β) provides a poor description of the actual collection of probabilities, the dominant β captures the right value of mutual information. This is the sheer instantiation of the equitability property discussed above.
When the number of samples N is much larger than the total number of available states k x × k y , our estimator of mutual information tends to the plug-in estimator, which is known to be consistent [20]. Consequently, our estimator is also consistent. By construction, I|n is equal to the mutual information averaged over all distributions q x and q y|x compatible with the measured data, each weighted by its posterior probability. As such, it can never produce a negative result, which is a desirable property. The down side is that the estimator must be positively biased, at least, when the true information vanishes. The derivation in Appendix B shows that, when the number of samples N ∈ [e H(X)/2 , e H(X) ], this bias is inversely proportional to the square root of the number of pairwise coincidences or, when q x is fairly uniform, to the inverse of the total number of samples. Moreover, the factor of proportionality is significantly smaller than the one obtained in the bound of the bias of other frequentist estimators [3,27]. If the number of samples N grows even further, the bias tends to zero, since the bias of all consistent estimators vanishes asymptotically [28].
Our method provides also a transparent way to identify the statistics that matter, out of all the measured data. Quite naturally, the x-states that have not been sampled provide no evidence in shaping p(β|n), as indicated by Equation (14), and only shift the posterior information towards the prior (Equation (30)). More interestingly, the x-states with just a single count are also irrelevant, both in shaping p(β|n) and in modifying the posterior information away from the prior. These states are unable to provide evidence about the existence of either flat or skewed conditional probabilities q y|x .
Only the states x that have been sampled at least twice contribute to the formation of a peak in p(β|n), and in deviating the posterior information away from the prior.
Our method can also be extended to generalizations of mutual information designed to characterize the degree of interdependence of more than two variables [29][30][31][32][33][34][35][36][37][38][39][40][41], as long as all but one of the variables are extensively sampled. The applicability of the method to these measures will be the object of future work, once a certain degree of consensus has been reached regarding their meaning and range of applicability.
Several fields can benefit from the application of our estimator of mutual information. Examples can be found in neuroscience, when studying whether neural activity (a variable with many possible states) correlates with a few selected stimuli or behavioral responses [12,42,43], or in genomics, to understand associations between genes (large-entropy variable) and a few specific phenotypes [44]. The method can also shed light on the development of rate-distortion methods to be employed in situations in which only a few samples are available. In particular, it can be applied within the information bottleneck framework [45,46], aimed at extracting a maximally compressed representation of an input variable, but still preserving those features that are relevant for the prediction of an output variable. The possibility of detecting statistical dependencies with only few samples is of key importance, not just for analyzing data sets, but also to understand how living organisms quickly infer dependencies in their environments and adapt accordingly [47].

Conclusions
We have proposed a novel estimator for the mutual information I(X; Y) between two variables, applicable to those cases in which the marginal distribution of one of the variables-the one with minimal entropy-is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain a consistent estimator that presents very low bias, outperforming previous methods discussed in the literature. The main data statistics determining the estimated value is the inhomogeneity of the conditional distribution of the low-entropy variable in those states in which the large-entropy variable registers coincidences. it is easy to show that In other words, if the marginal entropy is well sampled, the variance in the information is mainly due to the variance in the conditional entropy. In turn, H Y|X is defined as an average of k x terms (Equation (A3)). The independence hypothesis implied in the prior of Equation (13), and in the way different q 1|x and q x factor out in q(n|q x , {q y|x }) (Equation (6)), imply that the different terms of (Equation (A3)) are all independent from each other. The average of k x independent terms has a variance proportional to 1/k x , so the estimator proposed here becomes increasingly accurate as k x grows.
We now derive the detailed dependence of H 2 Y|X − H Y|X 2 on the sampled data n. The mean conditional entropy H Y|X can be written in terms of H Y|x (n x1 , n x0 , β), that is, of the entropy of the variable Y for a particular state x with n x = n x0 + n x1 counts at fixed β β/2 + n xy β + n x ψ 0 (β/2 + n xy + 1) Similarly, for the second moment, In turn, Var[H Y|x ((n x0 , n x1 , β)] = H 2 Y|x (n x0 , n x1 , β) − H Y|x (n x0 , n x1 , β) 2 , where the first moment H Y|x (n x0 , n x1 , β) is given in Equation (A5). The second moment is [14], In this equation, and ψ 1 (z) is the first polygamma function.
Replacing the obtained expressions in Equation (A4), the variance of the estimated information is obtained. The two terms of Equation (A6) represent the two sources of uncertainty of the conditional entropy: the uncertainty of β (first term), manifest in the width of p(β|n), and the uncertainty of the conditional entropies for a fixed β (second term), manifest in the width of p({q 1|x }|β). As the number of samples decreases, the uncertainty in β becomes the dominant term.
The approximate symbol in Equation (A2) stems from the fact that H(Y) is assumed to be well approximated by the plug-in estimator. The error in the marginal entropy H Y is thereby neglected, and the error in the mutual information is assumed to derive from the uncertainty in the conditional entropy H Y|X (Equation (A4)) alone. The assumption is well justified in the context explored in this paper, that is, when exp[H(X)] exp[H(Y)].

Appendix B. Bounding the Bias for Independent Variables
The core idea of a Bayesian approach is that the measured datum n is given, and the underlying probabilities are unknown. Bayesian estimators are hence constructed by averaging over all possible inferred probabilities. In this context, the calculation of the bias of an estimator is somehow at odds with the core idea, since the bias is obtained by fixing the underlying probability, and averaging on the possible outcomes of the sampled data. Perhaps for this reason, the bias of Bayesian estimators is hardly ever calculated. Yet, if a Bayesian estimator is regarded as a function of n, it is always possible to average the estimate over all the possible sampled data for a fixed underlying distribution. This average is performed in this appendix for the estimator I|n, β * , evaluated at the β-value that maximizes p(n|β), for a binary and symmetric Y variable, in the severely undersampled regime, and when the variables X and Y are independent (that is, q y|x = q y = 1 /2, ∀x). In this regime, the true information vanishes, so the calculated average equals the bias.
We work with a number of samples N in which the previous estimators begin to fail, but ours is still useful that is, exp[H(X)/2] < N exp[H(X)]. In this regime, the majority of the x-states are sampled at most once, and most of the x-states with coincidences are sampled only twice. As an approximation, we assume that no x-states are sampled more than twice. Since the β-value that maximizes p(n|β) cannot be calculated in the absence of coincidences (see Equation (25)), we assume that there is at least one coincidence. The total number m of x-states with two-fold coincidences is decomposed into m = m 20 + m 11 , where m 20 is the number of states sampled with two equal y-values (whichever), and m 11 = m − m 20 is the number of states with two different y-values.
We define a variable z = m 20 /m − 1 /2. Equation (25) implies that the optimal β is β * (z) = ( 1 /2 − z)/z, if z > 0 and β * → ∞ if z ≤ 0. In turn, assuming no more than two-fold coincidences occur (see Equations (24) and (29)), I(n, β * ) can be expressed as I(n, β * ) log 2 − H 00 (β * (z)) = log 2 − ψ 1 2z + ψ In order to calculate the bias in the non-informative case, we take q y|x = 1 /2, ∀ x (a distribution with I(X; Y) = 0). The bias B is obtained by averaging I(n, β * ) over p n|q The function I(n, β * ) of Equation (A9) is continuous in z, and starts growing from zero when z = 0 (m 20 = m/2). As the main contributions to the integral of Equation (A11) come from small values of z, we expand I(n, β * ) in a Taylor series, obtaining I(n, β * ) z + z 2 − 2z 4 + O(z 5 ). The integral of Equation (A11) can be calculated using the moments of the Half Gaussian, yielding where the averages indicated with a horizontal line are weighted by p (m|q x ). We have checked in simulations that in the regime exp[H(X)/2] < N exp[H(X)], the dependence of the bias on m scales as predicted by Equation (A12). It is worth noticing that, when m is as small as 20 coincidences, the bias drops to B 0.05 nats. If the probabilities q x are approximately homogeneous, then the number coincidences scales as m 0.5N 2 / exp[H(X)] [18]. In that case, the leading term in the bias is For comparison, the bias of the entropy estimator H G proposed by Grassberger [21] is bounded by −∆H G < 0.14 . . . e H(X) /N. Although the scaling in N is the same, the accompanying factor is much larger than in our estimator, especially in the undersampled regime considered here, where exp[H(X)/2] < N exp[H(X)].