Assessing the Relevance of Specific Response Features in the Neural Code

The study of the neural code aims at deciphering how the nervous system maps external stimuli into neural activity—the encoding phase—and subsequently transforms such activity into adequate responses to the original stimuli—the decoding phase. Several information-theoretical methods have been proposed to assess the relevance of individual response features, as for example, the spike count of a given neuron, or the amount of correlation in the activity of two cells. These methods work under the premise that the relevance of a feature is reflected in the information loss that is induced by eliminating the feature from the response. The alternative methods differ in the procedure by which the tested feature is removed, and the algorithm with which the lost information is calculated. Here we compare these methods, and show that more often than not, each method assigns a different relevance to the tested feature. We demonstrate that the differences are both quantitative and qualitative, and connect them with the method employed to remove the tested feature, as well as the procedure to calculate the lost information. By studying a collection of carefully designed examples, and working on analytic derivations, we identify the conditions under which the relevance of features diagnosed by different methods can be ranked, or sometimes even equated. The condition for equality involves both the amount and the type of information contributed by the tested feature. We conclude that the quest for relevant response features is more delicate than previously thought, and may yield to multiple answers depending on methodological subtleties.


Introduction
Understanding the neural code involves, among other things, identifying the relevant response features that participate in the representation of information. Different studies have proposed several candidates, for example, the spiking rate [1,2], the response latency [3], the temporal organisation of spikes [4], the amount of synchrony in a given brain area [5], the amount of correlation between the activity of different neurons [6], or the phase of the local field potential at the time of spiking [7], to cite a few. One way of evaluating the relevance of each candidate feature is to assess how much information is lost by ignoring that feature. This strategy involves the comparison of the mutual information between the stimulus and the so-called full response (a collection of response features including the tested one) and the same information calculated with a reduced response, obtained by dropping the tested feature from the full response. If the tested feature is relevant, the information encoded by the reduced response should be smaller than that of the full response.
The procedure is fairly straightforward when the response features are defined in terms of variables that take definite values in each stimulus presentation, as for example, the spike count C fired in a fixed time window, or the latency L between the stimulus and the first spike. The full response in this case is a two-component vector [C, L], the value of which is uniquely defined for each stimulus presentation-let us assume that in this example, C is never equal to 0, so L is always well defined. The reduced response is a one-component vector, either C or L, depending whether we are evaluating the relevance of the latency or the spike count, respectively. If the latency or the spike count are relevant, then the information encoded by C or L, respectively, should be smaller than that of the pair [C, L]. Throughout this paper, we often use C and L as examples of response features that take a precise value in each trial, to contrast with other features that are only defined in the whole collection of trials, as discussed below.
The method becomes more controversial when applied to response properties that can only be defined in multiple stimulus presentations, as for example, the amount of correlation in the activity of two or more neurons, or the temporal precision of the elicited spikes. These properties cannot be calculated from single responses, so more sophisticated methods are required to delete the tested feature. There are several alternative procedures to perform such deletion, and several are also the ways in which the lost information can be calculated. Interestingly, the lost information depends markedly on the chosen method, implying that the so-called relevance of a given feature is a subtle concept, that needs to be specified precisely. When assessing the relevance of noise correlations, two different sets of strategies have been proposed by the seminal works of Nirenberg et al. [8] and Schneidman et al. [9]. The first proposal evaluated the role of noise correlations in decoding the information represented in neural activity, whereas the second, in the amount of encoded information. Quite surprisingly, the contribution of correlations to the decoded information was shown to sometimes exceed the amount of encoded information [9], seemingly contradicting the intuitive idea that the encoded information constitutes an upper bound to the decoded information. The apparent inconsistency between the two measures has not been observed in later extensions of the technique, where the relevance of other response aspects was evaluated, such as spike-time precision, spike-counts or spike-onsets. Moreover, it has even been argued that the inconsistency was exclusively observed when assessing the role of noise correlations [10][11][12][13].
In this paper, for the first time, the different methods used in the literature to delete a given response feature are distinguished, and the implications of each method are discussed and compared. We show that the data processing inequality, stating that the decoded information cannot surpass the encoded information, can only be invoked with some -and not all -deletion procedures. The distinction between such procedures allows us to identify the conditions in which the decoded information can exceed the encoded information, and to demonstrate that there was no logical inconsistency in previous studies. We also show explicit examples where the decoded information surpasses the encoded information also when assessing the role of other response aspects different from noise correlations. In order to explain why such behaviours have not been identified until now, we scrutinise the arguments given in the literature to claim that only noise correlations could exhibit such syndrome. We conclude that although the measures employed to assess the relevance of individual response features initially distinguished clearly between the relevance for encoding and the relevance for decoding, this distinction was eventually lost in later modifications of the measures. By diagnosing the confusion, we prove that indeed, the response features for which the decoded information can surpass the encoded information are not restricted to noise correlations.
More generally, we discuss a wide collection of strategies employed to assess the relevance of individual response features, ranging from those encoded-oriented to those decoded-oriented. This distinction is related to the way the tested feature contributes to the performance of decoders, which can be mismatched or not. The relevance of the tested feature obtained with some of the measures is always bounded by the relevance of another measure. Yet, not all measures can be ordered hierarchically.
There are examples where the relevance of a feature obtained with one method may surpass or be surpassed by the relevance of another, depending on the specific values taken by the prior stimulus probability and the conditional response probabilities. We analyse a collection of carefully chosen examples to identify the cases where this is so. In certain restricted conditions, however, the hierarchy, or even the equality, can be ensured. Here we establish these conditions by means of analytic reasoning, and discuss their implications in terms of the amount and type of information encoded by the tested feature.
We also present examples in which the measures to assess the relevance of a given feature can be used to extract qualitative knowledge about the type of information encoded by the feature. In other words, we assess not only how much information is encoded by an individual feature, but also what kind of information is provided, with respect to individual stimulus attributes. Again, we prove that the type of encoded information depends on the method employed to assess it.
Finally, given that one important property of measures of relevance hinges on whether they represent the operation of matched or mismatched decoders, we also explore the consequences of operating mismatched decoders on noisy responses, instead of real responses. We conclude that it may be possible to improve the performance of a mismatched decoder by adding noise. From the theoretical point of view, this observation underscores the fact that the conditions for optimality for matched decoders need not hold for mismatched decoders. From the practical perspective, our results open new opportunities for potentially simpler, more efficient and more resilient decoding algorithms.
In Section 2.1, we establish the notation, and we introduce some of the key concepts that will be used throughout the paper. These concepts are employed in Section 2.2 to determine the cases where the data-processing inequality can be ensured. In Section 2.3 we introduce 9 measures of feature relevance that were previously defined in the literature, and briefly discuss their meaning, similarities and discrepancies. A numeric exploration of a set of carefully chosen examples is employed in Section 2.4 to detect the pairs of measures for which no general hierarchical order exists. In Section 2.5 we discuss the consequences of employing measures that are conceptually linked to matched or mismatched decoders. Later, in Section 2.6, we explore the way in which different measures of feature relevance arrogate different qualitative meaning to the type of information encoded by the tested feature. In Section 2.7 we discuss the conditions under which encoding-oriented measures provide the same amount of information as their decoding-oriented counterparts, and also the conditions under which the equality extends also to the content of that information. Then, in Section 2.8, we observe that sometimes, mismatched decoders may improve their performance when operating upon noisy responses. We discuss some relations of our work with other approaches and to the limiting sampling problem in Section 3, and we close with a summary of the main results of the paper in Section 4.

Statistical Notation
When no risk of ambiguity arises, we here employ the standard abbreviated notation of statistical inference [14], denoting random variables with letters in upper case, and their values, in lower case. For example, the symbol P(x|y) always denotes the conditional probability of the random variable X taking the value x given that the random variable Y takes the value y. This notation may lead to confusion or be inappropriate, for example, when the random variable X takes the value u given that the random variable Y takes the value v. In those cases, we explicitly indicate the random variables and their values, as for example P(X = u|Y = v).
In the study of the neural code, the relevant random variables are the stimulus S and the response R generated by the nervous system. In this paper, we discuss the statistics of the true responses observed experimentally, and compare them with a theoretical model that describes how responses would be, if the encoding strategy were different. To differentiate these two situations, we employ the variable R ex for the experimental responses (the real ones), and R su for the surrogate responses (the fictitious ones). The associated conditional probability distributions are P ex (R ex = r|S = s) and P su (R su = r|S = s), which are often abbreviated as P ex (r|s) and P su (r|s), respectively. Once these distributions are known, and given the prior stimulus probabilities P(s), the joint probabilities P ex (r, s) and P su (r, s) can be deduced, as well as the marginals P ex (r) and P su (r). When interpreting the abbreviated notation, readers should keep in mind that P ex governs the variable R ex , and P su , R su . If a statement is made about a distribution P or a response variable R that has no sub-index, the argument is intended for both the real and surrogate distributions or variables.

Encoding
The process of converting stimuli S into neural responses R (e.g., spike-trains, local-field potentials, electroencephalographic or other brain signals, etc.) is called "encoding" [9,15]. The encoding process is typically noisy, in the sense that repeated presentations of the same stimulus may yield different neural responses, and is characterised by the joint probability distribution P(s, r). The associated marginal probabilities are P(s) = ∑ r P(s, r), from which the conditional response probability P(r|s) = P(s, r)/P(s), and the posterior stimulus probability P(s|r) = P(s, r)/P(r) can be defined.
The mutual information that R contains about S is I(S; R) = ∑ s,r P(s, r) log 2 P(s|r) P(s) .
More generally, the mutual information I(S; X) about S contained in any random variable X, including but not limited to R, can be computed using the above formula with R replaced by X. For compactness, we denote I(S; X) as I X unless ambiguity arises.

Data Processing Inequalities
When the response R 2 is a post-processed version of the response R 1 , the joint probability distribution P(s, r 1 , r 2 ) can be written as P(s, r 1 ) P(r 2 |r 1 ). This decomposition implies that R 2 is conditionally independent of S. In these circumstances, the information about S contained in R 2 cannot exceed the information about S contained in R 1 [16]. In addition, the accuracy of the optimal decoder operating on R 2 cannot exceed the accuracy of the optimal decoder operating on R 1 [17]. These results constitute the data processing inequalities.

Decoding
The process of transforming responses r into estimated stimuliŝ is called "decoding" [9,15]. More precisely, a decoder is a mapping r→ŝ defined by a functionŝ = D(r). The inverse of this function is D −1 , and when D is not injective, D −1 is a multi-valued mapping. The joint probability P(s,ŝ) of the presented and estimated stimuli, also called "confusion matrix" [12], is where the sum runs over all responses r that are mapped ontoŝ by D. The information thatŜ preserves about S is IŜ, and can be calculated from the confusion matrix of Equation (2). The decoding accuracy above chance level is here defined as

Optimal Decoding
Although all mappings D are formally admissible as decoders, not all are useful. The aim of a decoder is to make a good guess of the external stimulus S from the neural response R. It is therefore important to be able to construct decoders that make good guesses, or at least, as good as the mapping from stimuli to responses allows. Optimal decoders (also called Bayesian or maximum-a-posteriori decoders, as well as ideal homunculus, or observer, among other names) are defined as [18,19] This mapping selects, for each response r, the stimulusŝ that most likely generated r. It is optimal in the sense that any other decoding algorithm yields a confusion matrix with lower decoding accuracy. Equation (4) depends on P(s, r), so the decoder cannot be defined before knowing the functional shape of the joint probability distribution between stimuli and responses. The process of estimating P(s, r) from real data, and the subsequent insertion of the obtained distribution in Equation (4) is called the training of the decoder. The word "training" makes reference to a gradual process, originally stemming from a computational strategy employed to estimate the distribution progressively, while the data was being gathered. However, in this paper we do not discuss estimation strategies from limited samples, so for us, "training a decoder" is equivalent to constructing a decoder from Equation (4).

Extensions of Optimal Decoding
The study of Ince et al. [20] introduced the concept of ranked decoding, in which each response r is mapped onto a list of K stimuliŝ = (ŝ 1 , . . . ,ŝ K ) ordered according to their posterior probabilities so that P(ŝ k |r) ≥ P(ŝ k+1 |r) (with 1 ≤ k < K, and K ≤ the total number of stimuli in the experiment). Ranked decoding can provide useful models for intermediate stages in the decision pathway, and the information loss induced by ranked decoding was computed recently [17]. The joint probability associated with ranked decoding is where the sum runs over all response vectors r that produce the same rankingŝ. Although P(s,ŝ) can be used to compute the information IŜ between S andŜ, it cannot be used to compute the decoding accuracy above chance level because the support ofŜ (i.e., the set of stimulus lists) is not contained in the support of S (i.e., the set of stimuli).

Approximations to Optimal Decoding
For given probabilities P(r|s) and P(s), Equation (4) defines a mapping between each response r and a candidate stimulusŝ. In the study of the neural code, scientists often wonder what would happen if responses were not governed by the experimentally recorded distribution P ex (r|s), but by some other surrogate distribution P su (r|s). If we replace P ex (r|s) by P su (r|s) in Equation (4), we define a new decoding algorithm s = D su (r) = arg max s P su (s|r) = arg max s P su (s, r) .
which, as discussed below, may or may not be optimal, depending on how the decoder is used.

Two Different Decoding Strategies
One alternative, here referred to as "decoding method α" is that, for each response r obtained experimentally, one decodifies a stimulusŝ using the new mapping of Equation (6). In this case, the chain s→r→ŝ gives rise to the confusion matrix where the sum runs over all response vectors r that are mapped ontoŝ by the new decoding algorithm D su , and the probability P ex (r, s) appearing in the right-hand side is the real one, since responses r are generated experimentally. It is easy to see that in this case, the decoding accuracy of the new algorithm is suboptimal, since responses r are generated with the original distribution P ex (r|s), and for that distribution, the optimal decoder is given by Equation (4) with P = P ex . In the literature, training a decoder with a probability P su (r|s) and then operating it on variables that are generated with P ex (r|s) is called mismatched decoding. In what follows, information values calculated from the distribution of Equation (7) are noted as I α S . A second alternative, "decoding method β," is that, for each stimulus s, a surrogate response R su is drawn using the new distribution P su (r|s). If the sampled value is R su = r, the stimulusŝ = D su (r) is decoded. In this case, the confusion matrix is where as before, the sum runs over all response vectors r that are mapped ontoŝ by the decoding algorithm D su (r), but now the probability P su (r, s) appearing in the right-hand side is the surrogate one, since responses R su are not generated experimentally. In this case, there is no mismatch between the construction and operation of the decoder, and D su is optimal, in the sense that no other algorithm decodes R su with higher decoding accuracy. One should bear in mind, however, that the surrogate responses are not the responses observed experimentally, that they may well take values in a response set that does not coincide with the set of real responses, and that R su is not necessarily obtained by transforming the real response R ex with a stimulus-independent mapping (see below . Methods α and β can be easily extended to encompass also ranked decoding, mutatis mutandis. The two alternative decoding methods yield two different decoding accuracies. To distinguish them, we use the notation A R 2 R 1 . The superscript indicates the variable whose probability distribution is used to construct the decoder in Equation (4), and consequently, determines the set of r ∈ D −1 su (ŝ) that contribute to the sums of Equations (7) and (8). The subscript indicates the variable upon which the decoder is applied, and its probability distribution is summed in the right-hand side of Equations (7) and (8). That is, so that P α (s,ŝ) = P R su R ex (s,ŝ) and P β (s,ŝ) = P R su R su (s,ŝ).

The Applicability of the Data-Processing Inequality
Assessing the relevance of a response feature typically involves a subtraction ∆I=I − I , where I and I represent the mutual information between stimuli and a set of response features containing or not containing the tested feature, respectively. The magnitude of ∆I is often interpreted as the information provided by the tested feature. This interpretation requires ∆I to be positive, since intuitively, one would imagine that removing a response feature cannot increase the encoded information.
As shown below, a formal proof of this intuition may or may not be possible invoking the data processing inequality (see Section 2.1.3 and reference [16]), depending on the method used to eliminate the tested feature. As a consequence, there are cases in which ∆I is indeed negative (see below).
In these cases, the tested feature is detrimental to information encoding [9].

Reduced Representations
There are several procedures by which the tested feature can be removed from the response. The validity of the data-processing inequalities (see definition in Section 2.1.3) depends on the chosen procedure. In order to specify the conditions in which the inequalities hold, we here introduce the concept of reduced representations. When the response feature under evaluation is removed from R ex by a deterministic mapping R su = f (R ex ), we call the obtained variable R su a reduced representation of R ex . A required condition for a mapping to be a reduced representation is that the function f be stimulus-independent, that is, that the value of R su be conditionally independent from s. Mathematically, this means that P(r su , s|r ex ) = P(r su |r ex ) P(s|r ex ). If the mapping f and the conditional response distribution P ex (r|s) are known, the distribution P su (r|s) can be derived using standard methods. The data processing inequality ensures that for all reduced representations, Reduced representations are usually employed when the response feature whose relevance is to be assessed takes a definite value in each trial, as happens for example, with the number of spikes in a fixed time window, the latency of the firing response, or the activity of a specific neuron in a larger population of neurons. In these cases it is easy to construct R su simply by dropping from R ex the tested feature, or by fixing its value with some deterministic rule.
Reduced representations can also be used in other cases, for example, when the relevance of the feature response accuracy is assessed. This feature does not take a specific value in each trial; only by comparing multiple trials can the response accuracy be determined. A widely-used strategy is to represent spike trains with temporal bins of increasing duration, and to evaluate how the amount of information decreases as the representation becomes coarser. A sequence of surrogate responses is thereby defined, by progressively disregarding the fine temporal precision with which spike trains were recorded ( Figure 1).
Several studies have reported an information I R su that decreases monotonically with the duration δt of the time bin (for example [21][22][23]). If there is a specific temporal scale in which spike-time precision is relevant-the alleged argument goes-a sudden drop in I R su (δt) appears at the relevant scale. It should be noted, however, that the data processing inequality does not ensure that I R su (δt) be a monotonically decreasing function of δt. In the example of Figure 1, representations R 1 su and R 2 su are defined with long temporal bins, the durations of which are integer multiples of the bin used for R ex . Hence, R 1 su and R 2 su are reduced representations of R ex , and the data processing inequality does indeed guarantee that I R ex ≥ I R 1 su and I R ex ≥ I R 2 su . However, R 2 su is not a reduced representation of R 1 su , so there is no reason why I R 2 su should be smaller than I R 1 su , and indeed, Figure 1b shows an example where it is not. The representation constructed with bins of intermediate duration, namely 10 ms, does not distinguish between the two stimuli, whereas those of shorter and longer duration, 5 and 15 ms, do. A similar effect can be observed in the experimental data (freely available online) of Lefebvre et al. [24], when analysed with bins of sizes 5, 10 and 15 ms in windows of total duration 60 ms. Although these examples are rare, they demonstrate that there is no theoretical substantiation to the expectation of I R su to drop monotonically with increasing δt.

Time scale (ms) Stimulus
Spike counts Rex 1 ¡ ¢ £ (a) Hypothetical intracellular recording of the spike patterns elicited by a single neuron after presenting in alternation two visual stimuli, and , each of which triggers two possible responses displayed in columns 1 and 3 for , and 2 and 4 for . Stimulus probabilities and conditional response probabilities are arbitrary. Time is discretized in bins of 5 ms. The responses are recorded within 30 ms time-windows after stimulus onset. Spikes are fired with latencies that are uniformly distributed between 0 and 10 ms after the onset of , and between 20 and 30 ms after the onset of . Responses are represented by counting the number of spikes within consecutive time-bins of size 5, 10 and 15 ms starting from stimulus onset, thereby yielding discrete-time sequences R ex , R 1 su and R 2 su , respectively; (b) Same as a, but with stimuli producing two different types of response patterns composed of 2 or 3 spikes.

Stochastically Reduced Representations
When the response feature under evaluation is removed from the response variable R ex by a stochastic mapping R ex →R su , the obtained variable R su is called a stochastically reduced representation of R ex . A required condition for a mapping to be a stochastically reduced representation is that the probability distribution of each R su be dependent on R ex , but conditionally independent from s. In these circumstances, the data processing inequality ensures that I R ex ≥ I R su . If the statistical properties of the noisy components of the mapping are known, as well as the conditional response probability distribution P ex (r|s), the distribution P su (r|s) can be derived using standard methods. Formally, stochastic representations R su are obtained through stimulus-independent stochastic functions of the original representation R ex . After observing that R ex adopted the value r ex , these functions produce a single value r su for R su chosen with transition probabilities Q(r su |r ex ) such that To illustrate the utility of stochastically reduced representations, we discuss their role in providing alternative strategies when assessing the relevance of spike-timing precision, not by changing the bin size as in Figure 1, but by randomly manipulating the responses, as illustrated in The elicited response r ex is turned into a surrogate response r su with a transition probability Q(r su |r ex ) given by Equation (11). This function turns R ex into a stochastic representation R su by shuffling spikes and silences within bins of 15 ms starting from stimulus onset; (b) Responses r ex in panel (a) are transformed by a stochastic function with Q(r su |r ex ) given by Equation (12), which introduces jitter uniformly distributed within 15 ms windows centered at each spike; (c) Responses r ex in panel (a) are transformed by a stochastic function with Q(r su |r ex ) given by Equation (13), which models the inability to distinguish responses with spikes occurring in adjacent bins, or equivalently, with distances [25,26] for further remarks on these distances). Notice that R su samples the same response set as R ex .
The method of Figure 2a yields the same information I R su and response accuracy as the method producing R 2 su in Figure 1. Each method yields responses that can be related to the responses of the other method through a stimulus-independent deterministic or stochastic function. Both methods suffer from the same drawback: They treat spikes differently depending on their location within the 15 ms time window. Indeed, both methods preserve the distinction between two spikes located in different windows, but not within the same window, even if the separation between the spikes is the same. The mapping illustrated in Figure 2a where rows enumerate the elements of the ordered set R ex ={ [2], [3], [4]} from where R ex is sampled, and columns enumerate the elements of the ordered set R su ={ [1], [2], [3], [4], [5], [6]} from where R su is sampled. A third method, jittering, consists in shuffling the recorded spikes within time windows centered at each spike ( Figure 2b). The responses generated by this method need not be obtainable from the responses generated by the mappings of Figure 2a or Figure 1 through stimulus-independent stochastic functions. Still, the method of Figure 2b inherently yields a stochastic code, and, unlike the methods discussed previously, treats all spikes in the same manner. The mapping illustrated in Figure 2b has transition probabilities Q(r su |r ex ) = 1 3 where rows enumerate the elements of the ordered set R ex ={ [2], [3], [4]} from where R ex is sampled, and columns enumerate the elements of the ordered set R su ={ [1], [2], [3], [4], [5]} from where R su is sampled. As a fourth example, consider the effect of response discrimination, as studied in the seminal work of Victor and Purpura [25]. There, two responses were considered indistinguishable when some measure of distance between the responses was less than a predefined threshold. However, neural responses were transformed through a method based on cross-validation that is not guaranteed to be stimulus-independent. Depending on the case, hence, this fourth method may or may not be a stochastically reduced representation. The case chosen in Figure 2c is a successful example, and the associated matrix of transition probabilities is where rows and columns enumerate the elements of the ordered set R ex =R su ={ [2], [3], [4]} from where both R ex and R su are sampled.
Other methods exist which merge indistinguishable responses, thereby yielding reduced representations. These methods, however, are limited to notions of similarity that are transitive, a condition not fulfilled, for example, by those based on Euclidean distance, edit distance, or by the case of Figure 2c.
Stochastically reduced representations include reduced representations as limiting cases. Indeed, when for each r ex there is a r su such that Q(r su |r ex ) = 1, stochastic representations become reduced representations ( Figure 3). The possibility to include stochasticity, however, broadens the range of alternatives. Consider for example the hypothetical experiment in Figure 3a, in which the neural responses R ex =[L, C] can be completely characterized by the first-spike latencies (L) and the spike counts (C). The importance of C can be studied for example by using a reduced code that replaces all C-values with a constant (Figure 3b). In this case, where rows enumerate the elements of the ordered set R ex ={[2, 1], [3,1], [3,2], [4,2]} from where R ex is sampled, and columns enumerate the elements of the ordered set R su ={[2, 1], [3,1], [4,1]} from where R su is sampled. Another alternative is to assess the relevance of C by means of a stochastic code that shuffles the values of C across all responses with the same L ( Figure 3c). In this case, where rows enumerate the elements of the ordered set R ex ={[2, 1], [3,1], [3,2], [4,2]} from where R ex is sampled, and columns enumerate the elements of the ordered set R su ={[2, 1], [3,1], [3,2], [4,2]} from where R su is sampled. The parameter a is arbitrary, as long as 0 < a < 1. We use the notation A third option is to use a stochastic code that preserves the original value of L but chooses the value of C from some possibly L−dependent probability distribution (Figure 3d), for which where rows enumerate the elements of the ordered set R ex ={[2, 1], [3,1], [3,2], [4,2]} from where R ex is sampled, and columns enumerate the elements of the ordered set R su ={ [2,1], [3,1], [4,1], [2,2], [3,2], [4,2]} from where R su is sampled. The parameters a, b, c and d are arbitrary, as long as 0<a, b, c, d<1; and we have used the notationx=1−x for any number x.  1], which ignores the additional information carried in C by considering it constant and equal to unity. This reduced code can also be reinterpreted as a stochastic code with transition probabilities Q(r su |r ex ) defined by Equation (14); (c) The additional information carried in C is here ignored by shuffling the values of C across all trails with the same L, thereby turning R ex in panel a into a stochastic code R su =[L,Ĉ] with transition probabilities Q(r su |r ex ) defined by Equation (15); (d) The additional information carried in C is here ignored by replacing the actual value of C for one chosen with some possibly L-dependent probability distribution (Equation (16)).

Modification of the Conditional Response Probability Distribution
When the response feature under evaluation is removed by altering the real conditional response probability distribution P ex (r|s), and transforming it into a surrogate distribution P su (r|s), the obtained response model is here said to implement a probabilistic removal of the tested feature. Probabilistic removals are usually employed when assessing the relevance of correlations between neurons in a population, since correlations are not a variable that can be deleted from each individual response. For example, if R=(R 1 , . . . , R n ) represents the spike count of n different neurons, the real distribution P ex (r 1 , . . . , r n |s) is replaced by a new distribution P su (r 1 , . . . , r n |s) in which all neurons are conditionally independent, that is, where, following the notation introduced previously [17], the generic subscript "su" was replaced by "NI" to indicate "noise-independent". The probabilistic removal of a response feature may or may not be describable in terms of a deterministically or a stochastically reduced representation. In other words, there may or may not exist a mapping R ex →R su , or equivalently, a matrix of transition probabilities Q(r su |r ex ), that captures the replacement of P ex (r|s) by P su (r|s). It is important to assess whether such a matrix exists, since the data processing inequality is only guaranteed to hold with reduced representations, stochastic or not. If no reduced representation can capture the effect of a probabilistic removal, the data processing inequality may not hold, and I R su may well be larger than I R ex .
In order to determine whether a stochastically reduced representation exists, the first step is to discern whether Equation (10) constitutes a compatible or an incompatible linear system for the matrix elements Q(r su |r ex ). If the system is incompatible, there is no solution. In the compatible case, which is often indeterminate, a solution entirely composed of non-negative numbers that sum up to unity in each row is required. Given enough time and computational power, the problem can always be solved in the framework of linear programming [27]. In practical cases, however, the search is often hampered by the curse of dimensionality. To facilitate the labour, here we list a few necessary (though not sufficient) conditions that must be fulfilled for the mapping to exist. If any of the following properties does not hold, Equation (10) has no solution, so there is no need to begin a search.

Property 1.
Let µ(s) be a probability distribution defined in the set of stimuli that may or may not be equal to the actual distribution with which stimuli appear in the experiment under study. For any stimulus s, the inequality I µ (R su ; S = s) ≤ I µ (R ex ; S = s) between stimulus-specific informations [28,29] must hold, where Proof. If Q(r su |r ex ) exists, then Equation (10) can be inserted in Equation (18). Using the log-sum inequality [16], Property 1 follows.
If we multiply both sides of the inequality by µ(s ) and sum over s , we obtain an inequality between the mutual informations I µ (R su ; S) ≤ I µ (R ex ; S). If µ(s) = P(s), this result reduces to the data-processing inequality I R su ≤ I R ex .

Property 2.
If Q(r su |r ex ) exists, then Q(r su |r ex ) = 0 whenever P ex (s, r ex ) > 0 and P su (s, r su ) = 0 for at least some s.
For example, in Figure 4a, we decorrelate first-spike latencies (L) and spike counts (C) by replacing the true conditional distribution P ex (r|s) (left panel) by its noise-independent version P su = P N I (r|s) defined in Equation (17) (middle panel). Before searching for a mapping R ex →R su , we verify that the condition I R ex > I R su holds. Moreover, for several choices of µ( ) and µ( ), one may confirm that I µ (R ex ; S = ) > I µ (R su ; S = ), as well as I µ (R ex ; S = ) > I µ (R su ; S = ). These results motivate the search for a solution of Equation (10) for Q(r su |r ex ). The transition probability must be zero at least whenever R su ∈{ [1,3]; [2,3]; [3,3]; [3,2]; [3,1]} and R ex ∈{ [1,2]; [2, 1]} (Property 2). One possible solution is  Figure 4. Relation between probabilistic removal and stochastic codes. (a) Cartesian coordinates depicting: on the left, responses R ex of a neuron for which L and C are positively correlated when elicited by , and negatively correlated when elicited by ; in the middle, the surrogate responses R su = R NI that would occur should L and C be noise independent (middle); and on the right, a stimulus-independent stochastic function that turns R ex into R su with Q(r su |r ex ) given by Equation (19); (b) Same description as in (a), but with L and C noise independent given , and with the stochastic function depicted on the right turning R ex into R NI given but not .
However, stochastically reduced representations are not always guaranteed to exist. For example, in Figure 4b, it is easy to verify that the condition I µ (R ex ; S = ) < I µ (R su ; S = ) holds for any µ( ) = 0. Therefore, no stochastic mapping can transform R ex into R su in such a way that P ex (r|s) is converted into P su (r|s). Schneidman et al. [9] employed an analogous example, but involving different neurons instead of response aspects. The two examples of Figure 4 motivate the following theorem: Theorem 1. No deterministic mapping R ex →R su exists transforming the conditional probability P ex (r|s) into its noise-independent version P su =P N I (r|s) defined in Equation (17). Stochastic mappings R ex →R su may or may not exist, depending on the conditional probability P ex (r|s).
In addition, when a stochastic mapping R ex →R su exists, the values of the probabilities Q(r su |r ex ) may well depend on the discarded response aspect, as well as on the preserved response aspects. We mention this fact, because when assessing the relevance of noise correlations, the marginals P ex (r i |s) suffice for us to write down the surrogate distribution P su (r|s) = P N I (r|s), with no need to know the full distribution P ex (r|s) containing the noise correlations. One could have hoped that perhaps also the mapping R ex →R su (assuming that such a mapping exists) could be calculated with no knowledge of the noise correlations. This is, however, not always true, as stated in the theorem below. Two experiments with the same marginals and different amounts of noise correlations may require different mappings to eliminate noise correlations, as illustrated in the the example of Figure 5. More formally: Theorem 2. The transition probabilities Q(r su |r ex ) of stochastic codes that ignore noise correlations may depend both on the marginal likelihoods (preserved at the output of the mapping), and on the noise correlations (eliminated at the output of the mapping).
The solution of Equation (10) for the example of Figure 5 is  (20) bears an explicit dependence on these parameters-and not only on P ex (L|S) and P ex (C|S)-implies that the transformation between R ex and R su depends on the amount of noise correlations in R ex . and , elicit single neuron responses (R su = R NI ) that are completely characterized by their first-spike latency (L) and spike counts (C). Both L and C are noise independent; (b) Cartesian coordinates representing a hypothetical experiment with the same marginal probabilities P ex (l|s) and P ex (c|s) as in panel (a), with one among many possible types of noise correlations between L and C; (c) Stimulus-independent stochastic function transforming the noise-correlated responses R ex of panel (b) into the noise-independent responses R su = R NI of panel (a). The transition probabilities Q(r su |r ex ) are given in Equation 20, and they bear an explicit dependence on the amount of noise correlations.

Multiple Measures to Assess the Relevance of a Specific Response Feature
The importance of a specific response feature has been previously quantified in many ways (see [17,30] and references therein), which have oftentimes led to heated debates about their merits and drawbacks [9,11,12,17,[31][32][33]. Here we consider several measures, to underscore the diversity of the meanings with which the relevance of a given feature has been assessed so far. They are mathematically defined as Equations (22) (25)-(29), as illustrated in Figure 6. Figure 6. Relations between the measures defined in Equations (21)- (29). The four measures on the left are either encoding-oriented (∆I R su , on a pink background), or half-way between encodingand decoding-oriented (the last three, gray background). The five measures on the right are all decoding-oriented (light-blue background). Each measure on the left has a conceptually related measure on the right on the same line, except for ∆I R su , which has two associated decoding-oriented measures: ∆I D and ∆I LD . The distinction between the measures on pink and on gray background relies on the fact that ∆I R su does not involve a decoding process. Instead, ∆IŜ, ∆IŜ and ∆A R su decode a stimulus (or rank the stimuli) with decoding method β. This decoding is not meant to be applicable to real experiments, since (as opposed to the truly decoding-oriented measures on the right, that operate with method α) the decoding is applied to the surrogate responses R su , not the real ones R ex .
We here describe the measures briefly, and refer the interested reader to the original papers. In Equation (21), I R ex and I R su are the mutual informations between the set of stimuli and a set of responses governed by the distributions P ex (r|s) and P su (r|s), respectively. Thus, ∆I R su is the simplest way in which the information encoded by the true responses can be compared with that of the surrogate responses. This comparison has been employed for more than six decades in neuroscience [34,35] to study, for example, the encoding of different stimulus features in spike counts, in synchronous spikes, and in other forms of spike patterns, both in single neurons and populations (see [30] and references therein).
The measure ∆I D defined in Equation (25) was introduced by Nirenberg et al. [8] to study the role of noise correlations, and was later extended to arbitrary deterministic mappings [10,12,13]. Here we use the supra-script D to indicate that the measure is the "divergence" (in the Kullback-Leibler sense) between the posterior stimulus distributions calculated with the real and the surrogate responses, respectively. In [10], Nirenberg and Latham argued that the important feature of ∆I D is that it represents the information loss of a mismatched decoder trained with P su (r|s) but operated on the real responses, sampled from P ex (r|s). Not before long, Schneidman et al. [9] noticed that ∆I D can exceed I R ex . The interpretation of ∆I D as a measure of information loss would imply that decoders trained with surrogate responses can lose more information than the one encoded by the real response. In fact, ∆I D tends to infinity if P su (s|r) → 0 when P(s|r) > 0 for some s. In the limit, ∆I D becomes undefined when P su (r) = 0 and P ex (r) >0 . To avoid this peculiar behavior, Latham and Nirenberg generalized the theoretical framework used to derive ∆I D [11], giving rise to the measure ∆I DL of Equation (26). Here, the supra-script DL makes reference to "Divergence Lowest", since the measure was presented as the lowest possible information loss of a decoder trained with P su (r|s). In the definition of ∆I LD , the parameter θ is a real scalar. The distribution P su (s|r, θ) was defined by Latham and Nirenberg [11] as proportional to P(s) P su (r|s) θ . This definition has several problems, as discussed in [11,17,[36][37][38][39]. In Appendix B.1 we demonstrate a theorem that resolves the issues appearing in previous definitions, and justifies the use of if P su (r|s)=P ex (r|s)=0 for some but not all s P(s) P su (r|s) θ otherwise (30) From the conceptual point of view, ∆I DL represents the information loss of a mismatched decoder trained with P su (r|s) and operated on R ex . Latham and Nirenberg [11] showed that, unlike ∆I D , it is possible to demonstrate that ∆I DL ≤ I R ex . Hence, ∆I DL never yields a tested feature encoding more information than the full response. The proof in [11] ignored a few specific cases that we discuss in the Theorem A1 of Appendix B.1. Still, even in those additional cases, the inequality ∆I DL ≤ I R ex holds.
In Equations (22) and (23),Ŝ andŜ denote a sorted stimulus list and the most-likely stimulus, respectively, both decoded by evaluating Equation (6) (or its ranked version) on a response r sampled from the surrogate distribution P su (r|s) (method β). Estimating mutual informations using decoders can be traced back at least to Gochin et al. [40], and comparing the estimations of two decoders that take different response features into account, at least to Warland et al. [41].
The measures ∆IŜ and ∆IŜ are paired with ∆I LS and ∆I B , respectively, since the latter are obtained from the former when replacing the decoding method from β to α. The measure ∆I LS was introduced by Ince et al. [20], and quantifies the difference between the information in R ex , and the one in the output of decoders that, after observing a variable r sampled with distribution P ex (r|s) (method α), produce a stimulus list sorted according to P su (s|r). The supra-script LS indicates "List of Stimuli". Similarly, ∆I B , quantifies the difference between the information encoded in R ex and that encoded in the output of a decoder trained by inserting P su (s|r) into Equation (6), and operated on r sampled with distribution P ex (r|s) (method α). The supra-script B stands for the "Bayesian" nature of the involved decoder. The use of these measures can be traced back at least to Nirenberg et al. [8], although in that case, decoders were restricted to be linear. The measure ∆IŜ of Equation (22) is new, and we have introduced it here as the homologous of ∆I LS . When the number of stimuli is two, ∆IŜ=∆IŜ, since selecting the optimal stimulus is (as a computation) in one-to-one correspondence with ranking the two candidate stimuli.
The accuracy loss ∆A R su defined in Equation (24) entails the comparison between the performance of two decoders, one trained with and applied on R ex , and one trained with and applied on R su . Such comparisons have also a long history in neuroscience [42,43] (see [9,12] for further discussion). The accuracy loss ∆A B also compares two decoders. The first, is the same as for ∆A R su , but the second is trained with R su and applied on R ex .
The measures ∆I LS , ∆I B , and ∆A B are undefined if the actual responses R ex are not contained in the set of surrogate responses R su . In other words, a decoder constructed with P su (r|s) does not know what output to produce when evaluated in a response r for which P su (r) = 0. This situation never happens when evaluating the relevance of noise correlations with P su = P NI , but it may well be encountered in more general situations, as for example, in Figure 3B.

Relating the Values Obtained with Different Measures
If a mapping R ex →R su exists transforming P ex (r|s) into P su (r|s), we may use the decoding procedure of Equation (6) to construct the transformation chain R ex →R su →Ŝ→Ŝ [17,44]. Consequently, ∆I R su , ∆IŜ and ∆IŜ can be interpreted as accumulated information losses after the first, second and third transformations, respectively, and ∆A R su , as the accuracy loss after the first transformation. The data processing theorems (Section 2.1.3) ensure that these measures are never negative. This property, however, cannot be guaranteed in the absence of a reduced transformation R ex →R su , stochastic or deterministic. Indeed, in the example of Figure 4b, if both stimuli are equiprobable, and both responses R ex associated with are equiprobable, then ∆I R su = ∆IŜ = ∆IŜ ≈ − 79 % of I R ex ≈ 0.31 bits, implying that the surrogate responses encode more information about the stimulus than the original, experimental responses. Removing the correlations between spike count and latency, hence, increases the information, so correlations can be concluded to be detrimental to information encoding.
Irrespective of whether a (deterministic or stochastic) mapping R ex →R su exists, the data processing inequality guarantees that ∆I R su ≤ ∆IŜ ≤ ∆IŜ, sinceŜ is a deterministic function of R su , andŜ is a deterministic function ofŜ. The inequality holds irrespective of the sign of each measure.
All decoder-oriented measured are guaranteed to be non-negative. The very definitions of ∆I D and of ∆I DL imply they cannot be negative, since they are both Kullback-Leibler divergences between two probability distributions. The sequence of reduced transformations R ex →Ŝ → S, in turn, guarantees the non-negativity of ∆I LS , ∆I B and ∆A B , through the Data Processing Inequalities.
In order to assess whether decoding-oriented measures are always larger or smaller than their encoding (or gray) counterparts, we performed a numerical exploration comparing each encoding/gray-oriented measure with its decoding-oriented homologue.   and wrapped to the interval [0, 2π). The encoding process is followed by a circular phase-shift that transforms R ex =Φ into another code R su =Φ with transition probabilities Q(r su |r ex ) defined by Equation (31). The set of all R su coincides with the set of all R ex ; (b) Same as (a), except that stimuli are four (A, A, B , and B ), and phases are measured with respect to a cycle of 30 ms period and discretized in intervals of size π/3. The encoding process is followed by a stochastic transformation (lines on the right) that introduces jitter, thereby transforming R ex =Φ into another code R su =Φ with transition probabilities Q(r su |r ex ) defined by Equation (32).   An important issue is to identify the situations in which ∆I R su gives exactly the same result as either ∆I D or ∆I DL . It is not easy to determine the conditions for the equality between ∆I R su and ∆I DL . Yet, for the equality between ∆I R su and ∆I D , and in the specific case in which P su (r|s) = P N I (r|s) as given by Equation (17), the following theorem holds. Moreover, λ ≶ 0 implies that ∆I D ≶ ∆I R su .
Equation (33) implies that neither the prior stimulus probabilities P(s) nor the conditional response probabilities P ex (r|s) intervene in the condition for the equality, beyond the effect they have in fixing the value of P ex (r) and P su (r). Each response r makes a contribution to the value of λ, which favours ∆I D whenever P su (r) > P ex (r), and I R ex in the opposite case. As pointed out by [10], all responses r for which P ex (r) = 0 and P su (r) > 0 give a null contribution to ∆I D , and a negative contribution to I R ex , implying that correlations in such responses are irrelevant for decoding, and detrimental to encoding.
The fact that encoding-oriented measures neither bound nor are bounded by decoding-oriented measures is a daunting result. If, when working in a specific example, one gets a positive value with one measure and a negative value with another, the interpretation must carefully distinguish between the two paradigms. One may wonder, however, if such distinction is also required when correlations are absolutely essential for one of the measures, in that they capture the whole of the encoded information. Could the other measure conclude that they are irrelevant? Or that they are only mildly relevant? Luckily, in this case, the answer is negative. In other words, when the tested feature is fundamental, then ∆I D and ∆I R su coincide, and no conflict arises between encoding and decoding, as proven by the following theorem: Theorem 4. ∆I DL =I R ex if and only if ∆I R su =I R ex , regardless of whether stochastic codes exist that map the actual responses R ex into the surrogate responses R su =R NI generated assuming noise independence.

Proof. See Appendix B.5.
The conclusion is that if a given feature is 100% relevant for encoding, then it is also 100% relevant for decoding, and vice versa. Hence, although ∆I R su and ∆I DL often differ in the relevance they ascribe to a given feature, the discrepancy is only encountered when the tested feature is not the only informative feature in play. When the removal of the feature is catastrophic (in the sense that it brings about a complete information loss), then both ∆I R su and ∆I DL diagnose the situation equally.

Relation between Measures Based on Decoding Strategies α and β
The results of Table 1 may seem puzzling because decoding happens after encoding. Therefore-one may naively reason-the data processing theorems should have forbidden both ∆I R su to surpass ∆I D , ∆I DL , or ∆I B , as well as ∆A R su to surpass ∆A B . However, even though decoding indeed happens after encoding, the data processing theorem is not violated. The theorem certainly ensures that ∆I R su and ∆A R su constitute lower bounds for measures related to decoders that operate on responses generated by P su (r|s), but not for measures related to decoders that operate on responses generated by P ex (r|s), such as happens with ∆I D , ∆I DL , ∆I B , and ∆A B .
This observation about the validity of the data processing inequality is different from the one discussed in Section 2.2. There, we discussed the conditions under which ∆I R su could be guaranteed to be non-negative, the crucial factor being the existence of a stochastic mapping R ex →R su . Now we are discussing a different aspect, regarding whether decoding-related measures can or cannot be bounded by encoding-oriented measures. The conclusion is that in general terms, the answer is negative, because decoding-related measures operate with decoding strategy α, a strategy never addressed by the encoding measures. The surrogate variable R su participating in the encoding measure ∆I R su is not the response decoded by the measures of Equations (25)-(28), so the data processing inequalities need not hold. That being said, there are specific instances in which both types of measures coincide, two of them discussed in Theorems 3 and 4 and a third case later in Theorem 5.
Other explanations have been given in the literature for the fact that sometimes, decoding oriented measures surpass their encoding counterparts. For example, it has been alleged [10] that when ∆I D , ∆I DL or ∆I B are smaller than ∆I R su , this is either due to (a) the impossibility to define a stimulus-independent reduction R ex →R su that yields P ex (r|s)→P su (r|s) (and therefore the data-processing inequality is not guaranteed to hold), or due to (b) the fact that surrogate responses often sample values of response space that are never reached by real responses (and therefore, the losses of matched decoders may be larger than the ones of mismatched ones). However, Figure 2c constitutes a counterexample of both arguments, since there, the stimulus-independent stochastic reduction exists, and the response set of R ex and R su coincide.
One could also wonder whether the discrepancy between the values obtained with encoding-oriented measures and decoding-oriented measures only occurs in examples where a stochastic reduction R ex →R su exists, and the involved transition matrix Q(r su |r ex ) depends on the joint probabilities P ex (r, s), and not only on the marginals, as discussed in Theorem 2. However, Figure 2b,c provide examples in which Q(r su |r ex ) does not depend on P(r, s), and yet, the discrepancies are still observed.
The distinction between decoding strategies α and β is also crucial when using the measure ∆I D . This measure was introduced by Nirenberg et al. [8] for the specific case in which the tested feature is the amount of noise correlations, that is, when P su (s|r)=P NI (s|r). The measure was later extended to arbitrary deterministic mappings R su = f (R ex ) [10,12,13], with the instruction to use an expression like Equation (25), but with P su (s|r) replaced by P(s|R su = f (r)) = P su (s| f (r)). It should be noted, however, that as soon as this replacement is made, ∆I D becomes exactly equal to ∆I R su . Specifically, the measure ∆I D now describes the information loss of a decoder that operates on a response variable generated with the surrogate distribution P su (r|s) (decoding method β). If we want to keep the original spirit, and associate ∆I D with a decoder that operates on a response variable generated with the real distribution P ex (r|s) (decoding method α), in Equation 25, P su (s|r) should not be modified. Only the evaluation of the surrogate variable R su in the experimentally observed value R ex = r describes a mismatched decoder constructed with P su (r|s) and operated on R ex (mathematical details in Appendix C).

Assessing the Type of Information Encoded by Individual Response Features
When the stimulus contains several attributes (as shape, color, sound, etc.), by removing a specific response feature it is possible to assess not only how much information is encoded by the feature, but also, what type of information. Identifiying the type of encoded information implies determining the stimulus feature represented by the tested response feature. As shown in this section, the type of encoded information is as dependent on the method of removal as is the amount. In other words, the different measures defined in Equations (21)-(29) sometimes associate a feature with the encoding of different stimulus attributes.
In the example of Figure 8, we use four compound stimuli S=[S F , S L ], generated by choosing independently a frame (S F = or ) and a letter (S L = A or B), thereby yielding A, A, B , and B . Stimuli are transformed into neural responses R = [L, C] with different number of spikes (1 ≤ C ≤ 5) fired at different first-spike latencies (1 ≤ L ≤ 4; time has been discretized in 5 ms bins). Latencies are only sensitive to frames whereas spikes counts are only sensitive to letters, thereby constituting independent-information streams: P(s, r) = P(s F , l) P(s L , c) [33]. The equality in the numerical value of two measures does not imply that both measures assign the same meaning to the information encoded by the tested response feature. Indeed, the two measures may sometimes report the tested response feature to encode two different aspects of the set of stimuli. Consider a decoder that is trained using the noisy data R su shown in Figure 8a, but it is asked to operate on either the same noisy data with which it was trained (strategy β), or with the quality data R ex of Figure 8b (strategy α). The information losses ∆I R su , ∆I D , and ∆I DL are all equal to 50 % of I(S, R ex ) = 2 bits. Therefore, the information loss is independent of whether, in the operation phase, the decoder is fed with responses generated with P su (r|s) or with P ex (r|s). (c) Stimulus-independent stochastic transformation with transition probabilities Q(r su |r ex ) given by Equation (34), that introduces independent noise both in the latencies and in the spike counts, thereby transforming R ex into R su and rendering R su a stochastic code; (d) Degraded dataȒ obtained by adding latency noise to the quality data; (e) Representation of the stimulus-independent stochastic transformation R ex →Ȓ with transition probabilities Q(ȓ|r ex ) given by Equation (35) that adds latency noise in panel (d).
The transformation Q(r su |r ex ) causes some responses R su to occur for all stimuli, so when decoding with method β, some information about frames is lost (that is, I(S F , R su ) ≈ 33 % of I(S F , R ex ) = 1 bit), as well as some information about letters (that is, I(S L , R su ) ≈ 67 % of I(S L , R ex ) = 1 bit). In other words, decoding R su causes a partial information loss ∆I R su that is composed of both frame and letter information. Instead, when decoding R ex with method α, there is no information loss about letters: For the responses R ex that actually occur, the decoder trained with R su can perfectly identify the letters, because P su (C = 2|S L = A) = P su (C = 4|S L = B) = 1. The information about frames, on the other hand, is completely lost, since P su (l| ) = P su (l| ) whenever l adopts a value that actually occurs in R ex , namely 2 or 3. This example shows that the fact that two decoding procedures give the same numerical loss does not mean that they draw the same conclusions regarding the role of the tested feature in the neural code. Ananalogous computations yield analogous results for the hypothetical experiment shown in Figure 7b.

Conditions for Equality of the Amount and Type of Information Loss Reported by Different Measures
We now derive the conditions under which encoding/gray-oriented measures coincide with their decoding-oriented counterparts, as observed in Figures 2a and 3d. That is, we derive the conditions under which the following equalities hold: The example in Figure 7a showed that the existence of deterministic mappings does not suffice for a qualitative and quantitative equivalence of different measures. Furthermore, the example of Figure 3b showed that the equalities require the space of R su to include the space of R ex , or else the decoding method α may be undefined. We demonstrate that the Equations (37)-(40) arise, and moreover, that there is no discrepancy in the type of information assessed by these different measures, whenever the mapping from R ex into R su can be described using positive-diagonal idempotent stochastic matrices [45]. Specifically, we prove the following theorem: Theorem 5. Consider a stimulus-independent stochastic function f from a representation R ex into another representation R su , such that the range R of R su includes that of R ex , and with transition probabilities Q(r su |r ex ) that can be written as positive-diagonal idempotent right stochastic matrices with row and column indices that enumerate the elements of R in the same order. Then, Equations (37)-(40) hold.
The theorem states that the equalities of Equations (37)- (40) can be guaranteed whenever the removal of the tested response feature involves a (deterministic or) stochastic mapping R ex →R su that induces a partition within the set of real responses R ex , and R su is obtained by rendering all responses inside each partition indistinguishable (but not across partitions). To sample R su , the probabilities of individual responses inside each partition are re-assigned, rendering their distinction uninformative [30].
This theorem provides sufficient but not necessary conditions for the equalities to hold. The important aspect, however, is that it ensures that the equalities hold not only in numerical value, but also, in the type of information that different measures ascribe to the tested feature. Two different methods preserve or lose information of different type if, when decoding a stimulus, the trials with decoding errors tend to confound different attributes of the stimulus, as in the example of Figure 8. The conditions of Theorem 5, however, ensure that the strategies α and β always decode exactly the same stimulus (see Appendix B.6), so there can be no difference in the confounded attributes. Pushing the argument further, one could even argue that responses (real or surrogate) encode more information than the identity of the stimulus that originated them. For a fixed decoded stimulus, the response still contains additional information [46], that refers to (a) the degree of certainty with which the stimulus is decoded, and (b) the rank of the alternative stimuli, in case the decoded stimulus was mistaken [20]. Both meanings are embodied in the whole rank of a posteriori probabilities P su (s|r), not just the maximal one. Yet, under the conditions of the theorem, the entire rankings obtained with methods α and β coincide (see Appendix B.6). Therefore, even within this broader interpretation, there can be no difference in the qualitative aspects of the information preserved or lost by one and the other.
For example, in Figure 7b, we found that all information losses are equal (that is, ∆IR, ∆IŜ, ∆IŜ, ∆I D , ∆I DL , ∆I LS , and ∆I B are all 50 %), and both accuracy losses are equal (that is, ∆AR and ∆A B are both ≈67 %). However, the conditions of Theorem 5 do not hold. The matrix of Equation (32) is not block-diagonal, nor it can be taken to that shape by incorporating new rows (to make it square), and permuting both rows and columns, in such a way that the response vectors are enumerated in the same order by both indices. For this reason, the losses are not guaranteed to be of the same type.

Improving the Performance of Decoders Operating with Strategy α
In a previous paper [17], we demonstrated that neither ∆I D nor ∆I DL constitute lower bounds on the information loss induced by decoders constructed by disregarding the tested response feature. This means that some decoders may exist, that perform better than D su (r) defined in Equation (6). In this section we discuss one possible way in which some of these improved decoders may be constructed, inspired in the example of Figure 8. Quite remarkably, the construction involves the addition of noise to the real responses, before feeding them to the decoder of Equation (6). Panel (a) shows a decoder constructed with noisy data (R su ), and then employed to decode quality data (R ex ; Figure 8b), thereby yielding information losses ∆I D = ∆I DL = 50 %. These losses can be decreased by feeding the decoder with a degraded versionȒ of the quality data (Figure 8d) generated through a stimulus-independent transformation that adds latency noise (Figure 8e). Decoding R ex as if it were R su by first transforming R ex intoȒ results in ∆I D = ∆I DL ≈ 33 %, thereby recovering 33 % of the information previously lost. On the contrary, adding spike-count noise will tend to increase the losses. Thus, adding suitable amounts and type of noise can increase the performance of approximate decoders, and the result is not limited to the case in which the response aspect is the amount of noise correlations. In addition, this result also indicates that, contrary to previously thought [47], decoding algorithms need not match the encoding mechanisms for performing optimally from an information-theoretical standpoint. All these results are a consequence of the fact that decoders operating with strategy α are not optimal, so it is possible to improve their performance by deterministic or stochastic manipulations of the response. In practice, our results open up the possibility of increasing the efficiency of decoders constructed with approximate descriptions of the neural responses, usually called approximate or mismatched decoders, by adding suitable amounts and types of noise to the decoder input.

Relation to Decomposition-Based Methods
Many measures of different types have been developed to assess how different response features of the neural code interact with each other. Some are based on direct comparisons between the information encoded by individual features, or collections of features (see for example [48][49][50], to cite just a few among many). Others distinguish between two or more potential dynamical models of brain activity [51], for example, by differentiating between conditional and unconditional correlations between neurons in the frequency domain [52]. Yet others, rely on decompositions or projections based on information geometry. In those, the mutual information between stimuli and responses I R is broken down as I R = ∑ i I R i + Synergy Terms + Redundancy Terms, where I R i represents the information contributed by the individual response feature R i , and the remaining terms incorporate the synergy or redundancy between them. In the original approaches [53][54][55][56][57], the terms I R i represented the information I(R i ; S) encoded in single response aspects irrespective of what be encoded in other aspects. In later studies, [58][59][60][61][62], these terms accounted for the information that is only encoded in individual aspects, taking care of excluding whatever be redundant with other aspects. The approach discussed in this paper is in the line of the studies Nirenberg et al. [8] and Schneidman et al. [9] and all their consequences. This line has some similarities and some discrepancies with the decomposition-based studies. We here comment on some of these relations.
-First, the measure ∆I R su quantifies the relevance of a given feature with the difference I R ex − I R su .
When the surrogate response R su is equal to the original response R ex with just a single component R i eliminated, ∆I R su is equal to I(R i ; s|R i ), whereR i is the collection of all response aspects except R i . In this case, ∆I R su coincides with the sum of the unique and the synergistic contributions of the dual decompositions in the newest set of methods [63]. -Second, when assessing the relevance of a given response feature, we are often inclined to draw conclusions about the cost of ignoring the tested feature when aiming to decode the original stimulus. As shown in this paper, those conclusions depend not only on how stimuli are encoded, but also, on how they are decoded. The decomposition-based methods are mainly focused in the encoding problem, so they are less suited to draw conclusions about decoding. -Finally, as discussed in Figure 8, not only the amount of (encoded or decoded) information matters, but also, what type. Decomposition-based methods, although not yet reaching a full consensus in their formulation, provide a valuable attempt to characterize how both the type and the amount of information is structured within the set of analyzed variables, in a way that is complementary to the present approach, specifically in analyzing the structure of the lattices obtained by associating different response features [58,63].

The Problem of Limited Sampling
Throughout the paper we assumed that the distribution P ex (s, r) is known, or is accessible to the experimenter. In the examples, when we calculated information values, we plugged the true distributions into the formulas, without discussing the fact that such distribution may not be easily estimated with finite amounts of data. Whichever method is used to estimate P ex (s, r), to a larger or lesser degree, the outcome is no more than an approximation. Hence, even I R ex (which is supposed to be the full information) is estimated approximately. Since P su (s, r) is a modified version of P ex (s, r), also P su (s, r) can only be estimated approximately. Information measures, including Kullback-Leibler divergences, are highly sensitive to variations in the involved probabilities [20,32,[64][65][66][67][68][69], and the latter are unavoidable in high-dimensional response spaces. The assessment of the relevance of a given feature, hence, requires experiments that contain sufficient samples so as to ensure that the correcting methods work. When the response space is large, the measures ∆I S , ∆I B and the loss of accuracies are less sensitive to limited sampling than ∆I R su , ∆I D and ∆I LD .
In addition, the problem of finite sampling can also be formulated as an attempt to determine the relevance of the feature "Accuracy in the estimation of P ex (r|s)". This feature is not a property of the nervous system, but rather, of our ability to characterise it. Still, the framework developed here can also handle this methodological problem. The estimated distribution can be interpreted as a stochastic modification P su (r|s) of the true distribution P ex (r|s). As long as the caveats discussed in this paper are taken into account, the measures of Equations (21)-(29) may serve to evaluate the cost of modeling P ex (r|s) out of finite amounts of data.

Conclusions
Several measures have been proposed in the literature to assess the relevance of specific response features in the neural code. All proposals are based on the idea that by removing the tested feature from the response, the neural code deteriorates, and the lost information is a useful measure of the relevance of the feature. In this paper, we demonstrated that the neural code may or may not deteriorate when removing a response feature, depending on the nature of the tested feature, and on the method of removal, in ways previously unseen. First, we determined the conditions under which the data processing inequality can be invoked. Second, we showed that decoding-oriented measures may result in larger or smaller losses than their encoding (or gray) counterparts, even for response aspects that, unlike noise correlations, can be modeled as stimulus-independent transformations of the full response. Third, we demonstrated that both types of measures coincide under the conditions of Theorem 5. Fourth, we showed that evaluating the role of a response feature in the neural code involves not only an assessment of its contribution to the amount of encoded information, but also, to the meaning of that information. Such meaning is as dependent as the amount on the measure employed to assess it. Finally, our results open up the possibility that simple and cheap decoding strategies, based on the addition of an adequate type and amount of noise, be more efficient and resilient than previously thought. We conclude that the assessment of the relevance of a specific response feature cannot be performed without a careful justification for the selection of a specific method of removal.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Appendix A. On the Information and Accuracy Differences
Each value in Table 1 (except for those associated with Figures 3b; see below) was computed using the Nelder-Mead simplex algorithm for optimization, as implemented by the function fminsearch of Matlab 2016. For accuracy reasons, only examples in which I R ex ≥ 10 −6 bits and A R ex R ex ≥ 10 −6 were considered. Furthermore, parameters defining the joint stimulus-response probabilities and the transition matrices were restricted to the interval [0.05, 0.95]. Each difference between two measures defined in Equations (21)- (29) was computed repeatedly, with random initial values for the stimulus-response probabilities and the transition matrices, until the value of the difference failed to increase or decrease in 20 consecutive runs.
The values in Table 1 for Figure 3b were computed analytically with P ex ([3, 2]) > 0 or P ex ([4, 2]) > 0, but not both. In those cases, the measures ∆I D , ∆I B , and ∆A B are undefined, whereas ∆I DL = 100 %, for the reasons given in Section 2.4. However, ∆I R su and ∆IŜ can vary between 0 % and 100 %, for example, attaining 0 % when P ex ([3, 1])→0, and 100 % when P ex ([2, 1])→0 and P ex ([4, 2])→0. The information I R ex equals the stimulus entropy, regardless of the response probabilities. The values in Table 1 for Figure 3d were computed by setting b = c = d = 0.5 in Equation (16). The values in Section 2.4 for Figure 7b were obtained by setting P ex (s, r) = 1/4 for the stimulus-response pairs shown in the figure, and are valid for any transition probability matrix set as in Equation (32) with b=a. The values in Section 2.4 for Figure 8 were obtained by setting P ex (s, r) = 1/4 for the stimulus-response pairs shown in the figure.

Appendix B. Proofs
Appendix B.1. Derivation of Equation (30) The definition of ∆I DL involves the probability P su (s|r, θ) defined in [11,36,38] as proportional to P(S) ∏ i P θ (R i |S), where the exponent θ is chosen so as to maximize ∆I DL . This definition has been recently shown to be invalid when ∃ r,s such that P su (r|s) = 0 for a stimulus s or a response r for which P ex (r|s) = 0 [17]. This problem never appears when evaluating the relevance of noise correlations with P su (r|s) = P N I (r|s) as stated by Equation (17). Yet, it may well appear in more general cases, including those arising from stochastically reduced codes. To overcome it, we prove the theorem Theorem A1. The probability P(s|r, θ) that appears in the definition of ∆I DL is P su (s|r, θ) ∝ P(s) if ∃ŝ,r such that P ex (r|ŝ)>P su (r|ŝ)=0 0 if P su (r|s)=P ex (r|s)=0 for some but not all s P(s) P su (r|s) θ otherwise Proof. According to Latham and Nirenberg [11], the probability P su (s|r, θ) is the one that minimizes the Kullback-Leibler divergence D KL [P * (r, s)||p(r)p(s)] with respect to the distribution P * (r, s), subject to the constraints log 2 P su (r|s) P * (r,s) = log 2 q(r|s) P(s,r) (A1) ∑ s P * (r, s) = P(r).
The minimization problem can be formulated in terms of an objective function to be minimized, in which the constraints appear with Lagrange multipliers, and θ is the one accompanying Equation (A1). Using the standard conventions that 0 log 0 = 0 and x log 0 = ∞ for x > 0, Equation (A1) is fulfilled if ∃r,ŝ such that P(ŝ|r, θ) > 0 if P ex (r,ŝ) > P su (r|ŝ) = 0. The first part of the theorem immediately follows by solving Equation (B15) in [11] as there indicated with β = 0. If r,ŝ such that P ex (r,ŝ) > P su (r|ŝ) = 0, then Equation (A1) is fulfilled only if P(s, r|θ) = 0 when P su (r|s) = P ex (r, s) = 0. The second and third parts of the theorem immediately follows using Bayes' rule. Figure 4. The first part was proved in [9], at least for cases in which the set of the surrogate responses R su = R NI differ from the set of the real responses R ex . When they both coincide, we can prove the first part by contradiction, assuming that a deterministic mapping exists from R ex into R NI . If both variables sample the same response space, the deterministic mapping must be one-to-one, otherwise the variable R NI would sample a smaller set. Therefore, both R NI and R ex maximize the conditional entropy given S over the probability distributions with the same marginals, since one-to-one mappings do not modify the entropy, and R NI is defined as the distribution with maximal conditional entropy with fixed marginals. Because the probability distribution achieving this maximum is unique [16], P su (r|s) and P ex (r|s) must be the same, thereby proving the theorem.

Appendix B.3. Proof of Theorem 2
Proof. We prove the dependency on the marginal likelihoods by computing Q(r su |r ex ) for the hypothetical experiment of Figure 4a, and observing that the result depends on the marginal likelihood P ex (L|s). To that end, we rewrite Equation (10) for R su = [1,2] as Using this and rearranging the terms, we obtain the quadratic equation  [2,1]). Hence, any change in P ex (L = 1| ) must be followed by some change in Q(r su |r ex ), thereby proving the first part.
We prove the dependency on the noise correlations by computing Q(r su |r ex ) for the hypothetical experiment of Figure 5, and observing that the result not only depends on the marginal likelihoods P ex (L|s) and P ex (C|s), but in many cases, it also depends on the joint distributions P ex (L, C|s). Hence, varying the amount of noise correlations, even if keeping the marginals fixed, yields a variation in the mapping Q(r su |r ex ).
We proceed by reductio ad absurdum. If Q(r su |r ex ) does not depend on the amount of noise correlations in P ex (r|s), we may assume that if we vary P ex (r|s) but keep the marginals P ex (r i |s) fixed, the transition probabilities Q(r su |r ex ) remain unchanged. Under this hypothesis, Equation (10) is valid for many choices of P ex (r|s). In this context, consider the set of all response distributions with the same marginals as P ex (r|s) that can be turned into P su (r|s) through Q(r su |r ex ). This set includes P su (r|s), and therefore, Q(r su |r ex ) should be able to transform P su (r|s) into itself. In addition, Property 2 requires that Q(r su |[2, 2]) = 0 when r su = [2, 2] because either P(r su | ) = 0 or P(r su | ) = 0 for those responses.  [1,2], [2,1]}. Consequently, the resulting Q(r su |r ex ) yields through Equation 10 that P su ([2, 2]| ) = P ex ([2, 2]| ). After noticing that P su ([2, 2]| )=P ex (L = 2| ) P ex (C = 2| ) , and that P ex ([2, 2]| ) = P ex (L = 2| )−P ex ([1, 2]| ) = P ex (C = 2| )−P ex ([2, 1]| ) , we can show that, after some straightforward algebra, Equation (10) only holds if P su (r| ) = P ex (r| ) for all r. Thus, the initial hypothesis yields a transition matrix Q(r su |r ex ) that is unable to transform R ex into R su when R ex is noise correlated, and thus Q(r su |r ex ) necessarily depends on the amount of noise correlations in R ex .
Appendix B.4. Proof of Theorem 3 Proof. The condition ∆I D = ∆I R su implies that ∑ sr P ex (s, r) log P ex (s|r) P su (s|r) However, ∆I D = I R ex − ∑ sr P ex (s, r) log P su (r|s) P su (r) .
Hence, Equation (A3) becomes − ∑ sr P ex (s, r) log P su (r|s) P su (r) = −I R su (A4) In addition, when evaluating the relevance of noise correlations, P su (r, s) = P(s) P N I (r|s) as established by Equation (17) If instead of an equality, we start with an inequality, that same inequality can be kept all through the proof.

Appendix B.5. Proof of Theorem 4
Proof. Consider a neural code R ex = [R 1 , . . . , R N ] and recall that the range of R NI includes that of R ex . Therefore, ∆I DL = I R ex implies that the minimum in Eqution (26) is attained when θ = 0. In that case, Equation (B13a) in [11] yields ∑ s,r n P(s, r n ) log 2 P(r n |s)= ∑ s,r n P(s) P ex (r n ) log 2 P ex (r n |s), ∀ 1≤n≤N, ∀ n.
After some more algebra and recalling that the Kullback-Leibler divergence is never negative, this equation becomes I R n = 0, implying that when read isolatedly, single responses contain no information about the stimulus. Consequently ∆I R NI = I R ex , thereby proving the "only if" part. For the "if" part, it is sufficient to notice that the last equality implies that P NI (r|s) = P NI (r). Appendix B.6. Proof of Theorem 5 Proof. The conditions on f and Q(r su |r ex ) ensure that Q(r su |r ex ) can be written as a block-diagonal matrix, each block composed of the same rows with no zeros, and that each block can be associated with a non-overlapping partition R 1 , . . . , R M of the range of f . Under these conditions, P(r su |r ex ) = P(r su |R m ) when r ex ∈R m . Hence, for r su ∈R m , P(r su |s) = P(r su |R m ) P(R m |s), yielding P(s|r su ) = P(s|R m ) and P(s|r su , θ) = P(s|R m , θ). Recomputing Equations (21)-(29) with these equalities in mind immediately yields the equalities in the theorem.
Even when the amount of information is equal, differences in the type of information may arise because the measures are based on different decoding strategies, here denoted α and β. However, under the conditions of the theorem, decoding strategy α and decoding strategy β are one and the same. Because P(s|r su ) = P(s|R m ), both decoding strategies choose s only based on the partition R of r ex or r su , respectively. Mathematically, both choose s according tô s = arg max s P(s|R(r)) , where R(r) denotes the mapping from r into R, which is the same regardless of whether r is r ex or r su . Because Q(r su |r ex ) maps each partition onto itself, the responses within each partition of r su is completely generated by the responses in each partition of r ex , and thus the decoding strategies are applied to the same set of r ex . Hence, both decoding strategies are defined and operate in the same manner, yielding the same information.

Appendix C. On the Computation of ∆I D
The information loss caused by mismatched decoders (decoding strategy α) when R su = f (R ex ) has previously been computed as ∆I D but with P su (s|r) replaced by P(s|R su = f (r)) = P su (s| f (r)) [10,12,13]. The latter represents the probability of s given that R su takes the value f (r), thereby limiting f to deterministic mappings. However, the probabilities P su (s|r) and P su (s| f (r)) are not equivalent, since P su (s|r) ∝ ∑ r = f (r) P ex (r, s) P su (s| f (r)) ∝ ∑ f (r) = f (r) P ex (r, s) These two definitions raise the question of which alternative is the appropriate one when computing the information loss caused by mismatched decoders.
To resolve this question, notice that replacing P su (s|r) with P su (s| f (r)) in Equation (6) yields the decoding algorithmŝ = arg max s P su (s| f (r)) .
This algorithm entails first transforming the observed r into r su = f (r), and then choosing the stimulusŝ = D su (r) with a matched probability. Hence, its operation is analogous to the decoding algorithm β, and not, as originally intended, to the decoding algorithm α.
To illustrate the difference, recall the experiment in Figure 7a and suppose that the observed response is r = 0.25π. The decoding algorithm α reads this value, computes P su (s|0.25π), and decodeŝ s = . Instead, the decoding algorithm proposed in [10,12,13], first transforms the value of r = 0.25π into f (r) = 0.75π, then computes P su (s|0.75π), and finally decodesŝ = . This mode of operation corresponds to the decoding algorithm β.
The above discrepancy can also be seen from the change in the operational meaning of ∆I D caused by the replacement. To that end, recall that ∆I D was first introduced as a comparison between the average number of binary questions required to identify s after observing r when using two optimal question-asking strategies, one tailored for P ex (s|r) and the other for P su (s|r) [8]. Mathematically, this difference can be written as ∆I D = ∑ s,r P ex (s, r) log 2 P ex (s|r) − ∑ s,r P ex (s, r) log 2 P su (s|r) . (A7) In each term, the argument of the logarithms is determined by the question-asking strategy, whereas the weight of the averages is determined by the probability distribution of the variables on which the strategy is applied [8,10,16]. Equation (A7) describes the decoding strategy α.
Unlike ∆I D , this difference compares the average number of binary questions required to identify s after observing r using a question-asking strategy that is optimal for P ex (s|r), with the average number of binary questions required to identify s after observing r su using the a question-asking strategy that is optimal for P su (s|r su ). This is the way the decoding strategy β operates, not α.
Naively, one may think that a change in P su (s|r), regardless of its size, may turn the measure ∆I R su , typically regarded as an encoding-oriented measure and here linked to the decoding algorithm β, into the decoding-oriented measure ∆I D . However, notice that this change cannot occur through the equations above due to the change induced in P su (s, r su ). For that to actually occur, one must write ∆I R su differently, as for example: ∆I R su = ∑ s,r P ex (s, r) log 2 P ex (s|r) − ∑ s,r su P ex (s, r) log 2 P su (s|r su ) .
In this reformulation, the second term can be interpreted as the average number of binary questions required to identify s after observing r using a question-asking strategy that is optimal for P su (s|r su ), but only after converting r into r su . Any change in P su (s|r su ) immediately renders P su (s|r su ) a mismatched probability for r su , and makes the second term represent the average number of binary questions required to identify s after observing r using the question-asking strategy that is optimal for an altered version of P su (s|r su ) but only after converting r into r su , which need not resemble the meaning of the second term in ∆I D .