Testing Differential Gene Networks under Nonparanormal Graphical Models with False Discovery Rate Control

The nonparanormal graphical model has emerged as an important tool for modeling dependency structure between variables because it is flexible to non-Gaussian data while maintaining the good interpretability and computational convenience of Gaussian graphical models. In this paper, we consider the problem of detecting differential substructure between two nonparanormal graphical models with false discovery rate control. We construct a new statistic based on a truncated estimator of the unknown transformation functions, together with a bias-corrected sample covariance. Furthermore, we show that the new test statistic converges to the same distribution as its oracle counterpart does. Both synthetic data and real cancer genomic data are used to illustrate the promise of the new method. Our proposed testing framework is simple and scalable, facilitating its applications to large-scale data. The computational pipeline has been implemented in the R package DNetFinder, which is freely available through the Comprehensive R Archive Network.


Background
Inferring the structural change of a network under different conditions is essential in many problems arising in biology, medicine, and other scientific fields. For instance, in genomics, it is often of importance to study the structural change of a genetic pathway between diseased and normal groups. In the field of brain mapping, it is critical to identify the difference in brain connectivity between groups (for example, the brain connectivity network of normal subjects and patients often possess different structures). Most of these applications have relied on the prevailing Gaussian graphical models (GGMs) because of its good interpretability and computational convenience, and there is a rich and growing literature on learning differential networks under GGMs. To name a few, Guo et al. (2015) [1] introduced a joint estimation for multiple GGMs by a group lasso approach, under the assumption that the GGMs being studied are sparse and only differ in a small portion of edges. Danaher et al. (2014) [2] proposed a fused graphical lasso method which is free from the sparsity assumption on condition-specific networks and only requires the sparsity of the differential network. Zhao et al. (2014) [3] constructed a new estimator which directly estimates the differential network defined as ∆ ∆ ∆ = Σ Σ Σ −1 X − Σ Σ Σ −1 Y , where Σ Σ Σ −1 X and Σ Σ Σ −1 Y represent the two condition-specific precision matrices and ∆ ∆ ∆, Σ Σ Σ −1 X , Σ Σ Σ −1 Y have the same dimension. Liu (2017) [4] presented a new test to simultaneously study structural similarities and differences between multiple high-dimensional GGMs, which adopts the partial correlation coefficients to characterize the potential changes of dependency strength between two variables.
Most of the aforementioned algorithms were based upon penalized likelihood maximization. Although some algorithms were consistent under certain regularity conditions, they failed to control the false discovery rate (FDR) of the substructure detection as it is difficult to choose a tuning parameter to control the FDR at the desired level [1][2][3]. One exception is Liu (2017), who introduced a hierarchical testing framework to adjust for the multiplicity. Liu's test was constructed to asymptotically control the FDR while keeping satisfactory statistical power. Simulation studies in [4] have shown that this new test exhibits substantial power gains over existing methods such as graphical lasso. One major drawback that limits the application of Liu's test is the Gaussian assumption, which is often violated in practice especially in genomics. For instance, some digital measurements of gene expression level such as RNA-Seq data often greatly deviate from normality even after log-transformation or other variance-stabilizing transformations. In this paper, we aim to extend Liu's work to a more flexible semiparametric framework, namely the nonparanormal graphical models (NPNGMs), where the random variables are assumed to follow a multivariate normal distribution after a set of monotonically increasing transformations. We use a novel rank-based multiple testing method to detect the structural difference between multiple networks from non-Gaussian data. The method is computationally efficient and asymptotically controls the FDR at a desired level. To begin with, we give the formal definition of nonparanormal distribution: where µ µ µ and Σ Σ Σ denote the mean and covariance matrix in the multivariate normal distribution, respectively. The distribution of Y Y Y depends on three parameters and it can be generally written as Y Y Y ∼ NPN(µ µ µ, Σ Σ Σ, f f f ).
By Definition 1 and Sklar's theorem, it is easy to verify that when the transformation functions f j s are all differentiable, the nonparanormal distribution NPN(µ µ µ, Σ Σ Σ, f f f ) is equivalent to a Gaussian copula [5]. As graphical models, the NPNGMs are much more flexible than GGMs in modeling non-Gaussian data while retaining the interpretability of the latter. Some recent studies have established the estimation and properties of high dimensional nonparanormal graphical models. For example, Liu et al. (2009) [5], who first studied high-dimensional NPNGMs, bridged the estimations of GGMs and NPNGMs by a nonparametric and truncated (Winsorized) estimator of the unknown transformation functions. Xue and Zou (2012) [6] proposed to use an adjusted Spearman's correlation to estimate the structure of high-dimensional NPNGMs, and they showed that the rank-based estimator achieves the same rate of convergence as its oracle counterpart (i.e., assuming known transformation functions). Despite the advances in single NPNGM estimation, to the best of our knowledge, the inference of differential substructure between multiple NPNGMs has not been studied. In this paper, we tackled this problem by embedding the Winsorized estimator into the testing framework of Liu (2017). Under some regularity conditions, we showed that the new test statistic converges to the same distribution as its oracle counterpart does [4].
We begin with the notations and problem formulation. For a vector a a a = (a 1 , ..., a p ), we define its 0 norm as a a a 0 = ∑ p i=1 I{a i = 0}, its 1 norm as a a a 1 = ∑ p i=1 |a i |, its 2 norm as a a a 2 = ∑ p i=1 a 2 i , and its ∞ norm as a a a ∞ = max i |a i |. For a matrix A A A = (a ij ) ∈ R p×q , we define its 0 norm as ij and its ∞ norm as A A A ∞ = max i,j |a ij |. Let A A A i,−j denote the ith row of A A A with its jth entry being removed and A A A −i,j denote the jth column with its ith entry being removed. We use A A A −i,−j to denote a (p − 1) × (q − 1) matrix by removing the ith row and the jth column. For square matrix B B B, we let λ max (B B B) and λ min (B B B) denote the largest and smallest eigenvalues of B B B respectively. In addition, for a given sequence of random variable {X n , n = 1, 2, ...} and a constant sequence {a n , n = 1, 2, ...}, X n = o p (a n ) denotes that X n /a n converges to zero in probability as n approaches to infinity and X n = O p (a n ) denotes that X n /a n is stochastically bounded. If there are positive constants c and C such that c ≤ X n /a n ≤ C for all n ≥ 1, we write X n ∼ a n .
To formulate the problem, we let k ∈ {1, 2, ..., K} be the index of class, p be the dimension, and (Y (k) , f f f (k) ), we test the following hypothesis: ij· represents the partial correlation coefficient between X ij· for some k, k ∈ {1, ..., K}, and the differential network is defined as the set of all differential edges. As a well-known result in statistics, ρ jj . Here, we consider an equivalent alternative of the hypothesis testing above. Similar as in [4], let then the hypothesis testing can be simplified as As S ij (Ω Ω Ω) = S ji (Ω Ω Ω), we define H H H 0 = {H 0ij , 1 ≤ i < j ≤ p} and H H H a = {H aij , 1 ≤ i < j ≤ p}, and the total numbers of tests are p(p − 1)/2, i.e., card(H H H 0 ) = card(H H H a ) = p(p − 1)/2. The rest of this paper is structured as follows: In Section 2, we introduce the new test statistic and multiple testing procedure. In Section 3 we perform a simulation study to evaluate the finite sample performance of the proposed test in terms of FDR control and statistical power. We then apply the new method to a rich genomic data to study the genetic difference between four breast cancer subtypes. We discuss the strength and shortcomings of the test in Section 5. Technical proof of the asymptotic results is provided in Appendix A.

Winsorized Estimator of the Latent Gaussian Variables
In practice, the transformation functions f f f (k) = ( f is some estimator of the cumulative distribution function of Y One major drawback of the eCDF above is that under high dimensionality, the variance ofF where δ n serves as the truncation parameter that should be carefully chosen. Liu et al. (2009) [5] suggested δ n = 1/(4n 1/4 π log n) to balance the bias and variance of eCDF, and so we will use this value in our calculations. To estimate the transformation functions and impute the latent Gaussian variable X X X, we define X The Winsorized estimator X (k) * mj generally works well in approximating the unknown X (k) mj , and it could be used to estimate the oracle sample covariance. LetΣ Σ Σ (k) be the sample covariance matrix by the oracle data, andΣ Σ Σ (k) be the sample covariance matrix by (X When estimating the precision matrix Ω Ω Ω (k) , one can consider a modified graphical lasso based on imputed data, i.e.,Ω Ω Ω Liu et al. (2009) showed the following convergence, which elucidated the asymptotic equivalence between the oracle data and imputed data in the structural estimation of NPNGM

Asymptotic Results for a Single Class
To extend Liu's test to a nonparanormal case, we first consider the problem of single GGM estimation based on oracle data, i.e., (X It is not hard to show that the regression coefficients β β β j,p ) and the error term As the oracle data (X (3) are generally unknown, we consider a new regression model based on Winsorized imputations: In solving the problem of single GGM estimation, Liu (2017) proposed an elegant test based on a bias-corrected sample covariance. This has motivated us to construct the following new statistic where r jj, − 1, we will prove that, under mild conditions (see a detailed proof in Appendix A) Similar as in [4], the estimated coefficientsβ β β (k) j must satisfy the following conditions: Equation (6) is our main result, which is essentially a counterpart of Proposition 3.1 in [4]. The detailed proof is given in Appendix A. The asymptotic result we obtained here suggested that, by an appropriate choice of regression coefficientsβ β β j under NPNGMs, one can use the rank-based method introduced by Xue and Zou (2012) [6]. Xue and Zou (2012) showed that the rank-based estimator (e.g., rank-based lasso and rank-based Dantzig selector) achieved exactly the same convergence rate as its oracle counterpart, therefore, it also satisfies our condition (7).

Multiple Testing Procedure for FDR Control
Now we introduce the multiple testing procedure for FDR control based on the single-class result from Equation (6). As suggested in [4], the partial correlation coefficient can be well estimated by a thresholding estimatorρ and we define the following two-sample test statistics In the multi-sample case S S S ij = (S (k,k ) ij ) 1≤k<k ≤K , we consider a sum squared test statistics Motivated by [4] (Equations 2.6 and 2.7) and [7], we define the following statistic (1) and the FDR can be controlled at level α, if we reject H 0ij : S ij (Ω Ω Ω) = 0 when T ij ≥ t(α 0 ). One may refer to [7] for the detailed proof about this testing procedure. Our proposed computational pipeline consisted of three steps: (1) Winsorized imputation for the latent Gaussian variables; (2) rank-based estimation of regression coefficients, and (3) multiple testing with FDR control. On the whole, we put forward a simple procedure to estimate the structural difference between multiple nonparanormal graphical models. The computational pipeline for a two-sample comparison has been implemented in the R package DNetFinder, which can be downloaded from the Comprehensive R Archive Network (CRAN).

Numerical Study
We performed a simulation study to evaluate the finite sample performance of the proposed procedure. In particular, we evaluated the empirical false discovery rate (eFDR) as well as the statistical power under two classes, i.e., K = 2. The dimension and sample size were set to be p = 200 and n 1 = n 2 = 100. We consider two commonly used graph-generating models including the band graph and Erdős-Rényi (ER) graph, and two estimators for regression coefficients including lasso estimator and Dantzig selector. Detailed set-up for precision matrices Ω Ω Ω 1 and Ω Ω Ω 2 are given below:
For each graph, we generated the latent Gaussian data (oracle data) from N(0 0 0, Ω Ω Ω −1 ), Ω Ω Ω ∈ {Ω Ω Ω 1 , Ω Ω Ω 2 }, and a Winsorized estimator with truncation parameter δ n = 1/(4n 1/4 π log n) was used to implement our test. The performance of the proposed method was then evaluated in two aspects: false discovery rate control and statistical power. In particular, we compared the results based on oracle data and imputed data by the Winsorized estimator. Two estimators including the lasso estimator and Dantzig selector were used to estimate coefficientsβ β β. For oracle data, we applied the R package flare to calculate the solution path over a sequence of 20 candidate λ's and tune by Akaike information criterion (AIC). For imputed data, we adopted the rank-based methods introduced by [6], i.e., the rank-based lasso and rank-based Dantzig selector. The simulation was repeated for 100 times for each FDR level (α ∈ {0.05, 0.10, 0.15, ..., 0.50}) and the average empirical FDR and statistical power were summarized. Figures 1 and 2 compared the empirical false discovery rate (eFDR) with the desired level α under the band graph and ER graph. It can be seen that the empirical FDR based on imputed data is close to the one by oracle data, both close to the desired level of α, suggesting that the FDRs were controlled quite well for both cases. The lasso estimator works almost equally well as Dantzig selector in both settings.

Figures 3 and 4 summarized the statistical power of the test for the band graph and ER graph.
As can be seen, the power for ER graph is substantially lower than the band graph, indicating that the complexity and denseness of the underlying differential network may significantly decrease the power of our test. The test based on oracle data performs slightly better than the imputed data, which is due to the loss of information during Winsorized imputation. Similar as we observed from Figures 1 and 2, the lasso estimator works almost equally well as Dantzig selector.   In addition, we compared the proposed test with a direct estimator, recently developed by Zhang (2019) [8]. The direct estimator is a rank-based estimator and can be solved by a parametric simplex algorithm. We simulated the data from the Erdős-Rényi (ER) graph with different sample sizes (n = 25, 50, 100, 150) and numbers of dimensions (p = 40, 60, 90, 120). As the direct estimator does not control the false discovery rate, we set the FDR level at 0.05 for our proposed test. Figure 5 summarized the empirical FDR and statistical power under different sample sizes (with dimension fixed at 100) and different dimensions (with sample size fixed at 100). It can be seen that the two methods have comparable performance and our proposed test achieves lower FDR but slightly lower statistical power. However, it is noteworthy that the direct estimator is computationally expensive and becomes impractical when the dimensions exceed 150. Table 1 summarized the running time of the two methods, where it can be seen that our test is much faster than the direct estimator, especially for relatively high dimensions. For instance, when p = 120, the direct estimator takes hours while our test takes less than 10 seconds. As the core part of the proposed algorithm is the estimation of regression coefficients, the time complexity is the same as the linear regression. For instance, with LASSO and p > n, the time complexity is O(np 2 ), while the direct estimator by Zhang (2019) has a time complexity O(np 4 ).

A Genomic Application
In this part, we applied the proposed test to the Cancer Genome Atlas data (TCGA, [9]) to study the different roles of the cell cycle pathway in the two subtypes of breast cancer including luminal A subtype and basal-like subtype. The cell cycle pathway is known to play a critical role in the initiation and progression of many human cancers including breast cancer and ovarian cancer [10,11]. For instance, the cell cycle pathway provided by KEGG (Kyoto Encyclopedia of Genes and Genomes, [12]) contains 128 important genes that co-regulate cell proliferation, including ATM, RB1, CCNE1, and MYC. Abnormal regulation among these genes may cause the over-proliferation of cells and an accumulation of tumor cell numbers [11].
The transcriptome profiling data for breast cancer were downloaded through the Genomic Data Commons portal [13] in January 2017. The expression level of each gene was quantified by the count of reads mapped to the gene. The quantifications were done by software HTSeq of version 0.9.1 [14]. In our analysis, we excluded 43 subjects including 12 male subjects and 31 subjects with >1% missing values. In addition, we removed the effects due to different age groups and batches using a medianmatching and variance-matching strategy [10,15,16]. For example, the batch effect can be removed in the following way: where g ijk refers to the expression value for gene i from sample k in batch j (j = 1, 2, ..., J; k = 1, 2, ..., n j ), M ij represents the median of g ij = (g ij1 , ..., g ijn j ), M i refers to the median of g i = (g i1 , ..., g iJ ),σ g i and σ g ij stand for the standard deviations of g i and g ij , respectively.
The remaining 959 breast cancer samples were further classified into five subtypes according to two molecular signatures, namely PAM50 [17] and SCMOD2 [18]. The two classifications were implemented separately using R package genefu [19] and we obtained 530 subjects with concordant classification by two classifiers. The resulting set contains 221 subjects in the luminal A group, 119 in the luminal B group, 74 in the her2-enriched group, 105 in the basal-like group, and 11 in the normal-like group. For illustration purposes, we conducted two pairwise comparisons (1) Luminal A vs basal-like and (2) Luminal B vs basal-like.
To balance the bias and variance, we choose the same truncation parameter in Winsorized imputation as in our simulation study δ (k) where k ∈ {1, 2}, n 1 = 221, n 2 = 105. The proposed test based on the Winsorized estimator was then conducted for each gene pair with different FDR cutoffs. Figures 6 and 7 summarized all the identified differential edges under FDR levels α = 0.05, 0.10, 0.15, 0.20, with all isolated genes being removed. Our results suggested a list of important genes that play different roles in different breast cancer subtypes. For instance, in Figure 6, genes CCNB1 and PRKDC contribute to several differential edges. According to recent studies, gene CCNB1 is a prognostic biomarker for certain subtypes of breast cancer and it is closely associated with hormone therapy resistance [20]. It has also been reported in the literature that the PRKDC regulates chemosensitivity and is a potential prognostic and predictive marker of response to adjuvant chemotherapy in breast cancer patients [21]. Our findings about several other genes including CHEK2 and CDC7 also confirmed some existing reports [22,23]. As we observed from the two examples, as the desired FDR level increases, the resulting differential network tends to be denser and denser (Figure 8 showed the correlation between FDR and the number of differential edges). In practice, users should consider the trade-off between the accuracy (FDR) and number of new hypotheses (number of differential edges) and choose an appropriate FDR [24].

Discussion
Detecting the differential substructure on multiple graphical models is a fundamental and challenging problem in statistics. Liu (2017) studied the problem under the Gaussian framework and introduced an elegant hierarchical test based on the estimation of single GGM. Unlike most existing methods, Liu's approach asymptotically controlled the false discovery rate at a nominal level, which guarantees the quality of the estimated differential network. In this work, we further extended Liu's test to a more flexible semiparametric framework, namely the nonparanormal graphical models. Our test is built upon a Winsorized estimator of the unknown transformation functions and it enjoys similar asymptotic properties as its oracle counterpart does.
Although the new test holds great promise in many applications such as genetic network modeling, it has some practical limitations. First, as we see from the theoretical derivation, the good performance of the test relied on the sparsity assumption on the differential network. Although the sparsity assumption is reasonable in many cases, it still could be violated in some applications. For instance, some genetic pathways may exhibit a global change of gene-gene regulations between different phenotypes. When the differential network is dense or locally dense, the method may fail to control the FDR. To solve the problem, a new test needs to be defined to evaluate the level of the sparseness of the change between two conditions. However, there is still a gap on the literature of this topic.
Second, one key assumption in NPNGMs is that the transformed variables follow a joint Gaussian distribution. This assumption also needs to be checked in real-world applications. Under low dimensions, one can employ some popular normality tests, including the Anderson-Darling test and Shapiro-Wilk test, on the imputed data or other normal scores. However, most of these tests fail to detect non-normality for high-dimension data. The normality test under high dimension is still an open and challenging problem and we left it for future research.
It is also noteworthy to mention that the new test relied on an accurate estimator for the coefficients β β β. Motivated by [6], we chose two popular estimators including lasso estimator and Dantzig selector based on the adjusted Spearman's rank, which satisfies Condition (7). In fact, some other estimators also satisfy the conditions, for instance, the rank-based adaptive lasso [6,25] and square-root lasso estimator [6,26]. These estimators can also be incorporated into our testing framework.

Conclusions
We have introduced a novel statistical test to detect the structural difference between the two nonparanormal graphical models. The proposed test dropped the Gaussian assumption and can be potentially applied to many non-Gaussian data for differential network analysis. For instance, some digital gene expression data (e.g., RNA-seq data) do not follow Gaussian distribution even after log transformation or other variance-stabilizing transformations. In such cases, one can model the data with a nonparanormal graphical model and apply our test to find differential edges between two or multiple phenotypic conditions. The proposed test may also be used to detect the difference between normal and disease populations in the brain connectivity network.
Funding: This research received no external funding.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

FDR
False discovery rate NPNGM Nonparanormal graphical model GGM Gaussian graphical model TCGA The Cancer Genome Atlas

Appendix A. Proof of Equation (6)
Define the estimated residuals based on Winsorized estimator as: * whereX * j = 1/n ∑ n m=1 X * mj andX X X * −j = 1/n ∑ n m=1 X X X * m,−j . The choice ofβ β β j must satisfy the following two conditions: β β β j − β β β j 1 = O p (a n ), where a It is noteworthy to mention that the conditions above are slightly different from the conditions in [4] due to the different convergence rates by oracle data and imputed data. The conditions above can be satisfied by the rank-based estimators introduced in [6], e.g., rank-based lasso estimator or rank-based Dantzig selector. By letting˜ mj = mj −¯ i , we have: First, for term (A3), we have: where the last term can be bounded as follows: To bound ∑ l =i,j˜ mi (X ml −X l )(β l,j − β l,j ), we use the independence between mi and X X X m,−i . It is easy to show that max l =i mi (X ml −X l )(β l,j − β l,j ))| = O p a n log p n .
By the independence between mi and X X X * m − X X X m , it is not hard to show =O p log p n + a n log p n .
Summarizing all the results above, by Equations (22) and (23)  mj (X mj −X j )(β j,i − β j,i )I{i = j} + O p log 2 p log 2 n n 3/2 + a 2 n log p log 2 n n 1/2 + a n log p n + b 2 n . mj (β j,i − β j,i )I{i = j} + O p log 2 p log 2 n n 3/2 + a 2 n log p log 2 n n 1/2 + a n log p n + b 2 n .
In addition 1 n n ∑ m=1 * 2 mi = 1 n n ∑ m=1˜ 2 mi + O p log 2 p log 2 n n 3/2 + a 2 n log p log 2 n n 1/2 + a n log p n + b 2 n .
Equation (6) follows immediately by central limit theorem.