Robust Linear Trend Test for Low-Coverage Next-Generation Sequence Data Controlling for Covariates

Low-coverage next-generation sequencing experiments assisted by statistical methods are popular in a genetic association study. Next-generation sequencing experiments produce genotype data that include allele read counts and read depths. For low sequencing depths, the genotypes tend to be highly uncertain; therefore, the uncertain genotypes are usually removed or imputed before performing a statistical analysis. It may result in the inflated type I error rate and in a loss of statistical power. In this paper, we propose a mixture-based penalized score association test adjusting for non-genetic covariates. The proposed score test statistic is based on a sandwich variance estimator so that it is robust under the model misspecification between the covariates and the latent genotypes. The proposed method takes advantage of not requiring either external imputation or elimination of uncertain genotypes. The results of our simulation study show that the type I error rates are well controlled and the proposed association test have reasonable statistical power. As an illustration, we apply our statistic to pharmacogenomics data for drug responsiveness among 400 epilepsy patients.


Introduction
Genome-wide association study (GWAS) is a powerful tool for screening a high-dimensional genome data set and selecting candidate genetic variants such as single nucleotide polymorphisms (SNPs) in genetic association studies. Next-generation sequencing (NGS) technology is widely used to produce a large amount of genetic information in a fast way. In the past decade, there have been numerous studies using NGS data such as rare variants association study [1,2], pharmacogenomics [3,4], machine learning and deep learning applications [5,6], and big data analysis [7,8]. Many NGS experiments are based on low-coverage sequencing with a large sized sample since there is a trade-off between sample size and sequencing depth in the NGS experiments [9,10]. For the low-coverage NGS data, a high uncertainty of the inferred genotypes is common; however, it causes biased and unreliable results on genetic association analyses. In genetic research based on NGS data, therefore, it is important to obtain accurate genotypes to perform an association analysis.
A number of researchers have worked on the effects of genotype misclassification in genetic association studies. There are two types of genotype misclassifications: differential and non-differential misclassifications, determined by whether the misclassification mechanism differs in the case and control groups or not. In summary, non-differential misclassifications result in a loss of statistical power and differential misclassifications distort type I error rates in a genetic case-control association study [11][12][13][14].
While there have been many research on improving the accuracy of genotypes such as the joint genotype calling algorithms across all samples were suggested to increase the accuracy of genotype calls [15][16][17], several researchers have tried to develop new association statistics accounting for the genotype errors. Their approaches are based on the raw measurements rather than inferred genotypes. In statistical genetics literature, Kim et al. [18] extended a chi-squared test of independence and developed a mixture likelihood based association test using the continuous measurements for copy number polymorphisms. Barnes et al. [19] proposed a mixture model linear trend test for the continuous copy number measurements. In NGS experiments, a likelihood ratio test based on allele read counts of pooled samples was proposed to test independence of genetic variants with a binary phenotype [20]. Gordon et al. [21] proposed a likelihood ratio test of the binomial mixture model of allele read counts with known error parameters. Kim et al. [13,22] proposed an extended version of Cochran-Armitage (CA) trend test and a multi-variant linear trend test for next-generation sequences data by using binomial mixture models. For a case-parent trio design, the binomial mixture model was applied to develop extended transmission disequilibrium tests (TDTs) based on read counts and read depths and to provide power analysis and sample size formulas [23]. All these approaches do not require genotype calls that can be highly uncertain when the read depth or coverage is low. However, none of these previous research has addressed how to include covariates in their mixture-based association studies.
When the covariates are independent of the latent genotypes, the extension of the mixture model based association tests is straightforward. However, if the latent genotype variable is associated with other covariates, then a likelihood based approach requires a model specification between the genotype variable and the other covariates as opposed to the previous research [16][17][18][19][20][21][22][23]. To our knowledge, this is the first study that investigates a genetic case-control association test controlling for covariates in low-coverage NGS experiments. Since we do not know the true model, we apply a sandwich variance estimator to develop a robust genetic association test statistic.

Mixture Model Accounting for Covariates
Let w be a covariate vector. Let y be a random variable indicating the case-control status of an individual such that y = 1 if a subject is in the case group and y = 0, otherwise. Let z = (z 0 , z 1 , z 2 ) denote an unobservable latent genotype vector, where ∑ 2 g=0 z g = 1 and z g = 1 if and only if the genotype is equal to g. Let x and v denote the minor allele read count and the read depth, respectively. The probability function is given by If the probability function of the read count x does not depend on the phenotype y, that is, p(x|v, z, y) = p(x|v, z), then it is called a non-differential error model. We apply a binary logit model to the case-control phenotype response variable y that is the same model for Cochran-Armitage trend test when perfect genotypes are available: We assume a binomial error model to the allele read counts as in previous research [13,16,20,21,23]: where u = ( , 1/2, 1 − ) T . For a differential error model, we can use u = y( 1 , 1/2, 1 − 1 ) T + (1 − y)( 0 , 1/2, 1 − 0 ) T . When perfect genotypes are available, we do not need the conditional probability of the genotype z given covariates w to perform genetic association tests since the logistic regression model is a conditional model given the genotypes and covariates. In this work, we assume a multinomial logit model for the latent genotype given the covariates as follows: where γ 0 = (0, 0, 0) T to remove over-parametrization. Other statistical models without the assumptions of a multinomial logit model may also be used for the relationship between covariates and latent genotypes, where we do not know the true model. The likelihood function L and the log-likelihood function are written as The error parameter is commonly small and hence the estimate of is often equal to zero. The zero estimate of the error parameter results in a divergent information matrix. It prevents us from calculating Rao's score test statistic. In order to overcome this issue, we include a beta density penalty term to prevent from zero estimate of the error parameter. The penalized log-likelihood function is given by During this work, we choose C = 1 as in [24,25]. The penalized complete-data likelihood function is given by The complete data log-likelihood function is written as

Derivation of EM Algorithm under H 0
We apply the Expectation-Maximization (EM) algorithm [26] to estimate the parameters in our mixture model. Given data and the (r)-th step estimated parameters, the (r + 1)-th E-step of the EM algorithm is written as We note that the posterior probability of subject k belonging to genotype class i depends on the all parameters. In M-step, the (r + 1)-th estimates of the parameters are obtained by maximizing Q (r+1) : where we use notations π ik = π ik (β, β w ) = e βs i +β T w w k for simplicity. From Equation (14), we derive a closed form iteration formula to update the allele read error parameter : There is no closed form iteration formulas to update other parameters β, β w , γ i . The M-step for β, β w , and γ can be obtained by the Newton-Raphson method. The Hessian matrix of Q (r+1) is given by (r) and update the parameter estimate by Let i and update the parameters γ i by In order to obtain β 2 || 2 ≤ tol 2 or the number of iterations reaches the prespecified maximum number of iterations. In our work, we set tol = 10 −6 and fix the maximum iteration as 1000.

Hypothesis Tests of Genetic Association Controlling for Covariates
To test genetic association between the latent genetic variables and the binary response variable while controlling covariates, we employ Rao's score test. There are several advantages for the use of the score test. Cochran-Armitage trend test with perfect genotypes is a score test, and we extend this test to when the genotypes are highly uncertain. The score test requires less computational cost compared to the likelihood ratio test since it requires the parameter estimates only under the null hypothesis of no association. The score function calculated in previous section is given by where the subscript (0) denotes the estimated parameter under the null hypothesis. Another important issue to be considered when we include the covariates in a low-coverage next-generation sequencing genetic association study is a model misspecification of the latent genotypes on the covariates. To overcome this model misspecification problem, we employ the sandwich variance estimator [27].
In this work, we derive a robust generalized score test using the sandwich variance-covariance estimator. In general, one of the difficulties in applying the sandwich estimator in practice is that it requires analytic derivation for the covariance matrix of the proposed model. For simplicity in our derivation of the sandwich variance estimator, θ denotes the vector of all parameters θ = (β, β w , γ, ), and φ = (β w , γ, ) denotes the parameter vector except β, and hence θ = (β, φ). The sandwich variance estimator for the score function S under H 0 is given by where ∂θ∂θ T under the unknown true distribution f 0 . For simplicity, we may use h ik during derivation of the sandwich variance estimator: so that the likelihood function is written as The relationship between J and V can be written as If there is no model misspecification, we have J = V and the robust score test statistic is reduced to Rao's score test statistic.
We denote the difference R = V − J so that where δ g (i) = 1 if g = i and δ g (i) = 0 if g = i. It is straightforward to calculate V from the above first derivatives. The second term ∂ 2 ∂θ∂θ T log h gk of R has components as where i = 1 or 2. All other second derivatives that are not presented are equal to zero. Using these first and second derivatives of log h gk , we can obtain the components of the difference matrix R as follows: Therefore, our proposed robust score test statistic Z R can be written as which asymptotically has a standard normal distribution under H 0 . Another common approach to obtain p-values is to use Monte Carlo permutation method based on the score vector or function. However, the Monte Carlo permutation p-value calculation given a very small Bonferroni's corrected level of significance needs high computational expenses since it requires at least 10 7 or 10 8 permuted resamples. In this work, we employ the asymptotic permutation p-value calculation. The score function is given by where the subscript (0) denotes the estimated parameter under the null hypothesis. We define a score r k = ∑ 2 i=0 τ ik(0) s i associated with subject k and the kth residual We can permute the residuals e k 's to calculate the permutation p-value for adjusting covariate effects. The asymptotic permutation test statistic Z AP for a large sample size is given by where r = 1 N ∑ N i=1 r i and e = 1 N ∑ N i=1 e i . The simple linear rank test statistic Z AP asymptotically has a standard normal distribution under the null hypothesis [28].

Simulation Study
In this section, we simulate data from the following process: For simplicity, we assume genetic relative risk R i = P(Y=1|G=i,w) P(Y=1|G=0,w) , for i = 1, 2, does not depend on the covariate W. We assume that the genotype frequency π i = P(G = i) satisfies Hardy-Weinberg equilibrium (HWE), so that P(G = 0) = p 2 , P(G = 1) = 2pq, and P(G = 2) = q 2 , where q is the minor allele frequency. Then, the prevalence is given by We consider two scenarios when generating covariates w: (1) f (w|G = i) is equal to a standard normal N(0, 1) for all i = 0, 1, 2, called by a single normal, and (2) f (w|G = i) has a normal distribution with mean µ i and standard deviation σ = 1, we call this a normal mixture. For the single normal model, We finally assume P(Y = 1|G = 0, w) = e α+βw w 1+e α+βw w . During the simulation study, we compute α by numerical integration given prevalence φ and other parameters.

Simulation Study for Null Distribution
To evaluate the type I error rate of the proposed test statistic, we perform simulations with 5000 replicates per each parameter setting. We fixed the proportion of cases as 0.5. The parameter settings that we consider are: Minor allele frequency (q): 0.05, 0.3 (iv) Total sample size (n): 500, 1000, 1500 (v) Covariate (w 1 ): single normal or normal mixture with mean µ = (0, 1 2 , 1 2 ) given genotype (0, 1, 2) (vi) Regression coefficient β w : 0, 1 We consider prevalence φ = 0.3 that may be large in a genetic association study. It is chosen to reflect pharmacogenomics data that we use in the real data analysis. Figure 1 shows boxplots of the null simulations. The permutation method appears to have more variability of the empirical rejection rates over different configurations and to have the smaller empirical rejection rates compared to the proposed robust score test based on the sandwich variance estimator. When the sample size was small as 500 and the coverage was 4×, the permutation-based test had less than 2.5% rejection rate though the desired value is 5%. The smallest empirical rejection rate for the proposed robust test was greater than 3.5%, and it appears the empirical rejection rates become closer to 5% as the sample size increases. If the coverage is 30× or higher, then the estimated posterior probabilities in our approach are close to zero-or-one and most inferred genotypes are quite clear. When the coverage was 30×, our proposed test seems to well control the type I error rates regardless of other parameter settings as expected. Table 1 shows the empirical rejection rates under the null settings by combining our simulation results for the lower level of significance. Table 1. Empirical rejection rates under null settings for level 1 × 10 −2 , 1 × 10 −3 , 1 × 10 −4 , and 1 × 10 −5 .

Simulation Study for Statistical Power
We used the same parameter settings as in the null simulation study. Additionally, we set multiplicative genetic relative risks vector (1, 1.5, 1.5 2 ) in the alternative parameter configurations. In the alternative simulations, we calculated empirical rejection rates under Bonferroni corrected level of significance, that is, 5 × 10 −8 . Figure 2 shows the boxplots of empirical power under various alternative settings. We removed the results when the sample size was 500 or the minor allele frequency was 0.05 since all the rejection rates were small in Figure 2. It appears interesting that the power of the proposed test when the coverage was 4× and the sample size was 1500 is higher than the power of the test when the coverage was 30× and the sample size was 1000. If the two design costs are similar, then the low-coverage with more samples seems more effective than the high-coverage with less samples.  The level of significance was set as 5 × 10 −8 . The notation 0.1.1000.4 represents prevalence 0.1, total sample size 1000, and coverage 4×. Table 2 summarizes statistical power of our proposed method and a naive approach. The naive approach uses uncertain genotypes by the maximum posterior probability classification rule [29]. The standard logistic regression was applied to the uncertain genotypes. As expected, the proposed robust method shows higher power than the naive approach when the sequencing coverage is as low as 4×. When the sequencing coverage is high as 30×, two approaches show similar performance in terms of statistical power.

Real Data Analysis
The proposed robust generalized score test was applied to the pharmacogenomics data consisting of 400 epilepsy patients [22]. The data were collected from several epilepsy clinics in Korea and were genotyped for whole-exomes by NGS experiments [30]. All study participants followed the criteria in [31] if the participants had drug-resistant (case group) or drug-responsive (control group) epilepsy. We defined the drug resistance as the occurrence of at least four unprovoked seizures during the past one year at the time of recruitment, with trials of two or more appropriate antiepileptic drugs (AEDs) at maximal tolerated doses. Patients who underwent surgical treatment for drug-resistant epilepsy were classified as having drug-resistant epilepsy, regardless of the surgical outcome. We excluded some patients from the study if they were frequently in poor compliance with AED therapy and had reported seizures with a questionable semiology. In addition, we defined the drug responsiveness as complete freedom from seizures for at least one year up to the date of the last follow-up visit.
We included two non-genetic covariates in our association analysis. The two covariates were age of patient and duration of epileptic seizures. The age variable was definitely independent of genetic information, whereas duration variable may be associated with genetic variables. Due to the relatively small sample size 400, we did not expect to find a significantly associated SNP controlling for the two covariates. Therefore, instead of reporting a genome-wide association study, we illustrated the results of a SNP with low read depths and a SNP with high read depths. For the low read depths example, we selected a SNP from chromosome 1, which is rs3811406. The distribution of read depths for the SNP was summarized in Table 3. More than 10% of the sample had five or less read depths and more than 30% of the sample had 10 or less read depths at the SNP. When applying our proposed mixture-based association test, the test statistic value was z R = 2.864 and the p-value was p = 4.183 × 10 −3 , while the standard logistic regression analysis using pooled genotype calls had z = 2.601 and the p-value p = 9.30 × 10 −3 that was more than twice the p-value of the proposed robust test. In addition, we applied our proposed test to SNP rs4915154 at which all patients had 13× or higher read depths and 85% patients had 25× or higher read depths. For this SNP, the proposed robust test statistic was z R = 2.940 with p-value = 3.28 × 10 −3 and the multiple logistic regression with the pooled genotype calls reported z = 2.963 with p-value = 3.05 × 10 −3 . The two results were quite close, as expected, due to high read depths at the SNP.

Discussion and Conclusions
In the present study, we developed the mixture-based genetic association tests adjusting the effects of non-genetic covariates in low-coverage NGS data. In order to construct a robust test statistic under model misspecification, we derived the sandwich variance estimator of the mixture model. The proposed test statistic is calculated from allele read counts and read depths instead of inferred genotypes so that we can apply this association test to low-coverage NGS data controlling for non-genetic covariates without external imputation or elimination of uncertain genotypes. Another important issue that we addressed in the present study is that the proposed test takes account of potential dependence between latent genotypes and the non-genetic covariates. Regarding computational cost, our proposed method is efficient because it is a generalized score test that uses the estimates of the parameters only under the null hypothesis of no association. When the sequencing depth is 4×, it takes around 1.2 s for sample size 500, 4 s for sample size 1000, and 9 s for sample size 1500 to simulate a dataset and to calculate both test statistics Z AP and Z R . When the sequencing depth is 30×, it takes approximately 0.13 s for sample size 500, 0.3 s for sample size 1000, and 0.53 s for sample size 1500. Time for these computations is measured based on a single core work of a 3.5 GHz Intel Xeon processor. As illustrated in the real data analysis section, the read depth is not a fixed constant. Therefore, the computational time for real data is usually less than that for the coverage 4× simulation setting. We used statistical software R, which is known to be slow. It would be computationally beneficial to run our proposed methods in other faster program languages for a high-dimensional genome-wide association study.
We applied the penalized likelihood method to avoid singularity of information matrix when calculating the proposed score test statistic. Therefore, the penalty term is not necessary for a non-zero estimate of the error parameter. During our work, we fixed the degree of penalization C = 1, a = 0.01, and b = 0.99 that implies 1% of allele read error as prior information. This parameter choice does not affect the proposed test statistic much since the likelihood function is merely changed when the sample size is greater than 500. It may be of interest to find optimal values for the parameters of the penalty term.
The simulation study confirms that the type I error rates of the proposed test are well controlled under the various parameter settings. The proposed robust test appears to perform better than the permutation based approach. Simulation results indicate that coverage 4× with sample size 1500 shows higher power as compared to coverage 30× with sample size 1000. Our method can be applied to an NGS experimental design by simulations to select coverage and sample size given a fixed amount of budget.
We presented a real data example in which the proposed test and multiple logistic regression results are similar to one another if the sequencing depth is high, whereas the test results may differ when the sequencing depth is low. This might have been caused because the proposed test is an extension of the multiple logistic regression with the unobserved latent genotype predictor. If the sequencing depth is high enough to call accurate genotypes, then our probability model becomes identical to the probability model of the multiple logistic regression. It would be more beneficial to compare with the previous methods by evaluating our proposed methods using a larger sized public dataset.
In this work, we focused on a single variant association test while controlling covariates. By adopting a multivariate mixture model, the proposed method can be extended to the multi-variant genetic association test including covariates. We can also extend the present method to differential genotype misclassifications.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

EM
Expectation-Maximization GWAS Genome-wide association study HWE Hardy-Weinberg equilibrium maf Minor allele frequency NGS Next-generation sequence SNP Single nucleotide polymorphism TDT Transmission disequilibrium test