Bayesian Computational Methods for Sampling from the Posterior Distribution of a Bivariate Survival Model, Based on AMH Copula in the Presence of Right-Censored Data

In this paper, we study the performance of Bayesian computational methods to estimate the parameters of a bivariate survival model based on the Ali–Mikhail–Haq copula with marginal distributions given by Weibull distributions. The estimation procedure was based on Monte Carlo Markov Chain (MCMC) algorithms. We present three version of the Metropolis–Hastings algorithm: Independent Metropolis–Hastings (IMH), Random Walk Metropolis (RWM) and Metropolis–Hastings with a natural-candidate generating density (MH). Since the creation of a good candidate generating density in IMH and RWM may be difficult, we also describe how to update a parameter of interest using the slice sampling (SS) method. A simulation study was carried out to compare the performances of the IMH, RWM and SS. A comparison was made using the sample root mean square error as an indicator of performance. Results obtained from the simulations show that the SS algorithm is an effective alternative to the IMH and RWM methods when simulating values from the posterior distribution, especially for small sample sizes. We also applied these methods to a real data set.


Introduction
In survival studies, it is common to observe two or more lifetimes for the same client, patient or equipment. For instance, in a bivariate scenario, the lifetimes of a pair of organs can be observed, such as a pair of kidneys, liver, or eyes in patients; or the lifetimes of engines in a twin-engine airplane.
These variables are usually correlated and we are interested in the bivariate model that considers the dependence between them. The copula model is useful for modeling this kind of bivariate data. It has been used in several articles, including the following: [1] describes a comparison between bivariate frailty models, and models based on bivariate exponential and Weibull distributions; [2] proposes a copula model to study the association between survival time of individuals infected with HIV and persistence time of infection; [3] models the association of bivariate failure times by copula functions, and investigates two-stage parametric and semi-parametric procedures; and [4] considers a Gaussian copula model and estimates the copula association parameter using a two-stage estimation procedure.
According to [5,6], a copula is a joint distribution function of random variables for which the marginal probability distribution of each variable is uniformly distributed on the interval [0, 1].
reported. In Section 5 we apply the three algorithms to the real data set. Section 6 summarizes our findings.

Bayesian Approach
In order to develop the Bayesian approach, we need to specify the prior distributions for α j , β j and φ, for j = 1, 2. We assume that priors are independent, i.e., π(θ, φ) Therefore, we consider the following prior distributions where Γ(·) is the Gamma distribution and a j1 , a j2 , b j1 and b j2 are known hyperparameters, all of them with support on (0, +∞), for j = 1, 2. The parametrization of the Gamma distribution is such that the mean is a j1 /a j2 and the variance is a j1 /a 2 j2 , for j = 1, 2. The choice of values for the hyperparameters depends on the application. In the remainder of the article, we set up the hyperparameters values that give prior distributions with large variances. In particular, we set a j1 = b j1 = 0.01, for j = 1, 2. For φ we chose the uniform prior distribution on the interval (−1, 1), φ ∼ U(−1, 1).
The conditional posterior distributions in Equations (4)- (6) are not familiar distributions. Thus, in order to simulate from conditional posterior distributions, we used the Metropolis-Hastings algorithm. At each iteration, the Metropolis-Hastings algorithm considers a value generated from a proposal distribution. This value is accepted according to a properly specified acceptance probability. This procedure guarantees the convergence of the Markov chain for the target density. More details on the Metropolis-Hastings algorithm can be found in [22][23][24][25] and their references.

Metropolis-Hastings Algorithm:
Let the current state of the Markov chain be , where l is the l-th iteration of the algorithm, α and φ (l−1) are the values of α 1 , θ −α 1 and φ in (l − 1)-th iteration, respectively, for l = 1, . . . , L, in which, α (0) , θ −α 1 and φ (0) are the initial values. At the l-th iteration of the algorithm, we updated α 1 as follows:
Another option is to generate α * 1 from a candidate generating density that does not depend on the current α 1 value. That is, we may set up q[α * 1 |α 1 ] = q[α * 1 ]. Thus, we have a special case of the original MH algorithm, called Independent Metropolis-Hastings (IMH), where A α 1 is given in (7) and simplifies to .
In order to implement this case, one may set q[α * 1 ] as the prior distribution, i.e., q[α * 1 ] = π(α * 1 ). Then, A α 1 is given by the likelihood ratios, This algorithm is implemented as follows.
• Independent Metropolis-Hastings Algorithm: Let the current state of the Markov chain be . For the l-th iteration of the algorithm do the following: (1) Generate α * 1 from the prior distribution α * 1 ∼ Γ(a 11 , a 12 ); Although the choice of the prior distribution as the candidate generating density may be mathematically attractive, it usually leads to a slow convergence of the algorithm. This happens when vague prior information is available and prior distribution has large variance. As a consequence, many of the proposed values are rejected.
An alternative is to explore the neighborhood of the current value of the Markov chain to propose a new value. This method is termed the random walk Metropolis (RWM). In the RWM, the candidate value α * 1 is generated from a symmetric density g(·). That is, we set up q[α * 1 |α 1 ] = g(|α 1 − α * 1 |) and the probability of generating a move from α 1 to α * 1 depends only on the distance between them. For this case, A α 1 given in (7) simplifies to since the proposal kernels from numerator and denominator cancel. In order to implement the RWM it is necessary to simulate α * 1 setting α * 1 = α 1 + ε, where ε is a random perturbation generated from a Normal distribution with mean 0 and variance σ 2 . This algorithm is implemented as follows.
• Random Walk Metropolis Algorithm: Let the current state of the Markov chain be . For the l-th iteration of the algorithm, l = 1, . . . , L, do the following: An issue in RWM is how to choose the value of σ 2 α 1 . It has a strong influence on the efficiency of the algorithm. If σ 2 α 1 is too small, the random perturbations will be small in magnitude and almost all will be accepted. The consequence is that it will take a large number of iterations to explore the entire state-space. On the other hand, if σ 2 α 1 is large there will be many rejections of the proposed values, slowing down the convergence. More details on this issue can be found in [23,[26][27][28].
Typically, one may fix the value of σ 2 α 1 by testing some values on a few pilot runs and then choosing a value whose acceptance ratio lies between 20% and 30% (see, for example, [24,25]). Thus, after a pilot run we set up σ 2 α = 1.

Slice Sampling Algorithm
An alternative to the IMH and RWM sampling from some generic distribution is the slice sampling algorithm. This algorithm is a type of Gibbs sampling based on the simulation of specific uniform random variables. Here we explain the algorithm slice sampling in the context of the simulation of α 1 . The sampling procedure for α 2 is similar. More details about SS can be found in [13].
In SS, an auxiliary variable U is introduced and the joint distribution π(α 1 , U|t, δ, θ −α 1 , φ) is given by a uniform distribution over the region Marginalizing However, as the inverse of κ(α 1 ) cannot be obtained analytically, we adopted the following procedure to update α 1 : Let λ = 0.01 andÃ be an empty set.
• Slice sampling algorithm: Let the current state of the Markov chain be α . For the l-th iteration of the algorithm, l = 1, . . . , L: , where κ(·) is given by (10).

Metropolis-Hastings Algorithm:
Let the current state of the Markov chain be . For the l-th iteration of the algorithm, l = 1, . . . , L: The Metropolis-Hastings algorithm for updating β 2 is similar. To update the dependence parameter φ conditional on the remaining parameters θ = (α 1 , β 1 , α 2 , β 2 ), we used the following IMH algorithm. Let G φ be a grid from −1 to 1 with increments of 0.1. Consider [I a , I a+1 ), an interval defined by two adjacent grid values of G φ where a is the index of the a-th value of the grid for a = 1, . . . , 20. For example, for a = 1 we have the interval [−1, −0.9); for a = 11, we have the interval [0, 0.1); and for a = 20 we have the interval [0.9, 1). Then generate the a candidate value φ * as follows: If the current value of φ is in the interval (I 1 , I 2 ), then generate φ * from one of the two following Uniform distributions φ * ∼ U(I 1 , I 2 ), with probability 1/2, U(I 2 , I 3 ), with probability 1/2.
For this case, we generate an auxiliary variable U ∼ U(0, 1); if u ≤ 1/2, then we generate φ * from U(I 1 , If the current value of φ is in (I 20 , I 21 ), then generate φ * from one of the two following uniform distributions If the current value of φ is in the interval (I a , I a+1 ), for a = 1 and a = 20, then generate φ * from one of three following uniform distributions , with probability 1/3, U(I a , I a+1 ), with probability 1/3, U(I a+1 , I a+2 ), with probability 1/3.

MCMC Algorithms
Using the algorithms IMH, RWM, SS and MH described above, we implemented three MCMC algorithms: For these three algorithms, the parameters β j and φ are updated via MH and IMH, as described in Section 3.2, for j = 1, 2.
After defining the algorithms, we ran them for L iterations and a burn-in B. We also consider jumps of size J, i.e., only 1 drawn from every J was extracted from the original sequence obtaining a sub sequence of size S = [(L − B)/J] to make inferences.

Simulation Study
In this section, we present the comparison between the performances of the three algorithms applied to simulated data sets. Simulated random samples of sizes n = 25, 50, 100 and 250 with 0%, 5%, 10%, 20% and 30% random right-censored were generated to represent small, medium and large data sets. Using these, we generated four simulated data sets with fixed parameters, as specified in Table 1.
Data set D 1 has two increasing hazard functions with a positive dependence parameter, while data set D 2 has a constant and increasing hazard function with a negative dependence parameter. Data set A 3 has parameters to produce a decreasing and a constant hazard function with weak dependence, while data set A 4 has strong dependence and two increasing hazard functions. The simulation procedure to generate n observations (t i1 , t i2 ), for i = 1, · · · , n, is given by the following steps: Set up the sample size n and set i = 1; (ii) Generate the censoring times C ij ∼ U(0, τ j ), where τ j controls the percentage of censored observations, for j = 1, 2; (iii) Generate uniform values u ij ∼ U(0, 1), j = 1, 2 and calculate w i , the solution of the nonlinear Here we used the rootsolve package and the uniroot.all command from R software to solve the nonlinear equation and obtain w i ; Calculate the times t ij = min(T ij , C ij ) and the censorship indicators δ ij , which are equal to 1 if t ij < T ij and 0 otherwise, for j = 1, 2; (vi) Set i = i + 1. If i = n stop. Otherwise, return to step (ii).
We generated M = 200 different simulated data sets according to steps (i)-(vi) described above and the parameters were estimated according to algorithms A 1 , A 2 and A 3 .
We used hyperparameters a j1 = a j2 = b j1 = b j2 = 0.01 to obtain prior distributions with large variance, for j = 1, 2. For the m-th generated data set, we applied algorithms A 1 , A 2 and A 3 fixing L = 55,000 iterations, burn-in B = 5000 and J = 10.
Comparison of the algorithms was made using the sample Root Mean Square Error (RMSE), given by A smaller RMSE indicates better overall quality of the estimates. Table 2 presents the RMSE value for each simulated data set by algorithm, sample size and percentage of censorship. The smaller RMSE value for each sample size and percentage of censorship is highlighted in bold. For the three algorithms, by fixing the sample size and increasing the censuring percentage (% cens.), the RMSE values increased. When the sample size increases at a fixed percentage of censures, the RMSE values decrease, consequently improving the precision of the estimators.
Based on the results presented in Table 2, for the smaller sample size n = 25, the algorithm A 3 (with SS) outperformed algorithm A 1 (with IMH) and algorithm A 2 (with RWM), i.e., it gave a smaller RMSE value for all percentages of censures. This better performance also happened for data sets D 3 and D 4 for n = 50. For all other simulated cases, the algorithm A 2 outperformed algorithms A 1 and A 3 . An exception is the case with n = 250 and 0% of censuring in data set D 2 , in which algorithm A 1 had a better performance. These results suggest a possible complementarity between algorithms A 2 and A 3 , where algorithm A 2 performs better for higher sample sizes and algorithm A 3 performs better for smaller sample sizes.
We verified the convergence of algorithms A 1 , A 2 and A 3 using the effective sample size [14] and the integrated autocorrelation time (IAT). The effective sample size (ESS) is the number of effectively independent draws from the posterior distribution. Method with larger ESS are the most efficient. The IAT is a MCMC diagnostic that estimates the average number of autocorrelated samples required to produce one independent sample draw. Lower IAT is means more efficiency. The EES and IAT values were obtained using the coda and LaplacesDemon. Both packages are available in the R software.
Tables A1 and A2 in Appendix A show the average of ESS and IAT values for each algorithm by parameter for data set D 1 . Algorithm A 3 showed a better performance than algorithms A 1 and A 2 , i.e., it had the highest ESS values and smallest IAT values by parameter for all simulated cases. Note that algorithm A 1 had the worst results, especially for simulated values for α j , j = 1, 2. Results for data sets D 2 , D 3 and D 4 were similar.  Appendix B presents an empirical convergence check for the sampled values for α 1 for each algorithm. As shown in Figure A1, the generated values for α 1 by algorithm A 1 did not mix well and the stability for the ergodic mean and estimated autocorrelation were not satisfactory. On the other hand, the values generated by algorithms A 2 and A 3 were well mixed and present satisfactory stability for the ergodic mean and autocorrelation. As an illustration of convergence diagnostic, Figure A1(j-l) shows the Gelman plot for the sequence of α 1 values in two chains by each algorithm. As can be seen in the figure, the number of iterations was sufficient for algorithms A 2 and A 3 to reach convergence, but not for algorithm A 1 . In addition, the scale reduction factor of the Gelman-Rubin diagnostic [29] for each parameter in algorithms A 2 and A 3 were smaller than 1.1, meaning that there is no indication of non-convergence. This implies a faster convergence of algorithms A 2 and A 3 in relation to algorithm A 1 . For β 1 sampled values, the three algorithms present satisfactory properties, i.e., good mixing, and satisfactory stability for ergodic mean and autocorrelation (see Figure A2 in Appendix B).

Sample Size % of Censures
The results indicate that algorithm A 3 (SS for α j ) is an effective alternative to algorithms A 1 (with IMH for α j ) and A 2 (with RWM for α j ) to simulate samples from the posterior distribution of bivariate survival models based on the Ali-Mikhail-Haq copula with marginal Weibull distributions.

Application to a Real Data Set
Next, we examine the performance of algorithms A 1 , A 2 and A 3 on the diabetic retinopathy data set described in [15], which is available in the R software 'survival' package [16]. This data set consists of the follow-up times of 197 diabetic patients under 60 years of age. The main objective of the study was to evaluate the effectiveness of the photocoagulation treatment for proliferative retinopathy. The treatment was randomly assigned to one eye of each patient and the other eye was taken as a control.
Let (T 1 , T 2 ) be the bivariate times, where T 1 is the time to visual loss for the treatment eye and T 2 is the time to visual loss for the control eye. The percentage of censure times for each variable is 72.59% (143 observations) for T 1 and 48.73% (96 observations) for T 2 .
We used (1) to model this data with Weibull marginal distributions with parameters α j and β j and dependence parameter φ.
We compared the performances of the algorithms using the RMSE in relation to the empirical distribution function, whereF j (t ij ) is obtained by substituting the estimates of α j , β j and φ (obtained by each algorithm); and F j (t ij ) is the empirical distribution function obtained from the Kaplan-Meier estimates, for j = 1, 2 and i = 1, . . . , n.
We ran the three algorithms using the same number of iterations, burn-in, thinning and hyperparameters values used with the simulation data. Table 3 shows the parameters estimates, the credibility intervals (95%) and RMSE values by algorithm. For this data set, the algorithm A 3 (with SS for α j ) gave the smaller RMSE value. Figure 1 shows the estimated survival functions by algorithms A 1 (red line) and A 3 (blue line). The step functions (black lines) are the Kaplan-Meier estimates. The estimated curves by algorithms A 1 and A 2 are very close and so we show only the curve estimated by A 1 , in order to provide a good visualization. The Kaplan-Meier estimates were obtained using the survival package and the survfit command in the R software. Table 4 shows the ESS and IAT values for the sequences generated by algorithms A 1 , A 2 , and A 3 . Algorithm A 3 had a better performance than algorithms A 1 and A 2 , i.e., the highest ESS value and the lowest IAT value per parameter.
We also compared the performances of the algorithms in relation to the sequences generated for each parameter. Figure 2 shows the traceplots, the ergodic means, and the autocorrelations for sequences of α 1 values simulated by algorithms A 1 , A 2 and A 3 .       It can be observed in these graphs that the α 1 values generated by the IMH (algorithm A 1 ) has poor mixing, does not show satisfactory stability for the ergodic mean, and the autocorrelation is high for long lags. On the other hand, the values generated by the RWM (algorithm A 2 ) and SS (algorithm A 3 ) are better mixed and present satisfactory stability for the ergodic mean. However, the sequence produced by the SS presents the steepest decreasing autocorrelation. Figure 3 shows the same graphs for parameter β 1 . As can be seen, for β 1 the performances of the three algorithms are satisfactory. These results, together with those presented by the RMSE, show that for the data set analyzed here SS provides a better performance than IMH or RWM. Figure 4 shows the Gelman plot for the simulated values for α 1 , β 1 and φ in two chains by each algorithm. As can be seen, the number of iterations was sufficient for algorithms A 2 and A 3 to reach the convergence, but not sufficient for algorithm A 1 (Figure 4a,b). The scale reduction factor for each parameter in algorithms A 2 and A 3 are all less than 1.1, while for algorithm A 1 only φ presents a scale reduction factor less than 1.1.

Final Remarks
We investigated the performances of three Bayesian computational methods to estimate parameters of a bivariate survival model based on the Ali-Mikhail-Haq copula with marginal Weibull distributions. The performances of the MCMC algorithms were compared using the RMSE criterion.
The RMSE values were calculated for different sample sizes and different percentages of censures.
The results obtained from the simulated data sets showed that the RWM and SS algorithms outperformed the IMH algorithm, and that the SS algorithm performed better for lower sample sizes. The results show evidence that MCMC sequences obtained with SS with the same number of iterations L, burn in B and thinning value, have better properties (i.e., higher ESS and lower IAT values) than for IMH and RWM, which are standard methods to sample from the joint posterior distribution.
We also illustrate the application of the algorithms using a real data set, available in the literature. The algorithm A 3 (with SS generating the α j 's) presented a better performance when applied to this data set. The criteria used to reach this conclusion were the stability for the ergodic mean, the autocorrelation, the minimum RMSE value, the maximum ESS value, and the minimum I AT value. In addition, the algorithm using SS presented a satisfactory performance in relation to scale factor reduction, and the Gelman plot of the Gelman-Rubin convergence diagnostic.
Our results show that algorithm A 3 , which is composed by a mixing of SS for generating α j , MH for β j and IMH for φ, is an effective algorithm to simulate values from the joint posterior distribution of an AMH copula with Weibull marginal distributions. Moreover, two advantages of SS are that it is easy to implement and it does not need to specify a candidate generating density. A disadvantage in our specific case is that it took longer to perform the simulation when compared with IMH and RWM. The reason for this longer time is that we needed an iterative method to obtain the inverse of the function κ(α j ). This was because an analytical solution is not available. All calculations were implemented using the software R and can be obtained from the authors. An extension of the results obtained here for other Arquimedian copulas as well other marginal distributions and a possible generalization would be a fruitful area for future work. Table A1. ESS by algorithm for data sets D 1 .

Appendix B. Empirical Illustration of the Convergence
We present here an empirical illustration of the convergence of the simulated sequences for parameters α 1 and β 1 . We randomly selected a data set from one of the M = 200 generated data sets D 1 with n = 100 and %cens = 5 and present the traceplot, graphs showing of the ergodic mean and autocorrelation of the sampled values by algorithm and the Gelman plot. Figure A1 shows the performance of the algorithms for sampled α 1 values. It can be observed that the IMH (algorithm A 1 ) does not mix well, it does not have stability for the ergodic mean, and the estimated autocorrelation does not decrease as fast as the other algorithms. The sequences of α 1 's generated by RWM and SS are well mixed and present satisfactory stability for the ergodic mean, and the autocorrelation decreases faster, with a clear advantage for algorithm A 3 . The Gelman plot indicates that the number of iterations used was sufficient for algorithms A 2 and A 3 to reach the convergence. Figure A2 presents the performances of each algorithm for the sequence generated for β 1 . As can be observed, the three algorithms present satisfactory properties. The satisfactory performance of the three algorithms is mainly due to the fact that β 1 has a natural candidate generating density with parameters depending on the observed data and values of hyperparameters.     Figure A1. Traceplot, ergodic mean and autocorrelation for sequences produced by algorithms A 1 , A 2 and A 3 for α 1 .       Figure A2. Traceplot, ergodic mean and autocorrelation for sequences produced by algorithms A 1 , A 2 and A 3 for β 1 .