In Search of Covariates of HIV-1 Subtype B Spread in the United States—A Cautionary Tale of Large-Scale Bayesian Phylogeography

Infections with HIV-1 group M subtype B viruses account for the majority of the HIV epidemic in the Western world. Phylogeographic studies have placed the introduction of subtype B in the United States in New York around 1970, where it grew into a major source of spread. Currently, it is estimated that over one million people are living with HIV in the US and that most are infected with subtype B variants. Here, we aim to identify the drivers of HIV-1 subtype B dispersal in the United States by analyzing a collection of 23,588 pol sequences, collected for drug resistance testing from 45 states during 2004–2011. To this end, we introduce a workflow to reduce this large collection of data to more computationally-manageable sample sizes and apply the BEAST framework to test which covariates associate with the spread of HIV-1 across state borders. Our results show that we are able to consistently identify certain predictors of spread under reasonable run times across datasets of up to 10,000 sequences. However, the general lack of phylogenetic structure and the high uncertainty associated with HIV trees make it difficult to interpret the epidemiological relevance of the drivers of spread we are able to identify. While the workflow we present here could be applied to other virus datasets of a similar scale, the characteristic star-like shape of HIV-1 phylogenies poses a serious obstacle to reconstructing a detailed evolutionary and spatial history for HIV-1 subtype B in the US.

subsampling procedure, we see that it closely matches that of the HIV population.  23 We considered a total of 38 potential predictors in each phylogeographic reconstruction. An origin 24 and destination predictor was specified for covariates that were state-specific, rather than pairwise.

25
Pseudocounts were added to predictors with zero values, to ensure strictly positive values before 26 log-transformation and standardization to a mean of 0 and a variance of 1. The potential predictors of 27 HIV spatial spread we considered for the GLM diffusion model can be grouped into the following Adjacency between locations was incorporated as a binary matrix that represents states sharing a 48 border with a value of 1 and 0 otherwise. This predictor was not log-transformed or standardized.

49
We constructed this matrix from a shape file downloaded from the US Census Bureau [5] by 50 looking at polygons with overlapping points.

51
The simulated distance predictor is a matrix of average geodesic distances between two discrete 52 locations. This distance measure was chosen as a more realistic alternative over centroid 53 great-circle distances. The simulated distance matrix was constructed by drawing a number of 54 points from each state corresponding to the number of sequences in each dataset. The locations 55 of each point followed a mean population density raster [6] such that points were more likely to 56 be drawn from areas with higher density. The distance between two states was then defined to 57 be the average between all pairs of points between two states across 100 replicates.

58
Environmental distances represent the distance between two points by taking into account 59 inaccessible regions using a circuit theory framework [7]. In this framework, pairwise 60 connectivity can be approximated by estimating pairwise electric resistance measures between 61 locations. They incorporate environmental rasters as grids of electric resistance or conductance to 62 study the connectivity among locations. This allows us to have a more realistic representation of 63 the path needed to traverse when moving from one point to another. We compute these distances 64 using the Circuitscape software [8,9]. We introduced two predictors to account for environmental is approximately five times that of the general adult population [16], which drove us to include 94 these potential predictors into our model.

95
Multi-year averages for total population and black population were obtained using the CPS 96   3. Connectivity of phylogeographic reconstructions on the PDA subsampled datasets Figure S3. Connectivity plots for the PDA subsampled datasets. The geographical reconstructions from each MCC tree are projected into geographical space by drawing a line for each transition event, connecting origin (circle) and destination. The connectivity plot for the 250 taxa tree is not included in this plot because it was found to be similar to those of the 500, 750 and 1,000 taxa datasets. As the number of sequences grows, the number of predictors included grows. (b) Number of predictors that never get included in the model. As the number of sequences grows, the number of predictors that never get included in the model also grows. This suggests reduced uncertainty with increasing data but we are unable to decouple this effect from the fact that the analysis is conditioned only on a single tree topology for the large sample sizes.

Evaluation of predictors on the randomly subsampled datasets
143 Figure S5. Predictors of HIV diffusion between US states for subsamples of 250, 500, 750, 1000, 2500, 5000 and 10000 taxa, selected using the random sampling scheme. Bar plots show the expecation of the GLM indicators associated to each explanatory variable, and represent the inclusion probability of each predictor. Indicator expectations corresponding to Bayes Factors of 5, 25, and 100 are represented with dotted vertical lines. Point plots show the mean and credible intervals of GLM coefficients on a log scale. The contribution of each predictor, when included in the model is given by β|δ = 1. Predictors that were never included in the model are represented by dots with no filling.