Machine Learning and DWI Brain Communicability Networks for Alzheimer’s Disease Detection

: Signal processing and machine learning techniques are changing the clinical practice based on medical imaging from many perspectives. A major topic is related to (i) the development of computer aided diagnosis systems to provide clinicians with novel, non-invasive and low-cost support-tools, and (ii) to the development of new methodologies for the analysis of biomedical data for ﬁnding new disease biomarkers. Advancements have been recently achieved in the context of Alzheimer’s disease (AD) diagnosis through the use of diffusion weighted imaging (DWI) data. When combined with tractography algorithms, this imaging modality enables the reconstruction of the physical connections of the brain that can be subsequently investigated through a complex network-based approach. A graph metric particularly suited to describe the disruption of the brain connectivity due to AD is communicability . In this work, we develop a machine learning framework for the classiﬁcation and feature importance analysis of AD based on communicability at the whole brain level. We fairly compare the performance of three state-of-the-art classiﬁcation models, namely support vector machines, random forests and artiﬁcial neural networks, on the connectivity networks of a balanced cohort of healthy control subjects and AD patients from the ADNI database. Moreover, we clinically validate the information content of the communicability metric by performing a feature importance analysis. Both performance comparison and feature importance analysis provide evidence of the robustness of the method. The results obtained conﬁrm that the whole brain structural communicability alterations due to AD are a valuable biomarker for the characterization and investigation of pathological conditions.


Introduction
Alzheimer's disease (AD) is the most widespread neurodegenerative disorder and is a growing health problem. It is mainly characterized by short-term memory loss in its earlier stages, followed by a progressive decline in other cognitive and behavioural functions as the disease advances [1]. Investigating useful biomarkers for the early diagnosis, prognosis and response to therapy is one of the primary goals of the current research activity in neuroscience.
A number of studies provided evidence that the decline due to AD is related to a disrupted connectivity among brain regions, caused by white matter (WM) degeneration, e.g., [2,3]. Due to their homogeneous chemical composition, conventional MRI is not able to highlight the structure of the WM fibers, therefore, it is not tailored to investigate the physical disconnections arising among them. Conversely, a promising technique for such an investigation is diffusion weighted imaging (DWI). This technique, in fact, is able to analyze the WM micro-structural integrity and can thus help identify WM alterations that may occur due to AD [4].
Different approaches have been proposed to study the diagnostic potential of DWI data, ranging from a finer voxelwise analysis [5] to an ROI-based approach [6]. In the last few years, a growing interest has arisen towards the application of an alternative approach based on complex network theory. When combined with tractography algorithms [7], in fact, DWI enables the reconstruction of the WM fiber tracts, providing a characterization of the physical connections of the brain that can be subsequently investigated through a complex network-based approach [8]. More precisely, the brain can be modeled as a network whose nodes are the anatomical regions and whose edges are related to the fiber tracts connecting them. Traditional network metrics suitable for describing topological properties of the brain include nodal degree and strength, and shortest path length.
In particular, a very promising research direction consists in feeding graph-based topological measures into machine learning algorithms so as to automatize the disease detection, e.g., [9][10][11]. Developing computer aided diagnosis systems is desirable, as they can provide a non-invasive, low-cost tool-support to the traditional neuropsychological assessment performed by expert clinicians. Moreover, a great variety of state-of-the-art machine learning approaches, has shown outstanding performance for early detection and automated classification of Alzheimer's disease (AD) [12,13].
Recently, we investigated the usefulness in this context of an uncommon graph measure, that is communicability. Communicability quantifies the ease of communications between node pairs in a network by considering not only the shortest path connecting them, but all possible available routes [14]. For this reason, this metric revealed to be particularly sensitive to the disruption of communication between brain regions due to AD [15,16]. In [15], communicability was able to outperform more classic graph measures on a mixed cohort of healthy control (HC) subjects and AD patients from the public Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/). Since the main goal was to compare communicability to other classic measures, we fixed the classification model to be used, i.e., support vector machines. Furthermore, a cortical parcellation scheme was adopted for the estimated brain networks. On one hand, the use of only one classifier may be not enough to claim the robustness of the communicability-based approach for supporting the automatic disease detection. On the other hand, the use of a coarse anatomical scheme could have overlooked detailed patterns of connectivity, which may play a key role in neurological diseases investigation. In [16], we partially addressed these open issues by conducting a connectivity analysis on only the sub-cortical connectivity sub-network.
In the present work, we extend our previous analyses by comparing different state-of-the-art classification algorithms and, at the same time, by using a different parcellation scheme which takes into account the overall brain structural connectivity patterns. To this end, we developed a machine learning framework for both classification and feature analysis of AD based on communicability.
In Section 2, we describe the dataset used for the study. In Section 3, we outline the steps of the analysis: we describe the image processing pipeline to obtain the connectivity network starting from DWI scans, we explain the feature extraction step consisting in the calculation of the graph communicability for each node pairs, and we provide a description of the different machine learning algorithms used for the classification comparison. In Section 4, we report the results of the classification comparison and of the feature analysis, finding out the region pairs most related to the disease detection in accordance with the three classification methods. The results are discussed in the last section.

Materials
For the purposes of the present study, we used the data of a balanced cohort of 40 HC subjects and 40 age-matched AD patients, from the ADNI database, in order to compare different classifiers by avoiding the potential problem arising from unbalanced dataset. ADNI is multi-site, longitudinal research study that actively supports the research on medical treatments to slow or stop the progression of AD. The overall goal of the study is to validate candidate biomarkers for use in clinical treatment trials.
The diffusion-weighted scans were randomly selected from baseline and follow-up study visits. HC subjects do not show signs of depression, mild cognitive impairment or dementia; participants with AD meet the NINCDS/ADRDA criteria for probable AD. Demographics and clinical scores for the participants are summarized in Table 1. Scans were acquired by using a 3-T GE Medical Systems scanner; more precisely, 46 separate images were acquired for each scan: five with negligible diffusion effects (b 0 images) and 41 diffusion-weighted images (b = 1000 s/mm 2 ). For each subject, the T1 weighted anatomical scan was also used to perform tractography.  [17] and ADAS 13 [18] scores are reported. According to the t-test statistics, MMSE, ADAS 11 and ADAS 13 are significantly different between healthy controls (HC) and Alzheimer's disease (AD). For age and gender, the chi-squared test was performed.

Methods
The proposed framework includes several steps which are described in the following subsections. It is worth to note that these steps typically require a huge computational burden, with image processing time in particular of about ten hours per subject. In order to carry out such an expensive computation, we used the distributed infrastructure ReCaS-Bari computing farm (https: //www.recas-bari.it/index.php/it/).

Image Preprocessing
For each subject, the DICOM images were acquired from the ADNI database, The dcm2nii tool, included in the MRIcron suite, was used to convert the DICOM images into the NIFTI format. The NIFTI images were then organized in the standard BIDS format. The other processing steps, from image preprocessing to connectome reconstruction, were carried out with tools provided within the MRtrix3 software package (http://www.mrtrix.org/) and the FMRIB Software Library (FSL) (https: //fsl.fmrib.ox.ac.uk/fsl/fslwiki).
The main steps of the whole processing, which are well-established in the literature, are shown in Figure 1 and they are outlined in the following. First, a denoising step was performed in order to enhance the signal-to-noise ratio of the diffusion weighted MR signals so as to reduce the thermal noise. This noise is due to the stochastic thermal motion of the water molecules and their interaction with the surrounding micro-structure [19]. Head motion and eddy current distortions were corrected by aligning the DWI images of each subject to the average b 0 image. The brain extraction tool (BET) was then used for the skull-stripping of the brain [20]. The bias-field correction was used to correct all DWI volumes. Similarly, the T1 weighted scans were preprocessed by performing the following steps: reorientation to the standard image MNI152, automatic cropping, bias-field correction, registration to the linear and non-linear standard space, brain extraction. The next step was the inter-modal registration of the diffusion weighted and T1 weighted image. After the preprocessing and co-registration steps, we performed the structural connectome generation. First, we generated a tissue-segmented image tailored to the anatomically constrained tractography [21]. Then, we performed an unsupervised estimation of WM, gray matter and cerebro-spinal fluid. In the next step, the fiber orientation distributions for spherical deconvolution was estimated [22]. We then performed a probabilistic tractography [23] using dynamic seeding [24] and anatomically-constrained tractography [25], which improves the tractography reconstruction by using anatomical information through a dynamic thresholding strategy. We applied the spherical-deconvolution informed filtering of tractograms (SIFT2) methodology [24], which not only provides more biologically meaningful estimates of the structural connection density, but also a more efficient solution to the streamlines connectivity quantification problem. The obtained streamlines were mapped through a T1 parcellation scheme by using the AAL2 atlas [26], which is a revised version of the automated anatomical atlas (AAL) including 120 regions. Finally, a robust structural connectome construction was performed for generating the connectivity matrices [27]. The pipeline here described has been used in recent structural connectivity studies, for example [28] and [29].
The final output of the image processing step was a 120 × 120 weighted symmetric connectivity matrix for each subject: each entry corresponded to the fiber tracts connecting region i to region j. In contrast to our previous works, where only cortical or sub-cortical regions were considered disjointly, we here employed networks at the whole-brain level.

Feature Extraction
The connectivity matrix represents the structural complexity of the brain network. A powerful framework for mathematically treating such a complex system is graph theory. Several graph metrics can be computed from the connectivity matrix to describe the topological properties of the brain. Most of these measures are based on the shortest path connecting two nodes of the network. Their relevance rests on the idea that the communication between two nodes takes place through the shortest path connecting them. However, in many real-world networks, such as social networks, information does not necessarily flow along the shortest paths; moreover, it can go back and forward several times before reaching its final destination (e.g., [30,31]). As a consequence, relying only on shortest path-based models can lead to relevant information loss.
In order to overcome this drawback, Estrada and Hatano proposed a new concept of communicability, initially only for binary networks, defining the communicability between two nodes as a function of the total number of walks connecting them, giving more importance to the shorter than the longer ones [14].
Let G be a graph of N nodes and A the corresponding N × N adjacency matrix, then: a p,r 1 a r 1 ,r 2 a r 2 ,r 3 · · · a r k−2 ,r k−1 a r k−1 ,r q , counts the number of walks of length k starting at node p and ending at node q. The communicability between p and q is given by the total number of walks connecting them, weighted in decreasing order of their lengths: This equation can also be rewritten in terms of the graph spectrum as: where ϕ j (p) is the p-th element of the j-th orthonormal eigenvector of the adjacency matrix associated with the eigenvalue λ j . The concept of communicability was extended to the weighted case by Crofts and Higham [32]. The definition provided above is still valid but A is the N × N weighted matrix and the terms a p,r 1 a r 1 ,r 2 a r 2 ,r 3 · · · a r k−2 ,r k−1 a r k−1 ,r q represent the weights of the walks i → r 1 , r 1 → r 2 , etc. In order to avoid the excessive influence of a node depending on its high weight, the authors proposed a normalization step which consists in dividing the weight a ij by the product √ s i s j , where s i is the strength of node i. Therefore, the communicability between two nodes p and q in a weighted network is defined as: where

Model Fitting
The main goal of our analysis was to compare the performance of different classification algorithms, on the same data set, for discriminating AD from HC through communicability. To this end, we employed three state-of-the-art classification models: • Support vector machines (SVMs); • Random forests (RFs); • Artificial neural networks (ANNs).
They are briefly described in the following.

Support Vector Machines
SVMs construct separating hyperplanes between the two classes so that the minimal distance from the closest data points of either classes is the largest [33]. Previously unseen examples are predicted to belong to a class based on which side of the hyperplane they fall. In order to mitigate the effects of overfitting, the margin of the hyperplane is chosen so as to correctly separate most of the training examples, while misclassifying some of them. To learn nonlinear decision boundaries, the data points can be mapped to a higher dimensional space via a kernel function: in the present work, we used a Gaussian radial basis function kernel. It is worth to note that the bias-variance trade-off of the algorithm is governed by the fine tuning of the penalty parameter C, which represents the violations the algorithm can tolerate when constructing the hyperplane, and the kernel coefficient γ. In this work, we set C to 1 and γ to 1 n , where n is the number of features: this is a typical parameter setting.

Random Forests
RF is a tree-based method for classification which relies on the concept of bootstrap aggregating (or bagging) to build a multitude of decision trees at training time and outputting the mode of the classes predicted by the individual trees at test time [34]. Bagging consists in iteratively selecting a random sample with replacement from the training set and fitting a decision tree to this sample.
In contrast to ordinary bagging, when building a decision tree RF does not consider the entire set of features at disposal but chooses random subsets of features. This serves to avoid growing highly correlated trees. In the present work, 500 trees were used to build the forest: this is a common choice.

Artificial Neural Networks
In this work, we used the classic Multi-layer Perceptron (MLP) architecture. Briefly speaking, an MLP is a feed-forward artificial artificial neural network that can learn a non-linear function approximator either for classification or regression [35]. In contrast to traditional logistic regression, which is based on a single weighted linear combination between the input layer and the output layer, an MLP provides one or more non-linear (hidden) layers. In the present work, we used an MLP with two hidden layers (32 hidden units each) and, as activation function, we used the commonly used ReLU. Employing more hidden layers would had a negative impact on classification performance, given the higher number of parameters to be optimized with respect to the number of examples. The network optimizes the log-loss function via backpropagation using the Limited-memory BFGS algorithm [36]. This is an optimization algorithm in the family of quasi-Newton methods which is known to perform well when, as in this case, the training data is small [37].

Feature Analysis
The supervised algorithms provided within our framework are naturally equipped with methods to assess the importance of the input features by computing weighted rankings: • Support vector machines: we used the popular recursive feature elimination (SVM-RFE) algorithm [38]. The method is based on criteria derived from the SVM model to assess feature importance and thus to remove features having small criteria. The process is computed iteratively until all features are removed from the feature set: the final output is a ranked feature list. • Random forests: for each tree, the feature importance was calculated as the decrease in node impurity weighted by the expected fraction of the samples reaching that node. For the overall forest, the normalized feature importance were simply summed.

•
Artificial neural networks: we used the well-known Gedeon method [39]: it computes a feature ranking by considering the weights connecting the input features to both the two hidden layers.

Experimental Results
In this section, the results of two analysis are reported. The first one was aimed at comparing the performance of the three classification models employed. The second one was a feature importance analysis aimed at evaluating the effectiveness of communicability in identifying the brain regions whose connectivity is more related to AD.

Classification Performance
In order to validate the classification performance, we used a 10-fold cross-validation. With this scheme the set of examples is divided into 10 folds of equal size: nine folds are used to train the learning algorithm, and the remaining fold is used to test it. The entire procedure is repeated 10 times, until every fold is used as test set once. Note that the splitting within each iteration was stratified by diagnosis so as to have approximately the same number of examples from each diagnostic group in each fold. The entire procedure was repeated ten times, with different permutations of the training and test examples, in order to obtain a better generalization of the performance.
The results obtained are expressed in terms of traditional performance metrics: accuracy, area under the ROC curve (AUC), sensitivity and specificity. In the following, the mean values are reported, averaged over all the cross-validation iterations. Also the standard errors are reported. Figure 2 shows the classification performance of the three classifiers. Although quite comparable performance can be observed for all of them, it can be noticed that ANN slightly outperformed SVM and RF both in terms of accuracy (i.e., 0.75 ± 0.01) and AUC (i.e., 0.83 ± 0.01). Interestingly, SVM and RF provided a specificity higher than sensitivity; instead, with ANN this trend is reversed as a sensitivity higher than specificity was obtained (i.e., 0.80 ± 0.02 and 0.70 ± 0.01, respectively). The mean value of sensitivity for ANN was found to be statistically significant different from that obtained by the other two classifiers (p-value < 0.0001 and p-value = 0.0003 in the comparison with SVM and RF, respectively, with the Mann-Whitney U test). Concerning the other performance metrics, no statistically significant difference, at the significance level of 0.01, was found between the three models.
Another important question arising concerns the agreement between the three classification models on the labels to be assigned to the test examples. In fact, in principle, two distinct models could perform similarly but misclassifying different examples. In order to evaluate the inter-annotator agreement, we calculated the well-known Cohen's κ between the pair-wise predictions. This value ranges from −1 to +1: high values indicate good agreement; zero o lower values indicate chance agreement. We observed a κ of ∼0.67 between ANN and the other two models and a κ of ∼0.85 between RF and SVM.

Feature Importance
In the second part of our analysis, we evaluated which features had a special role in the disease prediction. To this end, we used the feature ranking methods described in Section 3.4. It is worth to note that, for a more robust evaluation, we computed the importance rankings over a hundred of different random sub-samples of the subjects having the same class distribution of the original sample.
We found 25 region pairs in common to the 90th percentile of the importance distribution of the three methods. The region pairs with the highest relative importance, in accordance with the AAL2 atlas, are depicted in Figure 3. Instead, Table 2 shows the seven regions which occurred more than one time in the 25 most important region pairs. Finally, Table 3 shows the anatomical areas whose regions occur more often. It is worth to note the importance of the Occipital Lobe and its regions, as well as the importance of the subcortical sub-network, particularly Caudate.

Discussion and Conclusions
In this paper, we developed a framework that simultaneously exploits complex network features and machine learning algorithms to investigate the information content of the communicability metric in the discrimination between pathological and controls and explore the most relevant AD-related brain regions at the whole brain level. In doing this, we extended our previous research [15,16] by addressing two major open issues. As first important issue, since in [15] the goal was to compare communicability to other traditional network metrics, we used only one classification algorithm. In this work, we employed three different state-of-the-art classification algorithms in order to evaluate if the information content of communicability is robust against the use of a particular learning algorithm. We found that all the models employed, i.e., SVMs, RFs and ANNs, provided comparable values of accuracy, AUC and specificity, which are in line with our previous work. Instead, for what concerns sensitivity, a significantly higher value was obtained with ANN and thus we can conclude that this classification model is more sensitive to detect the disease starting from the features used in this analysis.
There is usually a trade-off between sensitivity and specificity. A sensitivity higher than specificity is preferable in diagnosis support systems, as this means that the system is better in detecting the presence of disease in the pathological group rather than detecting the absence of disease in the healthy group. Therefore, the system is more effective for ruling out disease when resulting in a negative response. In addition, it is worth to remark that the high value of sensitivity has been here obtained with a perfectly balanced dataset. Having balanced data provides more reliable evaluation of the performance, as it is well known that, in case of unbalanced data, classification algorithms tend to favor the correct prediction of the over-represented class instead of the other ones. However, it is worth noting that in a clinical context achieving a high sensitivity is of paramount importance, so it could be appropriate to explore also situations with unbalanced datasets or adopt strategies that alleviate the risk of overlooking the diagnosis [40].
The second important issue concerns the dependence of the proposed method on the network size. In our previous work [15], coarse anatomical connectivities were considered for the estimated brain networks. Connectivities obtained between relatively large region patterns are more robust and reproducible; however, they could overlook detailed patterns of connectivity, which may play a key role in neurological diseases investigation.This issue was partially addressed in [16], where we focused only on the subcortical sub-network. In this paper, we further extended our previous studies by taking into account a different parcellation scheme and a different reconstruction of the brain connectivity network to study patterns of connectivity emerging at the whole-brain level.
First of all, we observed that the most significant brain region pairs involve connections between cortical regions or between subcortical regions and cortical regions. No significant connections between subcortical regions were found, confirming our previous results about the hypothesis that AD connectivity alterations mostly regard the inter-connectivity between subcortical and cortical regions rather than the intra-subcortical connectivity. Among the most significant different connections we found the Temporal-Frontal and Temporal-Parietal which are known to be affected in AD [41,42]. The anatomical areas most occurring in the most significant brain region pairs are Occipital, Parietal and Temporal Lobe, which are highly AD-related brain regions [43][44][45]. Indeed, the Occipital area is responsible for visual processing, the Parietal area has an important role in integrating senses while the Temporal area is essential for memory. The Cerebellum has also been found among the most occurring regions. The role of the Cerebellum in AD has been a topic of debate in the last years: only recently its closely association with cognitive deterioration has emerged, e.g., [46,47]. Three subcortical regions were found among the most significantly different node pairs: right Caudate, left and right Hippocampus and Putamen. Also these regions play an important role in AD [48][49][50]. It is interesting to note that there are some significantly region pairs with greater communicability in AD compared to HC. Similar findings have been reported for example in [51] where some areas of greater communicability in stroke patients compared to controls were found also in the lesioned hemisphere, also in this case starting from DWI date for studying a disconnection syndrome. One possible interpretation of this pattern is the hypothesis that the increased communicability in AD could reflect adaptive changes in the white matter structure that have occurred secondary to the disease. We also compared the most occurring regions resulting from this study to our previous findings [15,16]. In particular, we found a significant overlap with some cortical and sub-cortical anatomical regions: parietal lobe, paracentral gyrus and temporal areas are the most overlapping regions. In addition, we found a consistent overlap in both parahippocampal and caudal regions.
Another important issue concerns the agreement between the three classification algorithms. Although the classification models are based on different algorithms, they showed very similar findings in Accuracy, AUC and Specificity and in the detection of the most significant features. The results attested the robustness of the framework based on whole brain graph communicability.
Other main open issues should be addressed in future work. In this paper, only the binary discrimination HC/AD was taken into account: the classification task involving mild cognitive impairment (MCI) subjects should also be considered in order to support the disease diagnosis at earlier stages. However, it should be noted that the AD patients we have taken into account for the present study, from the ADNI database, are characterized by a MMSE score indicating a moderate and not severe cognitive impairment.
In addition, a greater sample size should be used and other classification strategies should be explored to further improve the diagnostic accuracy. Novel insights could be obtained, for example, by using a multi-expert approach or a set of fuzzy inference rules. Promising results from using these techniques for diagnostic purposes in other domains have been recently reported, e.g., [52,53]. A multi-expert approach, which is based on ensembling different classifiers trained for the same task, may further improve prediction accuracy, as it can benefit from the different viewpoints from which these classifiers may look at the data. Concerning the use of fuzzy inference rules, they can be extracted from data with the aim to provide more explicit classification rules and thus to better support the decisions made by physicians. The use of Fuzzy Logic revealed beneficial in brain MRI. As a matter of fact, Fuzzy C-Means clustering was effectively applied also to MR image segmentation in neuroimaging for neurodegenerative diseases [54,55], as well in oncology [56]. Future work would address these issues. Funding: This paper has been partially supported by the Apulian regional INNONETWORK project, project code BNLGWP7.
Acknowledgments: Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributionristol-

Conflicts of Interest:
The authors declare no conflict of interest.