Computer-Aided Diagnosis of Skin Diseases using Deep Neural Networks

: Propensity of skin diseases to manifest in a variety of forms, lack and maldistribution of qualiﬁed dermatologists, and exigency of timely and accurate diagnosis call for automated Computer-Aided Diagnosis (CAD). This study aims at extending previous works on CAD for dermatology by exploring the potential of Deep Learning to classify hundreds of skin diseases, improving classiﬁcation performance, and utilizing disease taxonomy. We trained state-of-the-art Deep Neural Networks on two of the largest publicly available skin image datasets, namely DermNet and ISIC Archive, and also leveraged disease taxonomy, where available, to improve classiﬁcation performance of these models. On DermNet we establish new state-of-the-art with 80% accuracy and 98% Area Under the Curve (AUC) for classiﬁcation of 23 diseases. We also set precedence for classifying all 622 unique sub-classes in this dataset and achieved 67% accuracy and 98% AUC. On ISIC Archive we classiﬁed all 7 diseases with 93% average accuracy and 99% AUC. This study shows that Deep Learning has great potential to classify a vast array of skin diseases with near-human accuracy and far better reproducibility. It can have a promising role in practical real-time skin disease diagnosis by assisting physicians in large-scale screening using clinical or dermoscopic images.


Introduction
Deep Learning (DL) [1] is a branch of Artificial Intelligence (AI) in which a computer algorithm analyses raw data and automatically learns discriminatory features needed for recognizing hidden patterns in them. Over the last decade, this field has witnessed striking advances in the ability of DL-based algorithms to analyse various types of data, especially images [2] and natural language [3]. The most common DL models are trained using supervised learning, in which datasets are composed of inputs (e.g., dermoscopic images of skin diseases) and corresponding target output labels (e.g., diagnoses or skin disease classes such as 'benign' or 'malignant'). Healthcare and medicine can greatly benefit from recent advances in image classification and object detection [4], particularly those medical disciplines in which diagnoses are primarily based on detection of morphologic changes such as pathology, radiology, ophthalmology and dermatology etc. In such medical domains, digital images are captured and provided to DL algorithms for Computer-Aided Diagnosis (CAD). These advance algorithms have already made their mark on automated detection of tuberculosis [5], breast malignancy [6], glaucoma [7], diabetic retinopathy [8] and serious brain findings such as stroke, haemorrhage, and mass effects [9].
Large scale manual screening for diseases is exhaustively laborious, extremely protracted, and severely susceptible to human predisposition and fatigue. Since manual diagnosis may also be affected by physicians' level of experience and different dermoscopic algorithms in which they are formally trained, multiple experts might disagree on their diagnosis for a certain condition [10,11]. Additionally, due to physicians' subjective judgements, manual diagnosis is hardly reproducible [12]. On the other hand, CAD can provide swift, reliable and standardized diagnosis of various diseases with consistency and accuracy. CAD can also afford the opportunity of efficient and cost-effective screening and prevention of advanced tumour diseases to people living in rural or remote areas where expert dermatologists are not readily available.
Most of publicly available datasets for clinical or dermoscopic images like Interactive Atlas of Dermoscopy [13], Dermofit Image Library [14], Global Skin Atlas, MED-NODE [15] and PH2 [16] etc. contain only a few hundreds to a couple of thousand images. Ali et al. [17] reported that around 78% of the studies they surveyed used datasets smaller than 1000 images and the study using the largest dataset had 2430 images. Therefore, most of existing works on CAD of skin diseases use either private or very small publicly available datasets. Additionally, these studies usually render overwhelming focus on only binary or ternary classification of skin diseases and not much attention is paid to multi-class classification to explore the full potential of DL. Therefore, such studies act merely as a proof-of-concept for the efficacy of AI in dermatology.
In this work, we extend previous works by showing that DL model are fairly capable of recognising hundreds of skin lesions, and therefore should be capitalized to their full extent. We trained many state-of-the-art DL models for classification of skin diseases using two of the largest publicly available datasets, namely DermNet and ISIC Archive (2018 version). We also employed non-visual data in the form of disease taxonomy to improve our classification results and show that DL can process and utilize multi-model input for better classification performance.

Related Work
Convolutional Neural Networks (CNNs) are computer models inspired by biological visual cortex. These models have been proven to be very efficient, accurate and reliable in image classification. They have already achieved near-human performance in many challenging natural image stratification tasks [18][19][20][21] and have also been used to classify diseases from medical images [4].
Towards automated skin disease classification, Kawahara et al. [22] employed CNNs to extract features and trained a linear classifier on them using 1300 images of Dermofit Image Library to perform 10-ary classification. Similar approach was used by Ge et al. [23] on MoleMap dataset to do 15-ary classification. Esteva et al. [24] used a pre-trained Inception v3 on around 130,000 images. Although their results for two binary-classification tasks are merely "on par with all tested experts", yet this work was the first credible proof-of-concept based on a large dataset that DL can make a practical contribution in real-world diagnosis. Following their steps, Haenssle et al. [25] pitched their fine-tuned Inception v4 model against 58 dermatologist after evaluating binary classification performance of their model on two test sets of size 100 and 300 only. The sensitivity and specificity of their Deep Neural Network (DNN) model is certainly higher than that of dermatologists' mean performance on two private test sets, however, their performance on publicly available International Symposium on Biomedical Imaging (ISBI) 2016 Challenge [26] test data is below the performance of first two winning entries in that challenge.
To address the scarcity of available data for tracking and detecting skin diseases, Li et al. [27] developed a domain-specific data augmentation technique by merging individual lesions with full body images to generate large volume of synthetic data. Li and Shen [28] also used DNN to segment lesions, extract their dermoscopic features and classify them.

Datasets
DermNet is a freely available dataset of around 23,000 images gathered and labelled by Dermnet Skin Disease Atlas. We were able to download 22,501 images because the links for rest of them appeared to be inactive. This dataset provides diagnosis for 23 super-classes of diseases which are taxonomically divided into 642 sub-classes. However, there were some duplicate, empty and irrelevant sub-classes in the data. After pruning, 21,844 images in 622 sub-classes remained. Distribution of DermNet dataset used in this work is given in Table 1. The second dataset is an online archive of around 24,000 images divided into seven classes maintained by The International Skin Imaging Collaboration (ISIC). Their growing archive of high quality clinical and dermoscopic images is manually labelled. Distribution of images in ISIC Archive-2018 dataset can be found in Table 2.

Experimental Setup
We used various state-of-the-art DNN architectures developed in the recent years like residual networks, inception networks, densely connected networks, and frameworks facilitating architecture search. To cope up with never-ending appetite of deep CNNs for data, we used these models pre-trained on ImageNet, which is a large dataset of around 1.5 million natural scene images divided into 1000 classes. We fine-tuned these models on dermatology datasets to leverage the benefits of transfer learning. From various CNN architectures tried for this task, we eventually selected ResNet-152 [29], DenseNet-161 [30], SE-ResNeXt-101 [31], and NASNet [32] for their better performance. To report the final results, we combined the potential of all of these biologically inspired neural networks by taking ensemble of their individual predictions. For ensemble we used average of individual predictions of four best performing CNNs to output final prediction.
It is important to note here that comparing researches that use different datasets, different subsets or train/test splits of the same dataset is not scientifically correct. Since neither of the two datasets used in this work provided instructions on dividing the data into train and test sets, we used stratified k-fold cross validation (k = 5 in this work) so that any future research can be compared with our work at least. The k-fold cross validation is a statistical method to ensure that the classifier's performance is less biased towards a randomly taken train/test split. The k-fold cross validation is performed by dividing the whole dataset into k, possibly equal, portions or folds. During a training iteration, one of these folds is kept aside for validation and rest of k − 1 folds are used for training the model. In next training iteration a different fold is kept aside for validation and remaining k − 1 are used for training. This way, the train and test sets in each iteration are completely mutually exclusive. This process is repeated k times such that each of the k-folds is used for validation exactly once. This cross-validation approach provides a more realistic generalization approximation. For training, we randomly cropped the images with scale probability ranging between 0.7 and 1.0 while maintaining the aspect ratio. These cropped images are then resized to 224 × 224 pixels (for NASNet the input is resized to 331 × 331) before feeding them to the network. The images are also randomly flipped horizontally with flip probability 0.5. During testing, an image is cropped into four corners (top left, top right, bottom left, and bottom right) and one central crop of required size. These cropped images are given to the classifier for inference and ensemble of five predictions is taken to provide final output. Initial learning rate is set to 10 −4 and is halved every five epochs. The networks are trained for 20 epoch and 10 epochs for DermNet and ISIC Archive, respectively. The number of training epochs for each dataset and initial learning rate were determined empirically. To handle class imbalance, we used weighted loss where the weight for a certain class equals reciprocal of that class's ratio in the dataset.

Results on DermNet
As DermNet provides the opportunity to leverage taxonomical relationship among various diseases, therefore, for 23-ary classification we conducted our experiments in two ways. In the first experiment (Exp-1), we trained our networks on 23 classes and inferred on 23 classes. This is the most prevalent approach. We achieved 77.53 ± 0.64% Top-1 accuracy and 93.87 ± 0.37% Top-5 accuracy with 97.60 ± 0.15% Area Under the Curve (AUC) using ensemble of four best models. In second experiment (Exp-2) we made use of additionally given ontology in the dataset. We trained our network on 622 classes but inferred on 23 classes only. The use of this disease ontology information translates into incorporation of expert knowledge into the network. We implemented this by summing the predictions of all sub-classes to calculate the prediction of respective super-class. This approach gave us noticeable boost in our classifiers' performance. We got 79.94 ± 0.45% Top-1 accuracy, 95.02 ± 0.15% Top-5 accuracy and 98.07 ± 0.07% AUC using ensemble.
Top-N accuracy indicates the capability of a classifier to predict correct class in first N attempts. This metric gives a deeper insight into the classifier's learning and discriminating ability. Our results, of Exp-2 for example, show that the model was able to predict the correct diagnosis out of 23 possible diseases in first attempt with almost 80% accuracy. However, when allowed to make 5 most probable predictions about a given image, the classifier achieved more than 95% accuracy. This means that even when the first prediction of the classifier is wrong, the would-be correct prediction is high on the list of next four predictions. Table 3 shows detailed performance metrics of 23-ary classification in both experiments. Accuracies and AUC scores of individual classifiers for Exp-1 and Exp-2 are given in Table A1 in Appendix A.  Figure 1 shows that many reciprocatory misclassifications in Exp-1, like between Eczema (Abbreviated as ECZ in Figure 1) and Psoriasis Lichen Planus (PSO) and between Actinic Keratosis BCC (AKBCC) and Seborrheic Keratosis (SEB), are corrected to a large extent in Exp-2 by utilizing taxonomical relationship among diseases.
We not only performed classification for 23 super-classes but also took a step forward and tried to classify all 622 unique sub-classes as well. We obtained 66.74 ± 0.64% Top-1 accuracy and 86.26 ± 0.54% Top-5 accuracy with 98.34 ± 0.09% AUC. Small values of standard deviation in all of these results signify the stability and consistency of our classifier's performance.
Previous works on DermNet have generally opted for a subset of 23 super-classes for classification. However, Haofu Liao [33] chose to classify all 23 classes and reported best Top-1 accuracy of 73.1% and Top-5 accuracy of 91% on 1000 randomly chosen test images. Cícero et al. [34] reported Top-1 accuracy of 60% on 24 classes (they split "Melanoma and Melanocytic Nevi" into malignant and benign classes). They picked only 100 examples of each class for their test set. To the best of our knowledge, previously the classification task with highest number of classes using DermNet has been performed by Prabhu et al. [35]. They performed 200-ary classification and obtained highest Mean Class Accuracy (MCA) around 51%. Classification accuracy and AUC of individual models for 622-ary classification are given in Table A2 in Appendix A.

Results on ISIC Archive-2018
ISIC Archive consists of high resolution clinical and dermoscopic images. It does not provide any ontology information about the diseases. Therefore, the approach used in Exp-2 for DermNet cannot be applied here. We achieved Top-1 accuracy of 93.06% ± 0.31% and Top-2 accuracy of 98.18% ± 0.06% with 99.23% ± 0.02% AUC using ensemble approach. Since this dataset has only seven classes, we restricted ourselves to Top-2 accuracy. Table 4 shows that the ensemble of four classifiers was able to achieve high precision of over 80% for all classes except Vascular Lesions that can be justified by small number of images (157 only) in this class. Confusion matrix showing number of correctly classified and misclassified images per class in this dataset is shown in Figure 2. Table A3 in Appendix A presents accuracy and AUC scores of individual classifiers. The ISIC Challenges of 2016 [26] and 2017 [36] have focused on binary classification of skin lesions whereas ISIC Challenge 2018 [37] included seven classes. However, as shown in our experiments, DL has enormous capacity to discern far many diseases with high sensitivity and specificity if given enough data. While reliable and accurate detection of melanoma is of utmost importance because of its lethality, it might also be of interest for dermatologists to use CAD to detect other non-lethal skin diseases. Figure 3 shows some examples of correct and misclassified images. We observed that some of these misclassified images had very high correlation with other classes. For example, there is significantly small inter-class variance between Figure 3a

Discussion
Automated diagnosis of skin diseases has enjoyed much attention from researchers for quite some time now. However, most of these researches confine themselves to only binary or ternary classification [38][39][40][41][42][43] even when large number of classes are available [44]. The importance of early detection of melanoma is understandable given the growing risk it poses to the patient's survival with every passing day. However, there are thousands of other skin diseases [24] that might not be as fatal as melanoma but have an enormous impact on a patient's quality of life. DL is extremely competent to take on hundreds of classes simultaneously, as evident by our results. We believe that this is right time to harvest the potential of DL to its full extent and start conducting real impactful research that can actually translate into industry standard solution for automated skin disease diagnosis on a larger scale. These solutions can have a far-reaching social impact by not only helping dermatologist with their diagnosis in a clinical setup but also providing an economical and efficient initial screening for underprivileged people in both developed and developing countries.
Another consideration in terms of application of DL in dermatology is that many researchers either use private datasets or public datasets with their own choice of train/test splits (although randomly taken) and number of classes. For this reason, there is little common ground, and often times no ground at all, to compare various classification methods-as also noted by Brinker et al. [45]. This issue of non-comparability can be resolved by collecting and maintaining a standardized publicly available large dataset with explicitly specified train/test splits and standard performance metrics for benchmarking. Notwithstanding that some public datasets, like ISIC Challenges datasets, do provide this beforehand train/test split but their size is normally small and task is usually restricted to binary or ternary classification. Any research on such small datasets cannot be reliably generalized and although the results are publishable, they cannot be used as stepping stone for practical applications of AI in real-world diagnosis. On the other hand, large public datasets normally have a lot of noise, images with disgracefully low resolution or are watermarked. Significant useful information required for fine-grained classification of seemingly similar diseases is lost in such low resolution or watermarked images. Additionally, non-visual metadata, like medical history, is not usually available with medical image datasets. However, this additional information could be pivotal for confident and accurate diagnosis. We were able to utilize disease taxonomy for DermNet dataset and improve our results by 2.5% (refer to Table A1). If multi-model datasets are curated and provided publicly, AI can surely leverage additional information to improve its classification performance.
While understanding and interpreting results of any AI-based classifier it is important to realize that accuracy, or even sensitivity and specificity, might not portray the complete picture of a model's performance. That is why Area Under Receive Operating Characteristic (ROC) Curve (AUC) is also reported along with other performance metrics. From AI point of view, we might argue that achieving around 80% average sensitivity with 1.6% average false positive rate (Table 3, Exp-2) for 23-ary classification task using highly unbalanced datasets of low-resolution and watermarked images is a reasonable achievement. Nevertheless, the actual performance of any AI-based classifier can be significantly different in practical clinical setup as noted by Navarrete-Dechent et al. [46]. They found that the classifier developed by Han et al. [47] did not generalize well when presented with data from an archive of different demography than the one which was used to train the classifier. For a dermatologist it is certainly a cause of concern. However, Han et al. advocated in their response [48] that a classifier should not be judged merely on the bases of sensitivity and specificity. The ROC curves indicate the true ability of a classifier to perform under a wide range of operating points or thresholds while making a diagnosis prediction for a given image. Varying this threshold from 0 to 1 on model's output can change the trade-off between sensitivity and specificity and yield different accuracy. Therefore, higher AUC values ensure that the model has the ability to correctly predict a certain disease, for examples melanoma, with minimum chance of classifying any other disease as that particular disorder.

Conclusions
In this paper we have build on previous works on CAD for dermatology and exhibited that DNNs are fairly competent to identify hundreds of skin lesions, and therefore, should be exploited to their full potential instead of employing them to classify only a handful of diseases. We have also set new state-of-the-art result for 23-ary classification on DermNet. Non-visual metadata is not normally available with most of medical image datasets. However, if such additional information is available, DNNs are capable of utilizing it and improving their classification performance as is evident from our experiment with using disease taxonomy to noticeably improve our classification accuracy. Funding: This work was partly funded by National University of Science and Technology (NUST) Pakistan (0972/F008/HRD/FDP), BMBF project DeFuseNN (01IW17002) and BMBF project ExplAINN (01IS19074).

Acknowledgments:
The authors would like to extend their gratitude to Dieter Metze and Kerstin Steinbrink for providing insightful feedback and valuable suggestions to improve the draft. M. N. Bajwa is also thankful to Arbab Naila for helping with validation of results.

Conflicts of Interest:
The authors have no conflict of interest to declare.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A
This section presents classification accuracy and AUC for individual classifiers and their ensembles for both DermNet and ISIC Archive-2018 datasets.  Figure A1 shows ROC curves and Area under these ROC curves for all experiments conducted and reported above.