Computer Aided Classification of Neuroblastoma Histological Images Using Scale Invariant Feature Transform with Feature Encoding

Neuroblastoma is the most common extracranial solid malignancy in early childhood. Optimal management of neuroblastoma depends on many factors, including histopathological classification. Although histopathology study is considered the gold standard for classification of neuroblastoma histological images, computers can help to extract many more features some of which may not be recognizable by human eyes. This paper, proposes a combination of Scale Invariant Feature Transform with feature encoding algorithm to extract highly discriminative features. Then, distinctive image features are classified by Support Vector Machine classifier into five clinically relevant classes. The advantage of our model is extracting features which are more robust to scale variation compared to the Patched Completed Local Binary Pattern and Completed Local Binary Pattern methods. We gathered a database of 1043 histologic images of neuroblastic tumours classified into five subtypes. Our approach identified features that outperformed the state-of-the-art on both our neuroblastoma dataset and a benchmark breast cancer dataset. Our method shows promise for classification of neuroblastoma histological images.


Introduction
Neuroblastoma represents 8% of all malignancies in infants [1] and is the most common extracranial solid malignancy in childhood [2]. More than 15% of paediatric cancer deaths are the result of neuroblastoma [3]. Optimal management of neuroblastoma depends on many factors, one of which is the histopathological classification which is performed by pathologists using optical microscope to classify neuroblastic tumours of stained tissue sections. Pathologists use an optical microscope and classify neuroblastic tumours by examining thin slices of tissue on a glass slide. Pathologists commonly use the Shimada system [4] to classify neuroblastic tumours. Neuroblastic tumours are a The contributions of this paper are: 1. We applied SIFT to extract features of neuroblastoma histological images which are robust to scale variations. 2. We combine SIFT with the bag of features to reduce the number of features by extracting highly discriminative features. 3. We evaluate our method by comparing with other state-of-the-art benchmarks which show the effectiveness of our method in the classification of neuroblastoma histological images. 4. For more evaluation, we applied our method on a benchmark breast cancer dataset, BreaKHis dataset, and compared the results with state-of-the-art methods. Results show the effectiveness of our method.
The rest of this paper is as follows. Section 2 presents material and method. Section 3 shows the results. Section 4 presents the discussion, and finally Section 5 concludes the work.

Dataset Construction
As there is no public and available dataset in analysis of neuroblastic tumours we constructed a dataset which consists of images from neuroblastic tumours from The Tumour Bank of the Kid's Research Institute at The Children's Hospital at Westmead, Sydney, Australia. All the specific details of patients were removed from the dataset and a de-identified dataset was used in this research. The initial dataset was generated from cancer tissue biopsy slides and consisted of images of H&E stained tissue microarrays (TMA) of neuroblastic tumours. We used six TMA slides and seven whole sections representing 125 patients. Each TMA slide contains from 20 to 40 cores of neuroblastic tumour. Samples of TMA slide containing cores of different neuroblastic tumours are shown in Figures 2 and 3, respectively. The diameter of TMA cores is 1.2 mm, stained with H&E and cut at 3 µm thickness. Although most images belong to different patients, some of them are duplicates. Our constructed dataset is much larger in terms of patients and images than the datasets used by Tafavogh et al. [12] and Kong et al. [11].  All collected tissue samples were classified as undifferentiated neuroblastoma, poorly-differentiated neuroblastoma, differentiating neuroblastoma, ganglioneuroblastoma, or ganglioneuroma by pathologists. Representative images in the categories are shown in Figure 4. Due to the size of the scanned neuroblastic tumour images, up to a gigabyte each, the computational cost is high. Therefore, they were cropped to equal-sized square regions (300 × 300 pixels) by an expert histopathologist [AC] in which all the specification of a particular subtype of neuroblastic tumour is present in the cropped image. This is the best size which is large enough to capture diagnostic features of each category and small enough for the computational cost. Table 1 shows the number of images in our dataset.

Method
The overall framework of our approach is shown in Figure 5. The Scale Invariant Feature Transform (SIFT) extracts features from the test images and feeds them into the feature encoding block to summarize them into new discriminative features. Here, a SVM is used to classify the test image by comparing their extracted features with the training set images.

Scale Invariant Feature Transform
The SIFT was proposed by Lowe in 2004 [15]. In this method, distinctive image features are extracted from the scale-invariant keypoints. SIFT features are invariant to rotation and robust to scale and illumination variations. SIFT algorithm consists of four stages: detection of scale-space extrema, keypoint localization, orientation assignment and descriptor representation. These are described in detail below.

Detection of Scale-Space Extrema
The first step involves searching over multiple scales and image locations to identify keypoints that are invariant to differences in scale. Detecting the locations that are invariant to scale changes in an image is determined using a scale-space function [16]. Lindeberg [17] showed that the only possible scale-space kernel is the Gaussian function, G(x, y). Scale space of an image, L(x, y, σ), is calculated by the convolution of a variable-scale Gaussian function, G(x, y, σ), with an input image, I(x, y).
L(x, y, σ) = G(x, y, σ) × I(x, y) and G(x, y, σ) = 1 2πσ 2 e −(x 2 +y 2 )/2σ 2 where x and y are image coordinates and σ is width of the Gaussian function. Stable keypoint locations are defined by Difference of Gaussian (DoG) function in the scale-space. The scale-space extrema is located in D(x,y,σ), which is calculated as the difference between two images, one with scale k and the other one with scale kσ , k is a multiplicative factor.
D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) × I(x, y) = L(x, y, kσ) − L(x, y, σ) Points in the DoG function which are maxima or minima among the 26-neighboring pixels (eight neighbors in the current image and nine neighbors in the above and nine neighbors in the below scales) are considered as extrema points which are called keypoints. The detection of keypoints is shown in Figure 6.

Keypoint Localization
Some of the extracted keypoints that have low contrast or are poorly localized on edges of the image are not stable. These keypoints are eliminated in this stage. Location of the keypoint, z, is achieved by calculating the Laplacian.

∂D ∂x
If the value of Laplacian function is less than a threshold value, called the contrast threshold (C C ), the point is rejected. Therefore, the extracted keypoints with low contrast are eliminated. Some of the extracted keypoints have a strong response along edges. Therefore, we need to eliminate these points to increase the stability. Lowe et al. [15] used a 2 × 2 Hessian matrix, a square matrix of second-order partial derivatives of a scalar-valued function which describes the local curvature of a function of variables, to compute two eigenvalues in edges. If the ratio between the larger and smaller eigenvalue is greater than the edge threshold (C E ), that keypoint is discarded. Therefore, we have more stable keypoints after eliminating the low contrast and edge keypoints. The values of C C and C E are set in the experimental setup subsection.

Orientation Assignment
Before extraction of a descriptor for the keypoint, the keypoint is assigned an orientation to make it invariant to rotation. The orientation of the keypoint is calculated from the orientation histogram of local gradients from the closest smoothed image L(x, y, σ) which means there is a potential keypoint at (x, y) with scale σ. The gradient magnitude and the orientation are calculated as follow.
An orientation histogram is created from the gradient orientations of sample points within a region around the keypoint. It consists of 36 bins covering the 360 degrees of orientation [15]. Each sample added to the histogram is weighted by the gradient magnitude, m(x, y), and by a circular gaussian with σ equal to 1.5 times the scale of the keypoint [15]. Any peaks above 80% of the maximum peak height are considered to calculate the orientation, it creates keypoints with same location and scale, but different directions. An accurate orientation is calculated using the interpolation of dominant peaks in the histogram. The orientation calculation is illustrated in Figure 7.

Descriptor Representation
In this stage, a SIFT descriptor is calculated for each keypoint as a histogram of orientation in eight directions (as shown in Figure 8). According to [15,18,19], it is common that a 16 × 16 neighborhood region around each keypoint is divided into 4 × 4 sub-regions. The gradient vectors are aggregated into 8-bin histogram over the 4 × 4 matrix of subregions. Then, SIFT extracts 4 × 4 × 8 = 128 element feature vector for each keypoint. SIFT has three free parameters which must be set. The first one is σ, the width of the Gaussian. Increasing σ reduces the number of features generated from an image [20]. The second free parameter is the contrast threshold (C C ) which eliminates keypoints with low contrast. The third free parameter is the edge threshold (C E ) which eliminates the unstable keypoints that are near the edges. These parameters are set in Section 5.

Feature Encoding
The SIFT-descriptor extracts a feature vector with 128 elements for each keypoint. Using all of the extracted features in the classification of neuroblastoma histological images is time-consuming and have a high computational complexity. So, we use a bag of features [21] as a feature encoding to compute more discriminative representations. The scheme of the bag of features is shown in Figure 9. It starts by selecting keypoints, step (b), and then describes keypoints from the input image step (c). In the next step (d), the extracted features are clustered and the codebook which consists of the codewords is constructed in step (e), codewords are the extracted features by SIFT. Then, a bag of features histogram counts the occurrences of each feature in step (f). We use clustering to construct a codebook. All the extracted keypoints are clustered to find a set of centroids. In this work, we use k-means algorithm [22] for clustering. An important factor in construction of the codebook is selection of the number of codewords. We evaluate different codebook sizes and select the best one; experimental results are shown in Section 5. The input image is represented by a histogram of codewords using the feature encoding block. In the final stage, histological images are classified by training a classifier which classifies the histograms of the feature encoding block. We use SVM classifier which can efficiently perform a non-linear classification using an optimal hyperplane in a high-dimensional space.

Results
In this section, we evaluate the performance of the proposed approach on the constructed dataset of histological images of neuroblastoma. The database is divided randomly into three subsets: the first part for training (623 images), the second part for validation (211 images) and the third part for testing (209 images). We use the validation set to select the optimum values for free parameters of the proposed approach. Finally, the system is evaluated using testing and training sets.

Experimental Setup
SIFT has three parameters to be selected: Gaussian σ, contrast threshold (C C ), and edge threshold (C E ). We randomly divide the validation set into two subsets including 150 and 61 images. We train the algorithm using 150 images and compute the accuracy of the algorithm using 61 images to tune the free parameters of the algorithm. We repeat the above procedure multiple times (10 times) and compute the average over all experiments to have an accurate result. We tested SIFT with σ = 0.1, as C C and C E are 0.02 and 5, respectively. We increased the σ by step size 0.4 and found the best accuracy for the proposed system. According to Table 2, the best accuracy found when σ was 1.7. So, we consider σ 1.7 in the next experiments. We completed another experiment to find the optimum value for C C while the C E is 5. According to Table 3, the best accuracy was found when the C C is 0.04. Also, we chose the optimum value for C E . We tested five different values for C E and according to the Table 4 found the best accuracy when C E is 11. Furthermore, we tested the proposed system for five different codebook sizes, starting with 300 and following with 400, 500, 600 and 700. Table 5 shows the average classification accuracy versus size of the codebook. The best accuracy was found when the codebook size was 500. This value was set for the next experiments.  For SVM, we tried different kernels: linear, polynomial, Radial Basis Function (RBF), and sigmoid. We retained the RBF kernel which produced the best results. The kernel parameter γ was empirically defined through experiments with the best value of 0.004.

System Evaluation
To evaluate the proposed method, we use the remaining 80% of the dataset which is not seen in the experimental setup. They are divided randomly into the training set (623 images) and testing set (209 images). We train the final model using the training set and test it using the testing set. We repeat this procedure multiple times (10 times) and report the average accuracy. We use the standard bag of features and the pyramid bag of features [21] algorithms to evaluate the proposed method. In each case, we used RBF kernel type and histogram intersection kernel [23]. We measured the performance of our algorithm by the weighted average (weighted by the number of samples in each class) F-measure, recall and precision. The results are shown in Table 6. We compared our method with Patched Completed Local Binary Pattern (PCLBP) and Completed Local Binary Pattern (CLBP) as benchmarks. Distribution of the computed F-measures for different approaches over the ten trials is presented in Figure 10. It shows that combination of SIFT with the bag of feature works better than CLBP and PCLBP.
We completed a t-test (with significance level of 5%) giving p-values of 15 × 10 −7 and 12 × 10 −5 for CLBP and PCLBP with histogram intersection kernel SVM, respectively. The results showed that the combination of SIFT with bag of features and histogram intersection kernel SVM statistically significantly improved the classification accuracy in comparison with CLBP and PCLBP with RBF kernel SVM. The F-measure accuracy of our proposed method in the best case is 10.04% and 9.23% higher than CLBP and PCLBP, respectively. The weighted average of precision, recall, and F-measure of our algorithm is better than PCLBP and CLBP. Table 7 shows the representative confusion matrix. Interestingly, the poorest computer performance was in discriminating between poorly-differentiated and differentiating neuroblastoma, a distinction that human pathologists also find difficult in limited fields of view.

Comparison to Other Datasets
We tested our algorithm on the collected dataset which consisted of 1043 (300 × 300 pixels) neuroblastoma histological images. As there was no public and available dataset in this field, we evaluated the proposed algorithm on a second dataset provided by the University of Bristol. The dataset consists of five whole tissue sections, one ganglioneuroblastoma, three poorlydifferentiated neuroblastoma and one ganglioneuroma. To evaluate the proposed method, we use 623 sub-images cropped from Children's Hospital at Westmead dataset as training set and five whole tissue sections from the university of Bristol as the testing set. Here, we randomly select ten sub-images (300 × 300 pixels) from each whole tissue section. First, the algorithm assigns a label to each sub-images. Then, each tissue section is classified using the majority vote among ten labels corresponding to ten sub-images. The whole tissue section predicted labels and result of majority vote are shown in Table 8. As can be seen, our algorithm correctly classified three out of five whole tissue sections and the other two were misclassified as differentiating type while their actual types are poorly-differentiated and ganglioneuroblastoma. Figures 11 and 12 show the sub-images for whole tissue section 4906 and 4909. The actual label of whole tissue section 4906 is poorly-differentiated. The four out of ten sub-images extracted from 4906 were predicted as poorly-differentiated neuroblastoma and the other six images were predicted as differentiating neuroblastoma. The proposed algorithm assigns label differentiating to the whole tissue section 4906 based on the majority vote. Moreover, two out of ten extracted sub-images from 4909 with actual label ganglioneuroblastoma were predicted as ganglioneuroblastoma and the others predicted as differentiating. Therefore, according to the majority vote, the whole tissue section 4909 was predicted as differentiating. For more evaluation, we applied the proposed algorithm on the BreaKHis breast cancer dataset [10]. It consists of 7909 breast cancer histopathology images divided into benign and malignant tumours. The images are in five different levels of magnification with dimension of 700 × 460 pixels. The image distribution is shown in Table 9.
Spanhol et al. [10] applied combination of six feature extractors (i.e., CLBP, LBP, Gray Level Co-occurrence Matrices (GLCM), Local Phase Quantization (LPQ), Oriented Fast Rotated Brief (ORB) and Parameter-Free Threshold Adjacency Statistics (PFTAS)) with 1-Nearest Neighbor (1-NN), Quadratic Discriminant Analysis (QDA), Random Forest (RF) and SVM classifiers to classify breast cancer histopathology images. They calculated average recognition rates of five trials for magnification factors of 40×, 100×, 200× and 400×. They achieved the best average recognition rates using the PFTAS as a descriptor. Table 10 shows the best average recognition rates of the classifiers trained with PFTAS descriptor. For comparison, we follow the same testing protocol in their paper. The BreaKHis dataset is randomly divided into a training set (70%) and a testing set (30%). We train the model using the training set and test using the testing set and compute the recognition rates. We repeat this procedure five times and report the average recognition rates for all magnification factors. The results are shown in Table 11. According to Spanhol et al. [10] the recognition rate is calculated as follow where N rec is the number of images for each patient which are correctly classified and N P is the number of images for patient P existing in the dataset. The results are shown in Table 11. Results show that the combination of SIFT with bag of features works better than the methods that Spanhol et al. [10] tested.

Discussion
We used SIFT as a feature extractor and combined it with bag of features. It is a new approach to classify neuroblastoma histological images into five groups. Although several methods have been proposed in the literature, our system has multiple advantages over those systems: 1. It is the first time that neuroblastoma histological images with a complex texture are classified into five classes using features that are robust to scale variations. 2. We apply the bag of features algorithm to refine the extracted features by SIFT and improve the classification accuracy. 3. The poorest performance of our algorithm is in discriminating between poorly differentiated and differentiating neuroblastoma, a distinction that human pathologists also find difficult in limited fields of view. This is the reflection of why our model does same as the pathologists. 4. As there is no public and available dataset of neuroblastoma histological images, we have evaluated our algorithm under the BreaKHis dataset which is a public and available breast cancer dataset. Moreover, we evaluated our algorithm on the second dataset of neuroblastoma tumour provided by the University of Bristol. 5. Pathologists scan slides, naturally pick out local regions and decide about type of the tumour based on these regions. Our experimental design sought to emulate that process through evaluation of cropped images. Ten sub-images are extracted randomly from each whole tissue section which was provided by the University of Bristol. Then, the whole tissue section is classified using the majority vote among the ten labels assigned to the ten sub-images. As shown in Figures 11 and 12, the identification of visual features that distinguish our cropped images is not immediately apparent. However, the extracted features by SIFT are the vectors with 128 elements, they are mathematical features not visual features. It still remains to be determined whether these mathematical features represent biological distinctions within the tumours that have a clinical sequelae. 6. Our 'gold standard' is the human eye and pathologist assessment. But we also know that the human eye is gilded by subjectivity and that if we had multiple pathologist look over the cropped images our specificity would likely not be 100%. For example distinction between poorly-differentiated and differentiating tumour types is known to be difficult for pathologists to distinguish, which is reflected within our analysis (Table 7). That said, SIFT has not given us 100% specificity either. Perhaps that is a reflection of fidelity of the algorithm to find really minor variations, possibly unobservable to the naked eye. Perhaps it is that SIFT is not as good as the naked eye? This distinction has not been tested in this paper and is beyond the scope of the immediate study. However, it does highlight that SIFT may have the potential to delve deeper into the images of neuroblastoma tumours and identify new characteristics that relate to tumour growth and treatment response. As the future direction of this study, we would give 1043 cropped images to be classified by completely different pathologists and see the classification differences between them and would compare to the results from the SIFT algorithm. 7. Our experimental results show that combination of SIFT with the bag of features is a promising tool for classification of neuroblastoma histological images but at this stage the specificity of SIFT has not been shown to improve on the specificity of visual analysis by a pathologist. Consequently, further testing of human-computer differences needs to be undertaken.

Conclusions
We proposed combination of Scale Invariant Feature Transform (SIFT) and bag of features to classify histological images of neuroblastic tumours into five categories using the SVM classifier. The algorithm built the feature vectors by extraction of 128 element vectors. The advantage of the proposed method is the extraction of features which are robust to scale variations and combine them with the bag of features to yield more discriminative features which increased the classification accuracy. We evaluated the proposed algorithm on the constructed dataset with 1043 cropped images of neuroblastoma from samples of five categories. As there is no public and available dataset of neuroblastoma histological images, we tested our proposed method under the BreaKHis breast cancer dataset for more evaluation. Results indicate that a CAD system based on combination of SIFT with bag of features outperforms the state-of-the-art methods on both neuroblastoma and breast cancer datasets.