Multi-Scale Feature Fusion for Coal-Rock Recognition Based on Completed Local Binary Pattern and Convolution Neural Network

Automatic coal-rock recognition is one of the critical technologies for intelligent coal mining and processing. Most existing coal-rock recognition methods have some defects, such as unsatisfactory performance and low robustness. To solve these problems, and taking distinctive visual features of coal and rock into consideration, the multi-scale feature fusion coal-rock recognition (MFFCRR) model based on a multi-scale Completed Local Binary Pattern (CLBP) and a Convolution Neural Network (CNN) is proposed in this paper. Firstly, the multi-scale CLBP features are extracted from coal-rock image samples in the Texture Feature Extraction (TFE) sub-model, which represents texture information of the coal-rock image. Secondly, the high-level deep features are extracted from coal-rock image samples in the Deep Feature Extraction (DFE) sub-model, which represents macroscopic information of the coal-rock image. The texture information and macroscopic information are acquired based on information theory. Thirdly, the multi-scale feature vector is generated by fusing the multi-scale CLBP feature vector and deep feature vector. Finally, multi-scale feature vectors are input to the nearest neighbor classifier with the chi-square distance to realize coal-rock recognition. Experimental results show the coal-rock image recognition accuracy of the proposed MFFCRR model reaches 97.9167%, which increased by 2%–3% compared with state-of-the-art coal-rock recognition methods.


Introduction
Coal is a precious natural resource all over the world [1]. China has comparatively abundant coal resources; the nation is and will continue to be the largest coal consumer and producer in the foreseeable future [2,3]. Automatic coal-rock recognition is a critical technology for intelligent coal mining and processing [4], which is helpful for adaptive height adjustment of the shearer's drum, the process control of fully mechanized top-coal caving, and fast coal-gangue separation in coal preparation plants [5]. Due to the constraints of geological conditions and coal mining technologies, traditional coal-rock recognition methods, such as gamma ray detection, infrared detection and radar detection, are difficult to apply in practice [6]. Considering that coal and rock have distinctive visual and low robustness. Experimental results show that the MFFCRR method has better performance than state-of-the-art coal-rock recognition methods.
The rest of this article is organized as follows. Section 2 is the overall structure of the MFFCRR model. Section 3 presents the proposed MFFCRR method based on CLBP and CNN. Section 4 shows a contrast of different methods with our method, the model performance and experimental results. Section 5 presents the conclusions and directions for future work.

Overview of the Proposed MFFCRR Model
As shown in Figure 1, the proposed MFFCRR model is mainly composed of three parts: multi-scale feature extraction, feature fusion and recognition.
The multi-scale feature extraction part includes two paralleled steps: extracting the texture features and deep features, which are extracted in the TFE sub-model based on CLBP and the DFE sub-model based on CNN, respectively. Firstly, in the TFE sub-model, the multi-scale CLBP feature vector is extracted from coal-rock image samples, which represents texture information of the coal-rock image; Secondly, in the DFE sub-model, the high-level deep feature vector is extracted layer by layer from coal-rock image samples, which represents more abstract and macroscopic information of the coal-rock image.
After the multi-scale feature extraction is completed, the multi-scale feature vector is generated by fusing the multi-scale CLBP feature vector and deep feature vector. Finally, the multi-scale feature vectors, which are extracted from training samples and testing samples respectively, are input to the nearest neighbor classifier with the chi-square distance to realize coal-rock recognition. The following section describes these three parts of the MFFCRR model in detail. The rest of this article is organized as follows. Section 2 is the overall structure of the MFFCRR model. Section 3 presents the proposed MFFCRR method based on CLBP and CNN. Section 4 shows a contrast of different methods with our method, the model performance and experimental results. Section 5 presents the conclusions and directions for future work.

Overview of the Proposed MFFCRR Model
As shown in Figure 1, the proposed MFFCRR model is mainly composed of three parts: multiscale feature extraction, feature fusion and recognition.
The multi-scale feature extraction part includes two paralleled steps: extracting the texture features and deep features, which are extracted in the TFE sub-model based on CLBP and the DFE sub-model based on CNN, respectively. Firstly, in the TFE sub-model, the multi-scale CLBP feature vector is extracted from coal-rock image samples, which represents texture information of the coalrock image; Secondly, in the DFE sub-model, the high-level deep feature vector is extracted layer by layer from coal-rock image samples, which represents more abstract and macroscopic information of the coal-rock image.
After the multi-scale feature extraction is completed, the multi-scale feature vector is generated by fusing the multi-scale CLBP feature vector and deep feature vector. Finally, the multi-scale feature vectors, which are extracted from training samples and testing samples respectively, are input to the nearest neighbor classifier with the chi-square distance to realize coal-rock recognition. The following section describes these three parts of the MFFCRR model in detail.

Texture Feature Extraction Sub-Model (TFE Sub-Model)
Completed Local Binary Pattern (CLBP) [14] is a completed pattern of the LBP operator for texture classification based on the LDSMT, and three operators, namely CLBP-Sign (CLBP_S), CLBP-Magnitude (CLBP_M) and CLBP-Center (CLBP_C), are proposed. The CLBP_S operator is equivalent to the classical LBP. Given a pixel in the image, a traditional LBP [11] code is calculated by comparing it with its neighbors

Texture Feature Extraction Sub-Model (TFE Sub-Model)
Completed Local Binary Pattern (CLBP) [14] is a completed pattern of the LBP operator for texture classification based on the LDSMT, and three operators, namely CLBP-Sign (CLBP_S), CLBP-Magnitude (CLBP_M) and CLBP-Center (CLBP_C), are proposed. The CLBP_S operator is equivalent to the classical LBP. Given a pixel in the image, a traditional LBP [11] code is calculated by comparing it with its neighbors where g c and g p (p = 0, . . . , p − 1) denote the gray values of the central pixel and its circularly symmetric neighbors respectively. R denotes the radius of the neighborhood, and P denotes the total number of the neighbors. Assuming that the coordinate of g c is (0, 0), then the coordinates of g p are (R cos(2πp/P), R sin(2πp/P)). Note that, if the neighbors are not in the image grids, the gray values of neighbors can be estimated by interpolation. Figure 2 shows an example of LBP coding.  Figure 2 shows an example of LBP coding. (c) After LBP operator values are calculated, the histogram of LBP values is built to represent the texture features of the image. However, if the image is rotated, the LBP value will be changed; meanwhile, this will correspondingly result in the different image texture feature. Hence, we cannot guarantee the rotation invariance. In addition, the , LBP P R operator (Eqn. (1)) produces 2 P distinct output values, which will cause the corresponding histogram to be too sparse and contain a lot of redundant information.
To decrease the redundant information of the texture features and achieve rotation invariance, the rotation invariant uniform pattern (it is the most effective pattern in some patterns introduced in [11]) and the following operator is proposed:  By thresholding, the white and black spots denote s(·) = 0 and s(·) = 1, respectively. It is clearly seen that the traditional LBP codes the local pattern as an 8-bit string "01001101", starting from O 8 and coding clockwise.
After LBP operator values are calculated, the histogram of LBP values is built to represent the texture features of the image. However, if the image is rotated, the LBP value will be changed; meanwhile, this will correspondingly result in the different image texture feature. Hence, we cannot guarantee the rotation invariance. In addition, the LBP P,R operator (Equation (1)) produces 2 P distinct output values, which will cause the corresponding histogram to be too sparse and contain a lot of redundant information.
To decrease the redundant information of the texture features and achieve rotation invariance, the rotation invariant uniform pattern (it is the most effective pattern in some patterns introduced in [11]) and the following operator is proposed: where U(LBP P,R ) = s(g P−1 − g c ) − s(g 0 − g c ) + U(LBP P,R ) denotes the number of spatial transitions (bitwise 0/1 changes) and superscript "riu2" denotes the rotation invariant uniform pattern with U ≤ 2. Similarly, the histogram of LBP riu2 P,R values is built to represent the local image texture feature. The LBP riu2 P,R operator just has P + 2 different output values in comparison to LBP P,R . Hence, the dimension of the histogram and the redundant information of the texture features will be decreased. Meanwhile, we can also acquire the LBP image in the above process.
The LDSMT [14,20] is defined as: where d p = g p − g c and s p = 1, d p ≥ 0 −1, d p < 0 . Apparently, d p is decomposed into two components: the sign and magnitude components. m p and s p denote the magnitude and sign of d p respectively. Namely, m p represents the magnitude change of gray values between the central pixel and the circularly symmetric neighbor, and s p represents the sign change. Obviously, the operator CLBP_S (namely LBP) only defines the sign component and does not consider the magnitude change. Consequently, CLBP [14] denotes a completed LBP. The CLBP_S operator is the same as the traditional LBP defined in Equation (1). Namely, the CLBP_S riu2 P,R operator also has P + 2 different output values. In order to code the operator CLBP_M in a consistent format with CLBP_S, it is defined as follows: Here, c denotes the average value of m p from the entire image. Similar to LBP riu2 P,R , the rotation invariant uniform pattern of the operator CLBP_M P,R can also be defined, denoted by CLBP_M riu2 P,R . Meanwhile, the CLBP_M riu2 P,R operator also has P + 2 different output values. The central pixel, which reflects the image local gray-scale, also has available information. To make the operator CLBP_C consistent with CLBP_M and CLBP_S, it is defined as: where the threshold c I denotes the mean gray value of the entire image. It is clearly seen that the CLBP_C P,R image is a binary image. In other words, the CLBP_C P,R operator just has 2 different output values. The three operators, namely CLBP_S, CLBP_M and CLBP_C, could be combined. Hence, a 3-D joint histogram of them can be built, denoted by "CLBP_ S/M/C". As a very powerful tool for local texture analysis, multi-scale analysis can be utilized to improve recognition accuracy, which could combine the available information provided by multiple operators of diverse (R, P).
In this paper, the joint distribution CLBP_S riu2 P,R /M riu2 P,R /C P,R , shorthand for CLBP P,R , is used to characterize the texture features of each coal-rock image. The multi-scale CLBP: CLBP P 1 ,R 1 + · · · + CLBP P n ,R n , shorthand for Multi-CLBP, is used to extract the texture features from each coal-rock gray-scale image. Firstly, calculating the histograms of the CLBP_C P,R and CLBP_S riu2 P,R codes separately, a joint 2-D histogram of the CLBP_S riu2 P,R /C P,R code is acquired by concatenating the two histograms together. Then, calculating the histograms of the CLBP_M riu2 P,R code, we concatenate the three histograms together and build a 3-D joint histogram. Finally, the 3-D joint histogram is transformed into a vector, denoted by CLBP_S riu2 P,R /M riu2 P,R /C P,R . By applying the multi-scale CLBP, the local texture information can be captured effectively on diverse scales. In the TFE sub-model, we use the multi-scale CLBP to extract the texture features of the coal-rock image. Meanwhile, the experiment (see Section 4.3.1) also demonstrates that better recognition results can be acquired than utilizing single-scale CLBP.

Deep Feature Extraction Sub-Model (DFE Sub-Model)
After extracting the local texture features by multi-scale CLBP in the TFE sub-model, we extract the deep features from each image using CNN in the DFE sub-model. The DFE sub-model adopted in our MFFCRR model is designed based on the classic LeNet-5 network [21], whose architecture is shown in Figures 1 and 3. It contains six learned layers, namely two convolutional layers (C 1 , C 3 ) and four fully connected layers (F 5 , F 6 , F 7 , F 8 ); spatial pooling operation is carried out by two max-pooling layers (P 2 , P 4 ) which follow two convolutional layers respectively; the Parametric Rectified Linear Unit (PReLU) non-linearity is applied to the output of each convolutional and fully connected layer. In this paper, the deep features are extracted from the last fully connected layer.
Below, we describe our network's architecture in detail and the two ways reducing overfitting.
Entropy 2019, 21, x 6 of 16 By applying the multi-scale CLBP, the local texture information can be captured effectively on diverse scales. In the TFE sub-model, we use the multi-scale CLBP to extract the texture features of the coal-rock image. Meanwhile, the experiment (see Section 4.3.1) also demonstrates that better recognition results can be acquired than utilizing single-scale CLBP.

Deep Feature Extraction Sub-Model (DFE Sub-Model)
After extracting the local texture features by multi-scale CLBP in the TFE sub-model, we extract the deep features from each image using CNN in the DFE sub-model. The DFE sub-model adopted in our MFFCRR model is designed based on the classic LeNet-5 network [21], whose architecture is shown in Figures 1 and 3. It contains six learned layers, namely two convolutional layers (C1, C3) and four fully connected layers (F5, F6, F7, F8); spatial pooling operation is carried out by two max-pooling layers (P2, P4) which follow two convolutional layers respectively; the Parametric Rectified Linear Unit (PReLU) non-linearity is applied to the output of each convolutional and fully connected layer. In this paper, the deep features are extracted from the last fully connected layer.
Below, we describe our network's architecture in detail and the two ways reducing overfitting. The input of our network is a fixed-size 28 28 × gray-scale image (in order to better adapt this network, we resize the 128 128 × image to 28 28 × ). The first layer of the sub-model is a convolution layer, which applies a convolution kernel of 5 5 × and outputs 32 images of 24 24 × pixels. This layer is followed by a max-pooling layer, and 2 2 × sliding windows with a stride of 2 pixels are used for max-pooling to reduce the image to half of its size, namely outputting 32 images of 12 12 × pixels. The second convolutional layer performs 64 convolutions with a 5 5 × kernel to map the previous layer and outputs 64 images of 8 8 × pixels. This layer is followed by another max-pooling layer, again with a 2 2 × kernel to output 64 images of 4 4 × pixels. The second max-pooling layer is followed by four fully connected layers: the first two have 256 neurons each, the third and last have 2 and 4 neurons respectively. The outputs are generated from the last fully connected layer, where the deep features are extracted.
The first convolutional layer aims to learn elementary visual features for coal-rock recognition. Further, the convolution operation is expressed as The input of our network is a fixed-size 28 × 28 gray-scale image (in order to better adapt this network, we resize the 128 × 128 image to 28 × 28). The first layer of the sub-model is a convolution layer, which applies a convolution kernel of 5 × 5 and outputs 32 images of 24 × 24 pixels. This layer is followed by a max-pooling layer, and 2 × 2 sliding windows with a stride of 2 pixels are used for max-pooling to reduce the image to half of its size, namely outputting 32 images of 12 × 12 pixels. The second convolutional layer performs 64 convolutions with a 5 × 5 kernel to map the previous layer and outputs 64 images of 8 × 8 pixels. This layer is followed by another max-pooling layer, again with a 2 × 2 kernel to output 64 images of 4 × 4 pixels. The second max-pooling layer is followed by four fully connected layers: the first two have 256 neurons each, the third and last have 2 and 4 neurons respectively. The outputs are generated from the last fully connected layer, where the deep features are extracted.
The first convolutional layer aims to learn elementary visual features for coal-rock recognition. Further, the convolution operation is expressed as where x i and y j are the i-th input feature map and the j-th output feature map, respectively. k ij denotes the convolution kernel between the i-th input feature map and the j-th output feature map. * represents convolution operation. b j denotes the bias of the j-th output feature map. Weights in the higher convolutional layer of our network are locally shared to learn different middle level visual features in different regions [22]. r in Equation (1) denotes a local region where weights are shared. We use PReLU non-linearity ( f (x) = max(0, x) + a i min(0, x)) as the activation function of our network, which is detailed as follows.
The PReLU improves our model fitting by adaptively learning the parameters of the rectifiers, which follows every convolutional and fully connected layer. As a new generalization of Rectified Linear Unit (ReLU), PReLU is proposed by He et al. [23] and computed as where y i is the input of the nonlinear activation function f on the i-th channel. a i is a coefficient, which controls the slope of the negative part. The subscript i in a i indicates that the nonlinear activation can vary on different channels. If a i = 0, the activation function becomes ReLU [24]; if a i is a learnable parameter, it is denoted as Parametric ReLU (PReLU). The shapes of ReLU and PReLU are showed in Figure 4. In this paper, we use a i = 0.25 as the initialization (empirically chosen).
Entropy 2019, 21, x 7 of 16 where i x and j y are the -th i input feature map and the -th j output feature map, respectively. ij k denotes the convolution kernel between the -th i input feature map and the -th j output feature map. * represents convolution operation. j b denotes the bias of the -th j output feature map.
Weights in the higher convolutional layer of our network are locally shared to learn different middle level visual features in different regions [22]. r in Equation (1) denotes a local region where weights are shared. We use PReLU non-linearity ( ) as the activation function of our network, which is detailed as follows.
The PReLU improves our model fitting by adaptively learning the parameters of the rectifiers, which follows every convolutional and fully connected layer. As a new generalization of Rectified Linear Unit (ReLU), PReLU is proposed by He et al. [23] and computed as where each neuron in the -th i output feature map i y pools over a s s × non-overlapping local region (the pooling unit) in the -th i input feature map i x .
Four fully connected layers are set in our network, which are used for extracting the high-level deep features of the coal-rock image. The fully connected layer takes the function , , max 0, min 0, where x and w denote the neurons of the previous layer and weights in the current layer, respectively. Each fully connected layer is followed by the PReLU non-linearity. The loss in our network is computed using cross entropy, which is used for constraining the coal-rock recognition task. The cross-entropy loss can be calculated as The max-pooling layer reduces the spatial resolution of the feature map outputted from the previous layer (the convolutional layer), and max-pooling is formulated as where each neuron in the i-th output feature map y i pools over a s × s non-overlapping local region (the pooling unit) in the i-th input feature map x i . Four fully connected layers are set in our network, which are used for extracting the high-level deep features of the coal-rock image. The fully connected layer takes the function where x and w denote the neurons of the previous layer and weights in the current layer, respectively. Each fully connected layer is followed by the PReLU non-linearity.
The loss in our network is computed using cross entropy, which is used for constraining the coal-rock recognition task. The cross-entropy loss can be calculated as where m and k denote the number of the labeled samples and classes, respectively. y (i) ∈ {1, 2, · · · , k} corresponds to the class label of the sample x (i) ∈ R n+1 . θ 1 , θ 2 , · · · , θ k ∈ R n+1 are the parameters of the loss function. The term 2 λ k i=1 n j=0 θ 2 ij is used for the weight decay. Our network uses the Adam [25] stochastic optimization algorithm to perform parameter updates. Adam is an efficient update algorithm because information is only used for the main and secondary moments of the gradient, which is easier to perform than the back-propagation algorithm [26].

Reducing Overfitting
Generally, the deep model needs to learn a larger number of parameters during training, which makes it more prone to overfitting. We research the following two ways in which to combat this problem.
We artificially enlarge the dataset by rotating the coal-rock image, which is one of the easiest and most common ways to reduce overfitting. The amount of data available in our dataset is not sufficient to extract the deep features of the coal-rock image; therefore, we rotate the coal-rock image from 30 degree to 330 degree with an interval of 30 degree. This method of data augmentation is applied to our network, which effectively prevents overfitting. For each image, 11 additional rotation images are generated. The coal-rock image is also flipped horizontally, which is another way of data augmentation applied to the DFE sub-model.
Additionally, we use dropout [27,28] in the first three fully connected layers, which is an efficient way of reducing overfitting. Dropout sets the output of each hidden neuron to zero with probability 0.5, and then the neurons which are "dropped out" do not conduce to the forward propagation and do not participate in backward propagation. It is clear that a different architecture is sampled by the network for each input, but these different architectures share identical weights. Hence, dropout can effectively prevent complex co-adaptations of the training data. In this paper, we use 50% dropout (empirically chosen).

Multi-Scale Feature Fusion and Recognition
In this paper, we designed a straightforward way to fuse the features extracted in the TFE sub-model and the DFE sub-model, namely, concatenating the feature vectors. Due to the facts that the local texture feature which has been extracted belongs to the low-level features of the coal-rock image, while the extracted deep features belong to the mid-and high-level features of the coal-rock image, when combined, the overall coal-rock recognition performance will be improved.
Firstly, in the TFE sub-model, the multi-scale CLBP feature vector (the texture feature vector) is extracted from the coal-rock image samples, denoted as the feature vector H; Secondly, in the DFE sub-model, the deep feature vector is extracted from the coal-rock image samples, denoted as the feature vector D. Then, the feature vectors H and D are normalized to H * and D * , respectively; Finally, the weighting factors µ and δ are added to H * and D * respectively, and the multi-scale feature vector is generated by concatenating these two feature vectors H * and D * , denoted as X = (µH * , δD * ).
After generating the multi-scale feature vector X, the nearest neighbor classifier (NNC) with the chi-square distance is utilized to recognize coal-rock images. In other words, the distance between two normalized multi-scale feature vectors X 1 and X 2 was measured using the chi-square distance. Given two feature vectors X 1 , X 2 ∈ R d , the chi-square distance is defined as [14,20]: where, X 1i and X 2i are the i-th elements of feature vectors X 1 and X 2 , respectively. If χ 2 (·) is smaller, then the similarity between X 1 and X 2 is higher, which means that the probability that two coal-rock images belong to the same class is higher.

Dataset
In order to evaluate performance of the proposed MFFCRR model, we implemented the experiments on an image dataset of coal-rock (CR dataset). This dataset consists of 4800 coal-rock gray-scale images of 128 × 128 pixels, which are collected under different illuminations and from viewpoints. There are four classes of coal-rock: lignite, anthracite, mudstone and sandstone; each has 1200 gray-scale images. Eighty percent of the samples are used for training and 20% for testing, i.e., 3840 training samples and 960 testing samples. Figure 5 shows some coal-rock examples images from four different classes (anthracite, lignite, mudstone and sandstone).
where, 1i X and 2i X are the -th i elements of feature vectors 1 X and 2 X , respectively. If ( ) 2 χ ⋅ is smaller, then the similarity between 1 X and 2 X is higher, which means that the probability that two coal-rock images belong to the same class is higher.

Dataset
In order to evaluate performance of the proposed MFFCRR model, we implemented the experiments on an image dataset of coal-rock (CR dataset). This dataset consists of 4800 coal-rock gray-scale images of 128 128 × pixels, which are collected under different illuminations and from viewpoints. There are four classes of coal-rock: lignite, anthracite, mudstone and sandstone; each has 1200 gray-scale images. Eighty percent of the samples are used for training and 20% for testing, i.e., 3840 training samples and 960 testing samples. Figure 5 shows some coal-rock examples images from four different classes (anthracite, lignite, mudstone and sandstone).

Evaluation Metrics
Two usual evaluation metrics, accuracy and macro-average F1, are used to accurately evaluate the MFFCRR model performance. They are computed based on the following four situations [29]: Hence, the accuracy is defined as Sometimes, only using accuracy does not truly reflect the model performance, so the precision, recall and F1 score are introduced to comprehensively evaluate the MFFCRR model. For multi-class tasks, the performance evaluation of the proposed method should consider the prediction results of each class. Macro-average F1 represents the average of the F1 scores of all classes, which is used to efficiently evaluate the MFFCRR model performance. The precision, recall, F1 score and Macro-average F1 can be computed as follows [30]: where k represents the number of classes, and F 1i denotes the F1 score of the i-th class.

Parameters of the TFE Sub-Model
In this experiment, we studied the effect of the parameters P and R on the MFFCRR model. For the parameters P and R of the TFE Sub-model, we choose the three common combinations of (P, R) (namely (8,1), (16,2), and (24,3)) [14] to carry out the experiment. The three 2-scale combinations and one 3-scale combination are used for constructing the multi-scale CLBP. Table 1 shows the experimental results with macro-average F 1 and accuracy metrics at three single-scale and four multi-scale combinations. As seen in Table 1, the MFFCRR model using the multi-scale CLBP has better performance than when single-scale CLBP is used. Further, the model based on CLBP of this 2-scale combination ((8,1) + (24,3)) gets the best recognition accuracy, 97.9167%, and a macro-average F 1 score 97.3333%, respectively. Nevertheless, the performance of the model based on CLBP of the 3-scale combination ((8,1) + (16,2) + (24,3)) degrades a little, because more unstable distribution patterns are generated. Experimental results show the multi-scale CLBP is a powerful tool to enhance the performance of the proposed MFFCRR model.

Parameters of the DFE Sub-Model
In order to efficiently train the DFE sub-model, the data augmentation is used throughout the whole training process (see Section 3.2.2). Figure 6 shows the training accuracy and training loss in the DFE sub-model. Clearly, with an increase in training epochs, the DFE sub-model gradually converges. As can be seen from Figure 6, with the increase of training epochs, the training loss takes about 95 epochs to reach convergence and training accuracy is close to 90% after 95 epochs. This indicates that the deep features learned by CNN can be effectively extracted from the last fully connected layer. generated. Experimental results show the multi-scale CLBP is a powerful tool to enhance the performance of the proposed MFFCRR model. In order to efficiently train the DFE sub-model, the data augmentation is used throughout the whole training process (see Section 3.2.2). Figure 6 shows the training accuracy and training loss in the DFE sub-model. Clearly, with an increase in training epochs, the DFE sub-model gradually converges. As can be seen from Figures 6, with the increase of training epochs, the training loss takes about 95 epochs to reach convergence and training accuracy is close to 90% after 95 epochs. This indicates that the deep features learned by CNN can be effectively extracted from the last fully connected layer. After performing several experiments for the recognition performance of the MFFCRR model, the hyperparameters of the DFE sub-model were obtained (summarized in Table 2). In order to reduce overfitting, we use 50% dropout (see Section 3.2.2). In addition, the DFE sub-model is trained with Adam optimizer by setting   After performing several experiments for the recognition performance of the MFFCRR model, the hyperparameters of the DFE sub-model were obtained (summarized in Table 2). In order to reduce overfitting, we use 50% dropout (see Section 3.2.2). In addition, the DFE sub-model is trained with Adam optimizer by setting ε = 1e −8 , β 1 = 0.9 and β 2 = 0.999.

Parameters of the Multi-Scale Feature Fusion
In this experiment, we studied the effect of the weighting factors µ and δ (µ, δ ∈ [0, 1]). As two necessary parameters, the weighting factors µ and δ are used to fuse the texture feature vector and the deep feature vector (two normalized feature vectors) extracted from the coal-rock image samples, generating the multi-scale feature vector X = (µH * , δD * ). Figure 7 shows the recognition accuracy at different values of the weighting factors µ and δ. As can be seen from Figure 7, using only the texture feature vector (µ = 1) or the deep feature vector (δ = 1) on the MFFCRR model cannot acquire better recognition accuracy. Obviously, when µ = 0.6 and δ = 0.4, the best recognition accuracy is acquired.
In this experiment, we studied the effect of the weighting factors μ and δ (

Activations
The activation function is necessary for state-of-the-art networks, and significantly affects the performance of the model. As one of the most common activation functions, we introduce ReLU nonlinearity ( ) to compare with PReLU non-linearity in the experiment. Table 3 shows the experimental results with macro-average F1 and accuracy metrics. As shown in Table 3, PReLU non-linearity offers better performance. For the multi-class image recognition task, the receiver operating characteristic (ROC) curve is also an important factor to evaluate the performance of the model [30][31][32]. Hence, the ROC curves are shown in Figure 8. As seen in Figure 8, there are six curves; two of them are at an average level, and the other four are at a certain level. These two average curves show averages of areas under the curves at the macro-and micro-levels; these four curves at a certain level show the area under the curve of each class, where the class labels 0, 1, 2 and 3 correspond to lignite, anthracite, sandstone and mudstone, respectively.

Activations
The activation function is necessary for state-of-the-art networks, and significantly affects the performance of the model. As one of the most common activation functions, we introduce ReLU non-linearity ( f (x) = max(0, x)) to compare with PReLU non-linearity in the experiment. Table 3 shows the experimental results with macro-average F1 and accuracy metrics. As shown in Table 3, PReLU non-linearity offers better performance.

ROC Curve
For the multi-class image recognition task, the receiver operating characteristic (ROC) curve is also an important factor to evaluate the performance of the model [30][31][32]. Hence, the ROC curves are shown in Figure 8. As seen in Figure 8, there are six curves; two of them are at an average level, and the other four are at a certain level. These two average curves show averages of areas under the curves at the macro-and micro-levels; these four curves at a certain level show the area under the curve of each class, where the class labels 0, 1, 2 and 3 correspond to lignite, anthracite, sandstone and mudstone, respectively.

White Gaussian Noise
To evaluate the robustness against the noise of the proposed MFFCRR model, we carried out the experiment by adding white Gaussian noise to original coal-rock image samples. Figure 9 shows the original image sample (the noiseless image) and the associated noise images at different signal-tonoise ratio (SNR) levels (5 dB, 10 dB, 15 dB, 20 dB and 25 dB). As seen in Figure 9, the noise images became more and more distorted with the decrease of SNR, which made features extraction more difficult. Table 4 shows the recognition results with macro-average F1 and accuracy metrics at different SNR levels. Note that the noiseless image denotes the original image sample without adding white Gaussian noise in this paper. As shown in Table 4, when less white Gaussian noise is added to the original image, namely higher SNR, the performance of the MFFCRR model degrades a little, although it still meets the present requirements. However, when more and more white Gaussian noise is added (lower SNR), the performance of the MFFCRR model drops dramatically, which does not meet the actual requirements. Hence, experimental results show that the collected coal-rock image should be denoised before applying the MFFCRR model.

White Gaussian Noise
To evaluate the robustness against the noise of the proposed MFFCRR model, we carried out the experiment by adding white Gaussian noise to original coal-rock image samples. Figure 9 shows the original image sample (the noiseless image) and the associated noise images at different signal-to-noise ratio (SNR) levels (5 dB, 10 dB, 15 dB, 20 dB and 25 dB). As seen in Figure 9, the noise images became more and more distorted with the decrease of SNR, which made features extraction more difficult. Table 4 shows the recognition results with macro-average F1 and accuracy metrics at different SNR levels. Note that the noiseless image denotes the original image sample without adding white Gaussian noise in this paper. As shown in Table 4, when less white Gaussian noise is added to the original image, namely higher SNR, the performance of the MFFCRR model degrades a little, although it still meets the present requirements. However, when more and more white Gaussian noise is added (lower SNR), the performance of the MFFCRR model drops dramatically, which does not meet the actual requirements. Hence, experimental results show that the collected coal-rock image should be denoised before applying the MFFCRR model.

White Gaussian Noise
To evaluate the robustness against the noise of the proposed MFFCRR model, we carried out the experiment by adding white Gaussian noise to original coal-rock image samples. Figure 9 shows the original image sample (the noiseless image) and the associated noise images at different signal-tonoise ratio (SNR) levels (5 dB, 10 dB, 15 dB, 20 dB and 25 dB). As seen in Figure 9, the noise images became more and more distorted with the decrease of SNR, which made features extraction more difficult. Table 4 shows the recognition results with macro-average F1 and accuracy metrics at different SNR levels. Note that the noiseless image denotes the original image sample without adding white Gaussian noise in this paper. As shown in Table 4, when less white Gaussian noise is added to the original image, namely higher SNR, the performance of the MFFCRR model degrades a little, although it still meets the present requirements. However, when more and more white Gaussian noise is added (lower SNR), the performance of the MFFCRR model drops dramatically, which does not meet the actual requirements. Hence, experimental results show that the collected coal-rock image should be denoised before applying the MFFCRR model.

Comparison with State-of-the-Art Methods
We compare the performance of the proposed MFFCRR method with state-of-the-art coal-rock recognition methods on the CR dataset, including curvelet transform and compressed sensing method (denoted as CT-CS) [33], CLBP and support vector guided dictionary learning method (denoted as CLBP-SVGDL) [34], and locality-constrained self-taught learning method (denoted as LCSL) [10]. Meanwhile, in order to more comprehensively compare the performance of the proposed MFFCRR model, the TFE sub-model and the DFE sub-model are also used for coal-rock recognition on the CR dataset, respectively (two comparative experiments). In other words, CLBP method (the TFE sub-model using only CLBP) and CNN method (the DFE sub-model using only CNN) are also compared with our method (the MFFCRR model using both CLBP and CNN) at same parameters settings. The comparison results are listed in Table 5.
As shown in Table 5, our proposed MFFCRR method performs much better than state-of-the-art coal-rock recognition methods, both in relation to accuracy and macro-average F1 score, mainly due to the efficient multi-scale feature extraction and fusion techniques we used. Meanwhile, the proposed MFFCRR method based on CLBP and CNN has a better performance than CLBP method or CNN method at same parameters settings, which indicates that multi-scale features fused by the texture features and the deep features are more discriminative than the single texture features or deep features. Therefore, the proposed MFFCRR model is feasible and effective for coal-rock recognition with less data, and has the best recognition accuracy 97.9167% and the best macro-average F1 score 97.3333%, respectively.

Conclusions and Outlook
In this paper, a MFFCRR model based on CLBP and CNN is proposed to extract and fuse the texture features and deep features of coal-rock images for coal-rock recognition. The TFE sub-model uses CLBP to learn the local texture features, which are used to represent the texture information of coal-rock images, while the DFE sub-model uses CNN to learn the deep features, which are used to provide the macroscopic spatial information of coal-rock images. Then, the texture features and deep features fused together are input to the nearest neighbor classifier with the chi-square distance to realize coal-rock recognition. The proposed MFFCRR model not only reduces the heavy workload of manual extraction features but also solves the problems of unsatisfactory performance and low robustness in coal-rock recognition methods. Experimental results show the coal-rock recognition accuracy of the proposed MFFCRR method reaches up to 97.9167%, and the MFFCRR model significantly outperforms existing coal-rock recognition methods in terms of the performance metrics (accuracy and macro-average F1 score).
However, coal-rock images are acquired from the coal mine and are easily affected by noise. Therefore, improving the recognition accuracy of the MFFCRR model under low SNR will be the focus of our future work. In the future, we plan to collect more coal-rock images to enrich our dataset, and study more deep models to improve the coal-rock recognition accuracy.
Author Contributions: X.L. and W.J. contributed the multi-scale feature fusion coal-rock recognition method. X.L., W.J., M.Z., and Y.L. analyzed the experiments. All authors participated in writing the manuscript.
Funding: Authors gratefully acknowledge the supported by National Key R&D Program of China (No. 2016YFC0801800).

Conflicts of Interest:
The authors declare no conflict of interest.