A Case Study on Attribute Recognition of Heated Metal Mark Image Using Deep Convolutional Neural Networks

Heated metal mark is an important trace to identify the cause of fire. However, traditional methods mainly focus on the knowledge of physics and chemistry for qualitative analysis and make it still a challenging problem. This paper presents a case study on attribute recognition of the heated metal mark image using computer vision and machine learning technologies. The proposed work is composed of three parts. Material is first generated. According to national standards, actual needs and feasibility, seven attributes are selected for research. Data generation and organization are conducted, and a small size benchmark dataset is constructed. A recognition model is then implemented. Feature representation and classifier construction methods are introduced based on deep convolutional neural networks. Finally, the experimental evaluation is carried out. Multi-aspect testings are performed with various model structures, data augments, training modes, optimization methods and batch sizes. The influence of parameters, recognitio efficiency and execution time are also analyzed. The results show that with a fine-tuned model, the recognition rate of attributes metal type, heating mode, heating temperature, heating duration, cooling mode, placing duration and relative humidity are 0.925, 0.908, 0.835, 0.917, 0.928, 0.805 and 0.92, respectively. The proposed method recognizes the attribute of heated metal mark with preferable effect, and it can be used in practical application.


Introduction
With the rapid development of construction industry and material technology, metal components are widely used in modern architecture and domestic appliance. Generally, metal components are nonflammable and can be retained after the fire. When being heated, complex physical and chemical changes happened on the metal component. Consequently, various marks are left on the surface of the metal component. The heated metal mark is influenced by attributes of heating temperature, heating duration, heating mode, cooling mode, etc. The oxidation reactions on surface of metal are different with various attribute conditions. These attributes of heated metal marks are useful clues to locate the fire point, and then the source and situation of fire could be further analyzed. The scene of the fire is very complicated and cannot reappear. Therefore, it is a better way to recognize or classify attributes of heated metal through observing its marks image. Table 1 gives the inspection methods for trace and physical evidences from fire scene (a National standard of People's Republic of China) [1]. It includes relationship between color of heated metal mark and heating temperature. It should be noted that the color range of metal is determined by human expert. According to the standard, Ying Wu et al. utilized metal oxidation theory to analyze the relation between color of metal surface and its heating temperature. When heating temperature approaches or exceeds melting point, the metallographic organization significant changed [2,3]. Zejing Xu and Yupu Song proposed a method to record attributes value of object surface by means of macro-inspection and micro-analytical. Then fuzzy mathematics was adopted to establish temperature of building component [4]. Dadong Li and Tengyi Yu analyzed the changes in the surface of Zn-Fe alloy with different temperature and heating duration. By using stereo microscope and electron microscope, they found that chemical composition and organization structure were changed and these leaded to the color change on the surface of metal [5]. It can be seen that traditional methods for this problem are mainly based on the knowledge of physics and chemistry for qualitative analysis. However, it is usually unpractical to implement and with less automation. This paper takes another point of view, completely relies on computer vision and machine leaning technologies. The attributes of heated metal are modeled and analyzed by data-driven mode and intelligent recognition method is devised.
Image recognition is a classical problem in computer vision and machine learning fields. With annotated training dataset, supervised learning or unsupervised learning method can be adopted. It has two main steps. First, features are extracted from training images. Second, classifier models are trained with feature vectors and corresponding attribute labels.
Image feature representation is a key research field and many works have been reported. Before the year 2012, mainstream methods for image feature extraction and representation are based on hand-craft features by experienced scientist and engineer. Image local feature extraction and representation algorithms are designed to deal with content translation, scale variant, rotation, illumination and distortion, as much as possible. Local image descriptors are then transformed into feature vectors, and global image feature representation is aggregated with all local feature vectors. Some representative researches are introduced in the following statements. David Lowe proposed SIFT (scale invariant feature transform descriptor) [6,7]. It was computed from the pixel intensity around a specific interest point in image domain. A SIFT descriptor was encoded with 4 × 4 × 8 = 128 dimensions for each interest point. Dalal and Triggs developed an image local descriptor HOG (histograms of oriented gradients) which was computed from a group of gradient orientation histograms within subregions [8]. The dimension of HOG descriptor was determined by the number of cells per block, the number of pixels per cell and the number of channels per cell histogram. The SURF (speeded-up robust features) descriptor proposed by Bay et al. was closely related to the SIFT [9]. The main difference was that SURF was computed based on Haar wavelets and the interest point was determined based on approximations of scale-space extrema of the determinant of the Hessian matrix. SURF had better computational efficiency. Gabor f ilter was a linear filter that commonly used in image texture analysis [10,11]. A classical 2D Gabor f ilter in spatial domain can be seen as a sinusoidal plane wave modulated by Gaussian kernel, and whether there were any specific frequency content with the specific directions in a localized region of an image can be estimated. LBP (Local binary patterns) was another powerful feature descriptor for image classification [12]. It was computed based on comparison between a pixel with each of its 8 neighbor pixels. It defined an 8-digit binary number in clockwise or counter-clockwise orientation. The frequency of each binary number was computed and the final feature vector was represented by accumulating all cells in a region. Moreover, various improvement versions of these local feature descriptors were proposed constantly.
Image global descriptor is then represented based on these local feature descriptors. BoVW (bag of visual words) model was one of most widely adopted methods [13]. First, visual words were gained by clustering all local feature vectors and visual vocabulary was comprised of all visual words. Then each local patches of an image can be mapped to a visual word and the whole image was represented by the histogram of the visual word frequency. One disadvantage of original BOVW model was that it lacked spatial relationship of image content. Kristen Grauman and Trevor Darrell proposed SPM (spatial pyramid matching method) [14]. SPM treated an image as multi-resolutions, and it generated histograms by binning data points into discrete regions of different size. Thus, features that did not match at high resolutions can also be matched at low resolutions. VLAD (Vector of Locally Aggregated Descriptors) [15] and FV (Fisher Vector) [16] methods were presented that based on encoding the first and second order statistics of feature vector. They not only increased classification performance, but also decreased the size of visual vocabulary and lowered the computational effort.
With the global image feature representation, metrics between high dimension feature vectors are used to measure difference between images object. SV M (Support vector machine) was the most widely used classifier training method [17]. It treated features as points in high dimensional space and mappings was conducted that the examples of the separate categories were divided by hyper-planes which was forced as wide as possible.
Many public available image benchmark datasets were provided to speed up the technology development with large size labeled training samples. ImageNet and COCO were the two most famous sets. ImageNet was first opened by Jia Deng et al. in the year 2009 [18]. It contained at least 14 million images and covered over 20,000 categories. Microsoft COCO dataset was opened in 2014 and with a total of 2.5 million labeled instances in 328,000 images [19]. These datasets not only provided large size labeled images, but also provided platforms for comparison of different algorithms based on the unified standards.
Recently, deep learning has scored great success in machine learning field especially for image classification [20]. It is also called deep structured learning or hierarchical learning and essentially it is a special form of neural network. It uses a cascade of multiple layers of nonlinear processing units for feature transformation and extraction. The main advantages of deep learning are: (1) Feature extraction in deep level. It generates compositional models where the object is expressed as a layered composition of primitives; and (2) efficient parameter adjustment. The parameters in deep model for feature extraction are tuned based on training data and loss function completely automatic. Yan LeCun designed a small scale convolutional neural networks, LeNet, with the purpose of recognizing handwritten mail ZIP code [21,22]. A medium scale deep convolutional neural networks, AlexNet, proposed by Krizhevsky and Hinton won the ImageNet competition by a significant margin over traditional methods [23]. In the next few years, several more powerful models were proposed. ZFNet, VGGNet, GoogleNet and ResNet won the ImageNet image classification competition successively [24][25][26][27]. ResNet achieved an excellent top-5 error performance with 3.57% and outperformed humanity for the first time.
According to our knowledge, there is no researches focus on our problem. Some most relevant works are reviewed. A rail surface defects type detection method was proposed [28]. It constructed a deep network with three convolutional layers, three max-pooling layers and two fully connected layers. Twenty-two thousand four-hundred eight object images were manually labeled. Using the larger network and 90% percent data for training, 92.47% multi-class accuracy was obtained. A bearing fault diagnosis algorithm was introduced based on ensemble deep networks and an improved Dempster-Shafer theory [29]. Models used in this work was a smaller one with 3 convolutional layers and 1 fully connected layer. This fusion model combined multiple uncertain evidences and computed the result through merging consensus information and excluding conflicting information. Ten thousand image samples were used for training and 2500 image samples were used for testing. With fusion and ensemble, it gained 98.72% performance for 10-type fault type classification. A deep learning-based method was proposed for characterization of defected areas in steel elements with utilization of a magnetic multi-sensor matrix transducer and integration of data [30]. In this method, three united architectures for multi-label classification were used for evaluation of defect occurrence, rotation and depth. Basiclly, this model contains three convolutional layers, three max-pooling layers and one fully connected layer. Thirty-five thousand simulated data samples were generated. Data used for training and testing was set with a ratio of 85:15. A surface defects classification method was proposed for hot-rolled steel sheet [31]. The network contained seven layers, and eight surface defects were defined. There were 14,400 samples for the whole dataset and 1800 samples for each type. Ninety-four percent accuracy was obtained with 5/9 data for training. A damage detection method of civil infrastructure was designed [32]. The model contained three convolutional layers, three pooling layers and one fully connected layer. The images were divided into small patches, and were manually annotated as crack or intack. The dataset contained 40,000 samples, and 90% used for training. 98.22% accuracy performance was obtained with sliding windows. A Faster r-cnn-based method was used for structural surface damage detection [33]. 5 types of surface damage were defined as concrete cracks, steel corrosion (medium and high levels), bolt corrosion, and steel delamination. ZFNet was used as the backbone network. Two-thousand three-hundred sixty-six image samples were collected as the dataset. This model achieved a 87.8% accuracy with 2.3:1 proportion of training and testing samples. A multilevel deep learning model was proposed for surface defect and crack detection inside steel box girder [34]. This model included three bypass to concatenate the final feature representation. Three types, including crack sub-image, handwriting sub-image, background sub-image were defined. Raw images were obtained by common digital camera. After division, 67,200 sub image samples were generated. With 80% dataset for training, 95% mean accuracy precision was obtained. Moreover, the effects of super-resolution inputs were also investigated. These related works made similar studies to the proposed one. However, these methods usually adopted relatively simple models and the state-of-art deep learning models were not concerned. The training and optimization procedure were not demonstrated clearly. Based on these points, we carry out our research.
This paper presents a case study on heated metal attribute recognition by deep convolutional neural networks model. There are three important stages: (1) Material construction stage. Attributes of heated metal are first defined as needed. Then the procedure of raw image data generation is designed, including material type, heating mode, cooling method and capture device, etc. Benchmark dataset is finally organized; (2) Model training stage. Deep convolutional neural networks models used in this work are introduced, including basic structure, top models structure and useful technologies; (3) Experimental evaluation stage. Experiments and analysis are carried in many aspects, including performance on different models, parameters setting, data augment, model convergence, recognition efficiency and execution time. Figure 1 gives the whole framework of this study.
The main contributions of this paper are threefold: • Deep convolutional neural networks models are adopted to recognize attribute of heated metal based on its marks image; • The material benchmark dataset is completely new designed and generated; • Extensive experimental evaluations and analyses are carried out.
The rest of this paper is organized as follows. Section 2 presents the materials generation. Section 3 describes the methodology. Experimental evaluation and analysis are given in Section 4. Section 5 concludes this paper.

Materials Generation
Since there are no benchmarks in related fields, dataset for training and testing is constructed in this work. This section includes attributes definition, raw image generation and benchmark dataset construction.

Attribute Definition
According to the conditions defined in National standard of People's Republic of China GB/T42327905.3-2011 (inspection methods for trace and physical evidences from fire scene-Part 3: Ferrous metal work) [1], the heating temperature of metal is the the most important factor. In addition, other important factors are also covered for practical demands. Therefore, metal types, heating mode, heating temperature, heating duration, cooling mode, cooling humidity and placing duration are used as basic attributes which we want to recognize from heated metal mark image. The attributes are configured as follow: For each attribute, its value ranges considered in this study are detailed described in Table 2. For simplicity, attribute i is abbreviated as a i in the subsequent sections.

Raw Image Generation
Two widely used metal materials, galvanized steel and cold rolled steel, are selected as research objects. The metal plate is first cut to equal size (length = 1.0 cm, width = 1.0 cm, thickness = 1.0 mm). This guarantees the consistency of the experimental conditions. Three devices, a vacuum resistance furnace, muffle furnace and gasoline burner, are used to heat metals for simulating three different heating scenes. Figure 2a-c demonstrate the three devices respectively. After heating to a specific temperature (a 3 ) and duration time (a 4 ), the metals are placed in a test chamber, as shown in Figure 2d. The test chamber provides constant temperature and humidity, so attributes of cooling mode (a 5 ), placing duration (a 6 ) and relatively humidity (a 7 ) can be employed. To exhibit better appearance feature of sample images, we do not use traditional camera, instead a special purpose microscope is used to capture the heated metal mark image, as shown in Figure 2e.

Benchmark Dataset Construction
According to the conditions and processes set up above, independent productions are conducted. The image sample is captured with a resolution of 2152 × 1616 pixels. Each heated metal mark image sample is labeled with 7 attribute values as illustrated in Table 2. Image samples are demonstrated in Figure 3, and totally there are 900 image samples. Based on the generated image dataset with attribute label values, this work makes a case study to analysis and construct relations between heated metal mark image and its attributes based on computer vision and deep convolutional neural networks model.

Methodology
In this study, we want to design a model that can predict metal attribute based on its mark image. The basic formation can be written as Equation (1). x denotes a heated metal mark image. y is the attribute value estimated by a classifier model f (). In the following, basic structures of convolutional neural networks, top CNNs models, useful techniques and pesudocode are explained.

Basic Structures in CNNs
CNNs (Convolutional neural networks) are a special form of neural network, and proved to be the most powerful model for computer vision, especially for image classification and object detection [20]. Classical CNNs are composed of three principle layers, the convolutional layer, pooling layer and fully connected layer, respectively.

Convolutional Layer
The convolutional layer is the core building block of CNNs. It contains a set of trainable filters. Typically, the filter slides over the image spatially, and the final feature map is computed by convolution operation (dot product operation) across the whole image. Equation (2) gives the basic convolution operation. The convolution is an elementwise multiplication and sum of a filter in local image region. con f eature [i, j], c[i, j] and I[i, j] represent convolution result, convolution filter kernel and image at indices i and j. The height and width of filter kernel is denoted by l. After convolution, there is always an activation operation for model simulation and optimization. Equation (3) gives the ReLU (Rectified linear units) function, one of the most popular activation function for CNNs [35]. z means the result of convolution. r(z) denotes the activation value and all the activation results constitute the feature map. Figure 4 demonstrates the basic convolution operation of an image. Let the input image be set with 32 × 32 pixels and with RGB channels. It can be represented as a formation of 32 × 32 × 3 matrix. If there are 6 filters with size 5 × 5 × 3, then 6 separate activation feature maps with a stack of size 28 × 28 × 6 (with ReLU activation function, no padding, 1 pixel stride) are computed.
Ideally, one filter corresponds to a specific feature. The advantage of convolutional layer is that the local structure of an image can be captured and the parameter of a filter can be shared.

Pooling Layer
The aim of pooling layer is to reduce the dimension of a feature map while the important feature can also be retained. It makes the feature representation smaller and more compact. The result of pooling layer is shown in Figure 5a. The basic operation of pooling is to slide a window with specified size and stride on a feature map, and the corresponding value is computed by max or mean operation inside the window, as is shown in Figure 5b. Pooling layer decreases the scale of feature map and the subsequent computation is also reduced. Moreover, pooling also reduces the number of parameters, and makes the model invariant to transformation, distortion, translation and scale change.

Fully Connected Layer
A fully connected layer can be seen as a traditional multi-layer perceptron. Fully connected means all nodes in the previous layer are connected with all nodes of the next layer. It has two basic effects: (1) fully connected layer is another way of learning non-linear combination between features of different depth; (2) it can be used as output layer that the last feature map will be transformed into classification result with full connection. In this way so f tmax activation function is usually adopted. Figure 6 demonstrates the fully connected layer. As shown in the figure, an image is used as an input of CNNs model. Layer m − 1 and layer m are two continuous hidden layers, which are fully connected. Meanwhile, there are two nodes in the output layer, which represents 2 attribute values of metal type that the model predicted.

Loss Function and Model Training
Let {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x m , y m )} denote the training image data set. x i denotes the heated metal marks image with size w × h × c, where w, h and c denote the width, height and channel of input image. y i ∈ {0, 1, ..., n} is attribute label value.
For an input heated metal marks image sample data x i , we want to compute the probability value p(y = j|x i )(j ∈ 0, 1, ..., n). The output, a n-dimensional vector is estimated to represent the probability of each attribute type that x i belongs to. The hypothesis function can be expressed as Equation (4).
where θ = {θ 0 , θ 1 , .., θ n } is the model parameters. θ i is the parameter that belongs to ith predicted attribute. This equation normalizes the result and makes the sum to 1. For model training, the loss function can be given as follows: As shown in Equation (5), formula 1{.} represents an indicative function, and the second part is a commonly used term for model regularization. Loss function usually indicates the difference between predicted attribute label and true attribute label values. The goal is to make the loss function minimal. SGD (stochastic gradient descent) method is used for optimization and the corresponding derivative functions are given as Equations (6) and (7).

Top CNNs Models
In this subsection, some state-of-art CNNs models used in our study are introduced.

VGGNet
VGGNet was introduced by Karen Simonyan and Andrew Zisserman in Visual Geomrtry Group, University of Oxford [25]. This work first explored the feasibility of increasing the depth of CNNs model with very small convolution filters (3 × 3 receptive field and up to 19 weight layers) for large scale image classification task. The performance on the ImageNet challenge demonstrated the effectiveness of VGGNet model.

ResNet
Kaiming He et al. proposed ResNet in Microsoft Research [27]. This model focused on training extreme deeper networks. For solving the problem of gradient vanishing, a residual learning framework was devised. The weight layers were computed by addition with traditional stacked layer and a shortcut connection perform identity mapping. The deep residual model stacked basic building block of residual learning and made the model extremely deep (up to 152 layers). The experimental results on the ImageNet dataset demonstrated that the seemingly simple technique make the extremely deep model easier to optimize. It gained 1st place on the ILSVRC 2015 classification task while it still had lower complexity.

Inception
This model was first introduced by Christian Szegedy et al. in Google Inc. [26]. It increased both the depth and width of the network while keeping the model computational budget constant based on Hebbian principle and multi-scale process. They devised the Inception block as a new organization. The filters were multiple scales(1 × 1, 3 × 3 and 5 × 5). 1 × 1 convolutions were used as bottleneck to reduce high dimension. A 22 layers deep network was finally constructed by stacking these Inception modules. This model was referred as Inception-v1.
Sergey Ioffe and Christian Szegedy referred to the problem of internal covariate shift in deep CNNs model training [36]. They addressed it by normalizing the input of each layer. This enabled training with much higher learning rates and cared less about parameter initialization. This model was called Inception-v2.
Christian Szegedy and Vincent Vanhoucke explored to scale up networks in order to utilize the added computation as efficient as possible by suitably factorized convolutions and aggressive regularization [37]. A highest quality version of Inception-v3 was designed with better performance.
Christian Szegedy and Sergey Ioffe et al. combined the Inception architecture and Residual connections [38]. The empirical results clearly showed that the new model accelerate the network training significantly and outperformed the traditional Inception model in performance. This new model was referred as Inception-v4.

Mobilenet
Aiming at deploy deep learning model into computationally limited platform such as mobile and embedded systems, Andrew G. Howard and Menglong Zhu et al. proposed MobileNets [39]. Depth-wise separable convolutions were introduced to build light weight deep neural networks. Trade off between model latency and accuracy were considered. While, according to the constraints of the problems, model size could be adjusted automatically.

Useful Technique
Here we introduce some useful techniques used for the training and optimization of deep learning models.

Dropout
Deep convolutional neural networks usually contains a huge number of parameters, and it is prone to model overfitting. Dropout is a technique that a node is dropped out with probability of 1 − p or kept with probability p. Only the retained nodes are trained. All dropped out networks are averaged in testing stage. This method essentially cuts node interactions, and makes the model learn better feature representation that can generalize new data. Dropout does not just decrease model overfitting, but also improve training efficiency [40].

Data Augment
Most applications are faced with the problem of lacking sufficient training data, which is a key point for training large scale deep learning models. Data augment is a widely used method to generate new data with perturb existing one [41,42]. This method can provide more training data, reduce overfitting and improve generalization to a certain extent.

Pre-Trained Model
Another way to manage insufficient training data is to use an existing model for initialization. Loading these parameters into the network and start to train a new one [43]. The pre-trained models are often trained with other large dataset of related domains or in-domains.

Pseudocode
In this subsection, Algorithm 1 is given to demonstrate the pseudocode of the proposed method. Given Train set , Test set and initialized model parameter θ. A batch of sample images is random selected from Train set . After augment, the model is trained once with S and parameter θ is updated. Loss L is computed based on Test set . If L is less than or iteration counter i exceeds predefined thresholds N, training is over.

Experiment Setup
The generated benchmark dataset is used to evaluate the performance of heated metal mark attributes recognition, with deep convolutional neural networks models described in the above sections. In this case study, seven groups attributes of heated metal mark are considered. Each group of attributes are tested independently. Python is used as programming language. Tensor f low is adopted as deep learning framework and Keras is selected as the library. All the experiments are tested on Pentium i5-7 series CPU, 16G RAM, Nvidia GTX 1070 GPU, Ubuntu OS PC.
The experiments include the following aspects: (1) Evaluation of recognition rate with cross validation; (2) Evaluation of recognition efficiency; (3) Evaluation of different optimization method; (4) Evaluation of different batch size; (5) Evaluation of execution time.

Evaluation of Recognition Rate with Cross Validation
The recognition performance is evaluated independently for different attributes. Therefore, there are 7 groups of testing. The performance of heated metal mark image attribute recognition is computed with overall recognition rate, as shown in Equation (8). N correct denotes the number of correctly recognized samples. N all denotes the number of all testing samples. For each attribute, the dataset is divided into 5 subsets with attribute values equally distributed. 4 randomly chosen subsets (720 image samples) are used for training and the left subset (180 image samples) is used for testing. This process is repeated 4 times and the result is computed by averaging 4 independent testings.
Inception-v4, Inception-v3, ResNet, VGG16 and MobileNet are selected as basic CNNs architectures for evaluation. Factors of pre-trained model and data augment are considered. In this subsection, the pre-trained models are trained with COCO dataset [19]. If the pre-trained models are used for initialization, the parameters of low-level layers are fixed and the rest parameters are trainable. If the pre-trained models are not used for initialization, parameters of all layers are trainable. For data augment, commonly used transformations include random cropping, vertical and horizontal flipping, perturbation of brightness, saturation, hue and contrast are adopted. If the model is trained with data augment, 40% of training image in each batch are augmented, otherwise the probability is 10%. For model input, image size is set with 224 × 224 × 3 pixels. Epochs is set with 20 and batch size is set with 12. SGD is used as preferred optimization method. Learning rate is set with 0.0001 and momentum is set with 0.9. Dropout is set with 0.2.
The results of average recognition performance are shown in Table 3. Configurations of CNNs models, pre-trained model and data augment are listed in 1st, 2nd and 3rd columns respectively. The experiments are conducted under various condition combinations. a i means ith attribute. train a and test a denote recognition rate of training and testing. For convenience, NetModel(p 1 , p 2 ) is used to represent the model structure and parameters. NetModel ∈ {Inception-v4, Inception-v3, ResNet, VGG16, MobileNet}. p 1 and p 2 are parameters for pre-trained and data augment. p i ∈ {0, 1}, where 0 stands for off and 1 stands for on. For example, VGG16(0, 1) means the model is trained with VGG16 structure, with pre-trained off and with data augment on.
For training performance, most models(with various configurations) finally reach 0.9 accuracy, and some are close to 1. Meanwhile, the training accuracy achieves stability after about 10 epochs for all attributes. The results demonstrate that the training accuracy for a 1 to a 7 are fine and acceptable. This is mainly because the CNNs models have relatively large scale, and with the significant ability of feature abstraction they get great recognition performance on training dataset. These results are similar with other research reports.
For However, there are significant differences in testing accuracy versus epoch. Large fluctuations are shown especially on a 2 , a 4 and a 7 in our experiments. This also reveal that different training modes have great influence on increasing the testing accuracy of models. Models with configuration (0,1) obtain better testing accuracy for all attributes. Unlike researches of other image recognition field that using pre-trained model can get better optimization, the experimental results in our study show divergences that heated metal mark image attribute recognition with pre-trained off gets best performances. The main reasons are that heated metal mark image is a very special research object, and there is large gap from the common image dataset, so the filters provided by pre-trained model obtained with common image dataset do not have much impact on our study. Inception-v4 outperforms other models on recognition performance demonstrates superiority of combining Inception and Residual. The result also shows that models with data augment can improve performance effectively. This is reasonable for training data with certain augment can increase the diversity of sample and model robust can be improved. It is especially important for large scale CNNs for its huge parameters are prone to overfitting with insufficient training data. However, among all tests only a small group of models achieve good convergence. The reason for this situation may originate from the complexity of this study, including uncertain noisy generated in process of training image generation or unsuitable attribute values definition. This can be solved by detailed model design and more careful tuning.

Evaluation of Recognition Efficiency
In this subsection, individual class recognition efficiency η i , average recognition efficiency η a and the overall recognition efficiency η o are evaluated, which are defined in [44]. q ij is the number of samples of class i that was classified into class j. n c is the total number of classes and N is the total number of samples, as shown in Equations (9) and (10). Table 4 gives the recognition efficiency of 5 different deep learning models. For convenient evaluating, all models are configured with (0,1). The results show that Inception-v4 gets optimal performance than other models which is coincide with results of Section 4.2. Moreover, individual efficiency η i has low variance and it also proves the stability of the model.

Evaluation of Optimization Method
In this subsection, performance of different optimization methods, Adam, Adagrad and SGD are evaluated. For comparison, Inception-v4(0,1) is used as basic CNNs model structure, and other parameters are the same as Section 4.2.
For the training process, Adam and SGD get better results on attributes a 1 , a 2 , a 3 , a 5 , a 7 . SGD gets best result on attributes a 4 and a 6 . While, Adagrad gets worst result on all attributes. For the testing process, the results are the same as the training process. This phenomenon may be originated from the reason that SGD is simple but always works well for most tasks, and Adam and Adagrad are more fragile in our study and it may needs more complex tuning.

Evaluation of Batch Size
Batch size is another important factor for training CNNs model. In this subsection, performance of different batch sizes, 8, 12, 16, 24 and 32 are evaluated. For comparison, Inception-v4(0,1) is used as basic CNNs model structure, and other parameters are the same as Section 4.2.
For training process, models with batch size = 8 convergences slow, especially on attributes a 1 , a 2 and a 4 . Models with batch size = 12, 16, 24, 32 get good results. For testing process, we can see that large batch size leads to better model training and generalization. Comparing with models trained in Section 4.2 (batch size = 12), the performances of models trained with batch size = 32 are improved with 0.5%, 0.8%, 0.5%, 0.7%, 0.8%, 2.5%, 0.9% for attributes a 1 to a 7 respectively. There is obvious performance enhancement for attribute a 6 , while there are not much improvements for other attributes.

Evaluation of Execution Times
In this subsection, model execution time is evaluated. Training and testing time of 5 deep leaning models with various batch sizes (8, 12, 16, 24 and 32) are tested. Table 5 shows that VGG16 model cost the most execution time, with 2.67s, 3.05 s, 3.33 s, 3.67 s and 4.33 s for batch size of 8, 12, 16, 24 and 32 for each training iteration. MobiliNets gets the least execution time, with about 60% of VGG16's. Inception-v4 is the preferred model for its excellent performance and acceptable cost. For testing time, MobiliNets has the minimal cost, 0.031s. Moreover, trade off is feasible according to diverse needs.

Conclusion and Future Works
Heated metals are usually retained at the fire scene, and their mark can be used as an important trace for fire analysis. Traditional methods recognize attribute of heat metal mark mainly depend on human expert with knowledge of physics and chemistry. This makes the work very difficult to popularize and automate. This paper makes a case study on heated metal mark image attribute recognition based on convolutional neural networks. The benchmark dataset for training and testing is designed. For seven selected attributes, various CNNs architectures, parameters, training mode, recognition efficiency and execution time are evaluated and analyzed. One of the greatest advantages of this work is that the feasibility of attribute recognition on heated metal mark image based on top CNNs models is studied. Through the study, we can conclude that it is possible to recognise attributes of heated metal mark using the tuned deep CNNs model with an acceptable accuracy, and it can be implemented for real time applications.
This study needs to be further improved. Our future works will focus on three aspects: (1) More datasets will be generated to enrich the benchmarks; (2) recognition with multi-atrributes joint model will be studied and (3) Works will be focus on customed CNNs models structure design, and more fine-tuning techniques will be developed.