Deep Residual Learning for Nonlinear Regression

Deep learning plays a key role in the recent developments of machine learning. This paper develops a deep residual neural network (ResNet) for the regression of nonlinear functions. Convolutional layers and pooling layers are replaced by fully connected layers in the residual block. To evaluate the new regression model, we train and test neural networks with different depths and widths on simulated data, and we find the optimal parameters. We perform multiple numerical tests of the optimal regression model on multiple simulated data, and the results show that the new regression model behaves well on simulated data. Comparisons are also made between the optimal residual regression and other linear as well as nonlinear approximation techniques, such as lasso regression, decision tree, and support vector machine. The optimal residual regression model has better approximation capacity compared to the other models. Finally, the residual regression is applied into the prediction of a relative humidity series in the real world. Our study indicates that the residual regression model is stable and applicable in practice.


Introduction
Functions are common conceptions in sciences and engineering that quantify the dependence or interaction between one variable and the others. Functions are classified as linear functions and nonlinear functions from the superposition principle prospective. Linear functions are analytical, easy to analyze mathematically, and satisfy the superposition principle, while nonlinear functions are complicated and even nonanalytical. People often use linear models to approximate linear functions such as multiple linear regression [1], stepwise linear regression [2], ridge regression [3], lasso regression [4], and elastic net regression [5], which do not work for nonlinear functions. However, nonlinear functions are more common in the real world. Therefore, the approximation or regression of nonlinear functions has gained a lot of attention and is more practical and meaningful in practice [6][7][8][9][10][11][12][13][14].
One of the classical approximations is the Weierstrass approximation theorem, which shows that every continuous function defined on a close interval can be uniformly approximated as closely as desired by a polynomial function [15]. Support vector regression machine [16,17] is also a well-known nonlinear approximation technique. However, it often takes a long time if support vector regression machines are trained on large datasets, and it is difficult to choose proper kernel functions. In addition, decision tree regression [18] is widely used in practice. However, decision tree can be very non-robust, residual shortcuts are classified as identity shortcuts and convolution shortcuts as shown in Figure 1. The identity blocks in panel (a) only require addition for tensors and can be directly used when the input and output have the same dimensions. Linear projection shortcuts are used when the dimensions are different. One linear projection shortcut consists of a 1 × 1 convolution unit and a Batch Normalization unit. Therefore, it is also called a convolution block as shown in panel (b). The 1 × 1 convolution unit can change the input dimension so that the input dimension can match with the output dimension, and then addition operations can be performed. The inputs of ResNet are often images with many channels, and it behaves well in vision imagery. However, the inputs of nonlinear functions are often one-dimensional vectors, and convolution is a not a global but rather a local operator. This means that if vectors are reshaped into matrices, the convolution kernels cannot affect the whole sequence but rather only parts of the sequence. This contradicts with the aim of nonlinear regression. Therefore, the original architecture of ResNet is not suitable for the nonlinear regression issues.
input and output have the same dimensions. Linear projection shortcuts are used when the dimensions are different. One linear projection shortcut consists of a 1 × 1 convolution unit and a Batch Normalization unit. Therefore, it is also called a convolution block as shown in panel (b). The 1 × 1 convolution unit can change the input dimension so that the input dimension can match with the output dimension, and then addition operations can be performed. The inputs of ResNet are often images with many channels, and it behaves well in vision imagery. However, the inputs of nonlinear functions are often one-dimensional vectors, and convolution is a not a global but rather a local operator. This means that if vectors are reshaped into matrices, the convolution kernels cannot affect the whole sequence but rather only parts of the sequence. This contradicts with the aim of nonlinear regression. Therefore, the original architecture of ResNet is not suitable for the nonlinear regression issues.
Based on the structure of ResNet, we build a new neural network for nonlinear regression. Its architecture is presented in Figure 2. Convolutional layers and pooling layers are replaced by fully connected layers (or dense layers) in the residual block. Batch Normalization layers from the primary model are kept in our new model, which act as a regularizer in some cases, eliminating the need for Dropout, and allow people to use much higher learning rates and care less about initialization [36]. Panel (a) shows identity blocks that are used when the input and output are in the same dimensions. The dense blocks in panel (b) correspond to convolution blocks in Figure 1 and are used when the dimensions are different. Usually, deep learning uses a multilayer network and employs the gradient algorithm to train models, so executing deep learning requires heavy computation, and the learning is often trapped into a saddle point or local minima [37]. To tackle this issue, people propose the rectified linear unit (ReLU) as the activation function, whose gradient can be easily computed. ReLU is shown as follows Based on the structure of ResNet, we build a new neural network for nonlinear regression. Its architecture is presented in Figure 2. Convolutional layers and pooling layers are replaced by fully connected layers (or dense layers) in the residual block. Batch Normalization layers from the primary model are kept in our new model, which act as a regularizer in some cases, eliminating the need for Dropout, and allow people to use much higher learning rates and care less about initialization [36]. Panel (a) shows identity blocks that are used when the input and output are in the same dimensions. The dense blocks in panel (b) correspond to convolution blocks in Figure 1 and are used when the dimensions are different. In this paper, the new residual regression model employs tens or hundreds of layers. Hence, to speed up the learning convergence, ReLU is applied as activation function in hidden layers. For the output (or top) layer, linear activation function is used to meet the nonlinear regression requirement. The fundamental blocks of our model are shown in Table 1. The output dimension of the input block is equal to the input dimension of the first hidden layer. Usually, the input and output of an identity block have the same dimension ( 1), but it is not true for the input dimension ( 2) and output dimension ( 3) of a dense block ( 2 ≠ 3). A dense block is usually inserted between two identity blocks when the output shape of one identity block is different from the input shape of the other. The structure of regression model is presented in Figure 3. The model begins with an input block and is followed by dense blocks and identity blocks. The output block is in the end. In this paper, every dense block is followed by two identity blocks, and then it is followed by one dense block, and so forth. The last two identity blocks are followed by the output layer.  Usually, deep learning uses a multilayer network and employs the gradient algorithm to train models, so executing deep learning requires heavy computation, and the learning is often trapped into a saddle point or local minima [37]. To tackle this issue, people propose the rectified linear unit (ReLU) as the activation function, whose gradient can be easily computed. ReLU is shown as follows In this paper, the new residual regression model employs tens or hundreds of layers. Hence, to speed up the learning convergence, ReLU is applied as activation function in hidden layers. For the output (or top) layer, linear activation function is used to meet the nonlinear regression requirement.
The fundamental blocks of our model are shown in Table 1. The output dimension of the input block is equal to the input dimension of the first hidden layer. Usually, the input and output of an identity block have the same dimension (N1), but it is not true for the input dimension (N2) and output dimension (N3) of a dense block (N2 N3). A dense block is usually inserted between two identity blocks when the output shape of one identity block is different from the input shape of the other. The structure of regression model is presented in Figure 3. The model begins with an input block and is followed by dense blocks and identity blocks. The output block is in the end. In this paper, every dense block is followed by two identity blocks, and then it is followed by one dense block, and so forth. The last two identity blocks are followed by the output layer.
In this paper, 10,000,000 samples are generated for each function. The datasets are shown in

Datasets
In this part, we introduce multiple nonlinear functions to evaluate the residual regression model. The order of nonlinearity corresponds to the number of minimum functions. Functions with the order of nonlinearity from 1 to 4 are shown as Equations (2) Entropy 2020, 22, 193 6 of 14 In this paper, 10,000,000 samples are generated for each function. The datasets are shown in

Regression Models with Different Depths and Widths
As mentioned, the depth and width of ResNet can affect the approximation capacity. In order to evaluate the effects, we fix one factor and consider the influence of the other. Exactly speaking, when we assess the effect of depth, the width of every hidden layer is fixed, and the depth is changed. Nevertheless, when we consider the effect of width, the depth of ResNet is fixed, and the width is changed. Training data in this part are generated by Equation (4). Before training, the original data is standardized by the Min-Max scaler.
where ̂ stands for standardized data of . The residual regression models are built by Keras

Regression Models with Different Depths and Widths
As mentioned, the depth and width of ResNet can affect the approximation capacity. In order to evaluate the effects, we fix one factor and consider the influence of the other. Exactly speaking, when we assess the effect of depth, the width of every hidden layer is fixed, and the depth is changed. Nevertheless, when we consider the effect of width, the depth of ResNet is fixed, and the width is changed. Training data in this part are generated by Equation (4). Before training, the original data is standardized by the Min-Max scaler.ŵ whereŵ k stands for standardized data of w k . The residual regression models are built by Keras using TensorFlow as the backend and are trained on computer clusters with 64 CPUs and 126 GB Random Access Memory (RAM). The type of CPUs used is an Intel(R) Xeon(R) CPU E5-2683 V4 working at 2.10 GH. Every core has two threads. Training processing is also accelerated by two graphics processing units (GPUs). The type of GPUs is a GeForce GTX 1080 Ti produced by NVIDIA, and every GPU has a 10421 MB memory. We use mean squared error (MSE) as the loss function.
The Adam method is applied to minimize the loss function. It computes individual adaptive learning rates for different parameters [38] and combines the advantages of AdaGrad, which works well with sparse gradients [39], and RMSProp which works well in on-line and non-stationary settings [40].
An early stopping strategy is also used to avoid overfitting. When training large models, people often observe that training loss decreases over time, while validation loss begins to rise again. This means that a model with better validation loss can be obtained by returning to the time with the lowest validation loss, which is known as an early stopping strategy and is probably the most commonly used form of regularization in deep learning due to its effectiveness and simplicity [41]. The algorithm stops when no progress has been made over the best recorded validation loss for some pre-specified number (or patience) of epochs.
In this paper, the batch size for gradient descent is 5000, and the epoch number of training is 50. The patience of early stopping is 10 epochs. The train loss, validation loss, and testing loss are computed by standardized data and have the magnitude of 10 −4 . Table 2 shows the training process of regression models with different depths. Before considering the effects of depth and width, we have trained residual regression models with different widths for many times and found that testing loss is small when the width is fixed at 20. Hence by experience, the width of every hidden layer is initially fixed at 20. From Table 2, we could see that when the depth is less than 100, testing losses range from 3.4855 × 10 −4 to 5.3401 × 10 −4 . The testing losses are in the same magnitude and have small changes. However, when the depth is equal to or beyond 100, the testing and validation losses increase greatly. The testing losses vary from 1.8493 × 10 −3 to 2.4995 × 10 −1 . Thus, the optimal depth of residual regression model is approximately 28, since the model has a minimum testing loss, i.e., 3.4855 × 10 −4 . Besides, it is not true that if the neural network is deeper, it behaves better on approximation or regression. This is because neural networks use the back propagation (BP) algorithm to minimize their loss functions, but it is difficult to optimize very deep neural networks.
Based on the optimal depth, we consider the effect of width. Table 3 shows the training information of residual regression model with different widths. The depth is fixed to the optimal depth, i.e., 28. From Table 3, we could know that the optimal width is approximately 16, since the model has a minimum testing loss. When the width changes from 8 to 700, testing losses range from 2.1360 × 10 −4 to 8.1343 × 10 −4 , which have the same magnitude 10 −4 and have small variations. Nevertheless, when the width is small (less than 4), testing and validation losses are greater than 1 × 10 −3 . Hence, the approximation capacities of wide neural networks are stronger than very narrow neural networks. This is because residual regression models with small widths are too simple to approximate complex nonlinear mappings. Therefore, it is not recommended to set the residual regression model with a great depth (more than 100) or a small width (less than 4).  Table 4 shows the training information of an optimal regression model on simulated nonlinear data. The depth is fixed at 28 and the width 16. Preprocessing of data is the same as before. From Table 4, one could see that the maxima and minima of testing losses of the optimal regression model are 4.2117 × 10 −4 and 2.550 × 10 −5 , respectively. This indicates that the optimal model has small testing losses on these nonlinear datasets. Figure 5 shows the visualized comparisons between regression results of the optimal model and the real value of simulated datasets. For each dataset, 200 samples and the corresponding predictions are compared and plotted in Figure 5. From Table 4 and Figure 5, we could see that the optimal model behaves well in simulated nonlinear datasets.

Comparisons with Other Approximation Techniques
In this section, we compare the optimal residual model with other linear and nonlinear approximation techniques. Linear techniques include the linear regression, ridge regression, lasso regression, and elastic net regression, which combines ridge regression and lasso regression. Nonlinear techniques include the usual artificial neural network (ANN), decision tree, and support vector regression (SVR) machine. ANN has the same architecture (the same depth and width) as the optimal residual regression model except for the shortcut connections mentioned in Figure 2. ANN is modeled by Keras using tensorflow as a backend. Other regression models are built by the machine learning package scikit-learn in python [42]. A total of 10,000,000 samples generated by Equation (4) are used here. ANN is trained on 6,750,000 samples and validated on 750,000 samples. The remaining 2,500,000 samples are used as testing data. The epoch and patience of early warning for ANN are 50 and 10, respectively. Other models are trained on 7,500,000 samples and tested on the same 2,500,000 samples. Table 5 shows the comparison results including training time, validation loss, and testing loss. There are hyperparameters for every approximation technique excluding linear regression, which cannot be learned through training and must be set in advance. Hence, to get a good

Comparisons with Other Approximation Techniques
In this section, we compare the optimal residual model with other linear and nonlinear approximation techniques. Linear techniques include the linear regression, ridge regression, lasso regression, and elastic net regression, which combines ridge regression and lasso regression. Nonlinear techniques include the usual artificial neural network (ANN), decision tree, and support vector regression (SVR) machine. ANN has the same architecture (the same depth and width) as the optimal residual regression model except for the shortcut connections mentioned in Figure 2. ANN is modeled by Keras using tensorflow as a backend. Other regression models are built by the machine learning package scikit-learn in python [42]. A total of 10,000,000 samples generated by Equation (4) are used here. ANN is trained on 6,750,000 samples and validated on 750,000 samples. The remaining 2,500,000 samples are used as testing data. The epoch and patience of early warning for ANN are 50 and 10, respectively. Other models are trained on 7,500,000 samples and tested on the same 2,500,000 samples. Table 5 shows the comparison results including training time, validation loss, and testing loss. There are hyperparameters for every approximation technique excluding linear regression, which cannot be learned through training and must be set in advance. Hence, to get a good approximation, we use the grid search method to find the optimal values. Table 6 shows the information about hyperparameters, including the name, the range, and the corresponding optimal value of every hyperparameter. NA in Tables 5 and 6 stands for Not Applicable.

Optimal Hyperparameters
Linear regression NA NA NA

Ridge regression
Penalty parameter of L 2 norm α; 10 −10 , 10 −9 , 10 −8 . . . , 10 9 , 10 10 ; 10 2 Lasso regression Penalty parameter of L 1 norm α;  From Table 5, one could observe that the testing losses of the linear models listed in the top are approximately 3.72 × 10 −2 , which is much greater than the testing losses of the other nonlinear approximation techniques. This indicates that these linear models are not appropriate for nonlinear approximation. It is notable that support vector regression (SVR) machines with radial basis function (RBF) kernel have the maximum training time, which is close to 44 h, and the second greatest testing loss (1.2676 × 10 −2 ). The testing loss of SVR is also greater than that of residual regression, ANN without shortcuts, and decision tree. These drawbacks of SVR limit its further applications in practice. It is also worthy to mention that the testing loss of decision tree regression is 4.58 times as great as that of the residual regression model. This implies that the residual regression model has better approximation capacities than decision tree. In addition, the influence of shortcut connections is investigated. ANN without shortcuts has the same depth and width as the optimal residual model, and it has the second smallest testing loss. The model is trained for nearly 27 min and early stops at 34 epochs. However, the optimal residual model is trained approximately 23 min through 50 epochs. This means that the residual regression model is more efficient and approximates nonlinear functions better.

Application of Residual Regression Model on Climate Data
In this paper, we employ the residual regression to approximate relative humidity. Relative humidity is defined as the ratio of the water vapor pressure to the saturated water vapor pressure at a given temperature. It is a key factor affecting cloud microphysics and dynamics, and it plays an important role in climate [43]. The formation of cloud condensation nuclei needs water vapor to be supersaturated in the air. However, currently, there is no widely accepted and reliable method to measure the supersaturated vapor pressure accurately [44], which means that the relative humidity is not accurate under supersaturation circumstance. Therefore, finding the nonlinear relationship between relative humidity and other factors is meaningful.
Our training data is from EAR5 hourly reanalysis datasets on the 1000 hPa pressure level [45]. Pressure, temperature, and specific humidity are used as input features. EAR5 is the fifth generation ECMEF (European Centre for Medium-Range Weather Forecasts) atmospheric reanalysis of the global climate, and it provides hourly outputs at a spatial resolution of 0.25 • . The timestamp of the training data is from 00:00:00 to 23:00:00 on September 1, 2007. There are 24,917,709 samples in total. The optimal residual regression model is trained on 20,183,344 samples and validated on 2,242,594 samples. The remaining 2,491,771 samples are testing data. The other parameters are the same. A total of 200 observations and the corresponding predicted data are plotted in Figure 6. The relative error of testing data for relative humidity is 9%. We replace the testing dataset randomly for 10 times, and the averaged relative error is still 9%. This verifies that the residual regression model is stable and applicable in practice.

Application of Residual Regression Model on Climate Data
In this paper, we employ the residual regression to approximate relative humidity. Relative humidity is defined as the ratio of the water vapor pressure to the saturated water vapor pressure at a given temperature. It is a key factor affecting cloud microphysics and dynamics, and it plays an important role in climate [43]. The formation of cloud condensation nuclei needs water vapor to be supersaturated in the air. However, currently, there is no widely accepted and reliable method to measure the supersaturated vapor pressure accurately [44], which means that the relative humidity is not accurate under supersaturation circumstance. Therefore, finding the nonlinear relationship between relative humidity and other factors is meaningful.
Our training data is from EAR5 hourly reanalysis datasets on the 1000 hPa pressure level [45]. Pressure, temperature, and specific humidity are used as input features. EAR5 is the fifth generation ECMEF (European Centre for Medium-Range Weather Forecasts) atmospheric reanalysis of the global climate, and it provides hourly outputs at a spatial resolution of 0.25°. The timestamp of the training data is from 00:00:00 to 23:00:00 on September 1, 2007. There are 24,917,709 samples in total.
The optimal residual regression model is trained on 20,183,344 samples and validated on 2,242,594 samples. The remaining 2,491,771 samples are testing data. The other parameters are the same. A total of 200 observations and the corresponding predicted data are plotted in Figure 6. The relative error of testing data for relative humidity is 9%. We replace the testing dataset randomly for 10 times, and the averaged relative error is still 9%. This verifies that the residual regression model is stable and applicable in practice.

Conclusions
In this paper, we develop deep residual regression models for nonlinear regression. The traditional deep residual learning behaves well in the images process due to local convolution kernels and deep neural networks. However, the convolution kernels have no effects on the whole input sequence. Therefore, it is not suitable for the regression of nonlinear functions. We replace convolutional layers and pooling layers by fully connected layers to ensure that deep residual learning can be applied in nonlinear regression. The residual regression model is carefully and numerically evaluated on simulated nonlinear data, and the results show that the improved regression model works well. It is recommended to avoid setting the residual regression model into a great depth or a small width, since it has a great testing loss under these circumstances. In addition, Figure 6. The prediction of regression models for relative humidity. The red cross symbol (×) means predicted data, and the green circle symbol (O) stands for observations of EAR5.

Conclusions
In this paper, we develop deep residual regression models for nonlinear regression. The traditional deep residual learning behaves well in the images process due to local convolution kernels and deep neural networks. However, the convolution kernels have no effects on the whole input sequence. Therefore, it is not suitable for the regression of nonlinear functions. We replace convolutional layers and pooling layers by fully connected layers to ensure that deep residual learning can be applied Entropy 2020, 22, 193 12 of 14 in nonlinear regression. The residual regression model is carefully and numerically evaluated on simulated nonlinear data, and the results show that the improved regression model works well. It is recommended to avoid setting the residual regression model into a great depth or a small width, since it has a great testing loss under these circumstances. In addition, we compare the residual regression model with other linear and nonlinear approximation techniques. It turns out that the optimal residual regression model has a better approximation capacity compared to others. Finally, the residual regression model is applied into the prediction of relative humidity, and we get a low relative error, which indicates that the residual regression model is stable and applicable in practice. In the future, we intend to apply the residual regression model on large eddy simulation (LES) datasets of turbulence to improve the subgrid-scale parameterizaitons of LES.

Patents
There aren't any patents resulting from the work reported in this manuscript.