Research and Study of the Hybrid Algorithms Based on the Collective Behavior of Fish Schools and Classical Optimization Methods

: Inspired by biological systems, swarm intelligence algorithms are widely used to solve multimodal optimization problems. In this study, we consider the hybridization problem of an algorithm based on the collective behavior of fish schools. The algorithm is computationally inexpensive compared to other population-based algorithms. Accuracy of fish school search increases with the increase of predefined iteration count, but this also affects computation time required to find a suboptimal solution. We propose two hybrid approaches, intending to improve the evolutionary-inspired algorithm accuracy by using classical optimization methods, such as gradient descent and Newton's optimization method. The study shows the effectiveness of the proposed hybrid algorithms, and the strong advantage of the hybrid algorithm based on fish school search and gradient descent. We provide a solution for the linearly inseparable exclusive disjunction problem using the developed algorithm and a perceptron with one hidden layer. To demonstrate the effectiveness of the algorithms, we visualize high dimensional loss surfaces near global extreme points. In addition, we apply the distributed version of the most effective hybrid algorithm to the hyperparameter optimization problem of a neural network.


Introduction
Optimization problems arise in various modern economic sectors, including engineering [1], chemistry [2], economics [3], operations research and computer science. The recent research in artificial intelligence expanded the variety of optimization problems. Indeed, when a neural network is being trained, its loss function is being optimized in respect to the weights of the connections among neural network layers [4]; when a data clustering algorithm like k-means is evaluated, an optimization algorithm minimizes the distance between points in respect to the locations of centroids [5]. Neural network hyperparameter optimization is another promising research area.
In mathematics, optimization denotes the selection of the best element ⃗ ∈ from some set of available alternatives . The optimization process consists of maximizing or minimizing a real function : → ℝ by systematically choosing ⃗ ∈ and computing the value of ( ⃗). The goal of the maximization process is to find such ⃗ , that ∀ ⃗ ∈ : ( ⃗ ) ≥ ( ⃗) . The goal of the dual minimization process is to find such ⃗ , that ∀ ⃗ ∈ : ( ⃗ ) ≤ ( ⃗).
Evolutionary algorithms are a family of algorithms inspired by biological evolution. These algorithms are also known as population-based, due to the fact that they process a variety of solutions to an optimization problem at a time. Many kinds of population-based algorithms were introduced, including genetic algorithms [6], swarm intelligence algorithms [7], ant colony algorithm [8], bee swarm algorithm [9], fish school search [10] and others. Modifications and hybrid versions of these algorithms exist [11], applied to solve practical problems. Such problems include parameter optimization of models based on support vector machine (SVM) algorithms [12] or random forests [13], neural network architecture optimization using evolutionary-inspired algorithms [14].
In this paper, we consider the hybridization problem of the evolutionary-inspired algorithm based on the collective behavior of fish schools, invented by Bastos Filho et al. in [10] and known as Fish School Search (FSS). Variations of FSS exist, intending to improve the performance of the original algorithm on multi-plateau functions [15]. In 2017, Bastos Filho et al. proposed a multi-objective version of FSS [16].
Recently, many advanced population-based optimization algorithms have been proposed, such as particle swarm optimization [17], memetic computing [18], genetic algorithms [6], differential evolution [19] and others. The original FSS algorithm showed superior results in comparison to particle swarm optimization in [20]. Moreover, FSS and its modifications outperformed genetic and differential evolution algorithms in image reconstruction of electrical impedance tomography in [21]. Other known FSS applications include its use in intellectual assistant systems [22], in solving assembly line balancing problems [23] and in multi-layer perceptron training for mobility prediction [24], where the application of FSS allowed to obtain more accurate results compared to back propagation and bee swarm algorithm. The aims of the evolutionary-inspired FSS algorithm include the minimization of time required to find the suboptimal solution and the elimination of the premature convergence problem, inherent to many population-based algorithms. FSS is a relatively lightweight algorithm, compared to other swarm intelligence algorithms.
However, the results produced by evolutionary computation can be less accurate compared to the results that can be obtained from classical optimization methods; and classical optimization methods have a higher chance to converge to a locally optimal solution. Due to the fact that population-based algorithms process a variety of solutions at a time, by starting from different points in the search space of a given fitness function, there is a higher chance for such algorithms to converge closer to a globally optimal solution, especially when the search space is multimodal. Hence, in global optimization tasks, when high solution accuracy matters, it is reasonable to take the best from both worlds, by first applying a global search metaheuristic and then using a local search method to converge to the closest optimum.
The hybridization idea of evolutionary-inspired algorithms and classical optimization methods is not new. A similar approach was proposed by Requena-Pérez et al. in 2006 for accurate inverse permittivity measurement of arbitrarily shaped materials [25]. In the proposed hybrid approach, the best solution obtained by a genetic algorithm was used as a starting point for gradient descent. The results demonstrate that the hybrid algorithm requires fewer iterations to obtain a similar accuracy, compared to the original genetic and gradient descent algorithms used separately, applied to the calculations of complex permittivity of materials. Using a similar strategy, Ganjefar et al. in [26] achieved better accuracy and faster convergence speed when training a qubit neural network. Another similar hybrid algorithm was proposed by Reddy et al. in [27], which involved posthybridization of the enhanced bat algorithm with gradient-based local search. The algorithm proved its superiority over other bat algorithms in dealing with multidimensional test functions for optimization. Coelho and Mariani proposed the use of the population-based particle swarm optimization algorithm with a classical Quasi-Newton local search method [28]. In the proposed technique, the best solution obtained by swarm intelligence was used as a starting point for the Broyden-Fletcher-Goldfarb-Shanno optimization method. This paper extends the research presented earlier by Liliya A. Demidova and Artyom V. Gorchakov [4], where we proposed a hybrid algorithm based on FSS and classical optimization methods, such as the Newton's method in optimization and gradient ascent. In this paper, we provide additional benchmarks on test functions for optimization and the results of the non-parametric statistical hypothesis Wilcoxon signed-rank test, to demonstrate the effectiveness of the proposed techniques. The test was used to determine if significant statistical differences exist among the original algorithm and the considered hybrids. We also describe the applications of the most effective hybrid algorithm, including loss function optimization during neural network training, and hyperparameter optimization of neural networks. Evolutionary-inspired algorithms can be easily distributed across multiple computational nodes using various distribution strategies [29], such as master-slave model or island model. We provide the solution of the hyperparameter optimization problem of a neural network, by computing fitness function values in distributed mode, to decrease the time required to find the suboptimal solution.
However, the proposed algorithms based on FSS and classical optimization techniques impose a limitation-the methods cannot optimize non-differentiable numerical functions. Hence, the discussed techniques cannot be used for neural network training wherein the derivative of the loss function does not exist. The computational complexity of FSS depends on the considered fitness function, the dimensionality of the search space, predefined iteration count and population size. In the proposed hybrid algorithms, the impact on the time complexity of the incorporated classical methods is relatively small, compared to the impact of the evolutionary-inspired algorithm. In Section 3 of this work, alongside with accuracy comparison, we compare the time required for the hybrid algorithms to converge. The results of the numerical experiment confirm, that the accuracy of the solutions obtained by the discussed hybrid algorithms improve at the cost of minor time losses.

Materials and Methods
The objective of any optimization technique is to find the optimal or approximately optimal solution. The core idea of FSS is to make the agents in a population move towards the positive gradient of the optimized function in order to gain weight; agents with larger weight have a greater impact on the population behavior. On each iteration of the algorithm, one feeding and three movement operators are applied sequentially, and then the best agent ⃗ is chosen, which is considered as the suboptimal solution until a new agent is found with better fitness function value. First, the individual movement operator is applied, given by: where, ⃗ is the vector containing random real numbers, uniformly distributed on [−1,1] . The variable denotes the maximum displacement of an agent, ⃗ , is the position of -th agent on -th iteration. The new position is accepted only in case if ⃗ , > ( ⃗ , ). To perform the check, the delta value is computed for each agent, according to the following: where, is the fitness function. After the individual movement step, the feeding operator is applied to the whole population. The scalar weight value ⃗ , is associated with each ⃗ , agent. The weights of all agents are adjusted on every iteration according to the following formula: As a result, the weights of agents with greater fitness function improvements increase more, compared to other agents. On the next step, the collective-instinctive movement operator is applied to every agent in the population. The position delta value is computed for each agent. The collectiveinstinctive movement occurs according to the following formula: Then, the barycenter vector ⃗ , required for the collective-volitive movement step, is computed: The collective-volitive movement operator is given by: The ⃗ vector contains real numbers, uniformly distributed on the [−1, 1] interval. The variable denotes the collective-volitive movement step size. If the total weight of the population has increased since the last iteration, the '−' sign is used in Equation (6); this means that the agents are attracted to the barycenter of the population. Otherwise we use the '+' sign in Equation (6); the agents are spread away from the population barycenter.
On each iteration of FSS, the variables and decay linearly. If we are maximizing a function , then on each iteration the best agent ⃗ , is chosen from the ℙ population, that ∀ ⃗ ∈ ℙ : ⃗ , ≥ ( ⃗ ). The algorithm stops when the predefined iteration count is reached. The best agent found on the last iteration is assumed as the optimal solution of the maximized function .
The FSS algorithm shows accurate results in multimodal optimization because of its populationbased nature. The algorithm is computationally inexpensive compared to other swarm intelligence algorithms, does not require storing or computing large matrices or finding solutions to complicated equations. However, due to the heuristic nature of the algorithm, the estimated optimum can sometimes not be as accurate, as the optimum that classical optimization algorithms can find. The drawback of the classical gradient-based or Hessian-based algorithms is that they often converge to local minima or maxima, although various techniques exist, that can prevent the premature convergence of such algorithms. To take the best from both worlds, we propose the combination of the evolutionary-inspired FSS algorithm and the two classical optimization methods.
The first proposed algorithm consists of the population-based FSS described above in detail and of the Newton's method in optimization. The latter classical optimization algorithm is based on the Newton's method for finding roots of a differentiable function by solving the ( ⃗) = 0 equation. Such solutions are the stationary points of the considered function , which can either be minimum, maximum or saddle points [30]. In general, the iterative scheme for the Newton's classical optimization method is given by: where, is the iteration number, ⃗ vector represents the solution on the -th iteration, [ ( ⃗ )] denotes the inversion of the Hessian matrix, ∇ ( ⃗ ) represents the gradient of the function at the given point ⃗ . The derivatives can be computed either symbolically or numerically, using finite difference approximation [31]. The iterative optimization process continues until the specified iteration limit is not achieved, or while , − , > , where , represents the -th component of the ⃗ vector on iteration .
In the proposed hybrid approach, we run iterations of the FSS algorithm first, and then, using the discovered ⃗ solution as the starting point, we apply the Newton's method in optimization. The Newton's method converges to the closest to ⃗ stationary point. Notably, in some cases the Newton's method is not applicable, for example, when the Hessian matrix is singular, and, consequently, not invertible.
The second proposed algorithm is based on gradient descent methods. Gradient descent is an iterative optimization technique for finding the minimum of a function, and gradient ascent is a technique for solving the dual maximization problem. To minimize a given function , one takes steps proportional to the negative of the gradient ∇ ; to maximize a given function, one takes steps proportional to the positive of the gradient: Gradient descent formula is given by Equation (8) with the '−' sign used, gradient ascent formula is given by Equation (8) with the '+' sign used. The ⃗ vector represents the solution on the -th iteration, is the learning rate, ∇ ( ⃗ ) is the gradient of the optimized function at the point ⃗.
Convergence to the local extreme point is guaranteed, when the learning rate is adjusted on each iteration according to: In the proposed hybrid algorithm, we first run iterations of the evolutionary-inspired FSS algorithm. Then, we apply the gradient ascent algorithm, starting from the ⃗ solution obtained on the previous step. The population-based algorithm is expected to find the most promising region of many, and the classical optimization algorithm will improve the resulting solution.

Numerical Experiment
The original FSS algorithm and the hybrids were implemented using the interpreted Python programming language and such libraries, as SciPy and NumPy. First-order and second-order partial derivatives, required by gradient ascent and Newton's method, respectively, were computed using finite difference approximation.
We benchmarked the described hybrid algorithms on multidimensional test functions for optimization, such as the Sphere [32] function , the Styblinsky-Tang function [33] , the Rosenbrock [34] function and the Rastrigin [35] function . We performed the tests using 3, 5 and 10 dimensions. Additionally, we considered such two-dimensional test functions, as Ackley [32], Matyas [36], Eggholder [37] and Booth [36] functions, defined in Table 1 as , , and , respectively. The Michalewicz [32] function was tested with 2, 5 and 10 dimensions. The cause of this is that according to [32] the approximate optimum values for the Michalewicz function are well known only for these dimensions.
The functions from Table 1 were designed to be minimized. However, we solved the dual maximization problem in our experiments, so the functions were multiplied by −1 and then maximized. We spawned 100 agents and ran = 50 iterations multiple times for each of the functions, except for the Eggholder. Due to the fact, that the Eggholder function has many local extreme points in its search area, as shown in Figure 1a, we used a population of 512 agents there. The evolutionary-inspired algorithm was preliminary approbated on the two-dimensional Styblinsky-Tang function , the optimization process on 5-th iteration of 25 total is illustrated in Figure 1b.   For each of the test runs, the individual step size was set to be equal to the radius of the search area, and the collective-volitive step size was defined as = 0.5 × . We used = 5 × 10 initial step size for gradient ascent, and = 10 desired accuracy for both of the two classical optimization algorithms. Iteration limit for each of the classical methods was set to 25. The benchmarks were evaluated on a PC running Ubuntu 18.04 operating system, inside a Docker container with Python, Anaconda, Keras and Jupiter Notebooks software installed. The PC had Intel® Core™ i7-4770 CPU installed with 8 virtual cores and 3.40 GHz processor clock frequency, 16GB of RAM and an HDD hard drive.
Each of the test functions listed in Table 1 was evaluated 50 times to determine how accurate the algorithms are, compared to each other. We gathered such metrics, as the best function value, mean function value, variance and standard deviation. Here, the best value denotes the closest value to the optimum of the considered fitness function. The results are shown in Table 2 and in Table 3, where the best result in each row is highlighted in bold. We use the FSSN abbreviation for the hybrid algorithm, which is based on fish school search and Newton's optimization method. The FSSGD abbreviation denotes the hybrid algorithm where gradient ascent is used. Benchmark results for twodimensional, three-dimensional and ten-dimensional versions of multidimensional functions , , , and were deliberately omitted for brevity. The omitted results are similar to those shown in Table 2 and in Table 3.  From the obtained results, shown in Table 2 and in Table 3, we see that the hybrid algorithms generally perform better than the original algorithm. To demonstrate the convergence process of the hybrid algorithms, we obtained the plots shown in Figure 2. As shown in Figure 2, the classical algorithms can either slightly improve the obtained solution, as shown in (a) and (d) plots, or considerably, as shown in (b) and (e) plots. The plots (c) and (f) show that the FSSN hybrid can diverge on multimodal optimization problems that can be solved fine by FSSGD. The FSSN hybrid algorithm based on Newton's method diverges in a multimodal search space in the direction towards saddle points, local minima and local maxima. The hybrid based on the collective behavior of fish schools and gradient ascent successfully converged to the optimal solution.
The hybrid algorithm based on fish school search and Newton's method in optimization can show accurate results, but has limited applications. For example, the Newton's algorithm is not applicable to a function which Hessian matrix is singular, because a singular matrix is not invertible. Another limitation is the possible convergence to a wrong solution when optimizing a multimodal function. The method can converge, for example, to a saddle point. The use of the hybrid algorithm based on the collective behavior of fish schools and gradient ascent does not impose such limitations. This hybrid algorithm can be used in cases, when a function is differentiable inside the study region, by computing the derivative either numerically, by using finite difference approximations, or symbolically.
Additionally, we benchmarked the time required for the original and hybrid algorithms to optimize the functions listed in Table 1. The results are shown in Table 4 and Table 5, where the best result in each row is highlighted in bold.   Table 4 and Table 5 we can conclude that the hybrid algorithms usually run a bit longer than the evolutionary-inspired algorithm without modifications. However, according to Table 2 and  Table 3, in most cases the hybrids find solutions that are more accurate. The FSSGD algorithm is generally faster than the FSSN algorithm. The cause of this is that the hybrid algorithm based on Newton's method computes the Hessian matrix on each iteration, using finite difference approximation, and such computations can be expensive, especially in high-dimensional space.
To verify, that the hybrid algorithms outperform the original algorithm, we used the Wilcoxon signed-ranks statistical test, often used to determine if significant statistical differences exist between two algorithms in computational intelligence. The results are shown in Table 6.
In our case, we ensured that the significant differences exist between FSS and FSSN, and between FSS and FSSGD. We assumed the level of significance = 5 × 10 , which means the 95% confidence level. The null hypothesis is that the two related paired samples come from the same distribution. For each test function listed in Table 1, we ran the original algorithm and hybrids 20 times. Then, we applied the Wilcoxon signed-ranks statistical test to the obtained distributions of solutions. Based on the results shown in Table 6, we observed, that both of the hybrid algorithms can improve the accuracy of the original population-based algorithm, and in most cases, a statistically significant difference exists between the distributions the considered algorithms produce. The FSSN hybrid, which is based on Newton's optimization method, outperforms the FSSGD hybrid based on gradient ascent, however in some cases Newton's method converges to a stationary point or is not applicable to a function due to the singularity of its Hessian matrix in the given point. The FSSGD hybrid based on gradient ascent is more stable and shows more accurate results, than the original FSS algorithm; and can be used in cases, when the fitness function is differentiable in the study region.

Solution for the Linearly Inseparable XOR Problem
A neural network can be described as a programmatic system which allows making decisions using the evolution of a complex non-linear system. Such decisions are based on recognized underlying relationships in datasets. Neural networks take in a vector encoding the input object; the output signal of a neural network encodes the decision made by the system [38]. A perceptron is a mathematical model, proposed by Frank Rosenblatt in 1957; it can be treated as a simple neural network used to classify the data into two classes. Neural networks are often trained using back propagation, however, evolutionary algorithms show promising results as well [39][40][41]. As shown in [26], a hybrid algorithm which incorporates evolutionary-inspired and classical optimization techniques can achieve faster convergence speed and better accuracy.
The FSSGD, as the most effective hybrid algorithm proposed in this paper, can be used to optimize loss functions of neural networks, in cases, when the loss function of a perceptron is differentiable inside the study region. As a proof-of-concept, we used FSSGD as the loss function optimization algorithm to train a multi-layer perceptron. We were solving the linearly inseparable exclusive disjunction (XOR) problem [42], commonly used to benchmark neural network optimization algorithms [43]. In the problem we consider, we have the set of input objects, and the set of answers. Both the objects and the answers are Boolean values, belonging to the {0, 1} set. The goal is to restore the function * : → using a multi-layer perceptron. The binary XOR function can be described by Table 7.  To solve the formulated problem, we built a perceptron with one hidden layer consisting of eight neurons. Other neural network architectures exist, which can solve the XOR problem [43,44], but the goal of the experiment was to see how the FSSGD algorithm would deal with a plenty of conflicting solutions. Perceptron architecture is shown in Figure 3. On the hidden layer, we used the hyperbolic tangent activation function, given by: where, is the input signal multiplied by the weight value of a neuron. While the set of decisions of our neural network should be represented as the Boolean values, we used the sigmoid activation function on the output layer. The sigmoid activation function is given by: We used the binary cross-entropy as the loss function for the multi-layer perceptron: ( , , ⃗) = − 1 log ( ⃗ , ⃗) + (1 − ) log 1 − ( ⃗ , ⃗) . (12) where, and are the set of objects and the set of answers, respectively; ⃗ denotes the weights; is the -th answer, ∈ ; ⃗ is the -th object, ⃗ ∈ ; = | | = | |.
The training process of a neural network leads to the optimization of its loss function. In our case, we optimized the Equation (12) loss function, using the hybrid FSSGD algorithm, in respect to the weights vector ⃗. After the neural network loss function was optimized, we plotted the surface of the loss function using the method [45], which allows visualizing functions that live in very highdimensional spaces. In this approach, one chooses two orthogonal direction vectors ⃗ and ⃗. The loss function is transformed into the following function parameterized by two arguments: where, and are the objects and answers, respectively, ⃗ is the optimized weights vector, and are the scalar coefficients required to visualize the function in the three-dimensional space. The loss function was optimized twice, using different parameter sets of the hybrid algorithm. The obtained results are shown in Figure 4.  4,4], other parameters were left the same. When using = 10, it took 22 iterations for FSSGD to converge. When using = 4, it took 30 iterations for the hybrid to converge. When using = 10, the neural network output is more accurate, compared to the results obtained when = 4 . The same pseudorandom number generator seed was used in both experiments. Assuming the fact, that the obtained solutions differ, we can suppose, that if we increase the value, then we can get other, probably better, solutions. Notably, in Figure 4b,c,e,f, we see that the surface of the neural network loss function has a plenty of local minima points that can prevent the classical optimization algorithms from finding the global optimum. In Figure 4d we see, how finite difference-based gradient descent improves the solution obtained using Fish School Search.

Neural Network Hyperparameter Optimization
Modern neural networks achieve excellent performance in a wide variety of fields, but accuracy of predictions made by a neural network highly depends on chosen hyperparameters. The hyperparameters are the parameters usually adjusted by a researcher; such parameters include layer count in a neural network, neuron count on each layer, optimizer and learning rate, activation functions. Evolutionary-inspired algorithms are commonly used for neural network architecture optimization; neural network topologies can be evolved, for example, by using genetic algorithms [46] or particle swarm algorithms [47]. Training neural networks takes time, especially in deep learning. The benefit of using population-based algorithms is that such algorithms can be distributed across different worker nodes easily, and that can drastically reduce the amount of time required to run a single iteration of an optimization algorithm.
We applied the FSSGD algorithm to neural network hyperparameter optimization, evolving the architecture of a neural network for the prediction of house prices. Different neural networks were trained on the well-known Boston Housing dataset [48]. The Boston Housing dataset is relatively small, consisting of 506 rows and 14 columns. Each column represents the factor, which can potentially affect the price of a house. The dataset is split into training and testing data frames, the former consisted of 404 rows, and the latter consisted of 102 rows. We applied the feature normalization technique to the dataset by subtracting the mean value from the data and diving the data by standard deviation.
Each of the considered neural network architectures consisted of two layers. The optimized parameters were defined as follows: denoted neuron count on the first layer, ∈ [10, 90], ∈ ℤ; denoted neuron count on the second layer, ∈ [10, 90], ∈ ℤ; denoted the activation function used on the first layer, ∈ { , ℎ, , }; denoted the activation function used on the second layer, ∈ { , ℎ, , }; denoted the learning rate; denoted the optimizer used to train the neural network, ∈ { , , }. Neural network models were built with Keras and Tensorflow libraries. During every call to the fitness function of FSSGD, a parameterized model was trained during 50 epochs. The k-fold cross-validation technique was used to obtain the score of the model, assuming = 3. During the training process of different neural networks, the mean squared error loss function was used: Assuming and are the sets of objects and answers, respectively, ⃗ is the -th object, ⃗ ∈ ; is the -th answer, ∈ ; ( ⃗ ) denotes the predictions generated by the neural network; = | | = | |. Due to the use of the k-fold cross-validation technique, the training dataset was split into three data frames, then two of the three data frames were used to train the neural network; the one data frame left was used to test the network. During the training phase, the value was set equal to 270; during the testing phase, the value was set to 134. The mean absolute error function was used to estimate the fitness score of the model: The reason for using the mean absolute error metric seen in Equation (15) is that it produces results that can be easily interpreted by a human. In the Boston Housing dataset, we were solving a linear regression problem to predict house prices, so the mean absolute error metric gives us the error value in thousands of dollars.
To handle the experiment, five Ubuntu 18.04 LTS nodes were rented on Microsoft Azure. For the FSS part of the proposed FSSGD algorithm, four agents were spawned. The FSS algorithm was running on the server node; fitness function values were computed on each iteration of the algorithm, the computations were distributed across four worker nodes. For the FSS part of the hybrid algorithm, we assumed = 8, = 4, = 2. On each fitness function call, separate neural network models were built and trained on different worker nodes multiple times, using k-fold cross-validation, to estimate the quality of a model according to Equation (15). The distributed computations architecture of the FSS algorithm is illustrated in Figure 5a.
For the gradient descent part of the FSSGD hybrid algorithm, we used finite difference approximation to compute first-order partial derivatives for the gradient vector numerically on each iteration; the computations were distributed across four worker nodes as well, assuming = 1; negative and positive derivative values were rounded down and up, respectively. Each worker node computed its own first-order partial derivative. The convergence of the hybrid algorithm optimizing neural network architecture is shown in Figure 5b. Here, ⃗ denotes theth agent; is the time-consuming fitness function; is the -th worker; is the server, which hosts the algorithm; (b) FSSGD algorithm convergence when optimizing hyperparameters.
The first layer of the evolved neural network model consisted of 58 neurons; the second layer consisted of 78 neurons; the softmax and tanh activation functions were used on the first and second layers, respectively. The optimizer selected by FSSGD was RMSprop with 10 learning rate. The mean absolute error Equation (15) metric value obtained on the testing data frame was equal to 2.43, this is equivalent to 2430 USA dollars. The minimum price of a house in the Boston Housing dataset is 5000, the maximum price equals to 50,000 USA dollars. It took 1097 seconds total for the FSSGD algorithm to optimize the architecture in distributed mode, using four worker nodes, as shown in Figure 5a; 560 seconds for 8 iterations of FSS and 537 seconds for 2 iterations of gradient descent; the gradient descent computations were based on finite difference approximation of the fitness function. Without using the distributed approach to fitness function computations, it took more than an hour for the hybrid algorithm to converge.
The obtained results confirm that the hybrid algorithm based on the collective behavior of fish schools and gradient descent can produce slightly more accurate results, than the original population-based algorithm. However, when optimizing hyperparameters of neural networks the time losses required to compute the gradient vector numerically are too big, due to a large number of fitness function calls required to obtain the vector consisting of first-order partial derivatives, even in case if computations are distributed across multiple nodes. With the increase of search space dimensionality, the time required to compute the gradient of a fitness function increases, and this makes this technique not suitable for hyperparameter optimization of deep neural networks.

Discussion
The benchmarks of the proposed FSSN and FSSGD hybrid algorithms, which are based on the collective behavior of fish schools and classical optimization methods, on test functions for optimization, including the multidimensional Rastrigin, Rosenbrock, Sphere, Ackley, Michalewicz, Eggholder, Styblinsky-Tang function and others, indicate, that the hybrid algorithms generally produce more accurate solutions, and the Wilcoxon signed-rank test confirms that. The improvement in accuracy costs minor time losses, but the time losses increase with the increase of search space dimensionality.
The FSSN algorithm imposes a number of limitations upon the optimized function and has a chance to converge to the nearest saddle point, or to the wrong extremum. The FSSN algorithm can often produce solutions that are more accurate compared to FSS or FSSGD. However, the computations of the Hessian matrix can be quite expensive, especially in high-dimensional search space, or in cases, when the evaluation of a fitness function takes a long time to complete. In addition, there are such cases, when the Hessian matrix of the optimized function becomes singular. The Newton's optimization method requires the Hessian matrix to be invertible, so in case if the matrix is singular, the method is unable to proceed with the optimization process. The FSSGD algorithm only requires the fitness function to be differentiable inside the study region. Gradient vector computations using finite difference approximation are far less expensive than Hessian matrix computations, so the time losses introduced by performing local search by gradient ascent are quite small, as shown in Table 4 and in Table 5. Except from using numerical differentiation, the gradient of a fitness function can be obtained by either using symbolical or automatic differentiation, and in this case, the time losses become even smaller.
The proposed FSSGD algorithm can be applied to such practical tasks, as optimization of loss functions in neural networks or SVM classifiers. When training neural networks, the full gradient descent method can be replaced with its modifications, commonly used to train neural networks, to speed up the optimization process.
The FSSGD algorithm can be applied to hyperparameter tuning of a neural network as well, but in this case, the time losses become unaffordable due to the fact, that the derivative of a fitness function cannot be found analytically or symbolically here. The only option left is numerical differentiation, which takes a large amount of time, especially in high-dimensional spaces and in cases, when the fitness function is either computationally expensive or takes a long time to execute due to other reasons. Hence, for neural network hyperparameter optimization it is better to use the original evolutionary-inspired FSS algorithm. In evolutionary-inspired algorithms, fitness function computations can be easily distributed across different worker nodes, and this can considerably speed up the optimization process.
Future research could compare various hybrid algorithms based on FSS and Quasi-Newton methods. Notably, a hybrid algorithm based on particle swarm optimization and the Broyden-Fletcher-Goldfarb-Shanno optimization method showed a strong advantage in [28]. Considering loss function optimization in neural networks, future work could compare the proposed FSSGD algorithm with other gradient-based methods, commonly used to train neural networks, such as Adam or stochastic gradient descent with momentum [49]. Another promising research area is chaos theory in evolutionary optimization. The initial locations of agents in swarm intelligence algorithms affect the quality of the obtained solution, so a chaos-based pseudorandom number generator could be used to initialize the population in an evolutionary-inspired algorithm [50].