A Hybrid Forward–Backward Algorithm and Its Optimization Application

: In this paper, we study a hybrid forward–backward algorithm for sparse reconstruction. Our algorithm involves descent, splitting and inertial ideas. Under suitable conditions on the algorithm parameters, we establish a strong convergence solution theorem in the framework of Hilbert spaces. Numerical experiments are also provided to illustrate the application in the ﬁeld of signal processing.


Introduction
Let H be a real Hilbert space. ·, · denotes the associated scalar product and · stands for the induced norm. Recall that a set-valued operator G : H → 2 H is said to be maximally monotone iff x − x , y − y ≥ 0, ∀y ∈ G(x), y ∈ G(x ), and its graph, denoted by gph G := {(x, y) ∈ H × H | y ∈ G(x)}, is not properly contained in the graph of any other monotone operator on H.
A fundamental and classical problem is to find a zero of a maximally monotone operator G in a real Hilbert space H, namely find x ∈ H such that 0 ∈ G(x).
This problem includes, as special cases, bifunction equilibrium problems, convex-concave saddle-point problems, minimization problems, non-smooth variational inequalities, etc. Due to its diverse applications in economics, medicine, and engineering, the techniques and methods for solving (1) have received much attention in the optimization community, see [1][2][3][4]. Indeed, the interdisciplinary nature of Problem (1) is evident from the viewpoint of algorithmic developments; see, for instance, [5][6][7][8] and the references therein. Among them, a celebrated method for solving (1) is the following proximal point algorithm. It can be traced back to the early results obtained in [6][7][8][9]. Given the current iterate p n , calculate the next iterate p n+1 via p n+1 = (Id + γ n G) −1 p n , ∀n ∈ N, where Id stands for the identity operator and γ n is a positive real number. The operator (Id + γ n G) −1 is the so-called resolvent operator, which was introduced by Moreau in [8]. In the context of algorithms, the resolvent operator is often referred as the backward operator. In [7], Rockefeller studied the operator and proved that the sequence generated by algorithm (2) is weakly convergent via the inexact evaluation of the resolvent operator in the framework of infinite-dimensional real Hilbert spaces. However, the evaluation error has to satisfy a summability restriction. Indeed, this restriction essentially implies that the resolvent operator is computed with the increasing accuracy. In fact, due to the computation of the resolvent operator is often hard to control in practice situation, this is still somewhat limiting. Quite often, computing the resolvent of a maximally monotone operator, which is an inverse problem, is not easy in many cases. Thus, this limits the practicability of the proximal point algorithm in its plain form. Aiming at this, a favorable situation occurs when the operator G can be written as the sum of two operators. Precisely, let us define G = A + B such that the resolvent (Id + αA) −1 (implicit, backward step), and the evaluation of B (explicit, forward step) are much easier to compute than the full resolvent (Id + αG) −1 . In so doing, we only consider the following inclusion problem: find a zero of an additively structured operator A + B, acting on space H, namely find x ∈ H such that 0 ∈ Ax + Bx, where B is a smooth operator for which we can use direct evaluations, and A is a proxfriendly operator for which we can compute the resolvent. For solving (3), Lions and Mercier [10] proposed the forward-backward splitting method. It is based on the recursive application of a forward step with respect to B, followed by a backward step with respect to A. For any initial data p 0 ∈ H, where Id stands for the identity operator and γ n > 0. Basically, the operator A : H → 2 H is maximally monotone, and the operator B is κ-cocoercive (i.e., ∃κ > 0 such that Bx − By, x − y ≥ ρ Bx − By 2 , ∀x, y ∈ H) and ι-Lipschitz continuous (i.e., ∃ι > 0 such that Bx − By ≤ ι x − y , ∀x, y ∈ H). It was proven that the generated sequence (p n ) n≥0 converges weakly to a solution of (3). Of course, the problem decomposition is not the only consideration, the convergence rate is another. Accelerating first-order method is a subject of active research, which has been extensively studied. Since Polyak [6] introduced the so-called heavy ball method for minimizing a smooth convex function, much has been done on the development of first-order accelerated methods; see [11][12][13][14][15]. The inertial nature of the first-order accelerated method can be exploited in numerical computations to accelerate the trajectories and speed up the rate of convergence. In the context of algorithms, it is often referred as the inertial extrapolation, which involves two iterative steps and the second iterative step is defined based on the previous two iterates. In [16], Alvarez and Attouch employed the first-order accelerated method to study an initial proximal point algorithm for solving the problem of finding zero of a maximally monotone operator. This iteration can be written as the following form: for any initial data p 0 , p 1 ∈ H, where Id stands for the identity operator, G is a maximally monotone operator, the parameters It was showed that the iterative sequence (p n ) n≥0 generated by (4) converges to a solution of inclusion Problem (1) weakly. An alternative modification to the inertial forward-backward splitting algorithm or its modifications is the following algorithm proposed by Lorenz and Pock [17]. The algorithm weakly converges to a zero of the sum of two maximally monotone operators, with one of two operators being Lipschitz continuous and single valued. Starting with any data p 0 , p 1 ∈ H, we define a sequence (p k ) k≥0 as w n = p n + α n (p n − p n−1 ), where M is a positive definite and linear self-adjoint operator that can be used as a preconditioner for the scheme, γ n is a step-size parameter and α n ∈ [0, 1) is an extrapolation factor. The algorithm keeps the weak convergence property of the iterates. To motivate the so-called the inertial forward-backward splitting algorithm, we consider the following differential equation It comes naturally by discretizing the continuous dynamics (5) explicitly with a time step h n > 0. Set t n = ∑ n i=1 h i , ι n = ι(t n ), γ n = γ(t n ) and p n = p(t n ). An explicit finite-difference scheme for (5) with centered second-order variation gives that where w n is a point in the line passing through p n−1 and p n (we have some flexibility for the choice of w n ). The above equality (6) can be rewritten as Set γ n = h 2 n and α n = 1 − ι n h n in (7). With the aid of the classical Nesterov extrapolation choice for w n , continuous dynamics (5) leads to a special case of the algorithm as w n = p n + α n (p n − p n−1 ), p n+1 = (I + γ n A) −1 (w n − γ n B(w n )), (8) where γ n is a step-size parameter and α n is an extrapolation factor, the extrapolation term α n (p n − p n−1 ) is intended to speed up the rate of convergence. In so doing, this dynamical approach leads to a special case of the forward-backward algorithm of inertial type. Consider the following monotone variational inequality problem (VIP, in short), which consists of finding z ∈ Ω such that S(z), x − z ≥ 0, ∀x ∈ Ω, where S : Ω → H is a monotone operator. We denote the solution set of the variational inequality problem by V I(Ω, S).

Remark 1.
It is known that x solves the V I(Ω, S) iff x is an equilibrium point of the following dynamical system, i.e., where Proj Ω is the nearest point (or metric) projection from H onto Ω.
Variational inequality problems, which serve as a powerful mathematical model, unify several important concepts in applied mathematics, such as systems of nonlinear equations, complementarity problems, and equilibrium problems under a general and unified framework [18][19][20]. Recently, spotlight has been shed on developing efficient and implementable iterative schemes for solving monotone variational inequality problems, see [21][22][23][24]. A significant body of the work on iteration methods for VIPs has accumulated in the literature recently. In 2001, Yamada [25] investigated a so-called hybrid steepest descent method. Given the current iterate p n , calculate the next iterate p n+1 via p n+1 = (I − αχ n S)T p n , ∀n ≥ 0, where the operator T : Ω → H is nonexpansive, the operator S is ι-Lipschitz continuous for some ι ≥ 0 and κ-strongly monotone (i.e., ∃κ > 0 such that Sx − Sy, x − y ≥ ρ x − y 2 , ∀x, y ∈ H), while the parameters α ∈ (0, 2κ/ι 2 ), (χ n ) n≥0 ∈ (0, 1], lim n→∞ χ n = 0 and ∑ ∞ n=1 χ n = ∞. In this paper, the set Ω can be regarded as the solution set of the inclusion problem. In continuous case, Attouch and Mainǵe [26] developed a second-order autonomous system with a Tikhonov-like regularizing term ∇F, which reads where α > 0 and (p 0 , q 0 ) ∈ H 2 is an arbitrarily given initial data. With the assumptions: (i) F : H → R is a convex, differentiable operator with ∇F strongly monotone and Lipschitz continuous; β(t)dt = +∞ and β(t) → 0 as t → +∞, withβ Lipschitz continuous and bounded, they proved that each trajectory of (10) strongly converges to p * as t → ∞, which solves the following variational inequality problem: In the spirit of the splitting forward-backward method and the hybrid steepest descent method, we present an iterative scheme as a new strategy, parallel to that of the autonomous system (10). We analyze the convergence with the underlying operator B cocoercive and the extension from condition (ii) to the general maximally monotone case, which is considered in Section 3. From this perspective, our study is the natural extension of the convergence results obtained by Attouch and Mainǵe [26] in the case of continuous dynamical systems.
where t and s are any positive real numbers with t ≥ s.
Proof. Since A is maximally monotone, one has

Lemma 4 ([28]
). Let S be a nonexpansive mapping defined on a closed convex subset C of a Hilbert space H. Then Id − S is demi-closed, i.e., whenever (x n ) n≥0 is a sequence in C weakly converging to some x ∈ C and the sequence ((Id − S)x n ) n≥0 strongly converges to some y ∈ H, it follows that y = (Id − S)x.

Lemma 6 ([30]
). Let (ζ n ) n≥0 be a sequence of nonnegative real numbers such that there exists a subsequence (ζ n i ) of (ζ n ) n≥0 such that ζ n i +1 > ζ n i for all i ∈ N. Then there exists a nondecreasing sequence (m j ) of N such that lim j→∞ m j = ∞ and the following properties are satisfied by all (sufficiently large) number of j ∈ N, In fact, m j is the largest number n in the set {1, 2, · · · , j} such that ζ n+1 ≥ ζ n .

Main Results
Throughout this section, we make the following standing assumptions Assumption 1. (a) Id stands for the identity operator. The operator S : H → H is supposed to be κ-strongly monotone and ι-Lipschitz continuous for some κ, ι > 0; (b) Let A : H → 2 H be a maximally monotone operator, and B : H → H be a ρ-cocoercive operator for some Our Algorithm 1 is formally designed as follows. Update w n = p n + µ n (p n − p n−1 ); 11 Update q n = (Id + ν n A) −1 (Id − ν n B)w n ;

12
Update p n+1 = (Id − δη n S)q n ; 13 Set n ← n + 1; 14 end 15 returnp = p n We make the following assumption with respect to the algorithm parameters.

Remark 2.
Please note that the condition (C 3 ) of Assumption 2 can be easily implemented since the value of p n − p n−1 is known before choosing µ n . Indeed, the parameter µ n can be chosen such that where 0 ≤ ω and (τ n ) n≥0 is a positive sequence such that τ n = •(η n ). Now we are in a position to state and prove the main result of this section. Theorem 1. Suppose that Assumptions 1, 2 hold. Then for any initial data p 0 , p 1 ∈ H, the weak sequential cluster point of sequences (p n ) n≥0 , (w n ) n≥0 and (q n ) n≥0 generated by Algorithm 1 belongs to the solution set of Problem 3. In addition, the three sequences converge strongly to the unique element of V I(Ω, S).
Proof. First, we show that I − δR is a contraction operator. According to Lemma 1, we have where α = 1 2 δ(2κ − δι 2 ). Thus, we find that Id − δS is a contraction operator with the constant 1 − α. In light of the nonexpansivity of Proj Ω , we further obtain that Proj Ω (Id − δS) is a contraction operator. From the Banach contraction principle, there exists a unique point a ∈ Λ such that a = Proj Ω (Id − δS)a. Now, it remains to prove that (p n ) n≥0 is bounded. Since B : H → H is ρ-cocoercive and (Id + ν n A) −1 is firmly nonexpansive, one concludes from Lemma 2 that Id − ν n B is nonexpansive. Let z ∈ Λ be arbitrarily chosen. Invoking Assumption 2 (C 1 ), one infers that which yields and From the definition of w n , one reaches Invoking Assumption 2 (C 3 ), there exists a positive constant M 1 ∈ R such that µ n η n p n − p n−1 ≤ M 1 , ∀n ≥ 0.
From the definition of p n , which together with (14)- (16), one obtains where ξ = 1 − 1 − δ(2κ − δι 2 ) ∈ (0, 1], due to Assumption 2 (C 4 ). This implies that sequence (p n ) n≥0 is bounded. At the same time, by putting together (14) and (15), one concludes that (q n ) n≥0 and (w n ) n≥0 are bounded. Once again, using the definition of w n , one concludes that where By combining (12) with (18), one immediately concludes that Invoking (20), which together with the definition of p n , one deduces that where M 3 = sup n≥0 2δ z − p n+1 , Sz < ∞, due to the boundedness of (p n ) n≥0 . Let us rewrite (21) as By using the firmly nonexpansive property of (Id + ν n A) −1 , one arrives at which can be equivalently rewritten as q n − z 2 ≤ w n − z 2 − q n − w n 2 − ν 2 n Bw n − Bz 2 + 2ν n q n − w n Bw n − Bz .
Returning to (14), (18) and (23), one concludes Next, we show that ( p n − z 2 ) n≥0 converges to zero by considering two possible cases on the sequence ( p n − z 2 ) n≥0 . Case 1. There exists N ∈ N such that p n+1 − z 2 ≤ p n − z 2 , ∀n ≥ N. Recalling that p n − z 2 is lower bounded, one deduces that lim n→∞ p n − z 2 exists. Using Assumption 2 (C 2 ), and letting n tend to infinity in (22), one finds that lim n→∞ ν n Bz − Bw n = 0.
Denote G ν n = (Id + ν n A) −1 (Id − ν n B). It is then immediately that lim n→∞ G ν n w n − w n = lim n→∞ q n − w n = 0.
Since (p n ) n≥0 is a bounded sequence, there exists a subsequence (p n i ) n≥0 of (p n ) n≥0 and a weak sequential cluster point s ∈ H such that p n i s as i → ∞. Due to (26) and (27), one finds that w n i s and q n i s as i → ∞. Thus, lim sup n→∞ Sz, z − w n = lim i→∞ Sz, z − w n i .
In this case, it is obvious that there exists a nondecreasing sequence (m k ) k≥0 ∈ N such that m k → ∞ as k → ∞. Lemma 6 asserts that the following inequalities hold, Thanks to (21) and (36), one finds ν n (2κ − ν n ) Bz − Bw n 2 ≤µ n M 1 p n − p n−1 + M 2 η n .
By letting n tend to infinity in the above inequality, one infers that lim n→∞ ν n Bz − Bw n = 0. According to (24) and (36), one finds that q n − w n 2 ≤η n (Id − δS)q n − z 2 + µ n M 1 p n − p n−1 + 2ν n q n − w n Bw n − Bz .
Recalling that the sequence (p k ) k≥0 converges strongly to a, which together with (41), one finds that the sequences (q n ) n≥0 and (w n ) n≥0 also converge strongly to a. From the above, one can conclude that the sequences generated by Algorithm 1 converge strongly to the unique solution a ∈ Ω such that a = Proj Ω (Id − δS)a. In view of Remark 1, one sees that a ∈ V I(Ω, S). Furthermore, the solution of such variational inequality is unique due to the properties of S. This completes the proof.
If (µ n ) n∈N := 0, we obtain the following Algorithm 2 without the inertial extrapolation as a special case of Algorithm 1.

Algorithm 2:
The hybrid forward-backward algorithm without the inertial term Input: Input the algorithm parameters (ν i ) i≥0 , (η i ) i≥0 and δ; Output: Outputp; 1 Initialize the data p 0 ∈ H; 2 Set n ← 1; 3 while not converged do 4 Choose µ n as any positive number; 5 Update q n = (Id + ν n A) −1 (I − ν n B)p n ; 6 Update p n+1 = (Id − δη n S)q n ; 7 Set n ← n + 1 ; Recently, much attention has been paid to the relationship between continuous-time systems and neural network models, see [31,32]. One can views an iterative algorithm as a time discretized version of a continuous-time dynamic system. We propose a recurrent neural network to Problem (1). By taking constant parameters, we obtain the following consequence of the dynamical with respect to the time variable t Algorithm 1 is strongly connected with the damped inertial system (42). Indeed, Algorithm 1 comes naturally into play by performing an implicit discretization of the inertial system with respect to the time variable t. Without ambiguity, we omit to write the variable t to get simplified notations. In so doing, p stands for p(t), and so on. By setting S := Id and k = 1 − δη, the circuit architecture diagram of the model (42) is shown in Figure 1. The operational amplifiers U1, U2 are combined with capacitors C1, C2, resistors R1, R2 to work as integrators to realize the transformations between p,ṗ,p. The amplifier U9 cooperated with resistors R20, R21 brings about effectiveness in amplification and opposition. Henceṗ is translated into −αṗ. The amplifier U3 and resistors R3, R4, R5, R6 are united as the subtractor block to achieve −p. Also, in the same way, the amplifier U8 and resistors R17, R18, R19 are united as another subtractor block to achieve p + αṗ. One sees that A((ṗ +p)/k) and B(p + αṗ) are obtained, respectively, by "A(·) realization" block and "B(·) realization" block. The amplifier U6 cooperated with resistors R13, R14 brings about effectiveness in amplification and opposition, therefore, A((ṗ +p)/k) is translated into −vA((ṗ +p)/k). The amplifier U7 cooperated with resistors R15, R16 brings about effectiveness in amplification and opposition, and hence B(p + αṗ) is translated into −vB(p + αṗ). The amplifier U5 and resistors R9, R10, R11, R12 are united as the subtractor block to achieve vA ((ṗ +p)/k) + vB(p + αṗ) − (p + αṗ), that is −(ṗ +p)/k. On the other hand, −(ṗ +p)/k is translated into (ṗ +p)/k through the phase inverter composed by the operational amplifier U4 and resistors R7, R8.

Numerical Experiment
In this section, we consider a computational experiment to illustrate the convergence properties of the proposed method. The experiment is performed on a PC with Intel (R) Core (TM) i5-8250U CPU @1.60GHz under the MATLAB computing environment.
Digital signal reconstruction is one of the earliest and most classical problem in the file restoration, the video and image coding, the medical, the astronomical imaging and some other applications. Many problems in signal processing can be formulated as inverting the linear system, which are modeled as where x ∈ R i is the original signal to be reconstructed, ν ∈ R j is the noise, b ∈ R j is the noisy measurement, Q ∈ R j×i is a bounded linear observation operator, often ill conditioned because it models a process with loss of information.
In our experiment, we consider a general compressed sensing scenario, where the goal is to reconstruct an i-length sparse signal x with exactly k nonzero components from j (k j < i) observations, i.e., the number of measurements is much larger than the sparsity level of x and at the same time smaller than the number of the signal length. Considering the storage limitation of the PC, we test a small size signal with i = 2 12 and the original signal contains k = 180 randomly nonzero elements. We reconstruct this signal from j = 2 10 observations. More precisely, the observation b = Qx + ν, where Q ∈ R j×i is the Gaussian matrix whose elements are randomly obtained from the standard normal distribution N(0, 1) and ν is the Gaussian noise distributed as N(0, σ 2 I) with σ 2 = 10 −4 .
A classical and significant approach to the problems of the signal processing is the regularization method, which has attracted a considerable amount of attention and revived much interest in the compressed sensing literature. We restrict our attention to the l 1 -regularized least squares model (43). Lasso framework is a particular instant of the linear problems of type (43) with the non-smooth l 1 -norm as a regularizer, in which minimizes a squared loss function, and seeks to find the sparse solutions of where the regularization parameter α(α ≥ 0) provides a tradeoff between the noise sensitivity and the fidelity to the measurements. One knows that Problem (44) always has a solution. However, it needs not to be unique. Please note that the optimal solution tends to zero as α → ∞. As α → 0, the limiting point has the minimum l 1 norm among all points that satisfy Q T (Qx − b) = 0, i.e., x = arg min Q T (Qx−b)=0 x 1 . Since the objective function of Problem (44) is convex but not differentiable, one uses a first-order optimality condition based on the subdifferential calculus. For m = 1, 2, · · · , one can obtain the necessary and sufficient condition for the optimal solution as follows From above, one sees that the optimal solution of Problem (44) is 0, for (Q T b) m ∈ [−α, α], (m = 1, 2, · · · ), i.e., α ≥ Q T b ∞ . Thus, one can now derive the formula α max = Q T b ∞ . For l 1 -regularized least squares; however, the convergence occurs for a finite value of α(α ≥ α max ). To avoid the optimal sparse solution is a zero vector, in this experiment, the regularization parameter was denoted by α := 0.01α max , where the value of α max is computed by the above formula.
The proximal methods give a better modeling of the sparsity structure in the dataset. The major step in proximal methods is to find a solution of arg min x (g(x) + 1 2α x − v 2 2 ) with respect to the function g and the parameter α > Q T b ∞ . On the other hand, the proximity operator is defined as the resolvent operator of the subgradient prox αg = (I + α∂g) −1 . Furthermore, the proximity operator for l 1 -norm is described as the shrinkage operator, which is defined as prox α · (x) m = (| · | − α) + sgn((x) m ). Now one considers the deblurring Problem (43) via the definition of the iterative shrinkage algorithm. Resorting to g(x) = α x 1 and f (x) = 1 2 Qx − b 2 2 , one can see that Problem (44) is a special instance of the problem find x ∈ H such that 0 ∈ (∂g + ∇ f )x.
Under the class of regularized loss minimization problems, the function g is considered to be a non-smooth regularization function and the function f is viewed as a smooth loss function with gradient being Q T Q -cocoercive. Furthermore, One solves the deblurring problem by using the definition of the iterative shrinkage algorithm. Setting A := ∂g, B := ∇ f and S := Id in Algorithm 1, one can obtain the proximal gradient type algorithm given as follows q n = (Id + ν n A) −1 (Id − ν n B)(p n + µ n (p n − p n−1 )), p n+1 = (Id − δη n S)q n , where (µ n ) n∈N ∈ R + . Meanwhile, we set δ := 1 50 , η n := 1 n and ν n = ν := 10 −8 × Q T Q , where n ∈ N. Next, we randomly choose the starting signal in the range of (0, 1) 4096 and take the number of iterations n = 2000 as the stopping criterion.
In this experiment, the minimum norm solution is the point in the set {x ∈ R 4096 | Q T Qx = Q T b}, which is closest to the original sparse signal. Thus, we can see that the result of this experiment for a signal sparse reconstruction is showed in Figure 2. By comparing the last two plots in Figure 2 and the top plot in Figure 2, one finds that the original sparse signal is recovered almost exactly. We hence conclude that our algorithm is efficient for dealing with the signal deblurring problem. To illustrate that Algorithm 1 has a competitive performance compared with Algorithm 2, we describe the following numerical results shown in Figure 3. The left plot in Figure 3 shows that the behaviors of the term ( p n ) n∈N (the y-axis) with respect to the number of iterations (x-axis), where p n is generated by Algorithms 1 and 2. It can be observed from the test result reported in Figure 3 that ( p n ) n∈N converges to 12.8603. Thus, we can use ( p n ) n∈N to study the convergence and the computational performance of Algorithms 1 and 2. We denote D n = p n − x, where x is the original signal to be reconstructed, and p n is an estimated signal of x. The right plot in Figure 3 shows that the value of ( D n ) n∈N (the y-axis) when the execution time in second elapses (x-axis). The sequence ( D n ) n∈N converges to 0.4152 with the running time. The restoration accuracy can be measured by means of the mean squared error. We find that the mean squared error MSE = D n 2 /i converges to 4.2088e −5 , which further implies that the iterative sequence converges to the original signal x in this experiment.
We make a comparison of the behaviors of ( p n ) n∈N , ( D n ) n∈N generated respectively by Algorithms 1 and 2. It shows that the bigger µ n is, the fewer the required number of iterations becomes, the faster the convergence rate becomes. Furthermore, we find that all cases in Algorithm 1 need less computer time and enjoy a faster rate of the convergence than Algorithm 2. It can be observed from the plots that the changing process in all cases of Algorithm 1 outperforms Algorithm 2. The inertial extrapolation of Algorithm 1 plays a key role in the acceleration. For µ k judiciously chosen, this inertial term improves the convergence speed of this algorithm.

Conclusions
In this paper, we introduced a variational inequality problem over the zero solution set of the inclusion problems. Since this problem has a double structure, it can be considered as a double-hierarchical optimization problem. The proposed algorithm uses the extrapolation term α n (p n − p n−1 ), which can be viewed as the procedure of speeding up the rate of convergence. As an application, the algorithm was used to solve the signal deblurring problem to support our convergence result.