FDD Channel Estimation Via Covariance Estimation in Wideband Massive MIMO Systems

A method for channel estimation in wideband massive Multiple-Input Multiple-Output systems using hybrid digital analog architectures is developed. The proposed method is useful for Frequency-Division Duplex at either sub-6 GHz or millimeter wave frequency bands and takes into account the beam squint effect caused by the large bandwidth of the signals. To circumvent the estimation of large channel vectors, the posed algorithm relies on the slow time variation of the channel spatial covariance matrix, thus allowing for the utilization of very short training sequences. This is possibledue to the exploitation of the channel structure. After identifying the channel covariance matrix, the channel is estimated on the basis of the recovered information. To that end, we propose a novel method that relies on estimating the tap delays and the gains as sociated with each path. As a consequence, the proposed channel estimator achieves low computational complexity and significantly reduces the training overhead. Moreover, our numerical simulations show better performance results compared to the minimum mean-squared error solution.


Introduction
Massive Multiple-Input Multiple-Output (MIMO) and millimeter wave (mmWave) technologies are promising candidates to satisfy the demands of future wideband wireless communication systems. A common feature of these technologies is the deployment of antenna arrays with a large number of elements while keeping the size of the antenna aperture small. This way, advantages such as large beamforming gains necessary to compensate the propagation losses at mmWave frequencies, or the ability to support many spatially multiplexed streams, are efficiently achieved [1,2]. In addition, it has been shown that in multi-cell massive MIMO systems the capacity increases without bound as the number of antennas increases, even under pilot contamination, if the channel covariance matrices of the contaminating users are asymptotically linearly independent [3]. Measurements campaigns have proved that this is usually the case in practice [4].
Deploying massive antenna arrays raises concerns in terms of hardware cost and power consumption, especially at mmWave frequencies. To alleviate these requirements, hybrid architectures with an analog preprocessing network operating in the Radio Frequency (RF) domain have been proposed to reduce the number of complete RF chains [2]. Hybrid approaches imply stringent limitations, such as the reduced spatial dimensionality and the lack of flexibility of the analog part, usually implemented with a Phase Shifter (PS) network. Thus, results obtained for fully digital scenarios are not applicable to the hybrid case in general, and new precoding or combining solutions are needed.
The benefits of large antenna arrays rely on the accuracy of the Channel State Information (CSI). Considering wideband systems with multicarrier modulations, this information should be acquired In this work, we propose a downlink channel estimation method for wideband massive MIMO systems over hybrid architectures based on the explicit acquisition of the channel covariance matrix. With the aim of achieving low training overheads, the proposed method employs training sequences based on sparse rulers. The considered setup typically arises in mmWave frequencies, but the proposed framework is general enough to be applicable in sub-6 GHz bands since channel sparsity is not required. Indeed, our approach only as sumes (1) the channel is wide-sense stationary over a period of time, and (2) the channel covariance matrix has a Toeplitz structure. Moreover, we consider the nonlinear spatial transformation caused by the beam squint effect. In the following, we summarize the contributions of our work: • We propose a wideband downlink channel covariance estimation method for hybrid FDD massive MIMO systems that can be utilized for mmWave frequencies. The covariance matrices for different subcarriers lie in distinct subspacesdue to the beam squint effect. The proposed covariance estimation algorithm implements dictionary rotations to overcome this difficulty and exploit the common structure of the wideband covariance matrices.

•
Differently from the prior art, and with the aim of reducing the training overhead, Wichmann sparse rulers [36] are used to design the training sequences. Prior work made use of coprime arrays, but this solution cannot attain the lowest training lengths for a given number of transmit antennas.
Other works that relied on sparse rulers did not provide methods to design the training sequences for a number of antennas arbitrarily large.

•
We analyze the limitations of the proposed method based on MUltiple SIgnal Classification (MUSIC): (1) The low rank characteristic of the sample covariance matrices limits the applicability of MUSIC-like techniques. The use of sparse rulers make it possible to extend the proposed method using spatial smoothing techniques and allows for even shorter training sequences; (2) Prior knowledge on the number of channel paths. We show numerically that the number of channel paths can be estimated without significant performance losses and negligible computational cost.
(3) The dictionary dependency. We provide a discussion on dictionary sizes, posing that moderated sizes provide enough angular resolution. Furthermore, since sparsity is not a requirement for the proposed method, we show empirically how to avoid the power leakage arising when the Angle of Departures (AoDs) do not lie on the dictionary.

•
We propose a low complexity channel estimator, leveraging on the structure of the wideband channels to avoid the individual per-subcarrier procedure. This estimator obtains the delays and gains corresponding to each channel propagation path exploiting underlying the common information for all the subcarriers. Moreover, the estimation is performed using the training sequence employed for the acquisition of the covariance matrices, and it does not require an additional training stage.

Notation
In this work, scalars are represented by lower case letters, vectors by bold lower case letters, and matrices by bold capital letters. The superscripts T and H are the transpose and the Hermitian operators, respectively. The operator | · | with a scalar argument represents the absolute value, · with a vector argument represents its norm and · F with a matrix argument the Frobenius norm. The operator · represents the element-wise floor operation. A ⊗ B is the Kronecker product of matrices A and B, and A • B the Khatri-Rao product (or the column-wise Kronecker product) of matrices A and B. diag (·) is the diagonal matrix with the arguments in its main diagonal, and blockdiag (·) constructs a diagonal supermatrix in which the diagonal elements are given by the matrices in the argument, and the off-diagonal elements are zero matrices. 0 and 1 denote the vectors of zeros and ones of the right dimension, respectively. The constant c is the speed of light.

System Model
We consider a hybrid FDD massive MIMO system with M transmit antennas and N RF RF chains communicating with several single-antenna users over a multipath channel. To overcome the channel frequency selectivity, we assume an Orthogonal Frequency-Division Multiplexing (OFDM) modulation with N c subcarriers and a cyclic prefix large enough to avoid intercarrier and intersymbol interference. One of the aims of this work is to estimate the downlink channel covariance; hence, as suming that the channel is locally stationary, we will consider several blocks k ∈ {1, . . . , K} with the duration of the channel coherence time. At the beginning of each block, the BS transmits a training sequence of length T tr OFDM symbols to sound both the covariance and the channel. The remaining time is devoted to data transmission. That is, we collect K wideband observations of size T tr during K training periods. Since the block duration is limited by the channel coherence time, the training sequence length must satisfy T tr Mdue to the large antenna array deployed at the BS. Indeed, we circumvent the requirement of training sequence lengths larger than the number of paths L, i.e., we propose solutions to the scenario T tr ≤ L.
Finally, we will as sume that the users feed the observations back to the BS using, e.g., scalar quantization [8,37]. With the received feedback, the BS estimates the channel covariance matrices for all users and frequency subcarriers. This scheme allows for tracking the channel covariance matrices using subsequent received feedbackdue to the slow variation of the channel statistics. As previously mentioned, these slow changes make channel covariance acquisition simpler than estimating the channel itself. Next, with the aid of the obtained channel covariance matrices, we estimate the wideband channels employing the K aforementioned observations.

Channel Model
We represent the channel response to an arbitrary user using a model similar to that in [34,38]. We as sume a multipath channel model with N taps, where the gains for all antennas in the f -th subcarrier during the k-th coherence block are given by where h n,k ( f ) ∈ C M are the channel coefficients of the n-th tap of the channel. Each tap, in turn, is decomposed into L paths as where the a(ϑ l , f ) ∈ C M are the steering vectors for the AoD of the l-th path ϑ l , g l,k ∼ N C (0, σ 2 l ) and τ l are the complex channel gain and the relative delay corresponding to the l-th path, respectively, and p rc (t) is the transmit filter as sociated with the Digital to Analog Converter (DAC) sampled with time interval T s . We as sume that the channel gains for different paths are statistically independent, i.e., E[g l g * l ] = σ 2 l δ(l − l ). Regarding the steering vectors, we consider a ULA configuration together with the so-called beam squint effect that arises for the typical antenna apertures in massive MIMO and large relative signal bandwidths B/ f c [11]. We express this effect as a rotation of the steering vectors with respect to the central frequency f c , i.e., a(ϑ l , f ) = Γ(ϑ l , f )a(ϑ l ), where a(ϑ l ) ∈ C M is the steering vector for ϑ l in a ULA, whose elements are [a(ϑ l )] m = e −j 2π λ d(m−1) sin ϑ l , m = 1, . . . , M, with d is the distance between two consecutive array elements, λ is the carrier wavelength, and Γ(ϑ l , f ) is a diagonal matrix with the beam squint rotation coefficients. Under the common as sumption of inter-antenna distance d = c 2 f c , the elements of Γ(ϑ l , f ) read as where ∆ f B N c is the frequency offset of the f -th subcarrier with respect to f c , and ∆ f = f − N c −1 2 is its relative position to the central one. This frequency-dependent rotations allows for exploiting the common structure of the wideband covariance matrices, since the beam squint effect for the subcarriers is isolated as a rotation of the frequency-independent vectors a(ϑ l ).
Combining Equations (1) and (2), the channel frequency response for the f -th subcarrier reads as where N is the number of delay taps and β l ( f ) = ∑ N−1 n=0 p rc (nT s − τ l )e −j2π f n/N c . Since the actual AoD of the channel are unknown, the linear combination of steering vectors in (5) can be approximated by means of the steering vectors as sociated with G predefined angles θ 1 , . . . , θ G comprised in a frequency-dependent dictionary where g k ∈ C G is a vector with L non-zero entries in the positions corresponding to the AoDs, i.e., . Note that equality holds for (6) if the AoDs lie on the dictionary. According to the assumptions regarding the channel statistics, we obtain h k ( f ) ∼ N C (0, C h( f ) ). Moreover, we use the approximation in (6) yielding with C h( f ) being Toeplitz, andG( f ) = B( f )GB H ( f ) the effective gains for the f -th subcarrier.
In the following, we will consider that the AoDs belong to the dictionary, i.e., we as sume The consequences of this as sumption are discussed in Section 4.4.

Channel Observations
We aim to estimate the channel response using hybrid architectures to reduce the power consumption and implementation costs. This is achieved by using a number of RF chains significantly lower than the number of antennas, i.e., N RF M. As a counterpart, the feasible hybrid precoding vectors are restricted to the subspace imposed by the hardware limitations. Considering hybrid precoding, the signal received by the users during the k-th training block in the f -th subcarrier is where ×M is a matrix whose rows are the training vectors for each of the T tr training channel uses, such that . , x T tr ( f )) contains the transmitted pilot symbols in the f -subcarrier satisfying |x t ( f )| = 1, and P T = [p 1 , . . . , p T tr ] is the hybrid precoder. Therefore, each of these training vectors is the result of the product where p t is the configuration of the precoders during the t-th training period of k-th training block, and P A,t ∈ C M×N RF and p D,t ∈ C N RF denote the analog and digital precoding matrices. Recall that P A,t is flat in the frequency domain. Hence, it is not possible to design independent training sequences for each subcarrier.
After the transmission of the T tr vectors and the subsequent feedback stage, the channel observation ϕ k ( f ) of the k-th training period will be available at the BS. Since the channel and the noise vectors are statistically independent, the covariance of the observations is given by By considering the channel covariance matrix model in (7), we rewrite the previous equation as where we introduce the measurement matrices When the K samples of the random variable of (8) are available at the BS,

Training Design
This section is devoted to the design of training sequences to estimate the channel covariance, together with bounds on their length to guarantee its correct identification under the assumption of a Toeplitz structure.
Toeplitz covariance matrices of size M × M have M different coefficients and a training sequence of length T tr allows for defining (T 2 tr + T tr )/2 equations. This establishes the following trivial lower bound on the training length: T tr ≥ √ 2M + 1/4 − 1/2. A necessary condition to achieve this bound is that the training sequence should allow for measuring the correlation between pairs of antennas for any separation. This has a strong connection with complete sparse rulers, defined mathematically as an ordered integer sequence of with T elements R = (r 1 , . . . , r T ), such that r 1 = 0 and any z ≤ r T can be written as z = r i − r j for some r i , r j ∈ R. In this case, we speak of a complete sparse ruler of length r T , and a perfect ruler is a complete ruler that contains the minimum possible number of elements for a given r T . It has been shown that a training sequence built from a perfect sparse ruler is sufficient for determining Toeplitz covariance matrices [32,33]. In that case, the length of the training sequence is equal to the number of elements of the ruler, i.e., T tr = T, and the last element of the ruler is r T = M − 1.
In the following, we summarize some results found in the literature for sparse rulers. First, there exist bounds on the size of perfect rulers for a given length [39]. Theorem 1. (see [39]) Denoting l(r) as the minimum number of marks of a complete sparse ruler of length r, lim r→∞ l 2 (r) r exists and Hence, it is possible to identify a Toeplitz covariance matrix using sparse rulers with a training length that satisfies √ 2.434M ≤ T tr ≤ √ 3.348M when M approaches infinity.r. Utilizing the bounds provided by (12), we evaluate the efficiency of the training sequences employed in previous works on AoD estimation [30,31]. Let us start by introducing a trivial example of a complete sparse ruler of length m 2 , for any arbitrary m. Using the 2m − 1 marks (0, 1, 2, . . . , m − 1, 2m − 1, 3m − 1, . . . , m 2 − 1) provides a length of m 2 − 1 different differences. This complete ruler achieves a value of lim m→∞ (2m−1) 2 m 2 = 4, larger than the bound of (12). A similar result is obtained with the use of coprime arrays [30,31]. This method is based on two co-prime numbers p and q, i.e., mcd(p, q) = 1, and on generating the ruler as the set {mp : 0 ≤ m < q} ∪ {nq : 0 ≤ n < p}. This method achieves a training length of T tr = p + q and allows for identifying pq different lags.
Thus, the limiting function reads as Hence, coprime arrays cannot improve the asymptotic efficiency of the previous trivial complete ruler.
In the literature, we can find some techniques to build sparse rulers for arbitrary lengths satisfying the bound in (12). The following section summarizes a method to obtain rulers that are close to that bounds with very low computational complexity.

Wichmann Rulers
Wichmann [36] proposed a direct method to obtain sparse rulers of any size that, denoted as the differences between consecutive terms, have the form for some positive integers r and s such that 2r − 2 ≤ s ≤ 2r + 4. The number of elements of this ruler is l(r, s) = 4r + s + 3 and its length n(r, s) = 4r 2 + 8r With a change of variable s = s + (2r − 2), the length expression changes to The parameter r roughly adjusts the length to intervals starting at n lo (r). Then, for each r value, the s parameter finely adjusts length in steps of size 4r + 3. This allows for finding parameters r and s that generate a sparse ruler whose length is as close as possible to a given value.
Finally, the ruler can be completed up to the desired length, relying on the following observation: it is always possible to build a ruler that contains all differences in the interval [a, b] with a set of the form R = (0, 1, 2, . . . , The cardinality of this set is at most 2c + 3. In this case, given a Wichmann ruler R = (r 1 , . . . , r l(r,s)+1 ) of length n(r, s), and a set R that contains all differences in the interval [n(r, s), M], a new ruler extended to length M can be found by R ∪ R .

Training Based on Sparse Rulers
A training sequence to estimate a Toeplitz covariance matrix can be built from a sparse ruler. In that case, the length of the ruler must be equal to M − 1, the number of correlation coefficients. Given a sparse ruler of length M − 1 with elements (r 1 , . . . , M − 1) ∈ Z T tr , a training sequence for a Toeplitz covariance matrix can be designed as , where e i is the M-length indicator column vector of zeros and a one in the i-th element.
In the case of Wichmann rulers, it is not possible to obtain them for arbitrary lengths, but instead a length close to a given value. Hence, to obtain a ruler close to length M − 1, the r parameter can be found as the largest such that n lo (r) ≤ M − 1, which is r * = √ 12M+33− 3 12 . The other parameter is the largest s such that n(r * , s ) . The length of the ruler can be completed up to M − 1 by the method detailed in the previous section. Since the Wichmann rulers always contain entries from 0 to r, and the s parameter increases length in steps of 4r + 3, this completion procedure will add always at most 4r+3 r = 5 marks to the ruler, and the minimum training period will be in the interval T tr ∈ [l(r * , s * ), l(r * , s * ) + 5]. Figure 1 shows the training length obtained with this strategy compared to the upper and lower bounds in (12). It also shows the length achieved with the coprime arrays strategy. In the following, we show that the mild as sumption N RF ≥ 2 is enough to obtain the sparse ruler training sequence T( f ) according to (9). Let φ r , r = 1, . . . , N RF be blueeligible arbitrary phases for the PSs in the analog processing network, and P T A,t = [p A,t,1 , . . . , p A,t,M ], with the p T A,t,m ∈ C N RF is the m-th row of the analog precoding matrix for the t-th channel training use. Since the desired training based on sparse rulers must satisfy with the digital part as p D,t ( f ) = 1 2 e −jφ 1 [1, 1, Note that only two elements of p T A,t,m are employed to achieve P A,t p D,t ( f ) = e r t . Indeed, N RF = 2 is enough to apply (15).

Covariance Matrix Estimation
In this section, we propose algorithmic solutions to estimate the downlink covariance in wideband massive MIMO systems under the beam squint effect. It is apparent that per-subcarrier (or per-sub-band) covariance estimation is not a practical approach since it does not exploit the structure of the wideband channel. Furthermore, system scalability is compromised when the number of subcarriers is large. The proposed method is based on the estimation of the subspace spanned by the steering vectors corresponding to the L AoD of the channel propagation paths. Well-known estimation methods like MUSIC have been employed to estimate a single signal subspace. However, the beam squint effect results in different subspaces for the channel covariance matrices at distinct frequencies.
Resorting to the rotation matrices in (4), we extend MUSIC to the considered setup.

Subspace Estimation
To estimate the subspace, let us start by computing the eigendecomposition of the sample covariance matricesĈ ϕ( f ) = U( f )Λ( f )U H ( f ) for all the frequencies f . Given that the covariance converges to the second order moment of the channel for a large number of samples, we havê C ϕ( f ) → C ϕ( f ) when K → ∞. Next, the eigenvectors corresponding to the L largest eigenvalues are discarded to obtain the basisŪ( f ) that spans the noise subspace of the f subcarrier. Due to the beam squint effect, the relationship among dictionary matrices for different frequencies is nonlinear. Therefore, it is not feasible to perform this step jointly for all the subcarriers unless the relative signal bandwidth B/ f c is very small. A remarkable feature of this subspace discrimination is its lack of dependency with the SNR, as long as the noise is spatially white. Under such as sumption, for the approximation in (11), we get Therefore, it is crucial to force this condition by applying a pre-whitening filter, if necessary. According to this observation, the search is performed in the effective subspace, contrarily to OMP-based methods like Covariance OMP (COMP) [34]. Thus, we identify the support S = {θ s 1 , . . . , θ s L }, with s i ∈ {1, . . . , G} ∀i, with the following wideband estimator function where ψ i ( f ) = T( f )a(θ i , f ). Note that this function greatly increases the robustness of the decision, compared to a per-carrier estimator because it exploits the common structure for all the frequencies, given that the vectors a(θ i , f ) are generated from a common dictionary as shown in (3). Finally, the support S is identified as the L largest values of J i .

Estimating Gain Variances
Once the angles are determined, we construct the matrices Recall that the path delays τ l of (2) are unknown, as well as the matrices B( f ) in (7). Therefore, we aim at estimating the per-carrier gain variancesG( f ) in (7). In particular, we denote byĜ S ( f ) the estimated gain variances as sociated with the angles in the support S. Then, it is possible to use the estimator in [40] This solution clearly depends on the rank of Ψ S ( f ), and cannot be applied in the following situations: 1. L > T tr . This scenario corresponds to short training sequences compared to the number of propagation paths. Under this as sumption, the pseudo-inverse Ψ † S ( f ) cannot recover the L variance gains. 2. L ≤ T tr . In this case, the rank of Ψ S ( f ) depends on the choice of the dictionary matrix, A, for a proper training matrix T. Recall that the dictionary contains non-orthogonal vectors when M is finite. When the distance between two consecutive angles in the dictionary is small, a situation that arises when G is large, this may lead to dictionary matrices such that rank(A) ≤ min{M, G}, To circumvent (2), one possibility is to reduce the dictionary size (see Section 4.4). Alternatively, we propose to exploit the diagonal structure ofG( f ) and estimate it as with • the Khatri-Rao product (or the column-wise Kronecker product). The product Ψ * S ( f ) • Ψ S ( f ) produces L-rank matrices when the columns of Ψ S ( f ) are not co-linear (see Appendix A).
Regarding the channel gain variances, it is again possible to exploit the channel structure to increase the estimation accuracy. Consider the relationship between the values of β l ( f ) across frequencies, in particular, and, denoting f = − f + N c , we obtain β l ( f ) = β * l ( f ). Recall now that the L non-zero diagonal elements ofG( f ) are β l ( f )σ l β * l ( f ) = |β l ( f )| 2 σ l , l = 1, . . . , L. This implies thatG( f ) =G( f ), and we can increase the robustness of the estimated gain variances in (18) Observe that not all the gains can be calculated by means of the latter expression. In the case of odd N c , f = 0 has to be determined using (18), whereas f = 0 and f = N c /2 have to be computed via (18) for N c even.

Dictionary-Rotation Wideband Spatial Smoothing
Note that Algorithm 1 requires rank(Ĉ ϕ ( f )) > L, i.e., the number of estimated paths (or angles) is bounded by the number of training periods and the number of observations, i.e., L ≤ min{T tr , K}. Solutions to address rank deficient scenarios have been proposed in [41,42]. In particular, these solutions apply when rank(Ĉ ϕ ( f )) < L. Nevertheless, in the proposed scenario, it is desirable to reduce the training sequence length T tr as much as possible. Unfortunately, we obtain a more involved setup for rank(Ĉ ϕ ) = T tr < L. This scenario can be addressed by using spatial smoothing.
Spatial Smoothing (SS) is an array processing technique based on nested arrays with different antenna separation [43]. An effect similar to that of the nested arrays is obtained thanks to the use of sparse rulers [44]. Again, the proposed methods addressed the identification of signals lying in a single subspace, and we have to resort to dictionary rotations to exploit the wideband channel structure.
We start by introducing the sample covariance matrix vectorization y( f ) = vec(Ĉ ϕ( f ) ), i.e, where e = [e T 1 , . . . , e T T tr ] T . Observe that, due to the utilization of a perfect sparse ruler training T( f ), (Ψ * ( f ) • Ψ( f )) contains the products corresponding to the spatial differences z ∈ {−M + 1, M − 1}. Conventional SS in [44] discards the repeated distances. However, this means employing 2M − 1 elements out of T 2 tr from y( f ) in (22). Hence, we propose to refine the estimation by averaging over all the samples from the same inter-antenna distance. To that end, let us introduce y j z ( f ), j = 1, . . . , J z , z = −M + 1, . . . , M − 1 as the j-th sample of the z-th difference in y( f ). Note that J z represents the number of samples available for the z-th difference, and its values depend on the sparse ruler structure and T tr . Next, we introduce the reduced vectory( f ) ∈ C 2M−1 with sorted elements [y( f )] z = 1 J z ∑ J z j=1 y j z ( f ) corresponding to the 2M − 1 distances. Accordingly, we define the matrix Z( f ) containing 2M − 1 sorted rows from (Ψ * ( f ) • Ψ( f )). Thus, the reduced vector iš These differences iny( f ) are next seen as a phase shift of the M differences to be estimated. Considering M overlapping subarrays, such that the m-th subarray comprises the differences Due to the beam squint effect, (23) has to be computed individually for each subcarrier f . Nevertheless, we desire to achieve the benefits of exploiting the structure of the wideband channel.
In this regard, we employ of Algorithm 1. Then, following Algorithm 1 steps, the L eigenvectors corresponding to the L largest eigenvalues are discarded to obtain an incomplete basisŪ( f ) spanning the noise subspace. Hence, the wideband estimator incorporating the dictionary rotation capability of line 11 is updated as where θ i is the i-th angle contained in the dictionary matrix A. To estimate the variance gains, we follow a similar procedure as that in Section 4.2 to derive the robust estimator This procedure replaces lines 16 to 18, and completes the posed DR-W-SS method.

Dictionary Size
In this section, we provide some insight regarding the choice of the dictionary size G. Let us start by determining a sufficient condition for unique angular identification. To that end, we first define the Kruskal rank of a matrix X, krank(X). If krank(X) = x, then x is the maximum number such that any x columns of X are linearly independent. Theorem 2. In the noiseless scenario, and assuming AoD lying on the dictionary A( f ) and rank(Ĉ ϕ( f ) ) = L, the inequality L < krank(Ψ( f )) guarantees the unique identification of the L AoD [41]. See [45] for the proof.
Therefore, when krank(A) is small, the probability of identifying false directions increases. To illustrate the dependence with G, we compute in Figure 2 an upper bound of the Kruskal rank for G and M. This upper bound is computationally feasible and takes into account only z consecutive columns, a reasonable approximationdue to the structure of A. As shown in the figure, large G values complicate the AoD detection, which suggests to reduce the dictionary size. However, this strategy entails a power leakagedue to basis mismatching, that is, a performance lossdue to off-grid AoDs. However, the basis mismatching is moderate in practical settingsdue to the finite angular resolution of the ULA. For instance, a typical inter-antenna spacing d = λ/2 entails an angular resolution of about 2 M radians [46], leading to G ≈ ∆ϑπ 360 o M for an angular range ∆ϑ. That is, G ≈ M provides enough resolution to cover the sector ∆ϑ = 120 o . In addition, an alternative to increasing the dictionary size is to overestimate the number of channel propagation paths L. This strategy is explored in Section 6.2.

Delay-Gain Channel Estimator
As a means to estimate the channel, we propose a (DG) Delay-Gains estimator that exploits the structure of the wideband channels. This method employs the observations for all the subcarriers ϕ k ( f ) to estimate the frequency independent gains of each propagation path g k (cf. (6)). Thus, even in the case where certain frequencies present very poor SNR, their corresponding channels can be estimated by using the information from the remaining frequencies.
The covariance estimation procedure provides the measurement matrix Ψ S ( f ) and the estimated gain variancesĜ S ( f ) for all the frequencies. Using this information, we estimate the delay matrices B( f ) and the gain vector g k from the observations gathered in the covariance estimation stage. Next, a robust low complexity estimator is calculated for the frequency selective channels.
First, notice that the observations in (8) can be rewritten as . . , β s L ( f )) and g S,k contains the gains for the positions s 1 , . . . , s L of g k . Then, we focus on the unknown vector ζ k ( f ) = B S ( f )g S,k that contains the channel gains affected by the shaping pulse. Accordingly, the linear Minimum Mean Squared Error (MMSE) estimator of ζ k ( f ) reads aŝ Onceζ k ( f ) is determined, we have to identify the matrices B S ( f ) before performing the estimation of g S,k jointly using the vectorsζ k ( f ) for all the frequencies. Since B S ( f ) contains samples of the shaping filter p rc (·) in the frequency domain, we obtain B S ( f ) by estimating the delays corresponding to the L channel paths. In particular, to determine delay for the l-th path τ l , we collect the entries as sociated with the l-th path for all the frequencies, i.e., [ζ k ( f )] l for f = 1, . . . , N c , in a vector. Then, we multiply this vector times the first N columns of the Discrete Fourier Transform (DFT) matrix, F ∈ C N c ×N , that is Observe now that ζ l,k = g l,k p rc (τ l ) + w k contains the aforementioned N stacked pulse samples p rc (τ l ) = [p rc (−τ l ), p rc (T s − τ l ), . . . , p rc (N − 1)T s − τ l ] T , and the estimation noise w k ∼ N C (0, σ 2 w I N c ). Unfortunately, these samples are scaled by the unknown complex-valued channel gain g l of (2). Note that we assume that the error is statistically independent for different frequencies.
Hence, for given shaping filter p rc (t), we propose the following estimator function for τ where p rc (τ) = [p rc (−τ), p rc (T s − τ), . . . , p rc (N − 1)T s − τ ] T contains the pulse samples for the delay τ. Unfortunately, D(τ) is a non-monotonic function. Thus, we minimize D(τ) over τ, e.g., by linear search over [0, (N − 1)T s ] to obtain the delay estimateτ l . In the high SNR regime, it is enough to sound within the interval corresponding to the two samples of p rc (τ) with larger absolute values.
Recall that the channel gains are multiplied times B S ( f ) in the unknown vector ζ k ( f ) of (24). Thus, employing the estimated delaysτ l , we computeB S ( f ) for all frequencies (see (6)). To eventually estimate the channel gains g S,k , we compute the minimum variance unbiased estimator [47] using information from all the frequencies, i.e.,ĝ S,k =B †ζ Finally, by invoking (6), the estimated channel yieldŝ

Simulation Results
The following setup is considered for the numerical experiments. We as sume a training sequence T generated from a Wichmann ruler of length T tr . The number of transmit antennas is M = 200, while we consider a hybrid architecture using N RF = 2 RF chains, N c = 64 subcarriers, and a dictionary of size G = 400 with equally spaced angles within the range [0, π]. We consider the channel covariance model of (6) with L propagation paths uniformly distributed in [0, π]; different number of training periods K, and N = 8 random delay taps uniformly distributed in [0, N − 1]T s , with T s = 1/B. The numerical results are averaged over 200 channel covariance realizations for independent users, and over the different subcarriers N c . To evaluate the accuracy of the covariance identification, we employ the Normalized Mean Squared Error (NMSE) metric defined as NMSE = with the Frobenius norm · F . In the case of channel estimation, we use the efficiency η ∈ [0, 1] given by η = |ĥ H h| ĥ h , which is similar to the figures of merit in e.g., [26,34,48]. This insightful metric evaluates the signal power lost by beamforming using the estimated channels. Note that this metric is averaged over the number of channel blocks.
As a benchmark, we include an LS (known θ) curve that individually estimates the gain variances for each subcarrier using (18) as suming that the channel angles are known. That is, it does not exploit the symmetry with respect to the central frequency of (19) to estimate the variance gain. Figure 3a shows the NMSE obtained with the proposed algorithms. The training sequence length is set to T tr = 25 and the channel parameters are adjusted according to massive MIMO settings with f c = 5 GHz, B = 20 MHz, L = {15, 30} channel propagation paths, and a SNR of 15 dB. Using the same training sequence, we compare our methods with COMP in [34], which supports wideband estimation in the absence of beam squint. For L = 15, the proposed methods achieve better performance than COMP when the number of training blocks K is large, but the accuracy of the estimation deteriorates when K < 20. This behavior is motivated by the lack of similarity between the sample covariance matrixĈ ϕ( f ) and the actual channel covariance matrix C ϕ( f ) that makes difficult the discrimination of the signal and noise subspaces (see Algorithm 1). Nevertheless, values of K above 50 are feasible in practical settings, since K is not restricted by the channel coherence time. We also include the case of L = 30 channel paths in Figure 3a. Under such as sumption, COMP and DR-W-MUSIC do not applydue to the lack of angular sparsity, and the limited rank of the sample covariance matrices, respectively. Conversely, the DR-W-SS approach performs close to the benchmark with known angles, denoted by LS (known θ). 5   To evaluate the performance impact of the training sequence length T tr , we fix the parameters to M = 100, L = 50, SNR = 15 dB and K = 50 in Figure 4a. The training sequence is shorter than the number of paths and takes the values 19, 25, 32, 40, 50, to show the capability of DR-W-SS in the challenging scenario L > T tr . As shown, the proposed method performs better than the LS benchmark with known support S.  We now evaluate the consequences of neglecting the beam squint effect for the setup considered in Figure 3a, that is, f c = 28 GHz and different signal bandwidths, B = 200, 800, 1200, 2000 MHz. The methods referred to as W-MUSIC consist of Algorithm 1 ignoring the dictionary rotation. Figure 4b exhibits the performance impact of beam squintdue to the lack of a common subspace for different frequencies. As shown, the proposed dictionary rotation is able to exploit the structure even in the case of B = 2000 MHz.

Covariance Subspace Estimation
In the previous numerical examples, the number of paths L was as sumed to be known. Subspace estimation methods, like the one in [49], can be used to estimate L by comparing the differences between consecutive eigenvalues of the sample covariance matrix with a threshold . In the wideband case, we employ the information of all frequencies jointly, to increase the detection accuracy. Moreover, the computational cost over DR-W-MUSIC and DR-W-SS is negligible while it is significant for other approaches not performing a subspace discrimination. Figure 5a shows the robustness against uncertainty in the number of channel propagation paths L. We employ Algorithm 2 in [49] and empirically adjust the parameter for the different estimation methods. For the considered scenario with SNR = 0 dB, we set the threshold as = 0.025 for DR-W-MUSIC and = 0.01 for DR-W-SS. Remarkably, the performance obtained with subspace estimation for DR-W-SS is similar to that of the strategies with known L. In particular, for DR-W-MUSIC, overestimating the number of channel paths L improves the performance. This observation reveals that the support identified by Algorithm 1 with known L was incorrect. Recall that the support S comprises the angles corresponding to the L larger values on the estimator function J i . Under the strong noise conditions considered in Figure 5a, some angles belonging to the covariance support might be ignored (see line 14 of Algorithm 1). Instead, step 14 selects angles in the dictionary whose steering vectors are linearly dependent. Thus, overestimating L enables a better detection of the channel covariance subspace where angles with lower J i are included in S. In addition, we compare the effect of discarding the repeated distances in DR-W-SS, as performed in conventional SS [44]. In Figure 5a, this strategy is labeled DR-W-SS Discard. As shown, for T tr = 50, the performance loss is significant. In Section 4.4, we discussed the advantages and practical orientation of moderate dictionary sizes. Nevertheless, we investigate the effect of off-grid AoDs in Figure 5b for the mmWave setup and B = 200 MHz. The error floordue to basis mismatch is clear. As an alternative to increase G, we propose to compensate the basis inaccuracy by overestimating the number of channel paths L. This way, part of the power leakage is recovered with the additional AoDs. These approaches are denoted DR-W-MUSIC Overestimated L and DR-W-SS Overestimated L. For the LS estimator with known angles, we choose the closest dictionary angles to the real ones. Figure 6a shows the performance of the channel estimator with the covariance identified and the two different channel estimation methods: LMMSE [21,25,47] h

Wideband Channel Estimation
and the proposed DG estimator of (27). The number of channel blocks is set to K = 100, and the number of delay candidates τ considered in (26) is 20 N. Since the proposed DG method uses the information of all the subcarriers to produce the estimates, it is very robust against noise and results in a considerable performance gain. Indeed, more than 80% of the signal power is captured at −10 dB with the proposed strategies.  Table 1 summarizes the computational complexity of the proposed methods. Recall that a fast covariance estimation is important, but not as critical as the actual channel estimation. Indeed, we only employ a training stage contrarily to other approaches like [21,37].

Computational Complexity
For the favorable scenario, DR-W-MUSIC(a) and DR-W-SS(a), the gain variances are estimated using well-conditioned measurement matrices, while (18) with the Khatri-Rao product is considered for MUSIC(b) and DR-W-SS(b). The complexities have been calculated as suming T tr > L for DR-W-MUSIC and COMP. It is noticeable that neither DR-W-MUSIC nor COMP depend on the number of antennas, although both of them linearly depend on the dictionary size G. The number of channel taps N is also relatively small, and T is the number of τ candidates evaluated to estimate the delays using (26). Since COMP and DR-W-MUSIC apply to scenarios where L and T tr are very small compared to M or G, the complexity of the latter is smaller in general. Furthermore, the number of steps performed by COMP is larger, and its computational burden depends on the number of training periods K.

DG
O N c (LT tr + ML + L 2 ) LMMSE (28) O(N c MT tr ) When L M, the computational complexity virtually depends on the number of antennas. Hence, DR-W-SS and DR-W-MUSIC establish a trade-off among T tr , L and M, since we could use the less complex option by adjusting T tr and K. The effect of increasing L is depicted in Figure 6b. Interestingly, in the region where the channel is sparse and Dynamic COMP (DCOMP) applies, L < 10, the computational complexity of the proposed approaches is lower than that of DCOMP.
We now differentiate between the operations computed once using the estimated covariance matrix, referred to as "Computing Estimator", and the channel estimation for each observation, referred to as "Channel Estimation". Remarkably, the complexity of computing the proposed estimator (DG) linearly depends on M, cf. (24)- (26) whereas for the LMMSE estimator this dependency is quadratic.
Regarding the actual channel estimation, in scenarios suitable for DR-W-MUSIC with L < T tr , DG channel estimation is more efficient. This applies to mmWave frequencies and other massive MIMO scenarios, e.g., [8,37,[50][51][52][53]. Conversely, LMMSE estimator might be less complex for large number of channel propagation paths, i.e., L > √ M.

Conclusions
In this work, we proposed a covariance and channel estimation method for FDD wideband hybrid massive MIMO systems. The numerical experiments showed the good performance of the posed covariance estimation approaches compared to other methods in the literature, even when considering very short training sequence lengths. In addition, we developed a robust channel estimator with computational complexity depending linearly on the number of antennas. Both facts contributed to a reduction in the channel training overhead. Finally, we evaluated the performance of the proposed scheme for a wide variety of scenarios. The extension to Uniform Planar Array (UPA) antenna arrangements, and the use of shorter training sequences is an interesting research line for future works. Funding: This work has been funded by Xunta de Galicia (ED431G2019/01), Agencia Estatal de Investigación of Spain (TEC2016-75067-C4-1-R, RED2018-102668-T, PID2019-104958RB-C42), and ERDF funds of the EU (AEI/FEDER, UE).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
To prove the linear independence, we consider two cases: (1) If the columns of Ψ S ( f ) are linearly independent, the same property holds for the columns of Ψ * S ( f ) • Ψ S ( f ). (2) Consider that the i-th column of Ψ S ( f ), ψ S i for notation simplicity, is a linear combination of the columns whose indices are in the set T ⊆ S, i.e., ψ S i = ∑ |T |