Exploring Efﬁcient Acceleration Architecture for Winograd-Transformed Transposed Convolution of GANs on FPGAs †

: The acceleration architecture of transposed convolution layers is essential since transposed convolution operations, as critical components in the generative model of generative adversarial networks, are computationally intensive inherently. In addition, the pre-processing of inserting and padding with zeros for input feature maps causes many ineffective operations. Most of the already known FPGA (Field Programmable Gate Array) based architectures for convolution layers cannot tackle these issues. In this paper, we ﬁrstly propose a novel dataﬂow exploration through splitting the ﬁlters and its corresponding input feature maps into four sets and then applying the Winograd algorithm for fast processing with a high efﬁciency. Secondly, we present an underlying FPGA-based accelerator architecture that features owning processing units, with embedded parallel, pipelined, and buffered processing ﬂow. At last, a parallelism-aware memory partition technique and the hardware-based design space are explored coordinating, respectively, for the required parallel operations and optimal design parameters. Experiments of several state-of-the-art GANs by our methods achieve an average performance of 639.2 GOPS on Xilinx ZCU102 and 162.5 GOPS on Xilinx VC706. In reference to a conventional optimized accelerator baseline, this work demonstrates an 8.6 × (up to 11.7 × ) increase in processing performance, compared to below 2.2 × improvement by the prior studies in the literature.


Introduction
With the wide application of deep neural networks [1], several convolution operation-based generative adversarial networks (GANs) [2][3][4] have emerged to accomplish computer vision-related tasks such as image generation/synthesis [5][6][7][8] and 3D object modeling [9,10]. We have also observed that transposed convolution, a specific-domain convolution kernel, is the primary operation involved in the generator component of GANs, while convolution is involved in another component of GANs called discriminator. Abundant prior acceleration works for convolution operation in several CNN models were presented based on field programmable gate arrays (FPGAs) [11][12][13][14][15][16]. More recently, increasing the performance of transposed convolution has been considered in a few works [17][18][19][20][21]. Transposed convolutions are always implemented by the way of the conventional convolution form directly. Unfortunately, such an equivalent convolution-based dataflow causes more than 70% ineffective operations. GNA [17] and Zhang's work [18] solved the issue by designing a distinct computing strategy, so-called "Input-Oriented Mapping (IOM) method", to eliminate expanding the sparsity of matrix at the origin. Referring to Eyerss [22] who focused on the common convolution, the authors of [19][20][21] concerned removing all the invalid operations in the sparse matrix multiplication.
By a different method from the works mentioned above, the fast Winograd algorithm is explored in this work for efficiently deploying the transposed convolution layers of GANs on FPGAs. The fast Winograd algorithm provides an efficient means of transforming 2D convolution operation into element-wise multiplication manipulation (EWMM) operation to reduce the computational complexity. In Winograd domain, multiplications of the elements in a constant transformation matrix are transformed to addition and shift operations, while implementing the addition and shift operations by logic units such as look-up-tables (LUTs) and flip-flops (FFs) on the FPGA should incur a lower computational cost. Several reported implementations [23,24] have demonstrated higher performance through deploying the Winograd-transformed common convolutional layers on FPGAs. To the best of our knowledge, this is the work to target extending the fast Winograd algorithm to the implementation of the transposed convolution layers of GANs on FPGAs.
This paper makes the following contributions: • We present a novel dataflow exploration, so-called Wino-transCONV, which eliminates all the computations involving 0 values and then implements the transposed convolution operation through adding the DECOMPOSITION and REARRANGEMENT stages in a regular Winograd-transformed convolution dataflow. This dataflow optimization allows a significant reduction in computational complexity.

•
We propose a custom architecture for implementing transposed convolution layers efficiently by mapping the multiple-stage Wino-transCONV dataflow to FPGA device with pre-and post-processing in place. The design structures include the processing unit and Buffer operating in a combined pipelined and parallel fashion.

•
We devise the memory parallelism-aware partition to achieve efficient data access in the paper. Meanwhile, the optimal hardware-based design space is also explored by analyzing the efficiency in resource allocations and the performance in computational executions.

•
Several state-of-the-art GANs are verified on Xilinx FPGA devices by employing the proposed approaches in this paper.
The rest of this paper is organized as follows. Section 2 provides a brief description of some concepts regarding the fast Winograd algorithm and transposed convolution.The dataflow exploration is also highlighted in this section. Section 3 presents the custom architecture design optimized and implemented on FPGAs. Section 4 discusses the experimental verification results. Section 5 concludes this paper.

Transposed Convolution Dataflow Basics
The generator component of GANs features taking noise as input and generating samples, which involves a vast amount of transposed convolution operations. Figure 1 illustrates a representative data processing flow with regard to the Generative model of a GAN used for large-scale scene understanding challenge (LSUN) scene modeling [4]. Four layers of transposed convolutions (alias for fractionally-strided convolutions) are connected in series to convert the random noise representations into a high-resolution image. This paper is interested in exploring how to resource-efficiently boost the processing capabilities of implementing transposed convolution layers on FPGAs. Similar to a general convolutional layer, a transposed convolution layer carries out the feature generation by applying N groups of the filters (each of the groups owning C×K×K filters) to C channels of the two-dimensional input feature maps (inFM, C×W×H) and then outputting N channels of the two-dimensional output feature maps (outFM, N×2W×2H), as shown in Figure 2. As detailed in Figure 3, transposed convolution can be mapped into the conventional convolution dataflow, after the inFM is inserted and padded with zeros where appropriate to obtain the expanded input feature map (EX-inFM).

Winograd Transformation Basics
As shown in the equation below, the 2D convolution can be transformed in the fast Winograd algorithm [25].
Suppose that: IN has a size of n×n (n = m + r -1); OUT has a size of m×m; F has a size of r×r; and " " symbolizes an Element-Wise Matrix Multiplication (EWMM).
Both transformation matrices, G and B, are applied to, respectively, convert the weight filter (F) and the input feature map (IN) to the Winograd domain first. Then, in the Winograd domain, the two results are multiplied befor their outcome is further converted back by the third transformation matrix, A, to obtain the output feature map OUT. In one example case, in which m = 2 and r = 3, the Winograd transformation matrices A T , B T , and G are defined as follows:

Wino-transCONV Dataflow Exploration
As discussed above, our purpose is to remove void operations and reduce computational complexity. Foremost, we present DECOMPOSITION processing to eliminate void operations. As exhibited in Figure 4, there exist four (2×2) computing patterns for a 5×5 sized filter sliding up, down, left, and right over a 6×6 sized tile-window of the EX-inFM for operations. Specifically, the upper-left computing pattern in Figure 5 correspondingly generates the green-colored grids (also denoted with "×") for the outFM, as depicted in Figures 5a and 6. For the three other computing patterns, the same principle is applied to generate pink, blue and orange colored grids as, respectively, shown in Figure 5b-d. Those 25 grids ( f 0∼4,0∼4 ) belonging to the filter window are divided into and distributed across the four sub-filters with the irregular numbers of non-zero values (9+6+6+4). Figure 5 illustrates the detailed evolution dissolving a transposed convolution into four equivalent convolutions during the DECOMPOSITION process. More clearly, we pick the non-zero values of the filter and then squeeze the sparse matrix to the dense matrix. This way, one transposed convolution processing task over the 5×5 sized filters can be distributed into the four common convolution subroutines operated with smaller 3×3 sized filters.
Subsequently, the standard Winograd algorithm is possibly applied to implement the four general convolutions subroutines to reduce computational cost. It is noted that Figure 5 demonstrates the effectiveness of the offered data compression pre-processing for exploring the application of fast Winograd algorithm, when the filter is taken on the route of picking and squeezing for the minimum operations with inFM*. This matches the fact that, for an effective transformation, those conventional convolutions with the core size not exceeding 3×3 [23,24] are required by the execution of the fast Winograd algorithm.
Eventually, the four intermediate outputs in Figure 6 are alternately rearranged into the final outFM as necessary for the post-processing shown in Figure 5.   An overview of the provided Wino-transCONV dataflow is exhibited in Figure 7. With the exception of the standard fast Winograd algorithm (S2-S4), the two new stages, namely DECOMPOSITION (S1) and REARRANGEMENT (S5), are involved in this paper. A detailed explanation for DECOMPOSITION is given above (cf. Figures 4 and 5). Particularly during the processing of S1, the complete filter window with a size of K×K, which slides over the (n+1)×(n+1) sized tile-window of inFM, is split into four effective sub-filter windows. Next, each of those four effective sub-filter windows is operated on its corresponding sub-inFM window pattern. The processing of S5 is relatively simple, as four m × m sized intermediate output patterns are produced after S4 and alternately rearranged into one 2 m×2 m sized outFM. According to the standard fast Winograd algorithm, S2 implements matrix transformations for both the input feature map and the filter, while S3 implements EWMMs and S4 implements matrix transformations for the output feature map. The computational complexity of Wino-transCONV dataflow can be measured via the amount of the multiplication operands required to occupy the DSP resources on an FPGA device. For quantitative estimation, the following equations are derived: and Here, Mult Direct−CONV represents the number of multipliers needed for the convolution descendent from a direct transformation of the transposed convolution [4] when a K×K sized filter is applied on a W×H sized inFM. Mult Direct−CONV−e f f corresponds to the number of multipliers needed after a further removal of all the invalid operations [17][18][19][20][21].
For our Wino-transCONV dataflow, the number of multipliers required may be determined as follows simultaneously. Table 1 lists some comparison analysis results for the computing resource usage deemed among different acceleration platforms. In the table, the terms mult, add, and total_equiv_add denote the respective numbers of multiplications, additions, and total equivalent additions thus calculated. Without loss of generality, an L bit-width (fixed point) multiplication can be equivalently dissolved into L times the same bit-width additions. In this analysis, L = 16 is assumed. In Table 1, the Wino-transCONV dataflow results in the minimized multiplications through being replaced with reasonably increased additions. Thus, we advantageously trade the DSP resources with the LUT-plus-FF resources, which should be abundantly available nowadays on FPGAs. In this way, it should deliver a higher degree of implementation parallelism than the prior studies. In terms of the normalized factor, the Wino-transCONV approach reduces the number of operations approximately by 70-80%, compared to the Direct-CONV. It also requires fewer resources than Direct-CONV-eff by almost one third. Further referring to Table 1, a close examination suggests that the transforming properties associated with the Wino-transCONV should not be influenced by the size of inFM, although they exert the reduction effect more prominently with 5 × 5 filters than 4 × 4 ones.  Figure 8 illustrates the overall custom architecture design for the proposed dataflow exploration. In the figure, numerous datasets are involved, including inFMs, filters, and outFMs, all in batches to transport between on-and off-chip via, e.g., the AXI-Stream interface connection provided by the FPGA device. To overlap the computation time and data transfer time, the double line buffer [23] is adopted to realize ping-pang data exchange operations in this paper. A processing unit (PU) is specifically designed to accommodate the execution procedure, as specified in Figure 8 for a common transposed convolution operation. Then, Multiple PUs will be formed in an array to complete the entire Wino-transCONV dataflow in parallel. A tile-window of the inFM would be taken from the line buffer and then subsequently sent to a relevant PU along with a filter, while the PU generates the result to an outFM. As detailed in Figure 8, each of those PUs consists of a DECOMPOSITION pre-processing module, a Winograd Processing Element (Wino-PE) module, and a REARRANGEMENT post-processing module, all being mapped into DSPs or LUT-register resources on FPGA, as exhibited in Figure 9. To raise the processing capabilities at the architecture level, this paper investigates some possible implementation strategies for optimally balancing various design conditions such as hardware parallelism vs. network performance. We have observed that the transposed convolution in the algorithm can tap into two kinds of concurrent executions, namely the parallel processing among FMs and inside an FM [26]. Moreover, pipeline is regarded as an indispensable means that could possibly be taken to lift up the performance of an accelerator system. The strategies of inter-PU parallelism, intra-PU parallelism, and intra-PU pipeline are used for balancing the various conditions such as parallelism, peak performance, and resource consumption as follows:

Architecture-Wise Optimization
• Inter-PU parallelism: In the design, each PU is accountable for processing the data from one of the C channels of inFMs to one of the N channels of outFMs. Suppose that P C and P N denote parallel undertakings of inFMs and outFMs, respectively. Therefore, there are in total (P C × P N ) PUs to execute their individual operations in parallel.

•
Intra-PU parallelism: According to Figure 7, the function of Winograd processing inside a single PU is responsible for sequentially processing four pairs of the decomposed data, i.e., (sub-inFMs and sub-filters). Nevertheless, those four pairs can also be individually operated upon in parallel. Thus, the operational speed is improved, although at the expense of additional DSP blocks and programable logic resources being needed. It should be conceded that a single PU having consumed excessive DSP blocks would in practice result in the reduction to the degree of inter-PU parallelism, which must fall towards a smaller measure in terms of the number of (P C × P N ). This is because the total number of DSP resources available on an FPGA device is always capped. • Intra-PU Pipeline: In a PU structure, Steps S-S4 can be effectively pipelined according to the dataflow of Wino-transCONV given in Figure 7 except for S1 and S5. This is because the DECOMPOSITION (S1) and the REARRANGEMENT (S5) for those four pairs of data (sub-inFMs and sub-filters) cannot share one common set of hardware on a time division multiplexing basis. Element-wise matrix multiplication in the Wino-PE is the only module that is performed by the DSP blocks in the form of MAC units. Assuming it is denoted as the number of DSP (MAC) blocks required to achieve such EWMM in a PU, necessitating for execution in parallel and, further, one DSP block carries out fixed point multiplication in a bit-width of on FPGA device, to exercise the maximum parallelism, ε = 4 × n × n × ( χ η ) DSP blocks would be required to operate, where χ symbolizes the original bit-width dictated by the algorithm itself. In this work, χ and η are specifically chosen to be the same and equal to 16 bits (considering an elementary DSP unit as 18b×25b multiplier embedded in FPGA).
It is preliminarily emphasized with regard to the intra-PU parallelism and pipeline in Section 3.1 that, on the one hand, there exist six possible configurations of concurrency for a typical instance where n = 6, i.e., ε ⊆ {144, 72, 36, 18, 9, 3}. As exhibited in Table 2, the larger ε is, the fewer cycles are required to complete the operations, which indicates that the latency becomes shorter over a single PU. As a result, more EWMM operations would be conducted in concurrency, hitherto increasing the overall processing throughput.  On the other hand, the pipeline technique is applied in conjunction with a parallel strategy. The timing diagrams for a single PU with four typical values are given in Figure 11. They demonstrate a coherent parallel-pipeline scheme for executing all the necessary operations in PU. Actually, the actual value of ε should have impact on the initiation interval and the interval latency (II and IL [27,28]) as far as the pipeline design considered. II PU and IL PU can be estimated referring to the above analysis: where II S and IL S are the values of II and IL @ε = 4n 2 supposing II S = 1 cycles while IL S = 5 cycles in Figure 11 for simplicity, which are denoted as the standard reference measure. II S and IL S are further utilized to explore the design space in Section 3.4. Figure 11. Timing diagrams of four typical parallel implementations for S1-S5 operations in a PU.
The correlation regarding operational performance vs. ε is characterized on the Xilinx ZCU102 platform and evaluated with Xilinx Tool. As proposed in Figure 12, the processing performance in terms of Giga Operations Per Second (GOPS) goes up linearly with ε, while the energy efficiency denoted as GOPS/W somehow increases logarithmically with ε. This means that, although the processing performance does proportionally increase with the number of DSP blocks working in parallel, the pace to raise the energy efficiency by aligning more DSP blocks in concurrency may gradually slow down with ε growing larger. The reason may be attributed to the fact that a high degree of parallelism may cause lengthy and excessive interconnect wires on FPGA device and, in such conditions, the ratio of the DSP over the whole implementation with regard to power consumption would also gradually go up, typically from 1% (when ε = 3) to 40% (when ε = 144). Based on inter-PU parallelism and intra-PU parallelism, the total number of DSP blocks required to realize all the (P C × P N ) PUs in an FPGA may be predicted by:

Memory Sharing-Access and Partition
To have memory availble to be accessed, a parallelism-aware partition technique is taken into consideration. The on-chip data of inFMs, filters, and outFMs can be described as respective multi-dimensional matrices (Mats).
In principle, the data in a relevant Mats are partitioned into a certain number of segments; thus, they can be accessed in parallel to coordinate the parallel operations required. Table 3 gives the partition explorations of the inFMs, outFMs, and filters, all of which are subject for possible concurrent access in order to maximize the parallelism in the computation. Table 3a estimates the number of segments in line for the parallel access as well as the size of such a segment for each dimension (dim) in the Mat. In addition, it gives the total memory banks hence required and the volume of each memory bank specified for a layer implementation. It is noted that the data belonging to one segment are the minimum dataset necessarily arranged for serial access. More clearly, Figure 13 illustrates the process of partitioning the memory requirements into a group of concurrently accessible memory segments. The benefit of having this memory partition measure in place is to enable more parallel operations and hence increase the processing performance as well as the energy efficiency.
Two specification examples of implementation are presented in Table 3b,c based on the Xilinx FPGA device integrated into the ZCU102 board. Here, there are a total of 1824 18K BRAM banks available for use. In line with our method, the numbers of BRAMs needed in the partition technique are, for instance, 1728 and 1440, respectively, for realization of the second layers of DCGAN and EB-GAN. In practice, to fully exploit the multi-port features offered by the BRAM in FPGA, it is possible to have two data segments packed into one single BRAM bank through a true dual-port arrangement, provided that such data segments were properly fitting. In comparison to Table 3b,c, the total number of BRAM banks needed should hence be halved.    Table 3, we can estimate the number of BRAM banks required as follows: Here, α in , α out , and α f mean that α * BRAM banks are required if one single BRAM bank does not have enough capacity to store the data for serial access (see Figure 13). α in ,α out , and α f are expressed as follow: Specifically, V = 18K and L=16 in this paper.

Design Space Exploration
Among various FPGA devices, the density of their processing resources available, such as DSPs, BRAMs, LUTs, and Flip-Flops, differs in combinations. The number of DSP blocks and BRAM banks required is predicted in Sections 3.2 and 3.3, respectively. Notably, the bandwidth bottleneck emerges when the data processing time is not well matched to that of the data transfer, hence compromising the peak performance. In the design space specified by the FPGA device parameters, we need to find ways of acquiring optimal solutions under the constraints of the algorithm parameters.
Balancing between the computation and the transfer times is considered in this paper as well, especially when it is taken to process C channels of W×(n+1) inFMs and finally output 2W×2m outFMs, as expressed in Equations (10)- (12).
AchievedBandwidth T f _trans f er = 16×C×N×K 2 AchievedBandwidth (10) T trans f er = T i_trans f er + T o_trans f er + T f _trans f er (11) Here, T i_trans f er , T o_trans f er , and T f _trans f er are the respective transferring times of the inFMs, outFMs, and filters. Bandwidth is the device constraint provided by FPGA.
where II PU and IL PU are characterized in Equation (6). Our goal is to find the minimal T computer under the premise of T computer ≥ T trans f er since the peak performance must match the bandwidth. {P C , P N , ε, m} are unknown parameters to be explored. In Equation (6), II S and IL S can be obtained via a few small-scale experiments. {W, C, N, K} are parameters relevant to the transposed convolution while {freq, Achieved , and Bandwidth} are related to the FPGA hardware. In this paper, Algorithm 1 is devised to explore the optimal solution of {P C , P N , ε, m} based on the above analysis.

Experimental Cases for GANs
The design configurations for some of its transposed convolution layers regarding seven typical GAN models are listed in Table 4. In the table, the label #num denotes an order of the layer in the network. For networks such as DCGAN [4], Disco-GAN [5], Art-GAN [6], GP-GAN [7], and EB-GAN [8], they are mainly applied for the purpose of synthesizing images, while the others such as 3D-ED-GAN [9] and 3D-GAN [10] fulfill the tasks of generating 3D object.

Experimental Setup
To evaluate our design approach, those state-of-the-art transposed convolution layers of GANs were tested on the Xilinx FPGA platform with Vivado HLS (v2018.2). HLS provides abundant optimization directives for pipeline, parallelism, and memory partition by adding #pragma to the C/C++ code. The RTL representation of the design in terms of the Verilog HDL code can be exported as a Vivado's IP core after running C Synthesis and C/RTL Co-simulation. Finally, the Vivado Tool synthesizes the exported RTL codes and records the design specifics in the report file. In addition, the XPower analyzer tool integrated into Vivado performs the power estimation. Two Xilinx devices were adopted in this experiment: XCZU9EG and XC7Z045. They were integrated into the ZCU102 board and ZC706 board, respectively, together with the ARM core, and offer high-speed serial connectivity taking advantage of integrated AXI IP.
We implemented the networks in Table 4 using our techniques. Table 5 delivers the parameters for FPGA devices in our implementations.

Experimental Results
In this subsection, our experimental results are reported. We adopt the previous view [18] about the computation means of GOP of transposed convolution. Therefore, the processing capability of our implementation is defined by Equation (13): where T prepare denotes the time of loading the first (n+1) rows of inFMs and the first batch of filters into on-chip BRAM banks. Table 6 shows the experimental results in terms of performance and power as well as resource utilization. We also provide avg GOPS and avg DSP Efficiency. In Table 7, we compare the prior implementations, showing that our accelerator yields 15× (up to 21×) increase in DSP efficiency over prior work [18]. Actually, neither the work of Zhang [18] nor GNA [17] implements real-life GAN models. Moreover, GNA's partial study on transposed convolution layers focuses on the bit-width flexibility optimization based on TSMC 28nm ASIC. The meanings of GOPS given by Lu [23] and Zhang [13] differ from ours since they implemented the general convolutional neural network called AlexNet [29]. To further compare with prior work that implements real-life GAN models, we directly mapped the transposed convolution to general convolution utilizing the optimized FPGA-based conventional accelerator as the baseline, which is the same as that employed by FlexiGAN-FPGA [20]. Figure 14 records the speedup (GOPS) with our work and prior work [20] vs. Conv-baseline separately. We produce 8.6× (up to 11.7×) improvement in processing throughput over the Conv-baseline. Figure 14. Comparison between our work and the prior work [20] on the speedup ratio against Conv-Baseline.

Conclusions
To address the two issues of having ineffective operations and being computationally intensive inherently when implementing transposed convolution layers of GANs on FPGAs, we present the novel Wino-transCONV dataflow as well as its corresponding hardware architecture design. In this work, the distinct memory partition technique and the hardware-based design space are also explored. Our final implementations of seven state-of-the-art GANs achieve an overall performance of 639.2 GOPS on the Xilinx ZCU102 platform and 162.5 GOPS on the VC706 platform. The experiment results also show that our accelerator design yields 21× improvement in DSP efficiency over other prior works. In addition, in comparison to the best-known work, which delivers 2.2× higher performance than the optimized conventional accelerator baseline, the proposed design achieves 8.6× (up to 11.7×) increase in processing throughput over the Conv-baseline.