Validation of Deep Learning-Based Artifact Correction on Synthetic FLAIR Images in a Different Scanning Environment

We investigated the capability of a trained deep learning (DL) model with a convolutional neural network (CNN) in a different scanning environment in terms of ameliorating the quality of synthetic fluid-attenuated inversion recovery (FLAIR) images. The acquired data of 319 patients obtained from the retrospective review were used as test sets for the already trained DL model to correct the synthetic FLAIR images. Quantitative analyses were performed for native synthetic FLAIR and DL-FLAIR images against conventional FLAIR images. Two neuroradiologists assessed the quality and artifact degree of the native synthetic FLAIR and DL-FLAIR images. The quantitative parameters showed significant improvement on DL-FLAIR in all individual tissue segments and total intracranial tissues than on the native synthetic FLAIR (p < 0.0001). DL-FLAIR images showed improved image quality with fewer artifacts than the native synthetic FLAIR images (p < 0.0001). There was no significant difference in the preservation of the periventricular white matter hyperintensities and lesion conspicuity between the two FLAIR image sets (p = 0.217). The quality of synthetic FLAIR images was improved through artifact correction using the trained DL model on a different scan environment. DL-based correction can be a promising solution for ameliorating the quality of synthetic FLAIR images to broaden the clinical use of synthetic magnetic resonance imaging (MRI).


Introduction
Synthetic magnetic resonance imaging (MRI) is based on a quantitative approach using absolute physical properties such as the longitudinal T1-relaxation time, transverse T2-relaxation time, and proton density [1][2][3][4][5]. It can generate multiple contrast-weighted images in a single scan with modifiable acquisition parameters such as repetition time (TR), echo time (TE), and inversion time (TI) derived from mathematical inferences rather than being predetermined [1][2][3][4][5]. In previous studies, the clinical utility of synthetic MRI was investigated by assessing its image quality and diagnostic performance for detecting a range of brain abnormalities [4][5][6][7][8]. However, synthetic fluid-attenuated inversion recovery (FLAIR) artifacts are major drawbacks limiting the effectiveness of synthetic MRI for clinical use, even though synthetic MRI has a comparable diagnostic performance with that of conventional MRI and can reduce the scan time in the clinical setting [4][5][6][7][8]. It is well-known that synthetic FLAIR artifacts appear as thin, granulated, and marginal hyperintensity along the brain surface [4][5][6]8] or parenchymal swelling in the brain-cerebrospinal fluid (CSF) interface [9], resulting in a decrease in the overall image quality. Therefore, further efforts to improve the image quality of the synthetic FLAIR images are essential to expand the clinical use of synthetic MRI in daily clinical practice.
Two recent studies using deep learning (DL) have introduced the improvement of the synthetic FLAIR image quality [10,11]. Although those studies employed different methodological approaches-convolutional neural network with perceptual loss function (CNN) vs. pixel-wise neural network with conditional generative adversarial network (GAN) loss function-both of them showed remarkable potential to solve this issue [10,11]. However, the studies used DL-based correction for a synthetic FLAIR image employed a limited number of study participants at each institution; therefore, these new approaches should be validated in a different scan environment to establish their clinical utility. Thus, we aimed to investigate the capability of the already-trained DL model with CNN [10] in a different scanning environment from the perspective of ameliorating the image quality of synthetic FLAIR images.

Study Population
A review of our institutional database revealed 321 consecutive patients who underwent routine brain MRI with synthetic acquisition between July and December 2018. Among them, two patients who had not undergone 2D or 3D conventional FLAIR images were excluded. We ultimately enrolled 319 of the 321 patients for this study, comprising 176 men and 143 women with a mean age of 58.7 ± 12.7 years (range, 21-83 years). Of these, 19 patients had 2D FLAIR images, whereas 300 patients had 3D FLAIR images.
In the present study, the retrospective data collection and analyses were performed in accordance with the local institutional review board guidelines after obtaining its approval. The institutional review board determined that patient approval and informed consent were not required for retrospectively reviewing images and electronic medical records.

DL Framework
To apply the DL-based artifact correction in the present study, a pretrained CNN from an original work by Ryu et al. [10] was used. The network architecture was based on the residual nets (RESNET) architecture [12] with several modifications. This network used two combined loss functions, namely the mean absolute error and perceptual loss [13]. While this DL-based method has shown promising results in correcting artifacts in synthetic FLAIR, the previous validation study of the method relied on a dataset from a single scanner and only a small number of test data for 20 subjects [10]. Moreover, the network of the previous study was entirely trained on images obtained from a single scanner (GE Discovery 750W GE Healthcare, Milwaukee, USA) in a single institution [10].
For additional validation in this study, the network was tested in a different environment from our institution. The data for this study was obtained from a different scanner (Signa™ Architect; GE Healthcare, Milwaukee, WI, USA) and using a different number of receiver coils (48-channel head coil). Note that no additional training was performed in the current study. The imaging protocols of training and test sets were similar in terms of TE/TR, slice thickness, decay times, and bandwidth. In addition, the image voxel of the test set in the current study was 40% smaller than that of the training set.
For the subjects in the current study, the forward-pass of the network was used to produce the DL corrected images. This process was repeated slice by slice. It took approximately 1.1 s per subject for 25 slices to obtain the output images. After the completion of the process for a subject, the native synthetic FLAIR (input), DL-FLAIR (output), and conventional 2D or 3D FLAIR images were stored separately for evaluation.
The testing was performed using a single GPU (NVIDIA TITAN XP) with the Keras framework [14] and a TensorFlow [15] backend, CUDA 10.0, and CUDNN 7.1 on a Linux server.

Quantitative Analyses
Of the 319 patients, the quantitative analyses were performed for 19 patients who had available conventional 2D FLAIR images for a direct comparison of the region-wise evaluation. For the quantitative evaluation, normalized root mean squared error (NRMSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) were used. The NRMSE measures the normalized voxel-wise intensity differences (errors) while the SSIM measures the nonlocal structural similarity. PSNR was calculated by the following equation: where 25,500 is the maximum range of the FLAIR signal intensity. The NRMSE and PSNR were compared based on three automatically segmented regions as follows: gray matter (GM), white matter (WM), and CSF. This region-wise evaluation was conducted to indicate which region was most improved, and the segmentations for the regions were retrieved via segmentation with FSL-FAST32 using the synthetic T1-weighted images (Figure 1).

Qualitative Analyses
All the datasets were anonymized, and the reader reviewed all images using the picture archiving and communication system. Two attending neuroradiologists, having nine and four years of experience, performed independent analyses of the native synthetic FLAIR and DL-FLAIR images of all 319 patients according to assessment criteria of each item listed in Table 1. The synthetic FLAIR and DL-FLAIR images were assessed in random order after mixing the two FLAIR image sets to minimize bias. The image analyses were performed twice by each reader with a memory wash-out period of two weeks. In each session, the order of review of the studies was random. FLAIR, fluid-attenuated inversion recovery. * Typical synthetic artifacts are surface hyperintensity, granular artifact, or cortical swelling artifact. + The degree of other artifacts that substantially degraded the image quality through, for example, flow artifact, were also assessed.

Statistical Analysis
The data were tested for normal distribution using the Kolmogorov-Smirnov test. Paired t-tests were performed on the quantitative assessment results. For qualitative results, the scores of each image set from the two readers were averaged, and the Wilcoxon signed-rank test was conducted to compare the scores of synthetic FLAIR and DL-FLAIR images. Interobserver agreement between two readers was calculated using weighted kappa statistics. According to the recommendation by Landis and Koch [16], the weighted kappa value was interpreted as follows: 0, no agreement; 0.01-0.20, slight

Qualitative Analyses
All the datasets were anonymized, and the reader reviewed all images using the picture archiving and communication system. Two attending neuroradiologists, having nine and four years of experience, performed independent analyses of the native synthetic FLAIR and DL-FLAIR images of all 319 patients according to assessment criteria of each item listed in Table 1. The synthetic FLAIR and DL-FLAIR images were assessed in random order after mixing the two FLAIR image sets to minimize bias. The image analyses were performed twice by each reader with a memory wash-out period of two weeks. In each session, the order of review of the studies was random. (1) None or negligible (2) Mild (less than 30% of the axial images) (3) Moderate (between 30%-50% of the axial images) (4) Severe (above 50% of the axial images) FLAIR, fluid-attenuated inversion recovery. * Typical synthetic artifacts are surface hyperintensity, granular artifact, or cortical swelling artifact. + The degree of other artifacts that substantially degraded the image quality through, for example, flow artifact, were also assessed.

Statistical Analysis
The data were tested for normal distribution using the Kolmogorov-Smirnov test. Paired t-tests were performed on the quantitative assessment results. For qualitative results, the scores of each image set from the two readers were averaged, and the Wilcoxon signed-rank test was conducted to compare the scores of synthetic FLAIR and DL-FLAIR images. Interobserver agreement between two readers was calculated using weighted kappa statistics. According to the recommendation by Landis and Koch [16], the weighted kappa value was interpreted as follows: 0, no agreement; 0.01-0.20, slight agreement; 0.21-0.40, fair agreement; 0.41-0.60, moderate agreement; 0.61-0.80, substantial agreement; and 0.81-1.00, almost perfect agreement. All the statistical analyses were conducted using SPSS, version 24.0 (IBM Corp., Armonk, NY, USA), and the statistical significance was set at p < 0.05 (two-sided).
Representative examples are depicted in Figures 2-5. Table 2 summarizes the results of the quantitative assessment of NRMSE, PSNR, and SSIM for native synthetic and DL-FLAIR images calculated against the conventional 2D FLAIR images of 19 patients. Theoretically, images with a lower NRMSE, higher PSNR, and higher SSIM indicate better image quality. In this study, all values of NRMSE, PSNR, and SSIM were improved by the DL-based correction of the synthetic FLAIR images. The NRMSE was significantly lower for DL-FLAIR than for the native synthetic FLAIR images in GM, WM, CSF, and total intracranial tissues (all p < 0.0001). The NRMSE of the synthetic and DL-FLAIR images was the highest in CSF, with GM showing higher values than WM (all p < 0.0001). However, the percent change in NRMSE was the highest in GM, followed by CSF and WM. The PSNR was significantly higher for DL-FLAIR than for native synthetic FLAIR images in GM, WM, CSF, and total intracranial tissues (all p < 0.0001). The PSNR was the lowest in CSF, with GM showing lower values than WM in both DL-FLAIR and native synthetic FLAIR images (all p < 0.0001). In contrast, the percent change in PSNR was the highest in CSF, followed by GM and WM. In addition, the SSIM is improved from 0.907 to 0.938 (p < 0.0001). For the region-wise NRMSE and PSNR values, the improvement was more distinctive in GM and CSF than in WM. J. Clin. Med. 2020, 9, 364 5 of 12 agreement; 0.21-0.40, fair agreement; 0.41-0.60, moderate agreement; 0.61-0.80, substantial agreement; and 0.81-1.00, almost perfect agreement. All the statistical analyses were conducted using SPSS, version 24.0 (IBM Corp., Armonk, NY, USA), and the statistical significance was set at p < 0.05 (two-sided).
Representative examples are depicted in Figures 2-5. Table 2 summarizes the results of the quantitative assessment of NRMSE, PSNR, and SSIM for native synthetic and DL-FLAIR images calculated against the conventional 2D FLAIR images of 19 patients. Theoretically, images with a lower NRMSE, higher PSNR, and higher SSIM indicate better image quality. In this study, all values of NRMSE, PSNR, and SSIM were improved by the DL-based correction of the synthetic FLAIR images. The NRMSE was significantly lower for DL-FLAIR than for the native synthetic FLAIR images in GM, WM, CSF, and total intracranial tissues (all p < 0.0001). The NRMSE of the synthetic and DL-FLAIR images was the highest in CSF, with GM showing higher values than WM (all p < 0.0001). However, the percent change in NRMSE was the highest in GM, followed by CSF and WM. The PSNR was significantly higher for DL-FLAIR than for native synthetic FLAIR images in GM, WM, CSF, and total intracranial tissues (all p < 0.0001). The PSNR was the lowest in CSF, with GM showing lower values than WM in both DL-FLAIR and native synthetic FLAIR images (all p < 0.0001). In contrast, the percent change in PSNR was the highest in CSF, followed by GM and WM. In addition, the SSIM is improved from 0.907 to 0.938 (p < 0.0001). For the region-wise NRMSE and PSNR values, the improvement was more distinctive in GM and CSF than in WM.         For the qualitative analyses of 319 patients, the mean scores of both DL-FLAIR and native synthetic FLAIR images showed acceptable image quality for diagnostic use. The qualitative assessment scores given by the two readers and the corresponding interobserver reliability are shown in Table 3. The average mean scores of DL-FLAIR image quality were significantly higher than those  For the qualitative analyses of 319 patients, the mean scores of both DL-FLAIR and native synthetic FLAIR images showed acceptable image quality for diagnostic use. The qualitative assessment scores given by the two readers and the corresponding interobserver reliability are shown in Table 3. The average mean scores of DL-FLAIR image quality were significantly higher than those of the image quality of the native synthetic FLAIR (4.73 ± 0.46 vs. 3.12 ± 0.69; p < 0.0001). The average mean scores of the degree of preserving the preexisting periventricular WM hyperintensities or lesion conspicuity were not statistically significant for DL-FLAIR and native synthetic FLAIR: 4.69 ± 0.68 vs. 4.70 ± 0.61 (p = 0.217). Among the 319 patients, there was no case of generation of artificial pseudolesions during DL processing. However, it was possible to identify incomplete preservation of the preexisting true hyperintensities on DL-FLAIR images in 11 patients (3.4%) among 319 patients owing to the unexpected partial removal of the true hyperintensities (Figures 3b and 4b). The mean scores of the typical synthetic FLAIR artifacts including surface hyperintensities, granularities, or cortical swelling were identified for DL-FLAIR and native synthetic FLAIR images as follows: 1.32 ± 0.51 vs. 3.35 ± 0.68 (p < 0.0001) (Figure 2). In addition, other artifacts that substantially degraded the image quality, such as flow artifacts, were also improved in DL-FLAIR rather than in the native synthetic FLAIR: 1.27 ± 0.46 vs. 2.43 ± 0.72 (p < 0.0001) (Figure 2d).

Discussion
The findings of our study indicate that artifact correction using an already-trained DL algorithm could improve the image quality of synthetic FLAIR images by successfully removing native artifacts from an external data set in a different scanning environment, and it could also provide significantly better values of quantitative parameters. In addition, to the best of our knowledge, this is the first study to employ such a large sample size for the external validation of the trained DL model and provide three quantitative parameters for evaluating the image quality of DL-FLAIR and native synthetic FLAIR images.
In previous studies, synthetic FLAIR artifacts did not have a significant effect on the diagnosis because the artifacts could easily be differentiated among the pathologic conditions [5,8]. However, synthetic FLAIR artifacts are an issue for routine clinical use because they can mimic a pathology in the CSF-filled spaces or CSF-brain interface; to identify them, radiologists should undergo an adaptation period to gain familiarity with this issue. Thus far, the exact cause of such artifacts remains unclear; however, it may be related to the partial volume and flow effects from previous studies [4,5,10]. Fortunately, synthetic FLAIR artifacts have characterized patterns, thin, granulated, and marginal hyperintensity along the brain surface and CSF spaces, and they tend to appear in high convexities and posterior compartments, such as temporo-occipital regions and the brainstem. Therefore, DL-based artifact correction can improve the image quality of the synthetic FLAIR images.
Recently, DL methods have been applied increasingly in the field of radiology, and they have demonstrated enormous potential in several MRI processing areas [17], including artifact correction for specific pulse sequences [18,19]. Thus, recent studies have developed DL algorithms using variants of CNN to remove synthetic FLAIR artifacts and have thus demonstrated the feasibility of this method. However, two studies presented limitations because they were conducted using the same 3T MR scanner provided by a single vendor, although the institutions were different [10,11]. Therefore, our results are promising for generalizing the application of the DL method for improving synthetic FLAIR image quality because overfitted DL models only work for internal datasets and exhibit poor performance for external datasets [20].
The results of the current study also revealed that the DL algorithm using CNN improved the image quality of the synthetic FLAIR images by correcting the typical artifacts in both quantitative and qualitative analyses, and it is consistent with the results of two recent studies [10,11]. In the current study, both NRMSE and PSNR values in the DL-FLAIR image were more distinctive in GM and CSF regions than in WM in the region-wise analyses, which is consistent with the quantitative analysis of the recent study [10]. This may indicate that our DL-based artifact correction mainly acted on the brain surface and CSF spaces, which are the most common locations of synthetic FLAIR artifacts. Therefore, these results show the potential for the application of synthetic MRI in clinical use by enabling accurate detection of true intracranial pathologies at the brain-CSF interface on the synthetic FLAIR images. In addition, the improvement shown in the quantitative analysis was the lowest in WM, with no significant difference noticed in the degree of preserving the preexisting periventricular WM hyperintensities or lesion conspicuity on the visual assessment for DL-FLAIR images. The reason for this is unclear; therefore, additional studies are required to investigate this issue by comparing native synthetic FLAIR, DL-FLAIR, and conventional FLAIR images for expanding the diagnostic use of synthetic MRI in daily clinical practice.
In terms of image artifacts, the typical synthetic FLAIR artifacts were significantly improved in DL-FLAIR images (Figure 2), which is consistent with the original work [10]. The CNN used in the current study is well-known for being highly effective in sensing and learning spatial patterns or features [20,21]. Fortunately, synthetic FLAIR artifacts have characterized patterns showing granulated and marginal hyperintensity along the brain-CSF interface [4,5,8,10]. Therefore, our DL method enables the efficient detection and removal of artifacts according to their spatial patterns. However, further comparative studies using different DL methods should be conducted to investigate the effectiveness and differences to reduce synthetic FLAIR artifacts.
In the present study, we could identify the incomplete preservation of pre-existing true hyperintensities on DL-FLAIR images in 11 patients, owing to the unexpected partial removal of the true hyperintensities. In all these cases, the hyperintense lesions were located in the vicinity of cystic encephalomalacias, and the lesions were considered to be reactive gliosis. The reason for this finding is unclear; however, it may be related to the processing of DL-based artifact correction to distinguish artifacts from true hyperintensities, especially when the true hyperintensities were seen near fluid-containing lesions, making the fluid-lesion interface likely to be similar to the CSF-brain parenchyma interface. We believe that the issue can be solved if the DL algorithm is improved through a further training process using various pathologic cases that can differentiate a normal CSF-tissue interface from the lesion-fluid interface.
Although the results are promising, this study has certain limitations. First, we did not directly compare the qualities of DL-FLAIR and conventional FLAIR images because this study included two types of conventional FLAIR images: 2D and 3D. This was unavoidable because this study was retrospectively designed. We believe that further studies should be required to directly compare the image quality of DL-FLAIR and conventional FLAIR images for attesting the clinical use of DL-FLAIR images and validating our results. Second, the quantitative analyses were performed only for 19 patients owing to the mentioned heterogeneity of the conventional FLAIR images. Third, we obtained all data by using a different MR scanner with different scan parameters in a single institution; however, previous studies used scanners from the same vendor [10,11]. Therefore, we expect that future studies with different scanners from other vendors in multiple institutions will be conducted to validate and generalize our results. Finally, we did not perform a meticulous evaluation of the intracranial pathologies during the analyses because we focused on the improvement of the synthetic FLAIR image quality by DL-based correction and the heterogeneous brain MRI protocols related to the patients' medical condition.

Conclusions
In conclusion, the artifact correction with the already-trained DL algorithm led to successful improvements in the image quality of the synthetic FLAIR images upon usage of an external dataset on a different MR scanner in different scan environments. This was verified both qualitatively and quantitatively, and the obtained images were compared with the conventional FLAIR images. Therefore, we believe that the DL-based approach can provide a promising solution for improving the image quality of synthetic FLAIR images to broaden the clinical use of synthetic MRI in daily clinical practice.