Computer Aided Detection of Pulmonary Embolism Using Multi-Slice Multi-Axial Segmentation

Pulmonary Embolism (PE) is a respiratory disease caused by blood clots lodged in the pulmonary arteries, blocking perfusion, limiting blood oxygenation, and inducing a higher load on the right ventricle. Pulmonary embolism is diagnosed using contrast enhanced Computed Tomography Pulmonary Angiography (CTPA), resulting in a 3D image where the pulmonary arteries appear as bright structures, and emboli appear as filling defects, with these often being difficult to see, especially in the subsegmental case. In comparison to an expert panel, the average radiologist has a sensitivity of between 77% and 94%. Computer Aided Detection (CAD) is regarded as a promising system to detect emboli, but current algorithms are hindered by a high false positive rate. In this paper, we propose a novel methodology for emboli detection. Instead of finding candidate points and characterizing them, we find emboli directly on the whole image slice. Detections across different slices are merged into a single detection volume that is post-processed to generate emboli detections. The system was evaluated on a public PE database of 80 scans. On 20 test scans, our system obtained a per-embolus sensitivity of 68% at a regime of one false positive per scan, improving on state-of-the-art methods. We therefore conclude that our multi-slice emboli segmentation CAD for PE method is a valuable alternative to the standard methods of candidate point selection and classification.


Introduction
Pulmonary Embolism (PE) is a deadly disease formed when emboli lodge in the pulmonary arteries. Emboli can be formed in-situ or as a result of deep venous thrombosis, traveling through the blood stream, and traversing the right heart cavities. Emboli lodged in the pulmonary arteries impede blood flow, causing poor or no oxygen exchange and increasing right ventricular afterload. Decreased oxygenation can result in poor oxygen delivery to vital organs that can cease to function or malfunction. An increased right ventricular afterload results in right heart strain, which can cause right-sided heart failure, ischemia, and death. Lack or delay in treatment increases morbidity and mortality [1]. From an epidemiological standpoint, in the United States, PE affects 300,000-600,000 Americans/year, resulting in 12,000-80,000 deaths [2]. There are no global numbers for Europe, but the incidence of PE in Sweden, for example, is estimated to be 19,000 cases per year and 39,480 France [3].
Pulmonary embolism is clinically presented with a non-specific symptomatology that includes chest pain, shortness of breath and tachycardia, and may include hemoptysis, hypotension, and loss of consciousness. The clinical diagnosis of PE can be challenging, with radiological Computed Tomography Pulmonary Angiography (CTPA) being the gold standard. This procedure involves injecting a contrast agent into the patient's pulmonary arteries, imaging with a computed tomography (CT) scanner and reading the images to find filling defects. Such filling defects are the result of the emboli blockage of the contrast agent. Since the clinical symptoms for PE are non-specific and there is no laboratory test for the diagnosis of PE, many of the CTPA performed in clinical settings are not positive for PE.
Computer Aided Detection (CAD) methods have been used in PE diagnosis to measure the right ventricle strain [4,5] and to detect PE automatically. In comparison with a panel of expert radiologists, the sensitivity of an individual radiologist ranges from 77% to 94% [6,7]. Works on using CAD systems as a second reader have shown increases in reader sensitivity of 92-98% [6][7][8][9][10][11][12][13]. However, current systems of CAD for PE show moderate sensitivity (65%) at a fairly high false positive rate (three false positives per study), preventing their general adoption in clinical practice [14].
CAD algorithms for PE follow a two-step process, in which a set of candidate points are selected and tested to discriminate between emboli and false detections. Older studies extracted hand-crafted features around candidate points and used classification techniques to evaluate them [9,12]. Modern methods take advantage of convolutional neural networks for both feature extraction and classification by generating planar reformatted images centered around candidate points [15,16]. The more recent work [17] also includes context information around the candidate point to improve detection performance.
In this work, we hypothesize that deep convolutional neural networks can be used to directly detect emboli from images, without the need to detect candidate points. By using the whole image to detect emboli, we let the network use both local and global image characteristics to discern between PE and other image structures.

Materials and Methods
We treated the problem of CAD for PE as a segmentation problem, where we directly inputted the image and outputted a segmentation mask of the emboli. We post-processed emboli segmentations to generate emboli detections and evaluated the performance of the method, following the methodology of González et al. [14].

Dataset
We used a total of 80 Computed Tomography Pulmonary Angiograms (CTPA) from the CAD-PE challenge [14,18,19]. While the dataset was not obtained for this work, we describe it here for the sake of clarity. The original scans were supplied from six different hospitals and coordinated by the "Central Diagnostic Radiology Unit" in Madrid (Spain). SIEMENS Somaton Sensation 40 scanners were used for the data acquisition with a pixel size between 0.58 and 0.85 mm and a slice thickness oscillating between 0.75 and 1.5 mm.
Each scan was analyzed independently for the presence of PE by three radiologists with more than 15 years of experience establishing the reference standard. In the case of discrepancy, a majority voting scheme was employed. Furthermore, each radiologist marked regions of interest (ROI) around all the visible clots using the sagittal and coronal views. The regions of interest were segmented by applying an intensity threshold, binary closing operations and connected component analysis. Each of the individual connected components represented a different embolus clot, the Figure 1 shows an example of these findings. There was a total of 167 emboli in 60 training cases, with a per-case total average volume of 5.04 × 10 3 voxels. The number of emboli per scan ranged between 0 and 21. A variation in the reference standard was performed by dilating the border of each embolus with an epsilon value as tolerance margin ( = 2 mm and = 5 mm). This modification affected the number of emboli and their size in the reference segmentations, but was only used for test purposes, leaving the original reference standard ( = 0 mm) to train and validate the system.

2D Network
The 2D network was based on the U-Net structure [20]. The input to the network consisted of axial slices of the training data, and the output was emboli segmentation. We used the reference standard segmentations with =0 mm tolerance level as ground truth. The U-Net was composed of four convolution and polling levels in the down-sampling encoder. Each level was composed of the 3 × 3 convolution operations that doubled the number of filters with respect to the previous level. The first level started with 32 filters and the last encoder block had 256. Each encoder level was followed by a 2 × 2 Max-Pooling operation. At the end of the encoder, we performed two further 3 × 3 convolutions with 512 filters each. The decoding was performed by four up-sampling blocks consisting of an up-sampling operation, followed by the concatenation with the output of the corresponding encoder level data and two 3 × 3 convolution operations, with the same number of filters as its equivalent encoder level. To finish, a last 1 × 1 convolution was performed to obtain the network output. All of the convolution outputs were activated with a "ReLU" function except the last 1 × 1 convolution, which used a sigmoid activation to concentrate the output values in the extremes of the range [0, 1]. In addition, batch normalization was used for the convolution activations, to maintain the mean close to zero and its standard deviation close to one.

2.5D Network
This method was a variation of the 2D network where, instead of using a single axial slice as input, we used a composition of five slices, corresponding to the two above and the two below the target slice we wished to segment. We still used a single slice reference standard segmentation, comparing this with the network output as minimization objective.

3D Network
The 3D method consisted of a three-axis version, where we used the exact same structure as the 2.5D network but trained it with slices in the three axial direction (transverse, coronal, and sagittal planes). Since the scans did not have the same z resolution, we resampled them to achieve a homogeneous size of 512 × 512 × 512. At the prediction stage, we tested each scan three times, with input slices coming from different planes, obtaining three predictions, which were then merged using the maximum value on each pixel.

Training
To train the three different networks, we split the 60 scans into two groups: 55 scans for training and 5 for validation. As pre-processing, the data were clipped in the range [−200, 500] Hounsfield Units (HU) and normalized to the range [0, 1]. Each model was trained for 200 epochs, keeping the models with the best validation performance, which were then used to obtain the final coordinates over the test subset. As objective loss function, all methods minimized the binary cross-entropy between reference standard segmentations and the predicted ones. We used the Adam optimizer with a learning rate of 0.0005.

Post-Processing
After the training process, the model with the best validation performance was used to obtain the probabilistic segmentations on the test subset. These results were then transformed into emboli coordinates using the method shown in Figure 2. First, from the network predictions, a threshold of 0.5 was used to obtain a binarized mask. A binary closing operation with a kernel size of 5 × 3 × 3 was used to eliminate small noise on the mask. Then, connected components analysis was used to generate the different individual emboli.
In a second stage, the coordinates of each individual embolus were extracted. We applied an inner distance transform for each connected component, and found the pixels that were furthest from the perimeter. Then, we obtained the closest coordinates of such pixels to the center of mass of the embolus. We applied this method to all connected components, generating a per-detected embolus list of coordinates. Finally, each coordinate was associated with a score representing its probability of being a clot. This score was the network output value at the coordinate position generating a list of values composed by the coordinate in the "x", "y", and "z" axes and the "score" of probability.

Evaluation Metric
To compare the performance of the three different methods, we used the Free-response Receiver Operating Characteristic (FROC) curve. This tool uses sensitivity in the detection of emboli, defined as the total number of findings divided by the total number of emboli, together with the average false positive per scan obtained by adding the false positive detections divided by the total number of test scans. An embolus was correctly detected if any coordinate obtained by the tested network fell inside its reference standard segmentation. FROC analysis is standard for the evaluation of methods that locate several lesions in images, but unfortunately no clear figure of merit is widely accepted for such curves [21], and we resort to visual inspection of the curves for the comparison of the methods.

Results
First, we evaluated the segmentation performance of the proposed method. Qualitative segmentation results are shown in Figure 3, where green labels represent true positives, red labels false negatives, and yellow labels false positives. The average per-scan segmentation Dice score was 0.485 (standard deviation of 0.297) for the 2D network, 0.367 (standard deviation of 0.214) for the 2.5D network, and 0.296 (standard deviation of 0.160) for the 3D network. In all the cases, the standard deviation showed large discrepancies across scans. Figure A1 shows the relationships between the total size of the emboli in the scan, measured in voxels, and the Dice coefficient. There was a clear trend where the larger was the emboli in the scan, the better was the segmentation.  Table 1 shows the per embolus sensitivity of each of the three methods and the number of false positives per scan for the three tolerance margins = 0 mm, = 2 mm, and = 5 mm. The threshold to set the operating point of the method was set at 0.75. There was a clear improvement of the 2.5D method over the 2D method. For = 0 mm and = 2 mm, the sensitivity was similar, but the number of False Positives per Scan (FPS) was more than halved. For = 5 mm, there was an increase in sensitivity paired with a decrease in the number of FPS. The data in the table show no clear improvement between the 2.5D and 3D methods. The 3D method had higher sensitivity, but also a higher number of FPS.  Figure 4 displays the FROC curves. We included the results of the CAD-PE challenge in these images challenge to enable comparison [14,18]. We refer the reader to these references for a description and thorough evaluation of the other methods. The ASU-Mayo method is best described in [15] and the FUM-Mvlab method in [22]. The x-axis scale changes between the left and the right columns. Please note that most methods do not detect all emboli, since they are not part of their candidate selection mechanism or the segmentations did not reach the 0.5 threshold. The three proposed methods outperformed the other participants of the CAD-PE challenge for all the regimes of FPS. The 2.5D and 3D methods outperformed the 2D method for all the regimes of FPS in which they have data. There was a small increase in sensitivity of the 2.5D method over the 3D method in the range [0.25-4] FPS. The 3D method clearly outperformed the 2.5D method on the range [0-0.25] FPS. The performance of the three proposed methods with respect to embolus volume is shown in Figure 5. Each sub-image displays the histogram of embolus volume in green and the amount of emboli detected in blue. The x-axis is in logarithmic scale. All methods had problems finding emboli with less than 0.5 mL volume. For emboli above 0.5 mL volume, the more complex is the network, the better it performed, with the 3D network being the best-performing method. The cutoff value used to generate these histograms was the same as that used for Table 1.
The 2D network processed one case in 51.5 ± 20.5 s. The 2.5D network in 47.0 ± 14.2 s. The 2.5D method was faster than the 2D method because it produced a lower amount of connected components, and therefore the post processing was faster ( 2.5D took 14.5 ± 8.5 s vs. 2D took 34.8 ± 17.9 s). The 3D method processed one case in 100.8 ± 7.1 s, with post processing time of 10.4 ± 7.1 s. All methods were fast enough to be used in an emergency setting. Timing was measured in an Intel i7-6850k CPU with an Nvidia 1080-Ti graphics card.

Discussion
CAD methods for pulmonary embolism have long suffered from a moderate sensitivity at high regimes of false positives per scan, hindering their clinical use. Recent neural network methods have improved the results of CAD for PE, using the same classic machine learning methodology consisting of finding candidate points and then evaluating whether each candidate belongs to an embolus. We present a method that avoids the use of candidate points by treating CAD for PE as a segmentation problem, turning the segmentation into detection points as a post-processing step. We evaluated three methods with this underlying idea: a method using axial slices as input data, a method using a slab of the CT scan, and a method that combines information from axial, sagittal, and coronal planes.
The three methods presented outperformed the other participants on the CAD-PE challenge. This may be because there was information about the location of the embolus that was relevant to its shape and image characteristics. A distal embolus might not look similar if it is in the upper lobes or the lower lung lobes. By using global segmentation strategies, the network can learn such differences. When compared with the work of Lin et al. [17], the proposed methods showed lower performance at the regime of two false positives per scan. It should be noted that the performance of their method at lower FPS regimes, which are required for clinical practice, was unclear.
The methods improved when more data were added as input to the network. As such, the 2.5D network outperformed the 2D network on all regimes of FPS. This is logical. Emboli are 3D structures, and image variations on the z-axis contain valuable information. The authors were surprised to see that the 3D method did not clearly outperform the 2.5D method on all FPS regimes. However, there was a clear improvement in performance at very low FPS regimes ([0-0.25]), which indicates that the 3D method successfully increases the score of certain emboli. This effect makes sense, since some emboli that might not be clear in axial slices may be very clear in coronal or sagittal slices, and the context on such planes might be more relevant than the few slices that the 2.5D method uses. Surprisingly, the performance in terms of segmentation decreased with complexity, with the 2D method achieving a better Dice coefficient than the 2.5D or 3D method. This might indicate that the 3D network is more restrictive in its findings, producing fewer false positives while keeping the embolus located.
There are several limitations to this study. First, we used only 60 CT scans for training. While they account for more than 7000 slices, this number is small for training modern neural networks and could be considered a small dataset. A larger number of scans would likely improve the performance of the network. A thorough study on the improvement of performance with respect to the number of training scans might shed more light on the optimal number of scans required for training. Second, the scans on which we trained the dataset and the scans on which we tested them come from the same institution and had been obtained with the same acquisition machines, which might bias the results positively. The evaluation on external datasets, such as the one in [22], was left for future work. Third, there is room to improve the segmentation performance of the network, since the baseline network structure is very simple. More complex networks, using squeeze and excite blocks, for instance, may greatly improve segmentation results that could lead to better emboli detection. Fourth, the proposed method, as all other CAD for PE methods referenced in this paper, only detects emboli, and it is not capable of classifying them as obstructive or non-obstructive. Such clinical information is important, but impossible to obtain with the current dataset, since we do not have these labels for the emboli, nor do we have a method to segment the pulmonary artery and test for obstruction. Fifth, it would be of further interest to improve the method to compute clot burden scores, such as those in [23,24]. However, such scores require information on the anatomical location of the embolus, which is not available with the method in the present study.
We do not know whether the proposed network would have an impact on clinical practice. Other methods have been evaluated by radiologists and the sensitivity of radiologists, especially those without extensive experience, has been shown to improve without reducing specificity. The proposed method outperforms other CAD for PE algorithms and would likely have the same effect on radiologist's performance while being easier to use due to its lower rate of false positives per scan.

Conclusions
This article shows that the standard paradigm for CAD for PE algorithms of finding candidate points and then evaluating their belonging to an embolus is inferior to the proposed paradigm of directly locating emboli by treating the problem as a segmentation method. We achieved state-of-the art performance with this paradigm using a simple segmentation network as backbone.

Abbreviations
The following abbreviations are used in this manuscript: