Prediction of Post-Intubation Tachycardia Using Machine-Learning Models

: Tachycardia is deﬁned as a heart rate greater than 100 bpm for more than 1 min. Tachycardia often occurs after endotracheal intubation and can cause serious complication in patients with cardiovascular disease. The ability to predict post-intubation tachycardia would help clinicians by notifying a potential event to pre-treat. In this paper, we predict the potential post-intubation tachycardia. Given electronic medical record and vital signs collected before tracheal intubation, we predict whether post-intubation tachycardia will occur within 10 min. Of 1931 available patient datasets, 257 remained after ﬁltering those with inappropriate data such as outliers and inappropriate annotations. Three feature sets were designed using feature selection algorithms, and two additional feature sets were deﬁned by statistical inspection or manual examination. The ﬁve feature sets were compared with various machine learning models such as naïve Bayes classiﬁers, logistic regression, random forest, support vector machines, extreme gradient boosting, and artiﬁcial neural networks. Parameters of the models were optimized for each feature set. By 10-fold cross validation, we found that an logistic regression model with eight-dimensional hand-crafted features achieved an accuracy of 80.5%, recall of 85.1%, precision of 79.9%, an F1 score of 79.9%, and an area under the receiver operating characteristic curve of 0.85.


Introduction
The usual circulatory response to laryngeal and tracheal stimulation during tracheal intubation in anesthetized patients are tachycardia and a rise in arterial pressure due to reflex sympathetic stimulation [1][2][3]. The resulting tachycardia is transient and poses little risk to healthy patients. However, in patients with cardiovascular disease, it can induce a cardiac oxygen imbalance and cause serious complications such as myocardial ischemia. To prevent such tachycardia, β-blockers, lidocaine, opioids, and deep anesthesia are used [4][5][6]. These methods do not always effectively block tachycardia and the potential for complications associated with drug administration cannot be ignored. If post-intubation tachycardia can be predicted, pre-intubation medication could be administered prior to intubation to mitigate the degree of tachycardia. However, several characteristics of anesthesia induction make prediction difficult. Patients often experience anxiety prior to surgery, which activates the sympathetic system. Medication (e.g., β-blockers) and preexisting medical conditions (e.g., cardiovascular diseases, diabetes) are confounding factors. Furthermore, bolus administration of anesthesia induction agents responds variously according to the patient's general condition and existing disease. Tracheal intubation itself causes various hemodynamic responses, further complicating the task of predicting such responses. With the universal use of electronic medical records (EMR; otherwise referred to as Electronic Healthcare Records) and real-time biological signal collection equipment, data from EMR (e.g., patient demographic information, coexisting disease, and medication history) and physiological and pharmacological data obtained from various anesthesia monitoring devices can be collected simultaneously during anesthesia and surgery. The development of machine learning has reached a stage where it may be possible to train systems to predict blood-pressure fluctuation during anesthesia [7]. However, to our knowledge, no studies have yet predicted the occurrence of tachycardia after induction of anesthesia by machine learning.
Although no studies of predictive systems for post-intubation tachycardia have been conducted, reports on related events (e.g., ventricular tachycardia, cardiac arrest, and arrhythmia) are available. In most studies, the related events were predicted using heart rate variability (HRV) as the main variable and several have adopted machine learning models such as logistic regression (LR), random forest (RF), decision tree (DT), artificial neural networks (ANN) [8][9][10][11]. Some studies have investigated feature selection algorithms to enhance model performance and identify indicative features. Riasi et al. [12] used a linear correlation method to predict ventricular tachycardia (VT). They analyzed the linear correlation of the features, removed highly correlated features, and achieved a sensitivity of 0.88 using support vector machines (SVMs). Yaghouby et al. [13] used generalized discriminant analysis (GDA) feature reduction to predict four types of cardiac arrhythmias. A total of nine linear and nonlinear features were extracted from the HRV signals, which were reduced to three-dimensional values by GDA. Feature selection improved accuracy rates by between 1% and 7%, and they achieved an accuracy of 100% with an ANN model. As feature selection has shown its potential to resolve a variety of complex relationships between factors (or features), we employed feature selection algorithms to deal with factors that affect the heart rate after intubation. Among the known factors are patient characteristics (e.g., age, comorbidity), a history of taking cardiovascular drugs before surgery (e.g., β-blockers, calcium channel blockers, antiarrhythmics), time to intubation, degree of muscle relaxation, intubation methods (e.g., traditional direct laryngoscope, video laryngoscope), and the type and amount of anesthesia drug injected. We also designed feature sets based on statistical clues (e.g., p-value) and manual examination. We applied these carefully designed features to machine learning models for tachycardia prediction.
In this paper, we aimed to predict whether post-intubation tachycardia will appear within 10 min, given EMR and vital signs collected before tracheal intubation. Tachycardia is defined a heart rate greater than 100 bpm for more than 1 min. As far as we know, this is the first study to predict tachycardia after tracheal intubation. To predict tachycardia, we used a variety of machine learning models with feature selection techniques. We believe that this will prove helpful for clinicians by notifying a potential event to pre-treat. This paper is structured as follows. Section 2 describes the characteristics of the data and preprocessing steps. It also supplies details on how we used feature selection algorithms and machine learning models. Section 3 provides details of experimental settings and results, Section 4 discusses about the results, and Section 5 concludes the paper.

Data Preparation
Datasets were collected from nine operating rooms at Soonchunhyang University Bucheon Hospital, Bucheon city, Republic of Korea, between 29 October 2018, and 30 September 2019. The data were collected from patients who had received total intravenous anesthesia (TIVA), and were at least 18 years old. The datasets do not contain any direct patient identifiers and additional approval from our institutional review board was obtained for this retrospective study (No. 2019-10-024-002).
The datasets consisted of electronic medical records (EMRs) and vital signs. The EMR data were manually collected and included age, sex, height, weight, body mass index (BMI), American Society of Anesthesiologists (ASA) physical status grade, and types of coexisting diseases. Vital signs were collected automatically between the time of entering the operating room and the beginning of operation. Vital signs were collected using a Vital Recorder [14] connected to several devices in operating rooms: Bx50 (patient monitor), Solar 8000M (patient monitor), Datex-Ohmeda (anesthesia machine), Primus (anesthesia machine), BIS (brain monitor) and Orchestra (infusion pump). Vital signs included 35 variables such as heart rate (HR), systolic blood pressure (SBP), tidal volume (TV), carbon dioxide (CO2), end-tidal CO2 partial pressure (ETCO2), positive endexpiratory pressure (PEEP), bispectral index (BIS), and so forth. Figure 1 shows a sample of the two types of data; the EMR is collected once for each operation, whereas the vital signs are supposed to be gathered every second continuously. Figure 1 shows a sample of the two types of data; the EMR data were collected once for each operation, whereas vital signs were ideally gathered every second continuously. Details of the two types of data are summarized in Table 1, where the baseline vital values are collected only before anesthesia induction.   The datasets required preprocessing because of missing values (e.g., signal quality index (SQI) of BIS), inconsistent sampling rates, outliers (e.g., height), and incorrect time-logs of intubation. Missing values from the BIS were replaced by the mean of the surrounding values. Other missing values were replaced by the last observed values. Different vital signs had inconsistent sampling rates; for example, generally noninvasive blood pressure was recorded every minute, whereas the respiratory rate was recorded every 3 s. To handle this, as the fastest sampling rate was 1 s, we assume that all vital signs had the same sampling rate of 1 s. If a sampling rate of a particular variable was 3 s, then the corresponding variable values were copied twice.
After the missing values and inconsistent sampling rates were addressed, it was necessary to filter some data as depicted in Figure 2. Of the original 1931 patients collected between 29 October2018, and 30 September 2019, both vital signs and EMR were available for only 1093. Outlier values (e.g., height of 1554.2 cm, remifentanil volume of 538,976.25 µg) as determined by a predefined range of variables were removed. The patients without BIS, neuromuscular transmission train-of-four count (NMT_TOF_CNT), or CO2 were filtered out. The patients without any intubation annotation (i.e., "EVENT") were also removed because our purpose was to predict post-intubation tachycardia. Furthermore, because the intubation annotation may have been wrong for several reasons, we designed Algorithm 1 to correct wrong annotations. The algorithm also found patient data that could not be corrected, and 271 were discarded. Based on the corrected annotations of intubation, we filtered 64 patients data of two cases: (1) intubation only 10 s after anesthesia induction, and (2) operation begins within 10 min after intubation. The major purpose of Algorithm 1 was to correct annotations of intubation. The time-step (or point) of tracheal intubation was annotated manually by the perioperative nurse (or anesthesiologist) in the operating room. Therefore, the annotated points may have been incorrect for several reasons (e.g., a mistake or delay). For example, the EVENT column of Figure 1b has the incorrect annotation "intubation" at 00:08:02, because the CO2 value must remain at zero during tracheal intubation. Our purpose was to predict post-intubation tachycardia, so it was critical to find the correct intubation points. Algorithm 1 describes four steps of finding the start and end points of tracheal intubation. More formally, in Figure 3, the algorithm identifies the beginning point of intubation I s and the ending point of intubation I e . The first step of Algorithm 1 was to find candidate points of tracheal intubation using three conditions, as shown in line 4. The first condition checked if the bispectral index value (BIS) was lower than 70. When patients entered the operating room, TIVA using propofol and remifentanil via a target controlled infusion pump (Orchestra Base Primea with module DPS; Fresenius Kabi AG, Germany) was administered. After anesthesia, patients gradually lost consciousness as BIS decreased. In Figure 1b, this condition is met at 00:07:01. The second condition checks muscle state; neuromuscular transmission train of four count (NMT_TOF_CNT) is 0 or 1 when the muscle is relaxed. The rocuronium was injected intravenously, and we electrically stimulated the nerves dominating the patient's thumb and measured their movements to determine if the muscles were relaxed. The last condition checks if the carbon dioxide (CO2) values for 10 s remain at zero. After receiving the muscle relaxant, the patients could not breathe on their own, so CO2 values were zero. In Figure 1b, the annotated event "intubation" violates this condition, meaning that the annotation must be incorrect. By the above three conditions, the first step finds a candidate set of beginning points of intubation. For example, the candidate set included the point 00:07:01 in Figure 1b. If no candidate was found, then the corresponding data were discarded, as described in line 12. The lines between 14 and 18 were to take only the first candidate point if some candidate points are consecutive; for example, given four candidate points [10,20,30,50], the two points 20 and 30 were to be filtered out, resulting in two remaining candidate points [10,50].
The second step of the algorithm was to find a manually annotated intubation point in the operating room. Unless there was only one annotated intubation point, the corresponding data were discarded. In other words, if there were multiple annotated points or no points at all, then the corresponding data were regarded as incorrect. Such wrong data were discarded even multiple candidate points CI s were obtained from step 1, which means that the CI s and MI were required to find the intubation points I s and I e in the following steps. The third step was to find the beginning point I s . The I s was found by comparing MI and multiple candidate points CI s . By comparing the difference of MI with each element of CI s , the nearest element of CI s to MI became I s . For example, in Figure 1b, assume that CI s and MI are [00:07:01, 00:12:04] and [00:08:02], respectively. The I s will be 00:07:01 because it is the nearest candidate point to 00:08:02. The last step was to find the ending point I e of intubation. It assumes that the post-intubation CO2 emission was greater than the pre-intubation CO2 emission. For example, in Figure 1b, the CO2 pre will be 2.4 at 00:07:00. The I e will be 00:08:01 because it is the earliest point with a greater CO2 value than CO2 pre . If the I e was not found by this step, then the corresponding data were to be discarded; but we did not observe such a case.

Feature Selection
Our purpose was to predict a post-intubation tachycardia, for which the input feature was obtained from the "input area" in Figure 3, and the output was 1 (or true) if the tachycardia occurred in the "output area"; otherwise it was 0 (or false). In this paper, tachycardia is defined as a HR greater than 100 bpm for more than 1 min. The input feature is defined using two types of data (e.g., EMR and vital signs). First, a 24-dimensional feature vector f EMR was obtained from EMR, including age, sex, BMI, and so forth; details of the 24-dimensional feature can be found in the first two rows of Table A1 in the Appendix. For example, if the patient was female and had a cardiovascular disease (e.g., hypertension), then f EMR was [. . . , f sex = 1, . . . , f hypertension = 1, . . . ]. Second, a 129-dimensional feature vector f vital was obtained from the vital signs. The f vital contains mainly min, max, mean, and standard deviation (sd) of each vital sign; details can be found in Table A1  Feature selection finds a promising set of features that has a potential to contribute performance improvement. Feature selection is known to shorten training time, reduce overfitting, and improve accuracy [15]. It has produced useful results for ventricular tachycardia and arrhythmia in previous studies [12,13]. We compared feature selection with three different measurements: recursive feature elimination (RFE), Gini index (GI), and a univariate statistical test (UST) using mutual information. The RFE and GI-based feature selection were performed with a random forest (RF) classifier using scikit-learn [16]. By grid searching, RFE, GI, and UST-based feature selection resulted in 10, 15, and 15 promising feature sets, respectively. Details of the selected features are listed in Table 2.
In addition to feature selection with the three measurements, we also prepared two additional feature sets: a p-value based feature set and a manually designed feature set. In Table 2, the "P-based" and "Hand-crafted" feature sets indicate the two feature sets, respectively. The P-based feature set was defined by statistical clues; we conducted a t-test or Wilcox test for continuous variables (e.g., height, tidal volume), and chi-squared or Fisher tests for categorical variables (e.g., sex, ephedrine). We observed significant statistical differences in baseline heart rate, noninvasive heart rate, and remifentanil values, as shown in Table 3. However, the hand-crafted feature set was carefully designed through an exploratory data analysis process. That is, by manually examining a group of patients, we picked eight promising features, including sex, HR_mean, remifentanil_CE_max, remifentanil_CP_max, remifentanil_VOL_max, remifentanil_VOL_mean, BIS_min, and MV_max.

All 153 Features RFE-based GI-based UST-based P-based Hand-Crafted
End-tidal CO2 partial pressure Max RFE stands for recursive feature elimination, GI means Gini index, UST implies univariate statistical test, and P-based represents p-value based feature sets, and PLETH_SPO2 means saturation of partial pressure of oxygen. The five feature sets were compared using various machine learning models: random forest (RF) [17], logistic regression (LR) [18], naïve Bayes classifiers (NB) [19], support vector machine (SVM) [20], extreme gradient boosting (XGB) [21], and artificial neural networks (ANN) [22]. The ANN had two hidden layers, in which the first and second layers had 15 and 20 nodes, respectively.

Results
Among the 257 patients, there were positive patients that had the post-intubation tachycardia, where the number of positive patients |D pos | = 50. To avoid such imbalance, we randomly divided the remaining 207 negative patients D neg into four disjoint subsets, where |D neg 1 | = 52, |D neg 2 | = 52, |D neg 3 | = 52, and |D neg 4 | = 51. We conducted four independent experiments with every pair of D pos and {D neg 1 , D neg 2 , D neg 3 , D neg 4 }; the total amount of datasets for the four experiments was 102, 102, 102, and 101, respectively. All experimental results were averaged values of 10-fold cross-validation. The performance of different models was compared by accuracy, precision, recall, and F1 score [23]. We also utilized an area under the receiver operating characteristic curve (AUC) to compare the models; larger AUC values implied better models.
We applied our five feature sets to six machine learning models. Parameters of each model were optimized through a grid search, for which the settings are described in Table 4. We used a computer with eight Central Processing Units (CPU) of i7-7700 3.6 GHz and two NVIDIA GeForce 1080 Ti. The machine learning models were implemented with Python3 language. Table 4. Parameter settings for machine learning models.

Models Description
Artificial neural network Common parameter : Activation (relu), Solver (adam), L2 penalty (0.001), Maximum iteration (200) Common structure : two hidden layers = [15,20] All feature : L2 penalty = 0.1, One hidden layer with 10 nodes RFE-based : Initial parameter GI-based : Initial parameter UST-based : Activation = identity, Solver = sgd P-based : Activation = identity, Solver = lbfgs Hand-crafted : Activation = identity, Solver = lbfgs The experimental results are summarized in Table 5, where the first row, "All feature," refers to a 153-dimensional feature vector f input . The feature sets with feature selection algorithms (e.g., RFE-based, GI-based, and UST-based) gave better results than the f input . Among the six machine learning models, LR and SVM achieved the best accuracy of 80.5% with hand-crafted features. As it is more critical if we miss post-intubation tachycardia, better sensitivity is preferable. In terms of the sensitivity, the LR will be the most effective because its recall of tachycardia (85.1%) was greater than that of the SVM. As shown in Figure 4, the LR model achieved a much better AUC with the hand-crafted features than with the GI-based features. Although the hand-crafted feature was not the best for some models (e.g., random forest), all six models generally achieved high performances with it. Table 5. Performance of machine learning models with different feature sets, where each precision, recall, and F1 score is a pair of two results for non-tachycardia(0) and tachycardia(1).

Discussion
Of the five feature sets, hand-crafted features generally provided the best performance. The eight-dimensional hand-crafted features include sex, HR_mean, remifentanil_CE_max, remifentanil_CP_max, remifentanil_VOL_max, remifentanil_VOL_mean, BIS_min, and MV_max. Each of the features was carefully chosen by intensive examination of the datasets, and we found that there have been studies that support the choices. For example, it was discovered that female patients were more likely to have arrhythmia than male patients, which supports the variable "sex" [24,25]. In [26], the heart rate feature was utilized to predict tachycardia, which supports the variable "HR_mean." The four variables related to remifentanil are related to the fact that it may decrease heart rates [27,28] and that post-intubation tachycardia occurs more frequently in small dose patients than large dose patients [29]. The BIS values are known to be strongly related to prognoses in patients [30], which are related to the variable "BIS_min." It has been discovered that respiration features, which are related to the variable 'MV_max', contribute to prediction of tachycardia [31].
One can argue that using all 153-dimensional features f input must be better than other feature sets. More features might help improve performance, but they often fail when we do not balance model complexity and inherent data complexity. The total number of datasets for each experiment is close to 102, and the dimension of all features is 153. Such imbalance between them causes the machine learning models to overfit to a given training data; they will exhibit almost perfect performance only for the training data as shown in Figure 5. Note that we performed a grid search to find optimal parameter settings for every feature set. For example, the ANN model has an L2 penalty of 0.1, and the C value of SVM is 0.1 for f input . The LR model appears not to be overfit, which may be related to the fact that the complexity of the LR model was successfully balanced by regularization. On the other hand, SVM and ANN were overfit, meaning that they were poorly generalized. Nonetheless, in terms of precision, the SVM with hand-crafted features outperformed the LR. In terms of recall, the ANN achieved the same performance as the LR. This implies that different models can be chosen for different purposes or requirements. In general, recall for tachycardia is the most important, so the LR with hand-crafted features will be optimal. This study is limited by its small dataset. Although the original number of collected datasets was 1931, we had only 257 patients data after filtering. As machine learning models learn from the dataset, we believe that performance will improve if more data are gathered. We will continue collecting more data to create better models.

Conclusions
In this paper, to predict post-intubation tachycardia, we compared five feature sets with various machine learning models. We collected datasets of two types (e.g., EMR and vital signs), and developed an algorithm to find intubation points due to annotation errors. The feature sets were defined using feature selection algorithms, statistical inspection, or manual examination. By experimental results, logistic regression (LR) model achieved the best accuracy and sensitivity with eight-dimensional hand-crafted features. To improve performance, more data should be collected and better models investigated. We will also perform further experiments with other different settings.

Conflicts of Interest:
The authors declare no conflict of interest.

Duration of intubation
RFE stands for recursive feature elimination, GI means Gini index, UST implies univariate statistical test, and P-based represents p-value based feature sets, COPD indicates chronic obstructive pulmonary disease, and PLETH_SPO2 means saturation of partial pressure of oxygen.