Chronic Disease Prediction Using Character-Recurrent Neural Network in The Presence of Missing Information

: The aim of this study was to predict chronic diseases in individual patients using a character-recurrent neural network (Char-RNN), which is a deep learning model that treats data in each class as a word when a large portion of its input values is missing. An advantage of Char-RNN is that it does not require any additional imputation method because it implicitly infers missing values considering the relationship with nearby data points. We applied Char-RNN to classify cases in the Korea National Health and Nutrition Examination Survey (KNHANES) VI as normal status and ﬁve chronic diseases: hypertension, stroke, angina pectoris, myocardial infarction, and diabetes mellitus. We also employed a multilayer perceptron network for the same task for comparison. The results show higher accuracy for Char-RNN than for the conventional multilayer perceptron model. Char-RNN showed remarkable performance in ﬁnding patients with hypertension and stroke. The present study utilized the KNHANES VI data to demonstrate a practical approach to predicting and managing chronic diseases with partially observed information. one future research This study focused on the prediction of a dataset with missing values using machine learning methods. Identifying common and di ﬀ erent features across chronic diseases to prevent those diseases by using machine learning methods be in future work. there are several implicit methods for missing data these implicit methods comparing the results to ﬁnd the best predicting


Introduction
Chronic diseases require long-term, continuous management. They take a long time to manifest and are difficult to cure [1]. According to the Current Status and Future Development of Chronic Disease Management Project of the Korean Ministry of Health and Welfare, death by five major chronic diseases (hypertension, stroke, angina pectoris, myocardial infarction, and diabetes mellitus) constituted 63.1% of the total deaths in Korea in 2003. While the cost burden of diseases has increased annually, the number of deaths caused by chronic diseases also continues to increase.
Various approaches have been introduced to prevent chronic diseases, and most of them focus on lifestyle [1][2][3]. However, it is difficult for individuals to change their lifestyle to prevent chronic diseases, because many people do not know which chronic diseases they may be susceptible to based on their physical condition and medical history. Although a few approaches have been used to predict the possibility of contracting these diseases, their performance was limited because relevant information on the physical condition and medical history was often omitted.
Various studies on chronic diseases have received a lot of attention since the 1990s. A few studies were conducted on the assumption that smoking, drinking, and high cholesterol levels cause chronic diseases. Summer et al. [2] examined the association of cholesterol level with stroke and coronary heart disease using experimental groups. Other related studies have included reports investigating the effects of dietary supplements on preventing chronic diseases. One such dietary supplement is

Related Work
Data on individual lifestyle habits, which are generally obtained through surveys, similar to other health-related data, must be collected to analyze chronic diseases related to lifestyle habits. However, individuals are often unable to answer some health survey questions, which introduces missing information to the survey dataset. A dataset containing missing values often causes failure in analysis. Missing data is a common problem in survey datasets; hence, various studies have been conducted on how to handle missing values. García-Laencina et al. [5] analyzed the missing data problem in pattern classification and analyzed the missing data by using pattern recognition technology when solving for missing or unknown data by using the actual classification operation. Case detection, missing data imputation, model-based procedures, and machine learning methods for handling were used. We decided to introduce missing data and make the right choice for the situation of the data [5]. The missing values are also applied to medical data. This method was applied to data collected through the El Alamo-I project using alternative methods based on statistical techniques such as multilayer perceptron (MLP), self-organization map (SOM), and k-nearest neighbor (KNN). The accuracy of predicting early cancer recurrence was measured using artificial neural network (ANN), estimated using ANN with missing data [6]. In 2019, Williams et al. [7] suggested knowledge extraction and management (KEM). KEM can identify all related relationships between variables, even when there is only weak correlation, compared to statistical approaches. Conventional methods for identifying multivariate classifiers use univariate analysis of all functions, marker identification to allow class discrimination, and optimization algorithms such as random forest, support vector machine (SVM), or neural networks to find the optimal combination.
Several studies on health care data analysis with missing values have been presented. Schuster et al. [8] suggested a multilevel support vector machine framework to handle missing information and incorrect data. Razzaghi et al. [9] imputed missing values by assigning the values of neighboring data points using four approaches: hierarchical multiagglomerative clustering, normal distribution model, normal regression model, and predictive mean matching. Liu et al. [10] handled missing data using a clustering approach to reduce bias when analyzing a virus's potential for circulation. As demonstrated by these examples, most approaches control missing data by evaluating the surrounding mean and use clustering to compute the distance. Missing data have also been accurately estimated by applying an adjusted weight voting random forest-based model [11]. In 2001, data from the National Health Interview Survey were used to analyze multiple risk factors in the US population [12]. A total of 29,183 data points were used to analyze the data by cluster analysis of the risk factors. The analysis was excluded if there were missing data that would impair the accuracy.
In this study, the prediction of information about health is very sensitive to data omission; hence, a method to eliminate missing data is used. However, there is a limit to understanding data on missing information from the experimenter if there is a small amount of data or if data are missing because the participant does not know the information [13]. In 2002, Casaburi et al. [14] evaluated the safety and efficacy of new drugs for chronic obstructive pulmonary disease. They performed two 12-month clinical trials comparing the placebo effect to the drug effect, collected data, conducted covariance analysis using the collected data, and analyzed patients who could not be diagnosed by disease deterioration with the worst of the existing data. A commonality across these approaches is that missing values were imputed by estimating the values using adjacent data points in an arbitrary manner. In 2016, Liu et al. [15] looked at the 2003-2004 National Health and Nutrition Survey (NHANES) and physical activity data, and analyzed the missing data due to device failure in accelerometer measurement using a multiple imputation approach based on additive regression, bootstrapping, and predictive mean matching (ARBP). As a result, the most accurate ARBP model was selected and analyzed as the final model [15]. In 2017, Beaulieu-Jones and Moore [16] examined electronic health records (EHRs), which are a source of important data for patient status but have a lot of missing data. In this paper, imputation of missing information using deeply learned autoencoders in the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT) showed strong performance on estimation accuracy and contributed as the most powerful predictor of disease progression [17]. In 2019, Azimi et al. studied remote health status monitoring used to track patients and provide early detection of disease and preventive care. Internet-of-Things (IoT) technology should solve serious problems in real exams, but it facilitates the development of these monitoring systems. Therefore, forecasting is impossible with real-time health monitoring because missing data on human health indicators ignores variability. Therefore, IoT-based systems provide a way to experiment with clinical trials, learn new data from them, and make decisions on other missing data [18].
In 2019, a variety of machine learning data imputation methods was used to compare the accuracy of the data in order to replace the data of untested CpG coverage (i.e., for most CpGs, we have missing values), sites in the Bayesian hierarchy method of clustering cells using MEthyLation Inference for Single cell Analysis (Melissa) and finding posterior transition patterns between cells [19]. Another paper that makes it difficult to derive the complexity and trace levels of pollutants in the detection of unexpected compounds and chemical stability assessments for food safety assessment. Therefore, we performed missing data substitutions using Liquid Chromatography-High Resolution Mass Spectrometry(LC-HRMS) Peak Peaks, Mean-LOD, and Value Decomposition-Quantile Regression Imputation of Left-censored data(SVD-QRILC) combined with chemical measurement tools that use MTBLS752 and MTBLS74 data not explicitly stated [20]. Several studies on analyzing chronic diseases paying attention to preventing chronic diseases caused by the growing elderly population have recently been published.
The amount of health care data is steadily increasing; hence, various studies using deep learning and state-of-the-art methods in several applications, including image classification [21], text analysis [22], and speech recognition [23], are being conducted for data classification. In the field of health informatics, different deep learning architectures have been proposed along with the increased volume of relevant data, including convolutional neural network [24], recurrent neural network (RNN) [25], and deep neural network (DNN) [26], which is the most commonly used deep learning architecture in studies investigating data classification [27]. As such, the deep learning algorithm is also applied for health status and disease prediction. In 2013, Ahmed and Loutfi [28] introduced various methods and procedures for health monitoring and biometric information and data analysis using wearable sensors. Their methods processed and validated data to ensure that the data configuration was significant to the analysis, defined attributes for the datasets, and provided methods for analysis in the health and welfare sector.
Various machine learning methods are used as analysis methods. Using a machine learning method according to the data characteristics has also been suggested. In 2014, Kaur et al. [29] presented an improved J48 algorithm for predicting diabetes and analyzed it using the diabetes data of Pima Indians. Using a total of 768 pieces of data, they analyzed information from patients with diabetes and predicted diabetes. In 2019, using data collected from the National Institute of Diabetes and Digestive and Kidney Diseases, we compared SVM, Naïve bayes, Random forest, and Simple cart and SVM provided the best accuracy for predicting diabetes. Because the variables of the collected data for the prediction of diabetes were fixed and simple prediction to determine the presence or absence of the disease, good accuracy was obtained through the existing machine learning method. One study [30] predicted spatial prediction of landslide susceptibility in China's Long Country region using kernel logistic regression, naive Bayes and RBFNetwork. In this study, we compared the accuracy of spatial prediction with the existing machine learning method and RBFNetwork because the analysis data structure was simple in predicting spatial sensitivity of landslide sensitivity [31]. In order to classify the signal of single channel Electroenocephalography (EEG), the EEG signal was judged as one sequence and the sequence of EEG was analyzed by LSTM method. This automatically classifies sleep stages for single-channel EEG signals. Because of this, it showed excellent performance in classifying sleep stages and analyzing sequence data through the order of EEG [32].
Moreover, we present improved analytical models to improve health status and disease prediction; however, there is a limit to the application of other similar data in practice using refined data. Therefore, we compare the model proposed in this paper by applying some of the missing data replacement methods and the machine learning classification method mentioned in the recent paper.

Materials and Methods
This study employed Char-RNN, which is a deep learning method for text analysis that considers the relationships between nearby values, to classify five chronic diseases in the KNHANES dataset with missing values. This section presents detailed descriptions of the dataset employed in this study, preprocessing applied to the dataset, and learning procedure. The Char-RNN algorithm is also explained.
Char-RNN is a deep learning model that creates short strings of characters using RNN. Char-RNN can learn and generate similar new sentences based on learned sentences and derive the sentence class similar to the learned sentences. In a study conducted by Yuan et al. (2017), they learned the drug molecule with Char-RNN and derived a new compound-binding equation [33]. RNN is a deep learning method that learns training data letter by letter, whereas Char-RNN segments a sentence into words and learns word by word. Char-RNN is trained on sentences and segments them into n-grams while learning them. Char-RNN is frequently used in translation, because this model accurately interprets typographical errors and missing letters and has higher accuracy in sentence learning compared with RNN. For example, in [34], training a Char-RNN model on music data to develop a transcription model showed that Char-RNN performed better than the existing methods in music transcription.
RNN is a deep learning method frequently used in work involving natural language processing (NLP) [35]. The model equation is h t = ∅(Wx t + Uh t−1 ), where h t is a hidden layer at time t that is a function of x t (input at the same time t), W (a coefficient matrix), and a matrix U, which shows the value of a hidden layer at time t − 1 (i.e., h t−1 ). Memory is reflected in the coefficient matrix, W. A decision is made based on the current input value x t , an error value is computed, and the computed error is fed to the hidden layers. Next, W is updated based on the values. The sum of input x and memory h passes through the function ∅ and is compressed. The range of output values is restricted by a hyperbolic tangent function (tanh function) and can be differentiated segment by segment; hence, backpropagation is applied. Accordingly, h t and h t−1 feedback occurs at every moment. Through the learning process, the output is produced via a tanh function of input x and weight W multiplied by the input data. When an RNN model is trained on text based on these learning processes, it can learn short sentences; however, it does not perform well in learning long sentences or determining relationships among words.
In contrast to RNN, text analysis by Char-RNN learns a sentence by dividing it into n-gram segments, which results in superior learning performance when determining the relationships among words. Given a previous character sequence, Char-RNN effectively learns to predict the next character. This learning mode is similar to that of learning characters and sentences to output text vocabulary by generating a probability distribution of an object class-like image or character [35]. In this case, a standard categorical cross-entropy loss is used to effectively classify characters in a sentence and train a model whose output class is text vocabulary. Char-RNN divides the order of words into n-grams in each sentence, predicts the next word according to the order of the divided words, and grasps the meaning of the sentence. According to function 2, Char-RNN understands the sentences for n-grams before and after function 2 and learns one sentence by using it.

Data Description and Learning Procedure
We applied Char-RNN to data from the KNHANES VI (2013, 2014, and 2015) to predict five chronic diseases (hypertension, stroke, myocardial infarction, angina pectoris, and diabetes mellitus) with the greatest influence on comorbidities. The KNHANES is a national health survey conducted annually by the Korea Center for Disease Control that consists of questions to examine characteristics such as health behavior, nutritional intake, and chronic diseases [36]. The survey is administered to participants selected at the city, province, and county level. The screening items included in the survey are selected by the sector advisory committees and the coordinating advisory council. The KNHANES datasets have high reliability and accuracy because the data are collected by a national institute and the survey items are revised during each phase of the survey. The KNHANES datasets consist of 760 variables. This study focused on the phase VI data, containing approximately 22,000 cases reflecting the most recent lifestyle habits and patterns available in the datasets.
We selected five popular chronic diseases among a variety of diseases and related variables, including osteoarthritis, rheumatoid arthritis, osteoporosis, tuberculosis, asthma, thyroid disease, cancer, inflammation, and hepatitis. Table 1 presents the gender and age composition of the subjects included in the dataset for this study. Although the dataset contained 22,000 cases, most individual cases did not have any diagnosed disease. Consequently, significant variables were selected from the dataset with 760 variables using a regression analysis with stepwise variable selection. A total of 62 variables were selected for the five diseases. Correlations among selected variables were used to remove variables with strong correlation, with a correlation coefficient of 0.6 or higher, and finally, 32 variables were used for analysis. To maximize analytical accuracy, we excluded selected variables that coexisted with other selected variables during data processing ( Table 2). After variable selection, each variable value was interpreted as text and converted to character format. As shown in Figure 1, the numerical values of 29 variables relevant to the five diseases were converted to letters using the following rule: (0 → a, 1 → b, 2 → c . . . ). The missing values were replaced with tabs that could not be represented in alphabetical order. The reason for assigning numbers to one alphabet was to train the data in the form of one sentence. The sequences 0001 and AAAB are the same. Pretreatment divided the five diseases and six normal variables into the required variables. We extracted the variables using regression analysis to determine the variables affecting the disease. We trained Char-RNN with the preprocessed dataset after completing the preprocessing. Char-RNN learned each case as a sentence, identified the characteristics of the sentence from the cases, and determined the relationship between the label of each case and the corresponding characteristics of the transformed sentence. This approach assigned missing values in a new instance based on the characteristics or by considering the nearby values identified during the learning phase.

Experimental Results
The analysis was performed using two deep learning models. Figure 2 shows the whole process. Each model was trained to make decisions regarding cases labeled as normal, hypertension, stroke, myocardial infarction, angina pectoris, or diabetes mellitus and tested on test datasets with missing information to classify new cases. Learning was conducted with approximately 600 cases per label. The number of cases of angina pectoris and myocardial infarction was reduced to fewer than 600 during data processing; therefore, it can be seen that they are lower than the other chronic diseases. Hence, the sizes of these groups were increased by replicating existing cases to prevent overfitting due to imbalanced data. Char-RNN required the value of each case to be text, thus the data were converted using the rule (0 → a, 1 → b, 2 → c . . . ). The data format was text separated by tabs. The data were transformed to sentence format for the learning phase. Stepwise regression was used to remove the KNHANES VI variables that did not influence the five selected chronic diseases. There were a total of 760 variables in KNHANES VI, and 652 were selected after excluding pediatric-and female-specific and cancer-, joint-, or dental-related variables. Next, variables that were significantly associated with the five chronic diseases were selected using a stepwise selection method. In this case, variables with a p-value less than a significance level of 0.05 were extracted, and the rest were excluded because they were greater than that level. As a result, 17 variables were selected for hypertension, 20 for stroke, 23 for myocardial infarction, 22 for angina pectoris, and 29 for diabetes mellitus (Table A1). A few influential variables were associated with more than one disease; therefore, a total of 62 unique variables were selected. Table A2 lists detailed  descriptions of the selected variables shown in Table A1, along with the variables affecting the five diseases, and detailed descriptions of the variables came with the KNHANES VI guideline. Of the available 22,000 cases, those with missing values in 200 or more variables relevant to the five chronic diseases were completely removed from the analysis dataset. In addition, the existing training data had to be cleaned up to make the model using the data of the selected variables. Therefore, we removed the data where there was at least one missing value for each disease. Finally, approximately 3000 cases were selected after the filtering steps. The analysis outcome can be affected by the presence of correlations among the selected variables; hence, a correlational analysis was conducted. The results show that a few variables were correlated with others, and those that strongly correlated with included variables were removed. The criterion of r > 0.6 was used to select and remove strongly correlated variables. A total of 32 variables remained after strongly correlated variables were removed.
Next, three variables (time of depression diagnosis, DF2_ag; time of angina pectoris diagnosis, DI6_ag; and presence or absence of comorbidities of myocardial infarction and angina pectoris, DI4_pr) were removed because they had similar values across all five chronic diseases and were likely to reduce the analysis accuracy. The response rates for these three variables were very low because a large majority of respondents did not know the answer. Consequently, the value of DF2_ag, DI4_pr, and DI6_ag was mostly 8, which was used to code the response "do not know" (Figure 2). Such variables may reduce the analysis accuracy; thus, they were removed, and the analysis was performed on the final set of 29 variables. Based on the finally selected 29 variables included in the KNHANES VI (2013, 2014, 2015) dataset, we performed a classification of the five chronic diseases (hypertension, stroke, myocardial infarction, angina pectoris, and diabetes mellitus) that affect many individuals but do not yet have clear predictive criteria. Figure 3 depicts a graph of the optimal learning frequency of the analytical model. The number of iterations during the learning phase was set at 50,000, because data loss increased when the number of iterations exceeded 50,000. Char-RNN was compared against multilayer perceptron (MLP), an extensively employed deep neural network model that was specifically developed for data classification. For MLP, three hidden layers were formed with 256, 128, and 64 nodes. The prediction accuracy was higher for Char-RNN than for MLP.  Table 4 shows the accuracies of chronic disease predictions in 100, 200, and 300 test datasets with missing values based on the outcomes of learning via DNN and Char-RNN. In testing the model using test data, data imputation was performed on the models other than Char-RNN. KNNimputation uses k-nearest neighbor and multiple imputation. KNNimputation finds k-nearest neighbors with missing data, and then finds k-missing neighbors. This was used to find the class of data. There are several ways to measure the distance of neighbor algorithms. In this paper, we used Euclidean distance to find the closest neighbor and then used that value as a replacement for missing data. Alternatively, KNN with k neighbors can be used to take the weighted average of the distance from neighbors as a weight. The closer you are to neighbors, the more weight you have when you average. Weighted averages seem to be the most commonly used method [37]. Multiple imputation consists of three steps: imputation, analysis, and pooling. Multiple imputation can be used to account for the uncertainty of results in all environments. It can be interpreted as multiple substitution using chain equations. Therefore, we simulated multiple imputation using existing data, created several missing value substitution sets (m), performed specific statistical modeling with functions in the analysis step, and averaged m sets of substitutions generated in the pooling step to derive the results. This found the most optimal missing data replacement value [38].
The variables (HE_HPdg, HE_DMdg, HE_HLdg, and HE_fh) were physician diagnoses; hence, they directly affected the prediction outcome. The brightness contrast in the confusion matrices indicated that Char-RNN performed better than other models in predicting chronic diseases. Overall, the accuracy and precision were higher for Char-RNN, and the recall level was similar between the two models. The predictive power of Char-RNN was particularly high for hypertension and stroke.
The accuracy was higher for Char-RNN compared to DNN, Bayesian, SVM, and long short-term memory (LSTM) models (Tables 3 and 4), most likely because other models classify new data based on learning the training data, whereas Char-RNN learns training data by treating words as a data pattern and attributes meaning to a word when encountering a similar word in each label. Therefore, Char-RNN is far more effective than other models in handling missing values. In addition, it can learn long sequences exceptionally well even when there is missing information, because it learns the training data by dividing sequences into n-grams. The missing values of test datasets of other models were solved through data imputation. For the data imputation method, we processed the missing values using KNNimpute, mode impute, and multiple impute methods.

Conclusions
This study applied Char-RNN to the KNHANES VI dataset to classify five chronic diseases (hypertension, stroke, myocardial infarction, angina pectoris, and diabetes) and normal status to deal with missing values in the data. We first selected 29 of 760 variables using the stepwise selection method. We then applied Char-RNN to classify the five chronic diseases and normal status. A conventional DNN model with three hidden layers having 256, 128, and 64 nodes was applied to the same dataset for comparison. Additionally, LSTM and machine learning models, naïve Bayes, and SVM were used to compare the five chronic diseases. The results show that Char-RNN performed, on average, 10% better than the other models with KNN, mode, and multiple imputation methods. Table 4 shows that LSTM was more accurate for normal status and SVM was more accurate for stroke; however, Char-RNN had higher performance for the remaining four classes. In the comparison of missing values in Table 3, we can see that Char-RNN had better accuracy than the other models on the test dataset with missing values, because it predicted the labels of partially observed instances by identifying the data patterns surrounding the missing values. In addition, the data replacement method was used to replace the missing values in the other four models; however, Char-rnn did not go through the data replacement process for the missing values. Therefore, Char-rnn can provide better results than other machine learning methods that result when analyzing missing data without passing through the data transfer process, thus reducing data preprocessing time. Characterization of char-rnn allows for more accurate prediction and classification.
However, a few limitations must also be considered. First, the KNHANES dataset was only collected in South Korea. Therefore, applying Char-RNN to a dataset including respondents of diverse ethnicities and lifestyle habits can be one future research direction. This study also focused on the prediction of a dataset with missing values using machine learning methods. Identifying common and different features across chronic diseases to prevent those diseases by using machine learning methods can be studied in future work. Finally, there are several implicit methods for missing data analysis other than Char-RNN. Applying these implicit methods and comparing the results to find the best method of predicting chronic diseases can be useful.    Whether diagnosed with hypercholesterolemia by a physician (siblings) HE_IHDfh3 Whether diagnosed with ischemic heart disease by a physician (siblings) HE_STRfh1 Whether diagnosed with stroke by a physician (father)

DI1_dg
Whether diagnosed with hypertension by a physician DI1_pt Hypertension treatment DI1_2 Taking blood pressure regulator DI3_dg Whether diagnosed with stroke by a physician DI3_ag Time of stroke diagnosis DI3_2 Sequelae of stroke

DI4_dg
Whether diagnosed with myocardial infarction, angina pectoris by a physician DI4_pr Current morbidity of myocardial infarction, angina pectoris DI4_pt Myocardial infarction, angina pectoris treatment DI5_dg Whether diagnosed with myocardial infarction by a physician DI5_ag Time of myocardial infarction diagnosis DI5_pt Myocardial infarction treatment DI6_dg Whether diagnosed with angina pectoris by a physician DI6_ag Time of angina pectoris diagnosis DI6_pt Angina pectoris treatment DE1_ag Time of diabetes mellitus diagnosis DE1_33 Diabetes mellitus treatment: non-pharmaceutical therapy LQ4_04 Reason for limited activity: heart disease LQ1_mn Number of days bedridden in the last month educ Education level BO3_07 Weight control method: health functional food BP6_31 Whether attempted suicide in the past year HE_HPdg Whether diagnosed with hypertension by a physician Table A2. Cont.

Chronic Disease Variables Variable Description
Angina pectoris

DI1_dg
Whether diagnosed with hypertension by a physician DI1_pt Hypertension treatment DI1_2 Taking blood pressure regulator DI3_dg Whether diagnosed with stroke by a physician DI3_ag Time of stroke diagnosis DI3_2 Sequelae of stroke

DI4_dg
Whether diagnosed with myocardial infarction, angina pectoris by a physician DI4_pr Current morbidity of myocardial infarction, angina pectoris DI4_pt Myocardial infarction, angina pectoris treatment DI5_dg Whether diagnosed with myocardial infarction by a physician DI5_ag Time of myocardial infarction diagnosis DI6_dg Whether diagnosed with angina pectoris by a physician DI6_pt Myocardial infarction treatment DE1_33 Diabetes mellitus treatment: non-pharmaceutical therapy LQ4_04 Reason for limited activity: heart disease LQ4_06 (Adult) Reason for limited activity: stroke LQ1_mn Number of days bedridden in the last month educ Education level BO3_07 Weight control method: health functional food BD2_32 (Adult) Frequency of heavy drinking BS6_3 (Adult) Average daily smoking amount of past smokers HE_STRfh1 Whether diagnosed with stroke by a physician (father) Diabetes mellitus

DI1_dg
Whether diagnosed with hypertension by a physician DI1_pt Hypertension treatment DI1_2 Taking blood pressure regulator DI5_dg Whether diagnosed with myocardial infarction by a physician DI5_ag Time of myocardial infarction diagnosis DI6_dg Whether diagnosed with angina pectoris by a physician DI6_ag Time of angina pectoris diagnosis DE1_dg Whether diagnosed with diabetes mellitus by a physician DE1_pt Diabetes mellitus treatment DE1_4 Ophthalmoscopy DE2_dg Whether diagnosed with thyroid disease by a physician DF2_pr Current morbidity of depression DK4_pr Current morbidity of cirrhosis LQ4_15 Reason for limited activity: depression/anxiety/emotional problem LQ4_22 (