The Korea Cancer Big Data Platform (K-CBP) for Cancer Research

Data warehousing is the most important technology to address recent advances in precision medicine. However, a generic clinical data warehouse does not address unstructured and insufficient data. In precision medicine, it is essential to develop a platform that can collect and utilize data. Data were collected from electronic medical records, genomic sequences, tumor biopsy specimens, and national cancer control initiative databases in the National Cancer Center (NCC), Korea. Data were de-identified and stored in a safe and independent space. Unstructured clinical data were standardized and incorporated into cancer registries and linked to cancer genome sequences and tumor biopsy specimens. Finally, national cancer control initiative data from the public domain were independently organized and linked to cancer registries. We constructed a system for integrating and providing various cancer data called the Korea Cancer Big Data Platform (K-CBP). Although the K-CBP could be used for cancer research, the legal and regulatory aspects of data distribution and usage need to be addressed first. Nonetheless, the system will continue collecting data from cancer-related resources that will hopefully facilitate precision-based research.


Introduction
In clinical oncology, there is a major paradigm shift with the advent and dissemination of high-throughput genome sequencing technologies and molecular-targeted therapies. The subsequent explosion of biomedical data requires that cancer hospitals and laboratories be equipped with proper information technology capabilities, such as massive data storage and a high-capacity analysis platform, to address the accumulation of genomic and clinical data. Indeed, advances in information technology enable integration and analysis of large-scale biomedical data [1], and provide vital support for "precision medicine" that promises to deliver lifestyle modification and therapy tailored to the individual patient [2]. Now that this precision-based approach is a part of standard care in clinical oncology in some high-income countries, the need for a database or data warehouse (DW) to collect and store a wide range of medical, genomic, and lifestyle data cannot be overemphasized [3].
Initial versions of clinical DW simply stored raw data that were generated over the course of diagnosis and treatment without data editing [4][5][6][7], essentially ruling out the possibility of downstream analysis [8]. Since the early 2000s, many institutions collected large quantities of clinical data and constructed patient registries by extracting critical information such as outcome data [9]. Subsequent efforts were made to improve the quality of clinical data and track these data on a long-term basis [10,11]. Databases were built by collecting data from multiple institutions [12,13], and a system was created that

Data Sources
Data collected by the K-CBP consisted of (1) clinical data from 2001 to August 2018, (2) genomic data from April 2017 to August 2018, (3) biological sample data from 2002 to August 2018, and (4) cancer registration statistics data from the national cancer control initiatives from 1999 to August 2018. Clinical data were obtained from patients who visited the National Cancer Center (NCC) Hospital. Information on diagnosis and treatment was extracted from the electronic medical records (EMR) system. Information on genomic data was collected from targeted next-generation sequencing (NGS) analysis. Datasets of biological samples that included blood and tissue based on meta-information such as sample storage and storage locations were also collected from cancer screening examinees. Cancer registration statistics data generated from four distinct national cancer control activities conducted by the NCCI (namely, the Korea Central Cancer Registry, National Cancer Screening Program, Financial Aid Program for Cancer Patients, and Cancer Hospice and Palliative Care Program) were linked using an alternative key system independent of clinical cancer registries. Furthermore, "family history" and "social history" data based on survey results were additionally merged into the data. Data imported into the platform were de-identified and stored in an independent DW (IDW). All processes of DW generation were logged, which allowed for a newly updated version of DW to be explicitly characterized and deal with previous versions of a DW. Figure 1 shows an overview of the K-CBP holdings data.

De-Identification of Records
To protect patient privacy and ensure data safety, direct identifiers (ID), such as name and resident registration number present in healthcare data, need to be removed or replaced with alternative keys. Quasi-identifiers that can identify individuals when combining multiple items such as address, height, and weight were subjected to de-identification [31]. In the K-CBP, de-identification was performed by replacing the direct identifiers of source data with alternative patient keys before importing the data to K-CBP. Quasi-identifiers, which have the potential capability of identifying individuals when multiple features are combined, were deleted during the data provision process. The level of data deletion was determined depending on the level of the user's authority.
The alternative patient keys were generated by combining information such as institution source and registration order on the basis of personal information, as well as social security number, which is the unique identification number assigned to every legal citizen in Korea. The direct IDs used in the generation of the alternative patient keys were encrypted and stored in a closed system to minimize the possibility of external leakage. Only an honest broker who preprocessed the provided data can access the closed system. Alternative patient keys were configured to be changed randomly every time the dataset was generated so they could not be traced backward.

Creation of Clinical Datasets and Cancer Registries
The process of screening clinical data and creating a clinical cancer registry is shown in Figure  2. To create and refine datasets that contain clinically relevant information, raw data need to be carefully examined, and only meaningful features need to be extracted. To this end, we chose clinical datasets from raw clinical data, and potentially meaningful features were derived from the medical

De-Identification of Records
To protect patient privacy and ensure data safety, direct identifiers (ID), such as name and resident registration number present in healthcare data, need to be removed or replaced with alternative keys. Quasi-identifiers that can identify individuals when combining multiple items such as address, height, and weight were subjected to de-identification [31]. In the K-CBP, de-identification was performed by replacing the direct identifiers of source data with alternative patient keys before importing the data to K-CBP. Quasi-identifiers, which have the potential capability of identifying individuals when multiple features are combined, were deleted during the data provision process. The level of data deletion was determined depending on the level of the user's authority.
The alternative patient keys were generated by combining information such as institution source and registration order on the basis of personal information, as well as social security number, which is the unique identification number assigned to every legal citizen in Korea. The direct IDs used in the generation of the alternative patient keys were encrypted and stored in a closed system to minimize the possibility of external leakage. Only an honest broker who preprocessed the provided data can access the closed system. Alternative patient keys were configured to be changed randomly every time the dataset was generated so they could not be traced backward.

Creation of Clinical Datasets and Cancer Registries
The process of screening clinical data and creating a clinical cancer registry is shown in Figure 2. To create and refine datasets that contain clinically relevant information, raw data need to be carefully examined, and only meaningful features need to be extracted. To this end, we chose clinical datasets from raw clinical data, and potentially meaningful features were derived from the medical literature. Then, a panel of clinical experts and potential data users selected datasets and relevant features for each clinical cancer registry based on the evidence in the literature, clinical significance, and their potential for linkage to external data sources such as personal lifelog information, which includes health determinants and health status. The selected features were then extracted from the EMR system and converted to terms that conformed to terminology standards such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT), Logical Observation Identifiers Names and Codes (LOINC), RxNorm, and the Healthcare Common Procedure Coding System (HCPCS). Unstructured data in the clinical datasets were parsed into structured ones. Clinical data such as patient information, prescribed medications, and surgical information were selected from the formal data features that were processed as above. The created clinical datasets were reclassified by cancer type and used to generate a temporary clinical cancer registry. This temporary clinical cancer registry was moved to a manual input platform (i.e., an electronic case record form (e-CRF)), and the necessary data were added to complete the final clinical cancer registry. The completed clinical cancer registry was transferred to a DW and merged with clinical datasets. These processes were carried our weekly. Functions for data retrieval and analysis were implemented in a web-based platform of the K-CBP. literature. Then, a panel of clinical experts and potential data users selected datasets and relevant features for each clinical cancer registry based on the evidence in the literature, clinical significance, and their potential for linkage to external data sources such as personal lifelog information, which includes health determinants and health status. The selected features were then extracted from the EMR system and converted to terms that conformed to terminology standards such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT), Logical Observation Identifiers Names and Codes (LOINC), RxNorm, and the Healthcare Common Procedure Coding System (HCPCS). Unstructured data in the clinical datasets were parsed into structured ones. Clinical data such as patient information, prescribed medications, and surgical information were selected from the formal data features that were processed as above. The created clinical datasets were reclassified by cancer type and used to generate a temporary clinical cancer registry. This temporary clinical cancer registry was moved to a manual input platform (i.e., an electronic case record form (e-CRF)), and the necessary data were added to complete the final clinical cancer registry. The completed clinical cancer registry was transferred to a DW and merged with clinical datasets. These processes were carried our weekly. Functions for data retrieval and analysis were implemented in a web-based platform of the K-CBP.

Data Validation and Monitoring
To ensure that we created the correct system, we designed three validation procedures, and applied one or more of these procedures to each step of the dataset construction process ( Figure 3).

Data Validation and Monitoring
To ensure that we created the correct system, we designed three validation procedures, and applied one or more of these procedures to each step of the dataset construction process (Figure 3).
The goal of this process was to verify a large amount of data in a short period of time. Clinical datasets were validated by checking whether the process was performed (VA), comparing the number of original data and extracted data (VB), and finally checking the contents of sample data (VC). VA examines each step of the extract, transform, and load (ETL) procedure, which generates a temporary dataset called "temp mart". VB compares the temporary dataset with a preprocessed dataset with respect to the amount of the data by calculating some predetermined indexes, including the number of extracted items. In VC, the data manager validates the contents by directly sampling the data. During this process, any content with an error is modified by the input platform. Each validation process can be performed independently and can be applied at any ETL stage. The goal of this process was to verify a large amount of data in a short period of time. Clinical datasets were validated by checking whether the process was performed (VA), comparing the number of original data and extracted data (VB), and finally checking the contents of sample data (VC). VA examines each step of the extract, transform, and load (ETL) procedure, which generates a temporary dataset called "temp mart". VB compares the temporary dataset with a preprocessed dataset with respect to the amount of the data by calculating some predetermined indexes, including the number of extracted items. In VC, the data manager validates the contents by directly sampling the data. During this process, any content with an error is modified by the input platform. Each validation process can be performed independently and can be applied at any ETL stage.

Data Merging
The collected and processed datasets were merged and stored in an integrated DW. Data merging was performed on datasets such as biological sample information, genomic data, and national cancer control initiative data, starting with clinical datasets. The first step was to apply the same terminology standard to datasets based on clinical datasets with data matching. Then, each

Data Merging
The collected and processed datasets were merged and stored in an integrated DW. Data merging was performed on datasets such as biological sample information, genomic data, and national cancer control initiative data, starting with clinical datasets. The first step was to apply the same terminology standard to datasets based on clinical datasets with data matching. Then, each dataset was matched on the basis of the primary key of the clinical datasets. The primary key was used to restore the patient ID of each dataset and generate an alternative patient key. To avoid patient ID duplication and to match datasets, encrypted personal data were decrypted. After merging datasets, personal data were re-encrypted and stored in a closed system. Patients with duplicate IDs were disambiguated on the basis of their IDs at the NCC registry. Data from the clinical registries, biological samples, and gene mutations were transferred to the integrated DW where they could be linked to one another on the basis of the alternative patient key. Patient IDs could not be restored for data from the NCCI and external cancer registries. Instead of linking data to the clinical cancer registries at the NCC Hospital, we calculated summary statistics for NCCI data and presented them at the K-CBP.

Construction of the K-CBP
The K-CBP was constructed as a web-based data retrieval and analysis platform to utilize merged data. The platform was designed to connect with an integrated DW to reflect changes instantly. Data that can be utilized through the platform are shown in Figure 4.
ID duplication and to match datasets, encrypted personal data were decrypted. After merging datasets, personal data were re-encrypted and stored in a closed system. Patients with duplicate IDs were disambiguated on the basis of their IDs at the NCC registry. Data from the clinical registries, biological samples, and gene mutations were transferred to the integrated DW where they could be linked to one another on the basis of the alternative patient key. Patient IDs could not be restored for data from the NCCI and external cancer registries. Instead of linking data to the clinical cancer registries at the NCC Hospital, we calculated summary statistics for NCCI data and presented them at the K-CBP.

Construction of the K-CBP
The K-CBP was constructed as a web-based data retrieval and analysis platform to utilize merged data. The platform was designed to connect with an integrated DW to reflect changes instantly. Data that can be utilized through the platform are shown in Figure 4.

Legal and Regulatory Processes
The K-CBP data were provided in strict compliance with the user rights management and privacy policy. Because of legal regulation, only a few medical staff can use data from the K-CBP. Data distribution was permitted only after institutional review board approval. Data de-

Legal and Regulatory Processes
The K-CBP data were provided in strict compliance with the user rights management and privacy policy. Because of legal regulation, only a few medical staff can use data from the K-CBP. Data distribution was permitted only after institutional review board approval. Data de-identification, processing, and provision were carried out through an honest broker to minimize the possibility of personal information leakage.

Collection of Data
The cancer-related data held by the NCC included clinical, genomic, tumor biomarker, and cancer registration statistics data (Table 1). These data were split into subcategories according to data types, such as cancer sample type. Clinical cancer registries are in use for clinical and research purposes on specific cancer sites, such as prostate, lung, pancreatic, kidney, and ovarian cancers, and others are under construction for colon, liver, stomach, breast, and thyroid cancers. Additionally, the external registries will also expand the cancer type. These cancer registration statistics data comprise information such as diagnosis, death, surgery, drug prescription, Surveillance, Epidemiology, and End Results (SEER) summary stages, personal or family income, and healthcare institution included in hospice and palliative care.

Construction of Clinical Datasets and Clinical Cancer Registries
We generated clinical cancer registries for 10 major cancers. Unstructured data during the registry construction by cancer type averaged 24.3% compared to the whole features but decreased to 7.64% after structuring. Table 2 shows the features of structured, unstructured, and manual input data by cancer type. Structured data, data that can be represented by a specific number or word and whose format is roughly defined; unstructured data, free-text format data; manual input data, data that cannot be automatically entered through computerization.
More information about these terms and features is shown in Table 3 and Supplementary Material 1. Detailed information on the established registries for prostate, lung, pancreatobiliary, kidney, and ovarian cancers can be explored on sites linked to URL and QR code shown in Supplementary Material 2. Table 3. Definition of terms.

Term Definition
Alternative patient key A primary key that replaces a direct identifier with a random 8-digit number De-identification Elimination of direct identifiers and quasi-identifiers so that individuals cannot be identified Clinical cancer registry Outcome data such as diagnosis, treatment, and surgery that are selected among cancer clinical data; dataset refined in a form that can be used meaningfully National cancer control initiative data Cancer-related data collected under a nationally led project External clinical cancer registry Cancer-related clinical data from multiple institutions, including diagnosis, treatment, or surgery; dataset selected and refined for outcome data Structured data Data that can be represented by a specific number or word and whose format is standardized Unstructured data Free-text format data Manual input data Data that cannot be automatically imported through computerization

Data Merging
Clinical data from 515,780 cancer patients, NGS data from 280 patients, metadata of blood samples from 32,760 subjects, and metadata of tissue samples from 17,813 subjects were merged on the basis of the alternative key. In addition, cancer registration statistics data generated from the cancer control activities conducted by the four branches of the NCCI and external clinical cancer registries from separate cancer societies were transferred to the K-CBP. However, these data were not directly linked to the NCC clinical registries and the K-CBP because of the lack of information needed for re-identification.

Construction of the K-CBP
The K-CBP consists of a pre-generated dataset area and a dataset combination area. Through this, the K-CBP provides detailed information on each feature of the dataset. The user can inquire about the desired information through the K-CBP and reconstruct this new information to create a personalized dataset. Furthermore, the K-CBP provides status information about the data held by the platform, as shown in Figure 5.
The K-CBP consists of a pre-generated dataset area and a dataset combination area. Through this, the K-CBP provides detailed information on each feature of the dataset. The user can inquire about the desired information through the K-CBP and reconstruct this new information to create a personalized dataset. Furthermore, the K-CBP provides status information about the data held by the platform, as shown in Figure 5. On the web platform of the K-CBP, a user can query for the clinical registry data, analyze the features of different datasets, and carry out studies such as tracking and observing subjects who overlap in a specific period and for specific conditions. The generated personal dataset can be extracted and utilized after approval of the administrator. The National Cancer Data Center website (NCDC, http://www.cancerdata.kr) provides a detailed description of data held by the K-CBP, as well as a description of and data on the features of the retrieved function ( Figure 6). On the web platform of the K-CBP, a user can query for the clinical registry data, analyze the features of different datasets, and carry out studies such as tracking and observing subjects who overlap in a specific period and for specific conditions. The generated personal dataset can be extracted and utilized after approval of the administrator. The National Cancer Data Center website (NCDC, http://www.cancerdata.kr) provides a detailed description of data held by the K-CBP, as well as a description of and data on the features of the retrieved function ( Figure 6).

Discussion
The DW introduced in this paper is an integration of medical data that can enrich the quality of research and offer high-quality data services in medical fields. It is regularly being updated through strict evaluation steps to assure data quality. The DW is also the foundation of precision medicine, as it allows researchers or healthcare experts to trace patients' cumulative records over several years.
Although the established K-CBP has the potential to be used in precision medicine and clinical research, some limitations should be discussed. Genomic data are currently being integrated through the K-CBP, but the amount of data is insufficient to provide confidence in the results. Data on lifestyle and environment from various survey or study results were not integrated through the K-CBP. We expect that these datasets can be merged into K-CBP on the basis of the alternative key. On this platform, there are only 10 kinds of clinical cancer registries for each cancer type, combined with the number of cancers to be generated. Therefore, there is a limitation for research on rare and intractable cancers.
The merging of the generated datasets is a critical issue to be solved. When the source of the data

Discussion
The DW introduced in this paper is an integration of medical data that can enrich the quality of research and offer high-quality data services in medical fields. It is regularly being updated through strict evaluation steps to assure data quality. The DW is also the foundation of precision medicine, as it allows researchers or healthcare experts to trace patients' cumulative records over several years.
Although the established K-CBP has the potential to be used in precision medicine and clinical research, some limitations should be discussed. Genomic data are currently being integrated through the K-CBP, but the amount of data is insufficient to provide confidence in the results. Data on lifestyle and environment from various survey or study results were not integrated through the K-CBP. We expect that these datasets can be merged into K-CBP on the basis of the alternative key. On this platform, there are only 10 kinds of clinical cancer registries for each cancer type, combined with the number of cancers to be generated. Therefore, there is a limitation for research on rare and intractable cancers.
The merging of the generated datasets is a critical issue to be solved. When the source of the data is one institution, it is possible to merge the data using the common identifier used in that institution. However, when the sources of the data are multiple institutions, it is necessary to select a common identifier that covers institutions; most common identifiers that satisfy the conditions are not free from the laws and regulations imposed on the medical data. Data on lifestyle and environment, which are essential for precision medicine, are usually limited in terms of incorporation into the cancer big data platform because of the different clinical data and sources. Legal regulations in Korea that pertain to permission to access data, including personal information, should be eased. However, the legislation that concerns the authority to access these data is under discussion, and specific legal boundaries against the risk of using personal information should be proposed.
Despite these limitations, the K-CBP attempts to establish and implement basic datasets for cancer-related clinical studies and precision medicine. It creates and manages high-quality datasets. Particularly, we tried to secure the quality of data through the structuring of unstructured data, creation of datasets, and use of a validation system for the ETL process. However, there are limitations on the validation of source data and the presence of features that cannot be structured, which should be addressed in the future. Unstructured data that cannot be structured depend on manual input, and there is a possibility of error. Validation of manual input data is only through direct human validation; it is not systematic in comparison to the validation of other datasets. Moreover, a large dataset takes a long time to manually input, which makes it difficult to immediately update the data. In addition, unstructured text-formatted data are generated as a dataset without processing, which makes it necessary for the researcher to perform secondary processing. To secure the quality of the data, it is necessary to devise a method of generating a dataset by structuring the manual input data in addition to the conventional method of shaping. Furthermore, it is necessary to reduce human error and ensure quality by constructing a system that can instantly reflect all the changes as the source data changes.

Conclusions
In this study, we suggest the need for an integrated medical DW and introduce the construction process and expected outcomes from the utilization. K-CBP collects, links, and refines clinical, genomic, biological, statistical, and multi-institutional medical data under the jurisdiction of the law. Through this, comprehensive medical data for one subject (patient) are obtained. We developed a web-based platform that can utilize the collected data to enhance accessibility. Security is also taken into consideration through de-identification and user authority management over several stages. In the future, based on the expansion of features, data accumulation, and quality improvement, we intend to contribute to the improvement of the quality of public health by providing a system that can produce and enable healthcare experts to utilize high-quality base materials necessary for precision medicine.