A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce

Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.


Introduction
The knowledge base of biological data can be collected from natural life, scientific experiments, and research archives. Classical organism databases are purposeful where species-specific data are available, as it has great significance in new discoveries. The biological databases have a significant role in bioinformatics as it helps to approach a wide range of biological data along with increased varieties of organisms. Many biological research studies have been piloted and formed significant resources for genomic data. It is often declared that these data resources have not been fully explored yet [1]. These data sources also posture statistical problems; e.g., the family-wise error rate (FWER) [2] shows the occurrence probability of at least one false discovery in multiple tests as it is well known that multiple tests may cause serious false positive problems. The FWER increases with the increase of marker candidates [2,3]. It is investigated that there is a thoughtful issue of computation slant in genomic data, i.e., the size of the input file is the same while processing time of variant calling is still significantly different [4]. Single nucleotide polymorphism (SNP) is a variant of a single nucleotide which exists at a particular locus in the genome, where respective variant exists up-to noticeable degree in a population of a residence [5][6][7][8]. SNP is a genetic variation triggered by the alteration of a Approaches. These are powerful methodologies, but prone to infrequent patterns in datasets that tend to produce false positives results [9].
High-performance computing technology is being developed to process genomic data sources and perform computational analysis of life sciences [27]. Many researchers discovered filtering approaches and effective computational algorithms to efficiently detect SNPs [9]. An alternate is cloud computing, as a replacement for owing and conserving the dedicated hardware. Cloud computing provides the Map-Reduce as a parallel computing environment. An open-source implementation of the Hadoop Map-Reduce model is developed for big data analytics, for example NGS data [12]. With the emergence of technologies, the cost of sequencing has decreased but the cost of processing and storage increased, while processing of huge amount of data is challenging. NGS takes input data and processes it to produce output, during the processing that data becomes huge in volume which requires more space and computing resources [28]. Several distributed computing frameworks, e.g., Apache Spark have been developed to provide suitable solutions for addressing the scalability issues of variant calling such as SNPs [29]. A large number of genome analysis tools based on distributed and grid computing framework has been proposed in [29,30]. The framework presented in [30] is used for filtering of large genomic data sets called BAMSI, which is multi-cloud service and flexible in the use of compute and storage resources. The frame presented in [31] called SeqWare Query engine is used for storing and searching genome sequence data. The Genome Analysis Toolkit (GATK) is an effective development and determined exploratory tool used for NGS based on the functional programming of Map-Reduce. GATK is used for accuracy, consistency, CPU and memory effectiveness that allows shared and distributed memory parallelization [32]. Halvade uses Hadoop MapReduce based approach for genome analysis, where the variant calling carried out via chromosome divisions. Due to the noticeable variance in the length of chromosomes, the division may cause load imbalance issue [33,34]. Churchill is a closely unified DNA analysis pipeline and can be implemented for variant calling via HaplotypeCalller or FreeBayes [35][36][37]. The imbalance load created by uneven length of chromosomes can be reduced by using parallel variant calls. However, the problem is still considered as computationally intensive. Authors in [38] use Spark for parallel analysis of genomes. The strategy in the proposed work is simple, but it does not consider the adjacent block overlap. Another tool named GATK4.0 [39] equipped with many tools for analysis of genome data is also based on the Spark framework. The tool supports multi-node and multi-core variant calling with parallelization. The tool demand for high computational resources and memory for large datasets. The shuffle operation causes performance bottlenecks. To address the issue of SNPs detection, the genome sequence analysis pipeline also implemented in parallel through a scalable distributed framework e.g., SparkGA [38]. SparkGA has been widely used with the popularity of big data technology. This implementation is highly capable of parallelizing computation at data-level and highly scalable along with load balancing techniques. GenomeVIP [40] is an open-source platform for the mining of genomic variant discovery, interpretation and annotation running on the cloud and or local high-performance computing infrastructure. Although a number of tools are developed independently, they contain innumerable configuration options and lack of integration which makes it cumbersome for a bioinformatician to use properly. SNPs detection in NGS is critical as its analysis used in many applications like genome-based drug design, disease detection, and microarray analysis. Therefore, more investigations are required to develop a fast, scalable and more accurate SNPs detection framework. In this research study, we proposed a fast and scalable workflow for SNPs detection based on Hadoop Map-Reduce with the integration of Bowtie aligner and parallelized Heap, which enhanced the SNPs detection rate and optimized the execution time. Moreover, mining of SNPs is also introduced in the proposed workflow. The results obtained are compared with state-of-the-art algorithms i.e., GATK [32], FaSD [22], Halvade [33], SparkGA [38], and Heap [8] algorithms.

Materials and Methods
This research aims to improve SNPs detection in order to enhance the accuracy rate and optimize execution time. Our proposed framework relies on the Hadoop Map-Reduce programming model [41] which enables parallel and in-memory distributed computation. Hadoop is a free and open-source software platform that is used to process huge amounts of data and run applications in parallel on a cluster environment. It works on divide and conquer based techniques and concludes the results. It consists of a map and reduce functions for processing and Hadoop Distributed File System (HDFS) for storage [13]. Map-Reduce works by breaking the processing into two phases i.e., map phase and reduce phase. The fundamental concept of Map-Reduce is based on <key, value> pairs. The map phase takes input in <key, value> pairs. It produces the output in the form of a <key, value> pairs. The output key-value can be different as compared to the input key-value. The output of various map tasks is group together. The keys and associated set of values are sent to the Reduce phase. The Reduce phase operates on keys and an associated list of values. The output of Reduce is being concatenated and written on HDFS. The proposed framework for SNPs detection using the Map-Reduce paradigm is presented in Figure 1. The stepwise processes are shown in Figure 2; Figure 3 respectively. Moreover, the proposed framework also utilizes a dynamic load balancing algorithm based on [38] with some preprocessing of data format for compatibility to efficiently use the available resources. The proposed model consists of preprocessing, sequence alignment, and SNPs calling and mining integrated with dynamic load balancing as discussed next. The graphical representation of the proposed workflow. Both target and reference sequences are given as input to the model. Both inputs files are preprocessed as described in Section 2.1. Then the generated segments, i.e., interleaved and non-overlapping segments are uploaded to Hadoop Distributed File System (HDFS) for onward processing. In the map phase, the input data is aligned to the reference genome using Bowtie v.2 aligner as described in Section 2.2. The output of the map phase is collected in a reduce phase for SNPs detection, then Heap is used for detecting the single nucleotide polymorphisms (SNPs) as described in Section 2.3. Finally, the detected SNPs are mined, and the output is generated into a single variant calling format (VCF) file. The graphical representation of the proposed workflow. Both target and reference sequences are given as input to the model. Both inputs files are preprocessed as described in Section 2.1. Then the generated segments, i.e., interleaved and non-overlapping segments are uploaded to Hadoop Distributed File System (HDFS) for onward processing. In the map phase, the input data is aligned to the reference genome using Bowtie v.2 aligner as described in Section 2.2. The output of the map phase is collected in a reduce phase for SNPs detection, then Heap is used for detecting the single nucleotide polymorphisms (SNPs) as described in Section 2.3. Finally, the detected SNPs are mined, and the output is generated into a single variant calling format (VCF) file.

Preprocessing
The FASTA [42] and FASTQ [43] programs are widely used for biological sequences because they are fast, sensitive, and readily available. FASTA and FASTQ have emerged as a common file format for sharing sequencing reads data and are associated with per base quality score. Initially, the segmentation utility [44] which runs on master node locally takes input dataset in FASTA and or FASTQ format to make them accessible for all active computing instances, e.g., map tasks. The segmentation utility creates compressed segments of the default size of the HDFS block, e.g., 64 MB for parallel execution using map tasks. For example, it reads 'N' number of blocks in one iteration from a file, where 'N' represents the number of map tasks available for execution. Upon reading the specified blocks, each block is assigned to separate map task. All map tasks are executed in parallel to compress the assigned blocks, which are then uploaded to HDFS. The utility used here reads a block of data at once from the input file and looks for the read's boundary at the end of each block in order to check the ending of last read. The data is taken till the last read and stores the leftover portion in a buffer, which is then appended with next block of incoming data. Meanwhile, the data for a segment is interleaved in map tasks, e.g., a particular map task interleaving data and writing it to

Preprocessing
The FASTA [42] and FASTQ [43] programs are widely used for biological sequences because they are fast, sensitive, and readily available. FASTA and FASTQ have emerged as a common file format for sharing sequencing reads data and are associated with per base quality score. Initially, the segmentation utility [44] which runs on master node locally takes input dataset in FASTA and or FASTQ format to make them accessible for all active computing instances, e.g., map tasks. The segmentation utility creates compressed segments of the default size of the HDFS block, e.g., 64 MB for parallel execution using map tasks. For example, it reads 'N' number of blocks in one iteration from a file, where 'N' represents the number of map tasks available for execution. Upon reading the specified blocks, each block is assigned to separate map task. All map tasks are executed in parallel to compress the assigned blocks, which are then uploaded to HDFS. The utility used here reads a block of data at once from the input file and looks for the read's boundary at the end of each block in order to check the ending of last read. The data is taken till the last read and stores the leftover portion in a buffer, which is then appended with next block of incoming data. Meanwhile, the data for a segment is interleaved in map tasks, e.g., a particular map task interleaving data and writing it to

Preprocessing
The FASTA [42] and FASTQ [43] programs are widely used for biological sequences because they are fast, sensitive, and readily available. FASTA and FASTQ have emerged as a common file format for sharing sequencing reads data and are associated with per base quality score. Initially, the segmentation utility [44] which runs on master node locally takes input dataset in FASTA and or FASTQ format to make them accessible for all active computing instances, e.g., map tasks. The segmentation utility creates compressed segments of the default size of the HDFS block, e.g., 64 MB for parallel execution using map tasks. For example, it reads 'N' number of blocks in one iteration from a file, where 'N' represents the number of map tasks available for execution. Upon reading the specified blocks, each block is assigned to separate map task. All map tasks are executed in parallel to compress the assigned blocks, which are then uploaded to HDFS. The utility used here reads a block of data at once from the input file and looks for the read's boundary at the end of each block in order to check the ending of last read. The data is taken till the last read and stores the leftover portion in a buffer, which is then appended with next block of incoming data. Meanwhile, the data for a segment is interleaved in map tasks, e.g., a particular map task interleaving data and writing it to segment. Block-by-block reading of dataset is one of the reasons that the proposed model performs significantly better than other programs e.g., Halvade [33,34], which reads the data line-by-line. A status file is also uploaded in order to keep track record of input segments. The status file is used to inform the alignment program that particular segment has been uploaded. The status file contains IDs starting from 0; therefore, segments from 0 to 'N-1' will be uploaded first in case if there exist 'N' number of map tasks available for execution and the segments from 'N' to 'N × 2-1' are uploaded next, and so on. When all the segments uploads then a signal in the form of a sentinel file sent to show that all input datasets have been uploaded.
More, some preprocessing steps are also applied to the reference genome prior to the actual execution of Map-Reduce functions, e.g., the reference genome is divided into a preset number of non-overlapping segments. This segmentation is performed on chromosomal regions of approximately equal-sized; where, the chromosomal regions corresponds to the reduce tasks available for execution. The number of reduce tasks can be configured in advance based on the size of the reference genome. Moreover, it is also ensured that all the required data i.e., configuration files and binaries are accessible to each compute node. When all the required data are fetched to each compute node then these preprocessing phases can be ignored. Performing preprocessing on datasets to make them available on each corresponding compute node before actual execution minimizes the overhead of file I/O.

Map Function and Sequence Alignment
The input sequence reads are divided into segments as the default size of HDFS i.e., 64 MB. The Bowtie aligner v.2 [45] is used for aligning reads. Bowtie is a very fast and memory-efficient sequence aligning tool with the existence of reference genome sequences. Bowtie performs chromosome-wise data partitioning and shuffling and aligns the sequence reads with reference reads. It performs the exact matching, which is the foremost feature of Bowtie and helpful to detect more SNPs. In the map phase, each segment is considered as a separate split, hence processed as a single aligner instance. These are parallel executed on each compute node while utilizing all available mappers. Generally, the number of map tasks is much greater than the number of mappers, means that several map tasks will be processed by each mapper. In order to reduce the cost of network communication overhead and to minimize the repeated access of files stored remotely, our proposed model preferably makes use of map tasks that have locally input segments as part of HDFS. The indexing, concatenation, and sorting functions are based on Hadoop BAM [46] as shown stepwise in Figure 2. Hadoop BAM utilizes the Java libraries to manipulate the files in communal bioinformatics formats through the Hadoop Map-Reduce framework along with Picard SAM JDK as well as command-line tools, e.g., SAM-tools. Hadoop BAM is a novel library for the scalable manipulation and aligning next-generation sequencing data in the Hadoop distributed computing framework. The genome reads are parsed through Hadoop-BAM and aligned to the reference genome as shown in Figure 3, that are already available on each compute node. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop BAM solves the issues related to BAM data access by presenting a convenient API for implementing map functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are easily convertible to support large-scale distributed processing. Upon successful completion of all alignments, the reads are transformed to <key, value> pairs, where each key is generated from SAM records i.e., <id_chromosomal_region, position_of_mapping>; the key shows the exact location (mapping position) of mapping in the reference genome. Index function performs indexing BAM (binary conversion of sequence alignment map [SAM]) file and index a coordinate-sorted BAM file for fast random access. The concatenation function processes intermediate SAM and BAM files. It replaces groups of reads in the BAM file. It allows us to replace all groups of reads in the input file with a single new read group and allocate all reads to this reading group in the output BAM file. Sort algorithm sorts and merges BAM or SAM file and removes the duplicate reads. Genome reads that are aligned to the same chromosomal region are grouped together to form a single reduce task.

Reduce Function and Genome Single Nucleotide Polymorphisms (SNPs) Calling
Generally, the number of reduce tasks as much greater than the number of reducers; a number of reduce tasks are executed parallel. A particular task accepts all sorted intermediate <key, value> pairs as input for the single chromosomal region, which is stored in SAM or BAM file format. Here, multiple instances are created to perform SNPs calling. Heap is an accurate and highly sensitive SNP detection tool for high throughput sequencing data and offers equally dependable SNPs with distinct locus to genomic prediction (GP) and genome-wide association studies (GWAS) [8]. Heap performs the read filtering in order to obtain a high-quality score based on Phred-scale as shown in Equations (1) and (2). The reads having scored less than 20 and the bases with a score less than 13 are removed from the search scope of valid SNP calling sites. Based on quality filtering the frequency of an allele is computed on all nucleotide sites in order to determine genotype sampling. Heap then performs actual SNPs calling while comparing the genotypes between the reference genome and sample available at each compute node. The reducer function extracts the keys and associated values. It mines the bases A, T, C, and G through the utilization of a fast algorithm for statistical assessment of very large-scale databases [47], that fundamentally one time executes the itemset mining algorithm, while the other algorithms execute several times. Then, counts each base in reading and check either match or mismatch with the corresponding reference sequence. It also maintains the definite record based on the base quality which is very helpful to realign and recall of SNPs if detection accuracy remains inconsistent. The reduce function to release the <key, value> pairs. Variant calling format (VCF) file is generated at the end of each reduce task. VCF file consists of the SNPs detected in the corresponding chromosomal region. Finally, all the VCF files are merged into a single VCF file to present all the SNPs detected among the samples. The mining of SNPs generates the output to show the region-wise saturation. SNP caller calls the SNPs and generates the output which provides the number of SNPs. This study has improved the SNP caller results and SNP mining which shows the specific positions in the genome where the SNPs exist. It is more helpful for the target-based investigation of SNPs in a specific range of a genome.
where P represents error probability Q = −10 log 10 P where Q represents quality score (Phred Score).

Dynamic Load Balancing
In order to get the best performance from available resources, a dynamic load balancing algorithm as shown in Algorithm 1 is applied to balance the load, which remains active in process execution. A region with too many reads can be further divided via dynamic load balancing, so the execution time for several procedures in the workflow depends on the number of reads being processed. It is used as a local resource manager and is responsible for managing computing resources. Particularly, the dynamic load balancing algorithm consists of load estimation and resource management components. The load estimation component is used to calculate a load of a task instance while considering the size of data and training parameters, which are used to represent the computational complexity. The resource management component is used to assign the estimated amount of resources physically. It is worthful to note here that the dynamic load balancing algorithm does not change the resource scheduling algorithm of the Hadoop framework. Rather, it takes over the resources that have been pre-assigned for each lunched task. Then, the dynamic load balancing algorithm is used to re-assign the resources for sub-constitute tools in each task via reconfiguring their runtime parameters. Obtain the total number of sequence reads Total_number_of_reads ← number_of_reads_per_segment.reduce_by_key()

Compute the average number of reads based on load balancing region
Avg_seq_reads ← total_reads/chromosomal_region.count()

Experimental Setup
Experimental datasets are obtained from NCBI [23] and DDBJ DRA [48,49] web portals, which provide free access to biomedical and genomic data along with verified statistics. Two benchmark datasets are selected for experiments based on compatibility of parameters, e.g., Sorghum and the human genome. Three datasets of Sorghum e.g., GULUM_ABIA (DRR045054), RTx430 (DRR045061), SOR 1 (DRR045065) consist of 1,573,011, 2,251,325, and 2,942,974 number of reads respectively. The number of base pairs in each dataset is 158,874,111, 227,383,825, and 297,240,374 respectively. Each dataset consists of 1,000,000 genome length and 101 read length. The reference genome Sbicolor_v2.1_255 is used for Sorghum datasets. The human genome dataset NA12878 consists of 1.6 billion 101 bp paired-end reads stored in two FASTQ files of 97 GB in size compressed with gzip compression tool (https://www.gzip.org/). The human genome hg19 resource bundle available from [50] is used for reference. For results visualization and ease of understanding the results obtained for both datasets are separately plotted, while same parameters and experimental setup is used for comparison and analysis.
Various experimental setups are used for the evaluation of the proposed framework in comparison with other state-of-the-art models e.g., GATK 4.0, FaSD, Halvade, and SparkGA. Single node pseudo cluster and real clusters consist of 8, 16, and 32 working nodes are used for scaling and analysis. Single node pseudo cluster consists of Intel ® Core™ i7-7700K with four cores @ 4.20 CPU having eight threads along with 64-GB of memory installed, running on 64-bit instruction set kernel Linux (Ubuntu 16.04.6 LTS) operating system (OS). The real clusters comprise of 8, 16, and 32 compute nodes, the machine used in single node pseudo cluster are configured as server and rest of each node consist of Intel(R) Core i5-7600K with four cores @ 3.8 GHz CPU, 16-GB of memory installed, running on 64-bit instruction set kernel Linux (Ubuntu 16.04.4 LTS) OS. All the nodes are connected through the 10Gbit/s Ethernet network.

Measurement Metrics
Sensitivity, Specificity, and Accuracies are the terms that are mostly associated with a classification test and they statistically measure the performance of the test. In classification, we divide a given data set into two categories based on whether they have common properties or not by identifying their significance in a classification test. In general, sensitivity indicates, how well the test predicts one category and specificity measures how well the test predicts the other category. Whereas Accuracy is expected to measure how well the test predicts categories. If an SNP detected, further it has two possibilities as either it is true or not which is termed as a true positive (TP) and false positive (FP) respectively. Similarly, on the other hand, if an SNP is not detected, then it has also two categories i.e., true negative (TN) or false negative (FN). In [8], true detection of SNPs is based on sensitivity, positive predictive value (PPV), F-score, and accuracy. With the use of efficient SNPs detection algorithmic solution, the rate of TP and TN helps to increase the F-score and accuracy rate. The detected SNPs through GATK, FaSD, and Heap SNPs caller along with the integration of BWA and Bowtie aligner, SparkGA and Halvade are compared with the results of proposed framework i.e., Hadoop based Heap SNP caller integrated with Bowtie aligner. The F-score and accuracy of SNP callers are also recorded, where TP, FP, FN, TN, and PPV are considered as standard measurement parameters. The computational processes of chosen parameters are presented in Equations (3)- (6). Table 1 shows the empirical results of F-score and accuracy for all algorithms and respective datasets used. Figures 4 and 5 show the comparative results of accuracy and F-score for all frameworks respectively. The frameworks GATK and FaSD are integrated with BWA and Bowtie aligners. Results show that the Bowtie aligner produces better results than BWA in terms of F-score while the accuracy of BWA is better than the Bowtie aligner. Heap SNP caller is then integrated with BWA aligner and results are recorded for comparison. The comparative analysis of Heap integrated with BWA shows better results than GATK and FaSD integrated with BWA and Bowtie aligners. The SparkGA model is also executed, where its results are slightly better than previous frameworks. The Halvade framework results are also compared with other frameworks, however, its results are not significant on selected parameters. The results analysis of the proposed framework shows that it outperforms than existing algorithms in terms of parameters used in the comparison.

Single Nucleotide Polymorphism (SNP) Mining
Most of the SNPs caller algorithm detects the SNPs and generates the output in VCF file format. The output shows the details of SNPs detected and the number of SNPs. SNPs mining facilitates to identify the region-wise position of SNPs throughout the genome length in terms of position ID. The ID contains the starting position and ending position of a genomic region where SNPs exist. The region length tells the length of the region of these SNPs.

Results and Discussion
To evaluate the correctness and validity of the proposed framework sample datasets were extracted from all benchmarked datasets with consistent length i.e., 2000 genome length with 101 read length and executed on a single node. Each workflow experiment was executed 100 times and average time in seconds is computed for sample datasets, where the results of real clusters are recorded in minutes for clear visualization and ease of understanding. Results analysis of sample datasets shows that the proposed framework produced good results than others. For scalability analysis all the workflows are evaluated on real compute clusters of different configurations i.e., 8 compute nodes @ 116 GHz processing power with 32 cores equipped with 112 GB of memory, 16 compute nodes @ 237.60 GHz processing power with 64 cores equipped with 304 GB of memory and 32 compute nodes @ 471.2 GHz processing power with 128 cores equipped with 560 GB of memory. All the nodes are connected through 10 Gbit/s Ethernet network. GATK correctly calls SNPs if enough numerals of reads coverage are delivered i.e., 20× or more for enough sensitivity in genome re-sequencing, which is difficult under low read coverage, 7× or lower. FaSD uses the Bowtie for sequence read's alignment by default. Additionally, it requires high processing hardware infrastructure. Heap improves the sensitivity and accuracy of SNPs calling with lower coverage NGS data. Heap reduces the FP rate and accomplishes the highest F-scores with low coverage (7×). F-score is the harmonic means of sensitivity and PPV.
The default configurations for memory utilization and management are considered for all the existing workflows. For a fair comparison, the default configuration for memory management of Hadoop Map-Reduce is also considered for the proposed model as described next. On every node, Map-Reduce updates mapred-site.xml file with the number of map and reduce slots based on the number of computing instances available on the node. Traditionally, data are stored in block units. The memory path is updated upon the writing of each data block and finally reaches to the end of the array for redirection to the head. To make sure that data are written into the memory, the policy is re-written for selecting the storage path in the HDFS. Data files are assigned paths with different priorities, sort them based on priority, and then store the file paths into the array of the data node and check the paths in the array from the start when data is written. The observations and analysis of memory utilization show that all the workflows including the proposed model consume approximately the same memory.
GATK uses the BWA aligner as the default aligner, however, in [51] the GATK's results are reviewed and generated using Bowtie aligner which improves the results with respect to SNPs calling. Similarly, FaSD uses the Bowtie aligner as default, while in [52] the performance of FaSD with respect to SNPs calling using the BWA and Bowtie aligner are presented. The integration of Bowtie with FaSD produces more improved results than BWA as GATK integrated with Bowtie aligner. Heap uses the BWA for sequence alignment as the default aligner. We have integrated the Bowtie aligner with Heap and executed on Hadoop clusters and get improved results.
Results given in Figures 4 and 5 show the accuracy and F-score measurement analysis of the proposed framework in comparison with GATK + BWA, GATK + Bowtie, FaSD + BWA, FaSD + Bowtie, Heap + BWA, SparkGA, and Halvade pipelines respectively. Results analysis show that the proposed model is 52.3%, 29.6%, 23.4%, 20.9%, 6.3%, 6.5%, and 18% more efficient in F-score than GATK + BWA, GATK + Bowtie, FaSD + BWA, FaSD + Bowtie, Heap + BWA, SparkGA and Halvade pipelines respectively. It also shows that the proposed framework is 0.63%, 0.20%, 0.08%, 0.17%, 0.04%, 0.05%, and 0.31% more accurate than GATK + BWA, GATK + Bowtie, FaSD + BWA, FaSD + Bowtie, Heap + BWA, SparkGA and Halvade pipelines respectively. Results from Table 1 and Figure 4 show that the proposed model achieved 99.998% accuracy on the human genome, 99.75% on GULUM_ABIAD, 99.75% on RTx430, and 99.71% on the SOR_1 dataset, its analysis show that the proposed framework is consistent for accuracy gain as compared to others. The overall analysis of Figure 4; Figure 5 shows that the proposed framework is 22.46% more efficient and 0.21% more accurate on average empirical observations comparatively.  Figure 8a-c presents the cluster-wise speedup gained by the proposed model over other workflows while running on 8 nodes, 16 nodes, and 32 nodes clusters respectively for all datasets. The scalability analysis of all workflows show that the proposed framework is highly scalable as it has achieved good speedup on 8, 16, and 32 compute nodes. Figure 9 shows the average speedup measurement analysis of 8, 16, and 32 nodes real compute clusters for all datasets its analysis presents that the proposed framework outperforms than others on all datasets.      Here, it is clear that the proposed workflow takes less time in execution for detecting SNPs as compared to others. It is worth noting that on a larger dataset the efficiency of the proposed framework is much better than a smaller dataset.

Conclusions
SNP is a variation of a single nucleotide that exists at a particular locus in the genome, where respective variant exists to a noticeable degree in the population of a residence. Detecting SNPs in high dimensional genomic data is difficult, due to the growing number of genetic variations in genome sequences. It is helpful in biological research to assess an individual's reaction to certain drugs, defenselessness towards environmental factors like toxins, and risk of disease. The Hadoop is a novel platform using the Map-Reduce programming framework which runs on any cluster only with the prerequisite of Java. It provides the scalability, reusability, and reproducibility features. The Hadoop Map-Reduce can also be used for fast computation and processing to detect the SNPs in genome sequences. Hadoop Map-Reduce proves the capability to process NGS data to detect the SNP in less time with higher accuracy. In this research study, we proposed Hadoop based framework integrated with Heap for SNPs detection which enhances the SNPs detection rate and optimizes the execution time. The proposed framework is executed on a various number of nodes with different configurations. To validate the framework, different benchmark datasets have been used and the results are recorded for comparison with other state-of-the-art pipelines. This research contributed as a novel framework for SNP detection which has improved the SNPs detection rate, optimized the execution time and mined SNPs as well.
In the future, it is intended to identify SNPs associated with complex diseases such as cancer, diabetes, and heart disease on a large scale, e.g., cloud computing environment integrated with optimization technique of artificial intelligence and mine them. It is also intended to optimize memory requirement in the future.

Conclusions
SNP is a variation of a single nucleotide that exists at a particular locus in the genome, where respective variant exists to a noticeable degree in the population of a residence. Detecting SNPs in high dimensional genomic data is difficult, due to the growing number of genetic variations in genome sequences. It is helpful in biological research to assess an individual's reaction to certain drugs, defenselessness towards environmental factors like toxins, and risk of disease. The Hadoop is a novel platform using the Map-Reduce programming framework which runs on any cluster only with the prerequisite of Java. It provides the scalability, reusability, and reproducibility features. The Hadoop Map-Reduce can also be used for fast computation and processing to detect the SNPs in genome sequences. Hadoop Map-Reduce proves the capability to process NGS data to detect the SNP in less time with higher accuracy. In this research study, we proposed Hadoop based framework integrated with Heap for SNPs detection which enhances the SNPs detection rate and optimizes the execution time. The proposed framework is executed on a various number of nodes with different configurations. To validate the framework, different benchmark datasets have been used and the results are recorded for comparison with other state-of-the-art pipelines. This research contributed as a novel framework for SNP detection which has improved the SNPs detection rate, optimized the execution time and mined SNPs as well.
In the future, it is intended to identify SNPs associated with complex diseases such as cancer, diabetes, and heart disease on a large scale, e.g., cloud computing environment integrated with optimization technique of artificial intelligence and mine them. It is also intended to optimize memory requirement in the future.