Discrimination of Cultivated Regions of Soybeans (Glycine max) Based on Multivariate Data Analysis of Volatile Metabolite Profiles

Soybean (Glycine max) is a major crop cultivated in various regions and consumed globally. The formation of volatile compounds in soybeans is influenced by the cultivar as well as environmental factors, such as the climate and soil in the cultivation areas. This study used gas chromatography-mass spectrometry (GC-MS) combined by headspace solid-phase microextraction (HS-SPME) to analyze the volatile compounds of soybeans cultivated in Korea, China, and North America. The multivariate data analysis of partial least square-discriminant analysis (PLS-DA), and hierarchical clustering analysis (HCA) were then applied to GC-MS data sets. The soybeans could be clearly discriminated according to their geographical origins on the PLS-DA score plot. In particular, 25 volatile compounds, including terpenes (limonene, myrcene), esters (ethyl hexanoate, butyl butanoate, butyl prop-2-enoate, butyl acetate, butyl propanoate), aldehydes (nonanal, heptanal, (E)-hex-2-enal, (E)-hept-2-enal, acetaldehyde) were main contributors to the discrimination of soybeans cultivated in China from those cultivated in other regions in the PLS-DA score plot. On the other hand, 15 volatile compounds, such as 2-ethylhexan-1-ol, 2,5-dimethylhexan-2-ol, octanal, and heptanal, were related to Korean soybeans located on the negative PLS 2 axis, whereas 12 volatile compounds, such as oct-1-en-3-ol, heptan-4-ol, butyl butanoate, and butyl acetate, were responsible for North American soybeans. However, the multivariate statistical analysis (PLS-DA) was not able to clearly distinguish soybeans cultivated in Korea, except for those from the Gyeonggi and Kyeongsangbuk provinces.


Introduction
Soybean (Glycine max) is among the most important crops in the world and is extensively used in the production of soybean flour, soybean milk, fermented products, and oil for consumption by both humans and animals, mainly due to its high protein and fat contents [1]. It is generally accepted that soybean cultivation originated in China, but nowadays, soybeans are produced worldwide, including in North America, South America, and Asia [1]. The importing and exporting of agricultural products

Profiling of Total Volatile Compounds in Soybeans
In total, 146 volatile compounds were identified in GC-MS data sets obtained from soybean samples of different geographical origins. Tables S1-S3 indicate that diverse lipid-derived volatile compounds and terpenes were detected in this study. Previous studies have found the major volatiles of soybeans to be ethanol, 1-octen-3-ol, maltol, phenylethyl alcohol, hexanal, octanal, 2-propanone, and γ-butyrolactone [11,12]. All of these volatile compounds were detected in the present study with the exception of maltol, which could have been due to the use of different extraction techniques [11]-the present study employed headspace extraction using SPME, which generally focuses on the detection of highly volatile compounds with low boiling points.
The soybeans cultivated in North America contained 50 volatile compounds: 1 acid, 16 alcohols, 5 aldehydes, 4 esters, 2 furans, 4 benzenes, 4 ketones, 3 lactones, 1 nitrogen-containing compound, 1 sulfur-containing compound, 5 hydrocarbons, 2 terpenes, 1 phenol, and 1 pyrazine. The number of volatile compounds detected in North American soybeans was clearly smaller than in those from other cultivation areas, but there was a greater diversity of alcohol. Oct-1-en-3-ol was detected at higher levels, while propan-2-one and 2-methylprop-1-ene were present at lower levels in North American soybeans. The only pyrazine detected was 2-methylpyrazine. Among esters, the content of 3-hydroxy-2,4,4-trimethylpentyl 2-methylpropanoate was higher in soybeans from Indiana province (IN) than in those of other regions of North America.
Several enzymes of soybeans have been studied by various researchers, including lipoxygenase, lipase, urease, amylase, and protease [26,27]. In particular, soybeans are known to be a rich source of lipoxygenase [27], which is one of several enzymes used to produce aldehydes and alcohols via enzymatic oxidation [28]. This study found hexanal (13-linoleate hydroperoxide) and heptanal (11-linoleate hydroperoxide)-known as the major oxidative products from linoleate hydroperoxidesin most of the cultivation regions, as were octanal (11-oleate hydroperoxide) and nonanal (9-/10-oleate hydroperoxide) [29], which are known to be decomposition products of oleate hydroperoxides [30].
Benelli et al. found that the amount of hexanal was related to precipitation and light conditions in the cultivation area [7]. Table 1 [31] presents the differences in precipitation between the cultivation regions, whereas the amount of hexanal did not differ significantly between the geographical regions studied. In this study, alcohols-which are known to be secondary oxidative products of unsaturated fatty acids [29]-predominated in soybeans from Korea, China, and North America, among which pentan-1-ol and hexan-1-ol (both are derived from 13-linoleate hydroperoxide [30,32]) were observed in most samples. As mentioned above, oct-1-en-3-ol (produced from 10-linoleate hydroperoxide [32]) was the most abundant alcohol in soybeans cultivated in North America. On the other hand, furans can be produced from the oxidation of polyunsaturated fatty acids and carotenoids [33], and 2-alkylfurans are commonly derived from lipid degradation [34]. 2-Methylfuran, 2-ethylfuran, and 2-pentylfuran were detected in soybeans from Korea and China, whereas 2-methylfuran was not found in soybeans from North America.
Several ketones were also identified in soybeans from Korea, China, and North America. Other diverse ketones that are mainly formed from unsaturated fatty acids (e.g., linoleic acid) by lipoxygenase were found in soybeans from China [35,36]. Certain ketones, such as propan-2-one, butan-2-one, and 3-methylheptane-4-one, were commonly found in samples from China. Cheesbroug et al. and Gulen et al. explained that the activities of enzymes, such as peroxidase, increased with the temperature at which the plants were grown [37,38]. Also, some previous studies have reported that lipoxygenase activity is affected by the minimum mean temperature from flowering to maturity [39], which affects the formation of volatile compounds [40]. Table 1 indicates that the annual mean temperature was higher in China (excluding the northeast region) than in other cultivation areas (Korea and North America). It could, therefore, be assumed that the formation of various ketones in soybeans from China is due to high lipoxygenase activity related to the temperatures of their cultivation areas. Diverse terpenes that occur naturally as metabolites are commonly found in plants [41]. In general, terpenes are produced from isopentenyl diphosphate, which is elongated to geranyl diphosphate, farnesyl diphosphate, and geranylgeranyl diphosphate [42]. Those terpenes were identified in all of the present cultivated areas but showed the greatest abundance and variety in China. The 11 terpenes of α-pinene, α-thujene, sabinene, l-phellandrene, myrcene, α-terpinene, limonene, β-phellandrene, γ-terpinene, terpinolene, and α-cedrene were detected in soybeans from China. The formation of terpenes could depend on various factors, such as cultivar and region [43]. Marais reported that certain factors, such as increased temperature and acidic conditions, could affect the concentration and diversity of terpenes formed [43]. Also, terpene synthases could be affected by CO 2 levels [40]. According to Planbureau voor de Leefomgeving (PBL) Netherland Environmental Assessment Agency, China showed the largest CO 2 emissions in 2016 [44]. In particular, limonene derived from geranyl pyrophosphate was identified in all samples from China. A previous study suggested that a higher CO 2 concentration could enhance the activity of limonene synthase [40]. Therefore, the formation of limonene could be significantly affected by CO 2 concentration as well as other factors, such as temperature.

Discrimination of Soybeans by Different Geographical Origins
In order to discriminate soybeans according to their geographical origins, the relationship between soybeans from different cultivation regions and their volatile profiles was investigated. GC-MS data sets were processed using unsupervised statistical analysis (principal components analysis (PCA) and hierarchical clustering analysis (HCA)) as well as supervised statistical analysis (partial least square-discriminant analysis (PLS-DA)) [45]. PCA, HCA, and PLS-DA were performed to identify the differences in volatiles profiles obtained from GC-MS analyses of soybeans of different geographical origins.
The results of PCA were distinguished by their geographical origins (data not shown). Since both results of PCA and PLS-DA on score plots were similar, only PLS-DA results were presented to show the separation of samples according to the cultivation area ( Figure 1). In addition, partial least square (PLS) components 1, 2, and 3 in the PLS PLS-DA 3D score plot for soybeans of different origins together explained 37.9% of the total variance: 24.66%, 6.84%, and 6.40%, respectively (Figure 1a). The PLS-DA score plot for PLS component 1 and PLS component 2 is presented (Figure 1b). The parameters of the cross-validation modeling were component 3, with R 2 X = 0.379, R 2 Y = 0.788, and Q 2 (cum) = 0.709. After 100 times permutations, R 2 = 0.177 and Q 2 = −0.219 were obtained.  Some previous studies have shown that the chemical compositions of soybeans can vary significantly with differences in soils, fertilizer treatment, and climatic conditions, as well as other environmental factors [46][47][48]. Grieshop and Fashey showed that soybeans from China had greater crude protein content than those from North America [8]. Also, Shi et al. [47] demonstrated that soybeans from Korea contained more protein and less oil than those from North America. On the other hand, soybeans from China have been shown to have lower lipid concentration than those from North America [9]. Volatile compounds of soybeans are produced by nonvolatile precursors, such as lipids, sugars, and proteins [49]. In particular, oxidative degradation of lipids can lead to the formation of diverse volatiles. Certain lipid-derived compounds, such as oct-1-en-3-ol, differed significantly between soybeans from North America and those cultivated in other regions, which could be due to the higher lipid concentration of North American soybeans. On the other hand, the amounts of benzaldehyde, 2,6-dimethylpyrazine, and 2,5-dimethylpyrazine, which are known to be mainly produced by amino acids as major precursors [50], differed significantly between soybeans from China and those cultivated in other regions. This could be at least partially due to the differences in protein content between soybeans from different cultivation regions [46,48]. However, their exact formation mechanisms remain unclear, and they could involve both biological and chemical mechanisms during the cultivation and storage of the soybeans.
Medic et al. reported that the constituents of soybeans could be significantly altered by diverse environmental factors exerting complex combined effects [50]. This situation makes it difficult to explain how specific environmental factors influence the formation of volatile components in soybeans. As shown in Figure 1 [51]. The major volatile metabolites contributing to the positive PLS 1 axis were 2-ethylhexan-1-ol, while those in the negative axis of PLS component 1 were heptan-4-ol, butan-1-ol, butyl butanoate, octanal, butyl prop-2-enoate, 5-methyl-2-propan-2-ylcyclohexan-1-ol, butyl acetate, butyl propanoate, nonanal, toluene, heptanel, heptan-4-one, 5-ethyloxolan-2-one, 1,2,3-trimethylbenzene, heptan-2-one, (E)-hex-2-enal, ethyl hexanoate, (E)-hept-2-enal, limonene, 1-butoxybutane, 2-pentylfuran, acetaldehyde, myrcene, and 3-hydroxybutan-2-one, whereas those in the negative axis of PLS component 1 were found in all soybeans from China. On the other hand, the main volatile metabolites that contribute to the negative PLS 2 axis were 2-ethylhexan-1-ol, 2,5-dimethylhexan-2-ol, styrene, 2-methylfuran, 2-methylprop-2-ene, propan-2-one, 2-methylprop-2-enal, hexane, methyl acetate, 2-methylpentan-1-ol, octanal, butyl prop-2-enoate, 1-methyoxypropan-2-ol, heptanal, and toluene, whereas those in the positive PLS 2 axis were oct-1-en-3-ol, nonane, 4-methyloxolan-2-one, heptan-4-ol, butan-1-ol, octan-3-one, butyl butanoate, 3-hydroxy-2,4,4-trimethylpentyl 2-methylpropanoate, 5-methyl-2-propan-2-ylcyclohexan-1-ol, butyl acetate, butyl propanoate, and nonanal. In Figure 2, soybeans from each country are clustered according to their cultivation regions. The figure shows that soybeans from Korea were clustered more closely than the others, which is possibly due to the much smaller land area of that country (100,339 km 2 ) compared to China (9,596,951 km 2 ), Canada (9,984,670 km 2 ), and North America (9,826,676 km 2 ). Retention indices were determined using n-alkanes C 6 to C 30 as an external standard; 2 Retention indices were obtained from NIST database (http://webbook.nist.gov/chemistry); 3 Identification of the compounds was based as follows; A, mass spectrum and retention index agree with the authentic compounds under similar conditions (positive identification); B, mass spectrum and retention index were consistent with those from NIST database; C, mass spectrum was consistent with that of W9N08 (Wiley and NIST) and manual interpretation (tentative identification). Figure 2 shows the HCA dendrogram with its associated heatmap in which all of the samples are grouped in terms of their nearness or similarity [52]. The figure shows that all of the samples could be clustered into two groups except for Kyeongsangnam province Changnyeong (KNCN): group I consisted of 13 soybean samples cultivated in China, and group II comprised of 22 soybean samples from Korea and North America. The amounts of terpenes and esters were greater in group I than in group II. In group II, soybean samples from Korea-except for Kyeongsangnam province Changnyeong (KNCN)-and North America were classified into the subgroup. Among soybean samples grown in North America, those from Illinois (IL) and Indiana (IN) provinces could be distinguished from the others. Table S3 indicates that the samples from Illinois and Indiana provinces were found to contain greater amounts of alcohol than other North American soybeans (samples MI, MN, ON, and QB). The annual mean precipitations were similar across North America, but the annual mean temperatures were higher in Illinois and Indiana than in the other regions. Wills et al. reported that the concentration of esters and alcohols was positively related to temperature [53]. Therefore, it could be inferred that the formation of volatile compounds was affected by the cultivation temperature in soybeans from North America. When the multivariate statistical analysis was performed only on domestic samples in Korea to investigate the possibility of our method to the discrimination of samples cultivated in the regions close to each other, it could not distinguish soybeans according to the region in the results of PCA (data not shown) and PLS-DA. Figure 3a shows that PLS 1, 2, and 3 together explained 43.7% of the total variance (19.09%, 15.36%, and 8.92%, respectively), while Figure 3b shows that two PLS components (PLS components 1 and 2) explained 33.86%. The parameters of the cross-validation modeling were component 3, with R 2 X = 0.437, R 2 Y = 0.169, and Q 2 (cum) = 0.0535. After 100 times permutations, R 2 = 0.0951 and Q 2 = −0.0676 were obtained. Soybean samples from the Gyeonggi and Kyeongsangbuk provinces were clustered according to their regions, whereas other samples were not clearly clustered in the PLS-DA score plot. As shown in Table 1, the climatic conditions varied with the cultivation area. The mean temperatures in 2016 showed similar tendencies in all of the cultivation regions studied, but with slight differences in the total precipitation and sun exposure times. Various plant volatiles can be affected by changing biotic and abiotic factors [54]. Vallat et al. explained that the concentrations of nonanal and benzaldehyde were both positively related to precipitation, and positively and negatively related to temperature, respectively [54]. This variety of climate factors could together affect the volatile metabolites formed in soybeans cultivated in different regions. However, the relationships between climate and the amounts of nonanal and benzaldehyde formed were not clear in this study. Other domestic samples except those from the Gyeonggi and Kyeongsangbuk provinces were not clearly grouped in the PLS-DA score plot.

Materials
Thirty-six different soybean samples (17 from Korea, 13 from China, and 6 from North America) cultivated in 2016 (Figures S1-S3, Table 4) were used. Soybeans from Korea were provided by the National Agricultural Products Quality Management Service, whereas those from China were obtained from Chinese markets ( Figures S4 and S5, Table S4). Soybeans from North America were gifts from a soybean processing company in Korea ( Figure S6, Table S4). All samples were stored at −70 • C in a deep freezer before they were analyzed. Solid-phase microextraction (SPME) fibers and holders were purchased from Supelco (Bellefonte, PA, USA), whereas vials and screw caps (Ultraclean 18 mm) were purchased from Agilent Technologies (Santa Clara, CA, USA). l-Borneol was purchased from Sigma-Aldrich (St. Louis, MO, USA). Authentic standard compounds for positive identification of volatile compounds were purchased as follows: 3-methylphenol and hexan-1-ol were purchased from Supelco (Bellefonte, PA, USA), 1,3-benzothiazole, acetaldehyde, α-terpinene were obtained from Fluka (St. Gallen, Switzerland), and acetonitrile was bought from J.T. Baker (Phillipsburg, NJ, USA), while all of the other authentic standards were purchased from Sigma-Aldrich (St. Louis, MO, USA).

Extraction of Volatile Metabolites Using SPME
l-Borneol was prepared at 200 mg/L with tert-butanol. Then, distilled water was added at a final concentration of 1 mg/L before soybean (5 g) was placed in a 20 mL screw vial with a screw cap. SPME was used to obtain volatile metabolites of soybeans. The sample was maintained at 40 • C for 30 min to reach the equilibrium state. SPME fiber coated with carboxen/polydimethylsiloxane/ divinylbenzene (CAR/PDMS/DVB) was used to adsorb volatile compounds at 40 • C for 20 min, and desorption was executed at 200 • C in a GC injector for 5 min while cryo-trapping at −80 • C. For every other ten runs in GC-MS analysis, we included quality control (QC) soybean samples to confirm the relative peak areas and retention times of several main volatile compounds.

GC-MS Analysis
The GC-MS analysis was performed using a 7890A series gas chromatograph (Agilent Technologies, Santa Clara, CA, USA) and a 5975C mass detector (Agilent Technologies, Santa Clara, CA, USA) equipped with a DB-Wax column (30 m length × 0.25 mm i.d. × 0.25 µm film thickness, J&W Scientific, Folsom, CA, USA). GC oven temperature was programmed as follows; initial temperature was maintained 40 • C for 10 min, raised to 42 • C at a rate of 2 • C/min and held for 3 min, and increased to 100 • C at a rate of 4 • C/min and kept for 5 min, and raised 180 • C at a rate of 4 • C/min, and the ramped to 200 • C at a rate of 10 • C/min. The flow rate of helium, carrier gas, was constant at 0.8 mL/min, whereas mass spectra were obtained with a mass scan rage of 35-350 atomic mass unites (a.m.u.) at a rate of 4.5 scans/sec, and the electron impact (EI) mode was 70 eV. All sample preparations and analyses were independently performed in triplicate. In the preliminary study, we confirmed the repeatability and precision of our method on the results of the main volatile compounds in soybean in more than six replicates.

Identification and Quantification of Volatile Metabolites
The identification of each volatile compound was positively confirmed by comparison of retention time and mass spectral data with those of authentic standard compounds. When standard compounds were not available, each volatile compound was identified on the basis of its mass spectral data using the NIST.08 and Wiley.9 mass spectral libraries and the retention index (RI) values in the previous literature. The RI value of volatile compounds was calculated with n-alkane from C 6 to C 30 as an external standard. The quantification of the volatile components was calculated to obtain relative peak areas by comparing their peak areas with that of the internal standard compound on the total ion chromatogram of GC-MS. Five microliters of l-borneol (1 mg/L in tert-butanol/distilled water solvents mixture (1:200, v/v)) was used as an internal standard.

Statistical Analysis
All the datasets obtained were processed by multivariate statistical analysis, such as principal components analysis (PCA) and partial least square-discriminant analysis (PLS-DA) using SIMCA-P (version 11.0, Umetrics, Umea, Sweden), to determine the discrimination of soybeans according to different geographic origins. Heatmap visualization and hierarchical clustering analysis were performed based on Pearson's correlation and average linkage method using Multi Experimental Viewer (MeV) software (version 4.9, The Institute for Genomic Research (TIGR)) [55].

Conclusions
This study applied GC-MS analysis combined with the multivariate statistical analysis to distinguish the geographical origins of soybeans. The profiles of volatile compounds in the soybean samples varied with their cultivation regions. In the PLS-DA results, all soybean samples were clearly discriminated by their geographical origins. However, those cultivated in Korea (except for the samples from the Gyeonggi and Kyeongsangbuk provinces) could not be clearly separated according to the region on the PLS-DA score plot. We also determined the major volatile metabolites that contributed to the discrimination of geographical origins on the basis of PLS-DA. This study has the advantage of being able to distinguish the geographical origin of soybeans without any sample pretreatment on the basis of volatile metabolite profiles, which are highly related to their quality. However, we did not have enough sample information on post-harvest practices, such as drying and storage conditions, which could affect volatiles' profiles in some way. Nevertheless, our result could be applied to the discrimination of soybeans distributed and commercially available in Korea, the main objective of this study.
In summary, the findings of this study suggested that combining GC-MS-based analysis of volatile compounds with multivariate data analysis is a useful tool for discriminating the geographical origins of soybeans, but with some limitations for domestically cultivated ones.