Complete Chloroplast Genome of Fokienia hodginsii (Dunn) Henry et Thomas: Insights into Repeat Regions Variation and Phylogenetic Relationships in Cupressophyta

Fokienia hodginsii (Dunn) Henry et Thomas is a relic gymnosperm with broad application value. It is a fit candidate when choosing species for the construction of artificial forests. We determined the complete chloroplast genome sequence of F. hodginsii, which is 129,534 bp in length and encodes 83 protein genes, 33 transfer RNA (tRNA) genes, as well as four ribosomal RNA genes. The GC content of the complete sequence and protein coding regions is 34.8% and 36.2%, respectively. We identified 11 tandem repeats, 11 forward repeats, and three palindromic repeats and classified them by size. Following our microsatellite analysis, a total number of 73 simple sequence repeats were detected, preferentially within the intergenic space. Being a member of Cupressophyta, F. hodginsii owns several common characters; the trnR-CCG gene has been deleted, while the trnI-CAU and trnQ-UUG genes have been duplicated. Moreover, the accD gene, which encodes acetyl-CoA carboxylase, contains 771 codons in F. hodginsii, similar to Cryptomeria japonica (L. F.) D. Don, further supporting the diversity of accD and its size expansion in Cupressophyta. Concerning the loss of inverted repeat (IR) regions, the 86-bp sequence with the duplicated trnI-CAU gene is inferred to be the footprint of IR contraction. Phylogenetically, F. hodginsii is placed as a sister taxon to Chamaecyparis hodginsii (Dunn) Rushforth. This work offers meaningful guidance as well as reference value to the breeding research and improvement of F. hodginsii. Moreover, it gives us a better understanding of the genomic structure and evolutionary history of gymnosperms, especially coniferales.


Introduction
Fokienia hodginsii (Dunn) Henry et Thomas, a relic gymnosperm from the tertiary period, is the only species of Fokienia Henry et Thomas. This threatened species was listed as vulnerable by the International Union for Conservation of Nature (IUCN) and was also under the second class state protection recorded by China Species Red List. F. hodginsii grows well in warm and humid

DNA Extraction and Genome Assembly
Using the high salt concentration method [16], genomic DNA was isolated from young leaves of F. hodginsii grown in Fujian Yangkou Forest Farm, Shunchang, China. A 500 bp paired-end library was constructed with 5 µg of the isolated cp DNA. About 2 Gb of sequence with an average read length of 301 bp was obtained using the Illumina MiSeq platform. As to the assembly of the whole cp genome sequence, MiSeq raw reads were trimmed to 200 bp in length using a script developed in-house ('fasta_length_trimmer') to remove the potential low-quality bases. Then, initial contigs were assembled using Velvet Assembler version 1.2.07 [17]. Contigs with high similarity (E-value <1 e-10 ) compared with the reference cp genome (KX832622.1) were chosen to be spliced using SSPACE Premium version 2.2 [18], followed by a manual check. Finally, we acquired one single circular cp genome sequence (129,534) without ambiguous bases.

Genome Annotation and Sequence Analyses
The online program Dual Organellar GenoMe Annotator [19] and the Basic Local Alignment Search Tool (https://blast.ncbi.nlm.nih.gov/Blast.cgi) were used to annotate the F. hodginsii cp genome. Compared with homologous genes from other cp genomes, we manually checked the initial annotation and determined the putative start and stop codons, as well as intron positions. All transfer RNA genes were further verified by using tRNAscan-SE [20]. The map of the circular cp genome was drawn using OrganellarGenomeDRAW [21]. Codon usage and GC content were calculated by MEGA7 [22].

Repeat Sequence Statistics and IR Identification
Tandem repeats were identified using Tandem Repeats Finder [23] with parameters set to 2 for match, 7 for mismatch, and 7 for indel. The minimum alignment score and the maximum period length were 50 bp and 500 bp, respectively. Statistics of forward, reverse, complement, and palindromic repeats were computed by REPuter [24]. The minimal size was set to 30 bp and >90% identity with a hamming distance of 3. Upon the detection of the inverted repeat sequence, we conducted a comparison of IR regions among F. hodginsii and the other four conifers (Podocarpus lambertii Klotzch ex Endl, Cephalotaxus oliveri Mast, Taiwania cryptomerioides Hayata, and C. japonica). Additionally, in order to explore the conserved region as well as the repeated region of accD, we performed an amino acid sequence alignment using online MAFFT version7 [25] and visualized the results with jalview [26].
For the analysis of simple sequence repeats (SSRs), the Perl script MISA (https://pgrc.ipkgatersleben.de/misa/) was used. The threshold of minimal repeat units was 10 for mononucleotides, five for dinucleotides, and four for tri-, tetra-, penta-, and hexanucleotides. All the outputs from the programs above were manually adjusted and redundant data were removed.

Phylogenetic Research
In order to construct a phylogenetic tree, which shows the evolutionary relationships between candidates in a clear tree-like map, we selected 21 species (Table S1) and downloaded the sequences of their cp protein-coding genes from the National Center for Biotechnology Information (https: //www.ncbi.nlm.nih.gov/). We excluded all ndh (NADH dehydrogenase) genes due to their absence across Pinaceae [27]. After manual adjustment, 63 eligible genes were ultimately aligned using Clustal W [28] and concatenated by SequenceMatrix [29]. Phylogenetic trees were estimated by MEGA7 [16] using an array with 48967 bp nucleotide sites by maximum likelihood (ML) and neighbor-joining (NJ) methods. In the ML tree, the model General Time Reversible + Proportion Invariant + Gamma (GTR + I + R) was preferred for nucleotide substitution. We selected the Kimura two-parameter model [30] to estimate evolutionary rates of base substitutions and nearest-neighbor-interchange (NNI) on random trees, while initial trees were constructed automatically. For NJ analyses, we selected the the maximum composite likelihood method. One thousand bootstrap replicates were applied to evaluate clade supports in both methods.

Genome Features
We determined the full length of the F. hodginsii cp genome (MK890147) to be 129,534 bp ( Figure 1). The complete cp DNA does not share the common quadripartite structure, which contains a pair of IRs separated by the LSC and SSC regions. However, the trnI-CAU and trnQ-UUG genes were found to be duplicated within the short inverted repeats. We identified 120 genes (Table 1) in total, including 83 protein genes, 33 transfer RNA genes, and four ribosomal RNA genes. Among the 16 intron-containing genes (Table 2), eight protein-coding genes and six tRNA genes contain just one intron while another two genes, ribosomal protein S12 (rps12), and hypothetical chloroplast reading frame 3 (ycf3) contain two introns. It is worth mentioning that rps12 is trans-spliced [31], meaning that exonI is about 35 kilobase pairs downstream of the nearest copy of the other two exons. Moreover, the largest intron within trnK-UUU (2457 bp) includes the entire matK gene.  The GC contents of the complete cp genome sequence and protein-coding regions in F. hodginsii are 34.8% and 36.2%, respectively. Eighty-three genes constitute the protein-coding regions, being 74,715 bp in length. The first codon position shows a higher GC content than the second one; by contrast, the lowest GC content at the third codon position indicates a codon usage bias in A and T (Table 3), which contribute to the enrichment of AT content in the cp genome [32]. A total of 24,905 codons were identified, with the most and least frequently used codons being AAA (1234) and UGC (77), encoding lysine and cysteine, respectively. Leucine (10.95%) and cysteine (1.14%) respectively were the most and least often encoded amino acids ( Figure 2). Table 1. List of genes found in the F. hodginsii cp genome.

Functional Category
Group of Genes Gene Names

Genes of unknown function Conserved open reading frames ycf2
One or two asterisks after a gene indicate that the gene contains one or two introns.

Repeat Sequence and SSR Analysis
Repeat sequences play a crucial role in the occurrence of genomic rearrangements and the invention of evolutionary novelties. Occurring frequently in genomic sequences, they are important laboratory and analytic tools [23]. Our repeat sequence analyses are shown in Figure 3. We classified tandem, forward, and palindromic repeats, of which we identified 11, 11, and three, respectively ( Figure 3A). Then we subsequently categorized these by size as shown in Figure 3B. A tandem repeat in DNA is two or more adjacent, approximate copies of a pattern of nucleotides. In F. hodginsii, the majority of tandem repeats (81.81%) are below 60 bp in length. As to forward repeats, the direct sequences that have the same nucleotide bases and directions in common, they are mainly between 60 and 74 bp long. Palindromic repeats, which frequently occur as essential elements in regulatory regions [33], vary in length from 52 to 233 bp (Table S2). Concerning the distribution of all these repeats, 59% of them are located within intergenic regions, whereas the remaining 41% occur within protein-coding genes ( Figure 3C).

40
Simple sequence repeats (SSRs), also called microsatellites, have a high level of polymorphisms 41 and are considered potential genetic markers in ecological and evolutionary studies of plants [34].

44
we did not detect any tetra-or penta-nucleotide motifs. The most common motif was mononucleotide  Simple sequence repeats (SSRs), also called microsatellites, have a high level of polymorphisms and are considered potential genetic markers in ecological and evolutionary studies of plants [34]. From our SSR analysis, we detected 73 microsatellites in total (Table S3), with the minimum number of 10, five, four, and six for mono-, di-, tri, and hexa-nucleotides, respectively ( Figure 4A). However, we did not detect any tetra-or penta-nucleotide motifs. The most common motif was mononucleotide A, followed by mononucleotide T, which together make up 53.4% of all SSRs. Among dinucleotide and trinucleotide repeats, AT and AAG occur the most frequently. The only hexanucleotide repeat found in the F. hodginsii cp genome was AGGAAC. Most (68.49%) of the SSRs are located in the intergenic space, followed by protein-coding genes (20.55%) and introns (10.96%) ( Figure 4C). The exact numbers of SSRs found in these three regions were 50, 15, and 8, respectively ( Figure 4B). This distribution preference supports previous findings that the SSR frequency varies between different regions of the genome [35]. Among all SSRs, a majority of homopolymers is entirely composed of A/T sequences (76.7%), indicating an enrichment in A/T content and a bias in base composition.

57
When examining repeat location, we noticed that the protein-coding gene accD and the 58 intergenic space in its close proximity contain a high number of repeated sequences of various types.

59
This gene encodes acetyl-CoA carboxylase (ACCase), which is one of the key enzymes regulating the 60 rate of fatty acid biosynthesis in plants [36,37]. We also performed an alignment of the accD gene from  When examining repeat location, we noticed that the protein-coding gene accD and the intergenic space in its close proximity contain a high number of repeated sequences of various types. This gene encodes acetyl-CoA carboxylase (ACCase), which is one of the key enzymes regulating the rate of fatty acid biosynthesis in plants [36,37]. We also performed an alignment of the accD gene from five Cupressophyta members including F. hodginsii ( Figure 5). The accD reading frame of F. hodginsii contains 771 codons. Concerning the repetitive sequence, which most frequently occurs in medial portion, in F. hodginsii three repeats of EEEEQ were detected. Several previous studies support that IRs in gymnosperms display a high amount of variation 71 in sequence length and composition [5,11,38]. In order to explore the loss of IR regions in F. hodginsii,

72
we compared it with four other conifers, the results of which are shown below ( Figure 6). All five conifers conserved one copy of the IRB, and left footprints of the IRA region with duplicated genes 74 that can be detected in the sequence. We identified two main palindromic repeats in the F. hodginsii

Residual Inverted Repeat (IR) Regions
Several previous studies support that IRs in gymnosperms display a high amount of variation in sequence length and composition [5,11,38]. In order to explore the loss of IR regions in F. hodginsii, we compared it with four other conifers, the results of which are shown below ( Figure 6). All five conifers conserved one copy of the IR B , and left footprints of the IR A region with duplicated genes that can be detected in the sequence. We identified two main palindromic repeats in the F. hodginsii cp DNA sequence, each containing a duplicated gene: trnQ-UUG and trnI-CAU. Based on our comparison and preceding analysis, the 86 bp sequence that contains trnI-CAU and extends into the intergenic space between the trnI-CAU and psbA genes is inferred as the residual IR in F. hodginsii, which is the same length as the residual IR in Metasequoia glyptostroboides Hu et Cheng [39].

Phylogenetic Research
Ginkgo biloba was set as an outgroup for the construction of our phylogenetic tree with 21 plant species selected for this study. In the ML tree, 17 out of 18 nodes show bootstrap values of 100%, accounting for 94.44% (Figure 7), which is identical to our constructed NJ tree. Remarkably, in both topology structures, F. hodginsii belongs to the Cupressaceae family with a node bootstrap value of 100% and is placed as a sister clade to Chamaecyparis hodginsii (Dunn) Rushforth, indicating that F. hodginsii and C. hodginsii are very closely related. The 18 Cupressaceae gymnosperms included in our analysis form six sub-families of Cupressophyta (Figure 7). inverted repeat regions, one can effectively avoid the disturbance of paralogs when conducting 105 phylogenetic analyses. Moreover, the variation of the evolutionary rate between coding and 106 noncoding regions can be applied in distinguishing different categories [40]. Based on these benefits, 107 phylogeny has undergone rapid progress with the help of cp genome sequencing.

108
From the respect of cp genomic studies in Cupressophyta, previous experiments paid enough 109 attention to controversial issues about the taxonomy among Taxaceae, Cephalotaxaceae, and However, the taxonomic status of three genera from Araucariaceae remains to be solved [44,45].

112
Chloroplast genes rbcL and matK were the most commonly used gene segments because their 113 sequences variation contained historical evidence appropriate for evolutionary analysis [46]. As the 114 complete cp genome of more Cupressophyta species were published, new markers and methods can 115 be invented in related fields.

116
In our research, the complete cp genome of F. hodginsii was found to be 129,534 bp in length,

117
which falls in between the 128,290 bp of Taxus mairei [47] and the 131,810 bp of C. japonica [5].

118
Compared with Pinaceae cp genomes, which are mainly around 119 kbp [48][49][50], the Cupressophyta 119 plastomes are of a bigger size. When annotating genes, the absence of trnR-CCG, which has been 120 reported in Cupressophyta before and also was found to occur in F. hodginsii, supports the hypothesis 121 based on phytochrome phylogenetic trees that the trnR-CCG gene was lost during the second split 122 separating Araucariaceae and Podocarpaceae in the evolutionary progress of plants [51].

123
The comparative alignment of the accD gene revealed that the accD reading frame of F. hodginsii 124 contains 771 codons, slightly more than C. japonica (700 codons) [5] and similar to other

Discussion
In recent years, with the rapid development of large-scale sequencing technologies, more and more research studies concerning the evolutionary relationships of plants have been carried out. Chloroplast is an organelle specialized for carrying out photosynthesis in plants. Cp genomes have a great deal of merits. The conservatism of the genome content and the relatively small size of the sequence make cp genomes easy to be sequenced. Using single-copy genes located outside the inverted repeat regions, one can effectively avoid the disturbance of paralogs when conducting phylogenetic analyses. Moreover, the variation of the evolutionary rate between coding and noncoding regions can be applied in distinguishing different categories [40]. Based on these benefits, phylogeny has undergone rapid progress with the help of cp genome sequencing.
From the respect of cp genomic studies in Cupressophyta, previous experiments paid enough attention to controversial issues about the taxonomy among Taxaceae, Cephalotaxaceae, and Podocarpaceae as well as the phylogenetic relationships among the genera of these families [41][42][43]. However, the taxonomic status of three genera from Araucariaceae remains to be solved [44,45]. Chloroplast genes rbcL and matK were the most commonly used gene segments because their sequences variation contained historical evidence appropriate for evolutionary analysis [46]. As the complete cp genome of more Cupressophyta species were published, new markers and methods can be invented in related fields.
In our research, the complete cp genome of F. hodginsii was found to be 129,534 bp in length, which falls in between the 128,290 bp of Taxus mairei [47] and the 131,810 bp of C. japonica [5]. Compared with Pinaceae cp genomes, which are mainly around 119 kbp [48][49][50], the Cupressophyta plastomes are of a bigger size. When annotating genes, the absence of trnR-CCG, which has been reported in Cupressophyta before and also was found to occur in F. hodginsii, supports the hypothesis based on phytochrome phylogenetic trees that the trnR-CCG gene was lost during the second split separating Araucariaceae and Podocarpaceae in the evolutionary progress of plants [51].
The comparative alignment of the accD gene revealed that the accD reading frame of F. hodginsii contains 771 codons, slightly more than C. japonica (700 codons) [5] and similar to other Cupressophyta species such as T. cryptomerioides (800 codons), P. lambertii (864 codons), and C. oliveri (936 codons). Concerning the repetitive sequence that most frequently occurs in medial portion, in F. hodginsii, three repeats of EEEEQ were detected, almost in the same region where nine repeats of SDIEED in C. oliveri occur [52]. In P. lambertii as well as T. cryptomerioides, more than one type of repeat sequence was found. The insertion of repetitive elements is thought to lead to the variation of the whole sequence, in an attempt to accelerate the substitution rate [37]. As indicated by the yellow histograms below each line ( Figure 5), the amino acid sequence of both terminals remained more conserved than the middle part. Moreover, the C-terminal appeared as more conserved than the N-terminal. Since the most significant function of this region is considered as encoding the carboxyl transferase, this specific utilization may explain why the C-terminal remains more conserved in some degree. These results support that the accD reading frame displays a tendency towards enlarging size in Cupressophyta [6,37,51], which is mainly caused by the great number of insertions that consist of tandemly repeated motifs.
Rearrangements of cp genomes occurred more frequently in conifers and the loss of typical IR sequences may be an outcome of the reduction of gene content. For example, it is believed that the Pinus thunbergii Parl IR sequence only contains a region encompassing trnI and a portion of 3 psbA as a remnant of the whole IR [53]. The residual IR of C. japonica is 114 bp in length [5]. Short remaining sequences of 544 bp and 326 bp can be detected in C. oliveri [52] and P. lambertii [51], respectively. Wu et al. (2011) concluded that in conifers, the IR B region was retained in Pinaceae whereas the IR A region was retained in Cupressophyta. The regions encompassing the whole ycf2 gene and the adjoined psbA or rpl23 genes are considered as ancestral IRs [14]. This conclusion is supported by the 86 bp IR contraction footprint in F. hodginsii in our study. Notably, repeat regions still have function after expansion [41]. However, the complex mechanism of the insertion and loss of genes, as well as the induction of shrinkage and expansion in specific regions of cp genomes, remains unclear.
Our phylogenetic topology supports that the gymnosperm representative G. biloba forms a monophyletic branch.
Gymnosperms are considered to contain three separate clades: Cycadales-Ginkgoales, Gnetales, and Coniferales [54,55]. A previous phylogenetic study using 43 Coniferales species concluded that they were divided into two main sub-clades: Pinaceae and Cupressophyta. The Pinaceae are represented by Picea morrisonicola Hayata and P. thunbergii in our analysis and are considered basal to the remaining conifer families [55]. Within the Cupressophyta, which is another branch of Coniferales, the relationship between Sciadopitys verticillate (Thunb.) Sieb. et Zucc (representing Sciadopityaceae) and the other species stood out. Its position supports that Sciadopityaceae form their own separate family.

Conclusions
In this work, we sequenced the complete cp genome of F. hodginsii (129,534 bp) by using Illumina high-throughput sequencing technology. Repeat motifs found by our statistical analyses can be further used for the development of molecular markers, which can find broad applications in genetic studies as well as phylogenetic research. We will also be able to exploit new strategies to protect this endangered species based on developed markers. As the only endemic species of Fokienia Henry et Thomas, the phylogenetic position of F. hodginsii we constructed is of great value in terms of understanding the evolutionary history of this genus and enriching our comprehension of the systematic status of Cupressophyta. However, there are still some problems that remain to be explored, such as the elaborate explanation of different cpDNA forms in Cupressophyta, the principle of the insertion and loss of genes, as well as the mechanism of IR shrinkage and expansion in cp genomes. We believe that the complete cp genome of F. hodginsii is an ideal example when studying these issues.