当前位置: X-MOL 学术Plant Biotech. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A near complete genome of Arachis monticola, an allotetraploid wild peanut
Plant Biotechnology Journal ( IF 13.8 ) Pub Date : 2024-03-04 , DOI: 10.1111/pbi.14331
Hongzhang Xue 1, 2 , Kai Zhao 1 , Kunkun Zhao 1 , Suoyi Han 3 , Annapurna Chitikineni 4 , Lin Zhang 1 , Ding Qiu 1 , Rui Ren 1 , Fangping Gong 1 , Zhongfeng Li 1 , Xingli Ma 1 , Xingguo Zhang 1 , Rajeev K. Varshney 4 , Xinyou Zhang 3 , Chaochun Wei 2 , Dongmei Yin 1
Affiliation  

Peanut (Arachis hypogaea L.) is the most economically important oilseed legume crops throughout the world. Most Arachis species are diploid, but two are heterotetraploid: one cultivated species (A. hypogaea) and one wild relative (A. monticola). A. monticola was an important intermediate between diploid wild peanut species and tetraploid cultivated peanut varieties (Zhuang et al., 2019). We released the first A. monticola genome in 2018 (Yin et al., 2018), but the quality was not high enough.

The development of long-read sequencing technologies has enabled the assembly of higher-quality genomes, even at the telomere-to-telomere scale, such as Arabidopsis thaliana (Wang et al., 2022) and Oryza sativa (Song et al., 2021). Here we present Amon2.0, an updated high-quality genome assembly of A. monticola generated by combining Nanopore ultra-long, Hi-C reads and MGI short. The heterozygosity was estimated at 0.8%. The genome was de novo assembled, and then polished with both long and short reads. After scaffolding with Hi-C reads, the assembly has 46 scaffolds, with a total size of 2.56 Gbps and an N50 value of 137.6 Mbps, and 99.6% of the sequences were assembled into 20 chromosomes (Table S1; Figure 1b). The final genome assembly contained only 34 gaps, which was much fewer than the previously published tetraploid peanut genome (Bertioli et al., 2019). In total, 11 chromosomes have both complete telomeric sequences and the remaining 9 chromosomes have one complete telomeric sequence. All chromosomes have centromeric sequences, including 13 chromosomes with complete centromeric sequences.

Details are in the caption following the image
Figure 1
Open in figure viewerPowerPoint
Genomic and transcriptomic features of Amon2.0. (a) Circos plot summarizing information about Amon2.0. From the outermost to innermost circle, data shown are as follows: telomere and centromere; GC content; gene content; long terminal repeat (LTR)/Gypsy distribution; LTR/Copia distribution; and DNA transposon distribution. (b) Genome-wide all-by-all Hi-C interaction of Amon2.0. (c) Comparison of genome sequences between Amon1.0 and Amon2.0. (d) Comparison of chromosome A03 between Amon1.0 and Amon2.0. (e) Long-read sequence alignment near the 15.89 Mbps region of A03 in Amon1.0. (f) Long-read sequence alignment near the 14.94 Mbps region of A03 in Amon2.0. (g) Expression level distribution for genes in the whole genome, genes in sub-genome A and genes in sub-genome B. Gene expression was calculated in fragments per kilobase of transcript per million mapped reads (FPKM). (h) Expression levels of A-specific, A homologue, B-specific and B homologue genes. (i) Differences in expression levels between paired single-copy genes in sub-genomes A and B. (j) Expression levels of LECC1 (SL-I).

A total of 2 147 185 085 bps (83.96%) was annotated as repeat elements, which were mainly long terminal repeat (LTR) retroelements (65.3%) and DNA transposons (12.03%). The highest proportion of LTR retroelements were Gypsy sequences (67.1%). LTR Assembly Index (LAI) assessment demonstrated that Amon2.0 achieved gold-standard quality for repeat regions with an LAI score of 21.27. Totally, 75 226 protein-coding genes from 35 556 gene families were predicted. Specifically, 34 728 and 40 282 genes were identified in sub-genomes A and B, respectively. Of these genes, 74.6% were functionally annotated and 91.2% contained at least one known structural domain. The complete and duplicated Benchmarking Universal Single-Copy Orthologs scores were 99.3% and 91.5%, respectively, corresponding to 96.2% and 96.7% in sub-genomes A and B, respectively. The densities of genes, DNA transposons and LTR/Copia elements were higher near the telomeres, whereas LTR/Gypsy elements were mainly distributed near the centromeres (Figure 1a).

Compared with the previous A. monticola genome release (Amon1.0), Amon2.0 filled 248.1 Mbps of unknown sequences, with 243.6 Mbps and 180.8 Mbps of sequences anchored in sub-genomes A and B, respectively (Table S2). Amon2.0 was highly continuous, with a contig N50 value of 58.3 Mbps. Amon2.0 had an assembly consensus quality value score of 34.2, corresponding to >99.9% accuracy in consensus base calls. All of these measures indicated that the contiguity and completeness of both sub-genomes in Amon2.0 were superior to those of Amon1.0.

Collinearity analysis of Amon2.0 and Amon1.0 indicated the similarity was high between the corresponding chromosomes (Figure 1c; Figure S1). There was a lower similarity between homologous chromosomes. We identified an inconsistent region in chromosome A03 (from 14.94–32 Mbps in Amon2.0) (Figure 1d). This region was verified with the alignment of long reads. (Figure 1e,f). This indicated that Amon2.0 eliminated some assembly errors of Amon1.0. Compared with Amon2.0, 10 818 structural variations (SVs) were detected in the cultivated peanut Tifrunner (Table S3). Some SVs influence the structure and expression of nearby genes (e.g. LRK10L and ARF3) (Figure S2).

We next performed RNA sequencing in five A. monticola tissues. In five tissues, 57.5%–64.8% of all genes in the genome were expressed (FPKM>0), corresponding to 61.3%–68.6% and 54.7%–61.8% genes in sub-genomes A and B, respectively (Figure 1g). In total, 71.7% of all predicted genes were expressed in at least one tissue, corresponding to 75.6% and 68.6% of the genes in sub-genomes A and B, respectively. Based on gene family data, all sub-genomic genes were further divided into four groups: A-specific (genes only present in sub-genome A), A homologue (sub-genome A genes present in both sub-genomes), B-specific (present only in sub-genome B) and B homologue (sub-genome B genes present in both sub-genomes); there were 7682, 25 237, 10 718 and 27 162 genes in these groups, respectively. Overall, A homologue and B homologue genes were expressed at significantly higher levels than A-specific or B-specific genes (P < 0.05, Wilcoxon rank sum test) (Figure 1h). We further investigated the expression levels of 16 007 single-copy orthologous genes. Preliminary results showed a trend of tissue-specific asymmetric gene expression for paired genes belong to two sub-genomes (Figure 1i). For example, the orthologous gene LECC1 (SL-I) was higher expressed from sub-genome A in the fruits (FPKMAM09G34310 = 9657, FPKMAM19G32920 = 3833, Figure 1j). SL1 is related to a gene encoding a reportedly drought-inducible alpha-methyl-mannoside-specific lectin.

In conclusion, this study introduces Amon2.0, a near-complete, highly accurate genome assembly for A. monticola. Comparison of this genome assembly with the previous A. monticola reference genome clearly demonstrated the increased continuity, completeness and accuracy of Amon2.0. The unprecedented quality of this genome enabled us to observe tissue-specific asymmetric gene expression patterns between the A. monticola sub-genomes. This genome assembly will serve as a fundamental basis for further understanding of the domestication and evolutionary histories of Arachis spp. and the family Fabaceae more broadly. Furthermore, this genetic resource will contribute to functional genomics and future molecular-assisted breeding in these economically-important legume crops.



中文翻译:

异源四倍体野生花生 Arachis monticola 的近乎完整基因组

花生(ArachishypogaeaL .)是全世界经济上最重要的油籽豆科作物。大多数花生属物种是二倍体,但有两种是异四倍体:一种是栽培物种(A.hypogaea),一种是野生近缘种(A.monticola)。A. monticola是二倍体野生花生品种和四倍体栽培花生品种之间的重要中间体(Zhuang et al .,  2019)。我们在2018年发布了第一个A. monticola基因组(Yin et al .,  2018),但质量不够高。

长读长测序技术的发展使得更高质量的基因组组装成为可能,甚至在端粒到端粒的尺度上也是如此,例如拟南芥(Wang et al .,  2022)和水稻(Song et al .,  2021 ) )。在这里,我们展示了 Amon2.0,这是通过结合 Nanopore 超长、Hi-C 读段和 MGI 短读段生成的更新的高质量A. monticola基因组组装。杂合度估计为0.8%。基因组是从头组装的,然后用长读和短读进行修饰。经过 Hi-C 读段搭建脚手架后,组装有 46 个脚手架,总大小为 2.56 Gbps,N50 值为 137.6 Mbps,99.6% 的序列组装成 20 条染色体(表 S1;图 1b)。最终的基因组组装仅包含34个缺口,比之前发表的四倍体花生基因组少得多(Bertioli et al .,  2019)。总共有11条染色体具有完整的端粒序列,其余9条染色体具有1条完整的端粒序列。所有染色体均具有着丝粒序列,其中具有完整着丝粒序列的染色体有13条。

详细信息位于图片后面的标题中
图1
在图查看器中打开微软幻灯片软件
Amon2.0 的基因组和转录组特征。(a) Circos 图总结了 Amon2.0 的信息。从最外圈到最内圈,显示的数据如下:端粒和着丝粒;GC含量;基因含量;长末端重复序列(LTR)/吉普赛分布;LTR/Copia分布;和DNA转座子分布。(b) Amon2.0 的全基因组所有 Hi-C 相互作用。(c) Amon1.0和Amon2.0之间的基因组序列比较。(d) Amon1.0 和 Amon2.0 之间 A03 号染色体的比较。(e) Amon1.0 中 A03 的 15.89 Mbps 区域附近的长读序列比对。(f) Amon2.0 中 A03 的 14.94 Mbps 区域附近的长读序列比对。(g)全基因组中的基因、亚基因组A中的基因和亚基因组B中的基因的表达水平分布。基因表达以每百万映射读数每千碱基转录物的片段(FPKM)计算。(h)A特异性、A同源基因、B特异性和B同源基因的表达水平。(i)亚基因组A和B中配对单拷贝基因之间表达水平的差异。(j) LECC1 ( SL-I )的表达水平。

共有2 147 185 085 bps(83.96%)被注释为重复元件,主要是长末端重复(LTR)逆转录元件(65.3%)和DNA转座子(12.03%)。LTR 逆转录元素比例最高的是吉普赛序列(67.1%)。LTR 组装指数 (LAI) 评估表明,Amon2.0 达到了重复区域的黄金标准质量,LAI 得分为 21.27。总共预测了来自 35 556 个基因家族的 75 226 个蛋白质编码基因。具体而言,在亚基因组 A 和 B 中分别鉴定了 34 728 个和 40 282 个基因。在这些基因中,74.6% 进行了功能注释,91.2% 包含至少一个已知的结构域。完整和重复的基准通用单拷贝直向同源物得分分别为 99.3% 和 91.5%,分别对应于亚基因组 A 和 B 的 96.2% 和 96.7%。基因、DNA转座子和LTR/Copia元件的密度在端粒附近较高,而LTR/Gypsy元件主要分布在着丝粒附近(图1a)。

与之前的A. monticola基因组版本(Amon1.0)相比,Amon2.0填充了248.1 Mbps的未知序列,其中锚定在亚基因组A和B中的序列分别为243.6 Mbps和180.8 Mbps(表S2)。Amon2.0 具有高度连续性,重叠群 N50 值为 58.3 Mbps。Amon2.0 的组装共识质量值得分为 34.2,对应于共识碱基调用的准确度 >99.9%。所有这些指标都表明,Amon2.0 中两个亚基因组的连续性和完整性均优于 Amon1.0。

Amon2.0和Amon1.0的共线性分析表明相应染色体之间的相似性较高(图1c;图S1)。同源染色体之间的相似性较低。我们在染色体 A03 中发现了一个不一致的区域(Amon2.0 中的 14.94–32 Mbps)(图 1d)。该区域通过长读段的比对进行了验证。(图1e、f)。这表明Amon2.0消除了Amon1.0的一些汇编错误。与Amon2.0相比,在栽培花生Tifrunner中检测到10 818个结构变异(SV)(表S3)。一些 SV 影响附近基因(例如 LRK10L 和 ARF3)的结构和表达(图 S2)。

接下来,我们对 5 个A. monticola组织进行了 RNA 测序。在五个组织中,基因组中所有基因的 57.5%–64.8% 被表达(FPKM>0),分别对应于亚基因组 A 和 B 中的 61.3%–68.6% 和 54.7%–61.8% 的基因(图 1g) 。总共,所有预测基因的 71.7% 在至少一种组织中表达,分别对应于亚基因组 A 和 B 中基因的 75.6% 和 68.6%。根据基因家族数据,所有亚基因组基因进一步分为四组:A-特异性(仅存在于亚基因组A中的基因)、A同源物(亚基因组A基因同时存在于两个亚基因组中)、B-特异性(仅存在于亚基因组 B 中)和 B 同源物(亚基因组 B 基因存在于两个亚基因组中);这些组中分别有 7682、25237、10718 和 27162 个基因。总体而言,A 同源基因和 B 同源基因的表达水平显着高于 A 特异性或 B 特异性基因(P  < 0.05,Wilcoxon 秩和检验)(图 1h)。我们进一步研究了 16 007 个单拷贝直系同源基因的表达水平。初步结果显示,属于两个亚基因组的配对基因存在组织特异性不对称基因表达的趋势(图1i)。例如,直系同源基因LECC1 ( SL-I ) 在果实中的亚基因组 A 中表达较高(FPKM AM09G34310  = 9657,FPKM AM19G32920  = 3833,图 1j)。SL1与编码据报道干旱诱导的 α-甲基-甘露糖苷特异性凝集素的基因有关。

总之,本研究引入了 Amon2.0,这是一种近乎完整、高度准确的A. monticola基因组组装。将此基因组组装与之前的A. monticola参考基因组进行比较,清楚地表明 Amon2.0 的连续性、完整性和准确性有所提高。该基因组前所未有的质量使我们能够观察A. monticola亚基因组之间的组织特异性不对称基因表达模式。该基因组组装将为进一步了解花生属的驯化和进化历史奠定基础。以及更广泛的豆科植物。此外,这种遗传资源将有助于功能基因组学和未来这些具有重要经济意义的豆科作物的分子辅助育种。

更新日期:2024-03-04
down
wechat
bug