Over the past decade, we have gained considerable insight into the identification of sequence variation within the rDNA array of and its closest wild relative, . Yet considerable challenges remain in the computational characterisation of this complex genomic region. This study aimed to evaluate the use of variation graphs for this purpose, formally comparing their effectiveness with traditional linear approaches.

Specifically, we aimed to identify both partial and fixed variants (i.e. pSNPs, SNPs, pINDELs and INDELs) in the rDNA arrays of 10 diverse, haploid strains with high quality genomic datasets. We constructed two computational pipelines using two highly different approaches. The first pipeline used the BWA read mapper and the BCFtools variant caller to identify variants against the linear S288c reference, with the second pipeline using the vg tool to call variants against a graphical reference (either based on a graphical representation of the S288c genome or a pan-genome).

The results showed that the graph-based pipeline was able to identify more variants than the linear pipeline, and in particular partial variants, while also missing some key variants identified by BWA/BCFtools. A major discrepancy between the two pipelines was found in the read coverage at loci where the vg pipeline identified variants. In the coming months, we aim to investigate the cause of these differences and to develop a new graph-based computational pipeline that can accurately identify the full range of sequence and copy number variation within this key genomic region.


Article metrics loading...

Loading full text...

Full text loading...


Most cited this month Most Cited RSS feed

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error