Population / Reference bias in BaseRecalibrator?

AndresRiboneAndresRibone Member
edited November 2019 in Ask the GATK team

Hi, I work in sunflower (H. annuus) transcriptomics.

I'm trying to genotype a big amount of RNAseqs samples from wild accessions (coming from distant places).
I have used the reference genome for alignment and resulted a not so bad alignment rate (mean=92%,min=81%,max=95%). Now, for the BQSR step I have a reference SNP database (a big .vcf file) made from commercial cultivars (like the reference genome).

Considering I expect important differences between the genomes and the reference, Is it OK to use the "reference" SNPs? Should I ignore it and use the VariantCalling->BQSR->VariantCalling approach? Or maybe a mixed approach where I feed the reference .vcf AND the new .vcf to the BQSR?

What I have googled about reference bias in variant calling, is that the bias starts in the alignment step (of course) and trikcles down from there.

Also, to feed several .vcf to BaseRecalibrator, do I need to merge them first? or can I do:
gatk BaseRecalibrator [...] --knownSites VCF1.vcf --knownSites VCF2.vcf [...]

Thanks for your time!


