BQSR on single samples or merged BAMs

simono101simono101 London, UKMember

I have ~150 WGS all sequenced in batches on Illumina HiSeq over a number of years, on a number of machines/flowcells.

I want to perform BQSR on my BAM files before calling variants. However for my organism I do not have access to a resource like dbSNP for true variants. So I am following the protocol by doing a first round of variant calling, subsetting to the SNPs with highest confidence and using these as the known variants for BQSR.

My question is, should I carry this out on samples individually, i.e. one BAM per sample, on which I use HaplotypeCaller for initial variant calling, then subset to best SNPs, use these for BaseRecalibrator and apply the calibration to the original sample before carrying on with single sample variant finding and then joint genotyping as per best practices....


As I have multiple samples from the same flowcells and lanes, would I gain more information by combining those samples into a multisample BAM first, before carrying out BQSR? I'm a little unsure of how the covariates used by BQSR and actually calculated and whether I can increase the accuracy of my recalibration in this way? Each sample has between 500M and 5 billion nucleotides sequenced.

Many thanks.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Internally, the recalibration is done per read group, which is the fundamental unit of scale at which it makes sense to do the recalibration. So it is not useful to combine the data before BASR. However, for the bootstrapping process, you can absolutely group the callsets from multiple samples into a common pseudo-dbsnp resource.

