Targeted Sequencing: appropriate to use BaseRecalibrator (BQSR) on 150M bases over small intervals?

I know the best practices document is on its way, but I recently came across this quite new page on how to use the -L argument as was surprised to see that:
Small Targeted Experiments
The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.
I knew that VQSR could not be used on small targeted experiments, but didn't know that BQSR should not be used. The Base Quality Score Recalibration page includes the note:
A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
If I have 150Mbases of data over a set of small target intervals, does that count as a small dataset for which I should avoid using BQSR? What about 1B bases, again over a small set of intervals? What are the best practices in this case?
Thanks for your help!
Comments
@teamcoop
Hi,
"Small datasets" from the -L FAQ refers to datasets with less than 100M bases.
If you have 150M bases for a small targeted experiment, you should be fine to run BQSR. 1B is even better. Pretty much, anything above 100M bases is fine, but the bigger the dataset, the more accurate BQSR will be.
-Sheila
@Shiela, great! Thanks for your quick response.
Hi @Sheila,
I'd like to ask for a quick clarification - are small datasets ones with fewer than 100M targeted bases or ones with fewer than 100M total bases i.e. targeted bases x coverage?
Thanks!
@aliwali
Hi,
Small datasets have fewer than 100M total bases.
-Sheila
When you talk about 100M total bases, does it mean the number of bases that can be mapped to the genome such that they are contained in the bam file to be inputted to BQSR???
What about duplicate reads detected by MarkDuplicates? Are they counted in total bases?
Keep in mind these are very general numbers to give you an idea of the range, rather than absolute thresholds. These numbers refer to the amount of useable data, which typically would not include duplicates.
I have another question - does this 100M cut off apply per read group or for the whole set.
I have an experiment with a small target set (31,243 bases) but 719 read groups/samples, all run on one MiSeq flowcell.
The mean number of bases is ~3.5M per read group, but in total 2.5G.
Also have not marked duplicates as this is a Fluidigm Access Array (PCR based).
Please could you clarify if BQSR is appropriate?
@annat
Hi,
Usually it is best to have greater than 100 million bases per read group. But, in your case, it is fine to run BQSR on your entire dataset because BQSR accounts for errors in the lane and all your data comes from one lane.
-Sheila
Hi @Sheila ,
I have run BQSR on all my read groups together, each specified with -I. I am now trying to run PrintReads using the bqsr table and the individual bam files for each read group separately.
The job starts and I get the following output (see below), but nothing happens after that. It doesn't error out, produce any error message etc. I am running with java -Xmx4g . I have left these jobs running overnight and they have started using greater than 100% cpu even though I believe -nt doesn't work for this tool and I am not specifying it, nor am I specifying -nct.
Some jobs do eventually run out of memory, but I don't think that more than 4G should be required nor any long runtime, as I have previously used this tool and it is quite quick with much larger bam files than this (this would have been with older version of GATK), I am trying with 20G memory but still having same problem.
I have also tried using the -sn flag with the sample name but am still getting the same problem.
Is the problem that not all samples present in the table file are present in the bam or is there a bug in this tool?
TIA, anna
It's hard to say but if this is a new problem it could be an issue with your compute infrastructure. Have you checked with your IT support people that the servers are running normally? Sometimes nodes get stuck and need to be rebooted.
Hi @Geraldine_VdAuwera
It seems to be just a memory issue, I tried on two different compute infrastructures and both required more memory to get going, 8G seemed to work, and get a total runtime of ~ 20 minutes.
Just seems a lot of memory as bam files are only ~4MB and target is small.