Targeted Sequencing: appropriate to use BaseRecalibrator (BQSR) on 150M bases over small intervals?

teamcoopteamcoop Ontario, CanadaMember

I know the best practices document is on its way, but I recently came across this quite new page on how to use the -L argument as was surprised to see that:

Small Targeted Experiments
The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

I knew that VQSR could not be used on small targeted experiments, but didn't know that BQSR should not be used. The Base Quality Score Recalibration page includes the note:

A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.

If I have 150Mbases of data over a set of small target intervals, does that count as a small dataset for which I should avoid using BQSR? What about 1B bases, again over a small set of intervals? What are the best practices in this case?

Thanks for your help!

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @teamcoop‌

    Hi,

    "Small datasets" from the -L FAQ refers to datasets with less than 100M bases.

    If you have 150M bases for a small targeted experiment, you should be fine to run BQSR. 1B is even better. Pretty much, anything above 100M bases is fine, but the bigger the dataset, the more accurate BQSR will be.

    -Sheila

  • teamcoopteamcoop Ontario, CanadaMember

    @Shiela, great! Thanks for your quick response.

  • Hi @Sheila,

    I'd like to ask for a quick clarification - are small datasets ones with fewer than 100M targeted bases or ones with fewer than 100M total bases i.e. targeted bases x coverage?

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @aliwali

    Hi,

    Small datasets have fewer than 100M total bases.

    -Sheila

  • When you talk about 100M total bases, does it mean the number of bases that can be mapped to the genome such that they are contained in the bam file to be inputted to BQSR???

    What about duplicate reads detected by MarkDuplicates? Are they counted in total bases?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Keep in mind these are very general numbers to give you an idea of the range, rather than absolute thresholds. These numbers refer to the amount of useable data, which typically would not include duplicates.

  • annatannat Member
    edited June 2015

    I have another question - does this 100M cut off apply per read group or for the whole set.
    I have an experiment with a small target set (31,243 bases) but 719 read groups/samples, all run on one MiSeq flowcell.
    The mean number of bases is ~3.5M per read group, but in total 2.5G.
    Also have not marked duplicates as this is a Fluidigm Access Array (PCR based).
    Please could you clarify if BQSR is appropriate?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @annat
    Hi,

    Usually it is best to have greater than 100 million bases per read group. But, in your case, it is fine to run BQSR on your entire dataset because BQSR accounts for errors in the lane and all your data comes from one lane.

    -Sheila

  • annatannat Member
    edited July 2015

    Hi @Sheila ,
    I have run BQSR on all my read groups together, each specified with -I. I am now trying to run PrintReads using the bqsr table and the individual bam files for each read group separately.

    The job starts and I get the following output (see below), but nothing happens after that. It doesn't error out, produce any error message etc. I am running with java -Xmx4g . I have left these jobs running overnight and they have started using greater than 100% cpu even though I believe -nt doesn't work for this tool and I am not specifying it, nor am I specifying -nct.
    Some jobs do eventually run out of memory, but I don't think that more than 4G should be required nor any long runtime, as I have previously used this tool and it is quite quick with much larger bam files than this (this would have been with older version of GATK), I am trying with 20G memory but still having same problem.

    I have also tried using the -sn flag with the sample name but am still getting the same problem.
    Is the problem that not all samples present in the table file are present in the bam or is there a bug in this tool?

    TIA, anna

    INFO  23:56:24,978 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  23:56:24,994 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 
    INFO  23:56:24,994 HelpFormatter - Copyright (c) 2010 The Broad Institute 
    INFO  23:56:24,994 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  23:56:25,002 HelpFormatter - Program Args: -T PrintReads -I sam/bwa/gatk/10081.ir.bam -R data/GATKbundle/2.8/hg19/ucsc.hg19.fasta -BQSR sam/bwa/gatk/all.baseRecal.tab -o sam/bwa/gatk/10081.br.bam 
    INFO  23:56:25,011 HelpFormatter - Executing as [email protected] on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_65-b17. 
    INFO  23:56:25,012 HelpFormatter - Date/Time: 2015/07/06 23:56:24 
    INFO  23:56:25,012 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  23:56:25,012 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  23:56:26,397 GenomeAnalysisEngine - Strictness is SILENT
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It's hard to say but if this is a new problem it could be an issue with your compute infrastructure. Have you checked with your IT support people that the servers are running normally? Sometimes nodes get stuck and need to be rebooted.

  • annatannat Member

    Hi @Geraldine_VdAuwera
    It seems to be just a memory issue, I tried on two different compute infrastructures and both required more memory to get going, 8G seemed to work, and get a total runtime of ~ 20 minutes.
    Just seems a lot of memory as bam files are only ~4MB and target is small.

Sign In or Register to comment.