The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

BaseRecalibrator - Trade-Off between runtime and accuracy

tonytony Member Posts: 4
edited August 2012 in Ask the GATK team

Hi,

We are working with Illumina HiSeq 2000 paired-end data and as time goes by, lanes yield more and more sequences.

We are processing data at the lane BAM level (only one read group). The procedure, among others, does BWA mapping, Indel realignment, duplicates flagging and base quality recalibration. This is, as expected, a long process to complete but clearly the base recalibration stage is the longest by far, especially when lanes contain many sequences. We are using QualityScoreCovariate, ReadGroupCovariate, ContextCovariate and CycleCovariate covariates.

For instance, we have quite big lanes :

1 lane of 140,000,000 pairs (280,000,000 reads) : ~36 hours for recalibration

1 lane of 185,000,000 pairs (370,000,000 reads) : ~48 hours for recalibration

We obviously wish to reduce this run time and I found in the following link a small chapter on the topic (at the very end of the page) :
http://gatk.vanillaforums.com/discussion/44/base-quality-score-recalibrator#latest

So, we are really keen on downsampling our BAM files to reduce run time but at the same time we want our data as accurate as possible to help us for instance in the task of diminishing false positive substitutions rate. So if it is worth to wait, we wait.

Nevertheless, in the plot shown in the previous link, the x axis stops at 5,000,000 reads, where the RMSE value seems to have reached a "plateau".

1) We were thus wondering if there is a read count threshold (empirical value) above which the accuracy of the recalibration is no more improved ?

2) If such a threshold exists, I can not find the '--process_nth_locus' switch described in the link above, should I use '-dt', '-dfrac', '-dcov' options instead to downsample ?

3 ) Is the '--num_threads' working with BaseRecalibrator Walker ? Up to how many threads ?

Thanks a lot,

Best Regards,

Anthony

PS : GATK version used is v2.0-23-ge9a19be

Best Answer

  • ebanksebanks Broad InstituteMember, Broadie, Dev Posts: 692 admin
    Accepted Answer

    Good questions.

    1) Theoretically, the more data you have, the more accurate your recalibration will be. But the marginal difference in accuracy is very small (as shown by the plots in the thread you referenced) once you have enough data. By default the GATK engine will downsample a position to 1000x coverage, but you could certainly knock that number down without harm. You can use the plot at the bottom of the other thread to determine how much you'll want to downsample.

    2) -dcov. --process_nth_locus is no longer a recommended/valid option and I've updated the other post. Thanks for the heads up.

    3) Yes. As many as your machine can handle well (every system is different) - but you'll need to bump up the memory accordingly.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

Answers

  • avinashavinash Member Posts: 1

    I have the same question.
    Avinash

  • ebanksebanks Broad InstituteMember, Broadie, Dev Posts: 692 admin
    Accepted Answer

    Good questions.

    1) Theoretically, the more data you have, the more accurate your recalibration will be. But the marginal difference in accuracy is very small (as shown by the plots in the thread you referenced) once you have enough data. By default the GATK engine will downsample a position to 1000x coverage, but you could certainly knock that number down without harm. You can use the plot at the bottom of the other thread to determine how much you'll want to downsample.

    2) -dcov. --process_nth_locus is no longer a recommended/valid option and I've updated the other post. Thanks for the heads up.

    3) Yes. As many as your machine can handle well (every system is different) - but you'll need to bump up the memory accordingly.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • tonytony Member Posts: 4

    Just a quick point :
    From the same link again, it is mention that one can downsample by selecting a genome interval with -L option. Would you rather recommend to downsample with -dcov but using the whole genome or it is just fine to use only chromosome 1 for instance if enough reads were mapped back to this genomic region ?

  • ebanksebanks Broad InstituteMember, Broadie, Dev Posts: 692 admin

    Ideally you'd use the whole genome with downsampling (because then you see more instances for the context covariate).

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • pylpyl Member Posts: 9

    Hi,

    I was wondering which version of GATK you were referring to when you pointed out that one could use -nt to parallelize this walker, I get an error message, see below:

    INFO 08:42:39,778 HelpFormatter - ---------------------------------------------------------------------------------
    INFO 08:42:39,782 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-10-gbbafb72, Compiled 2012/11/26 03:37:03
    [...]

    ERROR
    ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis BaseRecalibrator currently does not support parallel execution with nt. Please run your analysis without the nt option.
    ERROR ------------------------------------------------------------------------------------------

    Cheers,
    Paul

  • chapmanbchapmanb Boston, MAMember Posts: 33

    GATK 2.2+ use the -nct option for parallelization, instead of -nt. See the release notes on GATK 2.2, specifically the 'Multi-threading power returns to BaseRecalibrator' section for more details:

    http://www.broadinstitute.org/gatk/guide/version-history

    Brad Chapman, Bioinformatics Core at Harvard Chan School

  • amiami Dev Posts: 50

    Just a quick note: you can use both -nt and -nct in GATK 2.2+ (each one by itself or even together). They provide different types of parallelization.
    BQSR on the other hand can work only with -nct, and this is the reason for the error you got.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,427 admin

    We'll have a document outlining which tools can use which modes up soon, possibly later today.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.