GATK HaplotypeCaller run

Hi,

I am using HaplotypeCaller for variant calling on GATK version 3.2.2 on whole genome Illumina reads. I used the following command as per best practices with and without multithreading option (-nct).

java -jar /GenomeAnalysisTK-3-2-2/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 10 -I infile.re.recal.bam -R /genome/human_g1k_v37.fasta -o outfile_raw.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3

Without -nct option variants found : 207828
And with -nct option variants found: 207850

Can you please help me understand why 22 extra variants found with multithreading option? And your suggestion whether to use multithreading option or not?

Thank you in advance.

-SK

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @surendrk‌

    Hi SK,

    These extra calls are probably marginal effects of downsampling. Can you check the qualities and annotations of the 22 different calls?

    Thanks,
    Sheila

  • surendrksurendrk Member

    HI Sheila,

    I checked the files again. And the exact numbers are slightly more.

    The breakdown as follows:

    Without -nct there is 209 unique calls (83 calls QUAL > 500 or 16 calls QUAL > 6000).

    With -nct there 231 unique calls (66 calls QUAL > 500 or 7 calls QUAL > 6000).

    These calls are unique for their respective run i.e. with and without -nct. Any pointers would be very helpful.

    Thanks,
    SK

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It's hard to comment based on just that information. What I can tell you is that typically we see some marginal differences in calls that are due to non-deterministic downsampling effects when -nct is used. This should only affect borderline calls that would get filtered out by VQSR anyway. If you see this happen for calls that are high-quality (not just high QUAL but also looking good on other annotations) then it may be cause for concern. We are definitely moving away from using -nct with the HaplotypeCaller because it seems to cause issues for people which are really hard to pin down.

  • surendrksurendrk Member

    Ok, I ll get back after vqsr and annotation step.

    Thank you Sheila and Geraldine.

    SK

  • mglclinicalmglclinical USAMember

    Based on previous comments it looks like Queue is a better choice to parallelize HaplotypeCaller; rather than using nct option.

    I have tested the performance of HaplotypeCaller (GATK3.4) by increasing the number of CPU threads (-nct) and keeping the rest of the parameters constant and I have recorded the time it took. On our server, we have a 128GB total RAM and 20 cores. I have made sure that no other applications are running on the server during this performance testing time :

    HaplotypeCaller's speed performance was best when run with just 1 cpu thread (-nct 1), and its speed performance decreased with increased number of CPU threads. The content(snps and indels) in raw_vcf_files ${variants_file} for all runs almost remained the same

    java -Xmx90g -Djava.io.tmpdir=pwd/tmp -jar GenomeAnalysisTK_3.4.jar -nct ${cpu_cores} -T HaplotypeCaller -R ucsc.hg19.fasta -I realigned_bam_file.bam -stand_emit_conf 10 -stand_call_conf 30 -L nexterarapidcapture_exome.bed -o ${variants_file}

    -nct 1
    GATK HC ... 1 threads
    Start : Mon Nov 16 10:04:53 EST 2015
    End : Mon Nov 16 11:36:48 EST 2015
    GATK HC Elapsed Time 1 hours 31 minutes 55 seconds

    -nct 2
    GATK HC ... 2 threads
    Start : Mon Nov 16 11:36:48 EST 2015
    End : Mon Nov 16 13:26:53 EST 2015
    GATK HC Elapsed Time 1 hours 50 minutes 5 seconds

    -nct 3
    GATK HC ... 3 threads
    Start : Mon Nov 16 13:26:53 EST 2015
    End : Mon Nov 16 15:26:06 EST 2015
    GATK HC Elapsed Time 1 hours 59 minutes 13 seconds

    -nct 4
    GATK HC ... 4 threads
    Start : Mon Nov 16 15:26:06 EST 2015
    End : Mon Nov 16 17:29:19 EST 2015
    GATK HC Elapsed Time 2 hours 3 minutes 13 seconds

    -nct 5
    GATK HC ... 5 threads
    Start : Mon Nov 16 17:29:19 EST 2015
    End : Mon Nov 16 19:37:14 EST 2015
    GATK HC Elapsed Time 2 hours 7 minutes 55 seconds

    -nct 6
    GATK HC ... 6 threads
    Start : Mon Nov 16 19:37:14 EST 2015
    End : Mon Nov 16 21:45:59 EST 2015
    GATK HC Elapsed Time 2 hours 8 minutes 45 seconds

    -nct 7
    GATK HC ... 7 threads
    Start : Mon Nov 16 21:45:59 EST 2015
    End : Mon Nov 16 23:57:47 EST 2015
    GATK HC Elapsed Time 2 hours 11 minutes 48 seconds

    -nct 8
    GATK HC ... 8 threads
    Start : Mon Nov 16 23:57:47 EST 2015
    End : Tue Nov 17 02:09:44 EST 2015
    GATK HC Elapsed Time 2 hours 11 minutes 57 seconds

    -nct 12
    GATK HC ... 12 threads
    Start : Fri Nov 13 18:40:43 EST 2015
    End : Fri Nov 13 20:55:51 EST 2015
    GATK HC Elapsed Time 2 hours 15 minutes 8 seconds

    -nct 16
    GATK HC ... 16 threads
    Start : Fri Nov 13 20:55:51 EST 2015
    End : Fri Nov 13 23:14:52 EST 2015
    GATK HC Elapsed Time 2 hours 19 minutes 1 seconds

  • mglclinicalmglclinical USAMember

    Would the performance(speed) of HaplotypeCaller change if I change(decrease or increase) the java heapsize ?

    Issue · Github
    by Sheila

    Issue Number
    346
    State
    closed
    Last Updated
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sorry @mglclinical, we haven't done that sort of profiling ourselves, so I can't give you a proper answer beyond "we just use Queue to scatter-gather". We have some collaborators who have been doing work along these lines, however, and who intend to share their results in the near future. At that time we'll be sure to link to their materials.

Sign In or Register to comment.