Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

huge time difference between HC and UG against 2000 known variants

asakiasaki chinaMember

Hi,

We are doing the benchmark analysis against amplicon sequencing data for 2000 known variants.

Both HC and UG have been tried.
However, it seems that HC take much longer time than UG, which is not acceptable, although HC should be better in calling variants.

I am wondering is there any solution to accelerate the calling process?

One sample analysis:
1. HC
real 138m14.093s
user 147m53.771s
sys 0m48.910s

  1. UG
    real 0m22.982s
    user 1m34.687s
    sys 0m8.169s

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @asaki
    Hi,

    Yes, this is expected because HaplotypeCaller does a reassembly step that takes time. You can read more about it in the Methods and Algorithms section.

    To speed things up, you can try WDL. Also, GATK4 should be faster and have easier ways to speed things up if you can try that.

    -Sheila

  • asakiasaki chinaMember

    Hi @Sheila ,

    Thanks for the reply. I did know the reassembly around activeRegions using HaplotypeCaller, but did not expect so much time on single sample for 2000 targets.

    As you suggested, I run gatk4 beta3 to faster the calling, however, no results returned, which make me confused.

    java -jar ~/program/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar HaplotypeCallerSpark -R /bioinfo/data/iGenomes/genome.2bit --alleles ~/lib/wellwise_snps.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES -A StrandBiasBySample -ip 50 -L ~/lib/wellwise_snps.vcf -O GW7O0002D04.hc.vcf -I  GWI211.reheader.bam
    

    Last information output by gatk

    17/08/24 11:16:32 INFO TaskSetManager: Finished task 5.0 in stage 3.0 (TID 19) in 30 ms on localhost (3/7)                                                                          [375/1954]
    17/08/24 11:16:32 INFO TaskSetManager: Finished task 3.0 in stage 3.0 (TID 17) in 32 ms on localhost (4/7)
    17/08/24 11:16:32 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 14) in 36 ms on localhost (5/7)
    17/08/24 11:16:32 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 16) in 35 ms on localhost (6/7)
    17/08/24 11:16:32 INFO TaskSetManager: Finished task 6.0 in stage 3.0 (TID 20) in 36 ms on localhost (7/7)
    17/08/24 11:16:32 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
    17/08/24 11:16:32 INFO DAGScheduler: ShuffleMapStage 3 (mapToPair at VariantsSparkSink.java:129) finished in 0.039 s
    17/08/24 11:16:32 INFO DAGScheduler: looking for newly runnable stages
    17/08/24 11:16:32 INFO DAGScheduler: running: Set()
    17/08/24 11:16:32 INFO DAGScheduler: waiting: Set(ResultStage 4)
    17/08/24 11:16:32 INFO DAGScheduler: failed: Set()
    17/08/24 11:16:32 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[16] at mapToPair at VariantsSparkSink.java:170), which has no missing parents
    17/08/24 11:16:32 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 114.6 KB, free 14.4 GB)
    17/08/24 11:16:32 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 50.4 KB, free 14.4 GB)
    17/08/24 11:16:32 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.0.21:42085 (size: 50.4 KB, free: 15.2 GB)
    17/08/24 11:16:32 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1012
    17/08/24 11:16:32 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[16] at mapToPair at VariantsSparkSink.java:170)
    17/08/24 11:16:32 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
    17/08/24 11:16:32 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 21, localhost, partition 0, PROCESS_LOCAL, 5185 bytes)
    17/08/24 11:16:32 INFO Executor: Running task 0.0 in stage 4.0 (TID 21)
    17/08/24 11:16:32 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 7 blocks
    17/08/24 11:16:32 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
    17/08/24 11:16:32 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
    17/08/24 11:16:32 INFO FileOutputCommitter: Saved output of task 'attempt_201708241116_0016_r_000000_0' to file:/home/jjiang/work/PanelEval/wellwise/deep_analysis/0609/gatk4_hc_recall/GW7O00
    02D04.hc.vcf.parts/_temporary/0/task_201708241116_0016_r_000000
    17/08/24 11:16:32 INFO Executor: Finished task 0.0 in stage 4.0 (TID 21). 1511 bytes result sent to driver
    17/08/24 11:16:32 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 21) in 110 ms on localhost (1/1)
    17/08/24 11:16:32 INFO DAGScheduler: ResultStage 4 (saveAsNewAPIHadoopFile at VariantsSparkSink.java:159) finished in 0.110 s
    17/08/24 11:16:32 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
    17/08/24 11:16:32 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile at VariantsSparkSink.java:159, took 0.190293 s
    17/08/24 11:16:32 INFO SparkUI: Stopped Spark web UI at http://192.168.0.21:4040
    17/08/24 11:16:32 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    17/08/24 11:16:32 INFO MemoryStore: MemoryStore cleared
    17/08/24 11:16:32 INFO BlockManager: BlockManager stopped
    17/08/24 11:16:32 INFO BlockManagerMaster: BlockManagerMaster stopped
    17/08/24 11:16:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    17/08/24 11:16:32 INFO SparkContext: Successfully stopped SparkContext
    11:16:32.533 INFO  HaplotypeCallerSpark - Shutting down engine
    [August 24, 2017 11:16:32 AM CST] org.broadinstitute.hellbender.tools.HaplotypeCallerSpark done. Elapsed time: 3.95 minutes.
    Runtime.totalMemory()=14827913216
    17/08/24 11:16:32 INFO ShutdownHookManager: Shutdown hook called
    17/08/24 11:16:32 INFO ShutdownHookManager: Deleting directory /tmp/jjiang/spark-7ec09abf-d8f5-4f6e-a74f-9372a1420403
    

    Any suggestions?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @asaki
    Hi,

    GATK4 does not support GGA mode. What happens if you remove --alleles ~/lib/wellwise_snps.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES from your command?

    Thanks,
    Sheila

  • asakiasaki chinaMember

    @Sheila said:
    @asaki
    Hi,

    GATK4 does not support GGA mode. What happens if you remove --alleles ~/lib/wellwise_snps.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES from your command?

    Thanks,
    Sheila

    Hi @Sheila ,

    Thanks,

    And what a pity if GATK4 will not support the GGA mode. We actually used NGS to genotype the variants we want.
    And will try your suggestion to see what happens.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @asaki
    Hi,

    Yes, I think GGA mode will be supported in GATK4, just not quite yet.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Support for GGA mode has indeed been requested by some of our internal stakeholders as well, so it is likely that it will be implemented eventually. However there are no immediate plans to prioritize this in the coming quarters, so I would not expect this feature to become available for at least another six months. In the meantime, the workaround is to joint genotype in allSites mode (exact spelling may be different, my memory is hazy) with the vcf of sites of interest provided as an intervals list, as the closest possible approximation.

  • asakiasaki chinaMember

    Hi @Geraldine_VdAuwera

    Thanks for your suggestions. Genotyping using HC with allSites will take longer times, while it sometimes will miss the real indels. So it might be the closet solution but not the best.

    @Sheila ,

    I try to removing the --alleles option, but still restrict the calling around my sites in VCF. It did work, but not all the sites will be genotyped if they are homozygous. Although I can run the script to "impute" these SNPs, supposing those "missed" variants in my variants list are homozygous, it might bring in mistakes.

    Thanks,

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @asaki
    HI,

    I am not sure I understand what you mean. Why will sites that are homozygous not be genotyped? Do you mean homozygous reference or homozygous variant?

    Thanks,
    Sheila

  • asakiasaki chinaMember

    @Sheila ,

    It is the homozygous reference.

  • asakiasaki chinaMember

    Hi @Sheila ,

    Using -allSites will sometimes lead to typing problems in INDELS.

Sign In or Register to comment.