Problem when running HaplotypeCallerSpark locally

I'm running HaplotypeCallerSpark locally, but I cannot see the progress lines on console, like how it was when running GATK 3.
It only shows spark-related information.
image
How can I get the progress report in realtime to know if I am running it right?

Thanks!

Best Answer

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @sean6016,

    We'll get to your question at some point. In the meantime, if you run the command on a small piece of data, does it complete?

  • sean6016sean6016 TaiwanMember

    @shlee said:
    Hi @sean6016,

    We'll get to your question at some point. In the meantime, if you run the command on a small piece of data, does it complete?

    Hi @shlee,

    I'm running HaplotypeCallerSpark on NA12878, which took a long time and didn't complete.
    Is there any place I can get small piece of data to test if it completes?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited February 2017

    @sean6016, are you using a genomic intervals list to limit your analyses to particular regions?

    We offer various types of small data but ultimately what will help you the most is to use your own data by either taking a genomic interval of reads (e.g. PrintReads -L 20) or downsampling the reads to lower coverage depth (see this article).

    Otherwise, here are the locations of some unaligned NA12878 BAMs: https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.inputs.json

    They are in google buckets that you should be able to download from using gsutil, e.g.

    gsutil cp gs://genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.bam .
    
    Post edited by shlee on
  • sean6016sean6016 TaiwanMember

    Thanks for advice!

    I switched to running HaplotypeCaller with additional -L 20 argument
    (because HaplotypeCallerSpark had run more than 5 days and was still running)
    But I'm curious about the differences between HaplotypeCallers in GATK 4 and GATK 3.
    Does it use Spark as well?
    It seems like it's using another programming model other than the one in GATK 3
    (For example, I cannot find the map and the reduce function in HapoltypeCaller.java)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    The engine framework was completely rewritten for GATK4 so you'll find many differences between 3 and 4, including that 3 does not support Spark. It has a homegrown multithreading mechanism that is not as reliable.
  • sean6016sean6016 TaiwanMember

    I see.
    Can you elaborate more on this 'homegrown multithreading mechanism'?
    How can I use this?
    Also, I'm still trying to run HaplotypeCallerSpark in local mode.
    But some memory-related errors like 'java.lang.OutOfMemoryError: Java heap space'
    and 'java.lang.OutOfMemoryError:GC overhead limit exceeded' kept showing up.
    I am wondering how much memory space is enough to run HaplotypeCallerSpark in local mode?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    In GATK3 the multithreading args are -nt and -nct. See the multithreading/parallelism documentation for details.

    We have not yet produced publishable benchmarks for memory usage of the GATK4 tools so we can't advise you on this point, sorry.
Sign In or Register to comment.