Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

The GATK reference model pipeline for incremental joint discovery, in full detail

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin
edited March 7 in Announcements

Okay, we realize the name's a bit of a mouthful, and we're willing to tweak it if anyone has any good ideas. But never mind that. It's difficult to overstate the importance of this new approach to joint variant discovery (but I'll give it a shot) so we're really stoked to finally be able to share the details of how it's is going to work in practice.

You're probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in the new document here, but here's an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:

image


The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. "Oh no," I hear you cry, "but the results were so much better when I called multiple samples together!". Well yeah, but it took forever. Bear with me for a minute.

The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called genomic VCF, which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.

So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you're working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.

See, that's the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.

Let us know what you think!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • mgymrekmgymrek Posts: 2Member

    The link to the "gory details" just goes to this page. Where can I find these gory details?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    Link fixed.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine,

    I have started this program on my bam file containing all chromosomes (mouse). It managed to call genotypes for 3 chromosomes in 5 days. I cannot run programs for longer than that due to limits in wall time. Do you recommend to run each chromosome separately, maybe? Or is there any other way to speed up the calculations? Thanks and best,

    Bettina

  • mike_boursnellmike_boursnell Posts: 72Member

    Hi Geraldine,

    This new method seems terribly slow. I am running a batch of BAM files which are about 2.5GB each, over only a 14MB region of the genome, and each one is taking about 12 hours. I am using nct=1 because I can't risk using higher values as that was unreliable in the past.

    Am I doing something wrong?

    java -Xmx4g -Djava.io.tmpdir=javatempdir -jar GenomeAnalysisTK.jar -R canfam3.fasta -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -L chr13:38900000-53300000 -I AS_case_07_final.bam -o gVCF1_AS_case_07_final_variants.vcf -S STRICT -nct 1

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    @Bettina_Harr‌ and @mike_boursnell‌ That's odd as in our hands this goes much faster than the previous versions. Are you working with very deep coverage? You can certainly parallelize by running on each chromosome individually, to help cut down on runtimes.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine, no my coverage is not extremely high, average 20-fold per site.

  • mike_boursnellmike_boursnell Posts: 72Member

    Quite high coverage 300-400. What is the 128000 parameter?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    The 128000 parameter has to do with the index compression scheme -- I think it specifies block size.

    Let me call in some of the devs who have been working on the HC's performance. Are you running 3.0 or 3.1?

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine, ok, I have restarted the program processing one chromosome at a time. Looking at my intermediate output, I noticed that the software issued a warning:

    WARN 02:33:07,773 DiploidExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:65884356 has 8 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

    I cannot see how a single diploid individual could have 6 or even more alleles. Is this maybe due to some simple sequence repeats/indels? What is the rationale behind allowing 6 different alleles at a site?

    Thanks!

  • Bettina_HarrBettina_Harr Posts: 22Member

    oh, I am running 2.8.1. Did not know there was already a new version. Could this be part of the reason? I have done all my pre-processing now with 2.8.1. Can I switch to 3.1 with the HaplotypeCaller? Or should I stay with 2.8.1? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    I cannot see how a single diploid individual could have 6 or even more alleles. Is this maybe due to some simple sequence repeats/indels? What is the rationale behind allowing 6 different alleles at a site?

    Yes, this is typically due to indels in a context with homopolymers and/or repeats. You can limit the number of max alt alleles if you want, which will make the program run faster.

    I am running 2.8.1.

    You should definitely switch to 3.1; this new method is specifically designed for work in 3.1, and involves new code that is not in 2.8.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    ok, but I do not need to repeat the pre-processing, i.e. indel and base recalibration, when switching to 3.1 now, correct?

  • CarneiroCarneiro Posts: 275Administrator, GATK Developer admin
    edited March 24

    Correct @Bettina_Harr‌ you don't need to reprocess (BQSR & Indel Realignment) to switch. Just run the haplotype caller. I actually didn't even know that we made the reference model available on 2.8, it was very experimental at that point. Try 3.1 and let us know how it is.

    Post edited by Carneiro on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    @mike_boursnell To clarify, are you finding that running HC in GVCF mode is much slower than previously?

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    @mike_boursnell‌ Do you have any comparison data, e.g. runtimes for running two different commands on the same data? We'd like to get a sense of what the difference is. The HC does do quite a bit more work in GVCF mode, so it's expected that the upfront runtime will be a little more, but not orders of magnitude.

    Geraldine Van der Auwera, PhD

  • mike_boursnellmike_boursnell Posts: 72Member

    OK, I'll have a look. I'm not sure how to compensate for other things running on our server at the same time. Any suggestions?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    Not really -- if you were able to reserve a set of nodes that would be ideal of course. Not knowing your setup/access I can't really advise. We have dedicated machines for testing this kind of thing so it's a lot easier.

    Geraldine Van der Auwera, PhD

  • sergiomvsergiomv Posts: 1Member

    Hi Geraldine, I run GATK 3.1 Unified Genotyper (UG) and Haplotype Caller (HC) for whole genome NA12878. With UG I have my vcf file with all variants called (near 4 milions), but when I run HC I obtained very few variants (near 1,3 milions) and in the log of HG I see this warning (a lot of): WARN 05:26:17,926 ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:7458214 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. My BWA alignment is OK I see only two alternate alleles (diploid). Do you know why of this warning ? Thanks.

    chr1-7458214.PNG
    1158 x 872 - 84K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    Hi @sergiomv‌,

    First I would say that the different numbers of variants is most probably unrelated to the warnings you're seeing. I can't really comment on those numbers; you should try to examine what is the overlap and what are the major differences between the two callsets. Have a look at the distribution of annotation values. See if you are missing major blocks of variants, or if there is a correlation with annotation values, e.g. if the variants HC is missing all have low MQ in the UG callset. HaplotypeCaller is more stringent about thing slike MQ than UG.

    Regarding the alleles question, I would bet that HC is seeing multiple possible deletion alleles in your data, because of the low complexity of the context. Keep in mind that the HC does a de novo reassembly step with the data so it may come to a different alignment than BWA. And since considering multiple alleles decreases performance, the HC is set to only proceed with the most likely alleles and discard the rest. You don't need to worry about this.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member
    edited April 22

    Hi Geraldine, a couple of questions: 1) I ran the Haplotype Caller and in the end I am getting a short summary of my reads.

    INFO  05:16:33,184 ProgressMeter - Total runtime 30549.58 secs, 509.16 min, 8.49 hours 
    INFO  05:16:33,185 MicroScheduler - 5481436 reads were filtered out during the traversal out of approximately 21788872 total reads (25.16%) 
    INFO  05:16:33,186 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter 
    INFO  05:16:33,186 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
    INFO  05:16:33,187 MicroScheduler -   -> 5434226 reads (24.94% of total) failing HCMappingQualityFilter 
    INFO  05:16:33,187 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
    INFO  05:16:33,188 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
    INFO  05:16:33,188 MicroScheduler -   -> 47210 reads (0.22% of total) failing NotPrimaryAlignmentFilter 
    INFO  05:16:33,188 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter 
    

    I am wondering about those 25% reads that were "filtered" during traversal. Could you tell me if I have to be concerned about this, and what that actually means?

    Post edited by Geraldine_VdAuwera on
  • Bettina_HarrBettina_Harr Posts: 22Member
    edited April 22

    2) I am still struggling with hitting the wall time when I am trying to run Haplotype caller on the "large" mouse chromosomes, i.e. chr1. It takes more than 4 days. We have a arge computing cluster, I tried to start Java with 16 processors and 20g RAM, but the program would not run any faster. I also set the max alternate allele to 2 hoping it would save some time, but it does not. I attach my jobscript below. Is there anything else I could try to speed things up? I am now using version 3.1.1. Thanks in advance!

    #!/bin/sh
    #BSUB -q fat-long
    #BSUB -R scratch
    #BSUB -W 120:00
    #BSUB -o TP7-10F1A2.recal.bamX_out
    #BSUB -e TP7-10F1A2.recal.bamX_error
    #BSUB -R 'span[hosts=1]'
    #BSUB -n 16
    java -Xmx20g -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I TP7-10F1A2.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L X --variant_index_type LINEAR --variant_index_parameter 128000 -o TP7-10F1A2.recal.bamX.gVCF
    
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    Hi @Bettina_Harr‌,

    1) HaplotypeCaller applies more stringent quality filters than our other tools. Any reads that have a mapping quality less than 20 (if I remember correctly) will be ignored. In your case it looks like many reads have a mapping quality that's considered too low to be useful, unfortunately. This depends mainly on the aligner you used and the quality of your data.

    2) You can try using a more aggressive pruning setting (see here) and/or adding a layer of multithreading if your cluster allows it (see here). In addition, if you this is exome data, you can provide a list of capture targets using the -L argument.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    thank you Geraldine. One more short question: does it make sense to set anything for the -Xmx parameter?

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine, fwi, the -nt option does not seem to be supported in the Haplotype caller. I am testing the -nct option now.

    INFO 01:45:01,617 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:01,620 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 01:45:01,621 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 01:45:01,621 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 01:45:01,626 HelpFormatter - Program Args: -T HaplotypeCaller -nt 2 -nct 4 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I TP7-10F1A2.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L X --variant_index_type LINEAR --variant_index_parameter 128000 -o TP7-10F1A2.recal.bamX.gVCF INFO 01:45:01,630 HelpFormatter - Executing as bharr@gwda036 on Linux 2.6.32-431.1.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18. INFO 01:45:01,631 HelpFormatter - Date/Time: 2014/04/23 01:45:01 INFO 01:45:01,631 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:01,632 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:02,319 GenomeAnalysisEngine - Strictness is SILENT INFO 01:45:02,461 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 01:45:02,471 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 01:45:02,522 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05 INFO 01:45:02,595 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 01:45:02,604 IntervalUtils - Processing 171031299 bp from intervals INFO 01:45:02,617 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 4 CPU thread(s) for each of 2 data thread(s), of 64 processors available on this machine

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis HaplotypeCaller currently does not support parallel execution with nt. Please run your analysis without the nt option.
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    @Bettina_Harr‌ If you're using -nct (which is what I meant to recommend, sorry if that wasn't clear) you may want to increases memory allocation, yes. see the FAQs on multithreading for some general guidelines.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    unfortunately, I am not very experienced with parallelism. I ran the code overnight WITHOUT a special -Xmx parameter (I did not know what reasonable value I should use in the multi-processing mode). It runs fine for about 5h and then exists with "exit code 1). The error file contains this at the end. Do you know what could be the problem? Simply ran out of memory? I used -ntc 6 and reserved 10 processors in my jobscript, to be on the safe side.

    INFO 01:42:29,369 ProgressMeter - 1:85086078 0.00e+00 7.4 h 15250.3 w 43.5% 17.0 h 9.6 h INFO 01:43:29,371 ProgressMeter - 1:85087145 0.00e+00 7.4 h 15250.3 w 43.5% 17.0 h 9.6 h

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:443) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:417) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.calculateGLsForThisEvent(GenotypingEngine.java:385) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoods(GenotypingEngine.java:222) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:880) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Code exception (see stack trace for error itself)
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    This looks like it might be a bug, actually. If you can isolate the region in which this issue occurred and produce a test case, we'll have a look and try to fix it. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine, ok, I perfomed the following test: the program generated the error after position 85299925 on chromosome 1 (see error below). Therefore I re-ran it with the L option (-L 1:85298925-85309925 ):

    java -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -L 1:85298925-85309925 -nct 6 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I 16B.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o 16B.recal.bam1.gVCF1

    This generated no error and a normal looking gVCF file.

    I think it is not a bug in itself, as the error only arises with chromosome 1 files, and not with any other chromosome of the same individual mouse.

    I have now rerun the chr1 files with the -Xmx option set to 20g, and reserving 20 processors for it (each processor has 4Gb memory allocated to it. The run is still waiting in the queue, so I will only be able to say what happened tomorrow morning.

    When I start the program interactively, and then use "top" to monitor the memory I can see it uses 18.6 Gb. Maybe -Xmx20g is still not enough?

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    23772 bharr 20 0 18.6g 842m 11m S 781.1 1.3 0:16.51 java

    error file:

    INFO 22:32:03,103 ProgressMeter - 1:85296437 0.00e+00 4.2 h 15250.3 w 43.6% 9.7 h 5.5 h INFO 22:32:33,105 ProgressMeter - 1:85299925 0.00e+00 4.2 h 15250.3 w 43.6% 9.7 h 5.5 h WARN 22:32:33,427 ExactAFCalc - this tool is currently set to genotype at most 2 alternate alleles in a given context, but the context at 1:85287439 has 4 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.addMiscellaneousAllele(GenotypingEngine.java:257) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoods(GenotypingEngine.java:227) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:880) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    Ack, this is worse than I thought. "ConcurrentModificationException" means this is most likely a problem with multithreading; it seems several threads may be trying to modify the same information at the same time, which is bad and causes the crash you see. This might be triggered by something in your data. At this point I would recommend not using multithreading at all with HaplotypeCaller. Instead of multithreading you may want to run HaplotypeCaller individually per chromosome, then concatenate the resulting variants (e.g. using the CatVariants utility). This will enable you to still use your cluster for parallelism without running into this thread safety issue.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 148Member ✭✭✭

    I've been wondering about that. I've been using both -nct and -pairHMM VECTOR blah when running in gvcf mode and I have been experiencing random crashes that aren't reproducible (I run it again and it goes through fine) on human exomes. About 1 or 2 crashes per 100 runs so far. My stack trace is not the same as above...I haven't had time to look at all of them, but below is my latest crash that I've already rerun through...I'm actually going to do a sanity check through my last 150 to see if there were any that crashed that I missed.

    `##### ERROR stack trace java.lang.NullPointerException at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:51) at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:41) at java.util.TimSort.binarySort(TimSort.java:265) at java.util.TimSort.sort(TimSort.java:208) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.broadinstitute.sting.utils.sam.ReadUtils.sortReadsByCoordinate(ReadUtils.java:320) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.finalizeActiveRegion(HaplotypeCaller.java:1111) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.referenceModelForNoVariation(HaplotypeCaller.java:1001) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:808) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Code exception (see stack trace for error itself)

    `

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine, I am already running everything on a per chromosome basis and this is were I hit my runtime limits with chromosome 1. It takes more than four days, even with allowing for 20g when I start Java. It seems that maybe I have to subdivide chromsome 1 in chuncks to be able to do it :-(!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    Wait - you're saying it's taking four days per chromosome per sample? What kind of coverage do you have?

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member
    edited April 24

    these are my CPU times (in seconds): TP7-10F1A2.recal.bamMT_out: CPU time : 73

    TP7-10F1A2.recal.bamX_out: CPU time : 26442

    TP7-10F1A2.recal.bamY_out: CPU time : 42362

    TP7-10F1A2.recal.bam19_out: CPU time : 45158

    TP7-10F1A2.recal.bam15_out: CPU time : 70149

    TP7-10F1A2.recal.bam18_out: CPU time : 91582

    TP7-10F1A2.recal.bam16_out: CPU time : 113535

    TP7-10F1A2.recal.bam14_out: CPU time : 116572

    TP7-10F1A2.recal.bam12_out: CPU time : 123313

    TP7-10F1A2.recal.bam17_out: CPU time : 132176

    TP7-10F1A2.recal.bam3_out: CPU time : 133359

    TP7-10F1A2.recal.bam7_out: CPU time : 143940

    TP7-10F1A2.recal.bam8_out: CPU time : 145699

    TP7-10F1A2.recal.bam6_out: CPU time : 148685

    TP7-10F1A2.recal.bam11_out: CPU time : 150079

    TP7-10F1A2.recal.bam9_out: CPU time : 153481

    TP7-10F1A2.recal.bam10_out: CPU time : 160582

    TP7-10F1A2.recal.bam13_out: CPU time : 163721

    TP7-10F1A2.recal.bam5_out: CPU time : 196669

    TP7-10F1A2.recal.bam2_out: CPU time : 253768

    TP7-10F1A2.recal.bam4_out: CPU time : 258922

    TP7-10F1A2.recal.bam1_out: CPU time : 355144

    Post edited by Bettina_Harr on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    That seems awfully slow. Unfortunately I can't really help you deal with performance and platform issues as that is beyond the scope of support I can provide, sorry.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine,

    ok, I got it to work. With this configuration, the whole chromosome 1 takes a little more than 1 day. So it is not a bug after all using the -nct option. One just seems to need tons of memory.

    !/bin/sh

    BSUB -q fat-long

    BSUB -R scratch

    BSUB -W 120:00

    BSUB -o 14.recal.bam1_out

    BSUB -e 14.recal.bam1_error

    BSUB -R 'span[hosts=1]'

    BSUB -n 20

    java -jar -Xmx40g /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 6 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I 14.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L 1 --variant_index_type LINEAR --variant_index_parameter 128000 -o 14.recal.bam1.gVCF

  • mike_boursnellmike_boursnell Posts: 72Member

    Hi. Can we use this gVCF method to analyse a bunch of BAM files all from the same sample (just so they can be analysed in parallel?) If so how would you have to set up the Read Group information in these BAM files?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    @mike_boursnell I'm not sure what you mean, but if you're saying you want to compare the BAMs as if they belonged to different samples (e.g. to compare results on technical replicates), then all you need to do is give them different SM tags so that GATK treats them separate when you do the GenotypeGVCFs step.

    Geraldine Van der Auwera, PhD

  • mike_boursnellmike_boursnell Posts: 72Member

    This is for whole genome sequencing of a single sample. We are looking at splitting the single FASTQ file into (say) 10 smaller bits and then processing them in parallel. When it gets to the HaplotypeCaller gVCFs stage there will be 10 separate BAM files, but which are all actually the same sample. If we give them the same SM tag will GATK assume they are the same sample and produce a single-column VCF file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    @mike_boursnell‌ Oh OK, got it. Yes, you should be able to just pass in the separate bam files. As long as they have the same SM tag, the data will get re-aggregated by the engine and served appropriately to the HC, leading to a single-sample VCF file.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine, I have now completed the variant calling using Haplotype Caller and recalibrated my variants. It all worked but I am now in a position where my outfiles contain ONLY variable sites, and not the invariable ones. The vcf files that "GenotypeGVCFs" produces does no longer contain the invariable sites. I would like to have the info on all sites in the final vcf file, even for those sites that are not variable as Id like to see their coverage. is this possible at all at this point?

    Thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    Hi @Bettina_Harr‌,

    Yes, you can get that using the -inv argument with GenotypeGVCFs. However, in the current version this will produce no-call site records to be emitted, so the hom-ref sites won't have much information attached to them. I believe the next version will produce detailed records at all sites.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    thank you Geraldine! I have another question. The R script does not automatically run on my machine, so I was trying to run the script manually by typing:

    library(ggplot2)

    source("7.recalibrate_SNP_plots.R") Error: 'opts' is deprecated. Use 'theme' instead. (Defunct; last used in version 0.9.1)

    I am getting the error above.

    I am using ggplot2 1.0.0.

    R version 3.1.0 (2014-04-10) -- "Spring Dance" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.1.0 (64-bit)

    Simply replacing "opts" with "theme" does not do the trick, I will get another error then:

    Error: theme_rect is deprecated. Use 'element_rect' instead. (Defunct; last used in version 0.9.1)

    Thanks!

    Bettina

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,089Administrator, GATK Developer admin

    Hi @Bettina_Harr‌,

    There are a few other functions related to the opt -> theme change that also need to be adapted. The error message tells you which. Just step through the script and change every one that throws an error the same way.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.