The current GATK version is 3.6-0

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

GATK 2.6.2 Exceptions with HaplotypeCaller and -nct

Member Posts: 68 ✭✭

When trying to run the HaplotypeCaller in 2.6.2 with -nct I'm getting a number of crashes. Is NCT currently supported or is this experimental for the HaplotypeCaller currently? With the Multithreading I'm not exactly sure where the error is occurring and it's a pretty big bam. If needed I can try to narrow it down a bit further and create a subset bam...


../jre1.7.0_25/bin/java -jar ../GenomeAnalysisTK-2.6-2-ge03a5e9/GenomeAnalysisTK.jar -R ../refs/bosTau6.lic.fa -T HaplotypeCaller -I ../Chr15.ir.bam -bamout Chr15.bam -o Chr15.vcf.gz -L Chr15 -nct 5 -rf BadCigar

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.NullPointerException
at net.sf.samtools.SAMRecordCoordinateComparator.fileOrderCompare(SAMRecordCoordinateComparator.java:82)
at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:43)
at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:41)
at java.util.TimSort.countRunAndMakeAscending(Unknown Source)
at java.util.TimSort.sort(Unknown Source)
at java.util.Arrays.sort(Unknown Source)
at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:203)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.6-2-ge03a5e9):
##### ERROR
##### ERROR Please check the documentation guide to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------



../jre1.7.0_25/bin/java -jar ../GenomeAnalysisTK-2.6-2-ge03a5e9/GenomeAnalysisTK.jar -R ../refs/bosTau6.lic.fa -T HaplotypeCaller -I ../Chr15.ir.bam -bamout Chr15.bam -o Chr15.vcf.gz -L Chr15 -nct 16 -rf BadCigar

#### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IllegalArgumentException: Comparison method violates its general contract!
at java.util.TimSort.mergeLo(Unknown Source)
at java.util.TimSort.mergeAt(Unknown Source)
at java.util.TimSort.mergeCollapse(Unknown Source)
at java.util.TimSort.sort(Unknown Source)
at java.util.Arrays.sort(Unknown Source)
at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:203)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.6-2-ge03a5e9):
##### ERROR
##### ERROR Please check the documentation guide to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: Comparison method violates its general contract!
##### ERROR ------------------------------------------------------------------------------------------


Tagged:

• Member Posts: 68 ✭✭

Note these may be due to a bug with the handling of -BAMOUT with NCT. If I remove -bamout the job appears to continue running with out exceptions.

Hmm, I'm not sure -- let me pass this on to the team.

Geraldine Van der Auwera, PhD

• Member Posts: 68 ✭✭

Any chance of the two options being made compatible in the future? The bamout option was very useful for seeing what exactly was happening with the indels. While NCT simplifies getting the HC to run a decent rate with out having to deal with 10000+ subfiles that all need to be merged.

Both together would be ideal.

I'm curious how you are using bamout? It's not efficiently implemented -- its really more of a debugging tool -- so running even without multiple threads I suspect that bamout must be slowing down the caller. Is that not your experience? Or does it just not matter, given that you can more easily understand what the HC is doing? Would some other type of output work better?

It's entirely possible to make the bamout option work with multiple threads. It's just a bit complex, since the reads could be coming out of order. I'll throw it in JIRA

--
Mark A. DePristo, Ph.D.
Co-Director, Medical and Population Genetics
Broad Institute of MIT and Harvard

• Member Posts: 68 ✭✭

Hi Mark
I've been using bamout to help our more biologically orientated staff to see what exactly the HC has done with the reads when making the call. The last validation step they do is simply to check the bam for the population to make sure it makes sense based on the Genotype supplied by GATK.

It allows them to have a look at bigger indels where the input bam is ambiguous as well as look at regions where there are multiple indels and check for compensatory mutations ( ie two frameshifts canceling each other out) as well as other error types that VQSR and other filtering have problems dealing with.

Also if your discussing the possible impact of an indel it's nice to be able to stick up a clear image from IGV showing the Indel and where it fits in the Genome and what is near by.

• Member Posts: 15

Hi Geraldine,
I am trying to run the HaplotypeCaller on version GenomeAnalysisTK-2.4-9.
I used the commands below:
java -Xmx6g -jar /home//GenomeAnalysisTK.jar -T HaplotypeCaller -nct 5 -R equcab2.fa -I /home/cleaned.sorted.bam1 -I /home/cleaned.sorted.bam2 -I /home/cleaned.sorted.bam3 -stand_call_conf 20 -stand_emit_conf 10.0 -o output.raw.snps.indels.vcf

When I run the command, I receive the following error:

INFO 14:12:21,883 HelpFormatter - Date/Time: 2013/08/21 14:12:21
INFO 14:12:21,883 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:12:21,883 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:12:21,952 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:12:22,057 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 14:12:22,064 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:12:22,091 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03
INFO 14:12:22,115 MicroScheduler - Running the GATK in parallel mode with 5 total threads, 5 CPU thread(s) for each of 1 data thread(s), of 8 processors available on this machine
INFO 14:12:22,739 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------

I am sure the documentation says that the HaplotypeCaller does support parallel execution as in "http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html"

What do you think the problem may be?

Thank you

Well, the documentation on the website is for version 2.6. I would have to check but I believe version 2.4 wasn't yet capable of running the HC multithreaded. I would recommend you update to the latest version to take advantage of the latest performance improvements.

Geraldine Van der Auwera, PhD