The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# IndelRealigner Fails on References with Colons in Scaffold Names

Posts: 5Member

I can't seem to run the IndelRealigner on reads that contain colons, ":" in the reference scaffold names. The RealignerTargetCreator step works correctly and generates the interval table, but the second, IndelRealigner, step fails. When I look at the generated interval table, I see the interval delimiter is a colon, which I imagine is the problem.

Unfortunately, I have a set of human references that have a colon in every scaffold name, so changing this would be a massive undertaking.

I believe this problem could be solved if you searched for the colon delimiter from the end of the interval string instead of from the beginning, so I'm hoping this a real simple fix.

Thanks!

Tagged:

• Posts: 5Member

• Posts: 8Member

I downloaded the fix from github and the performance of RealignerTargetCreator has tanked between the original version I grabbed a few months back (v2.0-39-ge14ea5c, Compiled 2012/08/10 15:13:41) and the one I built last week (v2.2-14-g3cb83c2, Compiled 2012/11/29 01:54:33).

For the same bam and reference and no -known the time estimate goes from 27 minutes to 6.8 hours. Both instances appear to be CPU bound and I don't appear to have any I/O or memory issues. The program is still running, so I can give you better times tomorrow when they're done, but the estimates look to be pretty good.

Interestingly the following line appears in the 2.2 output, but not the 2.0:
INFO 22:49:41,515 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000
I would have expected the downsampling to speed things up. Perhaps the 2.0 was doing so implicitly?

I am running a human sample from 1000 Genomes against GRCh37. Here are the edited messages. Let me know if you need everything.

INFO  01:38:09,722 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-39-ge14ea5c, Compiled 2012/08/10 15:13:41
INFO  01:38:09,723 HelpFormatter - Program Args: -T RealignerTargetCreator -R /reference/homo_sapiens_GRCh37_dbSNP/reference.fasta -o old_gatk_all.intervals -I ERR009283.bam
INFO  01:38:09,740 GenomeAnalysisEngine - Strictness is SILENT
INFO  01:38:10,666 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 01:38:10,881 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.21
INFO  01:38:11,891 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO  01:38:11,892 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  01:38:41,746 TraversalEngine - 1:30028082:24362919        5.41e+07   30.0 s        0.6 s      1.8%        27.3 m    26.8 m

INFO  22:49:40,711 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-14-g3cb83c2, Compiled 2012/11/29 01:54:33
INFO  22:49:40,719 HelpFormatter - Program Args: -T RealignerTargetCreator -R /reference/homo_sapiens_GRCh37_dbSNP/reference.fasta -o all.intervals -I ERR009283.bam
INFO  22:49:40,728 GenomeAnalysisEngine - Strictness is SILENT
INFO  22:49:41,515 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000
INFO  22:49:41,525 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 22:49:41,590 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO  22:49:41,737 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  22:49:41,738 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  22:50:11,779 ProgressMeter - 1:2684220:1128209        3.49e+06   30.0 s        8.6 s      0.1%         6.8 h     6.8 h

• Posts: 679GATK Developer mod

Are you sure that the correct results were output during the original (fast) run? Because the bug report was showing that the GATK was not handling contig names correctly if they contained a colon. Is the output between the two runs identical?

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 8Member

I'm pretty sure the results are different, but this is the first time that I'm running both with identical bams. I figured you had improved things between 2.0 and 2.2, but it didn't occur to me that the bug was the problem.

The problem with the colons was only fatal in the second, IndelRealigner step, not the RealignerTargetCreator step.

I'll compare the results tomorrow when the 2.2 version finishes. What should I look for in the interval table to check the results other than that an indel is present?

• Posts: 679GATK Developer mod

Honestly, I have no idea... :) We don't generally use contigs with colons. Do you have any known indels in your data?

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 8Member

The generated interval tables are identical in the 2.0 and 2.2 versions. Only the runtime has changed.

• Posts: 679GATK Developer mod

Interesting. Would you mind uploading your bam (plus index) and reference (plus dict and fai files) to our ftp site? I'd be happy to profile this for you to see whether we can make it faster.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 8Member

You will have to fix or rebuild the dict file as I didn't preserve the reference path.

• Posts: 8Member

I ran git bisect to see if I could find the problem. The first commit where the run of RealignerTargetCreator fails is 7bbabdd756650a49f9a45e88e3df2a290f6f0b4c (the hashes refer to the public github repo). In this commit, the command seems to hang, although I admit I didn't wait more than five minutes. Typically I would get the first time estimate in 20-30 seconds. Not surprisingly, it looks like you redid the locus traversal code.

The next commit, 76e0027001420298d44512bc559c48dba4ad43bd, fixes things, but now the time estimate has jumped from 30 minutes to 84 minutes.

At commit 8f605e87c5d9b0a3cb02596982b0ccd9a64f43df, the time jumps from 84 minutes to 6 hours.

edited December 2012

Hi Cheinan,

I wanted to personally thank you for this excellent bug report. With that bisect in hand we figured out that some monitoring code in the new NanoScheduler was causing the GATK to issue many calls to System.currentTimeMillis, which invokes an expensive system call but doesn't show up in the profiler for some reason. By removing this polling we've brought the runtime back down to the 2.0 version. These fixes will all be available in GATK 2.3.

Best,

Mark

Post edited by Mark_DePristo on

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

• Posts: 8Member

Glad to help. Git bisect and open source software rock.

I'd suggest timing your regression tests or putting timeouts on them so you know right away when something like this happens in the future.

• Posts: 8Member

I just tried the fix with v2.3-4-gb8f1308 and I'm getting a time estimate of 3.2 hours. I retried it with v2.0-39-ge14ea5c and got an estimate of 31 minutes so while the performance improved by a factor of two, it's still an order of magnitude slower than it used to be.

I also tried this with a version I build from source and got the same results. In fact, the version ID was identical, v2.3-4-gb8f1308.

Thanks! I'll have a look at this. It's possible that my most recent optimizations fix the problem.

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

edited December 2012

Just had a quick lot. 2.0 takes 15 min, 2.3 takes 90 minutes, but our internal 2.4 takes 14 min. I've been fighting through performance issues with BQSR, improving the nano scheduler, and those optimizations have fixed the problems. We'll be pushing 2.4 on our regular schedule (~6 weeks) so that should finally fix the whole issue. It'll also come with at least 3x performance improvements to BQSR

Post edited by Mark_DePristo on

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

• Posts: 8Member

Would it be possible to get a patch for this bug please? We won't be upgrading to 2.4 because of the license change and all I can do is ask that you leave the final open source version as bug-free as possible.