The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.4 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

IndelRealigner Fails on References with Colons in Scaffold Names

I can't seem to run the IndelRealigner on reads that contain colons, ":" in the reference scaffold names. The RealignerTargetCreator step works correctly and generates the interval table, but the second, IndelRealigner, step fails. When I look at the generated interval table, I see the interval delimiter is a colon, which I imagine is the problem.

Unfortunately, I have a set of human references that have a colon in every scaffold name, so changing this would be a massive undertaking.

I believe this problem could be solved if you searched for the colon delimiter from the end of the interval string instead of from the beginning, so I'm hoping this a real simple fix.


Best Answer


  • Thanks. I downloaded the fix, built the GATK and confirmed it.

  • I downloaded the fix from github and the performance of RealignerTargetCreator has tanked between the original version I grabbed a few months back (v2.0-39-ge14ea5c, Compiled 2012/08/10 15:13:41) and the one I built last week (v2.2-14-g3cb83c2, Compiled 2012/11/29 01:54:33).

    For the same bam and reference and no -known the time estimate goes from 27 minutes to 6.8 hours. Both instances appear to be CPU bound and I don't appear to have any I/O or memory issues. The program is still running, so I can give you better times tomorrow when they're done, but the estimates look to be pretty good.

    Interestingly the following line appears in the 2.2 output, but not the 2.0:
    INFO 22:49:41,515 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000
    I would have expected the downsampling to speed things up. Perhaps the 2.0 was doing so implicitly?

    I am running a human sample from 1000 Genomes against GRCh37. Here are the edited messages. Let me know if you need everything.

    INFO  01:38:09,722 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-39-ge14ea5c, Compiled 2012/08/10 15:13:41 
    INFO  01:38:09,723 HelpFormatter - Program Args: -T RealignerTargetCreator -R /reference/homo_sapiens_GRCh37_dbSNP/reference.fasta -o old_gatk_all.intervals -I ERR009283.bam 
    INFO  01:38:09,740 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  01:38:10,666 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  01:38:10,881 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.21 
    INFO  01:38:11,892 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  01:38:41,746 TraversalEngine - 1:30028082:24362919        5.41e+07   30.0 s        0.6 s      1.8%        27.3 m    26.8 m  
    INFO  22:49:40,711 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-14-g3cb83c2, Compiled 2012/11/29 01:54:33 
    INFO  22:49:40,719 HelpFormatter - Program Args: -T RealignerTargetCreator -R /reference/homo_sapiens_GRCh37_dbSNP/reference.fasta -o all.intervals -I ERR009283.bam 
    INFO  22:49:40,728 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  22:49:41,515 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000  
    INFO  22:49:41,525 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  22:49:41,590 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 
    INFO  22:49:41,738 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  22:50:11,779 ProgressMeter - 1:2684220:1128209        3.49e+06   30.0 s        8.6 s      0.1%         6.8 h     6.8 h 
  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Are you sure that the correct results were output during the original (fast) run? Because the bug report was showing that the GATK was not handling contig names correctly if they contained a colon. Is the output between the two runs identical?

  • I'm pretty sure the results are different, but this is the first time that I'm running both with identical bams. I figured you had improved things between 2.0 and 2.2, but it didn't occur to me that the bug was the problem.

    The problem with the colons was only fatal in the second, IndelRealigner step, not the RealignerTargetCreator step.

    I'll compare the results tomorrow when the 2.2 version finishes. What should I look for in the interval table to check the results other than that an indel is present?

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Honestly, I have no idea... :)
    We don't generally use contigs with colons.
    Do you have any known indels in your data?

  • The generated interval tables are identical in the 2.0 and 2.2 versions. Only the runtime has changed.

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Interesting. Would you mind uploading your bam (plus index) and reference (plus dict and fai files) to our ftp site? I'd be happy to profile this for you to see whether we can make it faster.

  • Uploaded to cheinan.indel_aligner_data.tar

    You will have to fix or rebuild the dict file as I didn't preserve the reference path.

  • I ran git bisect to see if I could find the problem. The first commit where the run of RealignerTargetCreator fails is 7bbabdd756650a49f9a45e88e3df2a290f6f0b4c (the hashes refer to the public github repo). In this commit, the command seems to hang, although I admit I didn't wait more than five minutes. Typically I would get the first time estimate in 20-30 seconds. Not surprisingly, it looks like you redid the locus traversal code.

    The next commit, 76e0027001420298d44512bc559c48dba4ad43bd, fixes things, but now the time estimate has jumped from 30 minutes to 84 minutes.

    At commit 8f605e87c5d9b0a3cb02596982b0ccd9a64f43df, the time jumps from 84 minutes to 6 hours.

  • Mark_DePristoMark_DePristo Broad InstituteMember
    edited December 2012

    Hi Cheinan,

    I wanted to personally thank you for this excellent bug report. With that bisect in hand we figured out that some monitoring code in the new NanoScheduler was causing the GATK to issue many calls to System.currentTimeMillis, which invokes an expensive system call but doesn't show up in the profiler for some reason. By removing this polling we've brought the runtime back down to the 2.0 version. These fixes will all be available in GATK 2.3.



    Post edited by Mark_DePristo on
  • Glad to help. Git bisect and open source software rock.

    I'd suggest timing your regression tests or putting timeouts on them so you know right away when something like this happens in the future.

  • I just tried the fix with v2.3-4-gb8f1308 and I'm getting a time estimate of 3.2 hours. I retried it with v2.0-39-ge14ea5c and got an estimate of 31 minutes so while the performance improved by a factor of two, it's still an order of magnitude slower than it used to be.

    I also tried this with a version I build from source and got the same results. In fact, the version ID was identical, v2.3-4-gb8f1308.

  • Mark_DePristoMark_DePristo Broad InstituteMember

    Thanks! I'll have a look at this. It's possible that my most recent optimizations fix the problem.

  • Mark_DePristoMark_DePristo Broad InstituteMember
    edited December 2012

    Just had a quick lot. 2.0 takes 15 min, 2.3 takes 90 minutes, but our internal 2.4 takes 14 min. I've been fighting through performance issues with BQSR, improving the nano scheduler, and those optimizations have fixed the problems. We'll be pushing 2.4 on our regular schedule (~6 weeks) so that should finally fix the whole issue. It'll also come with at least 3x performance improvements to BQSR

  • Would it be possible to get a patch for this bug please? We won't be upgrading to 2.4 because of the license change and all I can do is ask that you leave the final open source version as bug-free as possible.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @cheinan,

    Unfortunately that's not going to be possible. I understand your concerns but this is an inevitable consequence of our development model; we have considerably re-engineered portions of the GATK infrastructure ahead of 2.4 and it would be far too much work to back-port these improvements onto 2.3. Our apologies for the inconvenience.

Sign In or Register to comment.