IndelRealigner Fails on References with Colons in Scaffold Names

I can't seem to run the IndelRealigner on reads that contain colons, ":" in the reference scaffold names. The RealignerTargetCreator step works correctly and generates the interval table, but the second, IndelRealigner, step fails. When I look at the generated interval table, I see the interval delimiter is a colon, which I imagine is the problem.

Unfortunately, I have a set of human references that have a colon in every scaffold name, so changing this would be a massive undertaking.

I believe this problem could be solved if you searched for the colon delimiter from the end of the interval string instead of from the beginning, so I'm hoping this a real simple fix.

Thanks!

Best Answer

Answers

  • PipePlumberPipePlumber Posts: 5Member

    Thanks. I downloaded the fix, built the GATK and confirmed it.

  • cheinancheinan Posts: 8Member

    I downloaded the fix from github and the performance of RealignerTargetCreator has tanked between the original version I grabbed a few months back (v2.0-39-ge14ea5c, Compiled 2012/08/10 15:13:41) and the one I built last week (v2.2-14-g3cb83c2, Compiled 2012/11/29 01:54:33).

    For the same bam and reference and no -known the time estimate goes from 27 minutes to 6.8 hours. Both instances appear to be CPU bound and I don't appear to have any I/O or memory issues. The program is still running, so I can give you better times tomorrow when they're done, but the estimates look to be pretty good.

    Interestingly the following line appears in the 2.2 output, but not the 2.0:
    INFO 22:49:41,515 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000
    I would have expected the downsampling to speed things up. Perhaps the 2.0 was doing so implicitly?

    I am running a human sample from 1000 Genomes against GRCh37. Here are the edited messages. Let me know if you need everything.

    INFO  01:38:09,722 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-39-ge14ea5c, Compiled 2012/08/10 15:13:41 
    INFO  01:38:09,723 HelpFormatter - Program Args: -T RealignerTargetCreator -R /reference/homo_sapiens_GRCh37_dbSNP/reference.fasta -o old_gatk_all.intervals -I ERR009283.bam 
    INFO  01:38:09,740 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  01:38:10,666 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  01:38:10,881 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.21 
    INFO  01:38:11,891 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
    INFO  01:38:11,892 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  01:38:41,746 TraversalEngine - 1:30028082:24362919        5.41e+07   30.0 s        0.6 s      1.8%        27.3 m    26.8 m  
    
    INFO  22:49:40,711 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-14-g3cb83c2, Compiled 2012/11/29 01:54:33 
    INFO  22:49:40,719 HelpFormatter - Program Args: -T RealignerTargetCreator -R /reference/homo_sapiens_GRCh37_dbSNP/reference.fasta -o all.intervals -I ERR009283.bam 
    INFO  22:49:40,728 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  22:49:41,515 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000  
    INFO  22:49:41,525 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  22:49:41,590 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 
    INFO  22:49:41,737 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  22:49:41,738 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  22:50:11,779 ProgressMeter - 1:2684220:1128209        3.49e+06   30.0 s        8.6 s      0.1%         6.8 h     6.8 h 
    
  • ebanksebanks Posts: 684GATK Developer mod

    Are you sure that the correct results were output during the original (fast) run? Because the bug report was showing that the GATK was not handling contig names correctly if they contained a colon. Is the output between the two runs identical?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • cheinancheinan Posts: 8Member

    I'm pretty sure the results are different, but this is the first time that I'm running both with identical bams. I figured you had improved things between 2.0 and 2.2, but it didn't occur to me that the bug was the problem.

    The problem with the colons was only fatal in the second, IndelRealigner step, not the RealignerTargetCreator step.

    I'll compare the results tomorrow when the 2.2 version finishes. What should I look for in the interval table to check the results other than that an indel is present?

  • ebanksebanks Posts: 684GATK Developer mod

    Honestly, I have no idea... :) We don't generally use contigs with colons. Do you have any known indels in your data?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • cheinancheinan Posts: 8Member

    The generated interval tables are identical in the 2.0 and 2.2 versions. Only the runtime has changed.

  • ebanksebanks Posts: 684GATK Developer mod

    Interesting. Would you mind uploading your bam (plus index) and reference (plus dict and fai files) to our ftp site? I'd be happy to profile this for you to see whether we can make it faster.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • cheinancheinan Posts: 8Member

    Uploaded to cheinan.indel_aligner_data.tar

    You will have to fix or rebuild the dict file as I didn't preserve the reference path.

  • cheinancheinan Posts: 8Member

    I ran git bisect to see if I could find the problem. The first commit where the run of RealignerTargetCreator fails is 7bbabdd756650a49f9a45e88e3df2a290f6f0b4c (the hashes refer to the public github repo). In this commit, the command seems to hang, although I admit I didn't wait more than five minutes. Typically I would get the first time estimate in 20-30 seconds. Not surprisingly, it looks like you redid the locus traversal code.

    The next commit, 76e0027001420298d44512bc559c48dba4ad43bd, fixes things, but now the time estimate has jumped from 30 minutes to 84 minutes.

    At commit 8f605e87c5d9b0a3cb02596982b0ccd9a64f43df, the time jumps from 84 minutes to 6 hours.

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin
    edited December 2012

    Hi Cheinan,

    I wanted to personally thank you for this excellent bug report. With that bisect in hand we figured out that some monitoring code in the new NanoScheduler was causing the GATK to issue many calls to System.currentTimeMillis, which invokes an expensive system call but doesn't show up in the profiler for some reason. By removing this polling we've brought the runtime back down to the 2.0 version. These fixes will all be available in GATK 2.3.

    Best,

    Mark

    Post edited by Mark_DePristo on

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • cheinancheinan Posts: 8Member

    Glad to help. Git bisect and open source software rock.

    I'd suggest timing your regression tests or putting timeouts on them so you know right away when something like this happens in the future.

  • cheinancheinan Posts: 8Member

    I just tried the fix with v2.3-4-gb8f1308 and I'm getting a time estimate of 3.2 hours. I retried it with v2.0-39-ge14ea5c and got an estimate of 31 minutes so while the performance improved by a factor of two, it's still an order of magnitude slower than it used to be.

    I also tried this with a version I build from source and got the same results. In fact, the version ID was identical, v2.3-4-gb8f1308.

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    Thanks! I'll have a look at this. It's possible that my most recent optimizations fix the problem.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin
    edited December 2012

    Just had a quick lot. 2.0 takes 15 min, 2.3 takes 90 minutes, but our internal 2.4 takes 14 min. I've been fighting through performance issues with BQSR, improving the nano scheduler, and those optimizations have fixed the problems. We'll be pushing 2.4 on our regular schedule (~6 weeks) so that should finally fix the whole issue. It'll also come with at least 3x performance improvements to BQSR

    Post edited by Mark_DePristo on

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • cheinancheinan Posts: 8Member

    Would it be possible to get a patch for this bug please? We won't be upgrading to 2.4 because of the license change and all I can do is ask that you leave the final open source version as bug-free as possible.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hi @cheinan,

    Unfortunately that's not going to be possible. I understand your concerns but this is an inevitable consequence of our development model; we have considerably re-engineered portions of the GATK infrastructure ahead of 2.4 and it would be far too much work to back-port these improvements onto 2.3. Our apologies for the inconvenience.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.