Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

UnifiedGenotyper behaving differently when supplied with single or multiple bam files

genecgenec Posts: 8Member
edited October 2012 in Ask the team

I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.

I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field:

GT:AD:DP:GQ:MQ0:PL 1/1:0,66:66:99:0:2337,187,0

This seems to me to be a serious bug. Is this anything that's been noted before?

I am running GATKLite version 2.1-3-ge1dbcc8

Gene

Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GSA Member admin

    That is indeed alarming. Are you certain that your BAM files are well formed? For example, are the read groups properly formatted? If you can reproduce the error at just this site it would be great if you can extract just this site from the single BAM with PrintReads as well as the merged BAM with PrintReads using -L site, and send it up to our FTP server. But before you do that, make sure that everything is good with your BAMs, that they pass ValidateSamFile, and you are not running with things like -U

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • genecgenec Posts: 8Member

    Hi Mark. I've extracted just that region and the error is reproducing. I ran all the bam files through ValidateSamFile, and I get some errors about missing mates (as expected due to subsetting the full bam file) and some NM numbers not matching up (probably not serious).

    If I run UnifiedGenotyper on the individual bam files, I am able to identify homozygous mutations corresponding to rs28934576 in samples 6, 17, 20, and 22. If I run UnifiedGenotyper on all samples together, I see homozygous mutations in samples 17, 20, and 22, no call for samples 6 and 21, and homozygous wt for all the other samples. If I run just samples 5, 6, and 7 together, I again get the correct homozygous mutant call for sample 6.

    Please give me the details on where to ftp the sample files and I will get them to you.

    Thanks!

  • genecgenec Posts: 8Member

    Thanks, Geraldine for the ftp info.

    The files are uploaded as genotyper_debugging.tgz. This archive contains all the individual bam files (after extracting out the region around the SNP in question), the vcf files generated by UnifiedGenotyper on all the bam files separately, all the bam files together, or just samples 5, 6, and 7 together, along with a text file containing the exact command line arguments that I made to UnifiedGenotyper.

    Please let me know if you can figure anything out.

    Gene

  • genecgenec Posts: 8Member

    I've been doing a bit more debugging and have the issue a little more isolated. To recap, of samples 5 through 28, rs28934576 is present in samples 6, 17, 20, and 22. If any of these is processed alone, the SNP is called correctly. When done in combination:

    Sample 6 is called correctly when processed along with samples 5 through 16. Sample 6 is not called when processed with sample 17. Sample 17 is still called. Sample 6 is called when processed with sample 20. Sample 20 is not called. Sample 6 is not called when processed with sample 22. Sample 22 is still called. Samples 17, 20, and 22 are all properly called when processed together.

    I also tested an older version of GATK (1.6-13-g91f02df) and it appears to have the same behavior.

    Gene

  • genecgenec Posts: 8Member

    Your problem has already been fixed in the latest unstable version of the GATK and will make its way out to you in the next major release (2.2)

    Is there anyway to get access to the dev build of GATK? I'm not finding any reliable way to work around this bug and I'm working on a time-critical project?

    Thanks for your help!

    Gene

  • ebanksebanks Posts: 671GSA Member mod

    Unless you are at the Broad Institute, then sorry but this isn't possible.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.