The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# UnifiedGenotyper behaving differently when supplied with single or multiple bam files

Posts: 8
edited October 2012

I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.

I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field:

1/1:0,66:66:99:0:2337,187,0

This seems to me to be a serious bug. Is this anything that's been noted before?

I am running GATKLite version 2.1-3-ge1dbcc8

Gene

Post edited by Geraldine_VdAuwera on
Tagged:

That is indeed alarming. Are you certain that your BAM files are well formed? For example, are the read groups properly formatted? If you can reproduce the error at just this site it would be great if you can extract just this site from the single BAM with PrintReads as well as the merged BAM with PrintReads using -L site, and send it up to our FTP server. But before you do that, make sure that everything is good with your BAMs, that they pass ValidateSamFile, and you are not running with things like -U

--
Mark A. DePristo, Ph.D.
Co-Director, Medical and Population Genetics
Broad Institute of MIT and Harvard

• Posts: 8

Hi Mark. I've extracted just that region and the error is reproducing. I ran all the bam files through ValidateSamFile, and I get some errors about missing mates (as expected due to subsetting the full bam file) and some NM numbers not matching up (probably not serious).

If I run UnifiedGenotyper on the individual bam files, I am able to identify homozygous mutations corresponding to rs28934576 in samples 6, 17, 20, and 22. If I run UnifiedGenotyper on all samples together, I see homozygous mutations in samples 17, 20, and 22, no call for samples 6 and 21, and homozygous wt for all the other samples. If I run just samples 5, 6, and 7 together, I again get the correct homozygous mutant call for sample 6.

Please give me the details on where to ftp the sample files and I will get them to you.

Thanks!

Geraldine Van der Auwera, PhD

• Posts: 8

Thanks, Geraldine for the ftp info.

The files are uploaded as genotyper_debugging.tgz. This archive contains all the individual bam files (after extracting out the region around the SNP in question), the vcf files generated by UnifiedGenotyper on all the bam files separately, all the bam files together, or just samples 5, 6, and 7 together, along with a text file containing the exact command line arguments that I made to UnifiedGenotyper.

Please let me know if you can figure anything out.

Gene

• Posts: 8

I've been doing a bit more debugging and have the issue a little more isolated. To recap, of samples 5 through 28, rs28934576 is present in samples 6, 17, 20, and 22. If any of these is processed alone, the SNP is called correctly. When done in combination:

<br /> Sample 6 is called correctly when processed along with samples 5 through 16.<br /> Sample 6 is not called when processed with sample 17. Sample 17 is still called.<br /> Sample 6 is called when processed with sample 20. Sample 20 is not called.<br /> Sample 6 is not called when processed with sample 22. Sample 22 is still called.<br /> Samples 17, 20, and 22 are all properly called when processed together.<br /> `

I also tested an older version of GATK (1.6-13-g91f02df) and it appears to have the same behavior.

Gene

Hi Gene,

Your problem has already been fixed in the latest unstable version of the GATK and will make its way out to you in the next major release (2.2). See here for more details:

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Posts: 8

Your problem has already been fixed in the latest unstable version of the GATK and will make its way out to you in the next major release (2.2)

Is there anyway to get access to the dev build of GATK? I'm not finding any reliable way to work around this bug and I'm working on a time-critical project?