The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

**GATK4-BETA.2** is here. That's TWO, as in the second beta release. Be sure to read about the known issues before test driving. See Article#9881 to start and https://github.com/broadinstitute/gatk/blob/master/README.md for details.

# Counter-intuitive Genotype calls from UG in Exome samples

Member

I have a question regarding the bayesian methods employed by the UnifiedGenotyper. I recently upgraded to v2.4-9-g532efad after running into a bug in an old version.

The context is sixty whole exome samples (human), and my pipeline performs mark-duplicate, then bq recalibration, then indel realigning, then finally UnifiedGenotyper. Each sample has some number of variants called (combined VCF output of SNP and INDEL for about 200K lines per sample), without reference calls. Later when looking at individual variants we need to know which samples were homozygous reference versus no-coverage, which is information not contained in my initial VCFs. To get the coverage at each variant I'm running a new big UnifiedGenotyper at the alleles specified by a vcf-merge of each file. This alleles file is about 4 million entries, where I have asked UG to call SNPs and reference bases too.

Now we have more complete data, and some strange things are found. Here's an excerpt from the ouput:

1 17538 . C A 710.37 . AC=22;AF=0.183;AN=120;BaseQRankSum=-5.993;DP=317;Dels=0.00;FS=4.527;HaplotypeScore=0.2483;InbreedingCoeff=-0.1553;MLEAC=23;MLEAF=0.192;MQ=34.98;MQ0=65;MQRankSum=-0.773;QD=5.46;ReadPosRankSum=-0.115 GT:AD:DP:GQ:PL 0/1:5,1:6:27:27,0,96 0/0:5,1:6:12:0,12,156 0/1:5,1:6:26:26,0,113

The concern is that the same five reads reference, one read alternate is sometimes called Hom and sometimes Het.
At another site I've got
0/1:6,1:7:28:28,0,79 and 0/0:4,1:5:12:0,12,147 which shows a 6/1 being called hom and a 4/1 being called het!

Apologies for my lack of a well-posed question, I'm trying to get my head wrapped around this so that the next time I tell my lab a variant is discovered, we have more confidence.

Tagged:

Hi there,

This sounds like it's related to an issue we're aware of. We've been doing a lot of work on the likelihood model that is involved and we believe it's much improved in our development version. You can try it out by running the latest nightly build (see Downloads page). If you do please let us know if the calls you get at those sites make more sense with the nightly.

• Member

I have upgraded to vnightly-2013-04-18-g2fd787a and restarted the analysis. Again it estimates a week of compute-time, but these chr1 positions are ready to review.

The three 5,1 samples at locus chr1:17538 have identical calls and qualities in both versions. The new VCF line looks like this:

1 17538 . C A 710.37 . AC=22;AF=0.183;AN=120;BaseQRankSum=-5.943;DP=317;Dels=0.00;FS=4.527;HaplotypeScore=0.2483;InbreedingCoeff=-0.1553;MLEAC=23;MLEAF=0.192;MQ=34.98;MQ0=65;MQRankSum=-0.662;QD=5.46;ReadPosRankSum=-0.113 GT:AD:DP:GQ:PL 0/1:5,1:6:27:27,0,96 0/0:5,1:6:12:0,12,156 0/1:5,1:6:26:26,0,113 

Looking closely, the only changes are

BaseQRankSum from -5.993 to -5.943

MQRankSum from -0.773 to -0.662

ReadPosRankSum from -0.115 to -0.113`

So the genotype calls are stable, and likely driven by my dataset. Perhaps the underlying base quality is the culprit and this is simply a borderline case for the algorithm. The fourth segment of the colon-delimited genotype (the next-most-likely call score) is low, and perhaps we need to filter variants that are ambiguously called. We hadn't considered ambiguous calls, and might need a new annotation for that (like the non-ATCG entries in a FASTA file).