The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Counter-intuitive Genotype calls from UG in Exome samples

KStammKStamm Member Posts: 31

I have a question regarding the bayesian methods employed by the UnifiedGenotyper. I recently upgraded to v2.4-9-g532efad after running into a bug in an old version.

The context is sixty whole exome samples (human), and my pipeline performs mark-duplicate, then bq recalibration, then indel realigning, then finally UnifiedGenotyper. Each sample has some number of variants called (combined VCF output of SNP and INDEL for about 200K lines per sample), without reference calls. Later when looking at individual variants we need to know which samples were homozygous reference versus no-coverage, which is information not contained in my initial VCFs. To get the coverage at each variant I'm running a new big UnifiedGenotyper at the alleles specified by a vcf-merge of each file. This alleles file is about 4 million entries, where I have asked UG to call SNPs and reference bases too.

Now we have more complete data, and some strange things are found. Here's an excerpt from the ouput:

1       17538   .       C       A       710.37  .  
AC=22;AF=0.183;AN=120;BaseQRankSum=-5.993;DP=317;Dels=0.00;FS=4.527;HaplotypeScore=0.2483;InbreedingCoeff=-0.1553;MLEAC=23;MLEAF=0.192;MQ=34.98;MQ0=65;MQRankSum=-0.773;QD=5.46;ReadPosRankSum=-0.115   GT:AD:DP:GQ:PL
0/1:5,1:6:27:27,0,96    0/0:5,1:6:12:0,12,156   0/1:5,1:6:26:26,0,113

The concern is that the same five reads reference, one read alternate is sometimes called Hom and sometimes Het.
At another site I've got
0/1:6,1:7:28:28,0,79 and 0/0:4,1:5:12:0,12,147 which shows a 6/1 being called hom and a 4/1 being called het!

Apologies for my lack of a well-posed question, I'm trying to get my head wrapped around this so that the next time I tell my lab a variant is discovered, we have more confidence.

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,421 admin

    Hi there,

    This sounds like it's related to an issue we're aware of. We've been doing a lot of work on the likelihood model that is involved and we believe it's much improved in our development version. You can try it out by running the latest nightly build (see Downloads page). If you do please let us know if the calls you get at those sites make more sense with the nightly.

    Geraldine Van der Auwera, PhD

  • KStammKStamm Member Posts: 31

    I have upgraded to vnightly-2013-04-18-g2fd787a and restarted the analysis. Again it estimates a week of compute-time, but these chr1 positions are ready to review.

    The three 5,1 samples at locus chr1:17538 have identical calls and qualities in both versions. The new VCF line looks like this:

    1 17538 . C A 710.37 . AC=22;AF=0.183;AN=120;BaseQRankSum=-5.943;DP=317;Dels=0.00;FS=4.527;HaplotypeScore=0.2483;InbreedingCoeff=-0.1553;MLEAC=23;MLEAF=0.192;MQ=34.98;MQ0=65;MQRankSum=-0.662;QD=5.46;ReadPosRankSum=-0.113 GT:AD:DP:GQ:PL 0/1:5,1:6:27:27,0,96 0/0:5,1:6:12:0,12,156 0/1:5,1:6:26:26,0,113

    Looking closely, the only changes are

    BaseQRankSum from -5.993 to -5.943

    MQRankSum from -0.773 to -0.662

    ReadPosRankSum from -0.115 to -0.113

    So the genotype calls are stable, and likely driven by my dataset. Perhaps the underlying base quality is the culprit and this is simply a borderline case for the algorithm. The fourth segment of the colon-delimited genotype (the next-most-likely call score) is low, and perhaps we need to filter variants that are ambiguously called. We hadn't considered ambiguous calls, and might need a new annotation for that (like the non-ATCG entries in a FASTA file).

Sign In or Register to comment.