Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Counter-intuitive Genotype calls from UG in Exome samples

KStammKStamm Posts: 14Member

I have a question regarding the bayesian methods employed by the UnifiedGenotyper. I recently upgraded to v2.4-9-g532efad after running into a bug in an old version.

The context is sixty whole exome samples (human), and my pipeline performs mark-duplicate, then bq recalibration, then indel realigning, then finally UnifiedGenotyper. Each sample has some number of variants called (combined VCF output of SNP and INDEL for about 200K lines per sample), without reference calls. Later when looking at individual variants we need to know which samples were homozygous reference versus no-coverage, which is information not contained in my initial VCFs. To get the coverage at each variant I'm running a new big UnifiedGenotyper at the alleles specified by a vcf-merge of each file. This alleles file is about 4 million entries, where I have asked UG to call SNPs and reference bases too.

Now we have more complete data, and some strange things are found. Here's an excerpt from the ouput:

1       17538   .       C       A       710.37  .  
AC=22;AF=0.183;AN=120;BaseQRankSum=-5.993;DP=317;Dels=0.00;FS=4.527;HaplotypeScore=0.2483;InbreedingCoeff=-0.1553;MLEAC=23;MLEAF=0.192;MQ=34.98;MQ0=65;MQRankSum=-0.773;QD=5.46;ReadPosRankSum=-0.115   GT:AD:DP:GQ:PL
0/1:5,1:6:27:27,0,96    0/0:5,1:6:12:0,12,156   0/1:5,1:6:26:26,0,113

The concern is that the same five reads reference, one read alternate is sometimes called Hom and sometimes Het. At another site I've got 0/1:6,1:7:28:28,0,79 and 0/0:4,1:5:12:0,12,147 which shows a 6/1 being called hom and a 4/1 being called het!

Apologies for my lack of a well-posed question, I'm trying to get my head wrapped around this so that the next time I tell my lab a variant is discovered, we have more confidence.

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi there,

    This sounds like it's related to an issue we're aware of. We've been doing a lot of work on the likelihood model that is involved and we believe it's much improved in our development version. You can try it out by running the latest nightly build (see Downloads page). If you do please let us know if the calls you get at those sites make more sense with the nightly.

    Geraldine Van der Auwera, PhD

  • KStammKStamm Posts: 14Member

    I have upgraded to vnightly-2013-04-18-g2fd787a and restarted the analysis. Again it estimates a week of compute-time, but these chr1 positions are ready to review.

    The three 5,1 samples at locus chr1:17538 have identical calls and qualities in both versions. The new VCF line looks like this:

    1 17538 . C A 710.37 . AC=22;AF=0.183;AN=120;BaseQRankSum=-5.943;DP=317;Dels=0.00;FS=4.527;HaplotypeScore=0.2483;InbreedingCoeff=-0.1553;MLEAC=23;MLEAF=0.192;MQ=34.98;MQ0=65;MQRankSum=-0.662;QD=5.46;ReadPosRankSum=-0.113 GT:AD:DP:GQ:PL 0/1:5,1:6:27:27,0,96 0/0:5,1:6:12:0,12,156 0/1:5,1:6:26:26,0,113

    Looking closely, the only changes are

    BaseQRankSum from -5.993 to -5.943

    MQRankSum from -0.773 to -0.662

    ReadPosRankSum from -0.115 to -0.113

    So the genotype calls are stable, and likely driven by my dataset. Perhaps the underlying base quality is the culprit and this is simply a borderline case for the algorithm. The fourth segment of the colon-delimited genotype (the next-most-likely call score) is low, and perhaps we need to filter variants that are ambiguously called. We hadn't considered ambiguous calls, and might need a new annotation for that (like the non-ATCG entries in a FASTA file).

Sign In or Register to comment.