The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# How is it possible that the GT Field contradics the AD Field?

NetherlandsPosts: 41Member
edited February 2014

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802


This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....

Post edited by Geraldine_VdAuwera on
Tagged:

• NetherlandsPosts: 41Member

Thanks,
To summarise:
This inconsistency of AD and GT is then probably indicative of Error and should be filtered out. I think I'll look into that, and maybe even post the results .

This inconsistency of AD and GT is then probably indicative of Error and should be filtered out.

I would not put it like that. It is not really that AD and GT are inconsistent; the problem here is that you are ascribing too much meaning to the AD annotation, and not taking into account its limitations.

Geraldine Van der Auwera, PhD

• NetherlandsPosts: 41Member

I'm looking for "novel" variants in comparison to control sample(s). These incosistent AD/GT are because the data is way too buggy and the software is unable to handle it.

The best thing is to filter these variants out:
maybe the practical (haven't tested it >> will be buggy):
vc.getGenotype('NA12878').isHomRef() & '((Double.valueOf(vc.getGenotype('NA12878').getAD().1) / Double.valueOf(vc.getGenotype('NA12878').getDP())) > 0.01) & (vc.getGenotype('NA12878').getAD().1 > 2) | ((Double.valueOf(vc.getGenotype('NA12878').getAD().1) / Double.valueOf(vc.getGenotype('NA12878').getDP())) > 0.2 & vc.getGenotype('NA12878').getDP() < 8)'

This 'datasensitive' genotype interpretation leads to false positives (although the underlying model should be better for real genotyping).
In real live the filtering between different genotypes with low sample number the best thing is to do additional filtering using more sensitive AD inferred genotypes for controls. This assumes that for each individual variant the underlying errors are similar in more than one the sample, which should be safe. @Geraldine_VdAuwera what do you think of this?

PS: What is the best way to infer the genotype based on AD fields? (Should it be something like my jexl jumble which should allow a pcr error rate of 0.01 and for low coverage somewhat lenient genotyping )

• Posts: 541Member, Dev ✭✭✭✭

I wonder why you're going to all the trouble of running GATK if you just want to make assessments on allele depth. If that's all you want to do, why don't you just run samtools pileup and a counting python script to assign genotypes? The fact is, AD counting is too simplistic to get accurate genotypes - that's why every variant caller that I'm aware of (GATK, MAQ/samtools, SOAP, etc) uses more complicated models

Yes, agreed. You should never infer genotypes based on AD field. The documentation for the AD field explicitly says this.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• NetherlandsPosts: 41Member

Should i then use straight filter on the AD field for the negative control samples? like:

@mmterpstra, I'm not sure I understand what you're trying to achieve with that expression...

Geraldine Van der Auwera, PhD

• NetherlandsPosts: 41Member

from my 11 march post : I'm looking for "novel" variants in comparison to control sample(s). the expression is for variants with only one alternate allele: remove by filtering if in the negative control sample the ratio of allellic depth of alternate by total depth of exceeds 0.01 . This will filter out questionable variants?