If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How is it possible that the GT Field contradics the AD Field?

mmterpstrammterpstra NetherlandsMember ✭✭
edited February 2014 in Ask the GATK team

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802

This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....

Post edited by Geraldine_VdAuwera on

Best Answer


  • mmterpstrammterpstra NetherlandsMember ✭✭

    To summarise:
    This inconsistency of AD and GT is then probably indicative of Error and should be filtered out. I think I'll look into that, and maybe even post the results :) .

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This inconsistency of AD and GT is then probably indicative of Error and should be filtered out.

    I would not put it like that. It is not really that AD and GT are inconsistent; the problem here is that you are ascribing too much meaning to the AD annotation, and not taking into account its limitations.

  • mmterpstrammterpstra NetherlandsMember ✭✭

    I'm looking for "novel" variants in comparison to control sample(s). These incosistent AD/GT are because the data is way too buggy and the software is unable to handle it.

    The best thing is to filter these variants out:
    maybe the practical (haven't tested it >> will be buggy):
    vc.getGenotype('NA12878').isHomRef() & '((Double.valueOf(vc.getGenotype('NA12878').getAD().1) / Double.valueOf(vc.getGenotype('NA12878').getDP())) > 0.01) & (vc.getGenotype('NA12878').getAD().1 > 2) | ((Double.valueOf(vc.getGenotype('NA12878').getAD().1) / Double.valueOf(vc.getGenotype('NA12878').getDP())) > 0.2 & vc.getGenotype('NA12878').getDP() < 8)'

    This 'datasensitive' genotype interpretation leads to false positives (although the underlying model should be better for real genotyping).
    In real live the filtering between different genotypes with low sample number the best thing is to do additional filtering using more sensitive AD inferred genotypes for controls. This assumes that for each individual variant the underlying errors are similar in more than one the sample, which should be safe. @Geraldine_VdAuwera what do you think of this?

    PS: What is the best way to infer the genotype based on AD fields? (Should it be something like my jexl jumble which should allow a pcr error rate of 0.01 and for low coverage somewhat lenient genotyping )

  • pdexheimerpdexheimer Member ✭✭✭✭

    I wonder why you're going to all the trouble of running GATK if you just want to make assessments on allele depth. If that's all you want to do, why don't you just run samtools pileup and a counting python script to assign genotypes? The fact is, AD counting is too simplistic to get accurate genotypes - that's why every variant caller that I'm aware of (GATK, MAQ/samtools, SOAP, etc) uses more complicated models

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Yes, agreed. You should never infer genotypes based on AD field. The documentation for the AD field explicitly says this.

  • mmterpstrammterpstra NetherlandsMember ✭✭

    Should i then use straight filter on the AD field for the negative control samples? like:
    (Double.valueOf(vc.getGenotype('Control').getAD().1) / Double.valueOf(vc.getGenotype('Control').getDP())) > 0.01)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mmterpstra, I'm not sure I understand what you're trying to achieve with that expression...

  • mmterpstrammterpstra NetherlandsMember ✭✭

    from my 11 march post : I'm looking for "novel" variants in comparison to control sample(s). the expression is for variants with only one alternate allele: remove by filtering if in the negative control sample the ratio of allellic depth of alternate by total depth of exceeds 0.01 . This will filter out questionable variants?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mmterpstra‌, sorry for the late response.

    Thanks for clarifying. I don't have any experience with filtering on a case/control basis, so I don't feel comfortable commenting on whether that is an appropriate filtering strategy or not. If pressed I would have to say, it seems to me this assumes there is a correlation between sequencing errors in the separate samples. However, BQSR would typically have dealt with any systematic errors, so what remains should be random and therefore uncorrelated between samples. But I'd have to give this a lot more thought (which I sadly don't have the time for) before I could give a really informed opinion, sorry. Perhaps someone else with more experience can jump in with a more developed reasoning.

Sign In or Register to comment.