The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the github release page for download and details.

How is it possible that the GT Field contradics the AD Field?

mmterpstrammterpstra NetherlandsMember
edited February 2014 in Ask the GATK team

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802

This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....

Post edited by Geraldine_VdAuwera on
1920 x 1080 - 103K

Best Answer


  • mmterpstrammterpstra NetherlandsMember

    To summarise:
    This inconsistency of AD and GT is then probably indicative of Error and should be filtered out. I think I'll look into that, and maybe even post the results :) .

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    This inconsistency of AD and GT is then probably indicative of Error and should be filtered out.

    I would not put it like that. It is not really that AD and GT are inconsistent; the problem here is that you are ascribing too much meaning to the AD annotation, and not taking into account its limitations.

  • mmterpstrammterpstra NetherlandsMember

    I'm looking for "novel" variants in comparison to control sample(s). These incosistent AD/GT are because the data is way too buggy and the software is unable to handle it.

    The best thing is to filter these variants out:
    maybe the practical (haven't tested it >> will be buggy):
    vc.getGenotype('NA12878').isHomRef() & '((Double.valueOf(vc.getGenotype('NA12878').getAD().1) / Double.valueOf(vc.getGenotype('NA12878').getDP())) > 0.01) & (vc.getGenotype('NA12878').getAD().1 > 2) | ((Double.valueOf(vc.getGenotype('NA12878').getAD().1) / Double.valueOf(vc.getGenotype('NA12878').getDP())) > 0.2 & vc.getGenotype('NA12878').getDP() < 8)'

    This 'datasensitive' genotype interpretation leads to false positives (although the underlying model should be better for real genotyping).
    In real live the filtering between different genotypes with low sample number the best thing is to do additional filtering using more sensitive AD inferred genotypes for controls. This assumes that for each individual variant the underlying errors are similar in more than one the sample, which should be safe. @Geraldine_VdAuwera what do you think of this?

    PS: What is the best way to infer the genotype based on AD fields? (Should it be something like my jexl jumble which should allow a pcr error rate of 0.01 and for low coverage somewhat lenient genotyping )

  • pdexheimerpdexheimer Member, Dev

    I wonder why you're going to all the trouble of running GATK if you just want to make assessments on allele depth. If that's all you want to do, why don't you just run samtools pileup and a counting python script to assign genotypes? The fact is, AD counting is too simplistic to get accurate genotypes - that's why every variant caller that I'm aware of (GATK, MAQ/samtools, SOAP, etc) uses more complicated models

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Yes, agreed. You should never infer genotypes based on AD field. The documentation for the AD field explicitly says this.

  • mmterpstrammterpstra NetherlandsMember

    Should i then use straight filter on the AD field for the negative control samples? like:
    (Double.valueOf(vc.getGenotype('Control').getAD().1) / Double.valueOf(vc.getGenotype('Control').getDP())) > 0.01)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @mmterpstra, I'm not sure I understand what you're trying to achieve with that expression...

  • mmterpstrammterpstra NetherlandsMember

    from my 11 march post : I'm looking for "novel" variants in comparison to control sample(s). the expression is for variants with only one alternate allele: remove by filtering if in the negative control sample the ratio of allellic depth of alternate by total depth of exceeds 0.01 . This will filter out questionable variants?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @mmterpstra‌, sorry for the late response.

    Thanks for clarifying. I don't have any experience with filtering on a case/control basis, so I don't feel comfortable commenting on whether that is an appropriate filtering strategy or not. If pressed I would have to say, it seems to me this assumes there is a correlation between sequencing errors in the separate samples. However, BQSR would typically have dealt with any systematic errors, so what remains should be random and therefore uncorrelated between samples. But I'd have to give this a lot more thought (which I sadly don't have the time for) before I could give a really informed opinion, sorry. Perhaps someone else with more experience can jump in with a more developed reasoning.

Sign In or Register to comment.