The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at

EMIT_ALL_SITES in UnifiedGenotyper

newbie16newbie16 Member Posts: 42
edited January 2013 in Ask the GATK team

I have an alignment with 140 reference reads (Ref Base = C) and 10 variant Reads (Var Base = T) at locus: Chr17:7578406. When I use the "EMIT_ALL_SITES" mode, the UnifiedGenotyper generates the following output:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=150;Dels=0.00;FS=0.000;HaplotypeScore=11.1188;MLEAC=0;MLEAF=0.00;MQ=38.26;MQ0=0      GT:AD:DP:GQ:PL  0/0:140,0:150:99:0,361,4503

My questions are:

1) In ALT column, why is the base "A" being shown? Is it a randomly selected base when no SNP is identified at that position?
2) The ID column shows dbSNP record rs28934578 which is a C>T mutation (which is what my data has). Why is the dbSNP records for C>T mutation in the output when no variant is identified at this position (or C>A variant is shown?). Does this imply that ID column shows ALL dbSNP records at that position rather than a dbSNP record of the identified variant?
3) Is there a document that details the VCF output when EMIT_ALL_VARIANTS is used so I could understand the output vcf?

Post edited by Geraldine_VdAuwera on


  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,015 admin

    There is in fact a SNP identified at that position; but it is low quality, presumably under the emission quality threshold, which is why it doesn't get emitted when the tool is run in normal mode. The output vcf is exactly the same format as usual.

    Geraldine Van der Auwera, PhD

  • newbie16newbie16 Member Posts: 42


    Maybe I did'nt ask the question correctly.
    Since the data contains absolutely no reads with "A" base for positions where it is identifying an "A" ALT variant, I don't see how a SNP (e.g. G>A at position 7578402) is being emitted even if it is at a low confidence. Below is the output from an example. For the following command:

    java -Xmx4g -jar /usr/local/lib/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T UnifiedGenotyper -I NG.CL316.ordered.sorted.bam -o temp.vcf --dbsnp dbsnp_135.b37.vcf -stand_call_conf 30 -stand_emit_conf 0.0 -dcov 5000 -L "17:7,578,400-7,578,410" -out_mode EMIT_ALL_SITES -contamination 0.0

    below is a subsection of the output:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
    17      7578401 rs147002414     G       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.7458;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0        GT:AD:DP:GQ:PL  0/0:12,0:12:18:0,18,172
    17      7578402 .       G       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=5.5344;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:18:0,18,172
    17      7578403 .       C       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=4.5411;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,360
    17      7578404 .       A       C       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.5437;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,360
    17      7578405 .       G       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=4.5436;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,364
    17      7578406 rs28934578      C       T       83.77   .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.380;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=2.7884;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.135;QD=6.98;ReadPosRankSum=2.367     GT:AD:DP:GQ:PL  0/1:7,5:12:99:112,0,195
    17      7578407 rs138729528     G       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.7883;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0        GT:AD:DP:GQ:PL  0/0:12,0:12:30:0,30,367

    Position 7578406 shows a C>T SNP which is Okay. However at neighboring sites, the positions 7578401 to 7578403, 7578405, and 7578407 all show an n>A SNP; and position 7578404 shows an A>C SNP.
    If we look at the IGV screenshot(attached) for position 7578407, it show no read with base A (the pointed position in attached file). Then why is this position being called a A ALT variant.
    So I am wondering that since the stand_emit_conf is 0 at the above sites and I am using EMIT_ALL_SITES, and if a site is not callable, does the vcf output show a randomly selected variant?


    1280 x 1024 - 68K
  • ebanksebanks Broad InstituteMember, Administrator, Broadie, Moderator, Dev Posts: 698 admin

    Yes, that's right: the genotyper is assigning a "random" alternate allele just so that it can create genotype likelihoods for the site.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • newbie16newbie16 Member Posts: 42

    Thanks. Is this new change in Version 2.2? Since in version 2.0, with emit_all_sites, I would have seen a "." in the ALT column with a Qual score. Could you explain why this change was made?

Sign In or Register to comment.