Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

EMIT_ALL_SITES in UnifiedGenotyper

newbie16newbie16 Posts: 15Member
edited January 2013 in Ask the GATK team

Hello, I have an alignment with 140 reference reads (Ref Base = C) and 10 variant Reads (Var Base = T) at locus: Chr17:7578406. When I use the "EMIT_ALL_SITES" mode, the UnifiedGenotyper generates the following output:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=150;Dels=0.00;FS=0.000;HaplotypeScore=11.1188;MLEAC=0;MLEAF=0.00;MQ=38.26;MQ0=0      GT:AD:DP:GQ:PL  0/0:140,0:150:99:0,361,4503

My questions are:

1) In ALT column, why is the base "A" being shown? Is it a randomly selected base when no SNP is identified at that position?
2) The ID column shows dbSNP record rs28934578 which is a C>T mutation (which is what my data has). Why is the dbSNP records for C>T mutation in the output when no variant is identified at this position (or C>A variant is shown?). Does this imply that ID column shows ALL dbSNP records at that position rather than a dbSNP record of the identified variant?
3) Is there a document that details the VCF output when EMIT_ALL_VARIANTS is used so I could understand the output vcf?

Post edited by Geraldine_VdAuwera on

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,295Administrator, GATK Developer admin

    There is in fact a SNP identified at that position; but it is low quality, presumably under the emission quality threshold, which is why it doesn't get emitted when the tool is run in normal mode. The output vcf is exactly the same format as usual.

    Geraldine Van der Auwera, PhD

  • newbie16newbie16 Posts: 15Member

    Hi,

    Maybe I did'nt ask the question correctly. Since the data contains absolutely no reads with "A" base for positions where it is identifying an "A" ALT variant, I don't see how a SNP (e.g. G>A at position 7578402) is being emitted even if it is at a low confidence. Below is the output from an example. For the following command:

    java -Xmx4g -jar /usr/local/lib/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T UnifiedGenotyper -I NG.CL316.ordered.sorted.bam -o temp.vcf --dbsnp dbsnp_135.b37.vcf -stand_call_conf 30 -stand_emit_conf 0.0 -dcov 5000 -L "17:7,578,400-7,578,410" -out_mode EMIT_ALL_SITES -contamination 0.0

    below is a subsection of the output:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
    17      7578401 rs147002414     G       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.7458;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0        GT:AD:DP:GQ:PL  0/0:12,0:12:18:0,18,172
    17      7578402 .       G       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=5.5344;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:18:0,18,172
    17      7578403 .       C       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=4.5411;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,360
    17      7578404 .       A       C       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.5437;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,360
    17      7578405 .       G       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=4.5436;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,364
    17      7578406 rs28934578      C       T       83.77   .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.380;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=2.7884;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.135;QD=6.98;ReadPosRankSum=2.367     GT:AD:DP:GQ:PL  0/1:7,5:12:99:112,0,195
    17      7578407 rs138729528     G       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.7883;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0        GT:AD:DP:GQ:PL  0/0:12,0:12:30:0,30,367
    

    Position 7578406 shows a C>T SNP which is Okay. However at neighboring sites, the positions 7578401 to 7578403, 7578405, and 7578407 all show an n>A SNP; and position 7578404 shows an A>C SNP. If we look at the IGV screenshot(attached) for position 7578407, it show no read with base A (the pointed position in attached file). Then why is this position being called a A ALT variant. So I am wondering that since the stand_emit_conf is 0 at the above sites and I am using EMIT_ALL_SITES, and if a site is not callable, does the vcf output show a randomly selected variant?

    Thanks

    Screenshot.png
    1280 x 1024 - 68K
  • ebanksebanks Posts: 683GATK Developer mod

    Yes, that's right: the genotyper is assigning a "random" alternate allele just so that it can create genotype likelihoods for the site.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • newbie16newbie16 Posts: 15Member

    Thanks. Is this new change in Version 2.2? Since in version 2.0, with emit_all_sites, I would have seen a "." in the ALT column with a Qual score. Could you explain why this change was made?

Sign In or Register to comment.