The current GATK version is 3.4-0

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

EMIT_ALL_SITES in UnifiedGenotyper

Posts: 37Member
edited January 2013

Hello,
I have an alignment with 140 reference reads (Ref Base = C) and 10 variant Reads (Var Base = T) at locus: Chr17:7578406. When I use the "EMIT_ALL_SITES" mode, the UnifiedGenotyper generates the following output:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=150;Dels=0.00;FS=0.000;HaplotypeScore=11.1188;MLEAC=0;MLEAF=0.00;MQ=38.26;MQ0=0      GT:AD:DP:GQ:PL  0/0:140,0:150:99:0,361,4503


My questions are:

1) In ALT column, why is the base "A" being shown? Is it a randomly selected base when no SNP is identified at that position?
2) The ID column shows dbSNP record rs28934578 which is a C>T mutation (which is what my data has). Why is the dbSNP records for C>T mutation in the output when no variant is identified at this position (or C>A variant is shown?). Does this imply that ID column shows ALL dbSNP records at that position rather than a dbSNP record of the identified variant?
3) Is there a document that details the VCF output when EMIT_ALL_VARIANTS is used so I could understand the output vcf?

Post edited by Geraldine_VdAuwera on
Tagged:

There is in fact a SNP identified at that position; but it is low quality, presumably under the emission quality threshold, which is why it doesn't get emitted when the tool is run in normal mode. The output vcf is exactly the same format as usual.

Geraldine Van der Auwera, PhD

• Posts: 37Member

Hi,

Maybe I did'nt ask the question correctly.
Since the data contains absolutely no reads with "A" base for positions where it is identifying an "A" ALT variant, I don't see how a SNP (e.g. G>A at position 7578402) is being emitted even if it is at a low confidence. Below is the output from an example. For the following command:

java -Xmx4g -jar /usr/local/lib/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T UnifiedGenotyper -I NG.CL316.ordered.sorted.bam -o temp.vcf --dbsnp dbsnp_135.b37.vcf -stand_call_conf 30 -stand_emit_conf 0.0 -dcov 5000 -L "17:7,578,400-7,578,410" -out_mode EMIT_ALL_SITES -contamination 0.0

below is a subsection of the output:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578401 rs147002414     G       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.7458;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0        GT:AD:DP:GQ:PL  0/0:12,0:12:18:0,18,172
17      7578402 .       G       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=5.5344;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:18:0,18,172
17      7578403 .       C       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=4.5411;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,360
17      7578404 .       A       C       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.5437;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,360
17      7578405 .       G       A       0       LowQual AC=0;AF=0.00;AN=2;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=4.5436;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0  GT:AD:DP:GQ:PL   0/0:12,0:12:30:0,30,364
17      7578407 rs138729528     G       A       0       LowQual AC=0;AF=0.00;AN=2;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=3.7883;MLEAC=0;MLEAF=0.00;MQ=38.06;MQ0=0        GT:AD:DP:GQ:PL  0/0:12,0:12:30:0,30,367


Position 7578406 shows a C>T SNP which is Okay. However at neighboring sites, the positions 7578401 to 7578403, 7578405, and 7578407 all show an n>A SNP; and position 7578404 shows an A>C SNP.
If we look at the IGV screenshot(attached) for position 7578407, it show no read with base A (the pointed position in attached file). Then why is this position being called a A ALT variant.
So I am wondering that since the stand_emit_conf is 0 at the above sites and I am using EMIT_ALL_SITES, and if a site is not callable, does the vcf output show a randomly selected variant?

Thanks