How to get heterozygotes SNP with HaplotypeCaller ?

Hi,

I am new to GATK, I try to find SNPs for paired-end data in the mosquito. The genome of the mosquito many polymorphism.
I try to get a VCF file for each position all posibility for a SNP. In fact, when I look at my VCF file I have only one posibility for SNP as often it is heterozygous.

Example :
R 86 . T A 73.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;ClippingRankSum=0.000;DP=8;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=53.84;MQRankSum=-2.369;QD=9.22;ReadPosRankSum=0.992;SOR=0.368 GT:AD:DP:GQ:PL 0/1:5,3:8:99:102,0,188

At this position, with IGV i can see an heterozygous SNP : some reads are A other are T like the reference. Is it possible to get this information ?

This is my command line :
java -Xmx8g -jar GenomeAnalysisTK.jar -nct 4 -T HaplotypeCaller -R ../GENOME/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa -I ../RESULTS/NJ3-5302_2016-09-30/MAPPING_NJ3-5302.sorted.bam -o ../RESULTS/NJ3-5302_2016-09-30/test.vcf -mbq 25 -gt_mode DISCOVERY -L 2R:1-500000

Thx,
Nicolas

Best Answer

  • nkaspricnkaspric france
    Accepted Answer

    I answer to my question alone :)

    For poeple who need this information : http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it

    VCF file informs about the heterozygosity :smile:
    5. How the genotype and other sample-level information is represented

    The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

    Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

    1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
    1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
    1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26
    Looking at that last column, here is what the tags mean:

    GT : The genotype of this sample at this site.
    For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    0/0 - the sample is homozygous reference
    0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    1/1 - the sample is homozygous alternate
    In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
    For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.

    Nicolas

Answers

  • nkaspricnkaspric franceMember
    Accepted Answer

    I answer to my question alone :)

    For poeple who need this information : http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it

    VCF file informs about the heterozygosity :smile:
    5. How the genotype and other sample-level information is represented

    The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

    Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

    1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
    1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
    1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26
    Looking at that last column, here is what the tags mean:

    GT : The genotype of this sample at this site.
    For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    0/0 - the sample is homozygous reference
    0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    1/1 - the sample is homozygous alternate
    In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
    For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.

    Nicolas

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @nkaspric
    Hi Nicolas,

    I am happy you figured it out yourself! :smile:

    -Sheila

Sign In or Register to comment.