Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

outputing genotype after variants are called

Hi,
How do I output the sample's genotype (including non-variants) after I ran the pipeline and got the vcf file?

Thanks!

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @cenny
    HI,

    I am not sure I understand your question. The genotypes are represented in the VCF. Have a look at this article which may help.

    What tool are you using to call variants? If you are using the GVCF workflow, you will need to use -allSites in GenotypeGVCFs to get non-variant sites as well as the variant sites.

    -Sheila

  • cennycenny Member

    Hi Sheila,

    Thanks for your reply. I am using HaplotypeCaller to call the variants. For example: in the VCF, I have this result:
    20 17148914 . GAGGA G,GAGGAAGGAAGGAAGGA 3023.73 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=90;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=59.95;QD=35.57;SOR=2.334;VQSLOD=-Infinity;culprit=FS GT:AD:DP:GQ:PL 1/2:0,24,51:75:99:3061,2082,2235,988,0,837
    Is there a way to extract the sequences so if I specify a longer region for example chr20:17,148,914-17,150,000 it will give me two sequences one with reference genome and first allele; the other reference genome and the second allele.

  • cennycenny Member

    I just saw another person asking a similar question. So, I think I can use the SelectVariants followed by FastaAlternateReferenceMaker to get the sequences?

  • cennycenny Member
    edited November 2016

    I just tried SelectVariant to get the variant in this interval: chr20:17372520-17372693 then use FastaAlternateReferenceMaker to get the sequence but it only returns the reference sequence not the variants? Shouldn't it returns the genotype with more evidence which is the second allele?

    These are the commands I ran:

    java -jar GenomeAnalysisTK.jar \
    -T SelectVariants \
    -R human_g1k_v37_decoy.fasta \
    -V recalibrated_variants.vcf \
    -o output_20_17372520_17372693.vcf \
    -L 20:17372520-17372693

    java -jar $HOME/GenomeAnalysisTK.jar \
    -T FastaAlternateReferenceMaker \
    -R human_g1k_v37_decoy.fasta \
    -o recalibrated_20_17372520_17372693.fasta \
    -L 20:7372520-17372693 \
    -V output_20_17372520_17372693.vcf

    vcf file output_20_17372520_17372693.vcf:
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S6
    20 17372551 . G GAGGAAGGAAGGAAGGA,GAGGAAGGAAGGAAGGAAGGA 626.73 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=19;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=59.85;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=29.87;SOR=4.977;VQSLOD=1.17;culprit=SOR GT:AD:DP:GQ:PL 1/2:0,4,7:11:99:664,307,281,182,0,147

    fasta file recalibrated_20_17372520_17372693.fasta :
    >1 20:17372520
    AGAAAAAAAGAAAGAGAGAGAGAAAGAGAGAGAGGAAGGAAGGAAGGAAGGAAGGAAGGA
    AGGAAGGAAGGAAGGGGAAGGGAAAGAAGGAAAGGAAGGAAGGGAAGGGAAGGAAGGAAA
    GGAAGGAAAGGAAGGAAAGGAAAGAGAAGCCATCCTGTCAAGGAGTCCCAAATA

  • cennycenny Member

    "FastaAlternateReferenceMaker chooses an alternate allele at random". This, I don't think it's true. Can you please confirm?
    1. FastaAlternateReferenceMaker will always output reference sequence when there is more than 1 alternative allele?
    2. FastaAlternateReferenceMaker will always output the alternative sequence when there is 1 alternative allele?
    3. That means none of the information on GT is being used to generate sequence?
    4. Is there a documentation for these behavior? Would be nice to have one. What about deletions?

    Thanks for the LeftAlignAndTrimVariants suggestions. I will try that.

    Issue · Github
    by Sheila

    Issue Number
    1412
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The tool is designed to choose an allele at random for multiallelic sites, but there is some logic to give up on sites that are too complex, including long indels, if I recall correctly. This might be causing what you're seeing here.

    You're correct that GT information is not used.

    The only documentation we have for this tool is here; it's not part of our core tools (we don't use it ourselves much) so documenting and improving it has not been a priority -- and to be frank it's unlikely to become a priority anytime soon.

  • cennycenny Member

    @Geraldine, Thanks for your comments.
    So, for sites with one alternative allele, will FastaAlternateReferenceMaker always output the alternative regardless of the GT?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @cenny
    Hi,

    That is correct. The only caveat is if you specify a specific sample to use out of a multi-sample VCF, and the specific sample is not variant at the site, the tool will not output the variant. (So, yes the GT is taken into account).

    -Sheila

  • cennycenny Member

    @Sheila, to clarify:
    If I do not specify a specific sample and only have 1 alternative allele, FastaAlternateReferenceMaker will always output this alternative allele instead of reference allele regardless of GT, correct?

Sign In or Register to comment.