We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

outputing genotype after variants are called

How do I output the sample's genotype (including non-variants) after I ran the pipeline and got the vcf file?


Best Answers


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I am not sure I understand your question. The genotypes are represented in the VCF. Have a look at this article which may help.

    What tool are you using to call variants? If you are using the GVCF workflow, you will need to use -allSites in GenotypeGVCFs to get non-variant sites as well as the variant sites.


  • cennycenny Member

    Hi Sheila,

    Thanks for your reply. I am using HaplotypeCaller to call the variants. For example: in the VCF, I have this result:
    20 17148914 . GAGGA G,GAGGAAGGAAGGAAGGA 3023.73 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=90;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=59.95;QD=35.57;SOR=2.334;VQSLOD=-Infinity;culprit=FS GT:AD:DP:GQ:PL 1/2:0,24,51:75:99:3061,2082,2235,988,0,837
    Is there a way to extract the sequences so if I specify a longer region for example chr20:17,148,914-17,150,000 it will give me two sequences one with reference genome and first allele; the other reference genome and the second allele.

  • cennycenny Member

    I just saw another person asking a similar question. So, I think I can use the SelectVariants followed by FastaAlternateReferenceMaker to get the sequences?

  • cennycenny Member
    edited November 2016

    I just tried SelectVariant to get the variant in this interval: chr20:17372520-17372693 then use FastaAlternateReferenceMaker to get the sequence but it only returns the reference sequence not the variants? Shouldn't it returns the genotype with more evidence which is the second allele?

    These are the commands I ran:

    java -jar GenomeAnalysisTK.jar \
    -T SelectVariants \
    -R human_g1k_v37_decoy.fasta \
    -V recalibrated_variants.vcf \
    -o output_20_17372520_17372693.vcf \
    -L 20:17372520-17372693

    java -jar $HOME/GenomeAnalysisTK.jar \
    -T FastaAlternateReferenceMaker \
    -R human_g1k_v37_decoy.fasta \
    -o recalibrated_20_17372520_17372693.fasta \
    -L 20:7372520-17372693 \
    -V output_20_17372520_17372693.vcf

    vcf file output_20_17372520_17372693.vcf:
    20 17372551 . G GAGGAAGGAAGGAAGGA,GAGGAAGGAAGGAAGGAAGGA 626.73 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=19;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=59.85;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=29.87;SOR=4.977;VQSLOD=1.17;culprit=SOR GT:AD:DP:GQ:PL 1/2:0,4,7:11:99:664,307,281,182,0,147

    fasta file recalibrated_20_17372520_17372693.fasta :
    >1 20:17372520

  • cennycenny Member

    "FastaAlternateReferenceMaker chooses an alternate allele at random". This, I don't think it's true. Can you please confirm?
    1. FastaAlternateReferenceMaker will always output reference sequence when there is more than 1 alternative allele?
    2. FastaAlternateReferenceMaker will always output the alternative sequence when there is 1 alternative allele?
    3. That means none of the information on GT is being used to generate sequence?
    4. Is there a documentation for these behavior? Would be nice to have one. What about deletions?

    Thanks for the LeftAlignAndTrimVariants suggestions. I will try that.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The tool is designed to choose an allele at random for multiallelic sites, but there is some logic to give up on sites that are too complex, including long indels, if I recall correctly. This might be causing what you're seeing here.

    You're correct that GT information is not used.

    The only documentation we have for this tool is here; it's not part of our core tools (we don't use it ourselves much) so documenting and improving it has not been a priority -- and to be frank it's unlikely to become a priority anytime soon.

  • cennycenny Member

    @Geraldine, Thanks for your comments.
    So, for sites with one alternative allele, will FastaAlternateReferenceMaker always output the alternative regardless of the GT?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    That is correct. The only caveat is if you specify a specific sample to use out of a multi-sample VCF, and the specific sample is not variant at the site, the tool will not output the variant. (So, yes the GT is taken into account).


  • cennycenny Member

    @Sheila, to clarify:
    If I do not specify a specific sample and only have 1 alternative allele, FastaAlternateReferenceMaker will always output this alternative allele instead of reference allele regardless of GT, correct?

Sign In or Register to comment.