Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

VCF to FASTA

I have a multisample VCF file generated from GATK and I need to convert it into FASTA format for performing Multiple sequence alignments.

Briefly, my protocol is:

1) Convert the multisample VCF to single sample VCF (using GATK, SelectVariants option)
2) Convert the single sample VCF to FASTA for each genotype (using GATK, FastaAlternateReferenceMaker option)
3) Sequence alignments

After two weeks of trying this out, I figured that all steps work fine except for Step 2 of the protocol above. GATK does actually convert the
VCF to FASTA but it doesn't take into consideration the genotype (GT) or allele depth (AD) information which is actually the deciding factor for the differences in alleles of the different genotypes. In the end, I have the same fasta file for all the genotypes.

I am not sure if this is the right way to approach this problem. Do you have any better suggestions for performing this task?

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @suda_ravindran
    Hi,

    Can you post your command for SelectVariants?

    Thanks,
    Sheila

  • suda_ravindransuda_ravindran HamburgMember

    Hi Sheila,

    I used this command for SelectVariants (I am just showing an example for extracting my sample called "G1.6"):

    java -jar GenomeAnalysisTK.jar \
    -T SelectVariants \
    -R allaugust.okay.tr.fasta \
    -V allpops.vcf \
    -o G16.vcf \
    -sn G1.6

    This was my FastaAlternateReferenceMaker command:

    java -jar GenomeAnalysisTK.jar \
    -T FastaAlternateReferenceMaker \
    -R allaugust.okay.tr.fasta \
    -o G16.fasta \
    -V G16.vcf \

    I used the same command for all my other genotypes. I have 24 gentoypes in total, coming from four different populations. Each population has 6 genotypes.

    Thank you,
    Suda

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @suda_ravindran
    Hi Suda,

    Try adding -env and -trimAlternates to your command.

    -Sheila

  • nkobmoonkobmoo ParisMember

    Hi @suda_ravindran ,

    Have you solved the problem ? I encountered also a similar problem.

  • HovhannesSahakyanHovhannesSahakyan Tartu, EstoniaMember

    @Sheila said:
    @suda_ravindran
    Hi Suda,

    Try adding -env and -trimAlternates to your command.

    -Sheila

    Hello Sheila,

    I wanted to do the same with mtDNA. While using your suggestions I get the following error:

    INFO 17:09:16,431 GenomeAnalysisEngine - Strictness is SILENT
    INFO 17:09:16,534 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 17:09:16,690 GenomeAnalysisEngine - Preparing for traversal
    INFO 17:09:16,695 GenomeAnalysisEngine - Done preparing for traversal
    INFO 17:09:16,695 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 17:09:16,696 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 17:09:16,696 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 17:09:16,704 SelectVariants - Including sample 'HG00096'

    ERROR --
    ERROR stack trace

    org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: All samples must be diploid
    at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.getDiploidLikelihoodIndexes(GATKVariantContextUtils.java:688)
    at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.determineDiploidLikelihoodIndexesToUse(GATKVariantContextUtils.java:647)
    at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.fixDiploidGenotypesFromSubsettedAlleles(GATKVariantContextUtils.java:1421)
    at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.updatePLsSACsAD(GATKVariantContextUtils.java:1403)
    at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.subsetRecord(SelectVariants.java:1080)
    at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:854)
    at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:309)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: All samples must be diploid
    ERROR ------------------------------------------------------------------------------------------

    Can you please help to solve the problem?

  • HovhannesSahakyanHovhannesSahakyan Tartu, EstoniaMember

    I have an update on this issue. With GATK version 3.8.1 this error is eliminated. Nevertheless, all different one-sample mtDNA vcf files convert into exactly the same fasta sequences.

    Here are the commands:

    java -Xmx4G -jarGenomeAnalysisTK.jar -T SelectVariants -R hs37d5.fasta -V ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf -o HG00096.vcf -sn HG00096 -env -trimAlternates

    java -Xmx4G -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -L MT -R hs37d5.fasta -o HG00096.fasta -V HG00096.vcf

    Any help on this issue is appreciated!

    Issue · Github
    by bshifaw

    Issue Number
    3159
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    bshifaw
  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Hi @HovhannesSahakyan ,

    It seems like there is an issue with using SelectVariants on diploid samples for GATK V3.
    https://gatkforums.broadinstitute.org/gatk/discussion/7298/removealternates-not-working-on-haploid-vcfs

    But there is a fix for this on GATK4, could you try running the command using GATK4?
    https://github.com/broadinstitute/gatk/issues/1382

  • HovhannesSahakyanHovhannesSahakyan Tartu, EstoniaMember

    Dear @bshifaw ,

    Thank you very much for your answer!

    I wrote in my previous post that SelectVariants now work with GATK 3.8.1. The bug was corrected with that version already. Nevertheless, I ran it with GATK4 as you mentioned.

    gatk --java-options "-Xmx4G" SelectVariants -R hs37d5.fasta -V ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf -O HG00096.vcf -sn HG00096 --exclude-non-variants --remove-unused-alternates

    Then I ran FastaAlternateReferenceMaker with gatk3, because it seems there is no this tool in GATK4, is there?

    And the resulted fastas for the samples HG00096 and HG00097 look exactly the same as previously. So it does not matter with which version of GATK you'll run SelectVariants.

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    I checked the Tool Docs, there doesn't seem to be FastaAlternateReferenceMaker in GATK4.

    Ahh, I miss read your previous post. Sounds like you were able to move past the first step which was creating single sample vcfs using SelectVariants. Now you are on the second step which is creating a alternate reference fasta using FastaAlternateReferenceMaker but the generated fasta are identical amongst the different fasta samples.

    Can you confirm that single sample vcfs were properly created by SelectVariants by checking the contents of the VCF files for different sample names and variants.

  • HovhannesSahakyanHovhannesSahakyan Tartu, EstoniaMember

    Thank you very much for your feedback!

    Yes, you are right. I am now in the second phase.

    I confirm that single sample VCF files are different for different samples, and the variant positions and substitutions are correct.

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    @HovhannesSahakyan ,

    Would it be possible to share two of the vcfs files you are having trouble with so that we could try to debug it on our end?

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @HovhannesSahakyan

    Please follow this link for information on sharing your data: https://software.broadinstitute.org/gatk/documentation/article.php?id=1894

    Regards
    Bhanu

  • HovhannesSahakyanHovhannesSahakyan Tartu, EstoniaMember
    edited October 11

    Dear @bshifaw and @bhanuGandham ,

    Can't upload the file. After successful login it gives the following error "550 example_vcfs_hovhannes.tar.gz: Overwrite permission denied".

    Meanwhile I managed to get fastas.

    First, I follow @bshifaw 's previous suggestion to use gatk4 and make single-sample vcf files. Previously I told that gatk3.8.1 does it, but sorry I was wrong.

    gatk --java-options "-Xmx4G" SelectVariants -R hs37d5.fasta -V ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf -O HG00096.vcf -sn HG00096 --exclude-non-variants --remove-unused-alternates

    Then, I converted the single sample vcf into fasta with bcftools consensus. Vcf must be bgziped and indexed with tabix.

    samtools faidx hs37d5.fasta MT | bcftools consensus -I -o HG00096.fasta --sample HG00096 HG00096.vcf.gz

    FastaAlternateReferenceMaker of gatk3 still generates reference fastas for all vcfs (generated by gatk4).

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Thanks for sharing your solution!

Sign In or Register to comment.