Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Construct genome with gvcf or vcf file?

I have 100 rice accessions and their mapping files from DNA sequencing. My goal is to construct the genome for a specific region for each rice accession. Could I just use the gvcf file after initial varaint calling? If so, how could I guarantee the quality for those SNP/InDel sites? Or must I use the vcf file after final genotype call?

Best Answer

Answers

  • AdelaideRAdelaideR Member admin

    Hello @purod

    I assume you are thinking of using the tool FastaAlternateReferenceMaker to do this?

    I guess it depends on how you are going to end up using the reference.

    The differences between GVCF and VCF are outlined in this post

    The GVCF will not necessarily contain all of the variant calls, so if you have subpopulations represented in your 100 rice accessions, you may want to generate alternate references using GVCF for each subpopulation instead of grouping them all together.

    But it really depends on what you are using the reference for downstream. Perhaps if you could provide a little more information about that, I can provide some more feedback.

  • purodpurod Member

    Thank you so much @AdelaideR,
    The downstream analysis I would like to do is to accurately construct the alternate reference genome for some specific regions. To be able to do that, I need to identify all reliable differences between the reference genome and accession specific genome. The difference depends on the SNP and InDel call from GATK pipeline based on DNA-sequencing data. Some of my questions are listed below:
    1. Which one contains more differences, GVCF or VCF, how much can I trust those differences?
    2. Will FastaAlternateReferenceMaker accept GVCF format?
    3. As I have learned, FastaAlternateReferenceMaker has some caveats, especially the one not being able to handle complex alleles. How can I deal with this?
    4. Some SNP or Indel might be missing, I plan to use imputation to get the missing information?

  • purodpurod Member

    Thank you for your response. It helps a lot. I will try to do it manually or take a look at plink.

  • purodpurod Member

    I finally use bcftools to construct personalized genome

Sign In or Register to comment.