Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Merging population vcf files without gvcf

Hi Everyone,

I have two separate raw VCFs dataset processed by GATK version 3.5 (one from the population of ~ 2600 and one from the population of ~160). Since the upstream data cleaning and processing phases were done in elsewhere, I do not have the access to gvcf files. In order to combine these two populations, instead of joint genotyping via gvcf, is it possible to just merge the vcf files using existing tools ? Do you think it will introduce the batch effects or how to minimise it ? I can run from scratch (Bam files), but it will take a lot of computational resources since they are whole genome data. Feel free to contact me if you do not understand my questions.

Sincerely,

Myo
Tagged:

Best Answers

  • AdelaideRAdelaideR admin
    Accepted Answer

    @MyoNaung, it is difficult to determine whether combining your vcf's will meet your research question. Could you please provide some more information about your main objective?

    Also, which programs were used to generate these VCF's? Different versions of GATK can provide different numbers of variant predictions.

    If you are trying to tease out the true variants from our samples, you would want to have the gVCF for more accurate estimate of reference confidence. This will enable joint genotyping analysis downstream. Otherwise, your confidence intervals will be highly correlated with the sample number, and may introduce a false positive for variant calls in the lower sample number cohort.

    Here is some more information about VCF versus gVCF files

    If you are concerned about computational resources, it is possible to use the Broad Firecloud service to run this analysis, which is rather straightforward. You can find out more about Firecloud [here] (https://software.broadinstitute.org/firecloud/) New users get free credits on google cloud to get started.

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @MyoNaung Thanks for your question! I will look into your question and get back to you with some suggestions to help with merging your two datasets!

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin
    Accepted Answer

    @MyoNaung, it is difficult to determine whether combining your vcf's will meet your research question. Could you please provide some more information about your main objective?

    Also, which programs were used to generate these VCF's? Different versions of GATK can provide different numbers of variant predictions.

    If you are trying to tease out the true variants from our samples, you would want to have the gVCF for more accurate estimate of reference confidence. This will enable joint genotyping analysis downstream. Otherwise, your confidence intervals will be highly correlated with the sample number, and may introduce a false positive for variant calls in the lower sample number cohort.

    Here is some more information about VCF versus gVCF files

    If you are concerned about computational resources, it is possible to use the Broad Firecloud service to run this analysis, which is rather straightforward. You can find out more about Firecloud [here] (https://software.broadinstitute.org/firecloud/) New users get free credits on google cloud to get started.

  • MyoNaungMyoNaung Member
    Hi @AdelaideR,

    Thanks for the responses. The objective of my research is to extract specific genes from these dataset to do downstream population genetic analysis using SNPs. Therefore, I'm abit worried about false positive variants due to differences in sample sizes. They are processed by the same GATK pipeline. Do you think it will introduce a huge biases by merging vcf files since we are not comparing between datasets ?

    Myo
  • MyoNaungMyoNaung Member
    Thanks very much for the kind and very useful suggestion. I will create gvcf from bams.
  • MyoNaungMyoNaung Member
    one more question, since i will be calling variants on non-diploid organism (plasmodium falciprum) do you recommend to use UnifiedGenotyper or HaplotypeCaller as usual ?
Sign In or Register to comment.