MergeVcf and SortVcf in GATK4?

Hi there,

After variant calling by Haplotypecaller (gatk4) and hard filtering, I merged snp and indel vcf files using MergeVcf that suggested here instead of using CombineVariants in gatk4. But, I found the variant count in the merged vcf file isn’t the sum of variant count in the snp and indel vcf files. So, I tried SortVcf, which generated a merged vcf file that the total count of variant in this merged vcf file was the sum of the counts in the snp and indel vcf files. As I found, SortVcf doesn’t resolve the overlapping snp and indel, unlike MergeVcf, yes, is it right? If it’s right, could you please let me know which type of variant (snp or indel) take precedence in merged vcf file and why?
Also, please kindly tell me which one (MergeVcf or SortVcf) do you suggest and why?

Thanks in advance

Answers

  • Ana_22Ana_22 Member

    No response, yet. any suggestions, please!

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Ana_22

    Sorry about the late response as I was out sick most of last week and we have been facing a high volume.
    If the SNP and the INDEL share the same REF allele, MergeVcfs will produce one variant and that would depending on various factors such as quality of the variants called.
    For your purpose you can use CombineVariants and use --genotypemergeoption option to prioritize the source of genotypes. The link to it is here: https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php#--genotypemergeoption

    Hope this helps.

    Regards
    Bhanu

  • Ana_22Ana_22 Member

    Hi Bhanu,

    Thanks for your reply and hope you feel better.
    As you know, CombineVariants doesn’t exist in the gatk version 4 that I’m using, so instead of it, I used MergeVcf for combining snp and indel vcf files; is it a right approach? also, I’m not sure what the best way to face with overlapping variants is, could you please kindly advise me on this issue?

    Another question about MergeVcf (gatk4) for merging various vcf files from different samples, as gatk recommends, to this end, different vcf files should be sorted by SortVcf, then Merging by MergeVcf. However when we used SortVcf to this purpose, a single combined vcf file produced, so how to use MergeVcf after using SortVcf? In other words, we should have separated sorted vcf files, instead of a single combined vcf file, for applying MergeVcf on them. Could you please kindly help me out about it?

    Thank you in advance

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Ana_22

    Yes, using MergeVcf is the right approach although unlike CombineVariants it does not have the option to prioritize the source of genotypes. I am looking into what the precedence with the variant type would be for overlapping snp and indels. In order to test the precedence, would you please send me the snp and indel vcfs you are working with. Please follow this link for information on how to send us the files. I will get back to you with an answer soon.

    SortVcf does not resolve overlaps. It puts them in chromosome/position order, thereby preserving the same number of variants as the two individual VCFs.

    You could run SortVcf on each vcf individually and then apply MergeVcf to merge the sorted vcfs.

    Regards
    Bhanu

  • Ana_22Ana_22 Member

    Hi Bhanu,

    Thank you for following the post and response. I’ll send you the files as you kindly suggested.
    You suggested running SortVcf on each vcf individually for creating the sorted vcf files, then merging them. Imagine there are about 100 vcf files, meaning running SortVcf 100 times!, it’s hard to believe there isn’t any solution here. Could you please help me what I can do in these situations?

    Thanks in advance and all the best

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Ana_22

    That's a good question and thank you for bringing it up. I would say that once SortVcf is run with multiple vcfs as input and you get a combined vcf, there is then no further need to run MergeVCFs additionally. MergeVcfs performs merging a bit more efficiently if the vcfs are already correctly sorted, so should probably be used instead of SortVcf in that case. They use different techniques to basically do the same thing in your case. So my suggestion for you would be to use Sortvcf.

    Also we did some tests on our end, and we found that if snp and indel positions overlap, MergeVcfs uses it as separate records in the output file. And so looking at your specific dataset where that is not the case will help us understand this issue better. Looking forward to receiving the files from you.

    Regards
    Bhanu

  • Ana_22Ana_22 Member

    Hi Bhanu,

    Thank you very much for your answer and sorry for the further question.
    As applying SortVcf on multiple VCF files sorted them and also created the single combined vcf file, you suggested just using SortVcf for merging multiple vcf files without using MergeVcf. Now, I’m concerned about duplicated variants in the final combined vcf file produced by SortVcf, could you please kindly tell me if SortVcf removes duplicated (identical) variants in the final combined vcf file? If your response is negative, so how we should solve this issue?

    Thanks in advance

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Ana_22

    You mentioned that you re trying to combine a snp and an indel vcf from the same sample, in that case you should not have any duplicates.

    Regards
    Bhanu

  • Ana_22Ana_22 Member

    Hi Bhanu,
    Thank you. Actually, I asked two questions in the post, one about merging snp and indel vcf files, which as you mentioned merging with SortVcf file is a right approach and doesn’t produce the duplicate.

    However, another question was about merging several VCF files from several samples. As applying SortVcf on multiple VCF files (from multiple samples) sorted them and also created the single combined vcf file, you suggested just using SortVcf for merging multiple vcf files without using MergeVcf. Here, I’m concerned about duplicated variants in the final combined vcf file produced by SortVcf, could you please kindly tell me if SortVcf removes duplicated (identical) variants of multiple VCF files in the final combined vcf file? If your response is negative, so how we should solve this issue?

    Thanks in advance

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Ana_22

    In case you have several VCF files from several samples and you are concerned about duplicated variants in the final combined vcf file, then you should use CombineVariants.

    MergeVcfs and SortVcf will create duplicate records of identical variants from the input samples.

    I hope this helps.

    Regards
    Bhanu

  • Ana_22Ana_22 Member

    Hi Bhanu,

    Thank you, but as I mentioned in the original question, I'm talking about GATK version 4 that CombineVariants is not available within it. So, considering this issue and also creating duplicate records of identical variants by MergeVcf and SortVcf as you mentioned in your previous response, could you please kindly tell me how we can combine multiple vcf files from different samples (each sample has own related vcf file) into a single vcf file?

    Thanks in advance

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Ana_22

    Alright lets recapitulate here:
    1) We do not have a tool in GATK4 which will combine variants from multiple vcfs and remove duplicates. In order to do that we recommend the use of CombineVariants from Gatk3.8. CombineVariants also has the option to prioritize the source of genotypes. I will add though, that its an older tool and although we support it we are not in the process of accepting bug reports and feature requests for it.
    2) MergeVcf and SortVcf will output duplicate/overlapping records from input files into different lines in the output file.

    Regards
    Bhanu

Sign In or Register to comment.