Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

how can i get vcf file without repeat snps?

i just call snp with my several samples` RNA-seq data .
then i get several vcf files , so i just use the function "MergeVcfs" to combine them into a big vcf files.
and i use the "CollectVariantCallingMetrics" to evaluate it .
then i find that this big vcf file contain all the snps in my samples , even those snps whose share the same sites.
so what i wonder is can i get a vcf file that all snps get the unique site.
i know it may be a complicated question , because i think this kind of big vcf file contain the snps whose genotypes is different in a way.
so if i want to get a one-site-one-snp vcf file , the information about genotype may get wrong.
or my question is simplified in this way: i just want delete the repeat snps to get net number about my snps.
maybe my description is not so clear , but i am really trying my best to describe my question as best as i can.
thanks a lot.

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi, can you show a few lines of what you have right now and of what you would like to get instead?

  • Zea1nfOZea1nfO Member

    i just draw a draft for you:

    just like the picture: i got A.vcf , B.vcf , C.vcf.
    and the big vcf is what i want . these days i was always working on it , i think maybe i can use VariantToTable to convert it into a table-delimited file . Then i use R studio to get what i want .
    Maybe it is a complicated question.

    And i got two more question:

    The first question is What difference between this two method:
    method A : HaplotypeCaller call snp and indel directly
    method B: HaplotypeCaller -ERC GVCF mode + GenotypeGVCF call snp and indel .
    maybe A for RNA-seq , B for DNA-seq ?

    The second question is how does the multi-allelic snp show in the vcf ?
    maybe just like this following ?

    thanks a lot :smile:

  • Zea1nfOZea1nfO Member

    by the way , my data is diploid sample`s data.

  • Zea1nfOZea1nfO Member

    thank you very much for your patient explanation
    so you mean the big.vcf should be like this:

    rather than like this:

    But i am sorry that i have another question :
    How can i get multi-allelic SNPs from my big vcf ?
    I think i can use CombineVariants to combine my several diploid samples` vcf , and then use SelectVariants with -selectType SNP -selectType MNP -restrictAllelesTo MULTIALLELIC to get a vcf which only contain multi-allelic SNPs.
    Am i right?

  • Zea1nfOZea1nfO Member

    thank you very very much , i will try this way to calculate my samples` multi-allelic
    have a nice day

  • Zea1nfOZea1nfO Member

    i know you are on vacation now , so have nice days.
    but i still want to you a question , maybe you can answer me when you`re back on work.
    when i try to use CombineVariants in GATK3 to get my big vcf i talk about above , i dont know which mode i should choose.
    i find there are two options:

    i try to understand the explanation of it . but i still got confused about it.
    i just want to know which mode i should choose , and what difference between them?
    thanks a lot
    by the way , Happy Thanksgiving! :)

  • Zea1nfOZea1nfO Member

    i got it , thank you very much
    have a nice day :)

Sign In or Register to comment.