We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

how can i get vcf file without repeat snps?

i just call snp with my several samples` RNA-seq data .
then i get several vcf files , so i just use the function "MergeVcfs" to combine them into a big vcf files.
and i use the "CollectVariantCallingMetrics" to evaluate it .
then i find that this big vcf file contain all the snps in my samples , even those snps whose share the same sites.
so what i wonder is can i get a vcf file that all snps get the unique site.
i know it may be a complicated question , because i think this kind of big vcf file contain the snps whose genotypes is different in a way.
so if i want to get a one-site-one-snp vcf file , the information about genotype may get wrong.
or my question is simplified in this way: i just want delete the repeat snps to get net number about my snps.
maybe my description is not so clear , but i am really trying my best to describe my question as best as i can.
thanks a lot.

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi, can you show a few lines of what you have right now and of what you would like to get instead?

  • Zea1nfOZea1nfO Member

    i just draw a draft for you:

    just like the picture: i got A.vcf , B.vcf , C.vcf.
    and the big vcf is what i want . these days i was always working on it , i think maybe i can use VariantToTable to convert it into a table-delimited file . Then i use R studio to get what i want .
    Maybe it is a complicated question.

    And i got two more question:

    The first question is What difference between this two method:
    method A : HaplotypeCaller call snp and indel directly
    method B: HaplotypeCaller -ERC GVCF mode + GenotypeGVCF call snp and indel .
    maybe A for RNA-seq , B for DNA-seq ?

    The second question is how does the multi-allelic snp show in the vcf ?
    maybe just like this following ?

    thanks a lot :smile:

  • Zea1nfOZea1nfO Member

    by the way , my data is diploid sample`s data.

  • Zea1nfOZea1nfO Member

    thank you very much for your patient explanation
    so you mean the big.vcf should be like this:

    rather than like this:

    But i am sorry that i have another question :
    How can i get multi-allelic SNPs from my big vcf ?
    I think i can use CombineVariants to combine my several diploid samples` vcf , and then use SelectVariants with -selectType SNP -selectType MNP -restrictAllelesTo MULTIALLELIC to get a vcf which only contain multi-allelic SNPs.
    Am i right?

  • Zea1nfOZea1nfO Member

    thank you very very much , i will try this way to calculate my samples` multi-allelic
    have a nice day

  • Zea1nfOZea1nfO Member

    i know you are on vacation now , so have nice days.
    but i still want to you a question , maybe you can answer me when you`re back on work.
    when i try to use CombineVariants in GATK3 to get my big vcf i talk about above , i dont know which mode i should choose.
    i find there are two options:

    i try to understand the explanation of it . but i still got confused about it.
    i just want to know which mode i should choose , and what difference between them?
    thanks a lot
    by the way , Happy Thanksgiving! :)

  • Zea1nfOZea1nfO Member

    i got it , thank you very much
    have a nice day :)

Sign In or Register to comment.