I am trying to use the GATK (3.6) SelectVariants tool, and I want to input several vcf files. Preferably all as --variant (-V). The key here is that these have to be separate files, not one multi sample VCF. Supposedly SelectVariants can take a list of vcf files, but it is unclear (to me) how. I tried -V input1.vcf -V input2.vcf ... and -V input1.vcf input2.vcf ... and both throw an error. So how exactly can I provide a list of vcf files? Alternatively, which tool to use to select variants from multiple vcf files (I do not want to use the --concordant option because in some cases I want to select variants present in a fraction of input files, and I do not care much in which specific file given variant shows up). I'm grateful for any hints! Thanks!


  • agsmagsm SwedenMember

    Hi Sheila,
    Thanks for your quick reply. I have found some scattered posts (here and on Biostars) suggesting that SelectVariants can take a list of VCF files, so I wanted to give it a go. Initially, I did not want to combine variants into one file as it's apparently not a good practice (
    I'll combine the vcfs, as it seems the only way to do what I want to do (if I want to stay within the GATK framework), but just to be sure: after combining the vcf files, the annotations on the sample level (FORMAT) can be used for filtering. What about the variant level (INFO)?
    Thanks a lot for the clarification!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi A,

    Indeed we do not recommend combining VCFs. If you would like to analyze samples together, you should use the GVCF workflow. Have a look at this article.
    This article should convince you to use the GVCF workflow :smile:


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    To clarify, whether "combining vcfs" is a good idea or not depends entirely on what they contain and how they relate to each other. There are many possible case figures, and for some it's fine (some of our variant comparison tools require it) while for others it's a bad idea (to assemble a cohort from individual callsets).

  • agsmagsm SwedenMember

    Hi Sheila and Gerladine,

    Thanks for your comments. I work with RNA-seq data, and did not want to use a workflow which is not fully recommended (gVCF). It's not my own data, so I prefer to stick to your best practice guidelines. I think for my question though it's legit to combine the VCFs for downstream analysis, as I am only interested in the presence / absence of genotype calls in different samples (biological replicates of an experimental treatment).


