Individual vcf for each sample and single vcf for all samples. Does the output contents differ?

NandaNanda CanadaMember

Dear All,

I have performed variant calling analysis for 24 samples using GATK pipeline. I need some clarifications on following things

1) If I generate single VCF file for each of the 24 samples individually and then generate a single VCF file containing all 24 samples,

  • Are there any differences between them in the output VCF?

  • if yes, what are the differences?

The reason why I am asking this is, I have family level information and also symptom level information for those 24 samples.

Family level information for those 24 samples

FamilyA : Sample1, Sample2, Sample3

FamilyB : Sample4, Sample5, Sample6


FamilyH : Sample22, Sample23, Sample24

Symptom level information for those 24 samples

Joint pain : Sample1, Sample 4, Sample 14, Sample 15, Sample,16, Sample17

Bleeding : Sample2, Sample5, Sample6

Symptom X : …..

For instance,

I would like to know whether the samples that are grouped together in the above scenario have any common genetic variants among them. In other words, are there 'secondary' variants elsewhere in the exome (other than the X gene) that are common amongst patients that suffer from the same symptoms.

  • I want to find common variants for the bleeding symptom, does the common variants differ between the case1 and case2 or not?

case1: I am comparing individual VCF file (sample2.vcf, sample5.vcf and sample6.vcf) and filtering the common variants

case2: I am extracting just the sample2, sample5, and sample6 from the single VCF file with all 25 samples in it

As the above example, I would like to find common variants at the family level as well.



  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    This article should answer your questions.


  • NandaNanda CanadaMember

    Thanks Sheila for referring to the article

  • NandaNanda CanadaMember

    Dear Sheila,

    From the article, you provided I could see the significance of joint calling. Considering that I perform variant calling with all 24 samples together in a single vcf. Then there will be a case where some samples the vcf_records will be 0/0 (homozygous reference) but in the same vcf_record other samples might be a variant, this is my understanding.

    As mentioned above in the question, if I want to identify common variants at the family level for FamilyA (i.e. sample 1, sample2 and sample3) using joint calling method, which of the below is reasonable to perform.
    1) Joint calling of just sample1, sample 2 and sample3 (output sample_1_2_3.vcf)
    2) Joint calling of all 24 samples together (output samples_1-24.vcf), then extract
    a) First I have to extract sample1,2 and 3 from the main samples1-24.vcf.
    b) Then remove the vcf_records with 0/0(homozygous reference) from the extracted sample1_2_3.vcf

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    The best thing to do is call variants on all of your samples (24) together. Then, you can use SelectVariants to subset your final VCF to the samples of interest.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Note that when you subset the samples, you can have the program remove lines where those samples are non-variant. I forget what is the argument name at the moment but it should be in the SelectVariants tool documentation.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    (So you don't have to do a separate run just for that)
Sign In or Register to comment.