To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

selecting/intersecting variants

Hello GATK,
I have a WGS experiment of 3 genotypes from my favorite non-model organism for which I called variants based on GATK best practices (a pain for non-models!). I am interested in finding variants that are found in both A and B genotypes as heterozygotes and are not in common with genotype C. I am using SelectVariants but when I check the output files in IGV I don't see the desired variants. I would appreciate any input on modifying my commands:

Get all varinats from C
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o C.vcf -sn C --excludeNonVariants --excludeFiltered;

Get heterozyguos varinats from A
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o A_het.vcf -select 'vc.getGenotype("A").isHet()' -env -ef;

Get heterozyguos varinats from B
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o B_het.vcf -select 'vc.getGenotype("B").isHet()' -env -ef;

Find het varinats common to A and B and are differeant from C
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta
-V A_het.vcf \
--concordance B_het.vcf \
--discordance C.vcf \
-o ABh_no_C.vcf

Thanks a lot for any suggestion/correction

Best Answer


  • Thank you very much for the checking my script Geraldine. I think I was misinterpreting what I saw in the genome browser but it is ok.

  • when comparing the vcf files C.vcf and ABh_no_C.vcf I don't expect overlapping variants, however this dosn't seem to be the case in this example:

    • taken from file ABh_no_C.vcf
      ch04 6518437 . T A 1381.02 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=0.553;ClippingRankSum=0.00;DP=87;ExcessHet=6.9897;FS=5.432;MLEAC=3;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=16.06;ReadPosRankSum=-2.600e-01;SOR=0.938;set=SNP GT:AD:DP:GQ:PGT:PID:PL 0/1:9,8:17:99:0|1:6518437_T_A:263,0,354 0/1:13,27:40:99:0|1:6518437_T_A:1090,0,465

    • taken from file C.vcf
      ch04 6518437 . T A 1381.02 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.553;ClippingRankSum=0.00;DP=29;ExcessHet=6.9897;FS=5.432;MQ=60.00;MQRankSum=0.00;QD=16.06;ReadPosRankSum=-2.600e-01;SOR=0.938;set=SNP GT:AD:DP:GQ:PGT:PID:PL 0/1:26,3:29:48:0|1:6518437_T_A:48,0,1436

    If I rerun SelectVariants with on the final ABh_no_C.vcf with --discordance C.vcf this SNP (and others) are removed. Any explanation?


  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Why do you expect no overlapping variants? Do you only want the variants that are discordant between the two VCFs? If so, SelectVariants with discordance is the correct way to go about it.

    Note, if you want the variants in only one VCF and not in the other, you will need to run SelectVariants with discordance twice. Once with one VCF as the discordant file, then again with the other VCF as the other discordant file. That will give you the variants in one VCF but not the other.

  • Thanks, Sheila. Yes, need only the het variants that are common to 2 samples and are different from a 3rd.
    When you say run twice do you mean something like A -discordance C and also B -discordance C ? I thought if I run as above then the concordance of A+B will give the variants that are different from C

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Ah, actually, if you only have 3 samples, you may be better off using CombineVariants with the set annotation. Have a look at this document. You can then use SelectVariants to choose only sites that are het in the two samples.


    P.S. You can also go about it the -discordance way. You are correct in having A+B VCF and getting the discordance from C.

  • I see, thank you. Do you see an obvious reason why my initial approach left some variants that should have been filtered out?
    I have to comment that these comparisons are lengthy and complex, especially since when starting from a single vcf with multiple samples. It would have been nice to have a tool to streamline variants comparisons (Venns ?) since I think that relates to most genetic studies.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi @morgene, you're right that variant callset comparisons can get quite complex. The current GATK variant manipulation tools are mainly designed for general post processing and quality evaluation operations, not so much downstream analysis. There are other toolkits that offer such analysis functionality, and I would encourage you to find one that suits your needs.
Sign In or Register to comment.