If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

selecting/intersecting variants

Hello GATK,
I have a WGS experiment of 3 genotypes from my favorite non-model organism for which I called variants based on GATK best practices (a pain for non-models!). I am interested in finding variants that are found in both A and B genotypes as heterozygotes and are not in common with genotype C. I am using SelectVariants but when I check the output files in IGV I don't see the desired variants. I would appreciate any input on modifying my commands:

Get all varinats from C
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o C.vcf -sn C --excludeNonVariants --excludeFiltered;

Get heterozyguos varinats from A
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o A_het.vcf -select 'vc.getGenotype("A").isHet()' -env -ef;

Get heterozyguos varinats from B
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o B_het.vcf -select 'vc.getGenotype("B").isHet()' -env -ef;

Find het varinats common to A and B and are differeant from C
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta
-V A_het.vcf \
--concordance B_het.vcf \
--discordance C.vcf \
-o ABh_no_C.vcf

Thanks a lot for any suggestion/correction

Best Answer


  • morgenemorgene USMember

    Thank you very much for the checking my script Geraldine. I think I was misinterpreting what I saw in the genome browser but it is ok.

  • morgenemorgene USMember

    when comparing the vcf files C.vcf and ABh_no_C.vcf I don't expect overlapping variants, however this dosn't seem to be the case in this example:

    • taken from file ABh_no_C.vcf
      ch04 6518437 . T A 1381.02 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=0.553;ClippingRankSum=0.00;DP=87;ExcessHet=6.9897;FS=5.432;MLEAC=3;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=16.06;ReadPosRankSum=-2.600e-01;SOR=0.938;set=SNP GT:AD:DP:GQ:PGT:PID:PL 0/1:9,8:17:99:0|1:6518437_T_A:263,0,354 0/1:13,27:40:99:0|1:6518437_T_A:1090,0,465

    • taken from file C.vcf
      ch04 6518437 . T A 1381.02 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.553;ClippingRankSum=0.00;DP=29;ExcessHet=6.9897;FS=5.432;MQ=60.00;MQRankSum=0.00;QD=16.06;ReadPosRankSum=-2.600e-01;SOR=0.938;set=SNP GT:AD:DP:GQ:PGT:PID:PL 0/1:26,3:29:48:0|1:6518437_T_A:48,0,1436

    If I rerun SelectVariants with on the final ABh_no_C.vcf with --discordance C.vcf this SNP (and others) are removed. Any explanation?


  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin


    Why do you expect no overlapping variants? Do you only want the variants that are discordant between the two VCFs? If so, SelectVariants with discordance is the correct way to go about it.

    Note, if you want the variants in only one VCF and not in the other, you will need to run SelectVariants with discordance twice. Once with one VCF as the discordant file, then again with the other VCF as the other discordant file. That will give you the variants in one VCF but not the other.

  • morgenemorgene USMember

    Thanks, Sheila. Yes, need only the het variants that are common to 2 samples and are different from a 3rd.
    When you say run twice do you mean something like A -discordance C and also B -discordance C ? I thought if I run as above then the concordance of A+B will give the variants that are different from C

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin


    Ah, actually, if you only have 3 samples, you may be better off using CombineVariants with the set annotation. Have a look at this document. You can then use SelectVariants to choose only sites that are het in the two samples.


    P.S. You can also go about it the -discordance way. You are correct in having A+B VCF and getting the discordance from C.

  • morgenemorgene USMember

    I see, thank you. Do you see an obvious reason why my initial approach left some variants that should have been filtered out?
    I have to comment that these comparisons are lengthy and complex, especially since when starting from a single vcf with multiple samples. It would have been nice to have a tool to streamline variants comparisons (Venns ?) since I think that relates to most genetic studies.

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin
    Hi @morgene, you're right that variant callset comparisons can get quite complex. The current GATK variant manipulation tools are mainly designed for general post processing and quality evaluation operations, not so much downstream analysis. There are other toolkits that offer such analysis functionality, and I would encourage you to find one that suits your needs.
Sign In or Register to comment.