If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

SelectVariants/getHomVarCount JEXL question

estif74estif74 Saint Paul, MN, USAMember

Hi there,

Had a JEXL syntax question that I was hoping to get some thoughts on. Let's say that I have a VCF file with 10 samples in it, S1 through S10. I would like to select the sites where among 5 specific samples, 3 are homozygous variant. So something like:

select ' vc.getHomVarCount(S1, S2, S3, S4, S5) >= 10 '

However this syntax doesn't work unfortunately. Is there a way to specify how to do something like this using vc.GetHomVarCount()?

Thanks as always!



  • estif74estif74 Saint Paul, MN, USAMember

    sorry my syntax example should have read:

    select ' vc.getHomVarCount(S1, S2, S3, S4, S5) >= 3 '

    Not enough coffee yet :-)

  • estif74estif74 Saint Paul, MN, USAMember

    Played around a bit with the subsetToSamples and sampleNames functions after reading through the variant context documentation, but couldn't get this to work. I'm guessing there may be an elegant way to accomplish this using these functions, but I think I figured out a bit of a workaround. Here's what I did:

    1. Run SelectVariants but subset the original VCF to include only the samples I'm interested in, ie -sn S1 \ -sn S2 \ etc. as well as a select statement -select 'vc.getHomVarCount()>=3'
    2. Then re-run SelectVariants using the concordance function to find the sites in the original VCF that are concordant with the output from the VCF generated in #1. The output in this VCF file will have all the samples in it: -V original.vcf \ -conc output1.vcf \ -o new.vcf

    Would still be curious about a more elegant solution from more of a learning perspective on how to better use JEXL.


  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    If anyone has an elegant solution to do this it would be @gauthier :)

  • adominguesadomingues GermanyMember

    @Geraldine_VdAuwera, I am trying to achieve something very similar to what @estif74 did:

    • multisample vcf with 8 samples
    • select only homozygous variants that exist in at least 3 samples

    Using @estif74 solution I did:
    java -jar ~/imb-kettinggr/common_bin/GATK/3.6/GenomeAnalysisTK.jar \
    -T SelectVariants \
    --variant AllSamples.vcf \
    --maxNOCALLnumber 3 \
    --selectTypeToInclude SNP \
    -select "(QUAL > 30.0 || AD > 20 ) && vc.getHomVarCount() >=3 " \
    -o AllSamples.snps.filt.vcf

    However there are still variants left that are homozygous in less than 3 samples:
    chr10 283484 . T C 97.10 QD AC=6;AF=0.500;AN=12;BaseQRankSum=2.210;ClippingRankSum=-1.029;DP=294;ExcessHet=3.0103;FS=7.510;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.328;QD=0.74;ReadPosRankSum=-1.335;SF=1f,2,3f,5f,6f,7;SOR=2.270 GT:GQ:DP:PL:AD . 0/1:54:35:54,0,741:29,6 0/1:99:54:168,0,1225:41,13 0/1:68:44:68,0,921:37,7 . 0/1:78:54:78,0,1191:45,9 0/1:80:35:80,0,667:28,7 0/1:99:37:304,0,523:24,13

    Am I missing something?


  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    Hi António,

    I just tried using -select 'vc.getHomVarCount() >= 2' and it works for me. I only get sites where at least 2 samples are hom-var.

    Can you try running -select '(QUAL > 30.0 || AD > 20 )' -select 'vc.getHomVarCount() >=3' so the two conditions are on different lines? I'm wondering if somehow the second statement is getting lost.


  • adominguesadomingues GermanyMember

    Hi @Sheila,
    sorry for belated reply. It is working now. I think it was indeed the way I constructing the query that lead to the unexpected results. Great tip to split it into multiple -select chunks.


Sign In or Register to comment.