Suggested improvement to SelectVariants
I have recently discovered this tool and it is fantastic!!!
I have been digging a bit in both the documentation of JEXL that you have on the site (which was very useful) and into the code especially VariantContext and Genotype. A lot of the power of this tool is only unlocked when you have access to the javadoc for these classes. I have generated the javadoc myself, but I think there are a lot of people out there who would have great use for this tool but are not able to download from github and generate their own javadoc. Also, when I generate the javadoc, I get bad formatting of the section that has been written in markdown (where you give usage examples): this is no big deal of course but if you could solve this centrally.....
The suggested improvement to the tool is the following is a tweak to the code to so that the tool can be used for selecting non-variants (as well as variants) after the samples have been eliminated from consideration.
Say you have a vcf file with two samples unknown1 and unknown2 and you want to know how many sites are invariant. Then you can identify non-variant sites in several ways:
alias selectVariants="java -Xms2G -Xmx8G -jar GenomeAnalysisTK-2.4-9/GenomeAnalysisTK.jar -T SelectVariants -R /Users/timothyh/home/seqDb/seq/gatkBundle_2.3/b37/bwa_5.10/human_g1k_v37.fasta --intervals /Users/timothyh/home/proj_amg/amg/annotation/cardioAssess/CardioCodingExons_b37+-2_GeneName_merged.interval_list --validation_strictness STRICT --num_threads 4 --variant out.vcf --out temp.vcf" alias count='grep -v "#" temp.vcf | wc -l'; selectVariants --selectTypeToInclude NO_VARIATION count #189485 selectVariants \ --select_expressions '!vc.isVariant()' count # 189485 selectVariants --select_expressions '(vc.getGenotype("unknown1").isHomRef() || vc.getGenotype("unknown1").isNoCall()) && (vc.getGenotype("unknown2").isHomRef() || vc.getGenotype("unknown2").isNoCall())' count # 189485
For the samples separately, the non-variants are:
# Non variants sites in first sample selectVariants \ --sample_name unknown1 \ --select_expressions 'vc.getGenotype("unknown1").isHomRef() || vc.getGenotype("unknown1").isNoCall()' count # 189504 >> higher than 189485 as expected # Non variants sites in second sample selectVariants \ --sample_name unknown2 \ --select_expressions 'vc.getGenotype("unknown2").isHomRef() || vc.getGenotype("unknown2").isNoCall()' count # 189539 >> higher than 189485 as expected
But, if you only consider the 'unknown1' sample and you then try to use the VariantContext object rather than the Genotypes to identify non-variant sites, you get an erroneous result. This is something you would want to do if you had a lot of samples and could not list all the samples in consideration.
# Limiting to unknown1 and trying to use the variantContext selectVariants \ --sample_name unknown1 \ --select_expressions '!vc.isVariant()' count # 189485 >> should be 189504
I took a look at the code of SelectVariants and saw that the VariantContext only get updated if the user has specified --excludeNonVariants. Which you would not want to call when you are interested in non-variants.
In summary, although I can see why the code of a tool called SelectVariants does this, I would like to suggest that there be an option to update the VariantContext even when --excludeNonVariants has not been called as this would expand the utility of the tool.
One last thing Is there a way to get the tool to print to stdout the number of records in the output?