Suggested improvement to SelectVariants

Hi,

I have recently discovered this tool and it is fantastic!!!

I have been digging a bit in both the documentation of JEXL that you have on the site (which was very useful) and into the code especially VariantContext and Genotype. A lot of the power of this tool is only unlocked when you have access to the javadoc for these classes. I have generated the javadoc myself, but I think there are a lot of people out there who would have great use for this tool but are not able to download from github and generate their own javadoc. Also, when I generate the javadoc, I get bad formatting of the section that has been written in markdown (where you give usage examples): this is no big deal of course but if you could solve this centrally.....

The suggested improvement to the tool is the following is a tweak to the code to so that the tool can be used for selecting non-variants (as well as variants) after the samples have been eliminated from consideration.

Say you have a vcf file with two samples unknown1 and unknown2 and you want to know how many sites are invariant. Then you can identify non-variant sites in several ways:

alias selectVariants="java -Xms2G -Xmx8G -jar GenomeAnalysisTK-2.4-9/GenomeAnalysisTK.jar -T SelectVariants -R /Users/timothyh/home/seqDb/seq/gatkBundle_2.3/b37/bwa_5.10/human_g1k_v37.fasta --intervals /Users/timothyh/home/proj_amg/amg/annotation/cardioAssess/CardioCodingExons_b37+-2_GeneName_merged.interval_list --validation_strictness STRICT --num_threads 4 --variant out.vcf --out temp.vcf"

alias count='grep -v "#" temp.vcf | wc -l';

selectVariants --selectTypeToInclude NO_VARIATION
count
#189485

selectVariants \
--select_expressions '!vc.isVariant()'
count
# 189485

selectVariants --select_expressions '(vc.getGenotype("unknown1").isHomRef() || vc.getGenotype("unknown1").isNoCall()) && (vc.getGenotype("unknown2").isHomRef()  || vc.getGenotype("unknown2").isNoCall())'
count
# 189485

For the samples separately, the non-variants are:

# Non variants sites in first sample
selectVariants \
--sample_name unknown1 \
--select_expressions 'vc.getGenotype("unknown1").isHomRef() || vc.getGenotype("unknown1").isNoCall()'
count
# 189504 >>  higher than 189485 as expected

# Non variants sites in second sample
selectVariants \
--sample_name unknown2 \
--select_expressions 'vc.getGenotype("unknown2").isHomRef()  || vc.getGenotype("unknown2").isNoCall()'
count
# 189539 >> higher than 189485 as expected

But, if you only consider the 'unknown1' sample and you then try to use the VariantContext object rather than the Genotypes to identify non-variant sites, you get an erroneous result. This is something you would want to do if you had a lot of samples and could not list all the samples in consideration.

# Limiting to unknown1 and trying to use the variantContext
selectVariants \
--sample_name unknown1 \
--select_expressions '!vc.isVariant()'
count
# 189485 >> should be 189504

I took a look at the code of SelectVariants and saw that the VariantContext only get updated if the user has specified --excludeNonVariants. Which you would not want to call when you are interested in non-variants.

In summary, although I can see why the code of a tool called SelectVariants does this, I would like to suggest that there be an option to update the VariantContext even when --excludeNonVariants has not been called as this would expand the utility of the tool.

One last thing :) Is there a way to get the tool to print to stdout the number of records in the output?

Tim

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Tim,

    You want fries with that? Just kidding ;-)

    Regarding the javadocs, we decided not to provide them separately because we figured that anyone capable of understanding / making use of them is probably also capable of getting the source from github and building them. There are few if any other tools (than SelectVariants) for which analysts with no coding experience would find the javadocs helpful. That said perhaps we could make a doc article about the VariantContext methods, based on the javadocs, if you think that would be helpful?

    Regarding the formatting issue, this is a side-effect of how the "marked-down" bits are used for generating the Tech Docs. Eventually we will change how this is all formatted so that the formatting conflict goes away, but that project is not a priority so in the meantime you'll need to consult the Tech Docs if you would like to see the properly-formatted examples. Sorry for the inconvenience...

    Regarding your proposed tweak, we have no plans to implement that ourselves (again, it's an issue or prioritizing resources), but we would be very happy to look at a patch if you do it.

    Finally, I'm not sure what you mean by print out the number of records -- you mean like a summary that would say "we saw X records and selected Y records based on the input criteria"?

  • TimHughesTimHughes Member

    Hi Geraldine,

    Fries? yes please. I mean the burger is great, so I am sure the fries are super. Any chance of super sizing that? :)

    On the javadocs: I do see your point and if you did make the javadocs available you would probably get bombarded with basic questions precisely because there are a lot of people out there who need the functionality in this tool but are often biologists. In my experience, it is at this point in the process that the biologist come back into the picture and want to get their hands on the data.

    On the tweak: I will have a go.

    On outputing the number of records: I think you have got me right. The VCF manipulation tools print to screen the number of records in the input file but not the number of records in the output file. This means that one has to perform a count separately eg after SelectVariants. This is not big deal of course, but since you print to screen the number of input records why not do the same for number of output records.

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Tim,

    We aim to please. Regular, curly or sweet potato?

    Yep, we're aware that the VCF manipulation stage is where the biologists come in and have the most need for help. And unfortunately it's probably where we have the least documentation right now, apart from the standard set of examples, so folks who don't have the programming background to understand the variant context code are sadly unable to leverage the full power of these tools. Eventually we'll tackle developing better docs on the variant context, but in the immediate future we're going to focus on expanding the set of use case examples.

    Good news on the output summary front: we agree that this would be valuable so I'm adding that to the team list of bite-size to-dos. I expect it will get done fairly quickly since it's a trivial feature to implement.

    Bon appetit!

Sign In or Register to comment.