Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Read more about it here!
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

VariantRecalibrator difference between calling variants per individual and jointly

TimHughesTimHughes Member
edited April 2013 in Ask the GATK team

Hi,

We are working on a small targeted capture. We first called variants on a sample by sample basis and then ran VariantRecalibrator. This did not give us great results but we did see decent separation in the 2D plots on some features.

Now we have redone the Variant calling but this time on all samples jointly. When we run the VariantRecalibrator, we get very poor separation between different features/INFO variables.

This made us wonder about the differences between single and multi sample variant calling. The values that get written to the INFO field in a sample specific variant calling are obviously specific to one sample whereas they are the sum/average over all samples in a multi-sample vcf file. Wouldn't this affect the VariantRecalibrator as the specifics of the INFO field variables for the variant sample(s) are "lost" in the averaging over all samples (most of which are not variant)?

In particular, is there not a loss of information in the INFO field values (which is critical to the VariantRecalibrator) when one does multi-sample variant calling?

Can this explain the change in separation we see between single and multisample variant calling?

Tim.

Best Answer

Answers

  • Just occurred to me that it is possible to supply to VariantRecalibrator the original BAM files that went into the UnifiedGenotyper (when we did the multi sample variant calling). Can the VariantRecalibrator then compute the INFO fields for the subset of samples that are variant at a site (excluding all the non-variant samples)?

  • Hi Geraldine,

    Thanks for the feedback. I think what may have happened here is that for the reason you mention above (as well another) the multi sample calling is so much better that we are left with very few "bad" variants to see any separation on (this a very small target with 100 samples sequenced to a ridiculous depth).

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    OK, that makes sense; VQSR on small targets is tricky. Though I would say having too few bad variants is not a terrible problem to have :)

Sign In or Register to comment.