We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Filtering individual calls using CombineVariants

I was wondering if there is a method for filtering individual genotype calls when using CombineVariants to merge single-called VCF files. The desired behavior that I would like would be a hybrid between the KEEP_IF_ANY_UNFILTERED and KEEP_IF_ALL_UNFILTERED arguments to the -filteredRecordsMergeType. By this, I mean that any site that is unfiltered in any input will remain unfiltered in the output, but for any genotype call from a filtered input should have a filter annotation in the "FT" field of the genotype. I will show a simplified example below (extraneous columns removed from the sample files):

Input 1:

#CHROM POS       ID         (...) FILTER  FORMAT  SAMPLE1
1      11916764  rs79387574 (...) PASS    GT:DP   0/0:45

Input 2:

#CHROM POS       ID         (...) FILTER  FORMAT  SAMPLE2
1      11916764  rs79387574 (...) LowQ    GT:DP   0/1:3

Desired Output:

#CHROM POS       ID         (...) FILTER  FORMAT    SAMPLE1      SAMPLE2
1      11916764  rs79387574 (...) PASS    GT:DP:FT  0/0:45:PASS  0/1:3:LowQ

The reason for requesting this is there is occasionally a single sample that may have had a bad call at a site. Using the "KEEP_IF_ALL_UNFILTERED" filters N-1 high quality calls. However, on the other extreme, if we use "KEEP_IF_ANY_UNFILTERED" and only a single sample passes the filters, we introduce N-1 low quality calls and assert that they pass our requisite filters. The requested hybrid method will keep all information from the input samples and allow for better granularity.

John Wallace



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, I see what you're trying to do but this would mix site-level and sample-level filter annotations in a way that could be problematic. In any case we don't recommend combining single-called VCFs for cohort analysis. You'll be much better off switching to the new workflow, where you generate single-called GVCFs then use GenotypeGVCFs followed by filtering to generate the highly granular cohort-aware callset you want.

  • johnwallace123johnwallace123 Member ✭✭

    While I agree that the new workflow is better (and we're switching to that now), part of our pipeline is sanity checking the inputs at various stages of the pipeline.

    The data that comes to us is typically single-called, as it is streaming in from multiple sites (using multiple calling methods - typically UnifiedGenotyper version 2.x). Also, the input VCFs emit all bases in our target region, so the VCFs can almost be seen as an approximation to the gVCFs.

    I understand that it's not the optimal solution, but it seems that the hybrid method would avoid the extremes of including some bad data or throwing away a bunch of good data. Do you know how the mixing site and sample-level filter annotations could be problematic? I can see that you would need to take care reading the data, but perhaps there's something internally that makes it difficult.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh, I really don't recommend using the UG's "emit all sites" mode as a substitute for gVCFs. That method was very naive and will give inferior results.

    I'm not sure what you mean by "the extremes of including some bad data or throwing away a bunch of good data". With the new workflow, you have a way to know how good or bad any data point is, so you can weigh it appropriately in your analysis. This is good for all sorts of reasons that I don't have time to go into right now (but much of it is in the docs). And it's much better than deciding up front what to include or exclude based on single-sample results.

Sign In or Register to comment.