Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

How can I filter my callset if I cannot use VQSR / recalibrate variants ?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin
edited September 2013 in FAQs

If you are sure that you cannot use VQSR / recalibrate variants (typically because your dataset is too small, or because there are no truth/training resources available for your organism), then you will need to use the VariantFiltration tool to manually filter your variants. To do this, you will need to compose filter expressions as explained here, here and here based on the recommendations detailed further below.

But first, some caveats

Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, etc.

HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.

In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.

In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.

So, here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you:

Filtering recommendations for SNPs:

  • QD < 2.0
  • MQ < 40.0
  • FS > 60.0
  • HaplotypeScore > 13.0
  • MQRankSum < -12.5
  • ReadPosRankSum < -8.0

Filtering recommendations for indels:

  • QD < 2.0
  • ReadPosRankSum < -20.0
  • InbreedingCoeff < -0.8
  • FS > 200.0

And now some more IMPORTANT caveats (don't skip this!)

  • The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

  • For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.

  • The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.

Finally, a note of hope

Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.

HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.

And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?

Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)

We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • pdexheimerpdexheimer Posts: 324Member, GSA Collaborator ✭✭✭

    Thanks for this document, Geraldine - I hadn't appreciated before how tentative these recommendations are.

    we do want to hear from you about how we can best help you help yourself.

    Ask and ye shall receive! I don't have a very principled way to evaluate the quality of a callset. If there are enough variants, Ti/Tv and dbSNP % are very useful metrics (though hardly the complete picture). But I find that the data I have to hard filter is usually too small for even these numbers to be meaningful, and they never apply well to indels. Most of the time, I end up essentially just saying "Good Luck" to the investigator, which isn't especially helpful. Any advice you might be able to provide on evaluation would be very appreciated.

    Regarding the annotations, I would say that more information is helpful. For a long time (and I think this is still true), the only information about the annotations has been a one sentence description, barely more than is listed in the VCF. For things like DP and MQ this is fine, but some of the complex ones (I'm thinking particularly of InbreedingCoeff, HaplotypeScore, and FS here) would benefit from a much more detailed description. Personally, I like to see both the "less math" and "more math" descriptions available - maybe treat the "more math" track as Supplementary Methods. That way I can read the overview and get a good idea of what's going on, but if I ever want/need to really understand it I can dig into the math. The downside, of course, is that it doubles the documentation burden, but those annotations are complex enough that I think they would warrant it.

    Another note that would be helpful (which I think you alluded to) would be some reasoning for why the starting points here include some annotations but not others. For instance, why is InbreedingCoeff used for Indels but not SNVs? The explanation for DP is exactly the kind of thing I'm thinking of.

    Finally, I still encounter people (both in real life and online) who want to use what I think of as the "Sanger filters" - they set a minimum depth, and usually require a certain number of reads supporting the alternate allele, and that's it. It might be nice to have an explanation of why those kinds of filters are suboptimal that I can point them toward

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Oy, talk about being careful what you wish for ;-)

    Most of the time, I end up essentially just saying "Good Luck" to the investigator, which isn't especially helpful.

    We've mostly been limited to that as well in the past, but now we're hoping to ramp up our efforts to provide guidance. The hard part can be in identifying what to tackle first, so thanks for your thoughtful comments.

    Improving the explanations of what the annotations represent (the "less math" route) is probably the lowest-hanging fruit; while "more math" will probably consist of "go look at the code" until we get around to documenting it in the real world. Explaining why we recommend/discourage use of some filters and not others basically means writing down the experience accumulated by several people over years of analysis; it will take some time but we see the value in this so we will get there.

    Seriously though, thanks for your feedback, @pdexheimer. When people comment like you did and say "this part is really helpful, for example" that is immensely useful for defining priorities but also for getting buy-in from team members to sit down and formulate this stuff explicitly.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 324Member, GSA Collaborator ✭✭✭

    "more math" will probably consist of "go look at the code"

    Indeed, that's been my documentation for some time - one of the great benefits of open source. And the annotations are generally simple enough that you only need to look at one or two files to really grok them. But my experience doing that led to two different experiences:

    1. A wonderful "Oh, I get it!" moment with InbreedingCoeff. Once you see the math (and understand it), it's a very elegant metric
    2. A morning of confusion with HaplotypeScore. A forum discussion with ebanks cleared it up (in fact, most of the "less math" for that metric is probably already in the forum), but there were some subtleties in the code that I just didn't catch in my reading

    And there are certainly users of GATK that are just not comfortable or experienced reading code. I can understand the need to prioritize, I would just encourage you to consider that not all metrics are equal and there may be value in spending more effort on some of the more complex ideas (Preaching to the choir, you do that already. Just trying to bump up the priority a bit)

  • AdminTimAdminTim LondonPosts: 19Member

    Hi Folks,

    Are the recommended values suitable for use in the process of generating a bootstrapped 'truth' set for use in GATK, or should more stringent cutoffs be used for that? Is it possible to come up with some guidelines for that please? I suppose that in filtering for this purpose we don't want to include the 'twilight zone' that we might want for our final results. What metrics can be used to judge whether variants are 'gold standard'?

    Thanks

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Hi @AdminTim,

    To bootstrap a truth set you probably want to be more stringent on the filtering. Our gold standards are typically variants that have been validated using orthogonal approaches such as arrays. We're just now starting a project to develop guidelines for this kind of thing; it will probably take a month before we have them worked out and ready to share publicly.

    Geraldine Van der Auwera, PhD

  • AdminTimAdminTim LondonPosts: 19Member

    Hi Geraldine - sorry I've just seen your reply for some reason. Thanks for the info - I'll keep an eye open for those guidelines.

    For the time being I thought I could do the following for my 48 sample 100x coverage exome data: - Apply the above cutoffs to filter variations. - Include only variations where all samples are PASSes (i.e. not in an ambiguous/edge region where some samples were filtered). - Include only variations that occur in more than one sample.

    Is this reasonable for use in an iterative bootstrapping process?

    Thanks

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Hi Tim,

    It's a good start. Ideally you'll want to have a look at a bunch of your variants in context in IGV and check that the calls look reasonable. Obviously you can't do all of them, but it's worth doing as many as you can spare the time for, especially if you have no existing callset to compare them against.

    Geraldine Van der Auwera, PhD

  • ryanabashbashryanabashbash Texas A&M UniversityPosts: 9Member

    Hi all,

    I'm trying to bootstrap a truth set for an organism lacking an existing variant database, and I am looking for a sanity check. Here's the outline of what I've done; please let me know if you have any suggestions:

    I have WGS (i.e. not exome capture) for 2 different samples from an organism with a ~800Mbp genome.

    1. Alignment: BWA

    2. Mark Duplicates: Picard

    3. Generate realignment targets: GATK RealignerTargetCreator

    4. Realign around indels: GATK IndelRealigner

    5. Determine depth of coverage to determine depth threshold for step 7: GATK DepthOfCoverage
    Median for sample 1 and sample 2 are 28 and 26, respectively, with 82% and 80% of bases above 15, respectively.

    6. Initial variant discovery on both samples simultaneously: GATK HaplotypeCaller

    7. Hard filtration of variants to produce "initial truth set": GATK VariantFiltration
    Using the following filters:

    "DP < 10 || DP > 100" 
    "QD < 20.0"
    "MQ < 50.0"
    "FS > 10.0"
    "HaplotypeScore > 8.0"
    "MQRankSum < -12.5"
    "ReadPosRankSum < -3.0"
    "AC != 2 || AC != 4" # These samples are supposed to be fairly inbred so I'm considering heterozygous loci suspect.
    

    This reduces the initial ~3.7 million called variants across the two samples to 1.5 million variants (counted using GATK VariantEval). These 1.5 million variants were considered the "initial truth set" (and that value seems to be a reasonable ballpark value for a "pared down" variant set for the organism given reports by others).

    8. Base recalibration of initial BAMs using the "initial truth set": GATK BaseRecalibrator

    9. Generate Recalibrated BAMs with new quality scores: GATK PrintReads

    10. Base recalibration of Recalibrated BAMs using the "initial truth set": GATK BaseRecalibrator

    11. Compare the recalibration for both BAMs: GATK AnalyzeCovariates

    12. Perform variant discovery on both samples simultaneously using their Recalibrated BAMs: GATK HaplotypeCaller

    13. Repeat steps 7 through 12 until plots from AnalyzeCovariates converge.

    14. On convergence, perform step 12 and 7 a final time to generate a "final truth set".

    15. Use the "final truth set" as a training and truth set for VQSR, assigning a prior of 6 (~75%) to the "final truth set": GATK VariantRecalibrator

    Please let me know if you have any suggestions for improvement.

    Thank you,
    Ryan McCormick

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Hi Ryan,

    That all sounds perfectly reasonable, I don't really have anything to add. Out of curiosity, what size final truth set do you end up with? Do you see a big difference compared to the initial truth set?

    Geraldine Van der Auwera, PhD

  • ryanabashbashryanabashbash Texas A&M UniversityPosts: 9Member

    Hi Geraldine,

    Thanks for the response and feedback. I'll report back when I have a final truth set for the comparison (I'm currently iterating to obtain convergence).

    Thanks,
    Ryan

  • AdminTimAdminTim LondonPosts: 19Member

    Hi Folks,

    When using HaplotypeCaller I get mixed indel/snp variant entries in the VCF which is great, but what's the best bet to apply hard filtering to these? Since HC has a single method for SNPs and indels is it possible to apply a single filter that will work well for both?

    Thanks

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Hi @AdminTim,

    No, as indicated in the Best Practices documentation as well as the article above, we have found that SNPs and Indels have different error modes, so it's better to use different filters on them.

    Geraldine Van der Auwera, PhD

  • AdminTimAdminTim LondonPosts: 19Member

    Hi Geraldine,

    Ok thanks, but HC produces SNPs and indels that are on the same line like this:

    chr10 712670 . C CG,T 36963.32 . AC=79,4;AF=0.823,0.042;AN=96;BaseQRankSum=1.656;ClippingRankSum=0.692;DP=851;FS=0.000;InbreedingCoeff=0.4204;MLEAC=79,4;MLEAF=0.823,0.042;MQ=56.43;MQ0=0;MQRankSum=-0.272;QD=26.11;ReadPosRankSum=-0.778 GT:AD:DP:GQ:PL 0/1:8,7,0:15:99:193,0,418,238,433,813 1/2:0,7,8:15:99:1042,229,313,855,0,1039 1/1:0,14,0:14:45:439,45,0,479,45,647 1/1:0,25,0:25:78:838,78,0,912,78,1440 ......

    these are selected using the MIXED selectType to SelectVariants. How should I apply filtering to these?

    Thanks

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Ah, fair question. I'm not sure, let me check with the team.

    Geraldine Van der Auwera, PhD

  • ryanabashbashryanabashbash Texas A&M UniversityPosts: 9Member
    edited December 2013

    Out of curiosity, what size final truth set do you end up with? Do you see a big difference compared to the initial truth set?

    The size of the truth set changed from ~1.551 million variants (~1.383 SNP and ~0.166 INDELs) in the initial set to ~1.532 million variants (~1.385 SNP and ~0.146 INDELs) in the final set after 4 rounds of recalibration; the number of SNPs increased by ~2,000 (~1% increase) and the number of INDELs decreased by about 20,000 (~12% decrease). Between the 3rd and 4th (final) iterations, the number of variants called was very similar (~0.01% difference), and the AnalyzeCovariates plots were very similar. Please let me know if any of that sounds out of the ordinary.

    Lastly, the AC filtering statement in my previous post doesn't logically work (it always evaluates to False). I ended up using "AC < 2 || AC > 4 || AC == 3", and I think "AC != 2 && AC != 4" would work as well.

    Thanks,
    Ryan

    Post edited by ryanabashbash on
  • KurtKurt Posts: 126Member ✭✭✭

    Howdy,

    I know that this is the section of how to filter when you are SOL (you can't do VQSR). But I would like to comment on the use of QD as a filter for indels in this situation (as a testimonial or whatever). I am in a situation where I have to do single sample calling AND am dealing with a small capture region (essentially a gene panel using Agilent as the capture product on human samples). When doing a quick check of the data, I noticed that for where the filters were applied to indels quite a few of them had been filtered by the QD < 2.0 (in fact that was the only parameter from the above cut-offs that failed when the indel calls did not pass). Since QD is normalized by event length, then it would seem that quite a few long indels will in the end fail this threshold. I believe all of the ones that i have come across have been confirmed via sanger sequencing (Hopefully I can have this confirmed after the new year, but am reasonably sure that is the case from what I have heard). They generally have qual ~4000, depth ~300, MQ ~ 60, event lengths are 9 or greater, are hets (fair balance b/w alt/het), etc.

    I did have one indel that failed QD that was definitely not confirmed by sanger, but that particular one had a QUAL of about 90 and did not have a very good balance b/w alt and ref. the overall depth was ~300 and MQ ~60.

    This was a 2.7-4 workflow using HC as the caller. bwa-mem as the aligner. MiSeq 2x100.

    In any case, I think in my situation, I will end up taking out QD < 2.0 (or supplement it with a conditional argument)

    Kurt

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Hi @Kurt,

    This is a very interesting observation, and we'd like to follow up on our end to evaluate whether we're doing the right thing with QD, or if we need to make the method smarter. Would you mind sending us a couple of snippets of good indels that fail the QD filter?

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 126Member ✭✭✭

    I'll forward your request to the people whose data it is :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Ah, fair enough :) Please assure them that we would treat any sensitive data with the utmost care for confidentiality.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 126Member ✭✭✭

    @Geraldine_VdAuwera

    and ummm, yeah.

    "...Rare mutations, as most of these larger indels are, can be considered as identifiable information, so if our data is transferred outside of Hopkins to the Broad we may need her to sign a confidentiality agreement (data will not be shared, discarded after use, etc.)."

    so would your group be willing to do so?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    If necessary, yes we would; but we're first going to check if we can generate some test data ourselves before initiating a paperwork-laden process, to make things easier on everyone. Eric is going to look into this during the vacation period, and we'll get back to you if we still need the Hopkins data. Thanks for playing middle man :)

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 126Member ✭✭✭

    Sounds good, Geraldine. You're welcome ;) and I'm sorry for giving Eric something to do during vacation!

  • KurtKurt Posts: 126Member ✭✭✭

    Also @Geraldine_VdAuwera

    This was the command line for the HC.

    $JAVA_1_7/java -jar $GATK_DIR/GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R $REF_GENOME \
    --input_file $CORE_PATH/$OUTPATH/BAM/$SM_TAG".bam" \
    --dbsnp /isilon/sequencing/GATK_resource_bundle/2.2/b37/dbsnp_137.b37.vcf \
    -L $BED_DIR/$BAIT_BED \
    -stand_emit_conf 0 \
    -stand_call_conf 0 \
    -A VariantType \
    -A DepthPerSampleHC \
    -A ClippingRankSumTest \
    -A MappingQualityRankSumTest \
    -A ReadPosRankSumTest \
    -A FisherStrand \
    -A GCContent \
    -A AlleleBalanceBySample \
    -A AlleleBalance \
    -A QualByDepth \
    --emitRefConfidence GVCF \
    -dt NONE \
    -et NO_ET \
    -K $KEY \
    -o $CORE_PATH/$OUTPATH/temp/$SM_TAG".HC.ERC.raw.OnBait.vcf"
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Thanks Kurt! Don't worry too much about Eric -- if it wasn't this he'd find something else to work on :)

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 678GATK Developer mod

    Any excuse not to have to take care of my kids.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

    Yeah right he's such a terrible dad that he has a pic of his kids as profile picture ;-)

    You're not fooling anyone, @banks.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 126Member ✭✭✭

    Happy New Years! @banks and @Geraldine_VdAuwera. Just checking up to see if you have had time to see if you were seeing the same thing that I was seeing with larger indels and QD?

    Best Regards,

    Kurt

  • ebanksebanks Posts: 678GATK Developer mod

    Hi Kurt,

    I am actively working on it now.

    (I am a bit behind because those aforementioned kids broke my laptop at the beginning of vacation. No joke)

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • KurtKurt Posts: 126Member ✭✭✭

    Apparently they decided that you were going to take care of them after all ;)

  • ebanksebanks Posts: 678GATK Developer mod

    Hi Kurt, just wanted to let you know that the fix is going in over the next day or two. Thanks again for finding this big problem!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • KurtKurt Posts: 126Member ✭✭✭

    Hi Eric, good to hear and you're welcome! Glad to have contributed!

  • KurtKurt Posts: 126Member ✭✭✭

    Oh, when deployed, will this be found in a nightly build or do you plan on doing a stable release?

  • ebanksebanks Posts: 678GATK Developer mod

    Nightly build until the 3.0 release.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • KurtKurt Posts: 126Member ✭✭✭
Sign In or Register to comment.