SNP calling for Cell lines - how does the ploidy affect HC

JulsJuls Member
edited November 2017 in Ask the GATK team

Hi all,

I am calling SNPs in various immortalised cell lines, which are known to be very instable - hence the ploidy is not known. Generally it should be diploid. So my question is - what can happen if the ploidy is not correct? Would HC miss SNPs? I see a relatively low overlap of common SNPs between two related cell lines and I want to make sure this low overlap is indeed there.

Thank you in advance.

Tagged:

Best Answers

  • SheilaSheila Broad Institute admin
    Accepted Answer

    @Juls
    Hi,

    Perhaps the best thing is to try different ploidies (eg 2,3,4) and compare the outputs. HaplotypeCaller in GVCF mode is designed to be very sensitive, but the ploidy does play an important role. You can read more about the math in the Methods and Algorithms section.

    You may also try setting --standard_min_confidence_threshold_for_calling 0 to try and recover any low quality missed calls.

    -Sheila

    P.S. You may be interested in Mutect2, which is for somatic variant calling, but it does not assume any ploidy. It may be worth trying to call variants on each of your samples in tumor-only mode and seeing if you get so many differences.

  • AdelaideRAdelaideR admin
    edited February 11 Accepted Answer

    @Juls -

    I have heard back from the development team.

    The advice is

    that an immortalised cell-lines behaves like a tumor therefore Mutect is the right tool to use since it does not make any ploidy assumption. This is consistent with what the user says that Mutect is much more sensitive than Haplotype Caller in this scenario.

    Also, it seems that we have no tool that can estimate the ploidy number in a situation like this so Mutect (which does not do a plodiy assumption) is the only reasonable tool to use.

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Short response: Yes it will.

  • JulsJuls Member

    well, yes but how and how much? Can it lead to many missed SNPs?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    Accepted Answer

    @Juls
    Hi,

    Perhaps the best thing is to try different ploidies (eg 2,3,4) and compare the outputs. HaplotypeCaller in GVCF mode is designed to be very sensitive, but the ploidy does play an important role. You can read more about the math in the Methods and Algorithms section.

    You may also try setting --standard_min_confidence_threshold_for_calling 0 to try and recover any low quality missed calls.

    -Sheila

    P.S. You may be interested in Mutect2, which is for somatic variant calling, but it does not assume any ploidy. It may be worth trying to call variants on each of your samples in tumor-only mode and seeing if you get so many differences.

  • JulsJuls Member

    Thanks @Sheila I will give this a try.
    I thought about Mutect2 but it does need known SNPs as input in the tumor-only mode (--dbsnp option) correct? Would it make sense to feed SNPs called by HC into Mutect2 if one does not have any known snps?

  • JulsJuls Member
    edited December 2017

    @Sheila
    Hi,

    Thank you so much for your help! Just to make sure: I have an immortalised cell line - non-model organism (no known snps), ploidy should be 2 but it's very heterogenous/unstable. So you would still recommend running HC and not Mutect2. Just Mutect2 in tumour-only mode to check for missed calls? And not to switch to Mutect2 for the complete analysis. May I ask why? Is it because Mutect2 is made for tumor and matched normals mainly?

    Best & many thanks!!!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited December 2017

    @Juls
    Hi,

    Yes, I had said that originally. Honestly, I don't have any experience with what you are doing, so I cannot give hard recommendations :smiley: However, I do think it is worth trying Mutect2 in tumor only mode to see what is missed. You will have to do a lot of manual review to see if the results look good or if there are lots of false positives.

    I recommend sticking with HaplotypeCaller because you are not looking to detect somatic variants. Indeed, Mutect2 is optimized to run with tumor-normal matched pairs. If you run with tumor-only mode, you can pick up some low frequency artifacts that may not be picked up with HaplotypeCaller. However, you will need to do some extra work to determine whether you believe those extra calls. I was suggesting trying Mutect2 to get a sense of just how many extra variants are called. I would also suggest trying HaplotypeCaller with other ploidies to see if that makes a difference as well.

    -Sheila

    EDIT: Are you trying to find germline or somatic mutations? I was assuming germline in my answer.

  • JulsJuls Member
    edited December 2017

    Hi @Sheila ,

    Thank you again for your help!
    Well it's an immortalised cell line for a non-model organism and I am looking for any difference - any mutation - compared to a reference genome.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Juls
    Hi,

    Okay, well just keep in mind HaplotypeCaller and Mutect2 are designed for different purposes. In your case, it may be a good idea to use both callers to determine both germline and somatic variants.

    Good luck and let us know how things go :smile:

    -Sheila

  • JulsJuls Member
    edited June 2018

    Hi @Sheila,

    I was wandering how to go about the comparison between HC and Mutect2. Hence I have a couple of questions:
    First, I have applied hard filtering on the HC results - are there similar suggestions for Mutect2 to make the results somewhat similarly filtered?
    Second, could hard filtering of the HC results eliminate low frequency variants? Should I be careful here?
    Third, I have compared the Mutect2 and HC results using vcftools vcf-compare just on the position level and I get around 90% overlap. The remaining 10% are made up by 2% variants detected by HC but not Mutect2 and 8% variants detected by Mutect2 but not HC. Note that I fed the unfiltered results for now as I am not sure how to do comparable filtering. So it appears that both callers are missing some variants in my case. Is this a surprising result?

    Thank you again for your continuous input and help!
    Best J

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited June 2018

    @Juls
    Hi J,

    First, I have applied hard filtering on the HC results - are there similar suggestions for Mutect2 to make the results somewhat similarly filtered?

    The filtering tool for Mutect2 is FilterMutectCalls.

    Second, could hard filtering of the HC results eliminate low frequency variants? Should I be careful here?

    If you used the standard filters we recommend, they are designed to be very sensitive, so you should not lose many true positives when eliminating the false positives. You may consider plotting the annotations as well to see if you can do better with the filters. Have a look at this document.

    The remaining 10% are made up by 2% variants detected by HC but not Mutect2 and 8% variants detected by Mutect2 but not HC.

    So, Mutect2 is slightly more sensitive than HaplotypeCaller. This is expected because you have not filtered the Mutect2 output. In this case, you can first try filtering the Mutect2 output and comparing. Can you tell me what your end goal is?

    -Sheila

  • JulsJuls Member
    edited February 1

    @Sheila

    Thank you for your answer! Sorry for my late response. As this is a side project unfortunately, I haven't had time to work on it for a while.

    If you used the standard filters we recommend, they are designed to be very sensitive, so you should not lose many true positives when eliminating the false positives. You may consider plotting the annotations as well to see if you can do better with the filters. Have a look at this document.

    And thanks for the link to the document - I have plotted the the annotations and adjusted the hard filtering options somewhat. I will set them a bit harsher for the final results.

    The filtering tool for Mutect2 is FilterMutectCalls.

    Thanks I've used this now.

    Can you tell me what your end goal is?

    I have an immortalised cell line - ploidy should be 2 but it's very heterogenous/unstable - so this is just an average!

    I wanted to try Mutect2 in tumor-only mode to see if HC missed many SNPs as the ploidy of the cell line as well as the cell population itself is very heterogenous. Now I would like to compare the Mutect2 and the HC results.

    So, Mutect2 is slightly more sensitive than HaplotypeCaller. This is expected because you have not filtered the Mutect2 output. In this case, you can first try filtering the Mutect2 output and comparing.

    No, I have used in both cases the unfiltered output so they were comparable (since I had no filtered results for the Mutect2 calls yet).

    Now I have also the filtered results (for the comparison I used the default hard filtering recommendations for the HC as well as the default settings for Mutect2 using FilterMutectCalls, ignoring germline_risk and clustered_events filter). Now the differences are even harsher: I get about 80% overlap compared to 90% overlap as before. Most of the remaining 20% (unique to a caller) are detected by the Mutect2. So Mutect2 is much more sensitive. Or are these remaining 20 % false positives? So, would it make more sense to use the consensus between the two callers as my final high quality variants?

    I am having trouble with the filtering step for the Mutect2 calls though. I am not sure how to properly compare the two results when the filtering is so different. I have used FilterMutectCalls ignoring the germline_risk and clustered_events filter. germline_risk doesn't matter as I am looking for any variant compared to the reference. clustered_events kicked out so many variants - it's a cell line - many variants are expected - there also isn't a comparable filter in HC (so not that I am aware of...). Your hard filtering recommendations for HC kick out about 10 % of the variants only. ( I also tried slightly harsher ones which then kicked out 20%).

    Btw, is there a documentation that describes the FilterMutectCalls filters?

    Thank you so much for you help and input!!!

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @Juls

    In GATK4, the somatic calling and filtering functionalities are embodied by separate tools.

    Have you had a chance to take a look at the tutorial here.

    It provides some information on filtering.

    The FilterMutectCalls documentation is found here

  • JulsJuls Member

    @AdelaideR

    Thanks for your answer and the link to the tutorial.

    In GATK4, the somatic calling and filtering functionalities are embodied by separate tools.

    I know this of course and I've used the appropriate tools. However, I am trying to compare HC and Mutect2 calls (filtered and unfiltered each) to dissect the difference in observed calls (see my previous comment). Any suggestions on this?

    Thanks,

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    You can use a tool called SelectVariants

    I guess I posted the previous comment because the function used to be combined in GATK3.

  • JulsJuls Member
    edited February 5

    @AdelaideR

    I am sorry - I am not sure I understand.

    So I have done the following:

    • I've called SNPs/Indels on a number of cell line samples using HC as well as Mutect2 (to compare the callers). Mutect2 was of course used in tumour-only mode as I cannot have a matched normal for a cell line.
    • (after using SelectVariants to split in SNPs/Indels first) I filtered the HC calls with VariantFiltration according to the hard filtering recommendations to get the filtered HC variants.
    • I also filtered the Mutect2 calls using FilterMutectCalls with default parameters to get the filtered Mutect2 variants.
    • I used vcf-compare to compare the calls (position-wise) and hence got the overlap between
      Mutect2 unfiltered and HC unfiltered as well as between Mutect2 filtered and HC filtered.

    The reason behind this is that I have an immortalised cell line (ploidy should be 2 but it's very heterogenous/unstable - so this is just an average!) and I was wondering if HC would miss calls - hence if the ploidy affects calling.

    Now I was wondering what your thoughts were on the overlap I got:

    Mutect2 unfiltered and HC unfiltered: I get about 90% overlap. The remaining 10% are made up by 2% variants detected by HC but not Mutect2 and 8% variants detected by Mutect2 but not HC.
    Mutect2 filtered vs. HC filtered: I get about 80% overlap. Most of the remaining 20% (unique to a caller) are detected by the Mutect2.

    So Mutect2 is much more sensitive and HC misses calls. Or are these remaining 20 % false positives? So, would it make more sense to use the consensus between the two callers as my final high quality variants?

    Thanks!!!

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin
    edited February 11

    @Juls It would be helpful to have a few more details. Mutect2 was recently updated, so which version of GATK4 are you running? Also, if you could please provide the commands for HC and Mutect2 with the filter settings, that would be very helpful. Thanks.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin
    edited February 11 Accepted Answer

    @Juls -

    I have heard back from the development team.

    The advice is

    that an immortalised cell-lines behaves like a tumor therefore Mutect is the right tool to use since it does not make any ploidy assumption. This is consistent with what the user says that Mutect is much more sensitive than Haplotype Caller in this scenario.

    Also, it seems that we have no tool that can estimate the ploidy number in a situation like this so Mutect (which does not do a plodiy assumption) is the only reasonable tool to use.

Sign In or Register to comment.