Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Benchmark SNV calls between MuTect2 (GATK 3.7) and CaVEMan ?

alaabadrealaabadre FranceMember
edited April 2018 in Ask the GATK team

Hello,

I am trying to perform a benchmark between MuTect2 and CaVEMan. I invite anyone who has performed the comparison between them to share the overlap of their results for the calls. My results have shown an overlap of 20%. I also invite the admins to shed some light and share their thoughts with me on this issue (urgent to me).

I have followed the best practices steps and aligned using BWA-mem, cleaned with Picard and did a BQSR on the BAM files. The same BAM files have been used for MuTect2 and CaVEMan. I used public data from TCGA as well as my own data ( 22 samples in total ) where they share approximately 20% (+/- 1%, rarely more) of overlap.

To calculate these overlap, I used the vcf-tools (vcf-compare and vcf-isec).

Also, if you have any references to papers where they have performed a benchmark or at least if they used the two tools, could you please share it ?

Thanks a lot !

Best regards,
Alaa

Issue · Github
by Sheila

Issue Number
3036
State
closed
Last Updated
Assignee
Array
Closed By
sooheelee

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @alaabadre
    Hi Alaa,

    I will see if the team has anything they can share with you.

    -Sheila

  • alaabadrealaabadre FranceMember

    @Sheila said:
    @alaabadre
    Hi Alaa,

    I will see if the team has anything they can share with you.

    -Sheila

    Hello Sheila,

    Thank you very much. Let me know as soon as possible.

    Best regards,
    Alaa

  • alaabadrealaabadre FranceMember

    Also, check this post I put on Biostars recently to give you an idea what I am talking about: https://www.biostars.org/p/307752/

    Thanks !

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @alaabadre,

    Thanks for sharing a link to your preliminary comparison. I've only had a chance to glance at your post. One thing to watch out for in your comparisons is differential variant representations, especially those that can occur after reassembly of reads. For our workshops, we have a tutorial that in part uses RTG-Tools (see these installation instructions) because it will map back variants to the reference to see if the representations in two callsets converge.

    Mutect2 allows for calling indels in addition to SNVs because it performs graph assembly. I'm not familiar with CaVEMan other somatic callers. We on the Communications team rely on our developers to inform us of benchmarking results. Hopefully, one of them will have time to chime in here.

    Thanks again for sharing.

  • alaabadrealaabadre FranceMember
    edited April 2018

    @shlee said:
    Hi @alaabadre,

    Thanks for sharing a link to your preliminary comparison. I've only had a chance to glance at your post. One thing to watch out for in your comparisons is differential variant representations, especially those that can occur after reassembly of reads. For our workshops, we have a tutorial that in part uses RTG-Tools (see these installation instructions) because it will map back variants to the reference to see if the representations in two callsets converge.

    Mutect2 allows for calling indels in addition to SNVs because it performs graph assembly. I'm not familiar with CaVEMan other somatic callers. We on the Communications team rely on our developers to inform us of benchmarking results. Hopefully, one of them will have time to chime in here.

    Thanks again for sharing.

    Hi @shlee,

    Thanks for your reply. When you mention differential variant representations, are you talking about the rearrangement of local reads used by MuTect2 or are you talking about BQSR (I don't think the latter would change much the outcome of the results).

    If it's about the former, then I can understand that the output could be different in this case, it is something that I though about. Also, CaVEMan doesn't discover INDELs but only SNPs. In any case, is it possible to disable the reassembly of reads ? Also, do you think it is a good idea ? And finally, what is the command related to this option ?

    I also see that the issue has been closed on GitHub :(

    Thank you in advance !

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited April 2018

    @alaabadre
    Hi Alaa,

    I think Soo Hee was talking about different indel representations. For example, if reference is CA and reads show CTG, some options for output are inserted T and SNP G or SNP T and inserted G.

    We are going to talk to the developers soon to discuss benchmarking. (Github issue is back open :smile: ). We will get back to you soon.

    Also, CaVEMan doesn't discover INDELs but only SNPs. In any case, is it possible to disable the reassembly of reads ? Also, do you think it is a good idea ? And finally, what is the command related to this option ?

    In this case, I guess indel representation won't matter as much. However, it is possible the other tools will call false positive SNPs where there are unrecovered indels. You cannot disable reassembly, as it is a crucial part of making good indel calls and of the tool in general.

    -Sheila

  • alaabadrealaabadre FranceMember

    @Sheila said:
    @alaabadre
    Hi Alaa,

    I think Soo Hee was talking about different indel representations. For example, if reference is CA and reads show CTG, some options for output are inserted T and SNP G or SNP T and inserted G.

    We are going to talk to the developers soon to discuss benchmarking. (Github issue is back open :smile: ). We will get back to you soon.

    Also, CaVEMan doesn't discover INDELs but only SNPs. In any case, is it possible to disable the reassembly of reads ? Also, do you think it is a good idea ? And finally, what is the command related to this option ?

    In this case, I guess indel representation won't matter as much. However, it is possible the other tools will call false positive SNPs where there are unrecovered indels. You cannot disable reassembly, as it is a crucial part of making good indel calls and of the tool in general.

    -Sheila

    Hi @Sheila !

    Thanks for the clear explanation. I am patiently waiting news from your side guys.

    Best regards,
    Alaa

  • alaabadrealaabadre FranceMember

    @Sheila,

    Thanks for the presentation slides. I have checked on-the-fly the contents of each PDF. I found references to two papers discussing benchmark between other tools but none of them had a comparison between CaVEMan and MuTect2. However, we can agree that benchmarking is challenging and that the callers don't have a high overlap.

    Thanks Sheila.

Sign In or Register to comment.