Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

what's the difference between mutect and GATK-Mutect ?

YingLiuYingLiu ChinaMember

HI,
I do not know the difference between the mutect from http://archive.broadinstitute.org/cancer/cga/mutect and from GATK ?
I find the latest version is 1.14 at the CGA (http://archive.broadinstitute.org/cancer/cga/mutect_download) ,however the version is 2 at GATK.
Is Same ?

thank you!

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    No, these are very different versions. MuTect 1.x only calls snv, but 2.x also calls indels. And they work quite differently.
  • YingLiuYingLiu ChinaMember

    @Geraldine_VdAuwera said:
    No, these are very different versions. MuTect 1.x only calls snv, but 2.x also calls indels. And they work quite differently.

    @Geraldine , any difference about snv calling method ? is same ?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @YingLiu, one is basically a pileup caller that allows for low fraction alleles and the other uses HaplotypeCaller's graph assembly approach. There is a publication about M1 (Cibulskis et al 2013). M2 has a whitepaper discussing its algorithms at https://github.com/broadinstitute/gatk/tree/master/docs/mutect. Again, in the gatk repo, go to gatk/docs/mutect/mutect.pdf.

  • YingLiuYingLiu ChinaMember

    @shlee said:
    @YingLiu, one is basically a pileup caller that allows for low fraction alleles and the other uses HaplotypeCaller's graph assembly approach. There is a publication about M1 (Cibulskis et al 2013). M2 has a whitepaper discussing its algorithms at https://github.com/broadinstitute/gatk/tree/master/docs/mutect. Again, in the gatk repo, go to gatk/docs/mutect/mutect.pdf.

    @shlee thank you !

    I have read paper about M1 .
    but I can not understand what's the meaning of "P(bi|ei,r,m,f)" , I guess it means that the probability of the called base is real???
    the detailed rule as the following link :

    Expecting your answer.

    Issue · Github
    by shlee

    Issue Number
    2317
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @YingLiu, I have to consult with one of our developers/mathematicians.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Since your link is broken, I've included the equation you refer to again below @YinglLiu.

    image


    Our developer explains:

    This equation is the likelihood of observing base b given error probability e, reference base r, alt base m and alt allele fraction f. Since e, r, and m are known, this comes down to the probability of the data (b) given the allele fraction. Mutect 1's LOD score is the ratio of 1) this quantity when we plug in the observed alt allele fraction i.e. assume a somatic variant to 2) this quantity when we plug in f = 0 i.e. assume hom ref. In a fully Bayesian framework (which M1 is not) we would use Bayes' rule and a prior to "invert" this likelihood and get the probability of f given the observed data.

    We can parse the three parts of this equation as follows. Note that f is the fraction of DNA that is alt, 1 - f is the fraction of DNA that is ref, e is the probability of error, 1 - e is the probability of no error, and (because there are three possible substitution errors) e/3 is the probability of a ref to alt or alt to ref error

    • A ref base can come from alt DNA (f) with an alt-to-ref error (e/3) or ref DNA (1 - f) with no error (1 - e)
    • An alt base can come from alt DNA (f) with no error (1 - e) or ref DNA (1 - f) with a ref-to-alt error (e/3)
    • A different base (eg G if the ref is A and the alt is T) can only come from an error (e/3)

    Note that GATK 4 M2 does something much more sophisticated.


    We apologize the docs for M2 are incomplete. Since GATK4 and many of its tools are in BETA, documentation is also in BETA, and their completion is pending finalization of the tools themselves.

  • YingLiuYingLiu ChinaMember

    @shlee said:
    Since your link is broken, I've included the equation you refer to again below @YinglLiu.

    image


    Our developer explains:

    This equation is the likelihood of observing base b given error probability e, reference base r, alt base m and alt allele fraction f. Since e, r, and m are known, this comes down to the probability of the data (b) given the allele fraction. Mutect 1's LOD score is the ratio of 1) this quantity when we plug in the observed alt allele fraction i.e. assume a somatic variant to 2) this quantity when we plug in f = 0 i.e. assume hom ref. In a fully Bayesian framework (which M1 is not) we would use Bayes' rule and a prior to "invert" this likelihood and get the probability of f given the observed data.

    We can parse the three parts of this equation as follows. Note that f is the fraction of DNA that is alt, 1 - f is the fraction of DNA that is ref, e is the probability of error, 1 - e is the probability of no error, and (because there are three possible substitution errors) e/3 is the probability of a ref to alt or alt to ref error

    • A ref base can come from alt DNA (f) with an alt-to-ref error (e/3) or ref DNA (1 - f) with no error (1 - e)
    • An alt base can come from alt DNA (f) with no error (1 - e) or ref DNA (1 - f) with a ref-to-alt error (e/3)
    • A different base (eg G if the ref is A and the alt is T) can only come from an error (e/3)

    Note that GATK 4 M2 does something much more sophisticated.


    We apologize the docs for M2 are incomplete. Since GATK4 and many of its tools are in BETA, documentation is also in BETA, and their completion is pending finalization of the tools themselves.

    @shlee , thank you very much ! please forward my many thanks to the developer with providing help for us .

    why does the following rule use the quantity(the probablity that the present read base is observed for every read ) continued multiply ?
    I guess its goal is to calculate genotype(ref|alt) proability at this site ,if so ,why use multiplication ?
    any detailed explanations will be welcome .
    thank you !

  • YingLiuYingLiu ChinaMember

    @YingLiu said:

    @shlee said:
    Since your link is broken, I've included the equation you refer to again below @YinglLiu.

    image


    Our developer explains:

    This equation is the likelihood of observing base b given error probability e, reference base r, alt base m and alt allele fraction f. Since e, r, and m are known, this comes down to the probability of the data (b) given the allele fraction. Mutect 1's LOD score is the ratio of 1) this quantity when we plug in the observed alt allele fraction i.e. assume a somatic variant to 2) this quantity when we plug in f = 0 i.e. assume hom ref. In a fully Bayesian framework (which M1 is not) we would use Bayes' rule and a prior to "invert" this likelihood and get the probability of f given the observed data.

    We can parse the three parts of this equation as follows. Note that f is the fraction of DNA that is alt, 1 - f is the fraction of DNA that is ref, e is the probability of error, 1 - e is the probability of no error, and (because there are three possible substitution errors) e/3 is the probability of a ref to alt or alt to ref error

    • A ref base can come from alt DNA (f) with an alt-to-ref error (e/3) or ref DNA (1 - f) with no error (1 - e)
    • An alt base can come from alt DNA (f) with no error (1 - e) or ref DNA (1 - f) with a ref-to-alt error (e/3)
    • A different base (eg G if the ref is A and the alt is T) can only come from an error (e/3)

    Note that GATK 4 M2 does something much more sophisticated.


    We apologize the docs for M2 are incomplete. Since GATK4 and many of its tools are in BETA, documentation is also in BETA, and their completion is pending finalization of the tools themselves.

    @shlee , thank you very much ! please forward my many thanks to the developer with providing help for us .

    why does the following rule use the quantity(the probablity that the present read base is observed for every read ) continued multiply ?
    I guess its goal is to calculate genotype(ref|alt) proability at this site ,if so ,why use multiplication ?
    any detailed explanations will be welcome .
    thank you !

    @shlee, any answers is welcome ?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @YingLiu,

    Our developer says:

    The original equation we discussed was the likelihood of a single base in the read pileup given the alt allele fraction f. This equation is the likelihood of the set of all observed bases (that is, all reads in the pileup) given f. Since each read is an independent measurement, the probabilities multiply.

    This assumes that errors are independent and that the base qualities can be trusted. If they were not independent it would be very hard to make a tractable statistical model. Since, however, there are systematic sequencing artifacts (i.e. errors are not independent because some events in library prep and sequencing may create multiple reads with errors) Mutect needs an additional filtering step that detects systematic artifacts.


    We didn't forget you. Sometimes it takes a while for us to respond because we have other higher priority work or folks are on vacation, etc. I hope this information is helpful.

  • YingLiuYingLiu ChinaMember

    @shlee,
    thank you very much ! "Mutect needs an additional filtering step that detects systematic artifacts" ,which addtional filtering step I should use ??

  • shleeshlee CambridgeMember, Broadie, Moderator admin
Sign In or Register to comment.