Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

accuracy produced by Mutect2 is too low

Lanting GuoLanting Guo ShenzhenMember
edited April 2016 in Ask the GATK team

With simulated data by BAMSurgeon, I tested Mutect2 and Varscan2. There are almost no overlap between Mutect2's vcf and my ground truth while Varscan2's result is resonable.

Why? Is my usage is wrong?

Next is my code.

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._

class run_M2 extends QScript {

  @Argument(shortName = "L",  required=false, doc = "Intervals file")
  var intervalsFile: List[File] = Nil
  @Argument(shortName = "normal",  required=true, doc = "Normal sample BAM")
  var normalBAM: String = ""
  @Argument(shortName = "tumor", required=true, doc = "Tumor sample BAM")
  var tumorBAM: String = ""
  @Argument(shortName = "o",  required=true, doc = "Output file")
  var outputFile: String = ""
  @Argument(shortName = "sc",  required=false, doc = "base scatter count")
  var scatter: Int = 200
//  @Argument(shortName = "output_mode",  required=false, doc = "output_mode")                                     
//  var output_mode: String = "EMIT_ALL_SITES"                                                                     
  /*                                                                                                               
  @Argument(shortName = "cosmic",  required=true, doc = "cosmic vcf file ")                                        
  var cosmic: String = ""                                                                                          
  @Argument(shortName = "pon",  required=true, doc = "vcf panel of normal file ")                                  
  var pon: String = ""                                                                                             
   */
    def script() {

    val mutect2 = new MuTect2

    mutect2.reference_sequence = new File("/ref/GATK/ucsc.hg19/ucsc.hg19.fasta")
   /*                                                                                                              
    mutect2.cosmic = new File(cosmic)                                                                              
    mutect2.normal_panel = new File(pon)                                                                           
    */
    mutect2.dbsnp = new File("/ref/GATK/knownsites/dbsnp_138.hg19.vcf")
    mutect2.intervalsString = intervalsFile
    mutect2.memoryLimit = 2
    mutect2.input_file = List(new TaggedFile(normalBAM, "normal"), new TaggedFile(tumorBAM, "tumor"))

    mutect2.scatterCount = scatter
    mutect2.out = outputFile
//    mutect2.output_mode = output_mode                                                                            
    add(mutect2)
  }

}

java -jar ~/Github/gatk-protected/target/Queue.jar -S M2.scala --job_queue main.q -qsub -startFromScratch -\
sc 1000 -tumor ../data/benchmark/tumor.bam -normal ../data/benchmark/normal.bam  -o ../data/benchmark/Test1_noInte\
rval_benchmark.vcf --start_from_scratch -run
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Lanting Guo ,

    Your usage seems correct, but it would be more better to post one of the actual GATK commands that get produced by the Queue script, to be sure.

    It would also be helpful to post some details of what is your ground truth, how many variants are found vs. missed by each program, and how you performed the evaluation. Make sure your evaluation accounts for all records in the vcfs including filtered sites.

    Without that information we cannot comment on the results.

  • Lanting GuoLanting Guo ShenzhenMember
    edited April 2016

    Thanks, @Geraldine_VdAuwera .

    The actual GATK commands produced by Queue is:

    INFO  09:20:51,156 QScriptManager - Compiling 1 QScript 
    INFO  09:20:53,677 QScriptManager - Compilation complete 
    INFO  09:20:53,770 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  09:20:53,770 HelpFormatter - Queue v3.5-0-ge91472d, Compiled 2016/03/15 10:16:03 
    INFO  09:20:53,770 HelpFormatter - Copyright (c) 2012 The Broad Institute 
    INFO  09:20:53,771 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  09:20:53,771 HelpFormatter - Program Args: -S M2.scala --job_queue main.q -qsub -startFromScratch -sc 1000 -tumor ../data/benchmark/tumor.bam -normal ../data/benchmark/normal.bam -o ../data/benchmark/Test1_noInterval_benchmark.vcf --start_from_scratch -run 
    INFO  09:20:53,772 HelpFormatter - Executing as [email protected] on Linux 3.13.0-32-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_72-internal-b15. 
    INFO  09:20:53,772 HelpFormatter - Date/Time: 2016/04/14 09:20:53 
    INFO  09:20:53,773 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  09:20:53,773 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  09:20:53,781 QCommandLine - Scripting run_M2 
    INFO  09:20:53,895 QCommandLine - Added 1 functions 
    INFO  09:20:53,895 QGraph - Generating graph. 
    INFO  09:20:53,916 QGraph - Generating scatter gather jobs. 
    INFO  09:20:54,296 QGraph - Removing original jobs. 
    INFO  09:20:54,298 QGraph - Adding scatter gather jobs. 
    INFO  09:20:56,398 QGraph - Regenerating graph. 
    INFO  09:20:58,918 QGraph - Running jobs. 
    INFO  09:20:58,985 QGraph - Removing outputs from previous runs. 
    
    

    My Ground truth are 138 point mutation such as:

    chrom  start               end                mutation fraction
    chr1    11164752        11164752        0.1913878612370747
    chr1    11182075        11182075        0.05611858181557292
    chr1    11183125        11183125        0.3279659236169655
    chr1    11212241        11212241        0.2181217316104322
    chr1    11293593        11293593        0.2733698514916046
    chr1    11298201        11298201        0.4035859581569984
    chr1    11311087        11311087        0.03665408197994013
    chr1    20940016        20940016        0.018758052968420678
    chr1    26226164        26226164        0.21716032134290142
    chr1    65296817        65296817        0.4996091493125231
    chr1    65318420        65318420        0.05942835237567236
    chr1    65327561        65327561        0.22014024006447724
    chr1    97976456        97976456        0.11613254329181226
    chr1    115247279       115247279       0.30346891866584846
    chr1    144918733       144918733       0.01531894240809231
    chr1    162724705       162724705       0.2585497585951497
    chr2    29411229        29411229        0.17009283871671482
    chr2    29431892        29431892        0.22523844254466044
    chr2    29446772        29446772        0.2038176031460158
    chr2    178091413       178091413       0.4785042071976912
    chr2    178093984       178093984       0.29044722190217764
    chr2    212243399       212243399       0.44222607601477665
    chr2    212281754       212281754       0.2622518946853114
    chr2    212984591       212984591       0.20804629357801552
    chr2    234676040       234676040       0.43042709098097026
    chr3    12627317        12627317        0.23486487131991132
    chr3    14194724        14194724        0.42184894777736986
    chr3    14195132        14195132        0.26834308111424376
    chr3    37033200        37033200        0.45965986871507425
    chr3    37062146        37062146        0.33357290713846605
    chr3    41270285        41270285        0.35459192464007827
    chr3    41272290        41272290        0.013668646473863826
    chr3    100462289       100462289       0.19331743741901028
    chr3    100462351       100462351       0.45534283100969425
    chr4    25670931        25670931        0.28289491284336477
    chr4    25672809        25672809        0.25542013314715645
    chr4    55128538        55128538        0.33528570613140746
    chr4    55128774        55128774        0.39635467484909265
    chr4    55150015        55150015        0.13707126469916223
    chr4    55958864        55958864        0.3273899530375076
    chr5    112168954       112168954       0.45931334877536834
    chr5    112171164       112171164       0.47375653742042884
    chr5    149781471       149781471       0.4403670570666818
    chr6    117636129       117636129       0.28037514306169564
    chr6    117705610       117705610       0.027064280122495555
    chr6    159183483       159183483       0.07574933549481852
    chr7    55244071        55244071        0.25976471053435396
    chr7    87128728        87128728        0.14596510567620247
    chr7    87174279        87174279        0.21459972091281504
    chr7    87174816        87174816        0.3341362128986858
    chr7    98496107        98496107        0.1309361243409307
    chr7    98508387        98508387        0.31382034005889897
    chr7    98540946        98540946        0.3787340987720286
    chr7    98542199        98542199        0.06932923487947656
    chr7    98548889        98548889        0.4545045279719027
    chr7    98552008        98552008        0.3324178126063289
    chr7    98557326        98557326        0.27510869399014015
    chr7    98584805        98584805        0.4635153262479244
    chr7    98604865        98604865        0.1317240337781959
    chr7    116335189       116335189       0.4488705486112907
    chr7    128844217       128844217       0.03940262677827463
    chr7    140489246       140489246       0.22408510083432973
    chr7    140495184       140495184       0.4516062613561172
    chr8    38266796        38266796        0.2721969079284002
    

    The normal data is real sequence data. The tumor data is produced by BAMSurgeon based on normal data.
    Mutect2 got 2065 mutations with only 3 in ground truth.
    Varscan2 got 23 mutations with 21 in ground truth.

    All the above are focused on SNP.

    A variant is in ground truth only when its POS is in ground truth.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Lanting Guo
    Hi,

    Can you try running MuTect2 with -stand_call_conf and -stand_emit_conf set to 0?
    I am wondering if the missed sites have low confidence.

    Also, can you post a few IGV screenshots of the original bam file and bamout file for a few of the missed sites?

    Thanks,
    Sheila

  • zhmz90zhmz90 ShenzhenMember
    edited April 2016

    Hi, @Sheila . After running MuTect2 with -stand_call_conf and -stand_emit_conf set to 0, the result is not improved.

    When I load bamout files produced by the command
    java -jar GenomeAnalysisTK.jar -T MuTect2 -R /ref/GATK/u\ csc.hg19/ucsc.hg19.fasta -I:tumor tumor.bam -o hc.vcf\ -L $interval -bamout out_tumor.bam

    java -jar GenomeAnalysisTK.jar -T MuTect2 -R /ref/GATK/u\ csc.hg19/ucsc.hg19.fasta -I:normal normal.bam -o hc.vcf\ -L $interval -bamout out_normal.bam

    To my surprise, there is no normal bamout data while the tumor bamout data is consistent with original normal.bam and tumor.bam.
    The coverage at the interval is about 120X. I choose about 10 intervals to generate bamout data and all have empty normal bamout files.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You need to run the MuTect2 command with both normal and tumor to generate a bamout that includes data from both files.

    Also, the stand_*_conf thresholds don't actually do anything in MuTect2; to tweak sensitivity you have to tweak the lod score thresholds.

    Note that the current version doesn't support Java 1.8 and there may be correctness errors due to list ordering not being guaranteed between java 1.7 and 1.8, so we can't provide any further support for results obtained with GATK 3.5 on Java 1.8. In addition, we don't support OpenJDK, you need to use the Oracle JDK.

  • alongaloralongalor Member
    edited September 18

    I have exactly the same problem using the most current version of MT2. Strelka2 and Varscan both have high sensitivities and specificities while MT2 finds exactly 0 of the 1000 variants I inject with Bamsurgeon. Interestingly, when running MT2 in Tumor-Only mode, the tool finds the injected variants but labels them as non-PASS

  • alongaloralongalor Member
    edited September 18

    Update: The original Mutect (v1.1.7) works just fine. Papers I have seen using Bamsurgeon seem to use either the original mutect (https://www.nature.com/articles/nmeth.3407) or old versions of Mutect2 (https://www.nature.com/articles/s42003-018-0023-9 - v.2.3.5). It appears there is something going on in the newer versions that is causing Mutect2 to perform un-representatively on Bamsurgeon-generated tumors.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @alongalor,

    Interestingly, when running MT2 in Tumor-Only mode, the tool finds the injected variants but labels them as non-PASS

    Remember that Mutect1 only calls SNVs and performs no realignment around indels. MuTect2/Mutect2 uses the HaplotypeCaller reassembly machinery to reassemble reads so that you can call both SNVs and indels.

    GATK3's MuTect2 and GATK4's Mutect2 are different in multiple ways to Mutect1. To help researchers get familiar with the differences, I have developed and written a number of resources. Links to these are listed in Blog#11337. In particular, you may be interested in the Mutect2 tutorial. I hope these resources clarify usage and changes to priors that you should tune when using Mutect2.

    As is the practice in somatic mutation discovery, I am certain you will want to include Mutect2 calls in your union of calls from different callers for downstream analyses. Thanks for your interest in our tools.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    @alongalor A few questions:

    • How are the BamSurgeon loci chosen? Do they overlap with common germline variants?
    • Does BamSurgeon just randomly inject variants in a pileup, or does it try to respect phasing ie, if there is a germline SNP 30 bases away from an injected variant, will the injected variant show up only in reads with/without the SNP but not in both?
    • Similarly, does it respect phasing with reads' mates? The realignment filter, in case you're using it, is mate-aware.
    • Are the variants completely absent from the tumor-normal vcf, or are they showing up with filters? If so, what filters?
    • What allele fractions do the injected variants have?
    • What is your command line?
    • Do the injected reads occur on both strands, and on both read1 and read2?

    This is surprising because we regularly run M2 through the DREAM challenge synthetic bams, which were generated with BamSurgeon.

Sign In or Register to comment.