A definitive answer on running MuTect in STD mode? For a panel of normals or a single tumour sample.

alexbeesleyalexbeesley Telethon Kids InstituteMember

Hi, I refer to the discussions on these posts:

http://gatkforums.broadinstitute.org/discussion/4641/build-a-panel-of-normal-for-mutect?
http://gatkforums.broadinstitute.org/discussion/4536/mutect-run-in-hc-pon-mode

The answers on these two posts appear to disagree on whether it is or isn't possible to run MuTect against multiple normal samples. In addition, it appears from another previous post
(http://gatkforums.broadinstitute.org/discussion/2432/mutect-in-high-confidence-hc-mode) that the correct way to run MuTect on "normal" samples is to disable all filters using the flag "--artifact_detection_mode". However, the first two links do not seem to reflect this requirement. And nowhere on the internet can I find any documentation to support the use of the "--normal_panel" flag as suggested by kcibul to specify the 'panel of normals' vcf needed to run in the HC+PON mode.

Finally, the original CGA documentation for MuTect specifies that when running the command for a tumour-normal pair, the order of the BAM files specified in the command is positional (i.e. that the normal BAM should go before the tumour BAM). This seems incompatible with running either multiple normal samples, or running a single tumour sample.

I have unsuccessfully scoured the internet for hours to get a definitive answer on how the correct way run MuTect to (a) generate a panel of normals and (b) analyse a single tumour sample (unpaired). It would be terrific if you could post a definitive example of the full command required to achieve these two objectives.

Many thanks
Alex B

Telethon Kids Institute, Perth, Australia

Best Answer

Answers

  • alexbeesleyalexbeesley Telethon Kids InstituteMember

    Many thanks Geraldine! And please don't take my question for any criticism directed at yourself or the GATK team, you all do an amazing job!

    I have Mutect 1.1.7 running with and without matched pairs now, so thanks. Rather than a panel of normals however, I am now intending to use the the ExAC data to filter variant calls, which I think should be an even more robust approach than the PON filter. However, if you believe that there may be some additional benefit of running the PON filter within MuTect (recognising that the PON vcf's will themselves have been generated using MuTect and its underlying assumptions) then I'd be interested to hear your thoughts.

    Cheers, Alex

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No worries Alex, and thank you for your kind words :)

    I'm not yet familiar enough with MuTect's internals to assert this with certainty, but I do suspect that it would be better to use MuTect's built-in PON filter rather than do post-hoc filtering with the PON. I'll try to get the MuTect devs to opine on this and maybe provide some supporting arguments.

    Note that I don't think the PON necessarily has to be called with MuTect itself -- you could use calls from HaplotypeCaller to the same (or possibly better, at the moment) effect. That's another thing I need to check on though.

  • Hi Geraldine and Alex,

    I am trying to create a PON as we speak and came across your conversation. Anything happen on the new documentation materials for MuTect? I tried using the "--help" flag but this only gave me an error, but you can see the parameters well enough just checking the scala files on github.
    (see https://github.com/broadinstitute/mutect/blob/master/MuTectPipeline.scala)

    How many normal samples do you recommend using for the PON? I have about 50.

    My plan is basically: Use Mutect or HaplotypeCaller to get vcf files, merge them using bcftools or CombineVariants or something, use this for --panel_of_normals.

    This is a quite sloppy approach, I found this information in the appistry Q&A quoting the mutect paper:

    To reduce false positives and miscalled germ-line events, we used a panel of normal samples as a filter. To create this filter, we ran MuTect on a set of normal samples as if they were tumor samples without a matched normal sample in STD mode. From these data, a VCF file is created for the sites that were identified as variant by MuTect in more than one normal sample.

    Should I go through the trouble to fullfill the "more than one sample"? In that case I would probably use CombineVariants and count occurences of "./." on each line to figure out which variants have only single sample support. Is there an easier way?

    Thanks for reading and any input you might have on what I am doing.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @sebastian_d @alexbeesley

    Some of the new materials (the most in demand) will be available next week.

    A few more details on PONs for now:

    • The devs tell me it is actually better to generate the PON calls with MuTect rather than any other caller because it will also call low-fraction events that are probably noise in normals, which UG or HC wouldn't call. It sounds backwards but this is actually a good thing as it helps filter out noisy sites that might otherwise be called as low allele fraction variants in the tumor samples (especially if they are unmatched). To call variants with MuTect on a single sample, you need to add the --artifact_detection_mode flag.

    • You definitely should use more than one sample to create the PON. The goal is to capture as much of the background germline variation as possible, to avoid calling those sites as somatic, so including multiple individuals makes it more likely that you will observe most of the germline variations in your unmatched samples. We don't have firm guidelines yet on how many normals you should include in the PON, but I can tell you that internally we're currently working with a PON of about 400 exomes.

    • Once you have your PoN calls, you merge those VCFs into a panel VCF using CombineVariants with --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED and --filteredAreUncalled. Then you chop the VCF into a pseudo-sites-only VCF (pseudo because we keep the AC) using eg cat %s | cut -f1-8 > %s. The resulting VCF is your panel of normals.

    I hope this helps.

  • Dear Geraldine,

    sorry for the naive question, but for the CombineVariants part which -genotypeMergeOptions should i use? UNIQUIFY?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    In principle it shouldn't matter because you'll be combining distinct samples, so you shouldn't see any genotype merging happen. You can just run with the default.

  • Great, thank you!

  • varshavarsha FloridaMember

    Hi @Geraldine_VdAuwera , please let me know if this is the command to run MuTect on each of my normal samples before combining the output vcfs to get a panel of normals? I am not sure if I am missing something here. Thanks for your help.

    java -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf --cosmic hg19_cosmic_v54_120711.vcf --intervals hg19intervals_ucsc.bed --artifact_detection_mode --input_file:normal normal.bam --out normal.call_stats.txt --coverage_file normal.coverage.wig.txt --vcf normal.vcf

    ERROR stack trace

    java.lang.RuntimeException: At least one BAM tagged as 'tumor' required
    at org.broadinstitute.cga.tools.gatk.walkers.cancer.mutect.MuTect.initialize(MuTect.java:205)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:58)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 2.2-25-g2a68eab):
    ERROR
    ERROR Please visit the wiki to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: At least one BAM tagged as 'tumor' required
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I believe you need to use the tumor tag instead of the normal tag for your input file.

  • varshavarsha FloridaMember

    Got it, thanks @Geraldine_VdAuwera. Once I obtain all the normal vcfs, is there a command in MuTect similar to the GATK 'CombineVariants' to get the PON vcf? Please let me know.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No command in MuTect, you'll actually use GATK CombineVariants to do it. Do a search for "panel of normals" in the forum and that should pull up a few threads where we've answered this previously. Eventually there will be a proper documentation article on this.

  • varshavarsha FloridaMember

    Alright, thank you so much for your help. Also, the tumor tag instead of the normal tag did work!

  • varshavarsha FloridaMember

    Hi @Geraldine_VdAuwera, the forums were not very clear on the command to use in GATK combinevariants for creating a merged PON vcf. I am using -

    java -jar GenomeAnalysisTK.jar -T CombineVariants -R ucsc.hg19.fasta --variant normal1.vcf --variant_normal2.vcf --variant normal3.vcf -o PON38.vcf -genotypeMergeOptions UNIQUIFY

    This works for 2 vcf inputs but not more. I have about 40 normal vcfs to be used as the input for generating the PON and was wondering what would be a good way to do this. I appreciate your help. Thank you.

  • varshavarsha FloridaMember

    Sorry there is an extra '_' in the previous post. The command I used was -
    java -jar GenomeAnalysisTK.jar -T CombineVariants -R ucsc.hg19.fasta --variant normal1.vcf --variant normal2.vcf --variant normal3.vcf -o PON38.vcf -genotypeMergeOptions UNIQUIFY

    The GATK version is 3.3.0. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Can you clarify what you mean when you say "This works for 2 vcf inputs but not more"? Do you get an error, or does it run but not produce the result you want?

  • varshavarsha FloridaMember

    Sorry for not being clear earlier. It runs without any errors with 2 vcf inputs but when I give more than 2 input vcf files I get an error -

    ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Problem detecting index type
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That means one of your vcfs probably has a bad index file. You can delete all the indices and let GATK regenerate them for you.

  • varshavarsha FloridaMember

    Is there an option to add that in the GATK command or is it by default once i delete the indices? Also just to confirm, the option -genotypeMergeOptions UNIQUIFY would exclude only the variants in more than 1 vcf input files, am I right? I am trying to get the PON vcf with no repeating variants if the same variant is present in more than 1 normal vcf.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Indices will be regenerated by default if you delete them.

    -genotypeMergeOptions UNIQUIFY doesn't exclude variants, see the doc for an explanation of what it does.

  • varshavarsha FloridaMember

    @Geraldine_VdAuwera, I saw some previous documentation on using -genotypeMergeOptions REQUIRE_UNIQUE and UNIQUIFY and also -­‐filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED -­‐filteredAreUncalled for combining variants but I am not sure which of these is ideal for generating PON vcf with only 1 record for each variant (record positions with the same variant in the same location only once). Please let me know where I can find documentation regarding this. Sorry for the trouble, your help is much appreciated. I have looked for the info in these threads-

    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php
    http://gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one
    http://gatkforums.broadinstitute.org/discussion/4641/build-a-panel-of-normal-for-mutect

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It doesn't matter how you choose to merge genotypes because when MuTect uses the PON, it ignores the genotype information and only looks at site-level information. So you don't need to set the genotype merge options. But you should use the filtering arguments as recommended in the doc you read.

  • varshavarsha FloridaMember

    Hi @Geraldine_VdAuwera, I used the --genotypemergeoption UNSORTED in the command and it seemed to work fine and generated the PON vcf successfully and I was able to use this to run MuTect on my unpaired samples. I had tried using the filtering arguments but got the ERROR - Invalid argument value 'KEEP_IF_ANY_UNFILTERED' at position 83. Does the fact I did not use the filtering affect my PON vcf and thus my MuTect results? Please let me know, I appreciate your help.

  • Hi Geraldine,

    My question is again about the PON and MuTect.

    Reading this thread i think i am clear on that to create a PON sec for instance in 10 normal samples i must to run Mutect using each normal sample with the tumor flag to create a VCF and then merge the 10 VCFs either using other GATK tools to obtain a merged VFC which is in fact the PON.

    However i am still not clear about how i must input the PON VCF when running MuTect. Should i use the flag --dbsnp before of i should enter the info with another different flag?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No, there is a panel_of_normals argument; check the arguments list by running java -jar mutect.jar -T MuTect -h to get the help output. Make sure you're using the latest version.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @varsha sorry for the late reply. It looks like you may have used incorrect syntax when passing the filteredrecordsmergetype argument. I would recommend going back and fixing your command to be able to set the filter behavior appropriately.

    If you don't, you might have some over filtering or some under filtering of candidate variants. I forget what is the default behavior for filteredrecordsmergetype ; depending on that your PON may have too man false positives, or may lack some true positives.

  • varshavarsha FloridaMember

    Hi @Geraldine_VdAuwera , would you be able to specify the filteredrecordsmergetype to be used for combining the vcfs for PON? I am unable to narrow down the exact syntax from the forum posts. Thank you for your help.

  • varshavarsha FloridaMember

    @Geraldine_VdAuwera, I was wondering if I should extract the "PASSed" lines from the POn vcf for using this file as a normal panel to run MuTect with or do I leave the REJECTed lines in as well? Please let me know.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I believe MuTect should ignore filtered/rejected variants by default, but I will check.

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    255
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • varshavarsha FloridaMember

    Thanks @Geraldine_VdAuwera , please let me know :smile:

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @varsha, sorry for the delayed response. The --filteredAreUncalled flag in the CombineVariants command should cause all filtered (REJECT) variants to be omitted from the combined VCF pon file, so you shouldn't need to do anything further.

  • Hi everyone,
    I see some more posts have popped up in this topic, but no one has posted complete commands. Here's what worked for me:

    First I ran this on each of the normal samples
    java -Xmx6g -jar mutect-1.1.7.jar \ --analysis_type MuTect \ --reference_sequence ucsc.hg19.fasta \ --cosmic COSMICv71.hg19.vcf \ --dbsnp dbsnp_138.hg19.vcf \ --intervals intervals.file \ -I:tumor $1.bam \ --out ${1}.call_stats.out \ --coverage_file ${1}.coverage.wig.txt \ -vcf ${1}.vcf \ --disable_auto_index_creation_and_locking_when_reading_rods \ --artifact_detection_mode

    Then I combined them into PON
    java -jar GATK/3.4.0/GenomeAnalysisTK.jar \ -T CombineVariants \ -R ucsc.hg19.fasta \ -V 1.vcf \ -V 2.vcf \ -V 3.vcf \ -V etc.. \ -o output.vcf \ --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED \ --filteredAreUncalled \ --genotypemergeoption UNIQUIFY

    Cheers!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    FYI now that MuTect2 is available as part of GATK (3.5), this is documented in the tool docs.

Sign In or Register to comment.