To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

what reads from original BAM are listed in mutect2 VCF-output?

shabardishabardi Member
edited January 16 in Ask the GATK team

Dear GATK team,
I am sorry if my question is naive. After running mutect2 (version 4.2) what positions (AND corresponding reads) from the original bam-file are reported in the vcf file? all that passed filters or only those that have the potential to be SNP/mutation?
I am analysing several cancer data sets in parallel and want to choose the most interesting for me SNPs. For this I want to know how many patients had WT at a particular positions, and how many had a possible mutation at this position. Can I find this information in the vcf file or do I need to refer to the original BAM?

For example, in the vcf file I have a postition chr1:111222 which is encountered in 8 patients with some AD values for normal and tumor samples. Does it mean that the original BAM file contain the same 8 patients at this position OR there may be more patients, but some of them are clearly wild-type and are not reported in vcf-file?

BTW, I am using vcf files (processed with mutect2) downloaded from the GDC data repository.

I hope for your help:)

Best Answer

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @shabardi,

    I think you might find https://software.broadinstitute.org/gatk/documentation/article?id=11127 helpful. The article is still being tweaked but it is accurate.

  • @Sheila HI, thank you for your answer! now i understand better:)
    but the problems stays: we downloaded the vcf files after mutect2 analysis from gdc-portal, they were generated using tumor-normal pair. we chose those sites that are interesting to us, but now we want to know what is the ratio of this potential mutations to the WT (in number of patients). as i understand mutect2 analysis does not provide with this info? any hints what can i do here?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited January 22

    @shabardi
    Hi,

    No, the Mutect2 VCF does not provide information on the WT sites. It only provides the sites which have potential somatic mutations. Do you know the number of sites you ran Mutect2 on? For example, in exome analysis, you may have input an interval file which has positions for the tool to run on. If you know the number of sites the tool analyzed, you can simply subtract the number of sites that were output as somatic mutations in your VCF from the number of total sites input to get the number of WT sites.

    I hope this makes sense.

    -Sheila

  • Hi @Sheila ,
    thank you again. But I chose the specific set of the sites, so, I need to calculate number of WT cases individually. I guess i will have to dog down t the original bam files.
    Could you give me na advice: is it, in general, necessarily to calculate this ratio of potentially mutated sites to WT sites for the cancer studies to produce the reliable results?:) if yes, why SNPcallers dont do this job?
    have a good day!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shabardi
    Hi,

    I think this thread may help you.

    -Sheila

  • Hi @Sheila , the thing is that im working with the ready vcf files from GDC NIH portal, and i did already quite some job. so, running the new snp analysis will be completely different project:) seems like there is no solution, yet...

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shabardi
    Hi,

    I guess even in the ready made VCF, you can use the sites that are not output as WT sites?

    -Sheila

  • the WT sites are not in the vcf files at all, right? then i cannot use them. or am i mistaking smth again?:)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shabardi
    Hi,

    No, the WT sites are not in the VCF. I was thinking for a rough estimate you can count the sites that are not output in the VCF as WT sites (ones in between the mutant sites).

    On second thought, there may have been an interval list used and you won't know which sites actually were run on. However, you may be able to check in the header what command was run and if an intervals file was used. Can you check the header and let us know if there is a command given? If so, can you tell us if there is an intervals file provided? Perhaps you can ask to have access to the intervals file and count all positions the tool ran on but did not output as mutant as the WT sites.

    -Sheila

  • @Sheila thank you very much! I will go with your first suggestion, it seems reasonable and less "painful":)
    the command is provided and what surprised me that each sample were run only for one chromosome (or maybe i am mistalimg). this is the part concerning intervals: "intervals=[chr2:1-30000000] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0". Maybe you can explain it to me if you have time:)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shabardi
    Hi,

    Ah, so you do know the intervals. This is great. So, any positions from 1-30,000,000 that are not in the VCF, you can consider Wild Type :smile:

    I do not know why the team ran on only that interval. You will have to ask them to find out. Perhaps, they are only interested in that region, but you should find out why :smiley:

    -Sheila

  • @Sheila , thank you very much for your help! i hope i can manage it now :)

Sign In or Register to comment.