Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Mutect2 variant numbers differing 10-20 fold for samples from same patient

Hi,

I am running Mutect2 on a colorectal cancer brain metastasis WES dataset using matched normals. However, I am running into some trouble. Mainly, different samples from the same patient give me a wildly varying number of variants. This wouldn't be that large of a problem, but it seems like all these extra variants are very low allele frequency, and somehow the allele frequency in the normal and tumor seem to correlate.

I am following the GATK best practices from here: https://gatkforums.broadinstitute.org/gatk/discussion/24057/how-to-call-somatic-mutations-using-gatk4-mutect2#latest
I am using GATK version 4.1.2.0, although older ones give me the same problem. I cannot for some reason add images to this post (same problem as user Monete in this link: https://gatkforums.broadinstitute.org/gatk/discussion/8856) so sorry for having to post a link to my image, but it looks like this: https://imgur.com/a/mJHdztg
Both these figures are from brain metastasis of the same patient with similar purity (80-90%). The first one has 5267 variants total, most with allele frequency under 0.1. The second has 426 variants total. All variants were filtered on minimum depth (30) and number of variants calls (5). Median depth for my whole dataset was about 120.

My code looks a bit like this:
./gatk-4.1.2.0/gatk Mutect2 -R Homo_sapiens_assembly38.fasta -I "$Tumor".bam -tumor "$Tumor" -I "$Normal".bam -normal $Normal -O $Tumor.vcf -L Homo_sapiens_assembly38_exome_target_interval_list.bed -ip 100 --panel-of-normals somatic-hg38_1000g_pon.hg38.vcf.gz --germline-resource af-only-gnomad.hg38.vcf.gz

After I use FilterMutectCalls paired with the output of CalculateContamination (contamination is usually around 2-3% or less). Removing/adding the pon or germline recourses or contamination changes a bit as expected but not much.

Any suggestions on what the cause of this might be? I see this problem in multiple datasets, not in all patients but I cannot seem to find the reason. The correlation in Allele Frequency strongly makes me believe it is not true variants that I'm finding.

Answers

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @TomvdBosch Are you using the same normal sample for each normal? This could certainly happen if you are using different normals of varying quality. In particular, tumors matched with a low-depth normal will show more germline variants as PASS.

  • @davidben Yes this is with the same normal sample.
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    All variants were filtered on minimum depth (30) and number of variants calls (5).

    How do your callsets compare when you only use FilterMutectCalls for filtering? And, other than this hard filtering, is your pipeline identical to the best practices pipeline?

    Another question: do these extra low-AF calls tend to occur when the normal coverage is low? An informative figure, if you so desire, would be a histogram of normal depth at all calls that are in one metastasis but not the other.

    If you are free to share your VCFs, I could also look at those.

  • Before FilterMutectCalls it's equally large of a difference (6400 vs 470). I think I am not allowed to share my vcf's, as the data is not public (from EGA).

    I'll try to make the histogram for normal calls today or monday and come back asap, thanks for the help.
  • Interestingly, the depth in the normal sample is lower in positions of the low-variant sample. See https://imgur.com/a/oRcws9g for a normalized histogram. So looks like Mutect2 calls more variants if the depth in the normal is higher. This still doesn't explain the correlation in Allele Frequency though.

    I did this using samtools mpileup (because samtools depths does not allow a .bed file). If I instead take the AD from the variants.vcf file the histograms look similar.
  • To see if it depends on the normal I ran Mutect2 in tumor-only mode for these samples. The results are about 26000 and 46000 variants each before FilterMutectCalls and 20500 and 22600 variants after FilterMutectCalls. In this case, it seems that FilterMutectCalls does make the number of variants somewhat equal.
    Doing the same with another pair in which I saw the same variant number difference after following the whole pipeline (200 vs 2600) I get 30000 and 41000 variants before FilterMutectCalls and 21800 and 22400 after.
    These numbers certainly look a lot closer but of course not using a normal in my variant calling is not what I want, as most of these variants will be germline variants so the difference could still be very large after filtering those out.
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @TomvdBosch Thank you for that histogram. It's a good clue, but I don't have a good answer yet.

    The most perplexing thing to me is the small difference before and after filtering -- 6400 and 470 pre-filtered calls versus 5267 and 426 PASS calls after filtering. Mutect2 is very liberal and usually one sees less than a tenth of all variants ending up passing after FilterMutectCalls.

    Could you describe any meaningful differences -- in the biological samples, in the library preparation, in the sequencing, and in the informatics pipeline -- between the different tumor samples? Also, could you post the rest of your pipeline from Mutect2 through FilterMutectCalls?

  • @davidben thanks for your response!

    I seem to have made a mistake and looked at the wrong files: The before numbers should have been about 30500 and 4000. The 6400 and 700 were after my own filtering on depth which I accidently turned on.

    Since the samples are not mine, I can only assume that all original samples were handled the same. Since both samples are from the same brain metastasis from the same patient there should be no difference in any way. Library prep is reported to be the same, as is the sequencing. They are treated the same in my pipeline as well.

    Here's my Mutect2 and after pipeline, apologies for formatting:
    ./gatk GetPileupSummaries -I "$Tumor".bam -V af-only-gnomad.hg38.vcf.gz -L Homo_sapiens_assembly38_exome_target_interval_list.bed -O "$Tumor".getpileupsummaries.table

    ./gatk CalculateContamination -I "$Tumor".getpileupsummaries.table -matched "$Normal".getpileupsummaries.table -O "$Tumor".contamination.table

    ./gatk Mutect2 -R Homo_sapiens_assembly38.fasta -I "$Tumor".bam -tumor "$Tumor" -I "$Normal".bam -normal $Normal -O $Tumor.vcf -L Homo_sapiens_assembly38_exome_target_interval_list.bed -ip 100 --panel-of-normals somatic-hg38_1000g_pon.hg38.vcf.gz --germline-resource /mnt/r/bundle/af-only-gnomad.hg38.vcf.gz

    ./gatk FilterMutectCalls -R /mnt/r/bundle/Homo_sapiens_assembly38.fasta -V $Tumor.vcf -O $Tumor.filtered.vcf -contamination-table "$Tumor".contamination.table

    ./gatk VariantsToTable -V $Tumor.filtered.vcf -O $Tumor.variants.tsv -F CHROM -F POS -F TYPE -GF AD -GF AF

    The resulting variants.tsv file I load in matlab which I use to filter on depth and create plots.
  • @davidben I might have some more insight from the tumor-only mutect runs:

    Like I said, running Mutect2 in tumor-only mode equalizes the number of variants a bit. Then I used FilterMutectCalls, filtering out positions in both samples and then filtering on depth > 20. Afterwards, I saw a striking difference: Although the number of variants is similar-ish (2900 vs 3600 for one patient and 3200 vs 5500 for another) the allele frequencies were extremely different! I added histograms of this in the link below. The samples with many variants in my original problem have allele frequencies of <0.1 mostly whereas for the other sample there are peaks around 0.5 and 0.9-1. The depth of the variants (which I took from summing the two entries in the AD field of the variants file) also show some difference.
    The reported purity is similar in all files so getting low allele frequencies in one sample but 0.5 or 1 in the other seems very odd. It almost seems like there could be some degradation causing the low AF variants or chemo influencing, but that again should be the same for samples from the same patient.

    https://imgur.com/a/grCfzlp
Sign In or Register to comment.