Attention:
The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

Do you need to do Variant Quality Score Recalibration when calling somatic variants with Mutect2?

Hi,
I am currently working to call somatic variants from tumour samples with matched normal pairs from the same patient. I have carried out all of the steps in this tutorial: https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2#2. However I am now confused as to whether I am supposed to run the VQSR step described in some of the GATK tutorials on Youtube (for example here: ) .

I initially thought I did have to do it but as I've gone over the explanation of VQSR I now get the impression that it is designed for calling germline variants on many samples simultaneously. However I wanted to make sure before taking that step out of my pipeline that you genuinely do not need to do VQSR for somatic variant calling. I would also like to understand WHY it is that you need VQSR for germline variant calling but do not use it for somatic mutation calling.
Thanks!
Peter O'Donovan

Best Answers

Answers

  • Accepted Answer

    Right, thank you very much for the answer!

  • BegaliBegali GermanyMember

    @Sheila
    Hi Sheila

    I am working on cfDNA_NGS , I did calling somatic mutations and also I have already performed FilterMutectCalls, then add annotation for gene names based on this file hg19_GENCODE_v19_basic.Cancer_genes.bed , I was successful to add them after filtering, however I noticed at not all position row has been added based on the FILTER col I think has been added for real somatic mutation

    I would like to also distinguish between pathogenic and non- pathogenic based on ClinVar will be correct to add new annotation at each row

    Also I generated vcf by SAMTOOLs and I have added gene names before filtering as I notice VSQR only for Germline , then how I can filter those genetics variants which generated by SAMTOOLs . Is Hard filtering will be correct chose or what ?

    OR updated vcf with excluded position where gene names have been not added , then add process for ClinVar

    MY Goal is called somatic mutation for cancer on 33 sample cfdna(cysts fluid, plasma and blood for pancreas..

    and compare results for both hg38 and hg19 with different pipelines

    Kindly if you provide me some info for my question, I appreciate your reply and helping

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Begali
    Hi,

    however I noticed at not all position row has been added based on the FILTER col I think has been added for real somatic mutation

    Are you saying the annotations are not added to filtered sites, but they are added to non-filtered sites? Which tool did you use?

    Also I generated vcf by SAMTOOLs and I have added gene names before filtering as I notice VSQR only for Germline , then how I can filter those genetics variants which generated by SAMTOOLs . Is Hard filtering will be correct chose or what ?

    I am not sure I understand. I thought you ran Mutect2 for somatic variants only?

    MY Goal is called somatic mutation for cancer on 33 sample cfdna(cysts fluid, plasma and blood for pancreas..

    Can you post the exact commands you ran to get the variants?

    Thanks,
    Sheila

  • BegaliBegali GermanyMember

    @Sheila
    hi
    the command here:
    I called them by 4 different methods

    First method:

    gatk --java-options "-Xmx2g" Mutect2 -R ucsc.hg19.fasta -I recal109_17.bam -tumor 109_17 --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -pon somatic-hg19_1000g_pon.hg19.vcf -L resources_broad_2hg19_v0_wgs_calling_regions.hg19.interval_list -O tumor_unmatched_m2_snvs_indels109_17.vcf
    Filtering variants:

    1. Estimate cross-sample contamination using GetPileupSummaries

    gatk --java-options "-Xmx2g" GetPileupSummaries -I recal109_17.bam -V somatic-hg19_small_exac_common_3.hg19.vcf -O 109_17getpileupsummaries.table

    2.Second, estimate contamination with CalculateContamination
    gatk --java-options "-Xmx2g" CalculateContamination -I 109_17getpileupsummaries.table -O 109_17_tumor_calculatecontamination.table

    gatk --java-options "-Xmx2g" FilterMutectCalls -V tumor_unmatched_m2_snvs_indels109_17.vcf --contamination-table 109_17_tumor_calculatecontamination.table -O filtered109_17.vcf.gz

    2nd method:
    gatk --java-options "-Xmx2g" Mutect2 -R ucsc.hg19.fasta -I recal109_17.bam -tumor 109_17 -O Sample109_17.vcf
    Filtering variants: such as above except I replaced tumor_unmatched_m2_snvs_indels109_17.vcf with
    Sample109_17.vcf
    See the OUTPUT for 1st method and 2nd similar like this :smiley:

    3rd method:
    samtools mpileup -g -f ucsc.hg19.fasta
    realigned109_17.bam > raw109_17.bcf
    bcftools call -mv raw109_17.bcf > var109_17.bcf
    bcftools view var109_17.bcf > var109_17.vcf

    4th method:
    samtools mpileup -g -f ucsc.hg19.fasta
    sorted109_17.bam > raw109_17.bcf
    bcftools call -mv raw109_17.bcf > var109_17.bcf
    bcftools view var109_17.bcf > var109_17.vcf

    Filtering 3rd and 4th methods I did not apply VSQR as should perform only for Germline varaints

    so Then I wanted to add gene_name based on GOOGLE info
    1- vcf-sort hg19_GENCODE_v19_basic.Cancer_genes.bed | bgzip -c > hg19_GENCODE_v19_basic.Cancer_genes.bed.gz

    2- tabix hg19_GENCODE_v19_basic.Cancer_genes.bed.gz
    3- bcftools annotate -a hg19_GENCODE_v19_basic.Cancer_genes.bed.gz
    -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') filtered109_17.vcf > annonated_genes109_17.vcf

    here after adding genes names

    ex> for 1st method

    ex> for 4th method

    My Q is why gene names only have been added at some rows not all
    I assumed others are FP or not somatic mutations

    2nd Q is what I understood Pass means real somatic mutation or for FilterMutectCalls
    if correct then why gene name at this position has not been added
    see here

    FILTER=<ID=PASS,Description="All filters passed">

    FILTER=<ID=artifact_in_normal,Description="artifact_in_normal">

    FILTER=<ID=base_quality,Description="alt median base quality">

    FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">

    FILTER=<ID=contamination,Description="contamination">

    FILTER=<ID=duplicate_evidence,Description="evidence for alt allele is overrepresented by apparent duplicates">

    FILTER=<ID=fragment_length,Description="abs(ref - alt) median fragment length">

    FILTER=<ID=germline_risk,Description="Evidence indicates this site is germline, not somatic">

    FILTER=<ID=mapping_quality,Description="ref - alt median mapping quality">

    FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">

    FILTER=<ID=panel_of_normals,Description="Blacklisted site in panel of normals">

    FILTER=<ID=read_position,Description="median distance of alt variants from end of reads">

    FILTER=<ID=str_contraction,Description="Site filtered due to contraction of short tandem repeat region">

    FILTER=<ID=strand_artifact,Description="Evidence for alt allele comes from one read direction only">

    FILTER=<ID=t_lod,Description="Tumor does not meet likelihood threshold">

    My Goals to determine which method is the best one with min FP

    Also I repeated that with hg38 to compare with results for hg19

    3rd Q in order to determine pathogenic and non pathogenic I was not successful with VariantsAnnotator based on ClinVAr.vcf

    I am glad to recieve ur info as this is first time deal deal with those somatic mutation for human data

    with best regards

  • BegaliBegali GermanyMember

    @Sheila
    hi

    I could fix adding ClnVAr annontation as previously gives me errors

    ERROR MESSAGE: Input files /home/pathology/Desktop/My_Data/Processing/clinvar/Chr37/clinvar_20180701.vcf and reference have incompatible contigs. Please see https://software.broadinstitute.org/gatk/documentation/article?id=63for more information. Error details: No overlapping contigs found.

    ERROR /home/pathology/Desktop/My_Data/Processing/clinvar/Chr37/clinvar_20180701.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, NW_003315947.1, NW_003315950.2, NW_003871103.3]
    ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

    even if I used the same reference genome but when I capture
    grep -v "^#" Myfile.vcf | awk '{print $1}' | uniq -c
    started with chr1....
    grep -v "^#" ClinVar.vcf | awk '{print $1}' | uniq -c
    started with 1 ....
    so I fixed it by
    awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' ClinVar.vcf > with_chr.vcf

    then apply this command again
    java -jar /home/pathology/Desktop/My_Data/GenomeAnalysisTK/GenomeAnalysisTK-3.8-1-0-gf15c1c3ef/GenomeAnalysisTK.jar -R /home/pathology/Desktop/My_Data/Pre_Processing/ucsc.hg19.fasta -T VariantAnnotator --variant annonated_genes109_17.vcf -o final_109_17.vcf --resource:ClinVar /home/pathology/Desktop/My_Data/Processing/clinvar/Chr37/with_chr.vcf -E ClinVar.AF_ESP -E ClinVar.AF_EXAC -E ClinVar.AF_TGP -E ClinVar.ALLELEID -E ClinVar.CLNDN -E ClinVar.CLNDNINCL -E ClinVar.CLNDISDB -E ClinVar.CLNDISDBINCL -E ClinVar.CLNHGVS -E ClinVar.CLNREVSTAT -E ClinVar.CLNSIG -E ClinVar.CLNSIGNCL -E ClinVar.CLNVC -E ClinVar.CLNVCSO -E ClinVar.CLNSIGCONF -U ALLOW_SEQ_DICT_INCOMPATIBILITY

    the output for 2nd method was updated see

    I assuem will be work with others method , can I considered it also more filtering then I have to caputer only rows with ClinVar annotation correct

    with best regards

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Begali
    Hi,

    My Q is why gene names only have been added at some rows not all I assumed others are FP or not somatic mutations

    I see. The gene name can only be added if it is in your resource file. If it is not in your resource file, it is possible it is a false positive, or it simply has not yet been detected.

    2nd Q is what I understood Pass means real somatic mutation or for FilterMutectCalls if correct then why gene name at this position has not been added.

    It is possible Mutect2 discovered a true positive that is not picked up by other callers :smile:

    For your last question, you are using resource files that are incompatible with your reference used for analysis. You need to make sure all files have been generated using the exact same reference.

    -Sheila

Sign In or Register to comment.