GATK best pratices for RNA-seq somatic mutation finding

AlvaAlva GermanyMember
edited March 2017 in Ask the GATK team

Dear All,

I have followed GATK best practices for finding Somatic mutations from cancer versus normal sample from RNA-seq data using Mutect2 as the final caller and rest all quality control steps as per GATK best practices for RNA-seq. But the resulting filtered sets are false positives as we tried to validate them in the wet lab. So we are now planning to call somatic variants using haplotype caller as per guidelines in GTAK practices, please someone who has experience with this caller comment on this.How is it with Haplotype caller, I would like to know the Sensitivity and specificity rate, also the parameters used in the caller to get true positive variants.
Thank you

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Alva
    Hi,

    I am not sure I understand. Are you saying the VCF from MuTect2 on your RNA-seq data contains all false positives?

    We do not recommend using HaplotypeCaller for finding somatic mutations.

    -Sheila

  • AlvaAlva GermanyMember

    Actually, I started using Mutect tools from http://archive.broadinstitute.org/cancer/cga/mutect_run after following all pre-workflow from GATK best practices to find Somatic mutations. Then we used our filtered set of confident somatic mutations for wet lab validation and it turned out to be all False positives. So at this stage I just wanted to give a try with haplotype caller from GATK for finding Somatic mutations.. And what do you suggest ?

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited March 2017

    Hi @Alva,

    I'm sorry to hear you went through so much bench validation to find only false-positives. Unfortunately, calling somatic variants using RNA-Seq data is not a typical workflow and I don't believe we have recommendations towards this kind of analyses. But researchers are ingenious and the field moves forward with new approaches. Is there a paper that you are basing your analysis approach on?

    Typical MuTect use cases involve a pair of DNA samples--a tumor and a matched normal control. Somatic callers are sensitive to low allele fraction mutations and we recommend careful manual review of callsets derived from somatic callers. We have a new tutorial that outlines some considerations for calling somatic variants for DNA data--you can find a link to the worksheet in this blogpost. Careful pre-processing of data is necessary to minimize false positive calls. This is why we also recommend using a panel of normals (PoN).

    Let me know your thoughts.

  • AlvaAlva GermanyMember

    @Shlee.

    Thank you for your reply. So now I started with MuTect2 from GATK and I have a completely different result.Also, it is taking more tan 20 hours to finish one sample.

    The command line I used is from here,
    java -jar GenomeAnalysisTK.jar \
    -T MuTect2 \
    -R reference.fasta \
    -I:tumor tumor.bam \
    -I:normal normal.bam \
    [--dbsnp dbSNP.vcf] \
    [--cosmic COSMIC.vcf] \
    [-L targets.interval_list] \
    -o output.vcf

    As I have RNA-seq data from normal and matched tumor samples

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited April 2017

    Hi @Alva,

    I meant to share the hands-on tutorial to illustrate roughly the thought process and some considerations in a somatic analysis, e.g. contamination. MuTect2 is a tool that is in beta, which means it is not ready for research use. Faster run times and more sensitive results are something that MuTect2 will have when we release it in GATK4. You might ask why do we document a tool that is not ready for use? Well, for one, the tutorial focuses on general workflow considerations and not on the performance of MuTect2. For benchmarking, and the final thumbs-up, we need to wait for data from our developers.

    On the other hand, MuTect1 is widely used in production. Remember that MuTect1 only calls SNVs and requires prior indel realignment.

    If you want to see actual somatic (DNA) analyses, I highly recommend the workflows that FireCloud provide that use the original Mutect (MuTect1) offer for SNV calling. Last I checked, the workflow incorporates indel calls made by MuTect2. Here is the link: https://software.broadinstitute.org/firecloud/guide/article?id=7512. There are of course journal publications that use MuTect1 that should also inform your analyses.

    All this being said, at this time, I cannot recommend these workflows for RNA data. If you can figure out parameters for these workflows/tools to work well with your RNA-Seq data, then this is something we would be greatly interested in hearing about.

  • AlvaAlva GermanyMember

    Thank you @Shlee for the explanation. Now I have results from MuTect2 I see all the variants (from 40 patient with the corresponding matching sample as Normal or control) have germline_risk and multi_event_alt_allele_in_normal as clustered events. I don't understand how can every variant have same explanation for 40 samples?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Alva
    Hi,

    Can you post some example VCF records and IGV screenshots of the BAM file and bamout file at the site? Please include ~300 bases before and after the site in question in the BAMs.

    Thanks,
    Sheila

  • AlvaAlva GermanyMember
    edited May 2017

    Hello
    My bamout file is empty anyhow I am attaching the IGV screenshot for position
    VCF file ,

    > > > > #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TUMOR   NORMAL
    > > > > 1   14748   .   G   C   .   alt_allele_in_normal;t_lod_fstar    ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=29.34;TLOD=4.58;CSQ=C|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000423562|unprocessed_pseudogene|10/10||||1284|||||||-1||HGNC|38034|||||||||||||||||||||||||||||,C|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000438504|unprocessed_pseudogene|12/12||||1398|||||||-1||HGNC|38034|||||||||||||||||||||||||||||,C|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||1078|1||HGNC|37102|||||||||||||||||||||||||||||,C|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|||||||||||339|1||HGNC|37102|||||||||||||||||||||||||||||,C|intron_variant&non_coding_transcript_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene||10/10||||||||||-1||HGNC|38034|||||||||||||||||||||||||||||,C|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000515242|transcribed_unprocessed_pseudogene|||||||||||336|1||HGNC|37102|||||||||||||||||||||||||||||,C|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000518655|transcribed_unprocessed_pseudogene|||||||||||339|1||HGNC|37102|||||||||||||||||||||||||||||,C|intron_variant&non_coding_transcript_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000538476|unprocessed_pseudogene||12/12||||||||||-1||HGNC|38034|||||||||||||||||||||||||||||,C|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000541675|unprocessed_pseudogene|9/9||||1031|||||||-1||HGNC|38034|||||||||||||||||||||||||||||    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/1:86,2:0.027:1:1:0.500:2974,76:43:43  0/0:176,5:0.032:3:2:0.600:6048,155:92:84
    > > > > 1   15250   .   C   G   .   t_lod_fstar ECNT=1;HCNT=8;MAX_ED=.;MIN_ED=.;NLOD=8.13;TLOD=5.13;CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000423562|unprocessed_pseudogene||8/9||||||||||-1||HGNC|38034|||||||||||||||||||||||||||||,G|intron_variant&non_coding_transcript_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000438504|unprocessed_pseudogene||10/11||||||||||-1||HGNC|38034|||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||1580|1||HGNC|37102|||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|||||||||||841|1||HGNC|37102|||||||||||||||||||||||||||||,G|intron_variant&non_coding_transcript_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene||9/10||||||||||-1||HGNC|38034|||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000515242|transcribed_unprocessed_pseudogene|||||||||||838|1||HGNC|37102|||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000518655|transcribed_unprocessed_pseudogene|||||||||||841|1||HGNC|37102|||||||||||||||||||||||||||||,G|intron_variant&non_coding_transcript_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000538476|unprocessed_pseudogene||11/12||||||||||-1||HGNC|38034|||||||||||||||||||||||||||||,G|intron_variant&non_coding_transcript_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000541675|unprocessed_pseudogene||7/8||||||||||-1||HGNC|38034|||||||||||||||||||||||||||||  GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/1:26,3:0.100:1:2:0.667:900,105:15:11  0/0:38,0:0.00:0:0:.:1334,0:18:20
    > > > 
    > > > 
    
    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Alva,

    The region appears to be an intronic region and you IGV screenshot shows the blue lines that mark split reads. So this is RNA-Seq data, yes?

    The filters you mention--germline_risk, multi_event_alt_allele_in_normal and clustered events--are all of your variant calls marked by all three of these or just one of these? Filter annotations are meant to be conservative so that you can then manually review sites and remove the borderline cases. I imagine for RNA-Seq data, because of the clipping in split reads, all of the sites near these sites could trigger these filters.

    There may be some advanced parameters you could tweak, to algorithmically lessen this stringent filtering. Currently, some of the filters in the GATK3 MuTect2 beta release are being tweaked by our developers for the GATK4 MuTect2 alpha/beta release. Again, the default parameters will be for DNA data so I think you will still have similar issues. But I can't say for sure because I have yet to learn about the improvements.

    I think the best thing you can do for yourself is (i) figure out if this is the right approach for the question you want to answer, (ii) figure out the optimal parameters for your use case and (iii) use small data for your testing to iterate faster. Sorry I can't be more helpful at this time. We are looking ahead to the GATK4 release tools.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Alva
    Hi,

    What do you mean you ran the Best Practices for RNA-seq? Can you tell us all of the steps you ran? Did you run SplitNCigarReads?

    Thanks,
    Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator

    @Alva, our developers say that GATK4's MuTect2 (in beta) is now ready for testing. It has improvements to GATK3's MuTect2. You can find the jar at https://github.com/broadinstitute/gatk-protected/releases/tag/1.0.0.0-alpha1.2.6.

Sign In or Register to comment.