To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Help me understand why this variant was called

escaonescaon Limoges, FranceMember

This variant was called with MuTect2 T/N mode :

chr7 152008861 . A T . PASS \
ECNT=1;HCNT=70;MAX_ED=.;MIN_ED=.;NLOD=96.12;TLOD=37.01 \
GT:AD:AF:ALT_F1R2:ALT_F2R1:DP:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC \
0/1:201,0:0.112:0:0:201:.:5721,0:0:0:101,100,0,0 \
0/0:445,0:0.055:0:0:445:.:15084,0:0:0:219,226,0,0

If i read this correctly :
AF for this SNP (A->T) is 0.112 (11.2%) in the tumor sample.
AF for this SNP (A->T) is 0.055 in the normal sample.
Depth at this position (AD) in tumor sample : 201 (ref), 0 (alt).
Depth at this position (AD) in normal sample : 445 (ref), 0 (alt).

This seems strange to call a SNP with AF = 0.112 when no reads at all show up for the alt_allele in the AD field.
We have nothing to support alt_allele in the QSS field either (for both tumor and normal sample, the sum of base quality scores for alt_allele equals 0).

Ps : We wanted to have some clue about strand biaisfor our variants (strand_artifact flag nether appears in MuTect2 VCF), so we requested the SAC to be present in the VCF output. No reads at all in the SAC field are supporting alternate allele on +/- strands.

I know that it's stated in the doc that "the allele counts provided by AD and SAC should not be used to make assumptions about the called genotype", but still, i think it's very strange to have no supporting allele count in both AD & SAC when AF = 11.2%. And given that AF = AD[format] / DP[format], i was expecting some correlation between AF for a given variant and its AD field.

What am I missing ? Did i misread how AF is calculated ?

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @escaon
    Hi,

    Can you please post IGV screenshots of the original BAM file and bamout file at the position (include ~300 bases before and after the site of interest)?

    Thanks,
    Sheila

    P.S. You are using the latest version of GATK?

  • escaonescaon Limoges, FranceMember
    edited April 2017

    Hi, thanks for your interest.

    Some precisions first :

    Project :
    Call somatic variants on cancer samples (for each patient (13), we have 3 tissues, 2 different kind of tumoral tissues and one normal (blood)). We followed GATK best practises (https://software.broadinstitute.org/gatk/img/BP_somatic_workflow_M2.png). The only step that we excluded is the 'Mark Duplicates" step. That's because our NGS data is ampliseq (high depth), and that it's hard to distinguish between real PCR dups & true reads starting on same positions (sequenced on Ion Proton platform). Ps : if we mark dups, we have ~96% dups.

    Versions / Infos :
    GATK version : v3.7-0-gcfedb67
    Java version : javac 1.8.0_121 (java-8-openjdk-amd64)
    IGV : IGV_2.3.91
    Ref genome : hg19
    Nb : The variant we are interested in is called outside our amplicons boundaries (because we used padding=100 with MuTect2, given that we have ampliseq data). See below for amplicons positions surrounding this call. This variant is called in T/N mode, for one of our tumor tissu vs normal tissu.
    chr7 152007067 152007168
    chr7 152008863 152008946
    chr7 152008946 152009049

    BAMOUT generation :
    men='/home/erwann/Desktop/MEN/BISCEm/Hg19_files'; glio='/home/erwann/Desktop/GLIO/BISCEm/Outputs'; java -jar $GATK -T MuTect2 \ -R $men/ucsc.hg19.fasta \ -L chr7:152008861 \ --interval_padding 300 \ -I:tumor $glio/3_BQSR/V1/p1_TT_sorted_bqsr.bam \ -I:normal $glio/3_BQSR/V1/p1_sang_sorted_bqsr.bam \ --dbsnp $men/dbsnp_138.hg19.vcf \ --cosmic $men/cosmic_coding_and_noncoding_chr_M_sorted.vcf \ -PON $glio/5_PoN/V1/combined_pon_prun2.vcf \ -contamination $(awk 'NR==2 {print $4/100}' $glio/4_ContEst/V1/p1_TT_contest.txt) \ --maxReadsInRegionPerSample 10000 \ -A DepthPerAlleleBySample -A BaseQualitySumPerAlleleBySample -A TandemRepeatAnnotator -A OxoGReadCounts -A StrandAlleleCountsBySample -A DepthPerSampleHC \ -minPruning 2 \ --disableOptimizations --dontTrimActiveRegions --forceActive \ --bamOutput bamout.bam \ -o out.vcf;

    In out.vcf we have :
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL chr7 152008861 . A T . PASS ECNT=1;HCNT=78;MAX_ED=.;MIN_ED=.;NLOD=62.14;TLOD=92.86 GT:AD:AF:ALT_F1R2:ALT_F2R1:DP:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC 0/1:220,0:0.196:0:0:220:.:6250,0:0:0:108,112,0,0 0/0:445,0:0.084:0:0:445:.:15101,0:0:0:215,230,0,0

    BAMOUT : Screenshots :
    See attached files.
    bamout_1.png : ArtificialHaplotypes sorted by base, they clearly show why A->T SNP was called. In the blue pop-up, depth seems to be 796 at this pos. Does this correspond to the sum of tumor reads + normal reads + ArtificialHaplotypes at this position ? (given that DP tumor = 220, DP normal = 445, we lack 131 depth, but we have 127 ArtificialHaplotypes (well, i managed to found an artificialhaplotype called HC127) ).
    bamout_2.png : Less zoom.
    bamout_3.png : Scrolled down to see normal reads.
    bamout_4.png : Scrolled down to see tumor reads.
    tumor_bam_1.png : Tumor BAM used to call the variant.
    tumor_bam_2.png : Less zoom.

    My understanding so far :
    For this variant (chr7 152008861 A->T), the HC graph used by MuTect2 to make the call is highly different from the reads pileup in the BAM . I am still wondering how we have 0.112 AF for the alt in tumor sample with no alt count in either AD, QSS or SAC fields.

    Underlying question is : How to get a good strand count for alt allele (to estimate strand biais). The strand count seems to be included in the BAMOUT (blue pop-up in the bamout_1.png seems to indicate that the 79 alt are on the minus strand).

    Ps : Questions about BAMOUT and IGV :
    In the BAMOUT, are ArtificialHaplotypes contructed from reads coming from both the tumor and the normal samples ?

  • escaonescaon Limoges, FranceMember

    Ps : I also wonder why a variant with AF_tum = 0.196 & AF_nor = 0.084 was not flagged "alt_allele_in_normal".

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @escaon
    Hi,

    This is most likely a bug, but since the new version of MuTect2 in GATK4 is coming out very soon, there is probably nothing we can do. The best thing to do is wait for the new version and test this out again. Hopefully the issue no longer persists.

    -Sheila

  • escaonescaon Limoges, FranceMember

    Thanks for your reply.

    The "bug" is that this variant is called ? Or the associated values in AD, QSS & SAC fields ?
    At the end of the day, should this variant be discarded ? Or should we keep it (with some associated warning) ?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @escaon
    Hi,

    The issue is the messy region and the AD values that don't show evidence for the alternate allele. There are also many variants (specifically indels) in the region which could be due to homologous genes in other regions. You can try a tool such as BLAT to check for those.

    -Sheila

Sign In or Register to comment.