Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

SomaticGenotypingEngine - At Locus chrchr8:129712469, we detected that variant context had alleles t

mmokrejsmmokrejs Czech RepublicMember

Hi,
I wonder what am I doing wrong with my input to MuTect2.

java -Xmx16g -Djavaio.tmpdir=. -Xmx58g -jar GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T MuTect2 --num_cpu_threads_per_data_thread 16 -R ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/hs38DH.fa -I:tumor ../../tumor.bam -I:normal ../../normal.bam --dbsnp /scratch/work/project/bio/db/ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/00-All.vcf.gz --cosmic sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicAllMuts__Broad-style.vcf.bgz -o CR-MGUS-10_10-PB.bwa.gatk.MuTect2.vcf INFO 10:42:52,168 HelpFormatter - ------------------------------------------------------------------------------------------ INFO 10:42:52,170 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 ... INFO 16:31:31,339 ProgressMeter - chr8:128735831 1.391350276E9 5.8 h 15.0 s 47.2% 12.3 h 6.5 h WARN 16:31:39,909 SomaticGenotypingEngine - At Locus chrchr8:129712469, we detected that variant context had alleles that not in PRALM. VC alleles = [T, G*], PRALM alleles = [] WARN 16:31:39,934 SomaticGenotypingEngine - At Locus chrchr8:129727023, we detected that variant context had alleles that not in PRALM. VC alleles = [C*, CACACACACACAT], PRALM alleles = [] WARN 16:31:47,311 SomaticGenotypingEngine - At Locus chrchr8:130343105, we detected that variant context had alleles that not in PRALM. VC alleles = [A*, AT], PRALM alleles = [] WARN 16:31:47,311 SomaticGenotypingEngine - At Locus chrchr8:130343112, we detected that variant context had alleles that not in PRALM. VC alleles = [A*, G], PRALM alleles = [] WARN 16:31:47,816 SomaticGenotypingEngine - At Locus chrchr8:130638567, we detected that variant context had alleles that not in PRALM. VC alleles = [A*, G], PRALM alleles = [] WARN 16:31:50,176 SomaticGenotypingEngine - At Locus chrchr8:130805529, we detected that variant context had alleles that not in PRALM. VC alleles = [A, G*], PRALM alleles = [] WARN 16:31:51,902 SomaticGenotypingEngine - At Locus chrchr8:130936261, we detected that variant context had alleles that not in PRALM. VC alleles = [C, T*], PRALM alleles = [] INFO 16:32:31,348 ProgressMeter - chr8:134299699 1.391350276E9 5.8 h 15.0 s 47.4% 12.3 h 6.5 h

$ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicAllMuts__Broad-style.vcf.bgz | grep -v "^#" | head chr1 1 COSN24297174 N NT . . . chr1 1 COSN24297168 N NA . . . chr1 1 COSN24297179 N NC . . . chr1 1 COSN24297177 N NAT . . . chr1 1 COSN24297166 N NT . . . chr1 1 COSN24297175 N NT . . . chr1 1 COSN24297162 N NC . . . chr1 1 COSN24297163 N NG . . . chr1 1 COSN24297172 N NGCCG . . . chr1 1 COSN24297176 N NA . . . ^C

Hmm, seems actually the above SNP's inferred from COSMIC v80 are not much useful, right? Damn, maybe I screwed my BaseRecalibrator results as I passed this also to it.

Anyway, here are my chromosome names.

``

$ grep "^>" ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/hs38DH.fa

chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
chr2 AC:CM000664.2 gi:568336022 LN:242193529 rl:Chromosome M5:f98db672eb0993dcfdabafe2a882905c AS:GRCh38
chr3 AC:CM000665.2 gi:568336021 LN:198295559 rl:Chromosome M5:76635a41ea913a405ded820447d067b0 AS:GRCh38
chr4 AC:CM000666.2 gi:568336020 LN:190214555 rl:Chromosome M5:3210fecf1eb92d5489da4346b3fddc6e AS:GRCh38
chr5 AC:CM000667.2 gi:568336019 LN:181538259 rl:Chromosome M5:a811b3dc9fe66af729dc0dddf7fa4f13 AS:GRCh38 hm:47309185-49591369
chr6 AC:CM000668.2 gi:568336018 LN:170805979 rl:Chromosome M5:5691468a67c7e7a7b5f2a3a683792c29 AS:GRCh38
chr7 AC:CM000669.2 gi:568336017 LN:159345973 rl:Chromosome M5:cc044cc2256a1141212660fb07b6171e AS:GRCh38

``

Thank you for your comments

Answers

  • mmokrejsmmokrejs Czech RepublicMember

    Thanks @EADG , I Searched through the GATK site for "chrchr" and yielded no good hits. Looks the message is still containing "chrchr" in v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18. After reading the thread I am not certain whether I can just ignore the messages while anticipating those regions just being too messy.

    While thinking more of the COSMIC files with N's in the reference I think I did not cause any harm to BaseRecalibrator as it should have anyway ignored reference sites with N's for its work. Am I right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Hi there,

    The "chrchr" thing sounds like a minor bug of a hardcoded "chr" in the error message, you can safely ignore it (though we'll look to fix it).

    The PRALM warnings are indeed related to messy regions, iirc, and you should be able to ignore them safely as well, though I would still check that the calls at those sites, if any, look reasonable.

    Finally, I think BaseRecalibrator only uses the position coordinates and will have ignored the alleles -- if not you would have run into a validation error. So it's fine for BQSR but it won't be useable for MuTect2, so you should really get a fixed version of cosmic VCF.
  • mmokrejsmmokrejs Czech RepublicMember

    For clarity, let me emphasize I merged CosmicNonCodingVariants.vcf.gz and CosmicCodingMuts.vcf.gz into a single file. I already emailed COSMIC support to include chromosome dictionary in their VCF files (which was needed for merging the two VCF file contents). I worked around it but seems they could improve their pipeline output. The file CosmicAllMuts__Broad-style.vcf.bgz is the one I came up with and used as a value for --knownSites and for --cosmic. I modified the file to contain chromosome names [chr1, chr2, ..., chrM].

    Here are the original COSMIC files:

    $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz | grep -v "^#" | wc -l 15553844 $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz | grep -v "^#" | awk '$4 = "N" {print}' | wc -l 15553844 $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz | grep -v "^#" | awk '$4 != "N" {print}' | wc -l 15553575

    BTW, the file CosmicCodingMuts.vcf.gz uses a space instead of a TAB as a column separator (at least on rows containing N in the reference). Looks like a mix of row with either TAB or a space as a separator. The numbers below are maybe wrong because of that.

    $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicCodingMuts.vcf.gz | grep -v "^#" | wc -l 3420531 $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicCodingMuts.vcf.gz | grep -v "^#" | awk '$4 = "N" {print}' | wc -l 3420531 $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicCodingMuts.vcf.gz | grep -v "^#" | awk '$4 != "N" {print}' | wc -l 3420531 $

    @Geraldine_VdAuwera , so you agree I should discard rows with N in the reference before feeding MuTect2?
    Would you also please ask your programmers to also make MuTect2 ignore such lines and issue a summarizing warning how many lines were ignored?

  • mmokrejsmmokrejs Czech RepublicMember
    edited February 2017

    Indeed I had an error in the awk scripts so here are correct numbers (thanks to John Tate from COSMIC):

    $ for f in sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicCodingMuts.vcf.gz sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz; do sha1sum=`sha1sum -b $f | awk '{print $1}'`; num=`gzip -cd $f | grep -v '^#' | awk '$4 == "N" {print}' | wc -l`; echo "$f $sha1sum $num with N's"; num=`gzip -cd $f | grep -v '^#' | awk '$4 != "N" {print}' | wc -l`; echo "$f $sha1sum $num without N's"; num=`gzip -cd $f | grep -v '^#' | wc -l`; echo "$f $sha1sum $num total VCF entries"; done sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicCodingMuts.vcf.gz 8f7ac90713548dda1e3dff98e6497fbcd7b5efed 0 with N's sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicCodingMuts.vcf.gz 8f7ac90713548dda1e3dff98e6497fbcd7b5efed 3420531 without N's sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicCodingMuts.vcf.gz 8f7ac90713548dda1e3dff98e6497fbcd7b5efed 3420531 total VCF entries sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz ed6fee04a69a6bc08a902e90f430d7cdba0b0ffd 269 with N's sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz ed6fee04a69a6bc08a902e90f430d7cdba0b0ffd 15553575 without N's sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz ed6fee04a69a6bc08a902e90f430d7cdba0b0ffd 15553844 total VCF entries $

    I wanted to show whether or how many entries in Coding vs. NonCoding datasets are affected by the N's in genomic reference. It is not that much relevant for this GATK issue.

    Also because of the broken awk syntax I introduced the spaces into the output (see the first broken command below) and falsely attributed that to COSMIC input files. The second attempt below shows there are TABs (wider spacing is clear onthe screen). The third shows more precisely there are TABs indeed.

    $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz | grep -v "^#" | awk '$4 = "N" {print}' | grep ' ' | head -n 1 1 1 COSN24297174 N NT . . . $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz | grep -v "^#" | grep COSN24297174 1 1 COSN24297174 N NT . . . $ $ gzip -dc sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v80/VCF/CosmicNonCodingVariants.vcf.gz | grep -v "^#" | grep COSN24297174 | od -c 0000000 1 \t 1 \t C O S N 2 4 2 9 7 1 7 4 0000020 \t N \t N T \t . \t . \t . \n 0000034 $

    Post edited by mmokrejs on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mmokrejs, we would consider those files invalid, and while we do our best to build safeguards into our software, we can't possibly guard against every way that an input file might be wrong. So considering the amount of work we currently have on our plate, I don't think it's realistic to expect the developers to take time to add a safeguard for this case.

Sign In or Register to comment.