COSMIC and dbSNP files for MuTect

I'm having trouble finding the recommended COSMIC and dbSNP file for hg19 to use with MuTect (hg19_cosmic_v54_120711.vcf and dbsnp_132_b37.leftAligned.vcf). I can't find these in any of the bundles on the GATK public FTP site. I see a dbSNP file called dbsnp_132_b37.vcf; is this the same? I don't see any COSMIC file at all. I'm currently using bundle 2.3 for hg19 for the dbSNP files (and the standard indels from 1000G and Mills for indel realignment). Thanks!

Best Answers

Answers

  • Did you find a solution?? I am having the same problem.

  • @rpauly, I have not found a solution!

  • Di you get a reply from the authors?

  • Thanks, @kcibul!

    @kcibul said:
    Hi everyone -- sorry for the delay on this one. Until we have a better solution for these public files, I've put them on the same download page as MuTect. Please let me know if you have any problems.

  • Getting the incompatible contigs error with those files, are there some that are compatible with the reference sequence in the latest resource bundle (2.3)

  • kcibulkcibul Cambridge, MAMember, Broadie, Dev

    Can you please post exactly which files you are using for dbSNP, COSMIC and your reference?

  • I am using:
    hg19_cosmic_v54_120711.vcf
    dbsnp_132_b37.leftAligned.vcf
    from the Mutect download page and
    ucsc.hg19.fasta
    from the ftp resource bundle (2.3)

  • I'm getting the same error, using the same files as @furgason5. It seems that the cosmic and dbsnp files posted don't have the "chr" prefix for the chromosome names?

  • That is what I am seeing as well, not sure how to correct this as I usually us picard's reorderSam to adjust contigs, but it only works for sam/bam files

  • kmdailykmdaily Member
    edited March 2013

    I used grep/sed to change the file (adding "chr" to the beginning of non-comment lines), and it seems to work fine now. I do get a warning with the edited dbsnp file though:

    INFO  10:46:38,842 RMDTrackBuilder - Creating Tribble index in memory for file dbsnp_132_b37.leftAligned_new.vcf
    WARN  10:46:38,858 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A -- descriptions disagree; header has 'Allele Frequency' but standard is 'Allele Frequency, for each ALT allele, in the same order as listed'
    
  • Thanks @kmdaily, I will try that, can you give me the command line that you use to change the file, I'm not very familiar with grep/sed. Thanks

  • desmodesmo Member

    @kcibul Why do you run Mutect with old version of cosmic and dbSnp? There are any problem using resources up to date like dbSnp137 or CosmicV63?

  • @kmdaily said:
    I'm getting the same error, using the same files as furgason5. It seems that the cosmic and dbsnp files posted don't have the "chr" prefix for the chromosome names?

    I think kmdaily is right about the chromosome names. I suspect they should be compatible with the names used in your reference genome (it might, or it might not, use the "chr" prefix). The "chr" prefix is not included in the dbSNP/COSMIC files provided on the MuTect's download page. I think another source of complications could be "X" and "Y" chromosome names used in the dbSNP file, while "23" and "24" are used in the COSMIC file.

    However, if this indeed is the issue (and prefix addition/text replacement in the first column of the uncommented lines in those files could solve the problems then), I wonder whether/how these very files work for kcibul.

  • @danielvo said:
    However, if this indeed is the issue (and prefix addition/text replacement in the first column of the uncommented lines in those files could solve the problems then), I wonder whether/how these very files work for kcibul.

    Wondering the same thing

    @desmo said:
    kcibul Why do you run Mutect with old version of cosmic and dbSnp? There are any problem using resources up to date like dbSnp137 or CosmicV63?

    Where might one download these new versions, and are the contigs changed for these? Would like to get this sorted out as I have files piling up behind this step right now.

  • Thanks @kcibul, exactly the information I needed

  • Thank you, @kcibul, that was very helpful. :)

    I see that you recommend using the latest dbSNP collection available. If I understand correctly, dbSNP mutations not present in the COSMIC database are less likely to be called by MuTect (less likely when compared to mutations not found in the dbSNP file passed to MuTect). The GATK bundle offers also "a version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129, which excludes the impact of the 1000 Genomes project". Would it be advisable to use that dbSNP collection instead?

    Thanks!

  • kcibulkcibul Cambridge, MAMember, Broadie, Dev

    I would recommend using the latest dbSNP, not the one that excludes the 1000 Genomes project. You're correct that at sites present in the DBSNP VCF we are slightly less powered to classify mutations (not discover them in the tumor) given the exact same depth of sequencing. However, as we describe in the publication, in practices these differences really only come into play at very low coverage in the normal (under 20x).

    But say you had a dataset where the normal was covered at 10x only, and you were trying to decide what to do. If you used no DBSNP file you would make a huge number of mistakes (compared to the number of true somatic events) misclassifying true germline events as somatic. On the other hand, if you use the DBSNP file you would be less able to call true somatic mutations that occur at DBSNP positions that you would otherwise... but you would not be overwhelmed by false positives. It's a tradeoff, but one we typically don't have to make because most of the data we come across is well over 20x in the normal.

  • @kcibul Thank you very much for the swiftly-posted explanation!

    The coverage "detail" is an important one. I should read the paper again.

  • I tried to use the liftOverVCF.pl script to convert the b37 cosmic file to hg19. However, it threw an error:

    ##### ERROR MESSAGE: Key SOMATIC found in VariantContext field INFO at chr1:69538 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
    

    I had to add the following line to the header, and then it was successful:

    ##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic event">
    
  • @kcibul: could you please put hg19 cosmic vcf file on MuTect download page? I am having problem with GATK liftover while converting b37 cosmic file to hg19. I am using the reference and chain file from GATK but getting error below:

    " The chain file you are using is not compatible with the reference you are trying to lift over to; please use the appropriate chain file for the given reference"

    Thanks !

  • desmodesmo Member

    @kcibul. I'm trying to understand how did you converted the cosmic file.

    So I've downloaded the file CosmicMutantExport_v54_080711.tsv from the Cosmic website.

    Why this file has 182714 variants while your version hg19_cosmic_v54_120711.vcf has just 33500 variants?

    Which kind of variants did you chose?

    Thanks in advance

  • Thanks @kmdaily for doing the heavy lifting on this, I ended up getting it to work just how you did.

  • Hi @kmdaily and @furgason5 - would it be possible for you to upload this file someplace? Thanks in advance!

  • furgason5furgason5 Member
    edited March 2013

    While it might be beneficial to know how to do this yourself, I think since @kcibul said he would do it, I will give you the link to my file:
    (http://db.tt/thfHgygB)

  • Here's the command I used after modifying the header; the resulting file won't be much use to anyone else because I've modified the reference sequence.

    perl ./liftOverVCF.pl -vcf b37_cosmic_v54_120711.vcf -chain b37tohg19.chain -out hg19_cosmic_v54_120711.vcf -newRef ucsc.hg19 -oldRef human_g1k_v37 -gatk /usr/local/apps/GATK/GenomeAnalysisTK-2.4-7-g5e89f01/
    

    @desmo, did you figure anything else out about the difference in the original COSMIC file and the vcf version?

  • I am curious if any filtering of variants is done based on the dbSNP vcf? If so would some of the somatic variants present in dbSNP be thrown out?

  • kcibulkcibul Cambridge, MAMember, Broadie, Dev

    Hi

    No hard filtering is done, but in the MuTect publication we describe how we use the dbsnp information as a prior for a candidate event in the tumor being either germline or somatic.

  • Hi, I ran into another issue with the dbsnp file.
    ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:

    ERROR Name FeatureType Documentation
    ERROR BCF2 VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_bcf2_BCF2Codec.html
    ERROR VCF VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_vcf_VCFCodec.html

    Where should I add this NAME tag?

  • kcibulkcibul Cambridge, MAMember, Broadie, Dev

    Is this using your own VCF file or the one we distribute? If it's the former, then my suspicion is that your VCF file is malformed (and thus the GATK can't figure out what kind of file it is).

  • Thanks so much for your response!
    I downloaded your version but manually resorted it according to the reference I have...
    Are there anyway to solve this problem if I have to resort the dbsnp file? I tested it on one chromosome, still got the same error msg...

    Thanks a lot!

  • kcibulkcibul Cambridge, MAMember, Broadie, Dev

    My guess is that in the act of resorting it that you somehow corrupted something in the header which makes it not look like a VCF file to the GATK. Can you try using another GATK tool which reads a VCF file (say SelectVariants)? If you get the same error with that, it's decent confirmation that the problem is with the VCF.

  • mducarmducar Member

    Hi Kris,

    I came across this tread searching for something slightly different. COSMIC now provides a VCF formatted data set:
    ftp://ngs.sanger.ac.uk/production/cosmic/

    They provide two VCF files -- one for coding variants and one for non-coding variants of the recent v64 release. I'm still running into issues using them. First they weren't sorted as the GATK expected (easily fixed using their sortByRef.pl script). I then wanted to combine the two VCF files so I only had to work with a single file -- but the GATK spits out an error "there are not enough columns present in the header line".

    Still trying to figure out the second issue, but wanted to pass along the link to the COSMIC vcf files.

  • Hi @mducar,

    Not sure if you have solved that problem or not, but here is my solution:

    wget ftp://ngs.sanger.ac.uk/production/cosmic/CosmicNonCodingVariants_v64_02042013_noLimit.vcf.gz
    wget ftp://ngs.sanger.ac.uk/production/cosmic/CosmicCodingMuts_v64_02042013_noLimit.vcf.gz
    gunzip Cosmic*.gz
    grep "^#" CosmicCodingMuts_v64_02042013_noLimit.vcf > VCF_Header
    grep -v "^#" CosmicCodingMuts_v64_02042013_noLimit.vcf.gz > Coding.clean
    grep -v "^#" CosmicNonCodingVariants_v64_02042013_noLimit.vcf > NonCoding.clean
    cat Coding.clean NonCoding.clean | sort -gk 2,2 | awk '{print "chr"$0}' | perl sortByRef.pl --k 1 - ucsc.hg19.fasta.fai > Cosmic.hg19
    cat Header Cosmic.hg19 > Cosmic.hg19.vcf
    

    If you are trying to use b37, then just skip th awk code and use the corresponding fai file instead.

    The main problem for your error is that the sortByRef.pl seems to ignore the header where the headers of the VCF will be sorted and located at the bottom of the file, without proper headers, the vcf might not be recognize. Hopefully this code can help.

  • nbahlisnbahlis Member

    I am new to muTect. Can someone help please. I am getting the error "Input files reads and reference have incompatible contigs: No overlapping contigs found". I realize that my read contigs have "chr"before the chromosome number (chr1, chr2, etc...) while my reference contigs don't (1,2,....).
    Can someone please help me with a script to either add or remove chr form either reads.
    I am using --reference_sequence human_g1k_v37.fasta

    thank you

  • Assuming yout reference sequence is of the following format:

    1

    ACTG (Sequence)

    2

    (Sequence)

    then you can use:

    awk -F ">" '{if(index($0,">")!=0){print ">chr"$2 }else{print $0}}'

    But it will be even better if you download the hg19 version of the bundle, then you don't need to do anything about the reference (I assume you are using hg19 as your reference sequence is of b37 version)

  • @kmdaily said:
    Here's the command I used after modifying the header; the resulting file won't be much use to anyone else because I've modified the reference sequence.

    perl ./liftOverVCF.pl -vcf b37_cosmic_v54_120711.vcf -chain b37tohg19.chain -out hg19_cosmic_v54_120711.vcf -newRef ucsc.hg19 -oldRef human_g1k_v37 -gatk /usr/local/apps/GATK/GenomeAnalysisTK-2.4-7-g5e89f01/
    

    desmo, did you figure anything else out about the difference in the original COSMIC file and the vcf version?

    Could anyone explain what the difference in output is when using liftOverVCF.pl when compared to stripping just the 'chr'. For example, something like "sed 's/chr//g'"
    Thank you, Teja

  • @apallav2 said:
    I usually get the COSMIC vcfs from here: ftp://ngs.sanger.ac.uk/production/cosmic

    Do you know of an elegant way to sort those vcf's?

  • O-O some how I am not getting feeds from the post I commented -
    vcf-sort from vcf tools will do the sort on your vcfs.

  • sadiqsaleem09sadiqsaleem09 New YorkMember

    @shingwan I tried to use your script of combining coding and non-coding cosmic vcfs as follows:

    wget ftp://ngs.sanger.ac.uk/production/cosmic/CosmicCodingMuts_v68.vcf.gz wget ftp://ngs.sanger.ac.uk/production/cosmic/CosmicNonCodingVariants_v68.vcf.gz gunzip Cosmic*.gz grep "^#" CosmicCodingMuts_v68.vcf > VCF_Header grep -v "^#" CosmicCodingMuts_v68.vcf > Coding.clean grep -v "^#" CosmicNonCodingVariants_v68.vcf > Noncoding.clean cat Coding.clean Noncoding.clean | sort -gk 2,2 | awk '{print "chr"$0}' | perl sortByRef.pl --k 1 - ucsc.hg19.fasta.fai > Cosmic.hg19

    However, I got the following error:

    Bareword found where operator expected at sortByRef.pl line 6, near ""en" class" (Missing operator before class?) Bareword found where operator expected at sortByRef.pl line 13, near "<title>gatk" (Missing operator before gatk?) Can't modify numeric lt (<) in scalar assignment at sortByRef.pl line 6, near ""en" class" syntax error at sortByRef.pl line 6, near ""en" class" Unrecognized character \xC2; marked by <-- HERE after at master <-- HERE near column 40 at sortByRef.pl line 13.

    Any suggestions in terms of how to get pass this error related to sortByRef.pl script?
    Thank you!

  • steste auMember

    Dear @kcibul,

    maybe I missed it from the publication: does Mutect distinguish between germline and somatic variation present in dbSNP?

    As my understanding it uses COSMIC to detect which variant in dbSNP is somatic, however I would like to know if it uses the germline/somatic/both flag within dbSNP to make this distinction.

    Thanks a lot,
    Stefano

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @ste,

    Let me clear up a few misconceptions about what roles COSMIC and dbSNP play respectively in MuTect.

    dbSNP is used to reject candidate mutations that are most probably germline because they have been observed in other people. Because the level of validation of submissions to dbSNP is low, we are not confident that things being flagged as germline or somatic are trustworthy.

    In contrast, COSMIC is a more highly validated resource, so it is used essentially as a whitelist to "rescue" candidate mutations that would otherwise be rejected for being in the panel of normals and/or dbSNP. We expect that anything that is really somatic that is flagged as such in dbSNP will also be in COSMIC, so we can rely on COSMIC to rescue those sites.

    Does that clarify how this works?

  • steste auMember
    edited October 2014

    It did indeed.

    The sentence "We expect that anything that is really somatic that is flagged as such in dbSNP will also be in COSMIC" is what I was looking for.

    Thanks,
    Stefano

  • @sadiqsaleem09 said:
    shingwan I tried to use your script of combining coding and non-coding cosmic vcfs as follows:

    wget ftp://ngs.sanger.ac.uk/production/cosmic/CosmicCodingMuts_v68.vcf.gz wget ftp://ngs.sanger.ac.uk/production/cosmic/CosmicNonCodingVariants_v68.vcf.gz gunzip Cosmic*.gz grep "^#" CosmicCodingMuts_v68.vcf > VCF_Header grep -v "^#" CosmicCodingMuts_v68.vcf > Coding.clean grep -v "^#" CosmicNonCodingVariants_v68.vcf > Noncoding.clean cat Coding.clean Noncoding.clean | sort -gk 2,2 | awk '{print "chr"$0}' | perl sortByRef.pl --k 1 - ucsc.hg19.fasta.fai > Cosmic.hg19

    However, I got the following error:

    Bareword found where operator expected at sortByRef.pl line 6, near ""en" class" (Missing operator before class?) Bareword found where operator expected at sortByRef.pl line 13, near "<title>gatk" (Missing operator before gatk?) Can't modify numeric lt (<) in scalar assignment at sortByRef.pl line 6, near ""en" class" syntax error at sortByRef.pl line 6, near ""en" class" Unrecognized character \xC2; marked by <-- HERE after at master <-- HERE near column 40 at sortByRef.pl line 13.

    Any suggestions in terms of how to get pass this error related to sortByRef.pl script?
    Thank you!

    @sadiqsaleem09, I believe that is some problem with the perl script. you can try using chmod 777 or check whether if you have the correct perl script / perl package installed. However, I haven't touch on this for a while so I might need to get the script again to actually know what is the problem....

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @sadiqsaleem09‌ It looks like you might have a typo in your command -- there's a lonely - in perl sortByRef.pl --k 1 - ucsc.hg19.fasta.fai > Cosmic.hg19

  • @Geraldine_VdAuwera said:
    sadiqsaleem09‌ It looks like you might have a typo in your command -- there's a lonely - in perl sortByRef.pl --k 1 - ucsc.hg19.fasta.fai > Cosmic.hg19

    Looks weird but only worked with this extra hyphen in my case (otherwise sortByRef.pl gives "Wrong number of arguments" error).

  • varshavarsha FloridaMember

    Sorry I am still confused, for hg19 reference - if I use ucsc.hg19.fasta, can I use dbsnp_138.hg19.vcf and b37_cosmic_v54_120711.vcf (from the download page) to run MuTect? Please let me know. Thank you.

  • nroaknroak HoustonMember

    @sadiqsaleem09 said:
    shingwan I tried to use your script of combining coding and non-coding cosmic vcfs as follows:

    wget ftp://ngs.sanger.ac.uk/production/cosmic/CosmicCodingMuts_v68.vcf.gz wget ftp://ngs.sanger.ac.uk/production/cosmic/CosmicNonCodingVariants_v68.vcf.gz gunzip Cosmic*.gz grep "^#" CosmicCodingMuts_v68.vcf > VCF_Header grep -v "^#" CosmicCodingMuts_v68.vcf > Coding.clean grep -v "^#" CosmicNonCodingVariants_v68.vcf > Noncoding.clean cat Coding.clean Noncoding.clean | sort -gk 2,2 | awk '{print "chr"$0}' | perl sortByRef.pl --k 1 - ucsc.hg19.fasta.fai > Cosmic.hg19

    However, I got the following error:

    Bareword found where operator expected at sortByRef.pl line 6, near ""en" class" (Missing operator before class?) Bareword found where operator expected at sortByRef.pl line 13, near "<title>gatk" (Missing operator before gatk?) Can't modify numeric lt (<) in scalar assignment at sortByRef.pl line 6, near ""en" class" syntax error at sortByRef.pl line 6, near ""en" class" Unrecognized character \xC2; marked by <-- HERE after at master <-- HERE near column 40 at sortByRef.pl line 13.

    Any suggestions in terms of how to get pass this error related to sortByRef.pl script?
    Thank you!

    Thanks @sadiqsaleem09 for the code. In fact I could run exactly what you have written without any issues. I ran it for b37 though without the 'awk' part. The extra - is indeed needed for the script to run without any errors, which I noticed is because there is an 'if' statement in the code that requires minimum of 2 arguments apart from the input .fai file.

  • aammar1aammar1 ChicagoMember

    Hello,

    I am trying to run MuTect and I am running into compatibility problems with the cosmic file. I initially used hg19 to do alignment. In the hg19 folder (in the resource bundle) I cannot find the proper cosmic file. I tried to use the b37_cosmic_v54_120711.vcf but it does not work. I know that hg19 needs the chr# format. I am looking for the hg19_cosmic_v54_120711.vcf file. I'm sorry if this is a repeated question but I tried looking and can't find it.

    Thank you in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Sorry, we don't distribute that file.

  • Hi @Geraldine_VdAuwera and @kcibul,

    About the cosmic vcf file for Mutect, we can now download the file CosmicCodingMuts.vcf.gz and CosmicNonCodingVariants.vcf.gz from COSMIC directly. Is it a good and easy way to generate the vcf just by combining these two parts?

    But I notice that these two parts should come from CosmicCompleteExport.tsv.gz, but the current b37_cosmic_v54_120711.vcf was transformed from CosmicMutantExport. So, I have some confuse. Do you know the differences between them?

    For WGS somatic mutation detection, should I use the CompleteExport?

    Many thanks in advance.

  • From their web introduction, CosmicMutantExport seems to be a right choice for Mutect. Any idea?

    Complete COSMIC data:
    A tab separated table of complete curated COSMIC dataset from the current release. It includes all point mutations, negative data set and gene fusion mutations (CosmicCompleteExport).

    Complete mutation data:
    A tab separated table of all the point mutations in cosmic with all the fusion mutations from the current release (CosmicMutantExport).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Sorry, we can't comment on the choice of specific file. What MuTect expects is the list of point mutations.

  • YingLiuYingLiu ChinaMember

    ftp ftp.broadinstitute.org
    then type 'gsapubftp-anonymous'
    no passwd .

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @YingLiu
    Hi,

    Were you able to resolve the issue after trying a few more times?

    -Sheila

  • YingLiuYingLiu ChinaMember

    @Sheila,resolved ,thank you !

  • iriantojiriantoj Member
    edited November 2017

    Thank you @shingwan @mducar !!!

    I downloaded the Cosmic files for h37 from https://cancer.sanger.ac.uk/cosmic/download

    sortByRef.pl was part of https://github.com/amplab/smash

    Then use the following code:
    gunzip Cosmic*.gz
    grep "^#" CosmicCodingMuts.vcf > VCF_Header
    grep -v "^#" CosmicCodingMuts.vcf > Coding.clean
    grep -v "^#" CosmicNonCodingVariants.vcf > NonCoding.clean
    cat Coding.clean NonCoding.clean | sort -gk 2,2 | awk '{print "chr"$0}' | perl sortByRef.pl --k 1 - hg19UCSC.fa.fai > Cosmic.hg19
    cat VCF_Header Cosmic.hg19 > Cosmic.hg19.vcf

    The product is the following:
    https://upenn.box.com/s/x4xkno440jxmyglw0wohjk74ipso0hqr

    Going to try the MuTect now ... finger crossed

Sign In or Register to comment.