Ensembl ref and GATK

mattqdeanmattqdean CAMember
edited January 2016 in Ask the GATK team

I have a cancer rna-seq bam file that I would like to use to call variants using MuTect2, but I ran into a couple of errors that I would appreciate some assistance with. Thank you.

  1. The chr notations in my bam file is absent in the reference fasta (Homo_sapiens_assembly19.fasta). I'd rather not change my bam file header using sed and samtools reheader and rather append chr notation into the reference file. Will this mess up the corresponding snp (00-All.vcf)?

  2. Is there a karytypically sorted Ensembl reference (and its corresponding dbSNP vcf) that I can use for GATK workflow?

  3. The biggest issue is "Error: Discordant contig lengths: read MT LN=16571, ref MT LN=16569" when I use picard's reordersam function. This is because the read was aligned against Ensembl reference. Is there a quick fix to resolve this?

EDIT: I should mention that realigning my seq against GATK's reference is not an option because I don't have the original fastq files available.

Post edited by mattqdean on

Best Answers

Answers

  • mattqdeanmattqdean CAMember
    edited January 2016

    Thank you for your help! I have the ucsc.hg19.fa which does have chr notations. The contig length matches with the reads but it also has these random lengths that are not in my reads:

    @SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540dd
    @SQ SN:chrY LN:59373566 M5:1e86411d73e6f00a10590f976be01623
    @SQ SN:chr1_gl000191_random LN:106433 M5:d75b436f50a8214ee9c2a51d30b2c2cc
    @SQ SN:chr1_gl000192_random LN:547496 M5:325ba9e808f669dfeee210fdd7b470ac
    @SQ SN:chr4_ctg9_hap1 LN:590426 M5:fa24f81b680df26bcfb6d69b784fbe36
    @SQ SN:chr4_gl000193_random LN:189789 M5:dbb6e8ece0b5de29da56601613007c2a
    @SQ SN:chr4_gl000194_random LN:191469 M5:6ac8f815bf8e845bb3031b73f812c012
    @SQ SN:chr6_apd_hap1 LN:4622290 M5:fe71bc63420d666884f37a3ad79f3317
    @SQ SN:chr6_cox_hap2 LN:4795371 M5:18c17e1641ef04873b15f40f6c8659a4

    1. If I use ucsc.hg19.fa to sort my reads, will I have to revise my headers to include these random contains?
    2. Is ucsc.hg19 sort order applicable for MuTect2?
    3. May I use ncbi's dbSNP (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz) for calling somatic variants using ucsc.hg19 in MuTect2?
    4. If I'm only interested in specific regions (chr1, chr8, chr17, chr22) is there a way to use a custom reference and custom dbSNP for mutect? I saw in your previous post that this is usually not recommended(?) but I am only trying to verify that there exists variants in these regions with (some) confidence.

    EDIT: Concerning #3, I found that, no, this particular build does not work in mutect.

    ERROR dbsnp contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605]
    ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

    Can you refer me to a build that will be compatible with ucsc.hg19.fa?

    Post edited by mattqdean on
  • mattqdeanmattqdean CAMember

    I found a lead for a working snp from ftp://ftp.broadinstitute.org/bundle/2.8/hg19/. There is build 138 that I will try and report if it works.

Sign In or Register to comment.