Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Ensembl ref and GATK

mattqdeanmattqdean CAMember
edited January 2016 in Ask the GATK team

I have a cancer rna-seq bam file that I would like to use to call variants using MuTect2, but I ran into a couple of errors that I would appreciate some assistance with. Thank you.

  1. The chr notations in my bam file is absent in the reference fasta (Homo_sapiens_assembly19.fasta). I'd rather not change my bam file header using sed and samtools reheader and rather append chr notation into the reference file. Will this mess up the corresponding snp (00-All.vcf)?

  2. Is there a karytypically sorted Ensembl reference (and its corresponding dbSNP vcf) that I can use for GATK workflow?

  3. The biggest issue is "Error: Discordant contig lengths: read MT LN=16571, ref MT LN=16569" when I use picard's reordersam function. This is because the read was aligned against Ensembl reference. Is there a quick fix to resolve this?

EDIT: I should mention that realigning my seq against GATK's reference is not an option because I don't have the original fastq files available.

Post edited by mattqdean on

Best Answers

Answers

  • mattqdeanmattqdean CAMember
    edited January 2016

    Thank you for your help! I have the ucsc.hg19.fa which does have chr notations. The contig length matches with the reads but it also has these random lengths that are not in my reads:

    @SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540dd
    @SQ SN:chrY LN:59373566 M5:1e86411d73e6f00a10590f976be01623
    @SQ SN:chr1_gl000191_random LN:106433 M5:d75b436f50a8214ee9c2a51d30b2c2cc
    @SQ SN:chr1_gl000192_random LN:547496 M5:325ba9e808f669dfeee210fdd7b470ac
    @SQ SN:chr4_ctg9_hap1 LN:590426 M5:fa24f81b680df26bcfb6d69b784fbe36
    @SQ SN:chr4_gl000193_random LN:189789 M5:dbb6e8ece0b5de29da56601613007c2a
    @SQ SN:chr4_gl000194_random LN:191469 M5:6ac8f815bf8e845bb3031b73f812c012
    @SQ SN:chr6_apd_hap1 LN:4622290 M5:fe71bc63420d666884f37a3ad79f3317
    @SQ SN:chr6_cox_hap2 LN:4795371 M5:18c17e1641ef04873b15f40f6c8659a4

    1. If I use ucsc.hg19.fa to sort my reads, will I have to revise my headers to include these random contains?
    2. Is ucsc.hg19 sort order applicable for MuTect2?
    3. May I use ncbi's dbSNP (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz) for calling somatic variants using ucsc.hg19 in MuTect2?
    4. If I'm only interested in specific regions (chr1, chr8, chr17, chr22) is there a way to use a custom reference and custom dbSNP for mutect? I saw in your previous post that this is usually not recommended(?) but I am only trying to verify that there exists variants in these regions with (some) confidence.

    EDIT: Concerning #3, I found that, no, this particular build does not work in mutect.

    ERROR dbsnp contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605]
    ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

    Can you refer me to a build that will be compatible with ucsc.hg19.fa?

    Post edited by mattqdean on
  • mattqdeanmattqdean CAMember

    I found a lead for a working snp from ftp://ftp.broadinstitute.org/bundle/2.8/hg19/. There is build 138 that I will try and report if it works.

Sign In or Register to comment.