Badly formed genome loc: Contig 'chr1' does not match any contig in the GATK sequence dictionary

nroaknroak HoustonMember
edited September 2015 in Ask the GATK team

I hate to put this same error on the GATK forum again, but I went through many of these errors already posted on the forum, but none of the answers shed light on my issue.
I have my bam files aligned to GRCh37-lite and am using the same reference genome downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/special_requests

I have next performed GATK best practices for pre-processing of these bams using the same ref genome without throwing any error in the process.
Currently I'm running MuTect as
java -Xmx56g -jar muTect-1.1.4.jar
--analysis_type MuTect
--reference_sequence ./resources/b37/human_g1k_v37.fasta
--cosmic ./resources/Cosmic.b37.vcf
--dbsnp ./resources/dbsnp_138.b37.vcf
--intervals ./resources/mirna.1.5flank-interval-list.list
--input_file:normal $normal.recal_reads.bam
--input_file:tumor $tumor.recal_reads.bam
--out $sample.call_stats.out
--coverage_file $sample.coverage.wig.txt

And getting this error message:

ERROR MESSAGE: Badly formed genome loc: Contig 'chr1' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

What more tests should I run to troubleshoot this issue?
Also, the interval list is what I created from a .bed file. I have restricted my bam files to a limited bed regions using the same file in a command "samtools view -@8 -b -h -L"

This was the file I was most confused about. Is it possible that this file is causing the error? First few lines of this file are:
chr1:15869-18936
chr1:28866-32003
chr1:566205-569293
chr1:1100984-1104078
chr1:1101743-1104832
chr1:1102885-1105967
chr1:1229990-1233050
chr1:1246382-1249446
chr1:1273530-1276588
chr1:3043039-3046099
chr1:3475759-3478854
chr1:5622631-5625703
chr1:5921232-5924301
chr1:6488394-6491456
chr1:8925061-8928149
chr1:9210227-9213336
chr1:10025939-10029016
chr1:10286276-10289361

chr1:12087715-12090779

-

-

Thanks a ton for your help!

Tagged:

Best Answers

Answers

  • nroaknroak HoustonMember
    edited September 2015

    And as soon as I posted this, I figured out it was indeed the intervals file. I just gave the list of chr in this file and now MuTect is running fine.
    Sorry for the inconvenience.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nroak
    I am happy to hear you fixed the issue on your own :smile:

    -Sheila

  • nroaknroak HoustonMember

    Thank you @Sheila for reply. however, soon after MuTect completed the analysis, I faced a new issue.
    I'm not getting any mutations in my variant output file and only files header is printed out as

    muTector v1.0.47986

    contig position context ref_allele alt_allele tumor_name normal_name score dbsnp_site covered power tumor_power normal_power normal_power_nsp normal_power_wsp total_pairs improper_pairs map_Q0_reads init_t_lod t_lod_fstar t_lod_lqs t_lod_fstar_forward t_lod_fstar_reverse tumor_f tumor_f_lb contaminant_fraction contaminant_lod t_q20_count t_ref_count t_alt_count t_ref_sum t_alt_sum t_ref_max_mapq t_alt_max_mapq t_ins_count t_del_count normal_best_gt init_n_lod n_lod_fstar normal_f normal_f_quals normal_artifact_lod_tf normal_artifact_lod_low_tf normal_artifact_lod_nf normal_artifact_lod_nfq n_q20_count n_ref_count n_alt_count n_ref_sum n_alt_sum power_to_detect_positive_strand_artifact power_to_detect_negative_strand_artifact strand_bias_counts tumor_alt_fpir_median tumor_alt_fpir_mad tumor_alt_rpir_median tumor_alt_rpir_mad alt_fpir alt_rpir powered_filters normal_artifact_power_tf normal_artifact_power_low_tf normal_artifact_power_nf normal_global_qll normal_local_qll normal_qmodel_lod observed_in_normals_count failure_reasons judgement

    The bash .out file ends with a warning:
    INFO 12:59:29,640 ProgressMeter - 9:37527101 1.54e+09 31.0 m 1.2 s 50.8% 61.0 m 30.0 m
    INFO 12:59:59,641 ProgressMeter - 9:66039601 1.54e+09 31.5 m 1.2 s 51.8% 60.9 m 29.4 m
    INFO 13:00:29,642 ProgressMeter - 9:92353601 1.54e+09 32.0 m 1.2 s 52.6% 60.8 m 28.8 m
    INFO 13:00:59,643 ProgressMeter - 9:118460201 1.54e+09 32.5 m 1.3 s 53.4% 60.8 m 28.3 m
    WARN 13:01:26,954 RestStorageService - Error Response: PUT '/GATK_Run_Reports/iaglT1lmX64RthqnQ8T93274nf4eOqhI.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1177, Content-MD5: FypXzYZQjY9GI08cdNy+lw==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: 172a57cd86508d8f46234f1c74dcbe97, Date: Fri, 11 Sep 2015 18:01:26 GMT, Authorization: AWS AKIAJXU7VIHBPDW4TDSQ:9gBFJ2MkRG5Z/rhaF1sdFBKZ7tM=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-358.el6.x86_64; amd64; en; JVM 1.7.0_13), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 621CC6794053E5D5, x-amz-id-2: G8fWcIqMQi7I0a6WHcZdqicLRfU3/SPfuPjxCyb3t2+IOobPZRIH9dGCMhSbuAs3, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Fri, 11 Sep 2015 18:01:26 GMT, Connection: close, Server: AmazonS3]

    I next tried haplotypeCaller to call variants in my bam files independently but again only VCF headers were returned without any variants in the file.

    Can you please let me know what's the possible cause?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nroak
    Hi,

    Can you try again? I think maybe the internet connection was lost or had some issues.

    Thanks,
    Sheila

  • nroaknroak HoustonMember

    So I just finished running it again and it is still giving me the exact same output.

    Issue · Github
    by Sheila

    Issue Number
    170
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • nroaknroak HoustonMember

    @Sheila , Thank you for the response. I talked to our IT/ cluster admins and they believe I have all the required access. Perhaps, can you please elaborate on the issue I may have here? Meanwhile, I also tried using HaplotypeCaller to call variants individually in each file and I'm facing the same issue where the output vcf contains only the header with no variants in it.
    Can it be because of a filtering criteria?

  • nroaknroak HoustonMember

    Thank you @Geraldine_VdAuwera for the response! I will use the latest release from gatk downloads page.

  • nroaknroak HoustonMember

    Hi @Geraldine_VdAuwera and @Sheila I was not sure if I should start a new thread or update with my problem here. So I still have the issue of empty variant files either by MuTect or HaplotypeCaller. That made me think that the problem has to be in one of my upstream pre-processing steps. I have now identified the problem but don't know how to fix it.
    In my first step, I am restricting my WGS bams to a subset of regions that I'm interested in using the command:
    _samtools view -@8 -b -h -L $regions_of_interest -o ROI.bam $WGS.bam
    _
    This is where the original mapping qualities are being reset to 0 for all the reads in the resulting bam and subsequently HaplotypeCaller or MuTect are not calling any variants in any of these regions.

    Can you please let me know how should I restrict my WGS bams without losing mapping quality information?

    Thank you very much!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I don't know if your problem is due to the samtools command (that surprises me) but you could try just using the -L argument of GATK and MuTect to run on only your intervals of interest, without having to subset your data up front.

  • nroaknroak HoustonMember

    Ok. So if I understand this correct, In my first step of RealignerTargetCreator, I can use -L option and give the *.bed file of a subset of genomic co-ordinates (chr start end gene_name).
    One query is, do i use the -L option all the way through the PrintReads step or do I need to use it only in first step?
    Second is, does this option merge the overlapping genomic regions if any or just repeats the reads in the region?
    Lastly, I was running the samtools prior to all the pre-processing steps to save the computational time, I'm curious if gatk -L will serve this purpose?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
  • nroaknroak HoustonMember

    Thank you @Sheila for the link. I was able to find this link later on and used it as a guide to run the pipeline. But I noticed that even after using -L option, my .bam size is constant where I expect it to go from ~80gb to ~3-5gb (as seen with samtools -L, which unfortunately is resetting my mapq values to '0'). I was able to run through the entire pipeline and call variants using HaplotypeCaller. But this won't be the optimal pipeline for me to use since I will need to deal with WGS bams for all samples where I need only a small non-coding region of the genome. Am I expected to get a smaller bam after I use -L option with GATK indel realigner? In that case something might be going wrong (error file has no errors or warnings). Can you please let me know some suggestions based on your experience? That will be very helpful! Thank you.
    Ninad

  • nroaknroak HoustonMember

    I initially had followed the outline of using -L option from the link you gave this morning. That post suggests not to use -L for commands where bam or vcf files are emitted as an output. I understand this was done to make sure no data was lost. Although I realized that was precisely what I was planning to do, to restrict bams to a smaller size. So after I added -L option to those commands emitting bam files, I now have smaller bams that are processed much faster. i have anyways padded my regions with 1.5kb flanking sequences. I'm hoping this won't affect the resultant mutations. Thank you for your help!

Sign In or Register to comment.