ReduceReads Unknown Contig

kylechbkylechb Boston Children's HospitalPosts: 4Member

Hello,

I am new to the field, so please forgive me if this has a simple answer. I am attempting to call variants on 6 whole exome sequences. The best practices documentation suggested using 30 or more samples, so I downloaded 24 bam files from the 1000genomes database to use with mine. However, whenever I attempted to use ReduceReads on the files, I received the following error right near 100% completion:

ERROR MESSAGE: BUG: requested unknown contig=NC_007605 index=-1

I am using the latest b37 reference file from the bundle, and I have tried re-indexing it and re-forming the .dict file. Here is the stack trace from the above error. What is causing this, and how do I fix it?

ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: BUG: requested unknown contig=NC_007605 index=-1
at org.broadinstitute.sting.utils.MRUCachingSAMSequenceDictionary.updateCache(MRUCachingSAMSequenceDictionary.java:178)
at org.broadinstitute.sting.utils.MRUCachingSAMSequenceDictionary.getSequence(MRUCachingSAMSequenceDictionary.java:109)
at org.broadinstitute.sting.utils.GenomeLocParser.validateGenomeLoc(GenomeLocParser.java:284)
at org.broadinstitute.sting.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:239)
at org.broadinstitute.sting.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:449)
at org.broadinstitute.sting.gatk.datasources.providers.ReadReferenceView.getReferenceContext(ReadReferenceView.java:98)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:139)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:128)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.aggregateMapData(TraverseReadsNano.java:119)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:101)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Which file(s) did this error happen with? Can you post your command lines?

    Geraldine Van der Auwera, PhD

  • kylechbkylechb Boston Children's HospitalPosts: 4Member

    It happened with the 24 bam files I got from the 1000genomes project, but not with the 6 samples I have. My command line was:

    for f in *.bam ; do echo $f ; java -jar /opt/GenomeAnalysisTK-2.7-4/GenomeAnalysisTK.jar -T ReduceReads -R ../References/human_g1k_v37.fa -I "$f" -o "$f".reduced.bam

    I'm using the 1000genome exome bams (such as HG00116.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam, for example)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Hmm, it looks like the 1000G exomes include some extra contigs (seems NC_007605 is herpes virus) compared to our reference. Can you check the headers of the 1000G bam files for the reference info?

    Geraldine Van der Auwera, PhD

  • kylechbkylechb Boston Children's HospitalPosts: 4Member

    Ah, I see that now. They included a couple of different things. I copied the lines that differed from the reference file:

    @SQ SN:NC_007605 LN:171823 M5:6743bd63b3ff2b5b8985d8933c53290a UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human
    @SQ SN:hs37d5 LN:35477943 M5:5b6a4b3a81a2d3c134b7d14bf6ad39f1 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human

    So I should download and use the hs37d5.fa file for reducing reads instead of the b37 build I found in the GATK bundle, correct? If I do that though, will I need to realign all of my samples as well? Or can I plug them into the haplotypecaller along with the 24 samples from the 1000genomes project even though the latter apparently contains some extra contigs?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    I think @Kurt is correct -- our bundle's *_decoy reference should work with the 1000G files. If you use that as a reference you shouldn't need to realign your bam files.

    Geraldine Van der Auwera, PhD

  • kylechbkylechb Boston Children's HospitalPosts: 4Member

    Excellent, thank you both so much!

Sign In or Register to comment.