Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BaseRecalibrator: Reads file is unmapped. Skipping validation against reference.

mdoeringmdoering Saarbruecken, GermanyMember
edited August 2015 in Ask the GATK team

Hi,
I'm currently trying to recalibrate base qualities. I already prepared my BAM file to include the necessary fields (RG) for GATK. When I call

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R myref.fasta -I reads.bam -knownSites empty.vcf -o recal_data.table

I get the error that the reads file is unmapped. Of course, this is not true. The BAM file is coordinate-sorted and has an associated index file (bai). The reference also has an index file (fai) and just contains a single contig. I tested renaming the index files from .bam.bai and fasta.fai to .bai and .fai but this didn't help. Furthermore, I checked the content of my BAM file to see whether the given reference sequence name matches with that in the .fasta file and it does

The .fasta looks like this (first couple of lines):

>sample_consensus_048 GCCAGCCCCCTGATGGGGGCGACACTCCACCATGAATCACTCCCCTGTGAGGAACTATTG TCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAGTGTCGTGCAGCCTCCAGGAC CCCCCCTCCCGGGAGAGCCATAGTGGTCTGCGGAACCGGTGAGTACACCGGAATTGCCAG GACGACCGGGTCCTTTCTTGGATAAACCCGCTCAATGCCTGGAGATTTGGGCGTGCCCCC
The .bam looks like this (first couple of lines, for RG info I added some placeholder values):
@HD VN:1.4 SO:coordinate @SQ SN:sample_consensus_048 LN:9374 @RG ID:1 LB:UsedLibrary PL:Illumina SM:ReadGroupSampleName PU:BARCODE @PG ID:smalt PN:smalt VN:0.7.4 CL:smaltcommand MISEQ-02:155:000000000-AF9CT:1:1101:7745:13826 99 sample_consensus_048 3499 60 218M = 3861 486 ACCAAGTGGAGGGTGAGGTCCAGATTGTGTCAACTGCTGCCCAGACTTTCCTGGCAACGTGCATCAATGGGGTGTGCTGGACCGTCTACCACGGGGCGGGAACGAGGACCATTGCATCACCCAAGGGTCCTGTCATACAGATGTACACCAATGTAGACAAAGACCTTGTAGGCTGGCCCGCTCCTCAAGGTACCCGCTCACTGACACCCTGCACCTGC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGAFGGGGGGGGGGGGGGGGGGFFGGGGGDEEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGFFGGCFGGGGGGGGECEGGGGGGFGGFEGFGGGGFGGGGGGCGEGGGGGGG:8DFGFGGFGFB:(9C<6+<[email protected];534;@@FEF RG:Z:1 NM:i:0 AS:i:218 MISEQ-02:155:000000000-AF9CT:1:1103:24094:6940 99 sample_consensus_048 3499 60 213M = 3871 487 ACCAAGTGGAGGGTGAGGTCCAGATTGTGTCAACTGCTGCCCAGACTTTCCTGGCAACGTGCATCAATGGGGTGTGCTGGACCGTCTACCACGGGGCGGGAACGAGGACCATTGCATCACCCAAGGGTCTCGTCATACAGATGTACACCAATGTAGACAAAGACCTTGTAGGCTGGCCCGGTCCTCAAGGTACCCGCTCACTGACACCCTGCA CCCCCFEFGDFEGEGGGGGEFGGGGGGGGGGFFGFGFGC96EEGCG<C<[email protected]@FFGG<CFCFGFGGGGCG:[email protected]<=BDAE<FGG83<<[email protected],,73:CGC,EAA;,>3EFAE,<[email protected]@BBF9=3,@EF;?C*>=C*=1:+?9<>FCFAGFG*0/3++2=*2;D68=?D RG:Z:1 NM:i:3 AS:i:204
At the moment I just use an empty VCF file, as it doesn't make any difference and I wanted to exclude the structure of the VCF as an error source (backslashes before bashes are just inserted for formatting on the forum).

\##fileformat=VCFv4.0 \##fileDate=20150730 \##source=lofreq_cmd \##reference=ref_seq.fasta \##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth"> \##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency"> \##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position"> \##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases"> \##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL."> \##INFO=<ID=CONSVAR,Number=0,Type=Flag,Description="Indicates that the variant is a consensus variant (as opposed to a low frequency variant)."> \##INFO=<ID=HRUN,Number=1,Type=Integer,Description="Homopolymer length to the right of report indel position"> \##FILTER=<ID=min_snvqual_49,Description="Minimum SNV Quality (Phred) 49"> \##FILTER=<ID=min_indelqual_20,Description="Minimum Indel Quality (Phred) 20"> \#CHROM POS ID REF ALT QUAL FILTER INFO
I don't have any problems with my BAM file in any other usage scenario, e.g., using samtools or Biopython, so I'm not sure why GATK throws an error about unmapped reads. I would be really grateful if someone had an idea what might cause the problem with my bam/reference/vcf! I could only find one post where a person had the same error, but I didn't find anything that could help me.

Best
Matthias

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mdoering
    Hi Matthias,

    Can you try running Picard's validate Sam File on your bam file? http://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile

    Also, we don't recommend running Base Recalibrator with no known sites file. Have a look at this article to see what we recommend when working with organisms that don't have a good known variation database: http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr under "I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs."

    -Sheila

  • mdoeringmdoering Saarbruecken, GermanyMember

    Dear Sheila,
    that was quick! :-)

    I validated the bam file and got "No errors found". Looks good, right?

    I know that the calibrator should use a VCF file, otherwise recalibration wouldn't be so useful right? I just inputted an empty VCF for testing. For the actual usage I would first call variants and then use the called variants as 'true SNPs'. However, I now tried with a VCF with some content:

    \##reference=ref.fasta \##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth"> \##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency"> \##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position"> \##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases"> \##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL."> \##INFO=<ID=CONSVAR,Number=0,Type=Flag,Description="Indicates that the variant is a consensus variant (as opposed to a low frequency variant)."> \##INFO=<ID=HRUN,Number=1,Type=Integer,Description="Homopolymer length to the right of report indel position"> \##FILTER=<ID=min_snvqual_49,Description="Minimum SNV Quality (Phred) 49"> \##FILTER=<ID=min_indelqual_20,Description="Minimum Indel Quality (Phred) 20"> \#CHROM POS ID REF ALT QUAL FILTER INFO sample_consensus_048 6288 . T C 107 PASS DP=257857;AF=0.000717;SB=0;DP4=257578,0,186,0

    Anyway, I still get the same error:

    INFO 17:58:29,900 GenomeAnalysisEngine - Strictness is SILENT INFO 17:58:29,971 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 17:58:29,977 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 17:58:29,995 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 17:58:30,033 GenomeAnalysisEngine - Reads file is unmapped. Skipping validation against reference. INFO 17:58:30,086 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:58:30,090 GenomeAnalysisEngine - Done preparing for traversal INFO 17:58:30,090 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:58:30,090 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:58:30,091 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 17:58:30,113 BaseRecalibrator - The covariates being used here: INFO 17:58:30,113 BaseRecalibrator - ReadGroupCovariate INFO 17:58:30,113 BaseRecalibrator - QualityScoreCovariate INFO 17:58:30,113 BaseRecalibrator - ContextCovariate INFO 17:58:30,113 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:58:30,113 BaseRecalibrator - CycleCovariate INFO 17:58:30,116 ReadShardBalancer$1 - Loading BAM index data INFO 17:58:30,117 ReadShardBalancer$1 - Done loading BAM index data INFO 17:58:30,126 BaseRecalibrator - Calculating quantized quality scores... INFO 17:58:30,255 BaseRecalibrator - Writing recalibration report... INFO 17:58:30,300 BaseRecalibrator - ...done! INFO 17:58:30,300 BaseRecalibrator - BaseRecalibrator was able to recalibrate 0 reads INFO 17:58:30,301 ProgressMeter - done 0.0 0.0 s 58.5 h 100.0% 0.0 s 0.0 s INFO 17:58:30,302 ProgressMeter - Total runtime 0.21 secs, 0.00 min, 0.00 hours INFO 17:58:31,530 GATKRunReport - Uploaded run statistics report to AWS S3

    Cheers
    Matthias

    Issue · Github
    by Sheila

    Issue Number
    93
    State
    closed
    Last Updated
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mdoering
    Hi Matthias,

    Hmm. I don't think you have to worry about anything because the program ran to completion. Does your reference fasta file have a dictionary file?

    -Sheila

  • mdoeringmdoering Saarbruecken, GermanyMember

    Dear Sheila,
    sorry to have bother you with the problem. I just re-indexed the reads bam file and now GATK is running fine. It's a bit weird, since I thought that I had re-indexed the bam file after using picard to insert the missing RG tags, and, anyway, Picard shouldn't change the order of the reads when inserting the RG tags. Well, well ... anyway, it works ;-)

Sign In or Register to comment.