Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
BaseRecalibrator: Reads file is unmapped. Skipping validation against reference.
I'm currently trying to recalibrate base qualities. I already prepared my BAM file to include the necessary fields (RG) for GATK. When I call
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R myref.fasta -I reads.bam -knownSites empty.vcf -o recal_data.table
I get the error that the reads file is unmapped. Of course, this is not true. The BAM file is coordinate-sorted and has an associated index file (bai). The reference also has an index file (fai) and just contains a single contig. I tested renaming the index files from .bam.bai and fasta.fai to .bai and .fai but this didn't help. Furthermore, I checked the content of my BAM file to see whether the given reference sequence name matches with that in the .fasta file and it does
The .fasta looks like this (first couple of lines):
>sample_consensus_048 GCCAGCCCCCTGATGGGGGCGACACTCCACCATGAATCACTCCCCTGTGAGGAACTATTG TCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAGTGTCGTGCAGCCTCCAGGAC CCCCCCTCCCGGGAGAGCCATAGTGGTCTGCGGAACCGGTGAGTACACCGGAATTGCCAG GACGACCGGGTCCTTTCTTGGATAAACCCGCTCAATGCCTGGAGATTTGGGCGTGCCCCC
The .bam looks like this (first couple of lines, for RG info I added some placeholder values):
@HD VN:1.4 SO:coordinate @SQ SN:sample_consensus_048 LN:9374 @RG ID:1 LB:UsedLibrary PL:Illumina SM:ReadGroupSampleName PU:BARCODE @PG ID:smalt PN:smalt VN:0.7.4 CL:smaltcommand MISEQ-02:155:000000000-AF9CT:1:1101:7745:13826 99 sample_consensus_048 3499 60 218M = 3861 486 ACCAAGTGGAGGGTGAGGTCCAGATTGTGTCAACTGCTGCCCAGACTTTCCTGGCAACGTGCATCAATGGGGTGTGCTGGACCGTCTACCACGGGGCGGGAACGAGGACCATTGCATCACCCAAGGGTCCTGTCATACAGATGTACACCAATGTAGACAAAGACCTTGTAGGCTGGCCCGCTCCTCAAGGTACCCGCTCACTGACACCCTGCACCTGC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGAFGGGGGGGGGGGGGGGGGGFFGGGGGDEEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGFFGGCFGGGGGGGGECEGGGGGGFGGFEGFGGGGFGGGGGGCGEGGGGGGG:8DFGFGGFGFB:(9C<6+<[email protected];534;@@FEF RG:Z:1 NM:i:0 AS:i:218 MISEQ-02:155:000000000-AF9CT:1:1103:24094:6940 99 sample_consensus_048 3499 60 213M = 3871 487 ACCAAGTGGAGGGTGAGGTCCAGATTGTGTCAACTGCTGCCCAGACTTTCCTGGCAACGTGCATCAATGGGGTGTGCTGGACCGTCTACCACGGGGCGGGAACGAGGACCATTGCATCACCCAAGGGTCTCGTCATACAGATGTACACCAATGTAGACAAAGACCTTGTAGGCTGGCCCGGTCCTCAAGGTACCCGCTCACTGACACCCTGCA CCCCCFEFGDFEGEGGGGGEFGGGGGGGGGGFFGFGFGC96EEGCG<C<[email protected]@FFGG<CFCFGFGGGGCG:[email protected]<=BDAE<FGG83<<[email protected],,73:CGC,EAA;,>3EFAE,<[email protected]@BBF9=3,@EF;?C*>=C*=1:+?9<>FCFAGFG*0/3++2=*2;D68=?D RG:Z:1 NM:i:3 AS:i:204
At the moment I just use an empty VCF file, as it doesn't make any difference and I wanted to exclude the structure of the VCF as an error source (backslashes before bashes are just inserted for formatting on the forum).
\##fileformat=VCFv4.0 \##fileDate=20150730 \##source=lofreq_cmd \##reference=ref_seq.fasta \##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth"> \##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency"> \##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position"> \##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases"> \##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL."> \##INFO=<ID=CONSVAR,Number=0,Type=Flag,Description="Indicates that the variant is a consensus variant (as opposed to a low frequency variant)."> \##INFO=<ID=HRUN,Number=1,Type=Integer,Description="Homopolymer length to the right of report indel position"> \##FILTER=<ID=min_snvqual_49,Description="Minimum SNV Quality (Phred) 49"> \##FILTER=<ID=min_indelqual_20,Description="Minimum Indel Quality (Phred) 20"> \#CHROM POS ID REF ALT QUAL FILTER INFO
I don't have any problems with my BAM file in any other usage scenario, e.g., using samtools or Biopython, so I'm not sure why GATK throws an error about unmapped reads. I would be really grateful if someone had an idea what might cause the problem with my bam/reference/vcf! I could only find one post where a person had the same error, but I didn't find anything that could help me.