We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How to select specific chromosomes and convert chromosome notation in BAM for HaplotypeCaller

I am preparing BAM files from the 1000 genomes project to use in my GATK pipeline (along with other already processed BAMs) and I have the following issues:
- chromosome notation on my BAMs is from GRCh37 but my pipeline uses hg19, so I would like to replace chromosome notation (1 -> chr1)
- the mitochondrial chromosome is slightly different in hg19 and GRCh37 (see here), so I want to leave it out
- and actually leave out all alternate contigs
This sounds quite trivial, but I haven't found a clean way to do this yet.
I have tried the following:
i=INPUT.bam j=OUTPUT.bam samtools view -h $i | awk 'BEGIN{FS=OFS="\t"} (/^@/ && !/@SQ/){print $0} $2~/^SN:[1-9]|^SN:X|^SN:Y/{print $0} $3~/^[1-9]|X|Y/{$3="chr"$3; print $0} ' | sed 's/SN:/SN:chr/g' | samtools view -bS - > $j
However, when I try running the HaplotypeCaller, I get the following error:
ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with
Could you help me prepare these BAM files for processing?
Thanks a lot in advance
Best Answer
-
pdexheimer ✭✭✭✭
Well, with the somewhat large caveat that I had no part in the decision process, I still think it's reasonable. You're right that GRCh37 and hg19 are substantially the same, but you also previously pointed out a handful of subtle ways in which they're different. The issue is that you could put into a bunch of heuristics to ensure that the references are used in a compatible way, but that would be entirely specific to this particular build of this particular organism. I haven't yet looked into the differences between GRCh38 and hg20 for instance, and GATK is used with plenty of non-human organisms.
Another potential solution to your problem is to reheader your bam after the transformation - just copy out the @SQ lines from one of your bams that was processed with hg19
Answers
This is dangerous, and GATK is semi-intentionally set up to make this kind of thing hard. The official advice is to realign and reprocess your data against GRCh37 (or reprocess 1KG to hg19, which is probably a bigger task).
I think you're getting this error because the reference fasta you're using has contigs (like chrM) that aren't listed in the bam headers. You can probably work around this by messing with your reference, but keep in mind that you'll run into another round of compatibility issues with every new resource you bring in (like training variants for VQSR). I've tried doing things like this in the past, but ultimately found it to be better and safer to start from scratch with a common genome
@palmeira, I see you're in Liege (I assume the Liege in Belgium). FYI we are doing a GATK workshop at the Royal Institute for Natural Sciences in Brussels later this month; I encourage you to attend. I'll ask the organizers if there's a webpage you can register at, if you're interested.
Thanks a lot @pdexheimer, but I was trying to avoid having to re-run my pipeline on my data on GRCh37 or re-process (part of) 1KG in hg19. Is it really such bad practice to change chromosome notation, given that GRCh37 and hg19 are the same assembly? Could someone explain to me why the GATK spits out this error for a contig present in the reference but not in the bam? Should I then remove all these unwanted contigs from my reference and other resources?
Thanks @Geraldine_VdAuwera, a colleague of mine sent me the link for this GATK workshop and I've registered.
Well, with the somewhat large caveat that I had no part in the decision process, I still think it's reasonable. You're right that GRCh37 and hg19 are substantially the same, but you also previously pointed out a handful of subtle ways in which they're different. The issue is that you could put into a bunch of heuristics to ensure that the references are used in a compatible way, but that would be entirely specific to this particular build of this particular organism. I haven't yet looked into the differences between GRCh38 and hg20 for instance, and GATK is used with plenty of non-human organisms.
Another potential solution to your problem is to reheader your bam after the transformation - just copy out the @SQ lines from one of your bams that was processed with hg19