Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

How to select specific chromosomes and convert chromosome notation in BAM for HaplotypeCaller

I am preparing BAM files from the 1000 genomes project to use in my GATK pipeline (along with other already processed BAMs) and I have the following issues:

  • chromosome notation on my BAMs is from GRCh37 but my pipeline uses hg19, so I would like to replace chromosome notation (1 -> chr1)
  • the mitochondrial chromosome is slightly different in hg19 and GRCh37 (see here), so I want to leave it out
  • and actually leave out all alternate contigs

This sounds quite trivial, but I haven't found a clean way to do this yet.
I have tried the following:

i=INPUT.bam
j=OUTPUT.bam
samtools view -h $i | awk 'BEGIN{FS=OFS="\t"} (/^@/ && !/@SQ/){print $0} $2~/^SN:[1-9]|^SN:X|^SN:Y/{print $0}  $3~/^[1-9]|X|Y/{$3="chr"$3; print $0} ' | sed 's/SN:/SN:chr/g' | samtools view -bS - > $j

However, when I try running the HaplotypeCaller, I get the following error:

ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with

Could you help me prepare these BAM files for processing?
Thanks a lot in advance

Best Answer

Answers

  • This is dangerous, and GATK is semi-intentionally set up to make this kind of thing hard. The official advice is to realign and reprocess your data against GRCh37 (or reprocess 1KG to hg19, which is probably a bigger task).

    I think you're getting this error because the reference fasta you're using has contigs (like chrM) that aren't listed in the bam headers. You can probably work around this by messing with your reference, but keep in mind that you'll run into another round of compatibility issues with every new resource you bring in (like training variants for VQSR). I've tried doing things like this in the past, but ultimately found it to be better and safer to start from scratch with a common genome

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @palmeira, I see you're in Liege (I assume the Liege in Belgium). FYI we are doing a GATK workshop at the Royal Institute for Natural Sciences in Brussels later this month; I encourage you to attend. I'll ask the organizers if there's a webpage you can register at, if you're interested.

  • palmeirapalmeira LiegeMember

    Thanks a lot @pdexheimer, but I was trying to avoid having to re-run my pipeline on my data on GRCh37 or re-process (part of) 1KG in hg19. Is it really such bad practice to change chromosome notation, given that GRCh37 and hg19 are the same assembly? Could someone explain to me why the GATK spits out this error for a contig present in the reference but not in the bam? Should I then remove all these unwanted contigs from my reference and other resources?

  • palmeirapalmeira LiegeMember

    Thanks @Geraldine_VdAuwera, a colleague of mine sent me the link for this GATK workshop and I've registered.

Sign In or Register to comment.