How to select specific chromosomes and convert chromosome notation in BAM for HaplotypeCaller

palmeirapalmeira LiegeMember ✭✭

I am preparing BAM files from the 1000 genomes project to use in my GATK pipeline (along with other already processed BAMs) and I have the following issues:

  • chromosome notation on my BAMs is from GRCh37 but my pipeline uses hg19, so I would like to replace chromosome notation (1 -> chr1)
  • the mitochondrial chromosome is slightly different in hg19 and GRCh37 (see here), so I want to leave it out
  • and actually leave out all alternate contigs

This sounds quite trivial, but I haven't found a clean way to do this yet.
I have tried the following:

i=INPUT.bam
j=OUTPUT.bam
samtools view -h $i | awk 'BEGIN{FS=OFS="\t"} (/^@/ && !/@SQ/){print $0} $2~/^SN:[1-9]|^SN:X|^SN:Y/{print $0}  $3~/^[1-9]|X|Y/{$3="chr"$3; print $0} ' | sed 's/SN:/SN:chr/g' | samtools view -bS - > $j

However, when I try running the HaplotypeCaller, I get the following error:

ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with

Could you help me prepare these BAM files for processing?
Thanks a lot in advance

Best Answer

Answers

  • pdexheimerpdexheimer Member ✭✭✭✭

    This is dangerous, and GATK is semi-intentionally set up to make this kind of thing hard. The official advice is to realign and reprocess your data against GRCh37 (or reprocess 1KG to hg19, which is probably a bigger task).

    I think you're getting this error because the reference fasta you're using has contigs (like chrM) that aren't listed in the bam headers. You can probably work around this by messing with your reference, but keep in mind that you'll run into another round of compatibility issues with every new resource you bring in (like training variants for VQSR). I've tried doing things like this in the past, but ultimately found it to be better and safer to start from scratch with a common genome

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @palmeira, I see you're in Liege (I assume the Liege in Belgium). FYI we are doing a GATK workshop at the Royal Institute for Natural Sciences in Brussels later this month; I encourage you to attend. I'll ask the organizers if there's a webpage you can register at, if you're interested.

  • palmeirapalmeira LiegeMember ✭✭

    Thanks a lot @pdexheimer, but I was trying to avoid having to re-run my pipeline on my data on GRCh37 or re-process (part of) 1KG in hg19. Is it really such bad practice to change chromosome notation, given that GRCh37 and hg19 are the same assembly? Could someone explain to me why the GATK spits out this error for a contig present in the reference but not in the bam? Should I then remove all these unwanted contigs from my reference and other resources?

  • palmeirapalmeira LiegeMember ✭✭

    Thanks @Geraldine_VdAuwera, a colleague of mine sent me the link for this GATK workshop and I've registered.

Sign In or Register to comment.