Badly formed genome loc in HaplotypeCaller; HLA contigs
I am trying to run HaplotypeCaller on a .bam file that I mapped against the new human reference hg38 and that I post-processed according to your recommendations (MarkDuplicates, realign indels and BQSR). I used the human reference that include decoy sequences to improve mapping, but I do not think it is useful to include them in the SNP call, thus I decided to use the -L option and to provide a list of chromosomes and contigs names. My command looks like that:
java -Xmx7g -jar /GenomeAnalysisTK.jar -T HaplotypeCaller -R reference_hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa -I dbsnp_recal/ind_dbsnp_recal_reads.bam --genotyping_mode DISCOVERY -stand_emit_conf 30 -stand_call_conf 30 -L interval_lists/chr.intervals -o SNPcall_BQSR/ind_SNPcall_BQSR_rawcalls
My interval list contains first entries like 'chrX', then 'chrX_Z_random', then 'chrUn_Z', then 'chrX_Z_alt', and finally HLA contigs (for example HLA-A*01:01:01:01).
I get this error:
Badly formed genome loc: Contig 'HLA-A*01:01:01' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?
I checked in the reference .dict and .fa, and in my .bam: this contig does not exist, the real contig is called HLA-A01:01:01:01.
Thus I removed the HLA-A contigs from my interval lists; next come the HLA-B contigs. I get the same error, this time it complains about the contig “HLA-B07:02”, which does not exist either, the right contig is called HLA-B*07:02:01.
Is there any reason why the name of the contigs is not read properly? Is that because of the “:”?
Thank you in advance for your suggestions! And tell me if you need more information.