Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Badly formed genome loc in HaplotypeCaller; HLA contigs
I am trying to run HaplotypeCaller on a .bam file that I mapped against the new human reference hg38 and that I post-processed according to your recommendations (MarkDuplicates, realign indels and BQSR). I used the human reference that include decoy sequences to improve mapping, but I do not think it is useful to include them in the SNP call, thus I decided to use the -L option and to provide a list of chromosomes and contigs names. My command looks like that:
java -Xmx7g -jar /GenomeAnalysisTK.jar -T HaplotypeCaller -R reference_hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa -I dbsnp_recal/ind_dbsnp_recal_reads.bam --genotyping_mode DISCOVERY -stand_emit_conf 30 -stand_call_conf 30 -L interval_lists/chr.intervals -o SNPcall_BQSR/ind_SNPcall_BQSR_rawcalls
My interval list contains first entries like 'chrX', then 'chrX_Z_random', then 'chrUn_Z', then 'chrX_Z_alt', and finally HLA contigs (for example HLA-A*01:01:01:01).
I get this error:
Badly formed genome loc: Contig 'HLA-A*01:01:01' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?
I checked in the reference .dict and .fa, and in my .bam: this contig does not exist, the real contig is called HLA-A01:01:01:01.
Thus I removed the HLA-A contigs from my interval lists; next come the HLA-B contigs. I get the same error, this time it complains about the contig “HLA-B07:02”, which does not exist either, the right contig is called HLA-B*07:02:01.
Is there any reason why the name of the contigs is not read properly? Is that because of the “:”?
Thank you in advance for your suggestions! And tell me if you need more information.