We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Why does haplotype caller produce a vcf with per-lane output?

tirohiatirohia Auckland, New ZealandMember
edited October 2019 in Ask the GATK team
I'm using gatk4, I have input data for a single sample across four different lanes:
v02_L1.fastq.gz
v02_L2.fastq.gz
v02_L3.fastq.gz
v02_L4.fastq.gz

I want to merge this into per sample results. Following the suggestion here: https://software.broadinstitute.org/gatk/documentation/article?id=6057 I've merged the bams from alignment of the individual lanes by feeding all for bams into picards markDuplicates step. I've also tried doing this with samtools merge, I get the same results. When I feed the resulting single bam file to haplotypecaller, I get vcf with results for the 4 individual lanes, i.e. the header looks something like:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT v02_L1 v02_L2 v02_L3 v02_L4

This makes filtering based on number of reads mapping to ref/alt further downstream difficult as one has to add an extra step to combine the totals into a single number and that is ... really, really annoying. I started off using samtools merge to merge the bams, but I get the same result. And technically the article I link to above is for gatk3, but there's a message regarding a similar starting scenario from gatk4 (https://gatkforums.broadinstitute.org/gatk/discussion/10881/gatk4-importgenomicsdb-multiple-lanes-per-sample) where the user is pointed towards said article so I'm assuming I'm still following the recommended practice.

Why does haplotypecaller do this? And is there a way to either easily collate the results or force it to produce a per-sample rather than per-lane totals?

Thanks
Ben.
Tagged:

Best Answers

Answers

Sign In or Register to comment.