If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
WES : too many variants ?
I'm working on exome sequences.
My colleague used the following pipeline (GATK 3.7) :
1) bwa mem (reference genome : GRChr37) => sam
2) samtools view -bS -F 0x04 **(eliminates unmapped reads) => bam
3) sorting and indexing the bam file with **sambaba sort and sambaba index => bam + bai
4) samtools view -h -o => sam
5) GATK AddOrRepleaceReadGroups => bam
6) GATK Markduplicates => bam
7) GATK SplitNCigarReads => bam
8) GATK HaplotypeCaller -R GRChr37 -dontUseSoftClippedBases -stand_call_conf 20.0 => vcf
And got 4000 variants.
We saw that many variants were missed, and decided therefore to try another algorithm.
I tried to follow GATK recommandations (GATK 3.8) :
1) bwa mem (reference genome : hg38, downloaded from the GATK bundle) => sam
2) samtools view -f 0x2 (select mapped read in proper pair ; it seems that GATK 3.8 tools don't work with unvalid bam files, that's why I couldn't write -F 0x04 as my colleague did, "-F 0x04" did indeed leave some unpaired reads) => bam
3) GATK AddOrRepleaceReadGroups => bam
4) GATK Markduplicates => bam
5) GATK SortSam => bam
6) GATK BuildBamIndex => bam
7) GATK BaseRecalibrator => bam
8) GATK Printreads => bam
9) GATK HaplotypeCaller -R hg38 -dontUseSoftClippedBases -stand_call_conf 20.0 => vcf
And I get a vcf file with... ** 800 000** variants.
Do you have any idea of why the difference is so big ? I expected around 25 000 variants...
The main differences between both pipelines are :
- the reference genome => but I tried this pipeline with hg19 and got the same result as with hg38
- how we select the reads in samtools view : my colleague eliminated unmapped reads while I selected mapped reads in proper pair => but I should get less reads than he does, and that shouldn't change the number of called variants...
- I sorted and indexed the reads with GATK tools, he did it with sambaba tools => it shouldn't change anything, should it ?
- the Base Recalibration step => I tried with and without it, I get more variants without this step (more than 800 000...)
Another interesting information :
We're working on exomes that have been sequenced in 4 lanes. That means we have 8 FASTQ for each sample. At first, I tried my pipeline with only one lane (2 FASTQ) and got 200 000 variants. When I concatenate the FASTQ (Lane1_R1, Lane2_R1, Lane3_R1, Lane4_R1 => R1 ; same for R2) and then run my pipeline, I get 800 000 variants.
I opened the vcf file with 200 000 variants, and every variant seemed to appear only once. I looked at the bam on IGV, it seems that the new pipeline is much more sensitive than my colleage's one.
BWA : 0.7.5a-r405
samtools : 1.6 (using htslib 1.6)
GATK : GenomeAnalysisTK-3.8-0-ge9d806836
Any help would be really appreciated !
Many thanks, and happy new year