We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

WES : too many variants ?

NH2ANH2A Member
edited January 2018 in Ask the GATK team


I'm working on exome sequences.

My colleague used the following pipeline (GATK 3.7) :
1) bwa mem (reference genome : GRChr37) => sam
2) samtools view -bS -F 0x04 **(eliminates unmapped reads) => bam
3) sorting and indexing the bam file with **sambaba
sort and sambaba index => bam + bai
4) samtools view -h -o => sam
5) GATK AddOrRepleaceReadGroups => bam
6) GATK Markduplicates => bam
7) GATK SplitNCigarReads => bam
8) GATK HaplotypeCaller -R GRChr37 -dontUseSoftClippedBases -stand_call_conf 20.0 => vcf
And got 4000 variants.

We saw that many variants were missed, and decided therefore to try another algorithm.

I tried to follow GATK recommandations (GATK 3.8) :
1) bwa mem (reference genome : hg38, downloaded from the GATK bundle) => sam
2) samtools view -f 0x2 (select mapped read in proper pair ; it seems that GATK 3.8 tools don't work with unvalid bam files, that's why I couldn't write -F 0x04 as my colleague did, "-F 0x04" did indeed leave some unpaired reads) => bam
3) GATK AddOrRepleaceReadGroups => bam
4) GATK Markduplicates => bam
5) GATK SortSam => bam
6) GATK BuildBamIndex => bam
7) GATK BaseRecalibrator => bam
8) GATK Printreads => bam
9) GATK HaplotypeCaller -R hg38 -dontUseSoftClippedBases -stand_call_conf 20.0 => vcf
And I get a vcf file with... ** 800 000** variants.

Do you have any idea of why the difference is so big ? I expected around 25 000 variants...

The main differences between both pipelines are :

  • the reference genome => but I tried this pipeline with hg19 and got the same result as with hg38
  • how we select the reads in samtools view : my colleague eliminated unmapped reads while I selected mapped reads in proper pair => but I should get less reads than he does, and that shouldn't change the number of called variants...
  • I sorted and indexed the reads with GATK tools, he did it with sambaba tools => it shouldn't change anything, should it ?
  • the Base Recalibration step => I tried with and without it, I get more variants without this step (more than 800 000...)

Another interesting information :
We're working on exomes that have been sequenced in 4 lanes. That means we have 8 FASTQ for each sample. At first, I tried my pipeline with only one lane (2 FASTQ) and got 200 000 variants. When I concatenate the FASTQ (Lane1_R1, Lane2_R1, Lane3_R1, Lane4_R1 => R1 ; same for R2) and then run my pipeline, I get 800 000 variants.
I opened the vcf file with 200 000 variants, and every variant seemed to appear only once. I looked at the bam on IGV, it seems that the new pipeline is much more sensitive than my colleage's one.

Versions :
BWA : 0.7.5a-r405
samtools : 1.6 (using htslib 1.6)
GATK : GenomeAnalysisTK-3.8-0-ge9d806836

Any help would be really appreciated !

Many thanks, and happy new year ;)

Best Answer


  • Hi shlee, thank you for your answer,
    Finally I discovered that my colleague did some hard filtering on his vcf before giving them to me, it explains everything.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Glad to hear you've sorted your data out @NH2A.

Sign In or Register to comment.