Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

WES : too many variants ?

NH2ANH2A Member
edited January 2018 in Ask the GATK team

Hi!

I'm working on exome sequences.

My colleague used the following pipeline (GATK 3.7) :
1) bwa mem (reference genome : GRChr37) => sam
2) samtools view -bS -F 0x04 **(eliminates unmapped reads) => bam
3) sorting and indexing the bam file with **sambaba
sort and sambaba index => bam + bai
4) samtools view -h -o => sam
5) GATK AddOrRepleaceReadGroups => bam
6) GATK Markduplicates => bam
7) GATK SplitNCigarReads => bam
8) GATK HaplotypeCaller -R GRChr37 -dontUseSoftClippedBases -stand_call_conf 20.0 => vcf
And got 4000 variants.

We saw that many variants were missed, and decided therefore to try another algorithm.

I tried to follow GATK recommandations (GATK 3.8) :
1) bwa mem (reference genome : hg38, downloaded from the GATK bundle) => sam
2) samtools view -f 0x2 (select mapped read in proper pair ; it seems that GATK 3.8 tools don't work with unvalid bam files, that's why I couldn't write -F 0x04 as my colleague did, "-F 0x04" did indeed leave some unpaired reads) => bam
3) GATK AddOrRepleaceReadGroups => bam
4) GATK Markduplicates => bam
5) GATK SortSam => bam
6) GATK BuildBamIndex => bam
7) GATK BaseRecalibrator => bam
8) GATK Printreads => bam
9) GATK HaplotypeCaller -R hg38 -dontUseSoftClippedBases -stand_call_conf 20.0 => vcf
And I get a vcf file with... ** 800 000** variants.

Do you have any idea of why the difference is so big ? I expected around 25 000 variants...

The main differences between both pipelines are :

  • the reference genome => but I tried this pipeline with hg19 and got the same result as with hg38
  • how we select the reads in samtools view : my colleague eliminated unmapped reads while I selected mapped reads in proper pair => but I should get less reads than he does, and that shouldn't change the number of called variants...
  • I sorted and indexed the reads with GATK tools, he did it with sambaba tools => it shouldn't change anything, should it ?
  • the Base Recalibration step => I tried with and without it, I get more variants without this step (more than 800 000...)

Another interesting information :
We're working on exomes that have been sequenced in 4 lanes. That means we have 8 FASTQ for each sample. At first, I tried my pipeline with only one lane (2 FASTQ) and got 200 000 variants. When I concatenate the FASTQ (Lane1_R1, Lane2_R1, Lane3_R1, Lane4_R1 => R1 ; same for R2) and then run my pipeline, I get 800 000 variants.
I opened the vcf file with 200 000 variants, and every variant seemed to appear only once. I looked at the bam on IGV, it seems that the new pipeline is much more sensitive than my colleage's one.

Versions :
BWA : 0.7.5a-r405
samtools : 1.6 (using htslib 1.6)
GATK : GenomeAnalysisTK-3.8-0-ge9d806836

Any help would be really appreciated !

Many thanks, and happy new year ;)

Best Answer

Answers

  • Hi shlee, thank you for your answer,
    Finally I discovered that my colleague did some hard filtering on his vcf before giving them to me, it explains everything.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Glad to hear you've sorted your data out @NH2A.

Sign In or Register to comment.