Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Loosing variants when indel realignment step is omitted
I have inherited a number of 2x sequences which were processed in large with agreement with the Best Practices. One thing which could be improved upon in the pipeline was deleting the redundant step of indel realignment (as HaplotypeCaller was used for calling variants). I have reprocessed a number of samples without this step, and with the newer version of GATK, but instead of getting either similar variant files, I am getting much smaller gvcf files.
The old code:
## Mark duplicates using Picard java -jar -Djava.io.tmpdir=./tmp ./picard-tools-1.88/MarkDuplicates.jar I=sorted.bam O=sample.dedup.bam M=sample.dedup.metrics ASSUME_SORTED=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=4000 CREATE_INDEX=FALSE ## Realigner Creator java -jar ./GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I sample.dedup.bam -o sample.target.intervals ## Indel realignment java -jar ./GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fa -targetIntervals sample.target.intervals -I sample.dedup.bam -o sample.dedup.realigned.bam ## Variant calling java -jar -Djava.io.tmpdir=./tmp ./GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample.dedup.realigned.bam -o sample.variants.g.vcf.gz --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000
The above code results in a gvcf file of 1.6G and 6045722 rows.
The new code, without indel realignment and with the new version of GATK
## Mark duplicates using Picard - same as above ## HAplotype Caller ./gatk-126.96.36.199/gatk HaplotypeCaller -R reference.fa -I sample.dedup.bam -O sample.variants.g.vcf.gz -stand-call-conf 30 -ERC GVCF
This code results (for the same sample of course) in a variant file of 1.1G and 1892469 rows.
Could you please advise where this difference comes from? As the indel realignment step is described as redundant, I expected the new code to result in a similar-sized variants file. In contrast, across several samples tested, the new code results in variants which are considerably smaller. Was indel realignment (or old version of GATK) introducing such large number of false positives? Or am I missing a bit of code?