Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

Loosing variants when indel realignment step is omitted

I have inherited a number of 2x sequences which were processed in large with agreement with the Best Practices. One thing which could be improved upon in the pipeline was deleting the redundant step of indel realignment (as HaplotypeCaller was used for calling variants). I have reprocessed a number of samples without this step, and with the newer version of GATK, but instead of getting either similar variant files, I am getting much smaller gvcf files.
The old code:

## Mark duplicates using Picard
java -jar -Djava.io.tmpdir=./tmp ./picard-tools-1.88/MarkDuplicates.jar I=sorted.bam O=sample.dedup.bam M=sample.dedup.metrics ASSUME_SORTED=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=4000 CREATE_INDEX=FALSE

## Realigner Creator
java -jar ./GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I sample.dedup.bam -o sample.target.intervals

## Indel realignment
java -jar ./GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fa -targetIntervals sample.target.intervals -I sample.dedup.bam -o sample.dedup.realigned.bam

## Variant calling
java -jar -Djava.io.tmpdir=./tmp ./GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample.dedup.realigned.bam -o sample.variants.g.vcf.gz --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

The above code results in a gvcf file of 1.6G and 6045722 rows.

The new code, without indel realignment and with the new version of GATK

## Mark duplicates using Picard - same as above

## HAplotype Caller
./gatk-4.0.2.1/gatk HaplotypeCaller -R reference.fa -I sample.dedup.bam -O sample.variants.g.vcf.gz -stand-call-conf 30 -ERC GVCF

This code results (for the same sample of course) in a variant file of 1.1G and 1892469 rows.

Could you please advise where this difference comes from? As the indel realignment step is described as redundant, I expected the new code to result in a similar-sized variants file. In contrast, across several samples tested, the new code results in variants which are considerably smaller. Was indel realignment (or old version of GATK) introducing such large number of false positives? Or am I missing a bit of code?

Comments

  • I have now realised that the variants file after genotyping is actually the same. Still, could you please explain the mechanism through which the gvcfs following indel realignment are larger than without?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    IndelRealignment changes mapping qualities by 10. So your former MAPQ 0 reads are now 10 and MAPQ 10 reads are now 20. This may result in formerly filtered reads to be included in your variant call if IndelRealignment is added to your workflow. If you are working with very low coverage effects of this step may be more drastic. If you are working with samples of depth higher than 30-50X effects are usually negligible.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @jilska, what @SkyWarrior is saying is detailed in https://software.broadinstitute.org/gatk/blog?id=7847.

    You should compare the two GVCFs and see what is different. It's hard to say what is going on without seeing the VCF records.

Sign In or Register to comment.