Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
When does IndelRealigner discard reads?
I'm using IndelRealigner on version VN:3.4-46-gbc02625, command line field is:
CL:knownAlleles= targetIntervals=/data2/processed/dreamchallenge_set1/synthetic.challenge.set1.tumor.v2/tmp/synthetic.challenge.set1.tumor.v2.target.intervals.list LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
The (sambamba-produced) flagstat file is very different before and after - about ~18M reads are gone. We don't normally see this, but we also haven't run the DREAM data before. In what situations can IndelRealigner discard reads?
The documentation states that downsampling is not done by this tool by default. I notice the logs of the RealignerTargetCreator show some filters failing and downsampling to 1000 coverage. My assumption is that this should not affect the final result, as it is only to produce the regions to clean.
(I'm aware that in many cases indel re-alignment is no longer recommended).