Best Practices for Batch Calling Malaria

virpatel524virpatel524 Member
edited March 2017 in Ask the GATK team

Since there is no gold standard SNP database for Plasmodium vivax, my lab and I have been tinkering with a pipeline for variant calling with GATK. I'm pretty confident with the basic framework we have, but the complication is that we want to identify variants across different populations that were sequenced at different centers.

So far I have as follows:

java -jar -Xmx13g $1 AddOrReplaceReadGroups INPUT=$4 OUTPUT=readgroups.bam RGID=1 RGLB=$base RGPL=illumina RGPU=unit1 RGSM=$base VALIDATION_STRINGENCY=LENIENT java -jar -Xmx13g $1 MarkDuplicates INPUT=readgroups.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt VALIDATION_STRINGENCY=LENIENT rm -rf readgroups.bam java -jar -Xmx13g $1 BuildBamIndex INPUT=dedup_reads.bam VALIDATION_STRINGENCY=LENIENT java -jar -Xmx13g $2 -T RealignerTargetCreator -R $ref -I dedup_reads.bam -o realignment_targets.list java -jar -Xmx13g $2 -T IndelRealigner -R $ref -I dedup_reads.bam -targetIntervals realignment_targets.list -o realigned_reads.bam rm -rf realignment_targets.list dedup_reads.bam java -jar -Xmx13g $2 -T HaplotypeCaller -R $ref -I realigned_reads.bam -o raw_variants.vcf java -jar -Xmx13g $2 -T SelectVariants -R $ref -V raw_variants.vcf -selectType SNP -o raw_snps.vcf java -jar -Xmx13g $2 -T SelectVariants -R $ref -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf rm -rf raw_variants.vcf java -jar -Xmx13g $2 -T VariantFiltration -R $ref -V raw_snps.vcf --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o filtered_snps.vcf java -jar -Xmx13g $2 -T VariantFiltration -R $ref -V raw_indels.vcf --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' --filterName "basic_indel_filter" -o filtered_indels.vcf java -jar -Xmx13g $2 -T BaseRecalibrator -R $ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -o recal_data.table java -jar -Xmx13g $2 -T BaseRecalibrator -R $ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -BQSR recal_data.table -o post_recal_data.table java -jar -Xmx13g $2 -T AnalyzeCovariates -R $ref -before recal_data.table -after post_recal_data.table -plots ${base}recalibration_plots.pdf java -jar -Xmx13g $2 -T PrintReads -R $ref -I realigned_reads.bam -BQSR recal_data.table -o recal_reads.bam java -jar -Xmx13g $2 -T HaplotypeCaller -R $ref -I recal_reads.bam -ERC GVCF -o /data/wraycompute/vdp5/variants_out_gvcf/${base}.g.vcf

This is done for each sample, and then I do batch calling using GenotypeGVCFs to produce a final file. However, I'm concerned that this approach isn't appropriate for two reasons:

  1. Some subsets of the samples have different characteristics, so we could really get a wide diversity of genotypes for each that may be sequencing bias.

  2. I am not performing the AnalyzeCovariates step after I produce my final set of SNPs. Should I be doing so? Do you have any suggestions for how I should modify my pipeline?

Best Answers


  • Hi Shlee, This is excellent information—thank you very much for your input! I will post an update here after my analyses for future reference.

Sign In or Register to comment.