same setting, different results (GATK realignment,recalibration,haptypecaller,hard filtering)

gatk2013gatk2013 Posts: 15Member

When I run GATK with identical settings on some amplicon sequencing data from MiSeq (150KB region), I get different numbers of variants (approximately 10% difference), even after setting -dfrac to 1. What is the cause for this variation? How to make results reproducible?

Thank you so much!

RESULT

#8.1.vcf and 8.2.vcf are the raw VCFs from two runs with filters applied
java -Xmx4g -jar gatk.jar -T CombineVariants -R human_g1k_v37.fasta -o 8merge_union.vcf -V 8.1.vcf -V 8.2.vcf
grep Intersection 8merge_union.vcf > 8merge_intersection.vcf
grep -v Intersection 8merge_union.vcf > 8merge_NOintersection.vcf
grep PASS 8merge_NOintersection.vcf > 8merge_NOintersection.PASS.vcf

wc -l *vcf
   711 8.1.vcf
   642 8.2.vcf
   308 8merge_intersection.vcf
    86 8merge_NOintersection.PASS.vcf
   462 8merge_NOintersection.vcf
   770 8merge_union.vcf

Please find 8.1.vcf and 8.2.vcf in attachment.

COMMAND:

##realign
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -I 8.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -known ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf -o 8.intervals -nt 6 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T IndelRealigner -I 8.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -targetIntervals 8.intervals --out 8.realign.bam -known ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf --consensusDeterminationModel USE_READS -LOD 0.4 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
#
##base recal
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T BaseRecalibrator -I 8.realign.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --default_platform ILLUMINA -knownSites ~public/project/seqlib/gatk/dbsnp_137.b37.vcf -knownSites ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf -nct 6 -o 8.recal_data -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
java -Xmx6g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T PrintReads -I 8.realign.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -o 8.recal.bam -BQSR 8.recal_data -nct 1  -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

#variant calling
java -Xmx8g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T HaplotypeCaller  -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -I 8.recal.bam -o 8.raw.vcf --dbsnp ~public/project/seqlib/gatk/dbsnp_137.b37.vcf -nct 6 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

#variant filtering
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T SelectVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.vcf -o 8.raw.indel.vcf -selectType INDEL -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T SelectVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.vcf -o 8.raw.snp.vcf -selectType SNP -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

#use hard filtering due to small input file
#SNP
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T VariantFiltration -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.snp.vcf --filterName QDFilter --filterExpression 'QD<2.0' --filterName FSFilter --filterExpression 'FS>60.0' --filterName MQFilter --filterExpression 'MQ<40.0' --filterName MaqQualRankSumFilter --filterExpression 'MappingQualityRankSum<-12.5' --filterName ReadPosFilter --filterExpression 'ReadPosRankSum<-8.0' -o 8.filter.snp.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
#indel
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T VariantFiltration -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.indel.vcf --clusterWindowSize 10 --filterExpression 'QD<2.0' --filterName QDFilter --filterExpression 'ReadPosRankSum<-20.0' --filterName ReadPosFilter --filterExpression 'FS>200.0' --filterName FSFilter -o 8.filter.indel.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T CombineVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.filter.snp.vcf --variant 8.filter.indel.vcf -o 8.filter.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
7z
7z
sample8.7z
39K

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin
    Answer ✓

    Hi there,

    GATK runs are non-deterministic when run with multithreading (nt or nct). Only marginal calls are affected, but if you absolutely must have deterministic behavior, you'll need to disable multithreading.

    Geraldine Van der Auwera, PhD

  • rernstrernst UtrechtPosts: 2Member

    Are GATK runs also non-deterministic when using queue scatter/gather (wihtout nt or nct settings)?

  • gatk2013gatk2013 Posts: 15Member

    Thanks!

    I ran GATK without multithreading, indeed the results are the same. But I think when using mutlithreading, not only marginal calls are affected, some affected calls actually have quite high GQ, eg 300+.

    @Geraldine_VdAuwera said: Hi there,

    GATK runs are non-deterministic when run with multithreading (nt or nct). Only marginal calls are affected, but if you absolutely must have deterministic behavior, you'll need to disable multithreading.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    @rernst Queue scatter/gather generally affects the intervals list, which influences the downsampling seed, so it may cause non-deterministic behavior.

    @gatk2013 If there are any effects that cause you concern, feel free to post the affected variant records for us to take a look at.

    Geraldine Van der Auwera, PhD

  • gatk2013gatk2013 Posts: 15Member

    Thanks for help.

    Here is the VCF that contains variants not shared by two runs. They are all marked as PASS after hard filtering (VQSR unavailable due to small size).

    txt
    txt
    8merge_NOintersection.PASS.vcf.txt
    18K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    @gatk2013 I meant post a couple of records from the two different runs (with vs. without multithreading) that show differences in the calls.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    @gatk2013, that's too much data, I need you to point out specific records that you want me to look at. Please just copy one or two lines of each and post them in your comment (not as attached files), and tell me explicitly what values you think are problematic. I don't have the time to look at all your data in detail.

    Geraldine Van der Auwera, PhD

  • gatk2013gatk2013 Posts: 15Member
    edited March 18

    Sorry for this massive post.

    Could you please look at the following lines (with and without multithreading)? I just to be confident when we pick high quality variants for downstream analysis, they are really there.

    13      32941465        .       T       C       2654.77 PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-9.609;ClippingRankSum=5.979;DP=221;FS=14.269;MLEAC=1;MLEAF=0.500;MQ=59.00;MQ0=0;MQRankSum=-1.708;QD=12.01;ReadPosRankSum=-7.834;set=8.multi.vcf        GT:AD:DP:GQ:PL  0/1:153,67:220:99:2683,0,8451
    17      41252613        rs10445318      T       A       3168.77 PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=1.955;ClippingRankSum=0.964;DB;DP=52;FS=1.105;MLEAC=1;MLEAF=0.500;MQ=59.21;MQ0=0;MQRankSum=-0.009;QD=31.09;ReadPosRankSum=-0.395;set=8.nomulti.vcf      GT:AD:DP:GQ:PL  0/1:24,28:52:99:3197,0,7320
    
    Post edited by gatk2013 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    These calls look perfectly reasonable. Are you saying that they are called/not called depending on whether multithreading is enabled?

    Ultimately the best thing to do is look at the data in the BAMs for these sites, and see what the pileups look like.

    Geraldine Van der Auwera, PhD

  • gatk2013gatk2013 Posts: 15Member

    @Geraldine_VdAuwera‌ Thanks for your reply!

    As you mentioned, these calls look good, so I suppose they should show up regardless of multithreading.

    When I looked the BAM file, I saw there are quite a few reads with 2 or more indels around these two calls, and the region seems to be repetitive. This definitely will leave me doubt the validity of these calls. I'd appreciate it if gatk warns me about these loci, e.g., by assigning a low QUAL. But right now gatk shows some non-deterministic behavior instead.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    You may want to try calling variants using the latest version with HaplotypeCaller in GVCF mode. This will emit a record for every site (or block of sites) along with an estimate of reference confidence. This will give you better visibility into what's going on.

    Geraldine Van der Auwera, PhD

  • gatk2013gatk2013 Posts: 15Member

    Thank you, Dr. Auwera. I will try it.

  • wchenwchen Posts: 20Member
    edited November 13

    I am also comparing results for amplicon-based data from MiSeq to GATK. Results from GATK all gave very low depth, and all AF=0.5 and PL for 0/1 alt allele is always 0 same as above examples??? How to correctly calculate the variant allele frequency, which is very important for tumor samples?

    I was told that you have to mark the duplicates for GATK to work, is that right? If not, how to avoid it? What is the best practice for dealing with amplicon-based sequence data in GATK at every step? Thanks!

    Post edited by wchen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    @wchen We do not have recommendations for amplicon-based sequence analysis, so you will have to decide for yourself what makes sense for you. Marking duplicates is recommended for other types of experiments, but may not be applicable for amplicon sequence. If so, you just don't mark duplicates (or you unmark them if they are already marked) so GATK will not filter them out.

    The PL is always 0 for the selected genotype by definition, and the other genotype PLs vary depending on how likely they are relative to the chosen genotype. The AF is always 0.5 for heterozygous diploid calls because it is the model frequency. To calculate the effective frequency, you can use the AD field values, which give the depth per allele, over the total DP (depth of coverage).

    If this is cancer data you should be using MuTect to call SNPs.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.