GATK / UnifiedGenotyper -dcov parameter values

ngsgenengsgene Posts: 6Member
edited November 2012 in Ask the GATK team

I ran the same sample through a pipeline using GATK twice and received different variants. I am trying to understand the reason behind this. My samples are from a MiSeq/capture kit run and downsampling could be one reason (given in one scenario that variant is called and in other it isn't) the variant is called at 32% when looked into the .bam files.

As I understand the UnifiedGenotyper downsamples my dataset randomly to 250, so I played around with -dcov parameter

  • same sample run twice, 1st run reports a variant; 2nd run doesn't.
  • up -dcov to 1000 neither run reports the variant.
  • up -dcov to 10,000 1st run again reports a variant; 2nd run doesn't.
  • set -dt NONE both runs call that variant

But setting -dt to NONE could be computationally exhaustive for a big sample set. Is there an identifiable reason to why this is happening..?

Curious..!

Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,642Administrator, GATK Developer admin

    Differences in calls can indeed be explained by downsampling. This usually affects marginal, low-confidence calls. If that's your case it probably doesn't matter because those calls would get filtered out in the next step. If that's not your case, can you tell us more about these variant calls? What are their properties?

    Geraldine Van der Auwera, PhD

  • ngsgenengsgene Posts: 6Member

    Taking an example of a variant at chr4, its the second base in the codon, the reference reports it as T (and 67% alleles that map are also T) while the variant call is G at 32%. Mapping quality of both the variant and ref allele are around 150 and base phred quality for the variant call ranges from 25 to 29 while its 37 for the allele reported same as the reference. Total count of the bases at this position are 10056. Still capturing a variant at -dcov 250 and not getting it at -dcov 1000 looks strange..

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,642Administrator, GATK Developer admin

    Hmm. Could you please post your command line and the actual lines in the VCF output for the variant?

    Geraldine Van der Auwera, PhD

  • ngsgenengsgene Posts: 6Member
    edited November 2012
    /usr/java/latest/bin/java -Xmx6g -Xms512m -Djava.io.tmpdir=/path/to/sample \
        -jar /path/to/GenomeAnalysisTK/1.6-7-g2be5704//GenomeAnalysisTK.jar \
        -R /path/to/reference/ncbi/37.1/allchr.fa \
        -et NO_ET \
        -K /path/to/GenomeAnalysisTK/1.6-7-g2be5704//<name>.key \
        -T UnifiedGenotyper \
        --output_mode EMIT_VARIANTS_ONLY \
        --min_base_quality_score 20 \
        -nt 4 \
        --max_alternate_alleles 5 \
        -glm BOTH \
        -L chr4 \
        -dcov 250 \
        -I /path/to/file/IGV_BAM/sample.igv-sorted.bam
    

    I ran this script multiple times to find whether the chromosome of interest (in bold) was called or not. I've pasted results of two such runs, one where it isn't and the second where it is called.

    > chr4    624617  .       C       G       92.72   .       AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;MQ=114.91;MQ0=0;QD=30.91;SB=-0.01        GT:AD:DP:GQ:PL  1/1:0,3:3:9.03:125,9,0
    > chr4    624815  .       G       C       104.35  .       AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=104.69;MQ0=0;QD=26.09;SB=-0.01        GT:AD:DP:GQ:PL  1/1:0,4:4:12.03:137,12,0
    > chr4    106156187       .       C       T       3744.58 .       AC=1;AF=0.50;AN=2;BaseQRankSum=11.817;DP=238;DS;Dels=0.00;FS=1.254;HRun=1;HaplotypeScore=3.6078;MQ=148.43;MQ0=0;MQRankSum=-1.547;QD=15.73;ReadPosRankSum=0.866;SB=-198.11       GT:AD:DP:GQ:PL  0/1:116,122:238:99:3745,0,2854
    > chr4    106163109       .       TT      T       1215.96 .       AC=1;AF=0.50;AN=2;BaseQRankSum=-3.950;DP=235;DS;FS=66.843;HRun=13;HaplotypeScore=1358.8083;MQ=149.02;MQ0=0;MQRankSum=-1.786;QD=5.17;ReadPosRankSum=-7.710;SB=-819.24    GT:AD:DP:GQ:PL  0/1:163,47:235:99:1255,0,1705
    > chr4    106196951       .       A       G       2816.60 .       AC=1;AF=0.50;AN=2;BaseQRankSum=-11.240;DP=250;DS;Dels=0.00;FS=6.191;HRun=0;HaplotypeScore=10.2140;MQ=149.31;MQ0=0;MQRankSum=-1.125;QD=11.27;ReadPosRankSum=-2.318;SB=-371.15    GT:AD:DP:GQ:PL  0/1:123,124:247:99:2847,0,3830
    
    
    > chr4    624617  .       C       G       92.72   .       AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;MQ=114.91;MQ0=0;QD=30.91;SB=-0.01        GT:AD:DP:GQ:PL  1/1:0,3:3:9.03:125,9,0
    > chr4    624815  .       G       C       104.35  .       AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=104.69;MQ0=0;QD=26.09;SB=-0.01        GT:AD:DP:GQ:PL  1/1:0,4:4:12.03:137,12,0
    > chr4    106156187       .       C       T       3986.22 .       AC=1;AF=0.50;AN=2;BaseQRankSum=12.002;DP=239;DS;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=4.5550;MQ=148.19;MQ0=0;MQRankSum=-0.115;QD=16.68;ReadPosRankSum=-1.902;SB=-190.96      GT:AD:DP:GQ:PL  0/1:112,127:239:99:3986,0,2741
    > chr4    106163109       .       TTC     T       1453.27 .       AC=1;AF=0.50;AN=2;BaseQRankSum=-4.717;DP=236;DS;FS=72.463;HRun=0;HaplotypeScore=1395.2377;MQ=149.13;MQ0=0;MQRankSum=-0.037;QD=6.16;ReadPosRankSum=-7.710;SB=-819.78     GT:AD:DP:GQ:PL  0/1:171,26:236:99:1492,0,4510
    > **chr4    106196829       **.       T       G       39.25   .       AC=1;AF=0.50;AN=2;BaseQRankSum=-5.726;DP=250;DS;Dels=0.00;FS=134.542;HRun=1;HaplotypeScore=3.7543;MQ=146.85;MQ0=0;MQRankSum=-1.167;QD=0.16;ReadPosRankSum=5.967;SB=-0.01        GT:AD:DP:GQ:PL  0/1:225,24:249:69.24:69,0,7843
    > chr4    106196951       .       A       G       2971.45 .       AC=1;AF=0.50;AN=2;BaseQRankSum=-11.623;DP=250;DS;Dels=0.00;FS=0.707;HRun=0;HaplotypeScore=12.3292;MQ=149.27;MQ0=0;MQRankSum=-0.494;QD=11.89;ReadPosRankSum=-1.451;SB=-290.04    GT:AD:DP:GQ:PL  0/1:121,126:247:99:3001,0,3738
    
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,642Administrator, GATK Developer admin

    Well, nothing really stands out but I notice you're running version 1.6. I would strongly recommend you upgrade to the latest version to take advantage of the latest improvements we've made to the UG (including downsampling).

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 684GATK Developer mod

    Just looking at the call speaks volumes. Notice the QUAL score of the records around the one in question; they are all extremely high. But the QUAL for your record is just barely over the calling threshold. Once you run VQSR this record is absolutely, positively going to get filtered out (the QD is an infinitesimally small 0.16). This is what we mean when we say that the differences are marginal and make no practical differences.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • ngsgenengsgene Posts: 6Member

    Another thing to note, when i use

    "-nt 1" instead of "-nt 4"

    I don't get as many variant calls and this variant

    chr4 106196829

    is not reported atleast not in the few multiple runs that I did.

  • ebanksebanks Posts: 684GATK Developer mod

    Did you see my previous comment? This is ultimately not novel discussion and has already been addressed multiple times on this forum...

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • ngsgenengsgene Posts: 6Member

    I did see your post, thanks for pointing out the QD score. The post is not to alarm or trigger any novelty, my focus is to understand the tool better and implement different thresholds, such that it calls the same variants everytime. I did not see posts on downsampling revolving around different values calling different variants, so I went ahead and made one, please feel free to get rid of this it has not yeilded a lot of feedback anyways.

    Though I'd like to point out here, that VQSR was run both times and I ran the exact same data twice and in one of the runs, it reported this variant. Hence I went back to look at each step to identify if I could, why there was a difference.

  • ebanksebanks Posts: 684GATK Developer mod

    Are you saying that this site was not filtered out via VQSR? If that is the case, then there is a problem. But you should not be comparing the raw calls between 2 different runs; rather you need to be assessing whether the filtered call sets are the same.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • ngsgenengsgene Posts: 6Member
    edited November 2012

    These are the filtered results, any insight? The first one calls this variant, the second doesn't.

    > chr4    624617  .       C       G       92.72   DPFilter        AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;ED=0;FS=0.000;HRun=0;HaplotypeScore=0.0000;MQ=114.91;MQ0=0;QD=30.91;SB=-0.01;set=variant2        GT:AD:DP:GQ:PL    1/1:0,3:3:9.03:125,9,0
    > chr4    624815  .       G       C       104.35  DPFilter        AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;ED=0;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=104.69;MQ0=0;QD=26.09;SB=-0.01;set=variant2        GT:AD:DP:GQ:PL    1/1:0,4:4:12.03:137,12,0
    > chr4    106156187       .       C       T       4220.27 PASS    AC=1;AF=0.50;AN=2;BaseQRankSum=11.818;DP=238;DS;Dels=0.00;ED=0;FS=1.147;HRun=1;HaplotypeScore=3.6083;MQ=147.63;MQ0=0;MQRankSum=0.203;QD=17.73;ReadPosRankSum=-0.536;SB=-192.02;set=variant2 GT:AD:DP:GQ:PL  0/1:103,135:238:99:4220,0,2479
    > chr4    106163109       .       TT      T       1328.61 PASS    AC=1;AF=0.50;AN=2;BaseQRankSum=-3.537;DP=237;DS;ED=0;FS=66.900;HRun=13;HaplotypeScore=1364.4270;MQ=149.09;MQ0=0;MQRankSum=-1.753;QD=5.61;ReadPosRankSum=-9.594;SB=-740.36;set=variant       GT:AD:DP:GQ:PL  0/1:157,53:237:99:1368,0,1520
    > chr4    106196829       .       T       G       110.16  FSFilter;QDFilter       AC=1;AF=0.50;AN=2;BaseQRankSum=-6.199;DP=250;DS;Dels=0.00;ED=0;FS=148.757;HRun=1;HaplotypeScore=2.8934;MQ=146.54;MQ0=0;MQRankSum=-0.586;QD=0.44;ReadPosRankSum=6.504;SB=-0.01;set=FilteredInAll     GT:AD:DP:GQ:PL  0/1:223,26:250:99:140,0,7859
    > chr4    106196951       .       A       G       3027.53 PASS    AC=1;AF=0.50;AN=2;BaseQRankSum=-11.591;DP=250;DS;Dels=0.00;ED=0;FS=0.000;HRun=0;HaplotypeScore=12.7628;MQ=149.28;MQ0=0;MQRankSum=-2.261;QD=12.11;ReadPosRankSum=-0.358;SB=-295.06;set=variant2      GT:AD:DP:GQ:PL  0/1:117,130:247:99:3058,0,3622
    
    
    
    > chr4    624617  .       C       G       89.21   DPFilter        AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;ED=0;FS=0.000;HRun=0;HaplotypeScore=0.0000;MQ=114.91;MQ0=0;QD=29.74;SB=-0.01;set=variant2       GT:AD:DP:GQ:PL   1/1:0,3:3:9.03:121,9,0
    > chr4    624815  .       G       C       98.83   DPFilter        AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;ED=0;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=104.69;MQ0=0;QD=24.71;SB=-0.01;set=variant2       GT:AD:DP:GQ:PL   1/1:0,4:4:12.02:131,12,0
    > chr4    106156187       .       C       T       4093.73 PASS    AC=1;AF=0.50;AN=2;BaseQRankSum=12.419;DP=238;DS;Dels=0.00;ED=0;FS=4.056;HRun=1;HaplotypeScore=1.5572;MQ=148.13;MQ0=0;MQRankSum=0.409;QD=17.20;ReadPosRankSum=0.292;SB=-270.34;set=variant2        GT:AD:DP:GQ:PL  0/1:110,128:238:99:4094,0,2717
    > chr4    106163109       .       TT      T       1260.04 PASS    AC=1;AF=0.50;AN=2;BaseQRankSum=-2.908;DP=236;DS;ED=0;FS=60.294;HRun=13;HaplotypeScore=1403.8554;MQ=149.49;MQ0=0;MQRankSum=-0.489;QD=5.34;ReadPosRankSum=-8.758;SB=-540.28;set=variant     GT:AD:DP:GQ:PL  0/1:169,52:236:99:1299,0,1744
    > chr4    106196951       .       A       G       2807.05 PASS    AC=1;AF=0.50;AN=2;BaseQRankSum=-12.174;DP=250;DS;Dels=0.00;ED=0;FS=0.000;HRun=0;HaplotypeScore=9.5110;MQ=149.49;MQ0=0;MQRankSum=-0.069;QD=11.23;ReadPosRankSum=-1.077;SB=-333.34;set=variant2     GT:AD:DP:GQ:PL  0/1:126,121:247:99:2837,0,3904
    
    Post edited by Geraldine_VdAuwera on
  • ebanksebanks Posts: 684GATK Developer mod

    I think perhaps you need to look over the VCF specification. In particular, it is critical that you understand what the FILTER column is used for and what it means when the value there is not PASS. Good luck!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.