The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# GATK / UnifiedGenotyper -dcov parameter values

Member
edited November 2012

I ran the same sample through a pipeline using GATK twice and received different variants. I am trying to understand the reason behind this. My samples are from a MiSeq/capture kit run and downsampling could be one reason (given in one scenario that variant is called and in other it isn't) the variant is called at 32% when looked into the .bam files.

As I understand the UnifiedGenotyper downsamples my dataset randomly to 250, so I played around with -dcov parameter

• same sample run twice, 1st run reports a variant; 2nd run doesn't.
• up -dcov to 1000 neither run reports the variant.
• up -dcov to 10,000 1st run again reports a variant; 2nd run doesn't.
• set -dt NONE both runs call that variant

But setting -dt to NONE could be computationally exhaustive for a big sample set. Is there an identifiable reason to why this is happening..?

Curious..!

Post edited by Geraldine_VdAuwera on
Tagged:

Differences in calls can indeed be explained by downsampling. This usually affects marginal, low-confidence calls. If that's your case it probably doesn't matter because those calls would get filtered out in the next step. If that's not your case, can you tell us more about these variant calls? What are their properties?

• Member

Taking an example of a variant at chr4, its the second base in the codon, the reference reports it as T (and 67% alleles that map are also T) while the variant call is G at 32%. Mapping quality of both the variant and ref allele are around 150 and base phred quality for the variant call ranges from 25 to 29 while its 37 for the allele reported same as the reference.
Total count of the bases at this position are 10056.
Still capturing a variant at -dcov 250 and not getting it at -dcov 1000 looks strange..

Hmm. Could you please post your command line and the actual lines in the VCF output for the variant?

• Member
edited November 2012
/usr/java/latest/bin/java -Xmx6g -Xms512m -Djava.io.tmpdir=/path/to/sample \
-jar /path/to/GenomeAnalysisTK/1.6-7-g2be5704//GenomeAnalysisTK.jar \
-R /path/to/reference/ncbi/37.1/allchr.fa \
-et NO_ET \
-K /path/to/GenomeAnalysisTK/1.6-7-g2be5704//<name>.key \
-T UnifiedGenotyper \
--output_mode EMIT_VARIANTS_ONLY \
--min_base_quality_score 20 \
-nt 4 \
--max_alternate_alleles 5 \
-glm BOTH \
-L chr4 \
-dcov 250 \
-I /path/to/file/IGV_BAM/sample.igv-sorted.bam


I ran this script multiple times to find whether the chromosome of interest (in bold) was called or not.
I've pasted results of two such runs, one where it isn't and the second where it is called.

> chr4    624617  .       C       G       92.72   .       AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;MQ=114.91;MQ0=0;QD=30.91;SB=-0.01        GT:AD:DP:GQ:PL  1/1:0,3:3:9.03:125,9,0
> chr4    624815  .       G       C       104.35  .       AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=104.69;MQ0=0;QD=26.09;SB=-0.01        GT:AD:DP:GQ:PL  1/1:0,4:4:12.03:137,12,0

> chr4    624617  .       C       G       92.72   .       AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;MQ=114.91;MQ0=0;QD=30.91;SB=-0.01        GT:AD:DP:GQ:PL  1/1:0,3:3:9.03:125,9,0
> chr4    624815  .       G       C       104.35  .       AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=104.69;MQ0=0;QD=26.09;SB=-0.01        GT:AD:DP:GQ:PL  1/1:0,4:4:12.03:137,12,0


Well, nothing really stands out but I notice you're running version 1.6. I would strongly recommend you upgrade to the latest version to take advantage of the latest improvements we've made to the UG (including downsampling).

Just looking at the call speaks volumes. Notice the QUAL score of the records around the one in question; they are all extremely high. But the QUAL for your record is just barely over the calling threshold. Once you run VQSR this record is absolutely, positively going to get filtered out (the QD is an infinitesimally small 0.16). This is what we mean when we say that the differences are marginal and make no practical differences.

• Member

Another thing to note, when i use

"-nt 1" instead of "-nt 4"

I don't get as many variant calls and this variant

chr4 106196829

is not reported atleast not in the few multiple runs that I did.

Did you see my previous comment? This is ultimately not novel discussion and has already been addressed multiple times on this forum...

• Member

I did see your post, thanks for pointing out the QD score. The post is not to alarm or trigger any novelty, my focus is to understand the tool better and implement different thresholds, such that it calls the same variants everytime. I did not see posts on downsampling revolving around different values calling different variants, so I went ahead and made one, please feel free to get rid of this it has not yeilded a lot of feedback anyways.

Though I'd like to point out here, that VQSR was run both times and I ran the exact same data twice and in one of the runs, it reported this variant. Hence I went back to look at each step to identify if I could, why there was a difference.

Are you saying that this site was not filtered out via VQSR? If that is the case, then there is a problem.
But you should not be comparing the raw calls between 2 different runs; rather you need to be assessing whether the filtered call sets are the same.

• Member
edited November 2012

These are the filtered results, any insight? The first one calls this variant, the second doesn't.

> chr4    624617  .       C       G       92.72   DPFilter        AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;ED=0;FS=0.000;HRun=0;HaplotypeScore=0.0000;MQ=114.91;MQ0=0;QD=30.91;SB=-0.01;set=variant2        GT:AD:DP:GQ:PL    1/1:0,3:3:9.03:125,9,0
> chr4    624815  .       G       C       104.35  DPFilter        AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;ED=0;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=104.69;MQ0=0;QD=26.09;SB=-0.01;set=variant2        GT:AD:DP:GQ:PL    1/1:0,4:4:12.03:137,12,0

> chr4    624617  .       C       G       89.21   DPFilter        AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;ED=0;FS=0.000;HRun=0;HaplotypeScore=0.0000;MQ=114.91;MQ0=0;QD=29.74;SB=-0.01;set=variant2       GT:AD:DP:GQ:PL   1/1:0,3:3:9.03:121,9,0
> chr4    624815  .       G       C       98.83   DPFilter        AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;ED=0;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=104.69;MQ0=0;QD=24.71;SB=-0.01;set=variant2       GT:AD:DP:GQ:PL   1/1:0,4:4:12.02:131,12,0
`