The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

# [Screenshot + info] Appearent deletion not called by UnifiedGenotyper

Posts: 23Member

Hello!

I have what seems like a deletion in my sample, that is not called by UnifiedGenotyper. The data has gone through the best practises pipeline (duplicate marking, indel-realignment, base recalibration) before calling the variants. This is what it looks like in IGV:

Fullsize: postimg.org/image/57u1srm9r/

The mapping quality is 60 for all reads that overlap the deletion, the cigars are 100M except for the three reads that have the deletion.

I ran UnifiedGenotyper with as low as -stand_emit_conf 4.0 but there is still no record of an indel at this position, neither in the raw call set or in the call set after variant recalibration.

Why isn't this deletion called? What else can I do to shed some light on this?

Tagged:

can you post your command line for the UG ? Ideally with -L for this position only and the output with EMIT_ALL_SITES so we can see what is going on here?

• Posts: 23Member

I tried to run the command with -L set at the coordinate of the deletion, as well as a larger window of about 1 kb, but I got nothing in the output VCF file. Instead, I switched to -glm SNP and that produced an output.

Command:

java -Xmx8g -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R ref.fa \
-I input_bams.list \
-L 1:204159612 \
-o output.raw.vcf \
--dbsnp dbsnp_137.b37.vcf \
-glm SNP \
-dcov 200 \
--output_mode EMIT_ALL_SITES \
--validation_strictness LENIENT \
-l INFO \
-nt 4

Output:

1 204159612 . T . 44.76 . AN=4;DP=15;MQ=60.00;MQ0=0 GT:DP 0/0:6 0/0:6

There are two input samples, where one seems to have the deletion as shown in the screenshot above and the other does not seem to have any variant. Both samples are whole exomes.

well, with -glm SNP it is doing the right thing, which is to call it reference. What does it say with -glm indel ?

• Posts: 23Member

With -glm INDEL I get nothing in the output. The command is identical to the one above, except that -glm SNP is -glm INDEL.

• Posts: 23Member

To clearify: by "nothing" I mean that the only thing in the output VCF is the header section.

• Posts: 71Dev mod

Hi there - you might want to check the -minIndelCnt and -minIndelFrac arguments. As stated in the documentation, you need by default at least 5 reads with an indel for reads to be candidate for genotyping. You should also see the warnings in your output about using -glm INDEL and -out_mode EMIT_ALL_SITES.

• Posts: 23Member

Yeah there was a warning message about that, which is why I tried -glm SNP.

Ok, -minIndelCnt and -minIndelFrac may be the reasons why I didn't see this variant in the output VCF. Although I don't fully understand how:

The documentation says that -minIndelCnt is the minimum number of reads to trigger genotyping and that a locus must fulfill -minIndelFrac before the number of reads is considered. Is it enough that the fraction is at least 0.25 (default) or do both criteria need to be fulfilled? The reason I ask is because I have other samples with about 4x coverage and the UG finds a lot of variants in these samples, where the DP field in the VCF can be as low as 2, with one read per allele so the fraction is 0.5. In the example above, the fraction is 0.33 but there are only 3 reads that have the indel so it is not considered.

Either case has less than 5 reads but both have a fraction above 0.25 (the default), yet the variant is not called in the case in the screenshot above (3 reads, fraction 0.33).

What am I missing here?

Both conditions need to be fulfilled.

When you say that the UG finds a lot of variants in your other low-coverage samples, do you mean it finds indels or SNPs (or both)? SNPs are not subject to the same constraints as indels.

Geraldine Van der Auwera, PhD

• Posts: 23Member

Ok, I see.

I use UG to find INDELs in the low-coverage samples as well.

This is an example of the genotype information from that call set (there are lots of samples so I omitted some columns):

GT:AD:DP:GQ:PL ... 0/1:1,1:2:33:33,0,38 0/1:1,1:2:33:33,0,34 0/0:4,0:4:12:0,12,167 ...

Unless I'm completely off here, it seems like the first two sample are called as heterozygous for the variants with one read supporting either allele, with a total read depth of 2. This call set was generated using the same settings as above, plus -stand_emit_conf 10, -glm INDEL, and no -L interval specified.

I see -- just to be clear, this line is for an indel call, yes?

Let me check if the minimum count is on the filtered or unfiltered depth...

Geraldine Van der Auwera, PhD

• Posts: 23Member

Yes, this is a line for an INDEL call for the low-coverage samples.

• Posts: 23Member
edited May 2013

@Carneiro said:
The minimum thresholds are for the pileup, once you're over the threshold, every sample gets genotyped.

Ok, so the count is the combined count for all samples? In that case, I should lower the minimum count when I'm just running the UG with two samples, right?

EDIT:
I ran the UG again on the two exome samples with -minIndelCnt set to 3 and now it does indeed pick up the deletion. Thank you all for your support and for helping me understand what was going on here!