The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Did we ask for a bug report?

Then follow instructions in Article#1894.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

##### Jump to another community
Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the github release page for download and details.

# [Screenshot + info] Appearent deletion not called by UnifiedGenotyper

Member

Hello!

I have what seems like a deletion in my sample, that is not called by UnifiedGenotyper. The data has gone through the best practises pipeline (duplicate marking, indel-realignment, base recalibration) before calling the variants. This is what it looks like in IGV:

Fullsize: postimg.org/image/57u1srm9r/

The mapping quality is 60 for all reads that overlap the deletion, the cigars are 100M except for the three reads that have the deletion.

I ran UnifiedGenotyper with as low as -stand_emit_conf 4.0 but there is still no record of an indel at this position, neither in the raw call set or in the call set after variant recalibration.

Why isn't this deletion called? What else can I do to shed some light on this?

Tagged:

## Answers

• Charlestown, MAMember

can you post your command line for the UG ? Ideally with -L for this position only and the output with EMIT_ALL_SITES so we can see what is going on here?

• Member

I tried to run the command with -L set at the coordinate of the deletion, as well as a larger window of about 1 kb, but I got nothing in the output VCF file. Instead, I switched to -glm SNP and that produced an output.

Command:

java -Xmx8g -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R ref.fa \ -I input_bams.list \ -L 1:204159612 \ -o output.raw.vcf \ --dbsnp dbsnp_137.b37.vcf \ -glm SNP \ -dcov 200 \ --output_mode EMIT_ALL_SITES \ --validation_strictness LENIENT \ -l INFO \ -nt 4

Output:

1 204159612 . T . 44.76 . AN=4;DP=15;MQ=60.00;MQ0=0 GT:DP 0/0:6 0/0:6

There are two input samples, where one seems to have the deletion as shown in the screenshot above and the other does not seem to have any variant. Both samples are whole exomes.

• Charlestown, MAMember

well, with -glm SNP it is doing the right thing, which is to call it reference. What does it say with -glm indel ?

• Member

With -glm INDEL I get nothing in the output. The command is identical to the one above, except that -glm SNP is -glm INDEL.

• Member

To clearify: by "nothing" I mean that the only thing in the output VCF is the header section.

• Broad InstituteMember

Hi there - you might want to check the -minIndelCnt and -minIndelFrac arguments. As stated in the documentation, you need by default at least 5 reads with an indel for reads to be candidate for genotyping. You should also see the warnings in your output about using -glm INDEL and -out_mode EMIT_ALL_SITES.

• Member

Yeah there was a warning message about that, which is why I tried -glm SNP.

Ok, -minIndelCnt and -minIndelFrac may be the reasons why I didn't see this variant in the output VCF. Although I don't fully understand how:

The documentation says that -minIndelCnt is the minimum number of reads to trigger genotyping and that a locus must fulfill -minIndelFrac before the number of reads is considered. Is it enough that the fraction is at least 0.25 (default) or do both criteria need to be fulfilled? The reason I ask is because I have other samples with about 4x coverage and the UG finds a lot of variants in these samples, where the DP field in the VCF can be as low as 2, with one read per allele so the fraction is 0.5. In the example above, the fraction is 0.33 but there are only 3 reads that have the indel so it is not considered.

Either case has less than 5 reads but both have a fraction above 0.25 (the default), yet the variant is not called in the case in the screenshot above (3 reads, fraction 0.33).

What am I missing here?

• Cambridge, MAMember, Administrator, Broadie

Both conditions need to be fulfilled.

When you say that the UG finds a lot of variants in your other low-coverage samples, do you mean it finds indels or SNPs (or both)? SNPs are not subject to the same constraints as indels.

• Member

Ok, I see.

I use UG to find INDELs in the low-coverage samples as well.

This is an example of the genotype information from that call set (there are lots of samples so I omitted some columns):

GT:AD:DP:GQ:PL ... 0/1:1,1:2:33:33,0,38 0/1:1,1:2:33:33,0,34 0/0:4,0:4:12:0,12,167 ...

Unless I'm completely off here, it seems like the first two sample are called as heterozygous for the variants with one read supporting either allele, with a total read depth of 2. This call set was generated using the same settings as above, plus -stand_emit_conf 10, -glm INDEL, and no -L` interval specified.

• Cambridge, MAMember, Administrator, Broadie

I see -- just to be clear, this line is for an indel call, yes?

Let me check if the minimum count is on the filtered or unfiltered depth...

• Member

Yes, this is a line for an INDEL call for the low-coverage samples.

• Member
edited May 2013

@Carneiro said:
The minimum thresholds are for the pileup, once you're over the threshold, every sample gets genotyped.

Ok, so the count is the combined count for all samples? In that case, I should lower the minimum count when I'm just running the UG with two samples, right?

EDIT:
I ran the UG again on the two exome samples with -minIndelCnt set to 3 and now it does indeed pick up the deletion. Thank you all for your support and for helping me understand what was going on here!

• Charlestown, MAMember

Here is the doc:

-minIndelCnt / --min_indel_count_for_genotyping ( int with default value 5 )
Minimum number of consensus indels required to trigger genotyping run. A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site. Decreasing this value will increase sensitivity but at the cost of larger calling time and a larger number of false positives

Sign In or Register to comment.