The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Did we ask for a bug report?

Then follow instructions in Article#1894.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Genotype and Validate, or Haplotype Caller GGA: what am I doing wrong?

Member Posts: 2

Hi,

My use case is quite straightforward, but has been surprisingly hard to achieve:
For each sample, I have both Omni 2.5M SNP genotype data and RNA-seq variant call data (done with GATK3).
Now I want to see how well the RNA-seq variant calling is performing, using the SNP genotypes as reference.
To do this, I need not only the variant calls in the RNA-seq data (as HC is outputting normally), but all genotypes for a given set of positions.
Ideally, I would like to keep all the normal info fields from the RNA VCF, to allow calculation of some concordance metrics based on depth of coverage and other quality parameters later.

I've tried the following:
1. GenotypeAndValidate. With SNP VCF as "truth" and BAM to evaluate. The command:

java -Xmx32g -jar ${GATK} \ -T GenotypeAndValidate \ -R${REF} \
-I ${BAM} \ -alleles${SNPVCF} \
-L ${SNPVCF} \ -o$SAMPLEID.rnasnp.vcf \
-nt 4


The results (running only chr 1, with ~185k SNPs):

(empty) ALT REF No Status
called alt 0 0 4096
called ref 0 0 12995
not called 0 0 153034

sensitivity: NaN%
specificity: 100.000000%
not confident: 3678
not covered: 149356

This runs surprisingly fast - which makes me think I'm not inputting the files as expected.

2. Haplotype Caller in GGA mode. Giving it the SNP VCF as the --alleles file. The command, adjusted for RNA-seq data:

java -Xmx32g -jar ${GATK} \ -T HaplotypeCaller \ -R${REF} \
--dbsnp ${DBSNP} \ -I${BAM} \
-L ${SNPVCF} \ -alleles${SNPVCF} \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-dontUseSoftClippedBases \
-stand_call_conf 0.0 \
-stand_emit_conf 0.0 \
-o $SAMPLEID.rnasnp.vcf \ -nct 16  This almost results in what I want, in that HC starts outputting also 0/0 and ./. calls for reference and non-covered bases. But, it does so only for SNP-positions with non-reference alleles in the SNP VCF. Again, I want all positions called - including those that are homozygous reference in the SNP VCF. I am using these tools wrong? Or should I be doing this differently? Thanks in advance, Vasilios Tagged: ## Comments • Cambridge, MAMember, Administrator, Broadie Posts: 11,421 admin Hi there, You need to run HC in -ERC GVCF mode (or BP_RESOLUTION) to get calls at all positions. Geraldine Van der Auwera, PhD • Member Posts: 2 Thanks Geraldine, That does work if I also remove the -gt_mode, -alleles and --interval_padding (otherwise HC crashed with "code error" after a while): java -Xmx32g -jar${GATK} \
-T HaplotypeCaller \
-R ${REF} \ --dbsnp${DBSNP} \
-I ${BAM} \ -L${SNPVCF} \
-dontUseSoftClippedBases \
-stand_call_conf 0.0 \
-stand_emit_conf 0.0 \
-o \$sampleid.rnasnp.vcf \
-ERC BP_RESOLUTION \
-nct 16
`

Which also makes it run very very fast. But I assume calling only in 1bp intervals will also stop any local re-assembly (like HC normally does)? Perhaps that won't matter for my use.

Finally, any idea regarding the GenotypeAndValidate - is it not meant to deal with this type of situation?

Thanks for the help.
Vasilios

• ChinaMember Posts: 60

I just want to do this. Do you solve this problem ? I hope you can help me too.I have a VCF file and a mapping file (BAM),I want to konw the genotype for each site in the VCF file which also included the ref-hom . Thanks !

@vasilios, I think HC pads the interval to a minimum window size even if the original request is just one position, so you shouldn't need to worry on that front --but I will ask @vruano‌ to confirm.

Geraldine Van der Auwera, PhD

@zzq, I am answering you in the discussion thread you started.

Geraldine Van der Auwera, PhD

• Cambridge, MAMember, Dev Posts: 74 ✭✭
edited April 2014

Hi,

As it stands HC can make use of data up to 100 bp down and up-stream from any given interval and it will use as much of that as it thinks it needs too (can be controlled by some advance parameters). However it won't ever emit variation (or no-variation in GVCF/BP_RESOLUTION output mode) outside the interval provided with the exception of indels that extend out from that interval.

So answering @vasilios doubt, HC will do local re-assembly even if the requested intervals are small (e.g. 1bp)