Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Genotype and Validate, or Haplotype Caller GGA: what am I doing wrong?

vasiliosvasilios Posts: 2Member


My use case is quite straightforward, but has been surprisingly hard to achieve:
For each sample, I have both Omni 2.5M SNP genotype data and RNA-seq variant call data (done with GATK3). Now I want to see how well the RNA-seq variant calling is performing, using the SNP genotypes as reference. To do this, I need not only the variant calls in the RNA-seq data (as HC is outputting normally), but all genotypes for a given set of positions. Ideally, I would like to keep all the normal info fields from the RNA VCF, to allow calculation of some concordance metrics based on depth of coverage and other quality parameters later.

I've tried the following:
1. GenotypeAndValidate. With SNP VCF as "truth" and BAM to evaluate. The command:

java -Xmx32g -jar ${GATK} \  
-T GenotypeAndValidate \  
-R ${REF} \  
-I ${BAM} \  
-alleles ${SNPVCF} \  
-L ${SNPVCF} \  
-o $SAMPLEID.rnasnp.vcf \  
-nt 4  

The results (running only chr 1, with ~185k SNPs):

(empty) ALT REF No Status
called alt 0 0 4096
called ref 0 0 12995
not called 0 0 153034

sensitivity: NaN%
specificity: 100.000000%
not confident: 3678
not covered: 149356

This runs surprisingly fast - which makes me think I'm not inputting the files as expected.

2. Haplotype Caller in GGA mode. Giving it the SNP VCF as the --alleles file. The command, adjusted for RNA-seq data:

java -Xmx32g -jar ${GATK} \
-T HaplotypeCaller \
-R ${REF} \
--dbsnp ${DBSNP} \
-I ${BAM} \
-L ${SNPVCF} \
-alleles ${SNPVCF} \
--interval_padding 150 \
-recoverDanglingHeads \
-dontUseSoftClippedBases \
-stand_call_conf 0.0 \
-stand_emit_conf 0.0 \
-o $SAMPLEID.rnasnp.vcf \
-nct 16

This almost results in what I want, in that HC starts outputting also 0/0 and ./. calls for reference and non-covered bases.
But, it does so only for SNP-positions with non-reference alleles in the SNP VCF. Again, I want all positions called - including those that are homozygous reference in the SNP VCF.

I am using these tools wrong? Or should I be doing this differently?

Thanks in advance, Vasilios


Sign In or Register to comment.