Genotype and Validate, or Haplotype Caller GGA: what am I doing wrong?

vasiliosvasilios Posts: 2Member

Hi,

My use case is quite straightforward, but has been surprisingly hard to achieve:
For each sample, I have both Omni 2.5M SNP genotype data and RNA-seq variant call data (done with GATK3).
Now I want to see how well the RNA-seq variant calling is performing, using the SNP genotypes as reference.
To do this, I need not only the variant calls in the RNA-seq data (as HC is outputting normally), but all genotypes for a given set of positions.
Ideally, I would like to keep all the normal info fields from the RNA VCF, to allow calculation of some concordance metrics based on depth of coverage and other quality parameters later.

I've tried the following:
1. GenotypeAndValidate. With SNP VCF as "truth" and BAM to evaluate. The command:

java -Xmx32g -jar ${GATK} \  
-T GenotypeAndValidate \  
-R ${REF} \  
-I ${BAM} \  
-alleles ${SNPVCF} \  
-L ${SNPVCF} \  
-o $SAMPLEID.rnasnp.vcf \  
-nt 4  

The results (running only chr 1, with ~185k SNPs):

(empty) ALT REF No Status
called alt 0 0 4096
called ref 0 0 12995
not called 0 0 153034

sensitivity: NaN%
specificity: 100.000000%
not confident: 3678
not covered: 149356

This runs surprisingly fast - which makes me think I'm not inputting the files as expected.

2. Haplotype Caller in GGA mode. Giving it the SNP VCF as the --alleles file. The command, adjusted for RNA-seq data:

java -Xmx32g -jar ${GATK} \
-T HaplotypeCaller \
-R ${REF} \
--dbsnp ${DBSNP} \
-I ${BAM} \
-L ${SNPVCF} \
-alleles ${SNPVCF} \
--interval_padding 150 \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-recoverDanglingHeads \
-dontUseSoftClippedBases \
-stand_call_conf 0.0 \
-stand_emit_conf 0.0 \
-o $SAMPLEID.rnasnp.vcf \
-nct 16

This almost results in what I want, in that HC starts outputting also 0/0 and ./. calls for reference and non-covered bases.
But, it does so only for SNP-positions with non-reference alleles in the SNP VCF. Again, I want all positions called - including those that are homozygous reference in the SNP VCF.

I am using these tools wrong? Or should I be doing this differently?

Thanks in advance,
Vasilios

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,424Administrator, GATK Developer admin

    Hi there,

    You need to run HC in -ERC GVCF mode (or BP_RESOLUTION) to get calls at all positions.

    Geraldine Van der Auwera, PhD

  • vasiliosvasilios Posts: 2Member

    Thanks Geraldine,
    That does work if I also remove the -gt_mode, -alleles and --interval_padding (otherwise HC crashed with "code error" after a while):

    java -Xmx32g -jar ${GATK} \
    -T HaplotypeCaller \
    -R ${REF} \
    --dbsnp ${DBSNP} \
    -I ${BAM} \
    -L ${SNPVCF} \
    -recoverDanglingHeads \
    -dontUseSoftClippedBases \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0 \
    -o $sampleid.rnasnp.vcf \
    -ERC BP_RESOLUTION \
    -nct 16  
    

    Which also makes it run very very fast. But I assume calling only in 1bp intervals will also stop any local re-assembly (like HC normally does)? Perhaps that won't matter for my use.

    Finally, any idea regarding the GenotypeAndValidate - is it not meant to deal with this type of situation?

    Thanks for the help.
    Vasilios

  • zzqzzq ChinaPosts: 15Member

    I just want to do this. Do you solve this problem ? I hope you can help me too.I have a VCF file and a mapping file (BAM),I want to konw the genotype for each site in the VCF file which also included the ref-hom . Thanks !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,424Administrator, GATK Developer admin

    @vasilios, I think HC pads the interval to a minimum window size even if the original request is just one position, so you shouldn't need to worry on that front --but I will ask @vruano‌ to confirm.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,424Administrator, GATK Developer admin

    @zzq, I am answering you in the discussion thread you started.

    Geraldine Van der Auwera, PhD

  • valentinvalentin Cambridge, MAPosts: 19Member, GATK Developer mod
    edited April 2014

    Hi,

    As it stands HC can make use of data up to 100 bp down and up-stream from any given interval and it will use as much of that as it thinks it needs too (can be controlled by some advance parameters). However it won't ever emit variation (or no-variation in GVCF/BP_RESOLUTION output mode) outside the interval provided with the exception of indels that extend out from that interval.

    So answering @vasilios doubt, HC will do local re-assembly even if the requested intervals are small (e.g. 1bp)

    Post edited by valentin on
Sign In or Register to comment.