Joint variant calling using RNA-Seq data

xin.huangxin.huang Washington, DCMember

Dear GATK Team,

I realized that the Best Practices does not recommend the using the joing genotyping for cohort analysis using RNA-Seq data. Since I didn't see a clear alternative recommendation (or I may well have missed key points), I tried the ERC mode of the HaplotypeCaller. I was trying to find out fixed SNPs between two populations with 11 and 9 individuals, respectively. The number of SNPs (after window filtering, ~10,000) that I found seemed to be reasonable, compared to the numbers of segregating SNPs (after window filtering, ~160,000) between individual pairs from the two populations.

I expected the fixed/segregating SNPs between the two populations to exist between each of the individual pairs, since the segregating SNPs should be subsets of SNPs between individual pairs. However, I found that about half of the SNPs between populations were not found in one of invidual pairs that I started to look at. Originally I thought it was because window filtering for each individual sample could have got rid of valid SNPs between populations, because some SNPs in a 35bp window would no longer exist in a cluster when all individuals have been taken into consideration. So I dialed back without the window filtering steps, but still more than 40% of the segregating SNPs between populations were not found between the individual pair that I looked at.

I realized that I need to use the joint genotyping here at my own risk, but if there could be any insight into what could have happened, I would greatly appreciated it! Also, if joint genotyping is not approriate for RNA-Seq data, why is that and what alternatives may I try to call SNPs using multiple individual libraries?

Many thanks for taking time to consider my questions,

Xin

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @xin.huang
    Hi Xin,

    Can you tell me a little more about what you are trying to do? What exactly did you do for window filtering? Can you post the exact commands you ran for each type of filtering? Also, what do you mean by pairs of individuals?

    Thanks,
    Sheila

  • xin.huangxin.huang Washington, DCMember

    Hi Sheila,

    Sorry my question was not clear. We have mosquitoes from two populations, and we crossed individual females (N=9) from one population with individual males (N=11) from another. I was trying to find fixed SNPs between these two populations, meaning that at the particular locus, all females from Pop 1 have the same nucleotide, and differ from all the males from Pop 2, which share the same nucleotide. What we have are RNA-Seq reads for each of the individuals. I mapped the reads to a transcriptome that we built, because at that time the genome was not available. First I called variants on individual BAM files:

    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R <transcriptome.fa> -I <coord_sorted_dedupped.bam> --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o <raw.vcf>

    Then I did the join calling with all individual GVCFs from one population:

    java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R <transcriptome.fa> -stand_call_conf 20 -stand_emit_conf 20 -V <raw.vcf1> -V <raw.vcf2> ... -o <joint.vcf>

    Then I selected all the SNPs and used QD<2.0 and FS>30.0 for variant filtering, and I also only kept SNPs with a G/T = 1/1, to get fixed SNPs from each populations. The command I used was:
    bcftools filter -s LowQual -e '%QUAL<20 || GT!="1/1" || QD<2.0 || FS>30.0' -O z <joint.vcf> > <joint.flt.vcf>

    The reason I didn't apply the windown filtering in this step was that I couldn't figure out a way to use GATK to get SNPs with a G/T=1/1. After that I used bcftools to get SNPs identified from both populations after the filtering, by using the command
    bcftools isec -p -n+1 -O z -f .,PASS -c all <joint1.flt.vcf.gz> <joint2.flt.vcf.gz>

    Then I used -window 35 -cluster 3 to filter both joint VCFs by using:
    java -jar GenomeAnalysisTK.jar -T VariantFiltration -R <transcriptome.fa> -V <joint.flt.vcf> -window 35 -cluster 3 -o <joint.flt.windowfiltered.vcf>

    Then I found identical SNPs between the two populations by using:
    bcftools isec -p -n=2 -O z -f .,PASS -c none <joint1.flt.windowfilterd.vcf.gz> <joint2.flt.windowfiltered.vcf.gz>

    Finally I excluded these shared SNPs from both variant sets to get the final set by using:
    bcftools isec -p -C -O z -f .,PASS <joint1.flt.windowfilterd.vcf.gz> <shared_snps.vcf.gz>

    By pair of individuals, I meant that a female from one population and its mating male from another population. For these individuals, I used:

    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R <transcriptome.fa> -I <coord_sorted_dedupped.bam> -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o <raw.vcf>

    And I followed similar filtering procedures as described above.

    Thank you for any insight you might have into this!

    Xin

  • xin.huangxin.huang Washington, DCMember

    Hi Sheila,

    Yes, your understanding is exactly right. Thanks for your suggestions to get the segregating SNPs after the variant calling.

    I was wondering what I can use to do the join calling using these RNA-Seq libraries. I used the ERC mode of the HaplotypeCaller. The number of SNPs (after window filtering, ~10,000) that I found seemed to be reasonable, compared to the numbers of segregating SNPs (after window filtering, ~160,000) between individual pairs from the two populations.

    I expected the fixed/segregating SNPs between the two populations to exist between each of the individual pairs of male and female from population 1 and population 2, respectively, since the segregating SNPs should be subsets of SNPs between individual pairs. However, I found that about half of the SNPs between populations were not found in one of invidual pairs that I started to look at. Originally I thought it was because window filtering for each individual sample could have got rid of valid SNPs between populations, because some SNPs in a 35bp window would no longer exist in a cluster when all individuals have been taken into consideration. So I dialed back without the window filtering steps, but still more than 40% of the segregating SNPs between populations were not found between the individual pair that I looked at.

    I realized that I need to use the joint genotyping here at my own risk, but if there could be any insight into what could have happened, I would greatly appreciated it! Also, if joint genotyping is not approriate for RNA-Seq data, why is that and what alternatives may I try to call SNPs using multiple individual libraries?

    Thanks for any insight you might have!

    Best,

    Xin

Sign In or Register to comment.