Finding mutagen caused mutations

gvjgvj NLMember

Hi GATK team,
I am using GATK to call random mutations caused by mutagen. I have more than 5 mutant lines sequenced and I want to call SNP/INDELs which are unique for each samples (since they are randomly introduced they must be unique for each samples). Majoirity of SNP will be shared between all mutant samples as this are the natural variations between line used for mutagenesis and the Reference sequence.

I am using UnifiedGenotyper (because I have sequenced pooled DNA for each mutant family) by giving all the 5 bam files from samples. I wonder, by calling SNP from all 5 mutant samples together may negatively influence the unique SNPs? Does GATK consider all the 5 samples as a population and give preference (or high score to) locus which are shared between majority of samples? In my case, would it be better to call SNP/INDEL separately for all 5 samples?

The command I am using is java -jar GenomeAnalysisTKLite-2.3-9-gdcdccbb/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -glm BOTH -R ref.fa -nt 10 -I results/1_sorted.bam_fixed_rmdup_realign.bam -I results/2_sorted.bam_fixed_rmdup_realign.bam -I results/3_sorted.bam_fixed_rmdup_realign.bam -I results/4_sorted.bam_fixed_rmdup_realign.bam -I results/5_sorted.bam_fixed_rmdup_realign.bam -o results/out.vcf

I hope I am clear, and looking forward to learn more,

Thank you in advance

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @gvj‌

    Hi,

    Calling SNPs together will not negatively influence the unique SNPs. GATK does give a higher score to a locus which is shared between a majority of samples, but it does not decrease the score of a unique locus. It is not better in your case to call SNP/INDELs separately on all 5 samples.

    Hope this helps!

    -Sheila

  • gvjgvj NLMember

    Hi Sheila,
    Thank you for the reply. I tried calling SNPs individually just to test. I found that there are unique SNPs called when I call SNPs individually. I am pasting two of those just to get an impression of SNPs for you.
    ch0 5715440 . T C 636.77 . AC=2;AF=1.00;AN=2;DP=26;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=25.62;MQ0=3;QD=24.49 GT:AD:DP:GQ:PL 1/1:0,26:26:63:665,63,0 ch0 7677372 . A C 875.77 . AC=2;AF=1.00;AN=2;DP=26;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=53.11;MQ0=0;QD=33.68 GT:AD:DP:GQ:PL 1/1:0,26:26:72:904,72,0

    What do you think about those? Since they have good coverage and alternative allele frequency, I tend to believe that they are true.

    Thank you again for your help

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @gvj‌
    Hi,
    You are right that the allele frequencies do look good and the coverage is good, however there are some quality issues with the other annotations that might explain why the SNP is not being called when you do joint calling. You can see that the mapping quality, quality by depth and genotype quality are all low in the first sample. In the second sample, the mapping quality is very good and the quality of depth is fine, but the genotype quality is low considering the allelic frequency is so clear cut. This indicates there are problems that make it a marginal call rather than a confident call.

    If you want to be highly sensitive to singletons, you are probably better off calling the samples separately, but please note this can lead to a higher false positive rate.

    -Sheila

  • gvjgvj NLMember

    Thanks Sheila for the insight. I know that it must be hard to define some cutoffs, however it would be great if you can give me some figures (or even a range ) for MQ, QD and GQ which you think a good cutoff to select confident SNPs in combined and individual calls?

    It would also great if you could comment on which of these (or combination) I should use to play around to make the prediction more inclusive?

    Thanks you once again for your help

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @gvj‌

    Hi,

    For cutoff guidance, you can refer to (howto) Apply hard filters to a call set here: http://www.broadinstitute.org/gatk/guide/topic?name=tutorials

    You can also try the next step in Best Practices which is to run Variant Quality Score Recalibration, if you have enough data.

    Also, to find out why Haplotype Caller is calling the SNPs in individual samples and not in a population, you can use the -bamOut argument to compare haplotypes. Specifically, you want to run it on the regions that are outputting different SNPs. When you input the intervals you want to focus on, the output bam file will have the haplotypes generated for each of those regions and you can compare the individual haplotypes and the population haplotypes.

    I hope this helps.

    -Sheila

Sign In or Register to comment.