Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Multi-sample calling vs single sample calling

SharonCoxSharonCox Posts: 4Member

Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary?

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    Hi Sharon,

    Ah, I understand your question better now -- I thought you were calling indels and SNPs together, then separately, and seeing differences in the totals of variants called (there shouldn't be).

    If your samples belong to a coherent cohort (as opposed to being all a bunch of individuals lumped together at random) then it is better to call them together. What you're seeing here is probably that you have some variants which don't look very good by themselves (the program is not confident that they are real -- possibly because the context is tricky as you describe), but occur in many of your samples. When the program sees that they are present in many samples, even at low confidence, this increases the overall confidence that those sites are variant. This is the strength of multisample calling.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 17Member

    @Geraldine_VdAuwera : Hi, we also use all the time the multiplesample calling (WholeExomeSequencing), but we make the experience that as good as no variants of this multiplehits is validated by Sanger. How this could be explained? I know most the time the qualities itself are doesn't looks very well at this candidates, but it seems to be that all identified multiplehits are false positives. Best Air

  • SheilaSheila Broad InstitutePosts: 531Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi Air,

    Do you mean that you have tried calling samples individually and together, and that the variants that are novel to the joint callset are all false positives?

    If so, how low are the qualities? Can you post an IGV screenshot of any of the false positive regions?

    Thanks, Sheila

  • airtimeairtime Posts: 17Member

    Hi Sheila, we call all samples together and for example candidates like this following one couldn't be validated. (And this example is one with very good qualities) Here is the UG output line without chr and position:

    . C   T   173.78  PASS    AC=2;AF=0.077;AN=26;BaseQRankSum=-5.802;DP=781;Dels=0.00;FS=7.678;
    HaplotypeScore=1.1802;InbreedingCoeff=-0.0833;MLEAC=2;MLEAF=0.077;MQ=59.93;MQ0=0;MQRankSum=-1.180;QD=2.56;
    ReadPosRankSum=0.650    GT:AD:DP:GQ:PL  
    0/0:69,0:69:99:0,190,2382   0/0:58,0:58:99:0,150,1944   0/0:62,0:62:99:0,181,2249   
    0/0:75,0:75:99:0,199,2547   0/0:86,0:86:99:0,223,2859   0/1:25,6:31:99:123,0,707    0/0:59,0:59:99:0,172,2122   
    0/0:54,0:54:99:0,153,1941   0/1:30,7:37:94:94,0,846 0/0:68,0:68:99:0,175,2253   0/0:66,0:66:99:0,181,2293   
    0/0:64,1:66:99:0,150,2189   0/0:50,0:50:99:0,135,1701

    And the corresponding IGV image is attached

    _false_positive_UGcall.png
    1681 x 855 - 22K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    @airtime, Using the UnifiedGenotyper is no longer considered best practice. You should try calling this with HaplotypeCaller and see if you get a different call.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 17Member

    @Geraldine_VdAuwera, how about low sample amount? Under 30 samples we should use UG instead of HC or did this suggestion changed?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    I'm not aware that we ever made that recommendation. There are some some complications for filtering if you have lower than 30 samples, because variant recalibration does not work as well with low amounts of data. But you can do variant calling with HaplotypeCaller on any number of samples. We have a new workflow that now makes it easier to use it on many samples, but low number of samples was never a problem for HC.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 17Member

    Ok, thank you very much. Sorry for this confusion. I was so sure that in Best Practise v4 it was recommended. I would take a look again and try out HC for variant calling and for this multiple hits.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    If you find the text that caused your misunderstanding, let me know and I will try to clarify what we wrote in order to avoid future misunderstandings. Good luck with your work.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 17Member

    I take a look at my notice and at the best pratice documentation and at the output of HC it is written "unrecalibrated SNP and indel calls" So I thought the variant recalibration is needed for the whole HC calling. But it seem to be enough to make a hard-filtering methode? It was something wrong associated in my head, I think the documentations is fine.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    Oh I see, we'll need to clarify that. This only means that the calls coming from HC are unfiltered, the same as for UG, and they need to be filtered before use in an analysis project. Filtering is best with VQSR, but if that's not possible, hard filtering is ok.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 17Member

    Hi, so I repeat the analysis with the HC for this project (previously done by UG) and for this multihit example we have again the false positive hit. The read amount was only changed to 25,3 and 28,6 from the example in the picture, respectively.

    I'm currently running another project analysis with more validated multiple hits which are false positives. I hope we can remove some of them.

  • SheilaSheila Broad InstitutePosts: 531Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi,

    Can you please confirm that this false positive still exists after variant filtering? Haplotype Caller is designed to be very sensitive so it does not miss any potential variants. However, after filtering the false positives should be eliminated.

    -Sheila

Sign In or Register to comment.