The current GATK version is 3.2-2

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Multi-sample calling vs single sample calling

Posts: 4Member

Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary?

Tagged:

Hi Sharon,

Ah, I understand your question better now -- I thought you were calling indels and SNPs together, then separately, and seeing differences in the totals of variants called (there shouldn't be).

If your samples belong to a coherent cohort (as opposed to being all a bunch of individuals lumped together at random) then it is better to call them together. What you're seeing here is probably that you have some variants which don't look very good by themselves (the program is not confident that they are real -- possibly because the context is tricky as you describe), but occur in many of your samples. When the program sees that they are present in many samples, even at low confidence, this increases the overall confidence that those sites are variant. This is the strength of multisample calling.

Geraldine Van der Auwera, PhD

• Posts: 17Member

@Geraldine_VdAuwera : Hi, we also use all the time the multiplesample calling (WholeExomeSequencing), but we make the experience that as good as no variants of this multiplehits is validated by Sanger. How this could be explained? I know most the time the qualities itself are doesn't looks very well at this candidates, but it seems to be that all identified multiplehits are false positives. Best Air

Hi Air,

Do you mean that you have tried calling samples individually and together, and that the variants that are novel to the joint callset are all false positives?

If so, how low are the qualities? Can you post an IGV screenshot of any of the false positive regions?

Thanks, Sheila

• Posts: 17Member

Hi Sheila, we call all samples together and for example candidates like this following one couldn't be validated. (And this example is one with very good qualities) Here is the UG output line without chr and position:

. C   T   173.78  PASS    AC=2;AF=0.077;AN=26;BaseQRankSum=-5.802;DP=781;Dels=0.00;FS=7.678;
HaplotypeScore=1.1802;InbreedingCoeff=-0.0833;MLEAC=2;MLEAF=0.077;MQ=59.93;MQ0=0;MQRankSum=-1.180;QD=2.56;
0/0:69,0:69:99:0,190,2382   0/0:58,0:58:99:0,150,1944   0/0:62,0:62:99:0,181,2249
0/0:75,0:75:99:0,199,2547   0/0:86,0:86:99:0,223,2859   0/1:25,6:31:99:123,0,707    0/0:59,0:59:99:0,172,2122
0/0:54,0:54:99:0,153,1941   0/1:30,7:37:94:94,0,846 0/0:68,0:68:99:0,175,2253   0/0:66,0:66:99:0,181,2293
0/0:64,1:66:99:0,150,2189   0/0:50,0:50:99:0,135,1701

And the corresponding IGV image is attached

@airtime, Using the UnifiedGenotyper is no longer considered best practice. You should try calling this with HaplotypeCaller and see if you get a different call.

Geraldine Van der Auwera, PhD

• Posts: 17Member

@Geraldine_VdAuwera, how about low sample amount? Under 30 samples we should use UG instead of HC or did this suggestion changed?

I'm not aware that we ever made that recommendation. There are some some complications for filtering if you have lower than 30 samples, because variant recalibration does not work as well with low amounts of data. But you can do variant calling with HaplotypeCaller on any number of samples. We have a new workflow that now makes it easier to use it on many samples, but low number of samples was never a problem for HC.

Geraldine Van der Auwera, PhD

• Posts: 17Member

Ok, thank you very much. Sorry for this confusion. I was so sure that in Best Practise v4 it was recommended. I would take a look again and try out HC for variant calling and for this multiple hits.

If you find the text that caused your misunderstanding, let me know and I will try to clarify what we wrote in order to avoid future misunderstandings. Good luck with your work.

Geraldine Van der Auwera, PhD

• Posts: 17Member

I take a look at my notice and at the best pratice documentation and at the output of HC it is written "unrecalibrated SNP and indel calls" So I thought the variant recalibration is needed for the whole HC calling. But it seem to be enough to make a hard-filtering methode? It was something wrong associated in my head, I think the documentations is fine.

Oh I see, we'll need to clarify that. This only means that the calls coming from HC are unfiltered, the same as for UG, and they need to be filtered before use in an analysis project. Filtering is best with VQSR, but if that's not possible, hard filtering is ok.

Geraldine Van der Auwera, PhD

• Posts: 17Member

Hi, so I repeat the analysis with the HC for this project (previously done by UG) and for this multihit example we have again the false positive hit. The read amount was only changed to 25,3 and 28,6 from the example in the picture, respectively.

I'm currently running another project analysis with more validated multiple hits which are false positives. I hope we can remove some of them.