The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Multi-sample calling vs single sample calling

Posts: 4Member

Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary?

Tagged:

Hi Sharon,

Ah, I understand your question better now -- I thought you were calling indels and SNPs together, then separately, and seeing differences in the totals of variants called (there shouldn't be).

If your samples belong to a coherent cohort (as opposed to being all a bunch of individuals lumped together at random) then it is better to call them together. What you're seeing here is probably that you have some variants which don't look very good by themselves (the program is not confident that they are real -- possibly because the context is tricky as you describe), but occur in many of your samples. When the program sees that they are present in many samples, even at low confidence, this increases the overall confidence that those sites are variant. This is the strength of multisample calling.

Geraldine Van der Auwera, PhD

• Posts: 20Member

@Geraldine_VdAuwera : Hi, we also use all the time the multiplesample calling (WholeExomeSequencing), but we make the experience that as good as no variants of this multiplehits is validated by Sanger. How this could be explained? I know most the time the qualities itself are doesn't looks very well at this candidates, but it seems to be that all identified multiplehits are false positives. Best Air

Hi Air,

Do you mean that you have tried calling samples individually and together, and that the variants that are novel to the joint callset are all false positives?

If so, how low are the qualities? Can you post an IGV screenshot of any of the false positive regions?

Thanks, Sheila

• Posts: 20Member

Hi Sheila, we call all samples together and for example candidates like this following one couldn't be validated. (And this example is one with very good qualities) Here is the UG output line without chr and position:

. C   T   173.78  PASS    AC=2;AF=0.077;AN=26;BaseQRankSum=-5.802;DP=781;Dels=0.00;FS=7.678;
HaplotypeScore=1.1802;InbreedingCoeff=-0.0833;MLEAC=2;MLEAF=0.077;MQ=59.93;MQ0=0;MQRankSum=-1.180;QD=2.56;
0/0:69,0:69:99:0,190,2382   0/0:58,0:58:99:0,150,1944   0/0:62,0:62:99:0,181,2249
0/0:75,0:75:99:0,199,2547   0/0:86,0:86:99:0,223,2859   0/1:25,6:31:99:123,0,707    0/0:59,0:59:99:0,172,2122
0/0:54,0:54:99:0,153,1941   0/1:30,7:37:94:94,0,846 0/0:68,0:68:99:0,175,2253   0/0:66,0:66:99:0,181,2293
0/0:64,1:66:99:0,150,2189   0/0:50,0:50:99:0,135,1701

And the corresponding IGV image is attached

@airtime, Using the UnifiedGenotyper is no longer considered best practice. You should try calling this with HaplotypeCaller and see if you get a different call.

Geraldine Van der Auwera, PhD

• Posts: 20Member

@Geraldine_VdAuwera, how about low sample amount? Under 30 samples we should use UG instead of HC or did this suggestion changed?

I'm not aware that we ever made that recommendation. There are some some complications for filtering if you have lower than 30 samples, because variant recalibration does not work as well with low amounts of data. But you can do variant calling with HaplotypeCaller on any number of samples. We have a new workflow that now makes it easier to use it on many samples, but low number of samples was never a problem for HC.

Geraldine Van der Auwera, PhD

• Posts: 20Member

Ok, thank you very much. Sorry for this confusion. I was so sure that in Best Practise v4 it was recommended. I would take a look again and try out HC for variant calling and for this multiple hits.

If you find the text that caused your misunderstanding, let me know and I will try to clarify what we wrote in order to avoid future misunderstandings. Good luck with your work.

Geraldine Van der Auwera, PhD

• Posts: 20Member

I take a look at my notice and at the best pratice documentation and at the output of HC it is written "unrecalibrated SNP and indel calls" So I thought the variant recalibration is needed for the whole HC calling. But it seem to be enough to make a hard-filtering methode? It was something wrong associated in my head, I think the documentations is fine.

Oh I see, we'll need to clarify that. This only means that the calls coming from HC are unfiltered, the same as for UG, and they need to be filtered before use in an analysis project. Filtering is best with VQSR, but if that's not possible, hard filtering is ok.

Geraldine Van der Auwera, PhD

• Posts: 20Member

Hi, so I repeat the analysis with the HC for this project (previously done by UG) and for this multihit example we have again the false positive hit. The read amount was only changed to 25,3 and 28,6 from the example in the picture, respectively.

I'm currently running another project analysis with more validated multiple hits which are false positives. I hope we can remove some of them.

Hi,

Can you please confirm that this false positive still exists after variant filtering? Haplotype Caller is designed to be very sensitive so it does not miss any potential variants. However, after filtering the false positives should be eliminated.

-Sheila

• Posts: 20Member
edited October 23

@Sheila‌ Hi, after HC I select indel and snp/mnp seperately, then I filter with the default filter settings at "hard to validate".

here is the same hit with different read amounts:

.  C   T   175.78  PASS    AC=2;AF=0.077;AN=26;BaseQRankSum=-5.940e-01;ClippingRankSum=0.158;
DP=518;FS=1.517;GQ_MEAN=100.54;GQ_STDDEV=31.03;InbreedingCoeff=-0.0833;MLEAC=2;MLEAF=0.077;
0/0:46,0:46:99:0,120,1800   0/0:41,0:41:93:0,93,1395    0/0:46,0:46:99:0,117,1755   0/0:50,0:50:99:0,120,1800
0/0:51,0:51:99:0,120,1800   0/1:25,3:28:51:51,0,1237    0/0:39,0:39:91:0,91,1377    0/0:28,0:28:69:0,69,1035
0/1:28,6:34:99:168,0,1200   0/0:38,0:38:90:0,90,1349    0/0:35,0:35:81:0,81,1215    0/0:52,0:52:99:0,120,1800
0/0:30,0:30:67:0,67,1110
Post edited by airtime on

Hi,

I do not understand what exactly you did. Can you post your command line for filtering?

Thanks, Sheila

• Posts: 20Member

Hi, I use following commands:

java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T GenotypeGVCFs
-R hg19.fasta --variant S1_snp_indel.raw.vcf --variant S2_snp_indel.raw.vcf --variant  -o all_snp_indel.raw.vcf
java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T SelectVariants
-R hg19.fasta --variant all_snp_indel.raw.vcf -o only_all_snp_mnp.raw.vcf
--selectTypeToInclude SNP --selectTypeToInclude MNP
java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T VariantFiltration
-R hg19.fasta --variant only_all_snp_mnp.raw.vcf -o HaTyCa_only_all_snp_mnp.hard-filtered.vcf
--filterExpression "QD < 2.0 " --filterName "HARD_TO_VALIDATE_QD" --filterExpression " MQ < 40.0 "
--filterName "HARD_TO_VALIDATE_MQ" --filterExpression " FS > 60.0 " --filterName "HARD_TO_VALIDATE_FS"
--filterExpression " HaplotypeScore > 13.0 " --filterName "HARD_TO_VALIDATE_HaploTypeScore"
--filterExpression "MQRankSum < -12.5 " --filterName "HARD_TO_VALIDATE_MQRankSum"
--filterExpression " ReadPosRankSum < -8.0" --filterName "HARD_TO_VALIDATE_ReadPosRankSum"
java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T SelectVariants
-R hg19.fasta --variant HaTyCa_only_all_snp_mnp.hard-filtered.vcf
-o HaTyCa_only_all_snp_mnp.hard-filtered_results.vcf
-select 'vc.isNotFiltered()'

OK, that all looks pretty normal. I think you're encountering the limitations of hard-filtering here. One way to deal with this is to use more stringent filters by trial and error, to find out what filters will remove this type of false positive but still retain the true positives you want. The other way (which is better) is to increase your cohort size (e.g. by using exomes from the 1000 Genomes project) so that you can use variant recalibration. That should provide more discriminating filtering.

Geraldine Van der Auwera, PhD

• Posts: 20Member
edited October 24

Hi, yeah the second possibility sounds nice. But how to increase the cohort, only put some further exome samples without respect to the project? Because in some projects we have celllines in other we have tumor samples with corresponding normal samples. It is important to include also celllines at the cellline projects? And what about the tumor entity, should I keep the tumor type? What is important to increase the sample amount?

Post edited by airtime on