Multi-sample calling vs single sample calling

SharonCoxSharonCox Posts: 4Member

Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary?

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Sharon,

    Ah, I understand your question better now -- I thought you were calling indels and SNPs together, then separately, and seeing differences in the totals of variants called (there shouldn't be).

    If your samples belong to a coherent cohort (as opposed to being all a bunch of individuals lumped together at random) then it is better to call them together. What you're seeing here is probably that you have some variants which don't look very good by themselves (the program is not confident that they are real -- possibly because the context is tricky as you describe), but occur in many of your samples. When the program sees that they are present in many samples, even at low confidence, this increases the overall confidence that those sites are variant. This is the strength of multisample calling.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 20Member

    @Geraldine_VdAuwera : Hi, we also use all the time the multiplesample calling (WholeExomeSequencing), but we make the experience that as good as no variants of this multiplehits is validated by Sanger. How this could be explained? I know most the time the qualities itself are doesn't looks very well at this candidates, but it seems to be that all identified multiplehits are false positives. Best Air

  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi Air,

    Do you mean that you have tried calling samples individually and together, and that the variants that are novel to the joint callset are all false positives?

    If so, how low are the qualities? Can you post an IGV screenshot of any of the false positive regions?

    Thanks, Sheila

  • airtimeairtime Posts: 20Member

    Hi Sheila, we call all samples together and for example candidates like this following one couldn't be validated. (And this example is one with very good qualities) Here is the UG output line without chr and position:

    . C   T   173.78  PASS    AC=2;AF=0.077;AN=26;BaseQRankSum=-5.802;DP=781;Dels=0.00;FS=7.678;
    HaplotypeScore=1.1802;InbreedingCoeff=-0.0833;MLEAC=2;MLEAF=0.077;MQ=59.93;MQ0=0;MQRankSum=-1.180;QD=2.56;
    ReadPosRankSum=0.650    GT:AD:DP:GQ:PL  
    0/0:69,0:69:99:0,190,2382   0/0:58,0:58:99:0,150,1944   0/0:62,0:62:99:0,181,2249   
    0/0:75,0:75:99:0,199,2547   0/0:86,0:86:99:0,223,2859   0/1:25,6:31:99:123,0,707    0/0:59,0:59:99:0,172,2122   
    0/0:54,0:54:99:0,153,1941   0/1:30,7:37:94:94,0,846 0/0:68,0:68:99:0,175,2253   0/0:66,0:66:99:0,181,2293   
    0/0:64,1:66:99:0,150,2189   0/0:50,0:50:99:0,135,1701

    And the corresponding IGV image is attached

    _false_positive_UGcall.png
    1681 x 855 - 22K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @airtime, Using the UnifiedGenotyper is no longer considered best practice. You should try calling this with HaplotypeCaller and see if you get a different call.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 20Member

    @Geraldine_VdAuwera, how about low sample amount? Under 30 samples we should use UG instead of HC or did this suggestion changed?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    I'm not aware that we ever made that recommendation. There are some some complications for filtering if you have lower than 30 samples, because variant recalibration does not work as well with low amounts of data. But you can do variant calling with HaplotypeCaller on any number of samples. We have a new workflow that now makes it easier to use it on many samples, but low number of samples was never a problem for HC.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 20Member

    Ok, thank you very much. Sorry for this confusion. I was so sure that in Best Practise v4 it was recommended. I would take a look again and try out HC for variant calling and for this multiple hits.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    If you find the text that caused your misunderstanding, let me know and I will try to clarify what we wrote in order to avoid future misunderstandings. Good luck with your work.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 20Member

    I take a look at my notice and at the best pratice documentation and at the output of HC it is written "unrecalibrated SNP and indel calls" So I thought the variant recalibration is needed for the whole HC calling. But it seem to be enough to make a hard-filtering methode? It was something wrong associated in my head, I think the documentations is fine.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Oh I see, we'll need to clarify that. This only means that the calls coming from HC are unfiltered, the same as for UG, and they need to be filtered before use in an analysis project. Filtering is best with VQSR, but if that's not possible, hard filtering is ok.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 20Member

    Hi, so I repeat the analysis with the HC for this project (previously done by UG) and for this multihit example we have again the false positive hit. The read amount was only changed to 25,3 and 28,6 from the example in the picture, respectively.

    I'm currently running another project analysis with more validated multiple hits which are false positives. I hope we can remove some of them.

  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi,

    Can you please confirm that this false positive still exists after variant filtering? Haplotype Caller is designed to be very sensitive so it does not miss any potential variants. However, after filtering the false positives should be eliminated.

    -Sheila

  • airtimeairtime Posts: 20Member
    edited October 23

    @Sheila‌ Hi, after HC I select indel and snp/mnp seperately, then I filter with the default filter settings at "hard to validate".

    here is the same hit with different read amounts:

    .  C   T   175.78  PASS    AC=2;AF=0.077;AN=26;BaseQRankSum=-5.940e-01;ClippingRankSum=0.158;
    DP=518;FS=1.517;GQ_MEAN=100.54;GQ_STDDEV=31.03;InbreedingCoeff=-0.0833;MLEAC=2;MLEAF=0.077;
    MQ=60.00;MQ0=0;MQRankSum=-2.970e-01;NCC=0;QD=2.84;ReadPosRankSum=1.04   GT:AD:DP:GQ:PL
    0/0:46,0:46:99:0,120,1800   0/0:41,0:41:93:0,93,1395    0/0:46,0:46:99:0,117,1755   0/0:50,0:50:99:0,120,1800   
    0/0:51,0:51:99:0,120,1800   0/1:25,3:28:51:51,0,1237    0/0:39,0:39:91:0,91,1377    0/0:28,0:28:69:0,69,1035    
    0/1:28,6:34:99:168,0,1200   0/0:38,0:38:90:0,90,1349    0/0:35,0:35:81:0,81,1215    0/0:52,0:52:99:0,120,1800   
    0/0:30,0:30:67:0,67,1110
    Post edited by airtime on
  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi,

    I do not understand what exactly you did. Can you post your command line for filtering?

    Thanks, Sheila

  • airtimeairtime Posts: 20Member

    @Sheila

    Hi, I use following commands:

    java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T GenotypeGVCFs
    -R hg19.fasta --variant S1_snp_indel.raw.vcf --variant S2_snp_indel.raw.vcf --variant  -o all_snp_indel.raw.vcf
    java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T SelectVariants
    -R hg19.fasta --variant all_snp_indel.raw.vcf -o only_all_snp_mnp.raw.vcf
    --selectTypeToInclude SNP --selectTypeToInclude MNP
    java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T VariantFiltration
    -R hg19.fasta --variant only_all_snp_mnp.raw.vcf -o HaTyCa_only_all_snp_mnp.hard-filtered.vcf
    --filterExpression "QD < 2.0 " --filterName "HARD_TO_VALIDATE_QD" --filterExpression " MQ < 40.0 "
    --filterName "HARD_TO_VALIDATE_MQ" --filterExpression " FS > 60.0 " --filterName "HARD_TO_VALIDATE_FS"
    --filterExpression " HaplotypeScore > 13.0 " --filterName "HARD_TO_VALIDATE_HaploTypeScore"
    --filterExpression "MQRankSum < -12.5 " --filterName "HARD_TO_VALIDATE_MQRankSum"
    --filterExpression " ReadPosRankSum < -8.0" --filterName "HARD_TO_VALIDATE_ReadPosRankSum"
    java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T SelectVariants
    -R hg19.fasta --variant HaTyCa_only_all_snp_mnp.hard-filtered.vcf
    -o HaTyCa_only_all_snp_mnp.hard-filtered_results.vcf
    -select 'vc.isNotFiltered()'
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    OK, that all looks pretty normal. I think you're encountering the limitations of hard-filtering here. One way to deal with this is to use more stringent filters by trial and error, to find out what filters will remove this type of false positive but still retain the true positives you want. The other way (which is better) is to increase your cohort size (e.g. by using exomes from the 1000 Genomes project) so that you can use variant recalibration. That should provide more discriminating filtering.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 20Member
    edited October 24

    Hi, yeah the second possibility sounds nice. But how to increase the cohort, only put some further exome samples without respect to the project? Because in some projects we have celllines in other we have tumor samples with corresponding normal samples. It is important to include also celllines at the cellline projects? And what about the tumor entity, should I keep the tumor type? What is important to increase the sample amount?

    Post edited by airtime on
  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi,

    If you are working with tumor samples, we recommend using MuTect. Please read about it here: http://www.broadinstitute.org/cancer/cga/mutect

    -Sheila

Sign In or Register to comment.