US Holiday notice: this Thursday and Friday (Nov 25-26) the forum will be unattended. Normal service will resume Monday Nov 29. Happy Thanksgiving!

Multi-sample calling vs single sample calling

SharonCoxSharonCox Posts: 4Member

Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary?

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Hi Sharon,

    Ah, I understand your question better now -- I thought you were calling indels and SNPs together, then separately, and seeing differences in the totals of variants called (there shouldn't be).

    If your samples belong to a coherent cohort (as opposed to being all a bunch of individuals lumped together at random) then it is better to call them together. What you're seeing here is probably that you have some variants which don't look very good by themselves (the program is not confident that they are real -- possibly because the context is tricky as you describe), but occur in many of your samples. When the program sees that they are present in many samples, even at low confidence, this increases the overall confidence that those sites are variant. This is the strength of multisample calling.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 23Member

    @Geraldine_VdAuwera : Hi, we also use all the time the multiplesample calling (WholeExomeSequencing), but we make the experience that as good as no variants of this multiplehits is validated by Sanger. How this could be explained? I know most the time the qualities itself are doesn't looks very well at this candidates, but it seems to be that all identified multiplehits are false positives. Best Air

  • SheilaSheila Broad InstitutePosts: 653Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi Air,

    Do you mean that you have tried calling samples individually and together, and that the variants that are novel to the joint callset are all false positives?

    If so, how low are the qualities? Can you post an IGV screenshot of any of the false positive regions?

    Thanks, Sheila

  • airtimeairtime Posts: 23Member

    Hi Sheila, we call all samples together and for example candidates like this following one couldn't be validated. (And this example is one with very good qualities) Here is the UG output line without chr and position:

    . C   T   173.78  PASS    AC=2;AF=0.077;AN=26;BaseQRankSum=-5.802;DP=781;Dels=0.00;FS=7.678;
    HaplotypeScore=1.1802;InbreedingCoeff=-0.0833;MLEAC=2;MLEAF=0.077;MQ=59.93;MQ0=0;MQRankSum=-1.180;QD=2.56;
    ReadPosRankSum=0.650    GT:AD:DP:GQ:PL  
    0/0:69,0:69:99:0,190,2382   0/0:58,0:58:99:0,150,1944   0/0:62,0:62:99:0,181,2249   
    0/0:75,0:75:99:0,199,2547   0/0:86,0:86:99:0,223,2859   0/1:25,6:31:99:123,0,707    0/0:59,0:59:99:0,172,2122   
    0/0:54,0:54:99:0,153,1941   0/1:30,7:37:94:94,0,846 0/0:68,0:68:99:0,175,2253   0/0:66,0:66:99:0,181,2293   
    0/0:64,1:66:99:0,150,2189   0/0:50,0:50:99:0,135,1701

    And the corresponding IGV image is attached

    _false_positive_UGcall.png
    1681 x 855 - 22K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    @airtime, Using the UnifiedGenotyper is no longer considered best practice. You should try calling this with HaplotypeCaller and see if you get a different call.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 23Member

    @Geraldine_VdAuwera, how about low sample amount? Under 30 samples we should use UG instead of HC or did this suggestion changed?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    I'm not aware that we ever made that recommendation. There are some some complications for filtering if you have lower than 30 samples, because variant recalibration does not work as well with low amounts of data. But you can do variant calling with HaplotypeCaller on any number of samples. We have a new workflow that now makes it easier to use it on many samples, but low number of samples was never a problem for HC.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 23Member

    Ok, thank you very much. Sorry for this confusion. I was so sure that in Best Practise v4 it was recommended. I would take a look again and try out HC for variant calling and for this multiple hits.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    If you find the text that caused your misunderstanding, let me know and I will try to clarify what we wrote in order to avoid future misunderstandings. Good luck with your work.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 23Member

    I take a look at my notice and at the best pratice documentation and at the output of HC it is written "unrecalibrated SNP and indel calls" So I thought the variant recalibration is needed for the whole HC calling. But it seem to be enough to make a hard-filtering methode? It was something wrong associated in my head, I think the documentations is fine.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Oh I see, we'll need to clarify that. This only means that the calls coming from HC are unfiltered, the same as for UG, and they need to be filtered before use in an analysis project. Filtering is best with VQSR, but if that's not possible, hard filtering is ok.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 23Member

    Hi, so I repeat the analysis with the HC for this project (previously done by UG) and for this multihit example we have again the false positive hit. The read amount was only changed to 25,3 and 28,6 from the example in the picture, respectively.

    I'm currently running another project analysis with more validated multiple hits which are false positives. I hope we can remove some of them.

  • SheilaSheila Broad InstitutePosts: 653Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi,

    Can you please confirm that this false positive still exists after variant filtering? Haplotype Caller is designed to be very sensitive so it does not miss any potential variants. However, after filtering the false positives should be eliminated.

    -Sheila

  • airtimeairtime Posts: 23Member
    edited October 23

    @Sheila‌ Hi, after HC I select indel and snp/mnp seperately, then I filter with the default filter settings at "hard to validate".

    here is the same hit with different read amounts:

    .  C   T   175.78  PASS    AC=2;AF=0.077;AN=26;BaseQRankSum=-5.940e-01;ClippingRankSum=0.158;
    DP=518;FS=1.517;GQ_MEAN=100.54;GQ_STDDEV=31.03;InbreedingCoeff=-0.0833;MLEAC=2;MLEAF=0.077;
    MQ=60.00;MQ0=0;MQRankSum=-2.970e-01;NCC=0;QD=2.84;ReadPosRankSum=1.04   GT:AD:DP:GQ:PL
    0/0:46,0:46:99:0,120,1800   0/0:41,0:41:93:0,93,1395    0/0:46,0:46:99:0,117,1755   0/0:50,0:50:99:0,120,1800   
    0/0:51,0:51:99:0,120,1800   0/1:25,3:28:51:51,0,1237    0/0:39,0:39:91:0,91,1377    0/0:28,0:28:69:0,69,1035    
    0/1:28,6:34:99:168,0,1200   0/0:38,0:38:90:0,90,1349    0/0:35,0:35:81:0,81,1215    0/0:52,0:52:99:0,120,1800   
    0/0:30,0:30:67:0,67,1110
    Post edited by airtime on
  • SheilaSheila Broad InstitutePosts: 653Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi,

    I do not understand what exactly you did. Can you post your command line for filtering?

    Thanks, Sheila

  • airtimeairtime Posts: 23Member

    @Sheila

    Hi, I use following commands:

    java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T GenotypeGVCFs
    -R hg19.fasta --variant S1_snp_indel.raw.vcf --variant S2_snp_indel.raw.vcf --variant  -o all_snp_indel.raw.vcf
    java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T SelectVariants
    -R hg19.fasta --variant all_snp_indel.raw.vcf -o only_all_snp_mnp.raw.vcf
    --selectTypeToInclude SNP --selectTypeToInclude MNP
    java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T VariantFiltration
    -R hg19.fasta --variant only_all_snp_mnp.raw.vcf -o HaTyCa_only_all_snp_mnp.hard-filtered.vcf
    --filterExpression "QD < 2.0 " --filterName "HARD_TO_VALIDATE_QD" --filterExpression " MQ < 40.0 "
    --filterName "HARD_TO_VALIDATE_MQ" --filterExpression " FS > 60.0 " --filterName "HARD_TO_VALIDATE_FS"
    --filterExpression " HaplotypeScore > 13.0 " --filterName "HARD_TO_VALIDATE_HaploTypeScore"
    --filterExpression "MQRankSum < -12.5 " --filterName "HARD_TO_VALIDATE_MQRankSum"
    --filterExpression " ReadPosRankSum < -8.0" --filterName "HARD_TO_VALIDATE_ReadPosRankSum"
    java -Xmx12G -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T SelectVariants
    -R hg19.fasta --variant HaTyCa_only_all_snp_mnp.hard-filtered.vcf
    -o HaTyCa_only_all_snp_mnp.hard-filtered_results.vcf
    -select 'vc.isNotFiltered()'
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    OK, that all looks pretty normal. I think you're encountering the limitations of hard-filtering here. One way to deal with this is to use more stringent filters by trial and error, to find out what filters will remove this type of false positive but still retain the true positives you want. The other way (which is better) is to increase your cohort size (e.g. by using exomes from the 1000 Genomes project) so that you can use variant recalibration. That should provide more discriminating filtering.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 23Member
    edited October 24

    Hi, yeah the second possibility sounds nice. But how to increase the cohort, only put some further exome samples without respect to the project? Because in some projects we have celllines in other we have tumor samples with corresponding normal samples. It is important to include also celllines at the cellline projects? And what about the tumor entity, should I keep the tumor type? What is important to increase the sample amount?

    Post edited by airtime on
  • SheilaSheila Broad InstitutePosts: 653Member, GATK Developer, Broadie, Moderator admin

    @airtime

    Hi,

    If you are working with tumor samples, we recommend using MuTect. Please read about it here: http://www.broadinstitute.org/cancer/cga/mutect

    -Sheila

  • airtimeairtime Posts: 23Member

    @Sheila

    Hi, I know MuTect, but it is to identify somatic mutation. First of all we are screening whole exomes for variations(snp/snv/indel). For this task is the GATK-pipeline(UG/HC) the best or not? Please correct me if I'm wrong.

    best Air

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    @‌airtime

    If you are trying to evaluate what is the germline background in your samples and generate a panel of normals for MuTect to use for filtering, then the GATK pipeline is appropriate. For that, the basic recommendation is to choose samples with similar ethnic background to increase the chance that their germline mutations will be mostly the same.

    We don't have recommendations for dealing with cell lines, sorry. The rest of your questions are outside the scope of support we can provide at the level of experimental design. My suggestion would be to read the literature and find out what others have done successfully in the past.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 23Member

    @Geraldine_VdAuwera

    Thank you very much. I read many articles in the literature and in many variant calling comparison acrticles it is mentioned that GATK has a very good calling rate with low false positive rates and this is the same which we observed in our experiments/data. Also I take a look at the slides for "Best Practices for variant calling with the GATK". I thought it is possible to screen for variations in NGS human DNA Exome data, but the limitation that only normal samples can be used is new for me. Of course there are further algorithms which specialized to identify germline events.

    best Air

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    @airtime, I didn't mean to imply that only normal samples can be used. My point is that we do not have enough experience with tumor/cancer samples to give recommendations about how to put together a cohort for those purposes.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 23Member
    edited October 28

    @Geraldine_VdAuwera

    Ahh, so for cohort extension at tumor/normal samples you cannot give me a recommendation. So is there another possibility to remove this multiple hits? Or should I try to create a background with all normal samples and then to analyse each tumor with MuTect?

    Post edited by airtime on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Yes, I think making a panel of normals then analyzing the tumors with MuTect is a good approach.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.