Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Different reference allele in two files processed with same pipeline and smae reference files

vifehevifehe SpainMember

Dear GATK team,

I have a set of samples for which I have WES and WGS data. I have run the same pipeline (mostly following GATK's Best Proctices) on both datasets with the same reference sets and target sites down to VQSR. So now I have to VCFs ready to analyze. My idea is to merge these datasets and I believe my situation parallels to the third case described in this document gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one.

Therefore, I used -T CombineVariants to merge these two files into one. Up until here everything runs ok and I obtain a working final VCF. However when I look deeper into it, there are some variants that have been kept duplicated but present different reference allele. When I look back at the separate VCF files, these differences are already there. My question is, how two sets of files that have been processed exactly in the same way, can present different reference sites? I paste couple of examples for clarification:

chr10 100167436 . T C 177.06 PASS AC=1;AF=0.011;AN=94; set=WGS
chr10 100167436 . TGTCACCAGGGGTCACCAGGGATGAGGACC CGTCACCAGGGGTCACCAGGGATGAGGACC,T 56690.81 PASS AC=50,2;AF=0.024,9.533e-04;AN=2098; set=WES

chr10 101462504 . CT CTT,C 1539.20 PASS AC=34,48;AF=0.014,0.020;AN=2348; set=WES
chr10 101462504 . C T,CT . PASS AC=0,0;AF=0.00,0.00;AN=114; set=WGS

chr10 102295645 . G GT 627.02 PASS AC=7;AF=0.061;AN=114; set=WGS
chr10 102295645 . GT GTT,G,TT 63780.89 PASS AC=203,403,17;AF=0.088,0.174,7.328e-03;AN=2320; set=WES

Any thoughts on this behaviour? and possible solutions?

Thanks in advance

Victoria

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    WEx and WGS technologies have different strengths and weaknesses, so it is not unexpected to see some differences in the resulting calls. My advice would be to quantify those differences (how many sites differ substantially) and try to identify what are the commonalities in the sites that differ (eg very different coverage profiles perhaps). This should help you determine whether the differences are indicative of major errors, or whether they are just marginal.

  • vifehevifehe SpainMember

    thanks @Geraldine_VdAuwera and @KlausNZ , very much appreciated comments!!

  • vifehevifehe SpainMember
    edited August 2015

    @KlausNZ @Geraldine_VdAuwera ;
    I wanted to update on my final resolution of this topic - I actually ended up splitting the several multiallelic sites following this post: apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html and merging those positions that after deconvoluting ended up referring to the same alleles and positions regardless they came from the WES or the WGS set.

    I found the solution offered in the link provided worked better for my problem since BCFtools maintained the genotypic information whereas VariatnsToTable lost them.

    Hope it helps to future users with similar problems

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @vifehe
    Hi,

    Thank you for reporting your solution.
    I hope it will help other users too.

    -Sheila

Sign In or Register to comment.