how to evaluate the accuracy of a variant calling pipeline

blueskypyblueskypy Member ✭✭
edited November 2013 in Ask the GATK team

I'd like to evaluate the accuracy of my variant calling pipeline. So I made a cohort of 46 Caucasians from 1KG whole exome seq samples. Comparing to 1KG official calling results, the results of one sample are the following.

region  Non-Reference.Sensitivity   Non-Reference.Discrepancy    Overall_Genotype_Concordance

chr1    0.714   0.034   0.988

chr22    0.731  0.014   0.986

My questions are the following:

  1. What do those numbers say about my pipeline? good or bad?

  2. Are there better ways to evaluate a variant calling pipeline? In my test, there are a few sources of error: a) the official calling results are from whole genome seq, whereas mine is whole exome seq. Although I restricted the comparison to the capturing region of the exome seq, the coverage depths may be different; 2) the official calling may use a different calling pipeline; 3) The cohort composition are different.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    This is a difficult question, and one we are currently working to address in the documentation -- but the new materials won't be ready for another few weeks at least, sorry.

    As a general principle, though, I would say that you need to develop a comparison with fewer differences. As you rightly note, working with different data type (exome v wgs) and different cohorts necessarily introduces differences that will confound your results.

  • blueskypyblueskypy Member ✭✭

    If I run the sample individually, the result is much better than using a cohort (See below). Why is that? Actually in my 46 Caucasian cohort, there is one Japanese samples. but would one Japanese sample mess up the the calling so much to make it worse than not using a cohort?

    region     Non-Reference Sensitivity  Non-Reference Discrepancy  Overall_Genotype_Concordance
     chr1                         0.927                      0.005                         0.995
    
  • blueskypyblueskypy Member ✭✭
    edited November 2013

    btw, the mean seq depth on capturing regions in chr1 is around 178 for those samples.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately I don't have the time to go into the details of your results in different cases, but I can say that I wouldn't expect your one Japanese sample to be messing up anything.

  • blueskypyblueskypy Member ✭✭

    Thanks, Geraldine!

  • blueskypyblueskypy Member ✭✭

    Besides 1KG, are there other datasets that can be used to test the performance of a NGS variant calling pipeline?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We use several additional resources such as HapMap and Omni, and internally we have developed a knowledge base of highly curated sites which we use for evaluating proposed improvements in our tools and methods. I believe we recently added the knowledge base file to our bundle.

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    Thanks for the info! may I know the names of those files in the bundle for the knowledge base?
    Also, I found this site useful for testing the pipeline: http://www.bioplanet.com/gcat

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah yes, we've heard of GCAT, though we haven't tried it ourselves.

    For now the knowledge base (or truth set) files are in a directory called "NA12878KB" on our FTP site (at the same level as the bundle directory, not inside it). When we release the next GATK update the new bundle will include a snapshot of the knowledge base.

  • blueskypyblueskypy Member ✭✭
    edited December 2013

    I tested my pipeline on GCAT and the performance is very good. Now I wonder if my experimental design is bad. Will putting fastq files from different sequencing centers in 1KG cause problem for BQSR and VQSR and result in errors in calling for some samples?

    At least for BQSR, is there a hidden assumption that the fastq files should come from the same sequencer? Otherwise, the BQSR may not be really helpful?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy,

    Having sequence data from different origins is not really a problem in itself. For BQSR, recalibration is done per read group, so if your read groups are correctly identified, the data from different sequencing centers will be recalibrated separately.

  • blueskypyblueskypy Member ✭✭
    edited December 2013

    My colleagues would like me to show the following:

    1. Using a cohort provides high calling accuracy than running samples alone, but my test shows the opposite. Now I don't know how to explain.
    2. For minority samples, e.g. a few Asian or Black samples among large numbers of Caucasian samples, is it better to put those minority samples in the Caucasian cohort or to run those few samples alone? The coverage in our samples are around 70X. Now because of the problem with 1, I cannot test 2 either.

    Geraldine, could you give any advices? Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'll ask @ebanks to opine about the cohort breakdown/ minority samples issue, but regarding your first question, I should point out that multisample calling is meant to increase your calling power, which is not the same thing as accuracy. This is to say, using a cohort should increase your ability to detect tricky variants. But then, to achieve a high accuracy, you still need to apply a judicious filtering strategy, preferably using VQSR, which is not necessarily straightforward.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    To add on to Geraldine's previous point: if you have 70X coverage, you don't need to mix your samples together for accurate allele discovery. You certainly have enough depth for that. It helps only for the statistical filtering (VQSR) and I'm wondering if maybe you need to optimize the filtering for your particular data set.
    Given that, the answer to your second question is pretty much the same: including them with the Caucasians should be able to give better power to the VQSR.
    Hope this helps!

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine and ebanks,
    Thanks for the help! You both mentioned that "you still need to apply a judicious filtering strategy" and "maybe you need to optimize the filtering". Could you explain what do you mean regarding the filtering? In my case, the output isn't much different whether I use the --ignoreFilters flag or not.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy,

    The recommendations we provide regarding filtering in the Best Practices are only a starting point; depending on your data, you may need to tweak the filtering recommendations to get best results. We currently don't provide recommendations on how to do this tweaking.

Sign In or Register to comment.