Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Testing different capture reaction template

I did a test of capture sequencing :

96 libraries representing 56 individuals
For one individual we made 1,2,3 or 4 libraries.

The libraries were pooled before capture reactions so that the same individual (not the same library) can be found in different capture reactions.

In total we made 6 capture reactions with different conditions (level of multiplexing or dilution)

The Sequencing was done on a NextSeq,on one flowcell with 4 lanes.

I want to evaluate the effect of the different conditions of the capture reactions on my capacity to call good quality variant.

I have 96 (f and r) fastq files : all reads in a fastq file are from the same library and the same run but are from different lanes.

I have read your documentation and I was wondering if I should split my dataset depending on the capture condition and then follow the good practices for each condition or in some way include the capture condition in the "READ_GROUP_NAME" of the uBAM file ?

Thank you

Best Answers


  • SheilaSheila Broad InstituteMember, Broadie admin


    Sorry for the delay. If your end goal is strictly to determine which capture condition gives you the best quality data, you can split the dataset based on capture condition.

    But, can you tell us a little more about how exactly you will determine which capture condition is best?


  • Hi,

    My goal is to obtain SNP for many individuals in a non-model species.
    The baits were designed based on transcriptome sequences.

    So I am interested in knowing if with the different conditions of capture (multiplexing librairies in one capture reaction) I can call the same number of good quality SNP in many individuals.
    I think the main focus is to have a maximum number of reads on target so the SNP calling is reliable.

    And then there will be a trade-off between number of individuals and the number of SNP : by increasing the number of individuals per capture reaction at some point I am going to decrease the coverage. So I want to find the maximal number of individual I can put in one reaction and still being able to call reliable SNPs.

    If you have any suggestions on parameter to investigate to achieve this goal please let me know.

    Thank you

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • SheilaSheila Broad InstituteMember, Broadie admin


    Let me confirm my answer with the team and get back to you.


  • Hi,

    I followed the best practices pipeline to the indel realignment included.
    I used the CollectHSMetrics tool the compare the different capture reaction, but there is something I don't understand :
    I gave the same file for the BAit Interval and Target Interval but I have different results between bait and target metrics.

    Is it normal ? I think there's something I didn't get but I can't figure out what.


  • Hello,

    I want to run BQSR but I am working on a non-model species. So I ran HaplotypeCaller first and now I have to filter my SNP and indels to obtain knowsites toffed BQSR and run BQSR and HaplotypeCaller in a loop until convergence.
    I have a raw vcf file from several individuals coming from several population.

    Should I combine them in a multisample raw vcf and apply hard filtering to that file and then use the filtered file to feed BQSR ?

    Or should I do it in a sample centered way ? or a population centered way ?

    Thanks for your advices

  • SheilaSheila Broad InstituteMember, Broadie admin
    edited August 2017


    Assuming you want to compare all samples together, you can simply combine the raw VCFs into one using CombineVariants and filter that.

    Note, when you perform your actual variant calling step, we recommend using the GVCF workflow, which you can read about here.


  • Hello,

    I am trying to compare vcf files obtained of the same individuals coming from different capture reaction to measure the effect of capture reaction parameters on the call set quality.

    I am using the genotypeconcordance tool from Picard and I'd like a confirmation about the meaning of several variant "states"

    "MISSING" : the variant is not present in the call set

    "NO_CALL" : the variant is present int he call set but couldn't be determine because of lack of data

    "VC_FILTERED" : the variant didn't pass the filtration step

    Is that correct ?



  • Or maybe it's better to combine the vcfs obtained from the different Capture reaction ?

  • SheilaSheila Broad InstituteMember, Broadie admin


    Your understanding of the terms in your first post is correct :smiley:

    I don't understand what you mean by combine the VCFs from different capture kits? Are you running GenotypeConcordance on the different capture kits to compare them? If so, that is the way to go.


  • Hi,

    I have my vcf files obtained without running BQSR nor VQSR.
    I have a question about hard filtering.

    In a first step I used you recommendations for hard filtering and looking at my first analysis (PCA on allele frequencies) the results seem to be consistent.

    Now I look in details at the distribution of the different metrics used in hard filtering and I have differences in the shape on distribution.

    I join you the distributions obtained from my data.

    Do you have any recommendations to filter my snps based on these distributions ?

  • SheilaSheila Broad InstituteMember, Broadie admin


    Those distributions look fine. They don't have to match the distributions in this doc exactly. As for setting hard filters, you may find the hard filtering tutorials helpful. You can find them in the GATK presentations section.


Sign In or Register to comment.