Testing different capture reaction template

I did a test of capture sequencing :

96 libraries representing 56 individuals
For one individual we made 1,2,3 or 4 libraries.

The libraries were pooled before capture reactions so that the same individual (not the same library) can be found in different capture reactions.

In total we made 6 capture reactions with different conditions (level of multiplexing or dilution)

The Sequencing was done on a NextSeq,on one flowcell with 4 lanes.

I want to evaluate the effect of the different conditions of the capture reactions on my capacity to call good quality variant.

I have 96 (f and r) fastq files : all reads in a fastq file are from the same library and the same run but are from different lanes.

I have read your documentation and I was wondering if I should split my dataset depending on the capture condition and then follow the good practices for each condition or in some way include the capture condition in the "READ_GROUP_NAME" of the uBAM file ?

Thank you

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @adejode
    Hi,

    Sorry for the delay. If your end goal is strictly to determine which capture condition gives you the best quality data, you can split the dataset based on capture condition.

    But, can you tell us a little more about how exactly you will determine which capture condition is best?

    Thanks,
    Sheila

  • Hi,

    My goal is to obtain SNP for many individuals in a non-model species.
    The baits were designed based on transcriptome sequences.

    So I am interested in knowing if with the different conditions of capture (multiplexing librairies in one capture reaction) I can call the same number of good quality SNP in many individuals.
    I think the main focus is to have a maximum number of reads on target so the SNP calling is reliable.

    And then there will be a trade-off between number of individuals and the number of SNP : by increasing the number of individuals per capture reaction at some point I am going to decrease the coverage. So I want to find the maximal number of individual I can put in one reaction and still being able to call reliable SNPs.

    If you have any suggestions on parameter to investigate to achieve this goal please let me know.

    Thank you

    Issue · Github
    by Sheila

    Issue Number
    2222
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @adejode
    Hi,

    Let me confirm my answer with the team and get back to you.

    -Sheila

  • Hi,

    I followed the best practices pipeline to the indel realignment included.
    I used the CollectHSMetrics tool the compare the different capture reaction, but there is something I don't understand :
    I gave the same file for the BAit Interval and Target Interval but I have different results between bait and target metrics.

    Is it normal ? I think there's something I didn't get but I can't figure out what.

    Thanks

  • Hello,

    I want to run BQSR but I am working on a non-model species. So I ran HaplotypeCaller first and now I have to filter my SNP and indels to obtain knowsites toffed BQSR and run BQSR and HaplotypeCaller in a loop until convergence.
    I have a raw vcf file from several individuals coming from several population.

    Should I combine them in a multisample raw vcf and apply hard filtering to that file and then use the filtered file to feed BQSR ?

    Or should I do it in a sample centered way ? or a population centered way ?

    Thanks for your advices

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited August 2017

    @adejode
    Hi,

    Assuming you want to compare all samples together, you can simply combine the raw VCFs into one using CombineVariants and filter that.

    Note, when you perform your actual variant calling step, we recommend using the GVCF workflow, which you can read about here.

    -Sheila

  • Hello,

    I am trying to compare vcf files obtained of the same individuals coming from different capture reaction to measure the effect of capture reaction parameters on the call set quality.

    I am using the genotypeconcordance tool from Picard and I'd like a confirmation about the meaning of several variant "states"

    "MISSING" : the variant is not present in the call set

    "NO_CALL" : the variant is present int he call set but couldn't be determine because of lack of data

    "VC_FILTERED" : the variant didn't pass the filtration step

    Is that correct ?

    Thanks

    Cheers

  • Or maybe it's better to combine the vcfs obtained from the different Capture reaction ?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @adejode
    Hi,

    Your understanding of the terms in your first post is correct :smiley:

    I don't understand what you mean by combine the VCFs from different capture kits? Are you running GenotypeConcordance on the different capture kits to compare them? If so, that is the way to go.

    -Sheila

  • Hi,

    I have my vcf files obtained without running BQSR nor VQSR.
    I have a question about hard filtering.

    In a first step I used you recommendations for hard filtering and looking at my first analysis (PCA on allele frequencies) the results seem to be consistent.

    Now I look in details at the distribution of the different metrics used in hard filtering and I have differences in the shape on distribution.

    I join you the distributions obtained from my data.

    Do you have any recommendations to filter my snps based on these distributions ?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @adejode
    Hi,

    Those distributions look fine. They don't have to match the distributions in this doc exactly. As for setting hard filters, you may find the hard filtering tutorials helpful. You can find them in the GATK presentations section.

    -Sheila

Sign In or Register to comment.