Why so few PASS variants?

Weijian_leafWeijian_leaf China,DenmarkMember

Hi,
I was trying to run the GenomeStrip with 30 high-coverage samples using Discovery->Genotyping processing pipeline. It run successfully but the final vcf had so few PASS sites, like chr21 only 70 DELs left. In the 1000genome phase3 vcf, there is over 300 DELs in chr21.
I are running the SV preprocessing, discovery and genotyping processes separately on each chromosome. I am afraid that if it is because I made the metadata on each chromosome, not the whole genome? Could you help me?

Best Answer

Answers

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    This is a good question, but not necessarily easy to answer.

    First, the 1000 Genomes phase3 VCF is looking at deletions in 2500 people, not 30. Most of those deletions are rare, so you shouldn't necessarily expect as observe as many deletions in 30 people as 2500. Also, the 1000G VCFs contain small deletions (below 1Kb in length) that are predominantly from Pindel. The deletions larger than 1Kb are predominantly from Genome STRiP.

    Second, this is confounded by deletion length. Shorter deletions are more numerous than longer ones (in general). If you have more read depth (or tighter fragment length distributions), then you can more effectively target shorter deletions, which can increase the number of discovered sites (at a given false discovery rate) dramatically. Ethnicity of the samples also affects how much variation you should expect to see.

    As you've probably seen on other posts, having few PASS sites is not unexpected. Genome STRiP outputs a VCF line for every site it considers, most of which are supported by very few read pairs. The default filtering is very conservative. We usually target a false discovery rate below 5% and we often achieve that out-of-the-box with the default filters. You can relax these filters if you want to be more sensitive, but you should look at the distributions of the metrics being filtered on or have some other way to calibrate.

    A couple of things to think about: It is relatively cheap to take the 1000G phase 3 sites and genotype them in your cohort. By this method, you can see if these sites are, in fact, non-variant in your cohort. Second, you can relax the discovery filters and send more sites through genotyping. Some people have done this and then used the genotyping results as a secondary filter.

  • Weijian_leafWeijian_leaf China,DenmarkMember

    Hi Bhandsaker,
    Thank you very much and be grateful for your detailed answer. So, you think that it is not because I run the preprocess step by each chromosome instead of the whole genome. I am afraid that I may make the manual mistake. On the other hand, yes, I totally agree that it could have bias because of the samples size, which is not suitable to make a comparison directly. However, considering my data ( over 50X reads depth ), I set the variants size 100-1000000 and all the Genomestrip default parameters, it is a little unacceptable for the few number of pass DELs( 6k DELs in the whole genome, 70 DELs in chr21 ). Besides, after the Genomestrip genotyping, there are so many low quality sites (the info in the vcf file is like .:LowQual:-0.48,-0.48,-0.48:0:0,0,0) even in the pass variants, which made me think that it is something wrong about my result.
    I am checking if I have made mistakes in the preprocess step, so could you have any advice for me? Thank you again.

Sign In or Register to comment.