High proportion of spanning deletion in a whole-genome callset

GBrGBr SwedenMember

Hi GATK team,
I am working on a callset of ~150 high coverage human genomes. My processing pipeline follows the GATK Best Practices except that 1-I use IndelRealigner because we started processing the data before you stopped recommending it and 2-at the BQSR step I run additional steps involving variant calling on each sample (to avoid being too dependent on reference datasets). I am using GATK3.5 and GATK3.7 (3.7 for HaplotypeCaller and the steps afterwards).
I have mostly samples from Africa but I also have ~20 samples from outside Africa.
In my callset before VQSR I observe a very high proportion of variants with a * as the alternate allele (spanning deletion). It represents more than 25% of my biallelic variants. I looked at some of the variants in the VCF file and they seem to make sense as long as I can tell (close to indels etc). However I am worried about such a high proportion and have difficulties finding information about these variants and what to do with them.
Do you think that such a high proportion of spanning deletion is possible or do you think it is caused by something else in my pipeline? In that case, which checks could I run? Do you know where I could find data that I could compare mine with?
Thanks in advance for your insights,
Best,
Gwenna

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi Gwenna (@GBr),

    Can you explain in more detail what do you mean by the following?

    2-at the BQSR step I run additional steps involving variant calling on each sample (to avoid being too dependent on reference datasets).

    Can you tell me:

    [1] What reference are you using and are you performing alt-aware alignment?
    [2] What do you know about the quality of your data? Have you collected quality control metrics? E.g. user apfuentes here shows some confidence instilling quality metrics for their data. Any reason to believe samples may be contaminated or degraded?
    [3] When you compare your calls against known population datasets, e.g. 1000 Genomes Project, ExAc or gnomAD, and/or truthsets, what percentage of the variant sites and variant alleles are concordant?

    We have a variety of tools that collect metrics to assess the quality of a callset, e.g. Ti/Tv ratio. See these documents by @KateN:

    Does the result of running one of these show some unusual metrics?

  • GBrGBr SwedenMember

    Hi @shlee ,

    [0] for the BQSR step we do the following (for each sample):
    a-run HaplotypeCaller on the non-BQSRed BAMs, output a VCF
    b-BQSR the BAMs with the recommended reference datasets; run HaplotypeCaller, output a VCF
    c-BQSR the BAMs with the recommended datasets + the two VCF obtained at steps a- and b-.
    We use only the output of c- for the rest of the pipeline.

    [1] I am using hg38 and I am using bwakit for mapping (bwa mem and bwa-postalt.js)

    [2] I am quite confident in the quality of the data - most samples are above 30X. Also we processed publicly available samples (1000G high-coverage, Simon's project samples) with the same pipeline and there is no evidence that the in-house samples have more of the spanning deletions than the publicly available. I considered doing hard filtering (before choosing VQSR) and the "classical" annotations (DP, MQ, FS, QD, MQRankSum, ReadPosRankSum) follow the expected distributions.

    [3] I haven't compared my callsets to e.g. 1000Genomes Project. I will do that. However I did some concordance analysis on samples for which we have SNP array data and the concordance was good (~99%).
    I ran picard's CollectVariantCallingMetrics on the output of JointGenotyping and also on my VQSRed VCF. If I keep the variants having a * as the only alternate allele the proportion of variants in dbsnp150 is
    ~0.6 (before and after VQSR). If I remove them (after VQSR) it is closer to 0.8. The one other statistic that stands out is the TiTv ratio which is particularly low in the novel variants (~0.3 before any filtering, ~1.2 after VQSR + removing the * alleles).

    Does that make you think of what could explain the high proportion of * alleles? What kind of proportions do you observe in your own processing?
    Thanks in advance, and I'll write back once I have results from comparing with 1000 Genomes!
    Best,
    Gwenna

    Issue · Github
    by shlee

    Issue Number
    2665
    State
    closed
    Last Updated
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @GBr
    Hi Gwenna,

    I have mostly samples from Africa but I also have ~20 samples from outside Africa.

    Is there a difference between the African samples and the non-African samples? You can also try comparing to only the African samples in the 1000Genomes project. It may be worth running your pipeline on some African BAMs and non-African BAMs that are publicly available and seeing if the * allele appears at approximately the same rate as in your samples.

    -Sheila

  • GBrGBr SwedenMember

    @shlee,
    Hi Sheila,
    Yes there is more spanning deletions in the African samples than in the non-African. About half of my dataset consists of publicly available genomes (that I ran through my pipeline) and the * alleles are abundant in these samples as well (slightly less, but I suppose it could also have to do with sample size).
    The more things I look at the more I have the feeling that these variants are genuine. However, I do not really know how to best deal with them. They are not comparable to A/C/T/G variants in term of e.g. mutation processes. Thus, we think that they should be treated separately, but we are unsure of 1-how to do that and 2-at which stage to do that. For instance, do you think that these variants should be included in the VQSR step? Also, would it work to rename them to something else than "SNP"?
    Thanks in advance for your insight,
    /Gwenna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @GBr
    Hi Gwenna,

    Does your VCF have all sites in them (including non-variant sites)? Can you post a few example records with the * allele?

    If you output a variant sites only VCF, the sites that have * alleles in your VCF will also have SNPs for the other samples. The * allele does not show up unless other sample(s) have a variant at the site. There should be no need to remove the * allele sites from your VCF because the annotations are calculated separately for the different variant alleles.

    Thanks,
    Sheila

  • GBrGBr SwedenMember

    Hi @Sheila,
    I outputted a all sites file with GenotypeGVCF. But before running VQSR I extracted the SNP using VariantEval --selectType SNP and the file still contains all records which have only the * as alternate allele. Is that what you meant? Or should I run GenotypeGVCF and emit only variant sites to get ride of them?

    Here are a few example records with the * allele (including one which also has another alternate allele). I included only the first sample because else it is not readable at all, but I can include the others if needed.

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  IND1
    chr1    10147   .   C   *   12576.99    .   AC=60;AF=0.214;AN=280;AS_InbreedingCoeff=-0.2136;AS_QD=7.32;DP=8151;ExcessHet=29.4446;FS=1.959;InbreedingCoeff=-0.2136;MLEAC=84;MLEAF=0.300;QD=7.53;SOR=1.096   GT:AD:DP:GQ:PGT:PID:PL  ./.:9,0:9:.:.:.:0,0,0
    chr1    10177   rs201752861 A   C   272.38  .   AC=1;AF=8.333e-03;AN=120;AS_BaseQRankSum=-0.799;AS_FS=0.000;AS_InbreedingCoeff=-0.2813;AS_MQ=37.59;AS_MQRankSum=-0.702;AS_QD=13.32;AS_ReadPosRankSum=1.233;AS_SOR=0.446;BaseQRankSum=-7.990e-01;DB;DP=5840;ExcessHet=22.0577;FS=0.000;InbreedingCoeff=-0.2813;MLEAC=2;MLEAF=0.017;MQ=2.35;MQRankSum=-7.020e-01;QD=20.95;ReadPosRankSum=1.23;SOR=0.446   GT:AD:DP:GQ:PGT:PID:PL  0/0:6,0:6:0:.:.:0,0,14
    chr1    10245   .   C   *   25.12   .   AC=1;AF=5.814e-03;AN=172;AS_InbreedingCoeff=-0.0433;AS_QD=2.27;DP=7222;ExcessHet=3.7155;FS=3.522;InbreedingCoeff=-0.0433;MLEAC=1;MLEAF=5.814e-03;QD=2.79;SOR=2.266  GT:AD:DP:GQ:PGT:PID:PL  0/0:19,0:19:33:.:.:0,33,495
    chr1    10246   .   C   *   23.84   .   AC=1;AF=5.435e-03;AN=184;AS_InbreedingCoeff=-0.0371;AS_QD=2.22;DP=7401;ExcessHet=3.3634;FS=3.522;InbreedingCoeff=-0.0371;MLEAC=1;MLEAF=5.435e-03;QD=2.65;SOR=2.266  GT:AD:DP:GQ:PGT:PID:PL  0/0:20,0:20:25:.:.:0,25,593
    chr1    10353   .   A   *   191.45  .   AC=2;AF=0.200;AN=10;AS_QD=8.38;DP=10145;ExcessHet=3.9794;FS=3.349;MLEAC=2;MLEAF=0.200;QD=9.57;SOR=2.200 GT:AD:DP:GQ:PGT:PID:PL  ./.:16,0:16:.:.:.:0,0,0
    chr1    10354   rs1015856060    C   A,* 425.65  .   AC=3,2;AF=0.115,0.077;AN=26;AS_BaseQRankSum=-1.324,NaN;AS_FS=4.633,0.000;AS_InbreedingCoeff=-0.2044,-0.1504;AS_MQ=37.01,0.00;AS_MQRankSum=-0.527,NaN;AS_QD=4.08,2.55;AS_ReadPosRankSum=0.517,NaN;AS_SOR=0.254,0.206;BaseQRankSum=-1.136e+00;DB;DP=10020;ExcessHet=6.8797;FS=1.017;InbreedingCoeff=-0.2306;MLEAC=3,2;MLEAF=0.115,0.077;MQ=3.18;MQRankSum=-1.420e-01;QD=6.35;ReadPosRankSum=-9.400e-02;SOR=0.883  GT:AD:DP:GQ:PGT:PID:PL  ./.:16,0,0:16:.:.:.:0,0,0,0,0,0
    chr1    10355   .   C   *   141.62  .   AC=2;AF=0.010;AN=200;AS_InbreedingCoeff=-0.0281;AS_QD=6.88;DP=10886;ExcessHet=3.1403;FS=3.349;InbreedingCoeff=-0.0281;MLEAC=2;MLEAF=0.010;QD=7.08;SOR=2.200 GT:AD:DP:GQ:PGT:PID:PL  0/0:17,0:17:15:.:.:0,15,225
    chr1    10356   .   C   *   143.72  .   AC=2;AF=9.009e-03;AN=222;AS_InbreedingCoeff=-0.0556;AS_QD=6.87;DP=10935;ExcessHet=5.8225;FS=3.349;InbreedingCoeff=-0.0556;MLEAC=2;MLEAF=9.009e-03;QD=7.19;SOR=2.200 GT:AD:DP:GQ:PGT:PID:PL  0/0:17,0:17:15:.:.:0,15,225
    

    Thanks,
    Gwenna

  • GBrGBr SwedenMember

    I forgot to add - the records are examples of what I get after running SelectVariants -selectType SNP on the all sites VCF that I obtain with GenotypeGVCFs.
    Also in my answer above I mentioned "VariantEval" - it is a mistake, I meant SelectVariants. Sorry for that!
    Gwenna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @GBr
    Hi Gwenna,

    I see. Thanks for clarifying. You can either run GenotypeGVCFs without -allSites or use SelectVariants as explained here to remove the sites with only * allele.

    -Sheila

  • GBrGBr SwedenMember

    Hi @Sheila,
    even when running SelectVariants with --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude SYMBOLIC -env I still have the records like "REF/*". So I'll run GenotypeGVCFs twice, once to get the all-sites file which we need, and once to get the file with the variants only.
    Thanks for the help!
    /Gwenna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @GBr
    Hi Gwenna,

    Hmm. That is odd. I thought that solved the other user's issue in the other thread. In any case, l GenotypeGVCFs without -allSites should solve your issue.

    -Sheila

Sign In or Register to comment.