Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Genotype refinement details

MoneteMonete BrazilMember
Hi,

I have a simply (I believe) question which I couldn't find any answer out there.

After VQSR steps, I have variants tagged with "PASS", or "VQSRTrancheSNP|Indel" and for each tranche, in INFO field.

So, after SNPs VariantRecalibrator and ApplyRecalibration, and after INDELs VariantRecalibrator and ApplyRecalibration, I'm trying to apply genotype refinement steps.

Then, such next steps will ignore variants in my vcf not tagged with PASS or I will need to remove them out before I do genotype refinement?

I'm using gatk v3.8.

Thanks for your help. ;)

Best,

Monete

Best Answer

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @Monete It looks as though the Genotype Refinement steps are filtering low quality genotypes based on GQ < 20 after deriving posterior probabilities of genotypes so I don't think the PASS or VQSRTrancheSNP|INDEL fields will need to be filtered etc.

    Here is the document that describes the steps and a tutorial.

  • MoneteMonete BrazilMember
    edited May 17
    Thanks for your prompt answer @SChaluvadi

    Just to check: I have a vcf with ~6.500.000 variants and ~5.900.000 of those have "PASS" in their INFO field.
    So, like you said, genotype refinement steps will only work with such ~5.900.000 variants which have "PASS" in their INFO field. Am I right?

    Sorry my repetitive questions. I just want to confirm such things, since I had a lot of doubts in these steps and tutorials.

    Thanks again
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @Monete
    I think that the genotype refinement steps will work on all 6.500.000 variants because the filtering step is being done on the GQ score and not the INFO field. Can you post a snippet of your vcf file that has the PASS and VQSRTrancheSNP|Indel for each tranche?

  • MoneteMonete BrazilMember
    Hi @SChaluvadi
    (sorry, my markdown doesn't work for a long time, and I don't know why and who to ask for help)

    Considering this description about vqsr steps (from: https : / / software . broadinstitute . org / gatk / documentation / article ? id = 39):

    > Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

    So, if I have a vcf with ~6.500.000 variants and ~5.900.000 of those have "PASS" in their INFO field, so I have ~600000 false positives.

    Then:
    1) will genotype refinement processes such variants false positives also?
    2) Does not better to remove such false positives out with "selectvariants" before genotype refinement?

    Snippets from my vcf (after vqsr) with 91 samples:

    INFO with PASS
    ```
    chrM 63 . T A 1807.92 PASS AC=2;AF=0.012;AN=172;DP=1431;ExcessHet=0.0129;FS=0.000;InbreedingCoeff=0.2922;MLEAC=2;MLEAF=0.01
    2;MQ=58.28;NDA=1;QD=27.51;SOR=1.075;VQSLOD=4.63;culprit=FS GT:AD:DP:GQ:PGT:PID:PL 0/0:1,0:1:3:.:.:0,3,28 ...........
    ```

    INFO with VQSRTrancheINDEL99.00to99.90
    ```
    chrM 7710 . CT C 12.61 VQSRTrancheINDEL99.00to99.90 AC=1;AF=5.747e-03;AN=174;BaseQRankSum=-8.970e-01;ClippingRankSum=0.00;DP
    =1219;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.0447;MLEAC=1;MLEAF=5.747e-03;MQ=47.85;MQRankSum=-1.859e+00;NDA=1;NEGATIVE_TRAIN_SITE;QD=1.05;ReadPosRankSum=-2.160e-0
    1;SOR=0.412;VQSLOD=-1.841e+00;culprit=QD GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,29 .................
    ```

    INFO with VQSRTrancheINDEL99.90to100.00+
    None


    INFO with VQSRTrancheINDEL99.90to100.00
    ```
    chrM 5895 . A AC 69.13 VQSRTrancheINDEL99.90to100.00 AC=1;AF=6.849e-03;AN=146;BaseQRankSum=1.58;ClippingRankSum=0.00;DP=768;E
    xcessHet=3.0103;FS=3.136;InbreedingCoeff=-0.0700;MLEAC=2;MLEAF=0.014;MQ=31.16;MQRankSum=-3.280e-01;NDA=1;NEGATIVE_TRAIN_SITE;QD=9.88;ReadPosRankSum=-1.368e+00;SOR=1.492
    ;VQSLOD=-3.845e+00;culprit=ReadPosRankSum GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,35 ...........
    ```

    INFO with VQSRTrancheSNP99.00to99.90
    ```
    chrM 150 . T C 43579.35 VQSRTrancheSNP99.00to99.90 AC=82;AF=0.554;AN=148;DP=1719;ExcessHet=-0.0000;FS=0.000;InbreedingCoeff
    =0.8917;MLEAC=99;MLEAF=0.669;MQ=54.12;NDA=1;QD=32.01;SOR=1.221;VQSLOD=2.19;culprit=InbreedingCoeff GT:AD:DP:GQ:PGT:PID:PL 1/1:0,2:2:6:1|1:146_T_C:90,6,0 ...........
    ```

    INFO with VQSRTrancheSNP99.90to100.00
    ```
    chrM 195 . C T 30784.46 VQSRTrancheSNP99.90to100.00 AC=54;AF=0.333;AN=162;BaseQRankSum=-1.291e+00;ClippingRankSum=0.
    00;DP=1463;ExcessHet=-0.0000;FS=0.000;InbreedingCoeff=0.8355;MLEAC=62;MLEAF=0.383;MQ=59.43;MQRankSum=-2.870e-01;NDA=1;NEGATIVE_TRAIN_SITE;QD=23.40;ReadPosRankSum=1.12;S
    OR=0.285;VQSLOD=-2.134e+00;culprit=InbreedingCoeff GT:AD:DP:GQ:PGT:PID:PL ./.:1,0:1:.:.:.:0,0,0 ...........
    ```

    INFO with VQSRTrancheSNP99.90to100.00+
    None

    Thanks for your help
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited May 20

    @Monete
    You can filter the output of your VSQR steps with applyVQSR to do a filtering step before you run your Genotype Refinement steps.

    Post edited by SChaluvadi on
  • MoneteMonete BrazilMember
    Hi @SChaluvadi

    Thanks for your reply.

    The snippets from my vcf, presented above, came out from vqsr steps (VariantRecalibrator and Applyrecalibration). So I already did this. That's why my vcf is tagged with PASS or VQSR tranches each variant belong.

    My doubt is:

    Considering this description about vqsr steps (from: https : / / software . broadinstitute . org / gatk / documentation / article ? id = 39):

    > Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

    So, if I have a vcf with ~6.500.000 variants and ~5.900.000 of those have "PASS" in their INFO field, so I have ~600000 false positives.

    Then:
    1) will genotype refinement processes such variants false positives also?
    2) Does not better to remove such false positives out with "selectvariants" before genotype refinement?

    Thanks again
  • MoneteMonete BrazilMember
    Thanks @SChaluvadi

    I think I'll proceed with "selectvariants" in order to remove such false positives out before genotype refinement.

    Thanks for your great explanation.
Sign In or Register to comment.