what annotations to be used for VQSR if only dbSNP present

nb82nb82 United StatesMember
edited February 2015 in Ask the GATK team

Hi,
I am working on variant calling for a genotype of rice and I just generated a GVCF file. Here is a snippet of the GVCF file.

Chr1 173923 . C T, 40.74 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:9:73,9,0,73,9,73:0,0,2,1
Chr1 247702 . G A, 16.63 . DP=2;MLEAC=2,0;MLEAF=1.00,0.00;MQ=29.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:48,6,0,48,6,48:0,0,0,0
Chr1 247703 . G . . END=247714 GT:DP:GQ:MIN_DP:PL 0/0:2:5:2:0,6,43
.....
Chr1 248043 . A C, 0 . DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQ=27.07;MQ0=0 GT:AD:DP:GQ:PL:SB 0/0:1,0,0:1:3:0,3,45,3,45,45:0,0,0,0

For the VQSR step, I only have the dbSNP file for rice (there are no hapmap or 1000G files) which has following annotation:
CHROM POS ID REF ALT QUAL FILTER INFO
Chr1 84 rs349504122 T C . . RSPOS=84;dbSNPBuildID=138;SAO=0;VC=snp;VP=050000000000000000020100
Chr1 2223 rs349642167 C T . . RSPOS=2223;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP=
050000000000000000020100
Chr1 2228 rs350936653 A C . . RSPOS=2228;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP=
050000000000000000020100
Chr1 2417 rs347865191 G A . . RSPOS=2417;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP=
050000000000000000020100
Chr1 2546 rs349956414 A C . . RSPOS=2546;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP=
050000000000000000020100
Chr1 2588 rs352913853 A C . . RSPOS=2588;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP=
050000000000000000020100
Chr1 2634 rs348515762 C A . . RSPOS=2634;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP=
050000000000000000020100

Now, if I use -an QD for VQSR, it wont work because QD is not present in the input vcf file. So, given the two files above, what annotation can I use for VQSR step? Can I use the VQSR step at all?

And this is the setting in which I am using the dbSNP: -resource:dbSNP,known=true,training=true,truth=true,prior=6.0
Are the settings of known/training/truth correct if dbSNP is the only resource?

Thanks in advance, you have been very helpful
NB

Best Answer

Answers

  • nb82nb82 United StatesMember

    Aahh, Thanks Geraldine!
    So, if I get this right: I take the multiple GVCF files produced from HC and merge them together using GenotypeGVCFs which will give me one single merged regular VCF file which I use then use to perform all downstream steps such as VQSR and functional annotation. Is that correct? Please confirm.

    Also, is there a limit to the number of files that GenotypeGVCFs can take at a time? I have more than 3,200 samples. Is it going to be a very time-consuming step?

    Thanks for the all help, I really appreciate it.
    Best,
    NB

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's correct. For a large cohort we recommend combining GVCFs in batches of 200 before joint genotyping using the CombineGVCFs tool.

  • nb82nb82 United StatesMember

    OK, I think I get it now!
    So, I first use combineGVCFs to combine 3,200 files in batches of 200, thus generating 16 GVCF files. Then I use these 16 files and run GenotypeGVCFs in one go to make 1 vcf file. Correct?
    Again, thanks a lot for all your prompt help.
    Best,
    NB.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nb82
    Hi NB,

    Yes, you are correct.

    -Sheila

  • nb82nb82 United StatesMember

    Thank you Sheila! :)

  • nb82nb82 United StatesMember

    Hi, Just another question that I had asked earlier in the thread:
    I am doing the analysis for rice for which I only have the dbSNP file (there are no hapmap or 1000G files)
    This is the setting in which I am using the dbSNP: -resource:dbSNP,known=true,training=true,truth=true,prior=6.0
    Are the settings of known/training/truth correct if dbSNP is the only resource for VQSR?
    Thanks in advance,
    NB

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    We can't comment on the correctness of your settings because this depends so much on the dataset and the reliability of your dbsnp. Defining new resources for VQSR takes a lot of validation. This looks like a step in the right direction, but you will need to find orthogonal methods (eg gene chip etc) to validate that your results make sense.

Sign In or Register to comment.