SNV gets dbSNP annotation in one sample, doesn't get annotated in another one

wariobregawariobrega RomeMember
edited August 2017 in Ask the GATK team

Hello everyone,

I recently run HaplotypeCaller for GATK3.7 on a series of samples (several GATK runs performed at the same time), using the latest release of dbSNP(150). This was the command line I used for both cases (I omissed the full paths for privacy concerns):

/usr/bin/java -Djava.io.tmpdir=/scratch/javatmp/ngs_pipe<br /> \ -Xmx4g -jar /data01/Softwares/GATK/3.7/GenomeAnalysisTK.jar \<br /> -T HaplotypeCaller \<br /> -R /path/to/hg19 \<br /> -I input_bam \<br /> -o output.vcf \<br /> --dbsnp dbSNP_150_NEW_hg19_chr.vcf

and here's the same variant reported in two different files of the same data (exomes) performed using the same kit, on the same NextSeq run

sample 1:

chr11 125479363 rs2241502 G A 208.01 . AC=2;AF=1.00;AN=2;DB;DP=9;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=23.11;SOR=0.892 GT:AD:DP:GQ:PL 1/1:0,9:9:27:222,27,0

sample2:

chr11 125479363 . G A 597.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.140;ClippingRankSum=0.000;DP=41;ExcessHet=3.0103;FS=2.820;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=14.58;ReadPosRankSum=-1.110;SOR=0.448 GT:AD:DP:GQ:PL 0/1:16,25:41:99:605,0,343

I've seen this error running sistematically for several other positions in the same run, and I fear that the error might be always been there and I didn't notice before. I'm wondering if you know why this occurs, if it's a bug and it is known and if I should reannotate with VariantAnnotator every vcf I got in order to fix the issue (if VariantAnnotator is immune from this bug)

Thanks a lot for your help and time, I'm here for every clarification

Best Answer

Answers

  • Hi Sheila, I'll send a bug report, thanks for your attention!

  • HI Sheila, I've uploaded a case through your FTP as you requested. The file containing a case of missing rsIDs is in a .tar.gz file named wariobrega_GATK_Debug_missingrsIDs.tar.gz. In It you'll find

    • The command I used in a .sh script;
    • the 2 GATK log for Haplotype Caller in two *.log files;
    • The snippets of the two bam files I used (only chr1):
    • the relative vcf files and their indices (Genome Wide);
    • a .tsv files showing the missing rsids for chr1 only (the one I found in one bam but not in another one). If you request, I can upload all the missing rsids for each side;

    That's it I guess. Note that as --dbsnp I used the latest dbsnp version (150) and the hg19 downloaded from UCSC as reference.

    In case you need other info/data, just let me know.

    Thanks for your support!

    cheers,

    Daniele

    Issue · Github
    by Sheila

    Issue Number
    2498
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @wariobrega
    Hi Daniele,

    Thanks, I will have a look soon.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @wariobrega
    Hi Daniele,

    Sorry for the delay. I am just looking at this now. The reference you used is different from our hg19 reference. Can you upload the one you used? It looks like the UCSC site has per chromosome FASTA files, and it would just be easier if you upload your reference.

    Thanks,
    Sheila

    P.S. Can you check if this still occurs with GATK4 latest beta?

  • Hi Sheila,

    Thanks for your reply.

    I uploaded the reference genome I used (a cleaned version of hg19 downloaded from UCSC) to the ftp server in a tar.gz file named wariobrega_GATK_Debug_missingrsIDs_ReferenceGenome.tar.gz. There you'll find both the fasta and the relative index generated with Picard.

    As for the GATK4 test, I'd rather wait for the official release that will come in one and a half month. For the moment I opted for an "in house" resolution using a local database that is queried using the variant coordinate at the end of the variant calling, so I'd rather wait for the GATK4 Official so I can test out all the other cool stuff you've added altogether.

    Thanks a lot for your help and let me know if I can provide any other information to reproduce the error!

    Daniele

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @wariobrega
    Hi Daniele,

    I just tested this, and most of my sites are annotated with the dbSNP id (as long as the site is in dbSNP). I ran the default HaplotypeCaller command with -L chr1:1-100000 -D dbsnp_138.hg19.vcf. I am not sure if the version 150 dbSNP is what is causing the issue, but version 138 from our bundle should work.

    -Sheila

    P.S. I did not run on all of chr1, but I am assuming there will be no issue with the rest of it. I used the latest nightly build.

  • Hi Sheila,

    Thanks again for your interest in this concern. It may be that the extremely larger size of dbSNP150 compared to dbSNP138 may cause the issue. On the other hand, this missing annotation occurred once every 10k variants, so the problem can't be seen on a small test sample. Anyways we dealt with it by reannotating the RSID with custom scripts. Also it may be possible that the problem has been fixed already, so I will test it again on GATK4 once it comes out.

    Thanks a lot for your help!

    Cheers,

    Daniele

Sign In or Register to comment.