Exception thrown while running VariantRecalibrator

The command I am running is

gatk VariantRecalibrator -R /home/nick/Exome/reference/ucsc.hg19.fasta -input GRC14335721.vcf -recalFile GRC14335721.recal -tranchesFile GRC14335721_recalibrate_SNP.tranches -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /misc/exome/datasets/bundle/2.8/hg19/hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /misc/exome/datasets/bundle/2.8/hg19/1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /misc/exome/datasets/bundle/2.8/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /misc/exome/datasets/bundle/2.8/hg19/dbsnp_138.hg19.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -l debug &> ../log/variant-recalibrator.log

and the debug output is in the attached log (scroll to the end to see error message). I thought maybe the execution was OK, but the generated recal and tranches files are empty.

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    Hi Nick @23andaspie,

    In general, the most important criteria for supplementing your exome with an external cohort is to find other exomes that are as close as possible to yours in technical terms, ie that were prepared and sequenced using the same kits and technology if possible. Ethnicity etc. aren't really important. So the first thing you need to do is find out exactly how your exome data was generated. Then, you have to look up the 1000 Genomes sample list (it's available as an Excel document) and identify which of the technologies that was used is most similar to what was used for your exome. Once you have that, you'll need to filter the sample list by those criteria and obtain the data files from the 1000 Genomes FTP server. Then you'll need to generate GVCFs from each of those samples individually, run joint genotyping on the GVCFs together, and finally filter the resulting multisample VCF callset with VQSR.

    To be frank, this is not a trivial undertaking -- in addition to finding the right datasets, you'll need a lot of disk space and quite a bit of compute power. Are you doing this on personal computing equipment or do you have access to an analysis server?

    Note also that this is the procedure for filtering variants if you're interested in maximizing your analysis sensitivity overall. If you're looking to see if you have specific variants based on previous research, you may not need to jump through quite so many hoops. If you know what you're looking for and there are not too many sites, you could choose to subset your variant calls to just the sites of interest, and then examine each site individually. If you tell us what you want to get out of your exome analysis we may be able to give you some advice on how to work around some of the problems you're likely to encounter.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    Hi Nick @23andaspie,

    I thought that might be your interest based on your username.

    My mistake on the capture kit -- you have the right spreadsheet but the information is not in there, it's here in the 1000G FAQs, listed per sequencing center. Looking for protocol resources in genomics often turns into a scavenger hunt. Anyway, it looks like the technology used for your exome is more recent that anything used in the 1000G project, which is unfortunate for VQSR purposes, even though it's good in general because it's a more advanced technology.

    You could try to evaluate which of the technologies used in 1000G is most like the one used for your exome, or you could try to hunt down some other publicly available datasets done with your technology. The latter would be better but it's hard to even know where to start -- searching for data by capture kit is not something anyone has thought to make easy, as far as I know, and more's the pity.

    Alternatively you could just pick one of the centers (pick us! pick us! "BI" for Broad Institute) and select a set of ~30 exomes that score high on the QC metrics ("% Targets Covered to 20x or greater" is probably the most important) -- but best not include outliers that score exceptionally high. The advantages of joint calling should still outweigh the disadvantages of the technical mismatch.

    For a single exome you could get by on a beefy personal machine, but once you get into higher sample numbers frankly you'll probably need to go to the cloud or equivalent. We're working with Google on a Google Cloud Platform implementation of GATK-as-a-service; it's currently in closed-access beta status but we're looking to open it up to the public in the near future. Can't say how soon or not soon that might happen though.

    All this is assuming you want to go the route of joint calling -- which is probably worth it if you're going to be looking at potentially hundreds of loci. But you can already start by calling variants on your exome alone if you run the GVCF workflow (as detailed in our Best Practices documentation). This will at least allow you to do a "first draft" genotyping and start answering some, if not all of your questions.

    Speaking of which -- I'm really not the right person to advise you on designing your analysis because I have no clinical training, but a solid, simple way to start is to compile a list of all sites that have been reported as being of interest, and looking for genotype concordance between your own genotype (as reported by the variant caller) and the variant alleles reported in the literature at those sites.

    For the serotonin transporter gene (SERT/5-HTT) mentioned in the Atlantic article, you may want to start by having a look in the ExAC database here. There's also a recent paper here that seems like it might particularly relevant based on the abstract. (By the way, if you're having trouble accessing paywalled research articles, there are ways around those.)

    Be sure to look for papers from the Stanley Center for Psychiatric Research who are doing a lot of exciting work in that space here at the Broad. I know of at least this paper and this paper but there are surely others that I'm not aware of. You can also follow them on Twitter under the handle @stanleycenter.

    That's all I can think of at the moment; again I'm not an expert on this, but I'd be happy to help if you run into any difficulties during the GATK-centric portion of your analysis. Good luck!

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @23andaspie
    Hi,

    It looks like you don't have enough data to run VQSR. We recommend using at least 30 exomes or 1 whole genome. Can you tell us what kind of data you are working with and how many samples are in your VCF?

    Thanks,
    Sheila

  • 23andaspie23andaspie SeattleMember

    Hi Sheila,

    Thanks for the reply. This is my personal exome (Illumina) data through genebygene. Of course, the sample size here is 1.

    I was reading a similar post on 1000genomes and VQSR but there wasn't enough described there on which specific datasets I should download from 1000genomes, or what commands specifically to run (or how) to incorporate this with my personal data.

    Could you please point me in the right direction?

    Much appreciated,
    Nick

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    Hi Nick @23andaspie,

    In general, the most important criteria for supplementing your exome with an external cohort is to find other exomes that are as close as possible to yours in technical terms, ie that were prepared and sequenced using the same kits and technology if possible. Ethnicity etc. aren't really important. So the first thing you need to do is find out exactly how your exome data was generated. Then, you have to look up the 1000 Genomes sample list (it's available as an Excel document) and identify which of the technologies that was used is most similar to what was used for your exome. Once you have that, you'll need to filter the sample list by those criteria and obtain the data files from the 1000 Genomes FTP server. Then you'll need to generate GVCFs from each of those samples individually, run joint genotyping on the GVCFs together, and finally filter the resulting multisample VCF callset with VQSR.

    To be frank, this is not a trivial undertaking -- in addition to finding the right datasets, you'll need a lot of disk space and quite a bit of compute power. Are you doing this on personal computing equipment or do you have access to an analysis server?

    Note also that this is the procedure for filtering variants if you're interested in maximizing your analysis sensitivity overall. If you're looking to see if you have specific variants based on previous research, you may not need to jump through quite so many hoops. If you know what you're looking for and there are not too many sites, you could choose to subset your variant calls to just the sites of interest, and then examine each site individually. If you tell us what you want to get out of your exome analysis we may be able to give you some advice on how to work around some of the problems you're likely to encounter.

  • 23andaspie23andaspie SeattleMember

    Hi Dr. VdAuwera,

    Thank you for the thorough and insightful response! Now, I have found a spreadsheet, which has a "Phase1" tab with columns indicating Platform (e.g. Illumina) and Center -- is this what you are referring to? Now, my exome was done by genebygene which indicates that the "Nextera Rapid Capture Expanded Exome Kit - FC-140-1006" and Illumina HiSeq platform was used. However, I do not see kit information in the 1000genomes spreadsheet. Please let me know if I am looking in the correct place.

    I have knowledge/access to additional computational resources if need be, i.e. EC2/S3, if a cluster or fleet rather than a single machine is warranted for this scale of analysis. Up to this point, I have been running sequence analysis on my hardware at home (a quad-core i7, 32 GB of RAM, 1 TB internal SSD, several TBs of non-SSD external storage).

    At the age of 23, I was diagnosed with high-functioning autism spectrum disorder, a condition which has been shown to be overwhelmingly genetic in origin. My goal is to pinpoint the underlying genetic risk factors I hold which are known to be associated with autism spectrum disorders. This is not as simple as it may sound, though. The genetics of ASDs is extremely complex, with several hundreds of genes which have been identified to possess a minor correlation.

    I am continually looking into recent research on this subject matter. Here's a recent study on molecular diagnostics on individuals with high-functioning autism spectrum disorder that aligns closely to my interests and goals here.

    Issue · Github
    by Sheila

    Issue Number
    909
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • 23andaspie23andaspie SeattleMember

    Oh, there was one other thing I was hoping to get from my exome. There was a publican I read a few years ago titled The Science of Success which discusses a gene known as 5-HTTLPR:

    As I researched this story, I thought about such questions a lot, including how they pertained to my own temperament and genetic makeup. Having felt the black dog’s teeth a few times over the years, I’d considered many times having one of my own genes assayed—specifically, the serotonin-transporter gene, also called the SERT gene, or 5-HTTLPR. This gene helps regulate the processing of serotonin, a chemical messenger crucial to mood, among other things. The two shorter, less efficient versions of the gene’s three forms, known as short/short and short/long (or S/S and S/L), greatly magnify your risk of serious depression—if you hit enough rough road. The gene’s long/long form, on the other hand, appears to be protective.

    How would I be able to tell, from my exome data, which form of the gene I have?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    Hi Nick @23andaspie,

    I thought that might be your interest based on your username.

    My mistake on the capture kit -- you have the right spreadsheet but the information is not in there, it's here in the 1000G FAQs, listed per sequencing center. Looking for protocol resources in genomics often turns into a scavenger hunt. Anyway, it looks like the technology used for your exome is more recent that anything used in the 1000G project, which is unfortunate for VQSR purposes, even though it's good in general because it's a more advanced technology.

    You could try to evaluate which of the technologies used in 1000G is most like the one used for your exome, or you could try to hunt down some other publicly available datasets done with your technology. The latter would be better but it's hard to even know where to start -- searching for data by capture kit is not something anyone has thought to make easy, as far as I know, and more's the pity.

    Alternatively you could just pick one of the centers (pick us! pick us! "BI" for Broad Institute) and select a set of ~30 exomes that score high on the QC metrics ("% Targets Covered to 20x or greater" is probably the most important) -- but best not include outliers that score exceptionally high. The advantages of joint calling should still outweigh the disadvantages of the technical mismatch.

    For a single exome you could get by on a beefy personal machine, but once you get into higher sample numbers frankly you'll probably need to go to the cloud or equivalent. We're working with Google on a Google Cloud Platform implementation of GATK-as-a-service; it's currently in closed-access beta status but we're looking to open it up to the public in the near future. Can't say how soon or not soon that might happen though.

    All this is assuming you want to go the route of joint calling -- which is probably worth it if you're going to be looking at potentially hundreds of loci. But you can already start by calling variants on your exome alone if you run the GVCF workflow (as detailed in our Best Practices documentation). This will at least allow you to do a "first draft" genotyping and start answering some, if not all of your questions.

    Speaking of which -- I'm really not the right person to advise you on designing your analysis because I have no clinical training, but a solid, simple way to start is to compile a list of all sites that have been reported as being of interest, and looking for genotype concordance between your own genotype (as reported by the variant caller) and the variant alleles reported in the literature at those sites.

    For the serotonin transporter gene (SERT/5-HTT) mentioned in the Atlantic article, you may want to start by having a look in the ExAC database here. There's also a recent paper here that seems like it might particularly relevant based on the abstract. (By the way, if you're having trouble accessing paywalled research articles, there are ways around those.)

    Be sure to look for papers from the Stanley Center for Psychiatric Research who are doing a lot of exciting work in that space here at the Broad. I know of at least this paper and this paper but there are surely others that I'm not aware of. You can also follow them on Twitter under the handle @stanleycenter.

    That's all I can think of at the moment; again I'm not an expert on this, but I'd be happy to help if you run into any difficulties during the GATK-centric portion of your analysis. Good luck!

Sign In or Register to comment.