johnwallace123 ✭✭

About

Username
johnwallace123
Joined
Visits
47
Last Active
Roles
Member
Points
83
Badges
7

Comments

  • @JensR and @Geraldine_VdAuwera: We are also dealing with files in the 100's of GB range, but we aren't noticing an issue with VQSR. Since all of the VQSR training data is at the variant level, we work with sites-only files for training the VQSR mod…
  • @xiaolicbs I just saw this post, and wanted to let you know that we are also experiencing the same issue. The way we've worked around it is through a call to VariantAnnotator after the SelectVariants call. Our general workflow is as follows: java…
  • @Sheila, We currently have >40,000 samples, which requires merging gVCFs twice (typically in sets of ~150, then merging ~30 combined gVCFs). So, this means that we will have to re-merge and re-merge again for any gVCF that contains a sample tha…
  • @Sheila, @Geraldine_VdAuwera, I did some more debugging, and I'm pretty sure that it is a site that has a large amount of variation. With a bit of code modification, I've tracked down the Out of Memory error to the stack trace below. ##### ERROR …
  • @Sheila, Running GenotypeGVCFs on the batch of 64 combined GVCFs works. However, this set of 10,000 samples is only one batch in our final cohort, which will be substantially larger (>40,000 samples in total, so we'll need to merge the gVCFs tw…
  • @Geraldine_VdAuwera Thanks for the update. I'm surprised that the BCF performance would be worse, but the only performance testing that I've done has been vcf->bcf and bcf->vcf (using the SelectVariants in a probably silly way), and I found …
  • @pdexheimer, Thanks so much for the pointer in the right direction! I did have one final question (for now). I made the changes that I thought would be useful in the trunk of htsjdk, but I noticed that the trunk gatk-protected failed to build bec…
  • @pdexheimer, That's somewhat disappointing. Not having this available means that we can't use GATK for our concordance checks. Would it be possible to add a helper key "isFiltered" for this use case? If you point me to the construction…
  • @Sheila @azo121, Using the previous VCF as "test.vcf" (with the missing fields fleshed out), our base command line (omitting reference) is as follows: GATK -T GenotypeConcordance -comp test.vcf -eval test.vcf -moltenize | grep 'SAMP' | g…
  • @Geraldine_VdAuwera, @azo121, We do use the VariantFiltration with the JEXL epressions such as 'GQ<20' to produce our VCFs that we want to then check for concordance. Rather than re-using the 'GQ<20' expression, we want to evaluate the VCF a…
  • @SteveL: I've also had some issues with CombineGVCFs running quite slowly, and I've found that splitting by chromosome is a simple and effective way of parallelizing the process. While Queue would help, I also can't run it, so we've had to manually…
  • @Geraldine_VdAuwera, I'm also seeing strangeness in our tranche plots that looks like it might have the same cause. It looks like this is caused by the R script ordering the data by TiTv ratio instead of the truth sensitivity tranches. Normally, …
  • @thibault, @Geraldine_VdAuwera, Thanks so much for the answers; they confirm my suspicion about how gVCFs and SelectVariants work. I'm not sure that using the -L with CombineGVCFs will work well, as it seems that "-L" makes GenotypeGVCFs…
  • @Sheila‌ Sorry it took me so long to get back to you. I've uploaded "HighDP_LowGQ.tar.gz" that should include everything needed to take a look at. If you need any additional information or files, please let me know and I'd be happy to p…
  • @Sheila‌ I took a look at the BAM from the HC in IGV for both the problematic sample and a sample that has a variant called at that position, and I didn't see anything glaring. Most of the base qualities were in the 20-35 range, and the read count…
  • @Sheila‌ , The nightly build still shows this behavior. From the GVCF, it looks like there are similar non-variants nearby which have a "PL" of 0,120,1800. Do you know why I might be seeing this behavior sometimes? Below is a snippet of…
  • @Shiela I'll definitely try the nightly build to see if it fixes anything. I was just a little surprised that given what appear to be 601 REF reads, that it was a coin flip between hom-ref and het, while the probability of hom-alt was 10^-1374. I…
  • While I agree that the new workflow is better (and we're switching to that now), part of our pipeline is sanity checking the inputs at various stages of the pipeline. The data that comes to us is typically single-called, as it is streaming in from …
  • @Geraldine_VdAuwera‌ I think I've fixed the issue that we've been seeing; where should I submit a patch? In the how to submit a patch FAQs, I saw it said "attach them as an email", but I couldn't find a good email address. Thanks, John
  • @Geraldine_VdAuwera‌ and @Sheila‌, No problem; thanks for all the help! Even with the final VA step, the new 3.1-1 pipeline is much faster than the 2.x joint-calling version. If I have the time and the brainpower, I might try to submit a patch b/…
  • @Sheila‌, When I tried to run the VariantAnnotator on the GVCF, I didn't get the "AB" in the INFO field. It seems that I only ever get "AB" for heterozygous biallelic markers. However, there's never a heterozygous biallelic ma…
  • @Shelia From the VariantAnnotator documentation, it looks like I need to use the BAM files to generate the "AB" annotation; is this correct? Is there any way to generate this annotation from the VCF alone? Alternatively, can I add this …
  • I think throwing an error would alert users to the issue, but it would be nice if it was supported. Ideally, there would be some kind of "smart" merging of the 2 genotypes, but simply taking the first one seen would be just as valid. As …