Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
tommycarstensen ✭✭✭
About
- Username
- tommycarstensen
- Location
- United Kingdom
- Joined
- Visits
- 203
- Last Active
- Roles
- Member
- Points
- 293
- Badges
- 17
- @tommycarstensen
- Location
- United Kingdom
- Full Name
- Tommy Carstensen
Reactions
Comments
-
I think bcftools can check the validity of gVCFs. GATK and vcftools can probably do the same. As a simple check you could make sure that the number of fields are identical for all records/lines. If I was in your shoes I would run the job again irres…
-
There are lots of tools for going from MNPs to SNPs. But unless I have misunderstood something (which happens frequently), then you can't actually go the other way without having phased genotypes or somehow utilising the sequence data. In your case …
-
Sorry about the spam above. I solved my problem. I reduced the number of VR input files from ~30k to ~3k by concatenation (with a competing tool) and the runtime dropped to hours instead of weeks.
-
Please ignore my question. I already asked about it here. Previously VR3.3 took me ~12 hours to run with -nt 24. Now it will take me several weeks according to the stderr and it has already run for more than 8 hours. I must somehow have done somethi…
-
I get the same warning some 30k times when running VariantRecalibrator3.4. I'm assuming it's safe to ignore.
-
@EmmaDan Perhaps you can try with --maxGaussians 4 as suggested here? P.S. Your awk command is not correct in the case of multiallelic SNPs. As far as I recall GATK can count and return statistics. I use another competing tool myself, that I will n…
-
@manon_sourdeix this thread might be of interest to you? http://gatkforums.broadinstitute.org/discussion/5581/unifiedgenotyper-genotype-calling-oddity
-
Human? How many annotations? Compressed? I suggest you run on a fragment that you know have an average SNP density and which is much larger than the size of the metadata lines and multiply/extrapolate.
-
The beauty of HC is that you can keep adding samples to your set and still get a non sparse matrix. If your data allows it, then I highly recommend the workflow suggested by Sheila.
-
Thanks so much for this Geraldine! It's only chrom20, but it's still waste. I switched my brain on now :wink: I have attached the plot to this post as well. The hard link is: https://us.v-cdn.net/5019796/uploads/FileUpload/76/56678cde353eae12877c07…
-
@Geraldine_VdAuwera I was a bit impatient, because I need this project to move forward fast (as you might have guessed from my 3am GMT posts), so I didn't wait for the --minPruning 1 ROC curve to be generated. Please see the attachment with the incl…
-
MISTAKE POST - DELETED
-
@Sheila works for me...
-
@blueskypy I can help your with your 2nd/3rd question. You are meant to apply the variant recalibration/filtering with ApplyRecalibration, if you follow the default practices. VQSR is a two (four with indels) step process, if you follow the best pra…
-
@MaxideSousa Have you tried fixing the metadata lines of your input file? You can do this with bcftools reheader and bcftools annotate. You probably need to get rid of all metadata lines starting with ##FORMAT. Your records/lines only contain 8 &quo…
-
@tstuber this thread might be of interest to you: http://gatkforums.broadinstitute.org/discussion/5581/unifiedgenotyper-genotype-calling-oddity
-
It sounds to me as if one of your files A.g.vcf B.g.vcf C.g.vcf is corrupted. Can you try to grep END=21994810 and post the line? Perhaps you ran out of disk space?
-
Thanks @Geraldine_VdAuwera! HC4 sounds exciting! I don't think low coverage is where the puck is going to be though (with apologies for the Gretzky reference). I'll put a smile on your face some other time :) I'm very sad myself about not being able…
-
@Geraldine_VdAuwera I'm not sure I want to cycle to work anymore, if I spot a drone above me... Yes, lots of 0,0,X PLs after HC and GG. But what puzzles me is that UG seems to get the AD right, whereas HC/GG seems to get it wrong all the time with …
-
(Quote) I think you have understood the concept of PL. I think what you haven't understood is the fact that your AD is 3,3 and hence you make a heterozygous call (0/1). If the AD had been 0,6 or 6,0, then your call would have been 1/1 and 0/0, respe…
-
@Geraldine_VdAuwera Just don't tell your spouse I'm throwing stones after you... I have done the ROC curves for indels. HaplotypeCaller3.4 does just as bad as UnifiedGenotyper3.4 as one would expect for low coverage data. That's all fine. I'm not c…
-
+1 I have experienced the same problem with version 3.2 or 3.3 of UG and GenotypeGVCFs.
-
@santiagorevale just a small comment; UG emits SNPs and indels at the same position separately, whereas HC emits them as one VCF record. There are many tools to split your multiallelic sites into biallelic sites.
-
@amsci8 would it be an option for you to do EMIT_ALL_SITES instead along with --intervals ?
-
I finally got around to running HC with --minPruning and minDanglingBranchLength 2/4, 2/1, 1/4, 1/1. Initially those ROC curves do not look good for HC compared to UG; I will attach tomorrow. UG performs better in terms of sensitivity. This is low c…
-
@Sheila Nothing noticeable, but I never timed it though. @кфьутылн can you try adding --intervals to your command to see if that speeds up things? Can you post your command, if it doesn't?
-
@MUHAMMADSOHAILRAZA That is great news! Thanks for sharing! Hoping you will share a figure with the rest of us upon completion :) I forgot to ask about your depth of coverage.
-
@agaricx Is AC=1 before or after annotation with SNPEff? Did you run GenotypeGVCFs after HaplotypeCaller? It might be beneficial to post your commands?
-
@MUHAMMADSOHAILRAZA Did you reach any conclusions about UG and HC? Are you allowed to share? I would be interested. Thanks.
-
@bioinfo_89 I added the locations of the PARs to Wikipedia some months ago. Irrespective of your build your coordinate chrX:77298857 is not in the PARs.
-
@Sheila @Geraldine_VdAuwera I'm calling variants in some low coverage data (I promise it's the last!). I also have high coverage trios for some of these populations. Would it be a) really ingenious or b) really stupid or c) really indifferent or d)…
-
@JMFA Can you post the command you used? Did you run HC+GGVCFs or just HC; i.e. is this a gVCF?
-
Thanks @Geraldine_VdAuwera. I will use bcftools to generate the -alleles VCF and do the recalling with UG. I feel I'm on shaky ground doing this. Sorry for never making it to your workshop. I mostly just wanted to meet some celebrities.
-
@bioinfo_89 What is the coordinate of your variant? Is it possible that it is located in one of the PARs?
-
@h_asif This might be helpful? https://www.broadinstitute.org/gatk/guide/article.php?id=1213 http://gatkforums.broadinstitute.org/discussion/2217/how-where-to-download-resource-vcf-files
-
@Sheila @Geraldine_VdAuwera I have heard a few people at different occasions saying it's not necessary to run BQSR for Illumina HiSeq X Ten high coverage data. I thought I would hear it from the horse's mouth myself. Is it best practice to run BQSR …
-
@Geraldine_VdAuwera For version 3.4 do you recommend using UG for low coverage data? I have a few more low coverage datasets to be called. I was about to test out the sensitivity of HC3.4 compared to UG, but maybe you can save me the effort? Thank y…
-
@mahyarhey Thanks. I'm still not sure what is going on. As a fellow user I want to. (Quote) I noticed lots of missing calls in your HC output. How did you merge your HC output files? Preferably you should be doing joint calling for your 33 samples …
-
@mahyarhey Can you post the first record from each of your two VCF files being compared? Can you also post a record count for one chromosome (e.g. chr20) for each of your VCF files? What software did you use to convert from .gen.gz to .vcf.gz? Did y…
-
I think a GitHub issue has been raised: http://gatkforums.broadinstitute.org/discussion/5482/qual-for-non-variant-sites
-
@Berry123 How many records in your VCF and how many samples? Also are you using annotations that are present in your VCF? Can you print your command line and the first record of your VCF? Does your stderr contain this line by any chance? VariantDat…
-
Stupid question: How do you know, what your expected heterozygosity is? It could deviate from HWE. How many samples are you dealing with? Is this low coverage data?
-
(Quote) Have you completely lost your marbles? I would be broke then :smiley: I just wanted to advertise for bcftools, which would be my weapon of choice for a task like this. I used to do this with awk :)
-
(Quote) @TechnicalVault I ran VR3.3 and VR3.4 on your 15x VCF with -nt 1. It seems to work frequently (I did multiple runs), when I don't include -an InbreedingCoeff. Can you try without it? And can you try to do multiple runs? (Quote) Your chromos…
-
@shazly Try version 3.4.
-
@tinu It should not matter, but can you try with -stand_call_conf 0 and -stand_emit_conf 0? Any reason you are using -U LENIENT_VCF_PROCESSING? Can you print an example of a variant, which fails to be called?
-
@EmmaDan would it perhaps be an option for you to use VariantAnnotator or recall your variants in GGA mode with UG to avoid local realignment?
-
@55816815 Personally I do it in a third way. I run two VR jobs for SNPs and indels simultaneously and then I apply the recalibration using either the unix utilities paste, sort -m and awk or Python and the function heapq.merge() on the .recal and .v…
-
@Geraldine_VdAuwera @Sheila Is GATK3.4 just about to be released? Will it be released this week? If not, then version 3.3 it is for my next project. Thanks.
-
@55816815 I'm a bit confused. The best practices do not require that you run SelectVariants. Instead you can just run VariantRecalibrator with -mode SNP and -mode INDEL on a file containing all of your called variants. See the best practices procedu…
-
@zyx_n Just out of curiosity. Which software generated your vcf? It does not seem to be malformed at first glance.
-
@pdexheimer Did -allSites -L sites.vcf work for you; i.e. did it print all sites? Did you receive any -maxAltAlleles warnings? Did you combine it with -isr INTERSECTION? I am asking, because I had issues with HC in GGA mode as suggested by @Sheila a…
-
@shiroann Yes, basically don't run VQSR per chromosome, unless you need the results quickly, before the Death Star reaches Alderaan or something like that. And you would have to run ApplyRecalibration separately per chromosome as well, if you were t…
-
@shiroann It is better to maximize the number of data points by running simultaneously across all chromosomes. That being said I think VQSR still might give reasonable results for individual chromosomes.
-
What happens with the runtime if you compress and tabix index your files?
-
@SteveL No, starting from "tranche-90" and going up your specificity will decrease and your sensitivity will increase. The sentence is correct.
-
@newbee If you want to include the "non-passed" sites in your downstream analysis, then you might have to change the filter column to . or PASS. Otherwise a lot of tools will automatically filter out these sites. But going for a threshold …
-
@severinec The default value according to the documentation is 1000.
-
I am curious to see what answer you will receive to your question. I don't care about dbSNP annotations myself, so I never noticed this issue. One workaround is to split your multiallelics into biallelics, normalize (trim and left align) indels and …
-
A knownTiTv of 2.6 is not really low, is it? Your novelTiTv suffers from the fact that you are using dbSNP138. I remember when switching from 138 to 135 that my novel TiTv ratio increased. I'm sure if you look up the version history of dbSNP somewhe…
-
(Quote) Your file name could indicate that you are using a gVCF as input. I don't think you are meant to use gVCFs as input for VR. If your file is not a gVCF, then please ignore my comment. Best of luck getting it to work.
-
Are md5 sums available? Have you checked them? Have you checked the file size is correct? Have you done a tail on the file? Have you validated the file in other ways? Do you have write permission to your disk?
-
@pdexheimer Ooops, I didn't know about GATK CatVariants despite it apparently having been around for ages. I use bcftools concat for this task. I like bcftools concat, because I can pipe the output. I dislike bcftools concat, because it requires the…
-
Am I allowed to advertise for bcftools here or will my account be closed NSA style? :) See the filter options of the bcftools sub command view.
-
@Blue I also work with related samples. I would be very interested in hearing about your findings in this thread, when you get further with your analysis. I would be very interested in knowing about the proportion of your variants that have Inbreedi…
-
Well that sucks @Geraldine_VdAuwera, but no despair :) Hopefully this is the last time I will be working with low coverage data. I will take UG for one last dance instead. Thanks for looking into this so thoroughly and for not abandoning the questio…
-
@sletort Maybe I misunderstand but either your site is marked as LowQual (if -stand_call_conf and -stand_emit_conf are different) or a missing genotype has GT ./. See the VCF specifications.
-
@golharam I noticed your interval being named VQSRTranchesSNP99.00to99.90. I just wanted to let you know, that the VQSR best practices for SNPs are: --ts_filter_level 99.5 \-mode SNP \ According to the VR documentation the default --TStranche valu…
-
@biobio Is it perhaps an option for you to use PLINK1.9, if you are dealing with just one site? If you want to use haplotype input instead of estimating LD with the EM algorithm from unphased genotypes (and your data is in the SHAPEIT2 .haps format)…
-
@mayaab Interesting. Do you mind, if I ask, which version of GATK are you using? I haven't experienced this problem myself with the older version 3.2 on ~2000 samples. I'll follow this thread. Thanks for posting.
-
@Rebecca Yup, you need a set of truth sites for any machine learning method. See the best practices for further information. Good luck. Have fun.
-
@Rebecca if you have a training set, then you can use VQSR. If not, then you can't.
-
@bsmith030465 The current version is 3.3, so you can just go to the current best practices: https://www.broadinstitute.org/gatk/guide/best-practices It is possible to find the old 2.8 guide book here: https://www.broadinstitute.org/gatk/guide/pdf…
-
@pdexheimer Guru! Thanks!
-
To anyone reading this thread. It's continued here: http://gatkforums.broadinstitute.org/discussion/5265/
-
(Quote) Is this UG2.8 or UG3.3 producing the expected result? Did UG2.8 call the deletion 7:117199640:TATC:T correctly? I might have to revert from UG3.3 to UG2.8 myself based on your findings. I'm watching this thread with hawk eyes and I'm very gr…
-
@mlinderm I follow this thread with interest. Thanks for posting it. Have you by any chance calculated your homRef, het and homAlt genotype discordance per sample (assuming your original calls are the truth) for SNPs and indels for UG GGA and HC GGA…
-
I agree. VR for example gives an obscure error message, when an input resource file is missing. And only gives this error message after running for a while. No coffee for the sys admin, who moved my files ;) That being said GATK is an awesome tool a…
-
Hi @mayaab. I'm a user and I'm very interested in your question. I don't think there are currently any best practices for the sex chromosomes. See this thread: http://gatkforums.broadinstitute.org/discussion/2895/vqsr-and-sex-chromosomes With that…
-
@Bettina_Harr I was able to run my jobs. I just kept trying, when they failed. See the other thread. Perhaps also try changing -nct 6 to -nct 2 -nt 4. If your data is not high coverage, then you should be using HaplotypeCaller instead. Good luck!
-
Hi @SteveL. I'm just a user and I could be completely wrong, but I don't think stand_emit_conf and stand_call_conf applies to HaplotypeCaller, when you use it in GVCF mode. I only think they apply to GenotypeGVCFs in this case. Here is a link to the…
-
It would be great, if GenotypeGVCFs could support GENOTYPE_GIVEN_ALLELES if it doesn't already.
-
@zgtman I just found a description of VF and GQX here: https://support.basespace.illumina.com/knowledgebase/articles/144844-vcf-file You can calculate those annotations from existing annotations and add the annotations to your VCF file with bcftoo…
-
Is transcript information a valid annotation? Here documentation on annotations: https://www.broadinstitute.org/gatk/guide/tooldocs/#VariantAnnotations
-
@rcastelo if you click on "Guide" in the top menu, then you will get to the documentation. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php#--max_alternate_alleleshttps://ww…
-
(Quote) This sounds cool. Is there an example somewhere?
-
(Quote) The walker GenotypeGVCFs was introduced with GATK3.0 as far as I recall. Make sure you always use the latest version of GATK. Which is currently 3.3.
-
@apolyak also see this great post by @Sheila titled: "When should I use -L to pass in a list of intervals?"
-
Oh, and I forgot that @Sheila wrote "RankSumTest annotations do not work in Variant Annotator" in another thread. So definitely give re-calling a try.
-
It depends on a lot of factors. I will share my limited experience with Homo sapiens. Are you running HaplotypeCaller in GVCF mode? It should not require more than 4GB if you do. If you are doing joint calling in non-GVCF mode with HC or UG, then I …
-
I am not an expert on java, but you can set your tmp directory location like this: java -Djava.io.tmpdir=tmp I believe the default location is /tmp in the root, which can fill up quickly, when on a shared system.
-
@Kath I had the same problem when annotating calls from samtools, FreeBayes, etc. However, it is still best practice to use SOR for VQSR in INDEL mode. See the indel specific recommendations here. I get a neat continuous Gaussian distribution of SOR…
-
Yes, that's exactly the answer from Sheila I had in mind. Thanks for locating it. I think you forgot to include positions for those variants you posted. I'm not sure I understand your question in its current format. This thread on left alignment o…
-
Hi @rfuentes. There was someone else asking about this recently and Geraldine or Sheila gave a really good answer, but I can't find the thread. Short answer is that you can't know which position was deleted. GATK to the best of my knowledge by defau…
-
P.S. Not sure if FS is the best of annotations for SNPs (and indels), because all the values seem to cluster at 0 making it a binary classifier rather than a continuous value and it was mentioned in another thread that "the VQSR algorithm assum…
-
This thread is related to another one on a similar topic. I was afraid that the Y chromosome would perform differently under the VQSR model, because some of the annotations used by VQSR are depth dependent. Indeed relatively more chromY sites have …
-
Thanks Sheila. Doesn't bother me at all. Just thought I would mention it. Have a nice weekend.
-
It could be due to the random downsampling carried out by HaplotypeCaller. The default downsampling settings are described in the documentation. There are probably other reasons.
-
From the resource bundle and other places.
-
If this is low (or high) coverage data and you have at least 100 samples, then you should see reasonably good sensitivity and specificity with UG. I would find the calls reliable. Maybe you have some SNP array data or high coverage data for the same…