We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
(howto) Run the genotype refinement workflow

Overview
This tutorial describes step-by-step instruction for applying the Genotype Refinement workflow (described in this method article) to your data.
Step 1: Derive posterior probabilities of genotypes
In this first step, we are deriving the posteriors of genotype calls in our callset, recalibratedVariants.vcf
, which just came out of the VQSR filtering step; it contains among other samples a trio of individuals (mother, father and child) whose family structure is described in the pedigree file trio.ped
(which you need to supply). To do this, we are using the most comprehensive set of high confidence SNPs available to us, a set of sites from Phase 3 of the 1000 Genomes project (available in our resource bundle), which we pass via the --supporting
argument.
java -jar GenomeAnalysisToolkit.jar -R human_g1k_v37_decoy.fasta -T CalculateGenotypePosteriors --supporting 1000G_phase3_v4_20130502.sites.vcf -ped trio.ped -V recalibratedVariants.vcf -o recalibratedVariants.postCGP.vcf
This produces the output file recalibratedVariants.postCGP.vcf
, in which the posteriors have been annotated wherever possible.
Step 2: Filter low quality genotypes
In this second, very simple step, we are tagging low quality genotypes so we know not to use them in our downstream analyses. We use Q20 as threshold for quality, which means that any passing genotype has a 99% chance of being correct.
java -jar $GATKjar -T VariantFiltration -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf
Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ
with the FT tag of the FORMAT field.
Step 3: Annotate possible de novo mutations
In this third and final step, we tag variants for which at least one family in the callset shows evidence of a de novo mutation based on the genotypes of the family members.
java -jar $GATKjar -T VariantAnnotator -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf
The annotation output will include a list of the children with possible de novo mutations, classified as either high or low confidence.
See section 3 of the method article for a complete description of annotation outputs and section 4 for an example of a call and the interpretation of the annotation values.
Comments
Hello,
I have been unable to locate "ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf " on 1000G site and Broads ftp. Where can I actually find or how can it be generated?
Thank you.
Hi @Gustav,
We don't currently provide that file, but we are planning to include it in our resource bundle in the near future. In the meantime, you can generate it from the data that is publicly available from the 1000Genomes project website.
Hi there Geraldine, had 2 comments/questions about this:
In step 2 of the above, should the variant file (-V) as an input be the same as the output file (-o) of step 1? The input file in the second step is listed as "C1643.PbyT.CGP.vcf" but the output file of the first step is listed as "recalibratedVariants.postCGP.vcf"
Was just wondering the rationale for filtering on GQ < 20 in step 2? If you've passed your VCF through VQSR, would it be OK to just filter (using SelectVariants) for those that passed the filter (-ef)?
Thanks for your help as always!
Hi @estif74,
I think that's a copy/paste error; I'll check and fix if it is.
VQSR only tells you which variant sites are ok; whereas filtering on GQ tells you, within the pool of good sites, which sample genotypes are potentially unreliable. You can have a good variant, where you know you have a sample that's not hom-ref, but where you don't know if the sample is het or hom-var. Good variant, bad genotype. Two different levels of filtering.
Got it, that's very helpful. Thanks for the explanation as always!
Hello, I am wondering why you use 1000 genome phase 3 SNP vcf files in step 1? And why do you use 1000G_phase1.snps.high_confidence.b37.vcf for VariantRecalibrator in the Best practice?
After I download vcf files from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ (ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz) and merged them into one file, the size of this merged file is really huge. Are these files the correct files I should use for step 1?
Thank you.
@albertyu
Hi,
The only reason the phases are different between the two files are because there were different phases when each of the documents were written. This document for genotype refinement is newer than the vqsr document.
The files you are using from the 1000genomes ftp have the genotypes included, but you can just use the sites_only files in our bundle which are much smaller. You do not need the genotypes since VQSR does not use them.
-Sheila
Hello
After reading these comments it is still not clear to me which file could I use for step 1, to "Derive posterior probabilities of genotypes". Could this file be OK:
ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf
(from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ )
Otherwise could you please clarify how to prepare the required file
Thank you!
Dear Geraldine, I have exome data for ~300 unrelated individuals. I have called variants following GATK's best practices and I am wondering whether genotype refinement is necessary and if so, whether I need to calculate genotype posteriors using any other database. Thanks a lot!
@terestahl That looks ok based on the name; though you may need to subset just biallelic snps from that file.
@simonsanchezj Genotype refinement is not required for variant discovery, but it may be useful if your research project involves using genotype information (e.g. for de novo mutation discovery).
Hello Geraldine, Thank you for your answer! I extracted the biallelic SNPs (from the file ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf) and passed it with --supporting to derive posterior probabilities of genotypes. After that I filtered low quality genotypes (-G_filter "GQ < 20.0"). The output file is annotated with lowGQ or PASS. However not all variants get this annotation. I then wonder if the file looks OK. Thanks a lot. /Teresa
Dear Geraldine, thanks a lot for your reply. For this project I have exomed ~300 unrelated individuals for a case-control study. Thus, I don't intend to perform de novo mutation discovery. Is there anyway I can benefit from the genotype refinement workflow? Worth it? If so, which collection of VCFs should I use for informing allele frequency priors?
Finally, in case i decide not to do this, whould you still recommend to filter low quality genotypes as suggested in Step 2?
Thanks a lot for your help.
@terestahl I think I answered someone else with the exact same question recently -- in a nutshell, not all genotypes are "eligible" to be evaluated by the GQ filter. Some may be skipped if they do not fulfill the necessary conditions, and therefore will not get the FT annotation.
@simonsanchezj
Hi,
Unfortunately, we cannot decide whether the genotype refinement workflow is appropriate for your analysis or not. It depends if you care at all about being able to distinguish hets from hom-vars. There are other reasons for caring about this besides looking for de novos, and it is up to you to decide. If you only care about identifying variant sites, VQSR should be sufficient.
Good luck.
-Sheila
Thanks for your reply, Sheila. I do care about distinguishing hets from hom-vars. I think I understand now how the refinement workflow works. Which collection of VCFs should I use for informing allele frequency priors?
On a different matter, I want to calculate genotype posteriors in some families. I have read somewhere in this forum that the input .ped file should only contain trios. Thus, for each child, I created a dummy family containing both parents and that person. However, when I run CalculateGenotypePosteriors, I get the following error: No PED file passed or no non-skipped trios found in PED file. Skipping family priors.
What am I doing wrong?
Mi command line looks like:
java -Xmx"$MEM"g -jar "$GATK" \
-R "$REFERENCE" \
-T CalculateGenotypePosteriors \
-V "$INPUT" \
--skipPopulationPriors
-ped "$PED"
-o "$OUTPUT"
My 'dummy' ped looks like
FAM1 sample1 founder1 founder2 2 2
FAM1 founder1 0 0 1 1
FAM1 founder2 0 0 2 1
FAM2 sample2 founder1 founder2 2 2
FAM2 founder1 0 0 1 1
FAM2 founder2 0 0 2 1
FAM3 sample3 founder1 founder2 1 1
FAM3 founder1 0 0 1 1
FAM3 founder2 0 0 2 1
What do you mean by "created a dummy family containing both parents and that person"? Are they part of the same family or not?
Hi Geraldine,
Sheila already answered the ped-related question in another post. Thank you both for your help.
As for population-based analyses, which collection of VCFs should I use for informing allele frequency priors?
Thanks
@simonsanchezj
Hi,
You can use the Phase 3 1000 Genomes biallelic SNPs. Please see Geraldine's response to Gustav in this thread.
-Sheila
Dear Sheila, thanks a lot for your reply. Are you planning to include ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf in the 2.8 bundle in the near future? Otherwise, would you recommend using vcf's contained here? ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/
Thanks a lot for your collaboration in this and other matters.
We are planning to release a new version of the data bundle with the next version (3.4). It will probably be a few more weeks before we can talk about release schedule though.
Hi,
I also have trouble in finding supporting file for CalculateGenotypePosteriors walker.
I've found biallelic_snp.vcf file at 1000genome ftp server ;
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/site_assessment/
Here, for example
ALL.chr1.unfiltered_union_sites_with_svm.20130502.biallelic_snps.sites.vcf.gz
is available. But I have noticed that this directory doesn't have biallelic_snp.vcf file for chrY.
It involves biallelic_snp.vcf file for other chromosomes.
Are these files adequate "supporting" files ?
Yes, that sounds right. Not sure why chrY is not represented.
Thank you for your answer. I'll try merging these files and if in trouble again, post here again.
Hi~
Can you tell me about Sheila's answer for the ped-related question?
That's what I want to know.
Many thanks in advance.
@jh7521
Hi,
I believe @simonsanchezj is referring to this thread: http://gatkforums.broadinstitute.org/discussion/5068/ped-file-structure-for-calculategenotypeposteriors-walker
I hope this helps.
-Sheila
Hi,
I've been trying to prepare the ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf file based on the recommendations given in this thread in order to get Step 1 running. After downloading ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf from the 1000G ftp site, I believe my next step is to select only the biallelic site. I tried this out using the SleectVariants walker. However I receive an error as below:
[email protected]:/media/GenomeAnalysisTK-3.4-0$ java -jar GenomeAnalysisTK.jar -T SelectVariants -R /media/GATK\ Bundle/GRCh37/human_g1k_v37.fasta -V /media/GATK\ Bundle/ALL.phase3.20130502.biallelic_snps.integrated.sites/1000G_ftp/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf -o /media/GATK\ Bundle/ALL.phase3.20130502.biallelic_snps.integrated.sites/ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf -selectType SNP -selectType MNP -restrictAllelesTo BIALLELIC
...............................
...............................
INFO 10:43:36,934 ProgressMeter - 21:32804795 6.75401394E8 34.5 m 3.0 s 90.7% 38.0 m 3.5 m
INFO 10:44:06,936 ProgressMeter - 22:38919126 6.8447717E8 35.0 m 3.0 s 92.5% 37.8 m 2.8 m
WARN 10:44:26,933 RestStorageService - Error Response: PUT '/ZfGg9SNeuIpdIumzexlHAQZig40nfm0z.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1061, Content-MD5: 201Aq8dwhl0j5fJGHKRP0Q==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: db4d40abc770865d23e5f2461ca44fd1, Date: Wed, 08 Jul 2015 02:44:25 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:9h2p7rrbmckoIZny2zI+E3bpGAI=, User-Agent: JetS3t/0.8.1 (Linux/3.11.0-26-generic; amd64; en; JVM 1.7.0_55), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 26A9B7914689A8F5, x-amz-id-2: WS4wp1NQct07M27rc8EUappYwQ9OF0RMJmEtPJl4pbeQZBgL71Ts3pTS0j/h9vmGVFALomFrCG0=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Wed, 08 Jul 2015 02:25:44 GMT, Connection: close, Server: AmazonS3]
WARN 10:44:27,927 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately -1122 seconds. Retrying connection.
INFO 10:44:29,256 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
java.lang.IllegalStateException: Key OLD_VARIANT found in VariantContext field INFO at X:7151130 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.
at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:176)
at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:222)
at org.broadinstitute.gatk.engine.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:182)
at org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:271)
at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:774)
at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:288)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Key OLD_VARIANT found in VariantContext field INFO at X:7151130 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.
ERROR ------------------------------------------------------------------------------------------
Please help! Am I doing this right?
@rose_ismet
Hi,
This could be a bug. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/3962/genotypeandvalidate-error-key-callstatus-found-in-variantcontext-field-info
Can you please post the vcf record at position X:7151130?
Thanks,
Sheila
Hi Shiela,
Thank you for the promp reply. I'll try to read through the forum to gather more info on this issue.
At the meantime, here's the vcf record for the position requested.
X 7151130 . TT TC 100 PASS AC=1;AF=0.000264901;AN=3775;NS=2504;OLD_VARIANT=X:7151131:TC/CC/C;DP=13746;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;EAS_AF=0.001;VT=MNP
Anything out of the ordinary there?
Hi,
My VCF file contain both INDELs and SNPs, Is that genotype refinement can only be applied on SNPs but not INDELs present in the file?
I am sorry for my stupid question.. But i am confused. because --supporting file (ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf) contain snps.
Thanks!
@rose_ismet
Hi,
I am not sure why you are having this error if you downloaded from the 1000Genomes website. I just downloaded the X chromosome vcf and it has OLD_VARIANT defined in the header.
The reason you are getting the error is because OLD_VARIANT is not defined in the vcf header. You can simply add in a line to define it yourself, and that should solve the problem.
-Sheila
P.S. The header line from the vcf I downloaded is ##INFO=<ID=OLD_VARIANT,Number=.,Type=String,Description="Original before vt normalize was run. FORMAT chr:pos:ref:alt">
@MUHAMMADSOHAILRAZA
Hi,
Yes, as of now, the genotype refinement workflow works on SNPs only.
You don't need to select for SNPs only. You can simply input your recalibrated vcf with everything.
-Sheila
Sheila thank you...!
Hi,
After following all the steps i finally got the "recalibratedVariants.postCGP.Gfiltered.deNovos.vcf" file. Is there any way in GATK that i can extract and filter only "hiConfDeNovo" (high confidence De novo mutation)and "loConfDeNovo" (low confidence de novo mutations) mutations?
i just utilized the grep -w "hiConfDeNovo" recalibratedVariants.postCGP.Gfiltered.deNovos.vcf >> hiConfDeNovo.vcf to extract the corresponding records, is there another good way?
Thanks!
My another question is:
If we Phased the VCF file by PhaseByTransmission (i.e. after step 2, before using GenotypeAnnotator), how the tool effect on Genotypes information, and is that affect lator de novo mutation discovery in step 3 ? and if we utilize other GATK tools for downstream analysis, are they automatically recognize updated GQ values and ignore PLs?
Hi @Sheila @Geraldine_VdAuwera
I am waiting for your kind reply.....
@MUHAMMADSOHAILRAZA
Hi,
For your first question, you can use Select Variants. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php
For your second question, this thread should help: http://gatkforums.broadinstitute.org/discussion/5573/some-queries-on-phasebytransmission-output
The GQ field is updated in the output VCF, so it will be recognized by downstream tools. Because the PLs are not changed, the downstream tools will use the original PLs. As for ignoring PLs, it depends on which tools you use. If the PLs are taken into account, the original PLs will be used.
I hope this helps to clarify things.
-Sheila
@Sheila
Do PhaseByTransmission tool use GQ info?
Hi Shiela,
Thanks for the tips. Finally managed to get the mising header line into the file that I have. Also managed to seperate out the biallelic SNPs.
Now my issue is that I have used the hg19 as a referance. And i believe that the SNPs from Phase 3 of the 1000 Genomes project are of a GRCh build. I can't seem to be able to insert 'chr' to the chromosomes. I wonder if GATK provide this referance in hg19 format. or would there be any other alternative for analysis done using hg19 to derive posterior probabilities?
Coming from a biological background you can imagine the dilleme I am having.
Thank you for your time.
@MUHAMMADSOHAILRAZA
Hi,
Sorry for the late response! Yes. Phase By Transmission uses GQ.
-Sheila
@rose_ismet
Hi,
You can use Liftover Variants to fix your problem. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_LiftoverVariants.php
-Sheila
@Sheila
I also am getting the OLD_VARIANT error. It is strange because running SelectVariants on dbsnp142 and 1kG_phase3.snps vcfs I didn't get this error. I only got it trying to use CombineVariants on them.
I found the entry, it was in the 1kG_phase3.snps.b37.vcf. I think I got that from the resource bundle. I will try inserting the header.
Nelson
@nchuang
Hi Nelson,
You can either add the OLD_VARIANT to the header, or check again in the resource bundle for 1000G_phase3_v4_20130502.sites.vcf. The file was recently added and should work properly.
-Sheila
Hi,
Just to point out a mistake on the Step 2 output file description. Indeed, at the end of the Step 2 your can read "Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ in the FILTER field.". However, -G_filter through "VariantFiltration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag)."
Cheers
Ahmed
@ahmed_chakroun
Hi Ahmed,
I am afraid I don't understand your post. What exactly needs to be changed in the document? Posting some before and after records might help too.
-Sheila
Thanks for reporting this, @ahmed_chakroun. You are correct, we meant the "genotype filter field" which of course corresponds to the FT tag in FORMAT, not the site-level FILTER field. I'll fix that now.
Hello,
I'm trying to lift over the 1000G_phase3_v4_20130502.sites.vcf from the resource bundle to hg19 with Picard 2.1.0
using the command:
java -jar /usr/local/picard/2.1.0/lib/picard.jar LiftoverVcf I=1000G_phase3_v4_20130502.sites.vcf O=1000G_phase3_v4_20130502.sites.hg19.vcf CHAIN=b37tohg19.chain REJECT=liftover.rejected_variants.vcf R=hg19/ucsc.hg19.fasta.
I get the following error:
[Thu Feb 18 12:28:24 EST 2016] picard.vcf.LiftoverVcf INPUT=1000G_phase3_v4_20130502.sites.vcf OUTPUT=1000G_phase3_v4_20130502.sites.hg19.vcf CHAIN=b37tohg19.chain REJECT=liftover.rejected_variants.vcf REFERENCE_SEQUENCE=hg19/ucsc.hg19.fasta WARN_ON_MISSING_CONTIG=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Feb 18 12:28:24 EST 2016] Executing as [email protected] on Linux 2.6.32-573.8.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) IntelDeflater
INFO 2016-02-18 12:28:24 LiftoverVcf Loading up the target reference genome.
INFO 2016-02-18 12:28:40 LiftoverVcf Lifting variants over and sorting.
ERROR 2016-02-18 12:28:40 LiftoverVcf Encountered a contig, chr1 that is not part of the target reference.
[Thu Feb 18 12:28:40 EST 2016] picard.vcf.LiftoverVcf done. Elapsed time: 0.26 minutes.
Runtime.totalMemory()=5031067648
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
I have checked that chr1 is in the target reference (it was downloaded from the resource bundle), and have tried searching the forum for similar issues but don't seem to find any. Is it possible to liftover this file, or do you have any other suggestions on how I can overcome this issue?
Many thanks for your help.
@gkphilip
Hi,
I think your question is being answered here.
-Sheila
hi,
I have a trio(parents and a child) exome data which is going to call de novo variants, what i have done is following the best practices to get variants, during the filtering step, SNPs were using a VQSR, but Indels were using a hard filter since insufficient amount,.
before running CalculateGenotypePosteriors, both VQSR snps and hard filter indels were cat together.
the question is, after CalculateGenotypePosteriors, snps were all removed from output without any warnings, everything is ok if only run on SNPs set.
@hujingchu
Hi,
What do you mean "after CalculateGenotypePosteriors, snps were all removed from output without any warnings"? Can you post the exact command you ran?
Thanks,
Sheila
Hello,
I found that the 1000G_phase3_v4_20130502.sites.vcf file under b37 folder in your bundle. No such file in hg19 folder. Since my reference genome has been hg19, does that mean I would need to realign my files? Thank you.
Hi @helene,
I believe our resource bundles are provided as is. If the hg19 folder doesn't contain the resource, then please check with the 1000 Genomes Project to see if they have the equivalent.
Hi all,
I am trying to run a trio containing 3 samples (dad, mum and one child) with the aim to identify mutations in child. I managed to get the final vcf file: recalibratedVariants.postCGP.Gfiltered.deNovos.vcf. Based on the statement that high confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion, I filtered out these callsets with GQ >=20 in each trio sample. However, there is still more than 0.5 million snps as potential mutations, which cannot be true. I am looking for less than 100 snps as mutations.
Am I doing something wrong? Could anyone give me some suggestions?
Bests
Dr yan
@zejunyan
Hi Dr. Yan,
Are you seeing 0.5 million high confidence de novo mutations when you expect to see ~100? If so, can you please post the exact commands you ran to produce the final VCF with de novos?
Thanks,
Sheila
@Sheila
Yes, I see a lot more than expected. The command-lines are :
java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -T CalculateGenotypePosteriors --supporting ../../../../snp_db/vcf_chr_1-28_30_32.vcf.gz -ped trio.ped -V genotyped.trio.cohort.g.vcf -o recalibratedVariants.postCGP.vcf
java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -T VariantFiltration -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf
java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -T VariantAnnotator -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf
I used hard-filtered snp (genotyped.trio.cohort.g.vcf) as input.
I followed the documentation.
Is there something wrong with these three command-lines??
Thank you very much
Dr yan
@zejunyan
Hi Dr. Yan,
No, those look fine. Have a look at Geraldine's suggestion above as well.
-Sheila
@Vergilius
Hi,
I think you can input a general ped file containing information for all trios.
-Sheila
@Sheila
Ok! It's what I did. Looks to work fine. Thanks
Hello,
I am working with hg19 reference, and so the contigs in 1000G_phase3_v4_20130502.sites.vcf are not compatible. Will it make much difference to use 1000G_phase1.snps.high_confidence.hg19.sites.vcf from the bundel? Considering it's phase 1 not phase 3?
Thanks!
@mcvu
Hi,
Yes, that is the file you should use. The 1000G_phase3_v4_20130502.sites.vcf file is for use with b37 reference (which is what we use in our example commands)
-Sheila
Hello
Thanks so much for those workflows
I'm pretty new in the materia, if I understand right this workflow is for VQSR filtering, which one is in case of apply hard filtering? Im interested on call the novo mutations in my trio.
Thanks
Hello again
I have another question about de novo variants callers and somatic caller (atm no one answered me in the forums), maybe this is not the right place and its weird what I do. If I merge the parental bam files and I consider them as normal cell against the child (consider as tumor) those somatic variants results shouldnt be similar as if I use directly the bam files from the parents and child with de novo trio callers tools? In my case I used VarScan tools. Im trying to learn now using GATK.
Thanks
Does this workflow works on INDELs Now?
When i did the second step, i got this error:
##### ERROR MESSAGE: Invalid argument value '<' at position 8.
##### ERROR Invalid argument value '20.0"' at position 9.
and this is my command
java -Xmx16g -jar /home/mgujral/tools/GATK/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T VariantFiltration -R /oasis/projects/nsf/ddp195/dantakli/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa -V /home/a1lian/recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o 11000.SSC02220.Gfiltered.vcf
Hello
Im pretty confuse, hope someone can help me. I had done alignment and call variants using GATK tutorials and reference genome from ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz is there any way to get --supporting 1000G_phase3_v4_20130502.sites.vcf compatibility with my RF or I need to start over again using the human_g1k_v37_decoy.fasta ?
Thanks