Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

(howto) Run the genotype refinement workflow

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited November 2015 in Tutorials

Overview

This tutorial describes step-by-step instruction for applying the Genotype Refinement workflow (described in this method article) to your data.


Step 1: Derive posterior probabilities of genotypes

In this first step, we are deriving the posteriors of genotype calls in our callset, recalibratedVariants.vcf, which just came out of the VQSR filtering step; it contains among other samples a trio of individuals (mother, father and child) whose family structure is described in the pedigree file trio.ped (which you need to supply). To do this, we are using the most comprehensive set of high confidence SNPs available to us, a set of sites from Phase 3 of the 1000 Genomes project (available in our resource bundle), which we pass via the --supporting argument.

 java -jar GenomeAnalysisToolkit.jar -R human_g1k_v37_decoy.fasta -T CalculateGenotypePosteriors --supporting 1000G_phase3_v4_20130502.sites.vcf -ped trio.ped -V recalibratedVariants.vcf -o recalibratedVariants.postCGP.vcf

This produces the output file recalibratedVariants.postCGP.vcf, in which the posteriors have been annotated wherever possible.


Step 2: Filter low quality genotypes

In this second, very simple step, we are tagging low quality genotypes so we know not to use them in our downstream analyses. We use Q20 as threshold for quality, which means that any passing genotype has a 99% chance of being correct.

java -jar $GATKjar -T VariantFiltration -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf

Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ with the FT tag of the FORMAT field.


Step 3: Annotate possible de novo mutations

In this third and final step, we tag variants for which at least one family in the callset shows evidence of a de novo mutation based on the genotypes of the family members.

java -jar $GATKjar -T VariantAnnotator -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf

The annotation output will include a list of the children with possible de novo mutations, classified as either high or low confidence.

See section 3 of the method article for a complete description of annotation outputs and section 4 for an example of a call and the interpretation of the annotation values.

Post edited by Geraldine_VdAuwera on

Comments

  • GustavGustav SwedenMember

    Hello,
    I have been unable to locate "ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf " on 1000G site and Broads ftp. Where can I actually find or how can it be generated?
    Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Gustav,

    We don't currently provide that file, but we are planning to include it in our resource bundle in the near future. In the meantime, you can generate it from the data that is publicly available from the 1000Genomes project website.

  • estif74estif74 Saint Paul, MN, USAMember

    Hi there Geraldine, had 2 comments/questions about this:

    1. In step 2 of the above, should the variant file (-V) as an input be the same as the output file (-o) of step 1? The input file in the second step is listed as "C1643.PbyT.CGP.vcf" but the output file of the first step is listed as "recalibratedVariants.postCGP.vcf"

    2. Was just wondering the rationale for filtering on GQ < 20 in step 2? If you've passed your VCF through VQSR, would it be OK to just filter (using SelectVariants) for those that passed the filter (-ef)?

    Thanks for your help as always!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @estif74,

    1. I think that's a copy/paste error; I'll check and fix if it is.

    2. VQSR only tells you which variant sites are ok; whereas filtering on GQ tells you, within the pool of good sites, which sample genotypes are potentially unreliable. You can have a good variant, where you know you have a sample that's not hom-ref, but where you don't know if the sample is het or hom-var. Good variant, bad genotype. Two different levels of filtering.

  • estif74estif74 Saint Paul, MN, USAMember

    Got it, that's very helpful. Thanks for the explanation as always!

  • albertyualbertyu Member

    Hello, I am wondering why you use 1000 genome phase 3 SNP vcf files in step 1? And why do you use 1000G_phase1.snps.high_confidence.b37.vcf for VariantRecalibrator in the Best practice?

    After I download vcf files from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ (ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz) and merged them into one file, the size of this merged file is really huge. Are these files the correct files I should use for step 1?
    Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @albertyu‌

    Hi,

    The only reason the phases are different between the two files are because there were different phases when each of the documents were written. This document for genotype refinement is newer than the vqsr document.

    The files you are using from the 1000genomes ftp have the genotypes included, but you can just use the sites_only files in our bundle which are much smaller. You do not need the genotypes since VQSR does not use them.

    -Sheila

  • terestahlterestahl karolinska InstitutetMember

    Hello
    After reading these comments it is still not clear to me which file could I use for step 1, to "Derive posterior probabilities of genotypes". Could this file be OK:
    ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf
    (from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ )
    Otherwise could you please clarify how to prepare the required file
    Thank you!

  • simonsanchezjsimonsanchezj GermanyMember

    Dear Geraldine, I have exome data for ~300 unrelated individuals. I have called variants following GATK's best practices and I am wondering whether genotype refinement is necessary and if so, whether I need to calculate genotype posteriors using any other database. Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @terestahl That looks ok based on the name; though you may need to subset just biallelic snps from that file.

    @simonsanchezj‌ Genotype refinement is not required for variant discovery, but it may be useful if your research project involves using genotype information (e.g. for de novo mutation discovery).

  • terestahlterestahl karolinska InstitutetMember
    edited January 2015

    Hello Geraldine, Thank you for your answer! I extracted the biallelic SNPs (from the file ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf) and passed it with --supporting to derive posterior probabilities of genotypes. After that I filtered low quality genotypes (-G_filter "GQ < 20.0"). The output file is annotated with lowGQ or PASS. However not all variants get this annotation. I then wonder if the file looks OK. Thanks a lot. /Teresa

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  101 P1148_102   …..
    1   65745   .   A   G   92.55   VQSRTrancheSNP99.90to100.00 […..]   GT:AD:DP:FT:GQ:PL:PP    0/0:5,0:5:lowGQ:7:0,0,112:0,7,131   0/0:4,0:4:lowGQ:19:0,12,147:0,19,166    […..]
    1   69511   rs75062661  A   G   47644.66    VQSRTrancheSNP99.90to100.00 […..]   GT:AD:DP:GQ:PGT:PID:PL:PP   1/1:0,75:75:99:.:.:2157,225,0:2187,239,0    ./.:1,0:1   […..]
    1   721290  rs12565286  G   C   970.96  PASS    […..]   GT:AD:DP:GQ:PL:PP   0/1:8,3:11:65:76,0,263:65,0,280 0/0:12,0:12:44:0,33,495:0,44,523    […..]
    1   721450  .   G   A   1036.89 VQSRTrancheSNP99.90to100.00 […..]   GT:AD:DP:GQ:PL:PP   0/1:100,17:117:99:180,0,3000:172,0,3014 0/0:35,0:35:99:0,99,1485:0,107,1507 […..]
    1   752894  rs3131971   T   C   10644.86    PASS    […..]   GT:AD:DP:FT:GQ:PL:PP    0/1:5,5:10:PASS:99:138,0,157:146,0,155  1/1:0,11:11:PASS:35:348,33,0:358,35,0   […..]
    1   752928  .   C   T   50.23   PASS    […..]   GT:AD:DP:GQ:PL:PP   0/0:16,0:16:77:0,48,583:0,77,646    0/0:13,0:13:68:0,39,473:0,68,536    […..]
    1   762109  .   C   T   4807.89 VQSRTrancheSNP99.90to100.00 […..]   GT:AD:DP:GQ:PL:PP   0/1:251,52:303:99:842,0,7599:834,0,7613 0/0:45,0:45:99:0,118,1771:0,126,1793    […..]
    1   762273  rs3115849   G   A   133553.81   VQSRTrancheSNP99.90to100.00 […..]   GT:AD:DP:GQ:PL:PP   0/1:95,121:216:99:3583,0,2621:3590,0,2620   1/1:0,143:143:99:5091,430,0:5100,431,0  […..]
    1   762366  .   G   C   612.23  PASS    […..]   GT:AD:DP:GQ:PL:PP   0/0:45,0:45:99:0,108,1716:0,137,1779    0/0:36,0:36:99:0,99,1485:0,128,1548 […..]
    1   762472  rs145493205 C   T   1584.69 VQSRTrancheSNP99.90to100.00 […..]   GT:AD:DP:FT:GQ:PGT:PID:PL:PP    0/0:51,0:51:lowGQ:7:.:.:0,2,1318:0,7,1334   0/0:36,0:36:PASS:99:.:.:0,99,1485:0,104,1501    […..]
    1   762485  rs148989274 C   A   1516.82 PASS    […..]   GT:AD:DP:FT:GQ:PGT:PID:PL:PP    0/0:48,0:48:PASS:34:.:.:0,30,1508:0,34,1522 0/0:36,0:36:PASS:99:.:.:0,99,1485:0,103,1499    […..]
    1   762589  rs71507461  G   C   22684.91    VQSRTrancheSNP99.90to100.00 […..]   GT:AD:DP:FT:GQ:PGT:PID:PL:PP    0/1:5,22:27:PASS:99:0|1:762589_G_C:888,0,250:896,0,248  1/1:0,19:19:PASS:59:1|1:762589_G_C:822,57,0:832,59,0    […..]
    1   762592  rs71507462  C   G   22490.93    PASS    […..]   GT:AD:DP:FT:GQ:PGT:PID:PL:PP    0/1:5,21:26:PASS:99:0|1:762589_G_C:888,0,250:896,0,248  1/1:0,17:17:PASS:59:1|1:762589_G_C:822,57,0:832,59,0    […..]
    
    Post edited by Geraldine_VdAuwera on
  • simonsanchezjsimonsanchezj GermanyMember

    Dear Geraldine, thanks a lot for your reply. For this project I have exomed ~300 unrelated individuals for a case-control study. Thus, I don't intend to perform de novo mutation discovery. Is there anyway I can benefit from the genotype refinement workflow? Worth it? If so, which collection of VCFs should I use for informing allele frequency priors?
    Finally, in case i decide not to do this, whould you still recommend to filter low quality genotypes as suggested in Step 2?

    Thanks a lot for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @terestahl I think I answered someone else with the exact same question recently -- in a nutshell, not all genotypes are "eligible" to be evaluated by the GQ filter. Some may be skipped if they do not fulfill the necessary conditions, and therefore will not get the FT annotation.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @simonsanchezj‌

    Hi,

    Unfortunately, we cannot decide whether the genotype refinement workflow is appropriate for your analysis or not. It depends if you care at all about being able to distinguish hets from hom-vars. There are other reasons for caring about this besides looking for de novos, and it is up to you to decide. If you only care about identifying variant sites, VQSR should be sufficient.

    Good luck.

    -Sheila

  • simonsanchezjsimonsanchezj GermanyMember

    Thanks for your reply, Sheila. I do care about distinguishing hets from hom-vars. I think I understand now how the refinement workflow works. Which collection of VCFs should I use for informing allele frequency priors?

    On a different matter, I want to calculate genotype posteriors in some families. I have read somewhere in this forum that the input .ped file should only contain trios. Thus, for each child, I created a dummy family containing both parents and that person. However, when I run CalculateGenotypePosteriors, I get the following error: No PED file passed or no non-skipped trios found in PED file. Skipping family priors.

    What am I doing wrong?

    Mi command line looks like:

    java -Xmx"$MEM"g -jar "$GATK" \
    -R "$REFERENCE" \
    -T CalculateGenotypePosteriors \
    -V "$INPUT" \
    --skipPopulationPriors
    -ped "$PED"
    -o "$OUTPUT"

    My 'dummy' ped looks like
    FAM1 sample1 founder1 founder2 2 2
    FAM1 founder1 0 0 1 1
    FAM1 founder2 0 0 2 1
    FAM2 sample2 founder1 founder2 2 2
    FAM2 founder1 0 0 1 1
    FAM2 founder2 0 0 2 1
    FAM3 sample3 founder1 founder2 1 1
    FAM3 founder1 0 0 1 1
    FAM3 founder2 0 0 2 1

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    What do you mean by "created a dummy family containing both parents and that person"? Are they part of the same family or not?

  • simonsanchezjsimonsanchezj GermanyMember

    Hi Geraldine,
    Sheila already answered the ped-related question in another post. Thank you both for your help.

    As for population-based analyses, which collection of VCFs should I use for informing allele frequency priors?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @simonsanchezj‌

    Hi,

    You can use the Phase 3 1000 Genomes biallelic SNPs. Please see Geraldine's response to Gustav in this thread.

    -Sheila

  • simonsanchezjsimonsanchezj GermanyMember

    Dear Sheila, thanks a lot for your reply. Are you planning to include ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf in the 2.8 bundle in the near future? Otherwise, would you recommend using vcf's contained here? ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/

    Thanks a lot for your collaboration in this and other matters.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We are planning to release a new version of the data bundle with the next version (3.4). It will probably be a few more weeks before we can talk about release schedule though.

  • yoskeriyoskeri tokyo,japanMember

    Hi,
    I also have trouble in finding supporting file for CalculateGenotypePosteriors walker.
    I've found biallelic_snp.vcf file at 1000genome ftp server ;
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/site_assessment/
    Here, for example
    ALL.chr1.unfiltered_union_sites_with_svm.20130502.biallelic_snps.sites.vcf.gz
    is available. But I have noticed that this directory doesn't have biallelic_snp.vcf file for chrY.
    It involves biallelic_snp.vcf file for other chromosomes.

    Are these files adequate "supporting" files ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, that sounds right. Not sure why chrY is not represented.

  • yoskeriyoskeri tokyo,japanMember

    Thank you for your answer. I'll try merging these files and if in trouble again, post here again.

  • jh7521jh7521 South KoreaMember
    edited June 2015

    @simonsanchezj said:
    Hi Geraldine,
    Sheila already answered the ped-related question in another post. Thank you both for your help.

    As for population-based analyses, which collection of VCFs should I use for informing allele frequency priors?

    Thanks

    Hi~
    Can you tell me about Sheila's answer for the ped-related question?
    That's what I want to know.
    Many thanks in advance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
  • rose_ismetrose_ismet MalaysiaMember

    Hi,

    I've been trying to prepare the ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf file based on the recommendations given in this thread in order to get Step 1 running. After downloading ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf from the 1000G ftp site, I believe my next step is to select only the biallelic site. I tried this out using the SleectVariants walker. However I receive an error as below:

    [email protected]:/media/GenomeAnalysisTK-3.4-0$ java -jar GenomeAnalysisTK.jar -T SelectVariants -R /media/GATK\ Bundle/GRCh37/human_g1k_v37.fasta -V /media/GATK\ Bundle/ALL.phase3.20130502.biallelic_snps.integrated.sites/1000G_ftp/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf -o /media/GATK\ Bundle/ALL.phase3.20130502.biallelic_snps.integrated.sites/ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf -selectType SNP -selectType MNP -restrictAllelesTo BIALLELIC
    ...............................
    ...............................
    INFO 10:43:36,934 ProgressMeter - 21:32804795 6.75401394E8 34.5 m 3.0 s 90.7% 38.0 m 3.5 m
    INFO 10:44:06,936 ProgressMeter - 22:38919126 6.8447717E8 35.0 m 3.0 s 92.5% 37.8 m 2.8 m
    WARN 10:44:26,933 RestStorageService - Error Response: PUT '/ZfGg9SNeuIpdIumzexlHAQZig40nfm0z.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1061, Content-MD5: 201Aq8dwhl0j5fJGHKRP0Q==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: db4d40abc770865d23e5f2461ca44fd1, Date: Wed, 08 Jul 2015 02:44:25 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:9h2p7rrbmckoIZny2zI+E3bpGAI=, User-Agent: JetS3t/0.8.1 (Linux/3.11.0-26-generic; amd64; en; JVM 1.7.0_55), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 26A9B7914689A8F5, x-amz-id-2: WS4wp1NQct07M27rc8EUappYwQ9OF0RMJmEtPJl4pbeQZBgL71Ts3pTS0j/h9vmGVFALomFrCG0=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Wed, 08 Jul 2015 02:25:44 GMT, Connection: close, Server: AmazonS3]
    WARN 10:44:27,927 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately -1122 seconds. Retrying connection.
    INFO 10:44:29,256 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.IllegalStateException: Key OLD_VARIANT found in VariantContext field INFO at X:7151130 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.
    at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:176)
    at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115)
    at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:222)
    at org.broadinstitute.gatk.engine.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:182)
    at org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:271)
    at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:774)
    at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:288)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Key OLD_VARIANT found in VariantContext field INFO at X:7151130 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.
    ERROR ------------------------------------------------------------------------------------------

    Please help! Am I doing this right?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @rose_ismet
    Hi,

    This could be a bug. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/3962/genotypeandvalidate-error-key-callstatus-found-in-variantcontext-field-info

    Can you please post the vcf record at position X:7151130?

    Thanks,
    Sheila

  • rose_ismetrose_ismet MalaysiaMember

    Hi Shiela,

    Thank you for the promp reply. I'll try to read through the forum to gather more info on this issue.

    At the meantime, here's the vcf record for the position requested.

    X 7151130 . TT TC 100 PASS AC=1;AF=0.000264901;AN=3775;NS=2504;OLD_VARIANT=X:7151131:TC/CC/C;DP=13746;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;EAS_AF=0.001;VT=MNP

    Anything out of the ordinary there?

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember

    Hi,
    My VCF file contain both INDELs and SNPs, Is that genotype refinement can only be applied on SNPs but not INDELs present in the file?

    I am sorry for my stupid question.. But i am confused. because --supporting file (ALL.phase3.20130502.biallelic_snps.integrated.sites.vcf) contain snps.

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @rose_ismet
    Hi,

    I am not sure why you are having this error if you downloaded from the 1000Genomes website. I just downloaded the X chromosome vcf and it has OLD_VARIANT defined in the header.

    The reason you are getting the error is because OLD_VARIANT is not defined in the vcf header. You can simply add in a line to define it yourself, and that should solve the problem.

    -Sheila

    P.S. The header line from the vcf I downloaded is ##INFO=<ID=OLD_VARIANT,Number=.,Type=String,Description="Original before vt normalize was run. FORMAT chr:pos:ref:alt">

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @MUHAMMADSOHAILRAZA
    Hi,

    Yes, as of now, the genotype refinement workflow works on SNPs only.

    You don't need to select for SNPs only. You can simply input your recalibrated vcf with everything.

    -Sheila

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember

    Sheila thank you...!

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember

    Hi,

    After following all the steps i finally got the "recalibratedVariants.postCGP.Gfiltered.deNovos.vcf" file. Is there any way in GATK that i can extract and filter only "hiConfDeNovo" (high confidence De novo mutation)and "loConfDeNovo" (low confidence de novo mutations) mutations?

    i just utilized the grep -w "hiConfDeNovo" recalibratedVariants.postCGP.Gfiltered.deNovos.vcf >> hiConfDeNovo.vcf to extract the corresponding records, is there another good way?

    Thanks!

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember

    My another question is:

    If we Phased the VCF file by PhaseByTransmission (i.e. after step 2, before using GenotypeAnnotator), how the tool effect on Genotypes information, and is that affect lator de novo mutation discovery in step 3 ? and if we utilize other GATK tools for downstream analysis, are they automatically recognize updated GQ values and ignore PLs?

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember
    edited July 2015

    Hi @Sheila @Geraldine_VdAuwera

    I am waiting for your kind reply.....

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @MUHAMMADSOHAILRAZA
    Hi,

    For your first question, you can use Select Variants. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

    For your second question, this thread should help: http://gatkforums.broadinstitute.org/discussion/5573/some-queries-on-phasebytransmission-output

    The GQ field is updated in the output VCF, so it will be recognized by downstream tools. Because the PLs are not changed, the downstream tools will use the original PLs. As for ignoring PLs, it depends on which tools you use. If the PLs are taken into account, the original PLs will be used.

    I hope this helps to clarify things.

    -Sheila

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember

    @Sheila

    Do PhaseByTransmission tool use GQ info?

  • rose_ismetrose_ismet MalaysiaMember

    Hi Shiela,

    Thanks for the tips. Finally managed to get the mising header line into the file that I have. Also managed to seperate out the biallelic SNPs.

    Now my issue is that I have used the hg19 as a referance. And i believe that the SNPs from Phase 3 of the 1000 Genomes project are of a GRCh build. I can't seem to be able to insert 'chr' to the chromosomes. I wonder if GATK provide this referance in hg19 format. or would there be any other alternative for analysis done using hg19 to derive posterior probabilities?

    Coming from a biological background you can imagine the dilleme I am having.

    Thank you for your time.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @MUHAMMADSOHAILRAZA
    Hi,

    Sorry for the late response! Yes. Phase By Transmission uses GQ.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
  • nchuangnchuang Member
    edited July 2015

    @Sheila

    I also am getting the OLD_VARIANT error. It is strange because running SelectVariants on dbsnp142 and 1kG_phase3.snps vcfs I didn't get this error. I only got it trying to use CombineVariants on them.

    I found the entry, it was in the 1kG_phase3.snps.b37.vcf. I think I got that from the resource bundle. I will try inserting the header.

    Nelson

    Post edited by nchuang on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @nchuang
    Hi Nelson,

    You can either add the OLD_VARIANT to the header, or check again in the resource bundle for 1000G_phase3_v4_20130502.sites.vcf. The file was recently added and should work properly.

    -Sheila

  • ahmed_chakrounahmed_chakroun TunisiaMember

    Hi,

    Just to point out a mistake on the Step 2 output file description. Indeed, at the end of the Step 2 your can read "Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ in the FILTER field.". However, -G_filter through "VariantFiltration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag)."

    Cheers

    Ahmed

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ahmed_chakroun
    Hi Ahmed,

    I am afraid I don't understand your post. What exactly needs to be changed in the document? Posting some before and after records might help too.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for reporting this, @ahmed_chakroun. You are correct, we meant the "genotype filter field" which of course corresponds to the FT tag in FORMAT, not the site-level FILTER field. I'll fix that now.

  • gkphilipgkphilip Member

    Hello,

    I'm trying to lift over the 1000G_phase3_v4_20130502.sites.vcf from the resource bundle to hg19 with Picard 2.1.0
    using the command:
    java -jar /usr/local/picard/2.1.0/lib/picard.jar LiftoverVcf I=1000G_phase3_v4_20130502.sites.vcf O=1000G_phase3_v4_20130502.sites.hg19.vcf CHAIN=b37tohg19.chain REJECT=liftover.rejected_variants.vcf R=hg19/ucsc.hg19.fasta.

    I get the following error:

    [Thu Feb 18 12:28:24 EST 2016] picard.vcf.LiftoverVcf INPUT=1000G_phase3_v4_20130502.sites.vcf OUTPUT=1000G_phase3_v4_20130502.sites.hg19.vcf CHAIN=b37tohg19.chain REJECT=liftover.rejected_variants.vcf REFERENCE_SEQUENCE=hg19/ucsc.hg19.fasta WARN_ON_MISSING_CONTIG=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Thu Feb 18 12:28:24 EST 2016] Executing as [email protected] on Linux 2.6.32-573.8.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) IntelDeflater
    INFO 2016-02-18 12:28:24 LiftoverVcf Loading up the target reference genome.
    INFO 2016-02-18 12:28:40 LiftoverVcf Lifting variants over and sorting.
    ERROR 2016-02-18 12:28:40 LiftoverVcf Encountered a contig, chr1 that is not part of the target reference.
    [Thu Feb 18 12:28:40 EST 2016] picard.vcf.LiftoverVcf done. Elapsed time: 0.26 minutes.
    Runtime.totalMemory()=5031067648
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

    I have checked that chr1 is in the target reference (it was downloaded from the resource bundle), and have tried searching the forum for similar issues but don't seem to find any. Is it possible to liftover this file, or do you have any other suggestions on how I can overcome this issue?

    Many thanks for your help.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @gkphilip
    Hi,

    I think your question is being answered here.

    -Sheila

  • hujingchuhujingchu Member

    hi,
    I have a trio(parents and a child) exome data which is going to call de novo variants, what i have done is following the best practices to get variants, during the filtering step, SNPs were using a VQSR, but Indels were using a hard filter since insufficient amount,.
    before running CalculateGenotypePosteriors, both VQSR snps and hard filter indels were cat together.
    the question is, after CalculateGenotypePosteriors, snps were all removed from output without any warnings, everything is ok if only run on SNPs set.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @hujingchu
    Hi,

    What do you mean "after CalculateGenotypePosteriors, snps were all removed from output without any warnings"? Can you post the exact command you ran?

    Thanks,
    Sheila

  • helenehelene LAMember

    Hello,

    I found that the 1000G_phase3_v4_20130502.sites.vcf file under b37 folder in your bundle. No such file in hg19 folder. Since my reference genome has been hg19, does that mean I would need to realign my files? Thank you.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @helene,

    I believe our resource bundles are provided as is. If the hg19 folder doesn't contain the resource, then please check with the 1000 Genomes Project to see if they have the equivalent.

  • zejunyanzejunyan EdinburghMember

    Hi all,

    I am trying to run a trio containing 3 samples (dad, mum and one child) with the aim to identify mutations in child. I managed to get the final vcf file: recalibratedVariants.postCGP.Gfiltered.deNovos.vcf. Based on the statement that high confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion, I filtered out these callsets with GQ >=20 in each trio sample. However, there is still more than 0.5 million snps as potential mutations, which cannot be true. I am looking for less than 100 snps as mutations.

    Am I doing something wrong? Could anyone give me some suggestions?

    Bests

    Dr yan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @zejunyan
    Hi Dr. Yan,

    Are you seeing 0.5 million high confidence de novo mutations when you expect to see ~100? If so, can you please post the exact commands you ran to produce the final VCF with de novos?

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Would also be good to check for sample swaps, and verify paternity...
  • zejunyanzejunyan EdinburghMember

    @Sheila
    Yes, I see a lot more than expected. The command-lines are :

    java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -T CalculateGenotypePosteriors --supporting ../../../../snp_db/vcf_chr_1-28_30_32.vcf.gz -ped trio.ped -V genotyped.trio.cohort.g.vcf -o recalibratedVariants.postCGP.vcf
    java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -T VariantFiltration -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf
    java -jar ../../../../../zyan/tools/GenomeAnalysisTK.jar -T VariantAnnotator -R ../../../../../zyan/pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf
    I used hard-filtered snp (genotyped.trio.cohort.g.vcf) as input.

    I followed the documentation.

    Is there something wrong with these three command-lines??

    Thank you very much

    Dr yan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @zejunyan
    Hi Dr. Yan,

    No, those look fine. Have a look at Geraldine's suggestion above as well.

    -Sheila

  • VergiliusVergilius ItalyMember
    I was wondering whether I have a Joint analysis of X samples containing Y trios. Should I run a genotype refinement for each trio (using a single ped for the single family) or should I run the GRW for all the samples together and using in input a general PED file contianing information of all the families?
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Vergilius
    Hi,

    I think you can input a general ped file containing information for all trios.

    -Sheila

  • VergiliusVergilius ItalyMember

    @Sheila
    Ok! It's what I did. Looks to work fine. Thanks

  • mcvumcvu IrelandMember

    Hello,

    I am working with hg19 reference, and so the contigs in 1000G_phase3_v4_20130502.sites.vcf are not compatible. Will it make much difference to use 1000G_phase1.snps.high_confidence.hg19.sites.vcf from the bundel? Considering it's phase 1 not phase 3?

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited November 2017

    @mcvu
    Hi,

    Yes, that is the file you should use. The 1000G_phase3_v4_20130502.sites.vcf file is for use with b37 reference (which is what we use in our example commands) :smile:

    -Sheila

  • ccs76ccs76 Member

    Hello

    Thanks so much for those workflows :)

    I'm pretty new in the materia, if I understand right this workflow is for VQSR filtering, which one is in case of apply hard filtering? Im interested on call the novo mutations in my trio.

    Thanks

  • ccs76ccs76 Member

    Hello again

    I have another question about de novo variants callers and somatic caller (atm no one answered me in the forums), maybe this is not the right place and its weird what I do. If I merge the parental bam files and I consider them as normal cell against the child (consider as tumor) those somatic variants results shouldnt be similar as if I use directly the bam files from the parents and child with de novo trio callers tools? In my case I used VarScan tools. Im trying to learn now using GATK.

    Thanks

Sign In or Register to comment.