Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Looking for gnomad-common-biallelic file.

RTMRTM SingaporeMember
edited March 27 in Ask the GATK team

HelloEveryone!

I have exome sequencing data for 20 pairs of primary and matched metastatic tumors coming different patients. However, I have no Normal samples so I am following the TCGA Tumor-only workflow.

Step 2 requires this file: af-only-gnomad-common-biallelic.grch38.main.vcf.gz but I am only able to find af-only-gnomad.hg38.vcf.gz.

Is it the same file? or does anyone know where to get the gnomad-common-biallelic file?

Thanks a lot in advance!!

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    Its probably not the same file.
    GetPileupSummaries

    "The tool requires a common germline variant sites VCF, e.g. the gnomAD resource, with population allele frequencies (AF) in the INFO field. This resource must contain only biallelic SNPs and can be an eight-column sites-only VCF."

    So you'll need to provide a vcf file with only biallelic sites. Here is some info on generating this file

    "The WDL script mutect_resources.wdl takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in section 1 and a common biallelic variants resource for use in section 3. The script first generates a sites-only VCF and in the process removes all extraneous annotations except for AF allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF."

    (source)

  • 29043594952904359495 Member

    @bshifaw ,hi , is there a hg19 af_only_geomad vcf, liftover maybe not a good choice, I think

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @RTM The resource files are in the GATK resource bucket: gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf and gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf. The small exac vcf is used in the contamination part of the workflow.

    By the way, the GDC Mutect2 instructions are badly out of date and probably incorrect. We recommend using official GATK tutorials.

    Also, as of GATK 4.1 and later you can jointly call a primary tumor and metastasis with Mutect2. Just run it with gatk Mutect2 -R ref.fasta -I primary.bam -I metastasis.bam .-germline-resource af-only-gnomad.raw.sites.vcf . . ..

  • tonytony Member ✭✭

    Hi,

    When processing gnomAD files obtained from official gcs platform, we should also take only PASS variants to feed the resource ?

    Asking because grep -v "^#" ${input_vcf} | sed -e 's#\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t\(.*\)\t.*;AF=\([0-9]*\.[e0-9+-]*\).*#\1\t\2\t.\t\4\t\5\t.\t\7\tAF=\8#g' > simplified_body & in the WDL does not seem to select PASS variants only. Nevertheless resources find in the bundle have only PASS variants.

    Thanks,
    Anthony

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @tony, yes, you should take only PASS variants. This occurs elsewhere in the WDL.

  • 29043594952904359495 Member

    @RTM The resource files are in the GATK resource bucket: gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf and gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf. The small exac vcf is used in the contamination part of the workflow.

    so it is wrong to use the gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf in both command Mutect2 and contamination ?

    thanks a lot

  • AdelaideRAdelaideR Member admin

    @2904359495

    Both commands will accept this file.

  • 29043594952904359495 Member

    @AdelaideR thanks a lot, but you can see that @davidben said /somatic-b37/small_exac_common_3.vcf in contamination? which file is better and what is difference?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    What @AdelaideR said is correct: both commands accept the gnomAD vcf. We recommend the small ExAC vcf in the contamination part of our pipeline as a speed optimization because exonic sites alone are enough for a good estimate of contamination. However, you could use the AF-only gnomAD instead if you wanted to trade speed for accuracy.

  • 29043594952904359495 Member

    @davidben
    I am only interested in accuracy, speed is the second , thanks a lot

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    In theory, you would also want to use the gnomAD resource if you had a targeted panel that didn't overlap the exome.

  • 29043594952904359495 Member

    @davidben
    Thanks a lot,
    you are really a very thoughtful person

Sign In or Register to comment.