Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to create and collect PONs using Gatk (4.1.2.0)

EsraEsra Member
edited May 11 in Ask the GATK team
Hi GATK team,
I am using last version of Gatk (4.1.2.0) for Somatic short variant discovery pipeline. I have some problems at Pon creation part using CreateSomaticPanelOfNormal fuction. When I try following commonds it returns A USER ERROR has occurred: v is not a recognized option

Create PONs: gatk Mutect2 -R ref.fasta -I normal1.recalib.bam -normal normal1 -disable-read-filter MateOnSameContigOrNoMappedMateReadFilter -L targets.interval_list -O normal1.vcf.gz

Collect PONs: gatk CreateSomaticPanelOfNormals -vcfs normal1.vcf.gz -vcfs normal2.vcf.gz -vcfs normal3.vcf.gz -O all_pon.vcf.gz
I also tried -V instead of -vcfs but I took the same error.

Also I tried following commonds:

Step 1. Run Mutect2 in tumor-only mode for each normal sample.
gatk Mutect2 -R reference.fasta -I normal1.bam -O normal1.vcf.gz

Step 2. Create a GenomicsDB from the normal Mutect2 calls.
gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list \
--genomicsdb-workspace-path pon_db \
-V normal1.vcf.gz \
-V normal2.vcf.gz \
-V normal3.vcf.gz

Step 3. Combine the normal calls using CreateSomaticPanelOfNormals.
gatk CreateSomaticPanelOfNormals -R reference.fasta -V gendb://pon_db -O pon.vcf.gz

And it returned " Warning: CreateSomaticPanelOfNormals is a BETA tool and is not yet ready for use in production".

Should I use one of the older versions of GATK?

Could you suggest any commonds that I can use for gatk-4.1.2.0 version to create PONs?

Regards,

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @Esra
    The warning indicates that the tool is still in BETA phase but the error is just to let the user know about this status. If you check the output, assuming the syntax and inputs were done correctly, you should be able to see the output.

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Hi @Esra

    You can follow along in the mutect2 tutorial which has example commands: (How to) Call somatic mutations using GATK4 Mutect2. However the best place for up to date commands would be the tool documentation for CreateSomaticPanelOfNormals.

    If you are still having problems then copy and paste the exact command you used along with the full error message you received.

  • alanhoylealanhoyle UNC Lineberger Member
    edited May 22

    I did the same thing and was able to create a PON, with the same warning. However, I have a couple of questions. I'm running the GATK 4.1.2.0 Docker image on a MacBookPro, allocating 2 CPUs, 8 GiB of RAM, and 1 GiB swap. (I did the variant calling on a cluster node, but I thought generating the PON should be fast.)

    I started with 3 artifact VCFs created by running mutect2 with --map-mnp-distance 0 on the GDC GRCH38.d1.vd1 reference. The bams were from targeted capture, and the number of variants listed in the resulting VCFs were 650K, 820K, and 960K.

    The GenomicsDBImport creation took 175 minutes to import the three VCFs. Then I ran the CreateSomaticPanelOfNormals on the DB, and it ran overnight and took 1,028 minutes, and threw many thousands of warnings like the following:

    WARNING: No valid combination operation found for INFO field ECNT - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field GERMQ - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MBQ - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MFRL - the field will NOT be part of INFO fields in the generated VCF records
    

    The resulting PON.VCF only appears to contain 3524 data lines.

    Is this level of performance expected? Are there tuning parameters I should be using? I have access to hefty machines, but, again, I was expecting this to be a fast/light process.

  • bshifawbshifaw moonMember, Broadie, Moderator admin
    edited May 23

    @alanhoyle

    If you followed the commands in the Mutect2 tutorial than you should be fine. The parameters that are available for tuning are listed in the tool documentation. If you have access to a hefty machine and would like to speed up execution time, you could try mutect2_pon.wdl workflow which scatters CreateSomaticPanelOfNormals task using intervals and merges the scattered vcfs into a single pon vcf.

    Regarding the warning: Here

    Post edited by bshifaw on
  • alanhoylealanhoyle UNC Lineberger Member

    The Mutect2 tutorial you linked to does not work with the current version (4.1.2.0) of the GATK.

    The example commands listed there are:

    gatk CreateSomaticPanelOfNormals \
    -vcfs 3_HG00190.vcf.gz \
    -vcfs 4_NA19771.vcf.gz \
    -vcfs 5_HG02759.vcf.gz \
    -O 6_threesamplepon.vcf.gz
    

    throws an error with gatk 4.1.1.0 and 4.1.2.0:
    A USER ERROR has occurred: v is not a recognized option
    or if you change those to "-V" it says:
    A USER ERROR has occurred: Argument '[V, variant]' cannot be specified more than once.

    If I step back to gatk 4.1.0.0, I can get it to start running, but it dies almost immediately because of this error, I'm guessing because the I did the variant calling based on the 4.1.2.0 PON instructions:

    21:09:44.954 WARN  CreateSomaticPanelOfNormals - Some input files don't seem to be .vcf or .args files.  Make sure that any input vcfs list end in .args.
    21:09:45.015 INFO  CreateSomaticPanelOfNormals - Shutting down engine
    [May 22, 2019 9:09:45 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.CreateSomaticPanelOfNormals done. Elapsed time: 0.04 minutes.
    Runtime.totalMemory()=235405312
    htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: sample1_tumor_only_vcf.gz, for input source: file:///sample1_tumor_only_vcf.gz
        at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:263)
        at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:102)
        at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:127)
        at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:120)
    

    The commands I'm running are only slightly modified from the 4.1.2.0 PON instructions, in that I add an additional --max-mnp-distance 0 to the tumor-only Mutect2 call as GenomicsDBImport can't seem to handle the VCFs without it.

    It just surprises me that this was a "run all night" activity rather than a quick filter activity.

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Your right, looks like the tutorial needs to be updated. No need to use the older version, you should stick with the latest.

    How large are your input vcfs and are these wgs or exomes? I can check with the team if the duration is abnormal. You may want to double check the vcfs with ValidateVariants as a precaution to make sure there's nothing wrong with the files.

  • EsraEsra Member
    Hi,
    I really struggle the last version of GATK and definitely tutorial needs to be updated. Actually to save time I started to use older version GATK (4.0.1.2) using Docker and I don't take any error. So if you don't have so much time to finish analysis like me, I can suggest you that way.
  • alanhoylealanhoyle UNC Lineberger Member
    edited May 23

    @bshifaw said:
    Your right, looks like the tutorial needs to be updated. No need to use the older version, you should stick with the latest.

    How large are your input vcfs and are these wgs or exomes? I can check with the team if the duration is abnormal. You may want to double check the vcfs with ValidateVariants as a precaution to make sure there's nothing wrong with the files.

    The three VCFs are 20-25 MiB in size gzipped, with 650K to 960K variants per VCF. The sequencing is targeted sequencing with less than 1K genes in the capture (~ 3 megabases in the target). ValidateVariants reports nothing with any of the three VCFs.

    The resulting pon.vcf.gz is 84KiB gzipped, and only contains 3524 variant lines. I know the recommended number of source VCFs for a PON is 40+, but I'm just running tests right now.

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    @alanhoyle ,

    A new version of the Mutect2 tutorial has been released for GATK4.1.1.0 or later, you can find it here
    Talk to the dev team on performance and they mentioned

    what the user is seeing does not seem normal. However, he's seeing 650K-960K variants in a targeted screen with 3Mb. This is an extraordinary number of variants (1/3 to 1/4 bases mutated) which is almost certainly the source of the problem.

    You may see some improvements in performance if you use differently sourced samples for the panel of normal.

  • alanhoylealanhoyle UNC Lineberger Member

    Thank you for updating the tutorial.

    I notice that the tumor-only (artifact) VCFs that are being input to the PON generation step contain a large number of calls that are based on a single-digit number of reads. In one sample, I noticed that of the 960K calls, 476K of them were "DP=1" and another 291K were "DP=2" If I filtered down to DP>5, it would have less than 40K calls, and DP>9 would result in only 21K calls.

Sign In or Register to comment.