Forum Login Issue:
Currently the "Log in with Google" button redirects you to a "Page not found." This is an issue that our forum vendors are working on fixing. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).
ex: https://gatkforums.broadinstitute.org/gatk/gatk/entry/...

Use Select Variants on a gnomAD vcf for Mutect2 contamination filtering.

I am trying to follow this set of steps to use the mutect2 wdl from the Broad.

https://github.com/broadinstitute/gatk/tree/master/scripts/mutect2_wdl

In that file, it is recommended to make a variants_for_contamination file, to filter out contaminating reads.

The command that is requested to run is here:
java -jar $gatk SelectVariants -V gnomad.vcf -L 1 --select "AF > 0.05" -O variants_for_contamination.vcf

I first got gnomad by going here: http://gnomad.broadinstitute.org/downloads
and getting the vcf for exomes, as I believe was instructed.

I pull the gatk container:
sudo docker pull broadinstitute/gatk:latest

and in that container run:
java -jar /gatk/gatk.jar SelectVariants -V gnomad.exomes.r2.0.2.sites.vcf.bgz -L 1 --select "AF > 0.05" -O variants_for_contamination.vcf

The output that I get is:

02:50:34.291 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/libs/gatk-package-4.beta.6-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
[December 16, 2017 2:50:33 AM UTC] SelectVariants --output variants_for_contamination.vcf --selectExpressions AF > 0.05 --variant gnomad.exomes.r2.0.2.sites.vcf.bgz --intervals 1 --invertSelect false --excludeNonVariants false --excludeFiltered false --preserveAlleles false --removeUnusedAlternates false --restrictAllelesTo ALL --keepOriginalAC false --keepOriginalDP false --mendelianViolation false --invertMendelianViolation false --mendelianViolationQualThreshold 0.0 --select_random_fraction 0.0 --remove_fraction_genotypes 0.0 --fullyDecode false --maxIndelSize 2147483647 --minIndelSize 0 --maxFilteredGenotypes 2147483647 --minFilteredGenotypes 0 --maxFractionFilteredGenotypes 1.0 --minFractionFilteredGenotypes 0.0 --maxNOCALLnumber 2147483647 --maxNOCALLfraction 1.0 --setFilteredGtToNocall false --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES false --SUPPRESS_REFERENCE_PATH false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[December 16, 2017 2:50:33 AM UTC] Executing as root@ae5a49b74378 on Linux 4.8.0-59-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11; Version: 4.beta.6-SNAPSHOT
02:50:35.167 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 5
02:50:35.167 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
02:50:35.168 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
02:50:35.168 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
02:50:35.168 INFO SelectVariants - Deflater: IntelDeflater
02:50:35.169 INFO SelectVariants - Inflater: IntelInflater
02:50:35.169 INFO SelectVariants - GCS max retries/reopens: 20
02:50:35.170 INFO SelectVariants - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
02:50:35.171 INFO SelectVariants - Initializing engine
02:50:35.925 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/gnomad.exomes.r2.0.2.sites.vcf.bgz
02:50:36.143 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/gnomad.exomes.r2.0.2.sites.vcf.bgz
02:50:36.220 WARN IndexUtils - Feature file "/mnt/gnomad.exomes.r2.0.2.sites.vcf.bgz" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
02:50:36.341 WARN IndexUtils - Index file /mnt/gnomad.exomes.r2.0.2.sites.vcf.bgz.tbi is out of date (index older than input file). Use IndexFeatureFile to make a new index.
02:50:36.361 INFO SelectVariants - Shutting down engine
[December 16, 2017 2:50:36 AM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=422051840


A USER ERROR has occurred: Badly formed genome unclippedLoc: Parameters to GenomeLocParser are incorrect:The stop position 0 is less than start 1 in contig 1


Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

I have no idea what is going on here. I am just trying to follow the instructions to use the provided wdl.

Any help would be great!

Best Answer

Answers

  • dannykwellsdannykwells San FranciscoMember

    I wanted to update:

    I downloaded a new version of gnomad that is part of the Broad Bundle here: ftp://ftp.broadinstitute.org/bundle/beta/Mutect2/

    I added a reference genome (b37) and re-ran the command:

    java -jar /gatk/gatk.jar SelectVariants -V af-only-gnomad.raw.sites.b37.vcf.gz -L 1 -O variants_for_contamination.vcf -R Homo_sapiens_assembly19.fasta --select "AF>0.05"

    In this case, I get the error:
    `root@abe5a8aedbc3:/mnt# java -jar /gatk/gatk.jar SelectVariants -V af-only-gnomad.raw.sites.b37.vcf.gz -L 1 -O variants_for_contamination.vcf -R Homo_sapiens_assembly19.fasta --select "AF>0.05"
    22:14:12.822 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/libs/gatk-package-4.beta.6-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [December 17, 2017 10:14:12 PM UTC] SelectVariants --output variants_for_contamination.vcf --selectExpressions AF>0.05 --variant af-only-gnomad.raw.sites.b37.vcf.gz --intervals 1 --reference Homo_sapiens_assembly19.fasta --invertSelect false --excludeNonVariants false --excludeFiltered false --preserveAlleles false --removeUnusedAlternates false --restrictAllelesTo ALL --keepOriginalAC false --keepOriginalDP false --mendelianViolation false --invertMendelianViolation false --mendelianViolationQualThreshold 0.0 --select_random_fraction 0.0 --remove_fraction_genotypes 0.0 --fullyDecode false --maxIndelSize 2147483647 --minIndelSize 0 --maxFilteredGenotypes 2147483647 --minFilteredGenotypes 0 --maxFractionFilteredGenotypes 1.0 --minFractionFilteredGenotypes 0.0 --maxNOCALLnumber 2147483647 --maxNOCALLfraction 1.0 --setFilteredGtToNocall false --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES false --SUPPRESS_REFERENCE_PATH false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
    [December 17, 2017 10:14:12 PM UTC] Executing as root@abe5a8aedbc3 on Linux 4.8.0-59-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11; Version: 4.beta.6-SNAPSHOT
    22:14:12.949 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 5
    22:14:12.949 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    22:14:12.950 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
    22:14:12.950 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    22:14:12.950 INFO SelectVariants - Deflater: IntelDeflater
    22:14:12.950 INFO SelectVariants - Inflater: IntelInflater
    22:14:12.950 INFO SelectVariants - GCS max retries/reopens: 20
    22:14:12.950 INFO SelectVariants - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    22:14:12.950 INFO SelectVariants - Initializing engine
    22:14:13.339 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/af-only-gnomad.raw.sites.b37.vcf.gz
    22:14:13.436 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/af-only-gnomad.raw.sites.b37.vcf.gz
    22:14:13.526 WARN IndexUtils - Feature file "/mnt/af-only-gnomad.raw.sites.b37.vcf.gz" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
    22:14:13.647 INFO IntervalArgumentCollection - Processing 249250621 bp from intervals
    22:14:13.649 WARN IndexUtils - Feature file "/mnt/af-only-gnomad.raw.sites.b37.vcf.gz" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
    22:14:13.738 INFO SelectVariants - Done initializing engine
    22:14:13.795 WARN IndexUtils - Feature file "/mnt/af-only-gnomad.raw.sites.b37.vcf.gz" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
    22:14:13.891 INFO ProgressMeter - Starting traversal
    22:14:13.891 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    22:14:13.919 INFO SelectVariants - Shutting down engine
    [December 17, 2017 10:14:13 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.02 minutes.
    Runtime.totalMemory()=1128792064


    A USER ERROR has occurred: Invalid JEXL expression detected for select-0
    See https://www.broadinstitute.org/gatk/guide/article?id=1255 for documentation on using JEXL in GATK


    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
    `

    Any help would be appreciated.

  • dannykwellsdannykwells San FranciscoMember

    Ok, I think I figured this out: it appears that SelectVariants has an undocumented bug that does not allow it to be applied to VCF with multi-allelic sites, like are in all of the gnomad files. So these need to be split with, say, bcftools norm -m . Then things appear to be working.

    Does that agree on your side?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @dannykwells,

    That sounds like it could be the case as I do use bcftools norm myself to get around the multiallelic issue. Sorry to inform you after all of your efforts but there is an already processed gnomad file for use towards contamination estimation. It is under beta>GetPileupSummaries. Do you think it would be helpful for the file small_exac_common to be moved to under beta>Mutect2 instead?

  • dannykwellsdannykwells San FranciscoMember

    Hi @shlee I think it would likely help to have it in the mutect2 folder (and maybe with a pointer in the docs to it to say what it can be used for).

    Another question (somewhat related, fine asking a new question if necessary): some of these scripts require a "common sites" file in the style of picard. Is there an easy way to convert small_exac_common to work with these programs, like CollectAllelicCounts? Or is there a pre-generated list that would work?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @dannykwells,

    Our workshop tutorial that covers Mutect2 describes these files. We've yet to put these on the forum or include READMEs for these beta workflows. I've made a copy of the GetPileupSummaries folder in the Mutect2 folder to help users like yourself.

    To convert a VCF to a Picard-style intervals list, use Picard IntervalListTools. You can also try providing the VCF directly through the -L param. Most GATK tools know to access the VCF coordinates as intervals. I think last I checked CollectAllelicCounts did not but perhaps this has been updated.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @dannykwells To be clear we don't plan to co-locate data/resource-type files with scripts, for several reasons, mainly of logistics. However we do make fully worked-out example pipelines available in workspaces on our cloud-based platform, FireCloud. These are completely free to access (you just need to login with a google identity) and we even have a free credits program if you want to take the pipelines out for a spin with minimal effort. You can learn more about this here: https://software.broadinstitute.org/firecloud/documentation/freecredits

  • dannykwellsdannykwells San FranciscoMember

    @shlee Incredible! Thank you!
    @Geraldine_VdAuwera understood. I think the WDL provided above will help a lot. In general we work on our own cloud (still GCP) to have tight integration with downstream workflow, but FireCloud is an amazing resource. I have "borrowed" a non-trivial amount of code from WDLs found there.

Sign In or Register to comment.