How to relax parameters in GenotypeGVCFs to get more variants?

ManishManish IndiaMember

Hi,

I am using GenotypeGVCFs in following command,

java -Xmx2g -jar CancerAnalysisPackage-2015.1-3/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Refgenome_ucsc/ucsc.hg19.fasta --variant chrn.g.vcf --out chrn
.rawVariants.vcf -L chrn.bed --interval_padding 100 --disable_auto_index_creation_and_locking_when_reading_rods -nt 4

no of variants in chrn.g.vcf = 12248378
no of variant in chrn.rawVariants.vcf = 85579

Is there any way to get more variants as i miss lot of vairants during this step? Please help me out as i am in great need.

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You don't miss anything. GVCF contains variants + blocks of homrefs. raw variants include only homvars or hetvars.

  • ManishManish IndiaMember

    @SkyWarrior said:
    You don't miss anything. GVCF contains variants + blocks of homrefs. raw variants include only homvars or hetvars.

    I can find my variant at gvcf level but miss at raw variant level. How can i find the same in out (Raw Variants).

  • ManishManish IndiaMember

    Like in this example
    chr1 27088546 . A T, 29.77 . BaseQRankSum=0.358;ClippingRankSum=0.358;DP=5;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=0.358;ReadPosRankSum=0.358 GT:AD:DP:GQ:PL:SB 0/1:3,2,0:5:58:58,0,100,68,106,174:3,0,2,0 is present at gvcf level but not present in output (Raw Variants).

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    That variant does not seem to be too confident however can you try running GenotypeGVCFs with --new-qual parameter? That parameter helps recovering singlets in gvcfs.

  • ManishManish IndiaMember

    @SkyWarrior said:
    That variant does not seem to be too confident however can you try running GenotypeGVCFs with --new-qual parameter? That parameter helps recovering singlets in gvcfs.

    Do i need to pass some value here with this parameter? As i am getting following ERROR MESSAGE: Argument with name 'new-qual' isn't defined

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    sorry for the double dashes. Use -new-qual or -new-qual true.

  • ManishManish IndiaMember

    i tried -new-qual and -new-qual true but i am getting same > @SkyWarrior said:

    sorry for the double dashes. Use -new-qual or -new-qual true.

    I tied both but
    java -Xmx4g -jar CancerAnalysisPackage-2015.1-3/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Refgenome_ucsc/ucsc.hg19.fasta --variant chrn.g.vcf -new-qual true --out chrn.paddingXrawVariants.vcf -G -nt 32

    and
    java -Xmx4g -jar CancerAnalysisPackage-2015.1-3/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Refgenome_ucsc/ucsc.hg19.fasta --variant chrn.g.vcf -new-qual --out chrn.paddingXrawVariants.vcf -G -nt 32

    But i am getting follwing error message.

    ERROR MESSAGE: Argument with name 'new-qual' isn't defined.
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited May 2018

    you may need to check your command line once again. It is possible that whitespace is not being recognized properly. try typing slower.

    -new-qual and -new-qual true works fine for me.

    EDIT: You are using an older version. Your parameter should be -newQual

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @Manish,

    I see your toolkit is from 2015. Please update to GATKv3.7 or GATK4 and try to see if you can call your variant. It may be that this is a false positive variant that calling discards or the rare instance of a variant that may be helped by calling within a cohort.

  • ManishManish IndiaMember

    @shlee said:
    Hi @Manish,

    I see your toolkit is from 2015. Please update to GATKv3.7 or GATK4 and try to see if you can call your variant. It may be that this is a false positive variant that calling discards or the rare instance of a variant that may be helped by calling within a cohort.

    i am bench-marking one old pipeline so i am using older version of software. Is there any way to use old software and still get those false positive in raw variants y changing some parameters?

  • ManishManish IndiaMember

    @SkyWarrior said:
    you may need to check your command line once again. It is possible that whitespace is not being recognized properly. try typing slower.

    -new-qual and -new-qual true works fine for me.

    EDIT: You are using an older version. Your parameter should be -newQual

    None of the above options works for me. Below command line function that works for me.

    java -Xmx2g -jar CancerAnalysisPackage-2015.1-3/GenomeAnalysisTK.jar -T GenotypeGVCFs --help

    The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41
    Copyright (c) 2010 The Broad Institute

    For support and documentation go to http://www.broadinstitute.org/gatk


    usage: java -jar GenomeAnalysisTK.jar -T [-args ] [-I ] [--showFullBamList] [-rbs
    ] [-et ] [-K ] [-tag ] [-rf ] [-drf ] [-L
    ] [-XL ] [-isr ] [-im ] [-ip ] [-R
    ] [-ndrs] [-maxRuntime ] [-maxRuntimeUnits ] [-dt ]
    [-dfrac ] [-dcov ] [-baq ] [-baqGOP ] [-fixNDN]
    [-fixMisencodedQuals] [-allowPotentiallyMisencodedQuals] [-OQ] [-DBQ ] [-PF ]
    [-BQSR ] [-qq ] [-DIQ] [-EOQ] [-preserveQ ] [-globalQScorePrior
    ] [-S ] [-rpr] [-kpr] [-sample_rename_mapping_file
    ] [-U ] [-disable_auto_index_creation_and_locking_when_reading_rods] [-sites_only]
    [-writeFullFormat] [-compress ] [-simplifyBAM] [--disable_bam_indexing] [--generate_md5] [-nt
    ] [-nct ] [-mte] [-bfh ] [-rgbl
    ] [-ped ] [-pedString ] [-pedValidationType ]
    [-variant_index_type ] [-variant_index_parameter ] [-l ]
    [-log ] [-h] [-version] -V [-o ] [-allSites] [-nda] [-hets ]
    [-indelHeterozygosity ] [-stand_call_conf ]
    [-stand_emit_conf ] [-maxAltAlleles ]
    [-inputPrior ] [-ploidy ] [-A ] [-D ] [-filterRNC] [-filterMBQ]
    [-filterNoBases]

    -T,--analysis_type Name of the tool to run
    -args,--arg_file Reads arguments from the
    specified file
    -I,--input_file Input file containing sequence
    data (SAM or BAM)
    --showFullBamList Emit a log entry (level INFO)
    containing the full list of
    sequence data files to be
    included in the analysis
    (including files inside
    .bam.list files).
    -rbs,--read_buffer_size Number of reads per SAM file
    to buffer in memory
    -et,--phone_home Run reporting mode (NO_ET|AWS|
    STDOUT)
    -K,--gatk_key GATK key file required to run
    with -et NO_ET
    -tag,--tag Tag to identify this GATK run
    as part of a group of runs
    -rf,--read_filter Filters to apply to reads
    before analysis
    -drf,--disable_read_filter Read filters to disable
    -L,--intervals One or more genomic intervals
    over which to operate
    -XL,--excludeIntervals One or more genomic intervals
    to exclude from processing
    -isr,--interval_set_rule Set merging approach to use
    for combining interval inputs
    (UNION|INTERSECTION)
    -im,--interval_merging Interval merging rule for
    abutting intervals (ALL|
    OVERLAPPING_ONLY)
    -ip,--interval_padding Amount of padding (in bp) to
    add to each interval
    -R,--reference_sequence Reference sequence file
    -ndrs,--nonDeterministicRandomSeed Use a non-deterministic random
    seed
    -maxRuntime,--maxRuntime Stop execution cleanly as soon
    as maxRuntime has been reached
    -maxRuntimeUnits,--maxRuntimeUnits Unit of time used by
    maxRuntime (NANOSECONDS|
    MICROSECONDS|MILLISECONDS|
    SECONDS|MINUTES|HOURS|DAYS)
    -dt,--downsampling_type Type of read downsampling to
    employ at a given locus (NONE|
    ALL_READS|BY_SAMPLE)
    -dfrac,--downsample_to_fraction Fraction of reads to
    downsample to
    -dcov,--downsample_to_coverage Target coverage threshold for
    downsampling to coverage
    -baq,--baq Type of BAQ calculation to
    apply in the engine (OFF|
    CALCULATE_AS_NECESSARY|
    RECALCULATE)
    -baqGOP,--baqGapOpenPenalty BAQ gap open penalty
    -fixNDN,--refactor_NDN_cigar_string Reduce NDN elements in CIGAR
    string
    -fixMisencodedQuals,--fix_misencoded_quality_scores Fix mis-encoded base quality
    scores
    -allowPotentiallyMisencodedQuals,--allow_potentially_misencoded_quality_scores Ignore warnings about base
    quality score encoding
    -OQ,--useOriginalQualities Use the base quality scores
    from the OQ tag
    -DBQ,--defaultBaseQualities Assign a default base quality
    -PF,--performanceLog Write GATK runtime performance
    log to this file
    -BQSR,--BQSR Input covariates table file
    for on-the-fly base quality
    score recalibration
    -qq,--quantize_quals Quantize quality scores to a
    given number of levels (with
    -BQSR)
    -DIQ,--disable_indel_quals Disable printing of base
    insertion and deletion tags
    (with -BQSR)
    -EOQ,--emit_original_quals Emit the OQ tag with the
    original base qualities (with
    -BQSR)
    -preserveQ,--preserve_qscores_less_than Don't recalibrate bases with
    quality scores less than this
    threshold (with -BQSR)
    -globalQScorePrior,--globalQScorePrior Global Qscore Bayesian prior
    to use for BQSR
    -S,--validation_strictness How strict should we be with
    validation (STRICT|LENIENT|
    SILENT)
    -rpr,--remove_program_records Remove program records from
    the SAM header
    -kpr,--keep_program_records Keep program records in the
    SAM header
    -sample_rename_mapping_file,--sample_rename_mapping_file Rename sample IDs on-the-fly
    at runtime using the provided
    mapping file
    -U,--unsafe Enable unsafe operations:
    nothing will be checked at
    runtime (ALLOW_N_CIGAR_READS|
    ALLOW_UNINDEXED_BAM|
    ALLOW_UNSET_BAM_SORT_ORDER|
    NO_READ_ORDER_VERIFICATION|
    ALLOW_SEQ_DICT_INCOMPATIBILITY|
    LENIENT_VCF_PROCESSING|ALL)
    d_locking_when_reading_rods,--disable_auto_index_creation_and_locking_when_reading_rods Disable both auto-generation
    of index files and index file
    locking
    -sites_only,--sites_only Just output sites without
    genotypes (i.e. only the first
    8 columns of the VCF)
    -writeFullFormat,--never_trim_vcf_format_field Always output all the records
    in VCF FORMAT fields, even if
    some are missing
    -compress,--bam_compression Compression level to use for
    writing BAM files (0 - 9,
    higher is more compressed)
    -simplifyBAM,--simplifyBAM If provided, output BAM files
    will be simplified to include
    just key reads for downstream
    variation discovery analyses
    (removing duplicates, PF-,
    non-primary reads), as well
    stripping all extended tags
    from the kept reads except the
    read group identifier
    --disable_bam_indexing Turn off on-the-fly creation
    of indices for output BAM
    files.
    --generate_md5 Enable on-the-fly creation of
    md5s for output BAM files.
    -nt,--num_threads Number of data threads to
    allocate to this analysis
    -nct,--num_cpu_threads_per_data_thread Number of CPU threads to
    allocate per data thread
    -mte,--monitorThreadEfficiency Enable threading efficiency
    monitoring
    -bfh,--num_bam_file_handles Total number of BAM file
    handles to keep open
    simultaneously
    -rgbl,--read_group_black_list Exclude read groups based on
    tags
    -ped,--pedigree Pedigree files for samples
    -pedString,--pedigreeString Pedigree string for samples
    -pedValidationType,--pedigreeValidationType Validation strictness for
    pedigree information (STRICT|
    SILENT)
    -variant_index_type,--variant_index_type Type of IndexCreator to use
    for VCF/BCF indices
    (DYNAMIC_SEEK|DYNAMIC_SIZE|
    LINEAR|INTERVAL)
    -variant_index_parameter,--variant_index_parameter Parameter to pass to the
    VCF/BCF IndexCreator
    -l,--logging_level Set the minimum level of
    logging
    -log,--log_to_file Set the logging location
    -h,--help Generate the help message
    -version,--version Output version information

    Arguments for GenotypeGVCFs:
    -V,--variant One or more input gVCF files
    -o,--out File to which variants should
    be written
    -allSites,--includeNonVariantSites Include loci found to be
    non-variant after genotyping
    -nda,--annotateNDA If provided, we will annotate
    records with the number of
    alternate alleles that were
    discovered (but not
    necessarily genotyped) at a
    given site
    -hets,--heterozygosity Heterozygosity value used to
    compute prior likelihoods for
    any locus. See the GATKDocs
    for full details on the
    meaning of this population
    genetics concept
    -indelHeterozygosity,--indel_heterozygosity Heterozygosity for indel
    calling. See the GATKDocs for
    heterozygosity for full
    details on the meaning of this
    population genetics concept
    -stand_call_conf,--standard_min_confidence_threshold_for_calling The minimum phred-scaled
    confidence threshold at which
    variants should be called
    -stand_emit_conf,--standard_min_confidence_threshold_for_emitting The minimum phred-scaled
    confidence threshold at which
    variants should be emitted
    (and filtered with LowQual if
    less than the calling
    threshold)
    -maxAltAlleles,--max_alternate_alleles Maximum number of alternate
    alleles to genotype
    -inputPrior,--input_prior Input prior for calls
    -ploidy,--sample_ploidy Ploidy (number of chromosomes)
    per sample. For pooled data,
    set to (Number of samples in
    each pool * Sample Ploidy).
    -A,--annotation One or more specific
    annotations to recompute. The
    single value 'none' removes
    the default annotations
    -D,--dbsnp dbSNP file

    Arguments for MalformedReadFilter:
    -filterRNC,--filter_reads_with_N_cigar Filter out reads with CIGAR containing the N operator, instead of
    failing with an error
    -filterMBQ,--filter_mismatching_base_and_quals Filter out reads with mismatching numbers of bases and base qualities,
    instead of failing with an error
    -filterNoBases,--filter_bases_not_stored Filter out reads with no stored bases (i.e. '*' where the sequence
    should be), instead of failing with an error

    Available Reference Ordered Data types:
    Name FeatureType Documentation
    BCF2 VariantContext (this is an external codec and is not documented within GATK)
    BEAGLE BeagleFeature (this is an external codec and is not documented within GATK)
    BED BEDFeature (this is an external codec and is not documented within GATK)
    BEDTABLE TableFeature (this is an external codec and is not documented within GATK)
    EXAMPLEBINARY Feature (this is an external codec and is not documented within GATK)
    GELITEXT GeliTextFeature (this is an external codec and is not documented within GATK)
    OLDDBSNP OldDbSNPFeature (this is an external codec and is not documented within GATK)
    RAWHAPMAP RawHapMapFeature (this is an external codec and is not documented within GATK)
    REFSEQ RefSeqFeature (this is an external codec and is not documented within GATK)
    SAMPILEUP SAMPileupFeature (this is an external codec and is not documented within GATK)
    SAMREAD SAMReadFeature (this is an external codec and is not documented within GATK)
    TABLE TableFeature (this is an external codec and is not documented within GATK)
    VCF VariantContext (this is an external codec and is not documented within GATK)
    VCF3 VariantContext (this is an external codec and is not documented within GATK)

    For a full description of this walker, see its GATKdocs at:
    http://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Yes, @Manish, newQual became available for the final GATK3 releases.

  • ManishManish IndiaMember

    So i can't use it here?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    If your datestamping (2015) is reflective of the version of GATK you are using, then no.

  • ManishManish IndiaMember

    @shlee said:
    If your datestamping (2015) is reflective of the version of GATK you are using, then no.

    Is there any other way to print more low quality variants? As I want to tune parameters to get missing variants.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
Sign In or Register to comment.