The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.2 is now available at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

Select SNP and reference monomorphic sites from a large callset

zzqzzq ChinaMember
edited February 8 in Ask the GATK team

Dear all,

Recently, I have called genotype using GenotypeGVCFs with --includeNonVariantSites. I want to pull out the sites with one allele or . in ALT column. I found the following command will also report the sites with * or more than one allele listed in ALT column. I also tried to add --restrictAllelesTo BIALLELIC, but the non-variant sites (sites with . listed in ALT column) will be excluded. I do not know how to set these parameters to meet my requirements.

-T SelectVariants \ --selectTypeToInclude NO_VARIATION \ --selectTypeToInclude SNP \

Hope for your help.

Best
Zhuqing

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzq
    Hi Zhuqing,

    I think you can use --selectTypeToExclude. If you add --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC that should give you what you are looking for.

    -Sheila

  • zzqzzq ChinaMember

    Hi @Sheila

    Thank you. I have run with --selectTypeToExclude, but the results also report the sites with * or multiple allele listed in ALT column (like following). The version of GATK 3.5-0-g36282e4

    NC_005044.2 1229 . A . . . AN=424;DP=39131 GT:DP:RGQ 0/0:2:6 0/0:43:99 0/0:49 NC_005044.2 1230 . A . . . AN=426;DP=39198 GT:DP:RGQ 0/0:2:6 0/0:43:99 0/0:49 NC_005044.2 1232 . A . . . AN=442;DP=39386 GT:DP:RGQ 0/0:2:6 0/0:43:99 0/0:49 NC_005044.2 1233 . A . . . AN=446;DP=39415 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1234 . A . . . AN=446;DP=39512 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1237 . G A,* 605604.72 . AC=5,32;AF=0.011,0.071;AN=448;BaseQRankSum=0.979;Clipp NC_005044.2 1238 . A . . . AN=416;DP=39597 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1239 . A . . . AN=448;DP=38210 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1240 . T . . . AN=448;DP=38200 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1241 . T . . . AN=448;DP=38201 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1242 . A . . . AN=448;DP=38192 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 5558 . G . . . AN=448;DP=13759 GT:DP:RGQ 0/0:8:24 0/0:44:99 NC_005044.2 5559 . G A,C 19917 . AC=150,8;AF=0.336,0.018;AN=446;BaseQRankSum=0.361;ClippingRank NC_005044.2 5560 . T . . . AN=448;DP=13810 GT:DP:RGQ 0/0:9:27 0/0:44:99 NC_005044.2 5561 . T . . . AN=448;DP=13790 GT:DP:RGQ 0/0:9:27 0/0:44:99 NC_005044.2 5562 . T . . . AN=440;DP=13786 GT:DP:RGQ 0/0:9:27 ./.:45:0 NC_005044.2 5563 . G . . . AN=448;DP=13774 GT:DP:RGQ 0/0:9:27 0/0:45:99 NC_005044.2 5564 . G . . . AN=446;DP=13766 GT:DP:RGQ 0/0:10:30 0/0:45:99 NC_005044.2 5565 . C . . . AN=440;DP=13752 GT:DP:RGQ 0/0:10:30 ./.:45:0

    Maybe I have missed something.

    Best
    Zhuqing

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited February 17

    Hi @zzq,

    Can you reproduce this result with the latest GATK (v3.7)? If so, would you mind sharing a snippet of your data so we can debug this? Instructions for sending us data are at http://gatkforums.broadinstitute.org/gatk/discussion/1894/how-do-i-submit-a-detailed-bug-report. Thanks.

    Post edited by shlee on
  • zzqzzq ChinaMember

    Dear @shlee

    Yes, the same problem reproduced with the latest GATK.
    The command
    java -Xmx100g /bin/GATK3.7/GenomeAnalysisTK.jar \ -T SelectVariants \ -R ref.fa \ --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \ -V $VCF \ -o $OUT
    It will keep the MNP sites like following
    1 5969 . C . 84.60 . 1 6033 . C T 623.14 . 1 6085 . A . 72.87 . 1 6093 . A . 72.87 . 1 6097 . T . 66.87 . 1 6357 . C A,G 4303.95 . 1 6374 . C . 45.87 . 1 6417 . C . 51.87 . 1 6418 . G . 51.87 . 1 6508 . C . 42.87 . 1 6540 . C . 51.87 . 1 6547 . G A 2612.19 . 1 6597 . C . 66.87 . 1 6753 . A . 72.80 . 1 6799 . C . 77.90 . 1 6890 . G . 72.86 .
    Best
    Zhuqing

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @zzq none of these would be considered a MNP in GATK -- this list is a mix of SNPs (including a multiallelic SNP) and reference sites. A MNP would be e.g. AAT -> ACC.

  • zzqzzq ChinaMember
    Hi @Geraldine_VdAuwera

    thanks, but how can i filter these sites using SelectVariants. Hope your help.

    best
    Zhuqing
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzq
    Hi Zhuqing,

    Ah, I see. I think you are asking to remove sites that have more than one alternate allele. In that case, you need to use --restrictAllelesTo BIALLELIC.

    -Sheila

  • zzqzzq ChinaMember

    Hi @Sheila

    Yes, I have tried with --restrictAllelesTo BIALLELIC, but the non-variant sites (sites with . listed in ALT column) will also be excluded. But
    I want to keep them.

    Best
    Zhuqing

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @zzq,

    The team may have more suggestions for using SelectVariants or JEXL expressions. In the meanwhile, perhaps you can move forward by using awk to exclude rows where column 5 contains a comma. This should then give you rows where the ALT allele is singular or a ..

    For example, try this command:

    cat input.vcf | awk '{if ($5 !~ /,/) print $0}' > output.vcf
    

    Be sure to review the resulting file to make sure it is as you expect.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzq
    Hi,

    So, adding --selectTypeToInclude NO_VARIATION and --restrictAllelesTo BIALLELIC does not get you what you want? If so, what esle do you want to keep and/or not keep?

    -Sheila

  • zzqzzq ChinaMember

    Dear @Sheila

    Sorry for the late reply. When using your recommended parameters, there will be nothing in the output. I just want to keep the sites with one allele or . in ALT column.

    Best
    Zhuqing

  • zzqzzq ChinaMember

    Dear @Sheila

    I think following can get I want. Thank you.
    -T SelectVariants \ -R ref.fa \ --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \ --selectTypeToInclude NO_VARIATION \ --maxNOCALLfraction 0 --excludeFiltered \ -V $VCF \ -o $OUT.Monomorphic.vcf.gz -T SelectVariants \ -R ref.fa \ --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \ --selectTypeToInclude SNP \ --maxNOCALLfraction 0 --excludeFiltered \ --restrictAllelesTo BIALLELIC \ -V $VCF \ -o $OUT.SNP.vcf.gz -T CombineVariants \ -R ref.fa \ -V $OUT.Monomorphic.vcf.gz \ -V $OUT.SNP.vcf.gz \ -o $OUT.Clean.vcf.gz \ --assumeIdenticalSamples
    Best wishes
    Zhuqing

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzq
    Hi Zhuqing,

    Thanks for posting your solution.

    -Sheila

Sign In or Register to comment.