To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Select SNP and reference monomorphic sites from a large callset

zzqzzq ChinaMember
edited February 2017 in Ask the GATK team

Dear all,

Recently, I have called genotype using GenotypeGVCFs with --includeNonVariantSites. I want to pull out the sites with one allele or . in ALT column. I found the following command will also report the sites with * or more than one allele listed in ALT column. I also tried to add --restrictAllelesTo BIALLELIC, but the non-variant sites (sites with . listed in ALT column) will be excluded. I do not know how to set these parameters to meet my requirements.

-T SelectVariants \ --selectTypeToInclude NO_VARIATION \ --selectTypeToInclude SNP \

Hope for your help.

Best
Zhuqing

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzq
    Hi Zhuqing,

    I think you can use --selectTypeToExclude. If you add --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC that should give you what you are looking for.

    -Sheila

  • zzqzzq ChinaMember

    Hi @Sheila

    Thank you. I have run with --selectTypeToExclude, but the results also report the sites with * or multiple allele listed in ALT column (like following). The version of GATK 3.5-0-g36282e4

    NC_005044.2 1229 . A . . . AN=424;DP=39131 GT:DP:RGQ 0/0:2:6 0/0:43:99 0/0:49 NC_005044.2 1230 . A . . . AN=426;DP=39198 GT:DP:RGQ 0/0:2:6 0/0:43:99 0/0:49 NC_005044.2 1232 . A . . . AN=442;DP=39386 GT:DP:RGQ 0/0:2:6 0/0:43:99 0/0:49 NC_005044.2 1233 . A . . . AN=446;DP=39415 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1234 . A . . . AN=446;DP=39512 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1237 . G A,* 605604.72 . AC=5,32;AF=0.011,0.071;AN=448;BaseQRankSum=0.979;Clipp NC_005044.2 1238 . A . . . AN=416;DP=39597 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1239 . A . . . AN=448;DP=38210 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1240 . T . . . AN=448;DP=38200 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1241 . T . . . AN=448;DP=38201 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 1242 . A . . . AN=448;DP=38192 GT:DP:RGQ 0/0:3:9 0/0:43:99 0/0:49 NC_005044.2 5558 . G . . . AN=448;DP=13759 GT:DP:RGQ 0/0:8:24 0/0:44:99 NC_005044.2 5559 . G A,C 19917 . AC=150,8;AF=0.336,0.018;AN=446;BaseQRankSum=0.361;ClippingRank NC_005044.2 5560 . T . . . AN=448;DP=13810 GT:DP:RGQ 0/0:9:27 0/0:44:99 NC_005044.2 5561 . T . . . AN=448;DP=13790 GT:DP:RGQ 0/0:9:27 0/0:44:99 NC_005044.2 5562 . T . . . AN=440;DP=13786 GT:DP:RGQ 0/0:9:27 ./.:45:0 NC_005044.2 5563 . G . . . AN=448;DP=13774 GT:DP:RGQ 0/0:9:27 0/0:45:99 NC_005044.2 5564 . G . . . AN=446;DP=13766 GT:DP:RGQ 0/0:10:30 0/0:45:99 NC_005044.2 5565 . C . . . AN=440;DP=13752 GT:DP:RGQ 0/0:10:30 ./.:45:0

    Maybe I have missed something.

    Best
    Zhuqing

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited February 2017

    Hi @zzq,

    Can you reproduce this result with the latest GATK (v3.7)? If so, would you mind sharing a snippet of your data so we can debug this? Instructions for sending us data are at http://gatkforums.broadinstitute.org/gatk/discussion/1894/how-do-i-submit-a-detailed-bug-report. Thanks.

    Post edited by shlee on
  • zzqzzq ChinaMember

    Dear @shlee

    Yes, the same problem reproduced with the latest GATK.
    The command
    java -Xmx100g /bin/GATK3.7/GenomeAnalysisTK.jar \ -T SelectVariants \ -R ref.fa \ --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \ -V $VCF \ -o $OUT
    It will keep the MNP sites like following
    1 5969 . C . 84.60 . 1 6033 . C T 623.14 . 1 6085 . A . 72.87 . 1 6093 . A . 72.87 . 1 6097 . T . 66.87 . 1 6357 . C A,G 4303.95 . 1 6374 . C . 45.87 . 1 6417 . C . 51.87 . 1 6418 . G . 51.87 . 1 6508 . C . 42.87 . 1 6540 . C . 51.87 . 1 6547 . G A 2612.19 . 1 6597 . C . 66.87 . 1 6753 . A . 72.80 . 1 6799 . C . 77.90 . 1 6890 . G . 72.86 .
    Best
    Zhuqing

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @zzq none of these would be considered a MNP in GATK -- this list is a mix of SNPs (including a multiallelic SNP) and reference sites. A MNP would be e.g. AAT -> ACC.

  • zzqzzq ChinaMember
    Hi @Geraldine_VdAuwera

    thanks, but how can i filter these sites using SelectVariants. Hope your help.

    best
    Zhuqing
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzq
    Hi Zhuqing,

    Ah, I see. I think you are asking to remove sites that have more than one alternate allele. In that case, you need to use --restrictAllelesTo BIALLELIC.

    -Sheila

  • zzqzzq ChinaMember

    Hi @Sheila

    Yes, I have tried with --restrictAllelesTo BIALLELIC, but the non-variant sites (sites with . listed in ALT column) will also be excluded. But
    I want to keep them.

    Best
    Zhuqing

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @zzq,

    The team may have more suggestions for using SelectVariants or JEXL expressions. In the meanwhile, perhaps you can move forward by using awk to exclude rows where column 5 contains a comma. This should then give you rows where the ALT allele is singular or a ..

    For example, try this command:

    cat input.vcf | awk '{if ($5 !~ /,/) print $0}' > output.vcf
    

    Be sure to review the resulting file to make sure it is as you expect.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzq
    Hi,

    So, adding --selectTypeToInclude NO_VARIATION and --restrictAllelesTo BIALLELIC does not get you what you want? If so, what esle do you want to keep and/or not keep?

    -Sheila

  • zzqzzq ChinaMember

    Dear @Sheila

    Sorry for the late reply. When using your recommended parameters, there will be nothing in the output. I just want to keep the sites with one allele or . in ALT column.

    Best
    Zhuqing

  • zzqzzq ChinaMember

    Dear @Sheila

    I think following can get I want. Thank you.
    -T SelectVariants \ -R ref.fa \ --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \ --selectTypeToInclude NO_VARIATION \ --maxNOCALLfraction 0 --excludeFiltered \ -V $VCF \ -o $OUT.Monomorphic.vcf.gz -T SelectVariants \ -R ref.fa \ --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \ --selectTypeToInclude SNP \ --maxNOCALLfraction 0 --excludeFiltered \ --restrictAllelesTo BIALLELIC \ -V $VCF \ -o $OUT.SNP.vcf.gz -T CombineVariants \ -R ref.fa \ -V $OUT.Monomorphic.vcf.gz \ -V $OUT.SNP.vcf.gz \ -o $OUT.Clean.vcf.gz \ --assumeIdenticalSamples
    Best wishes
    Zhuqing

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzq
    Hi Zhuqing,

    Thanks for posting your solution.

    -Sheila

Sign In or Register to comment.